Evaluations

What are evaluations?

Eval Screenshot

Evaluations are a method to measure the performance of your AI application. Performance is an overloaded word in AI—in traditional software it means "speed" (e.g. the number of milliseconds required to complete a request), but in AI, it usually means "accuracy" or “quality”.

Why are evals important?

In AI development, it's hard for teams to understand how an update will impact performance. This breaks the dev loop, making iteration feel like guesswork instead of engineering.

Evaluations solve this, helping you distill the craziness of AI applications into an effective feedback loop that enables you to ship more reliable, higher quality products.

Specifically, great evals help you:

  • Understand whether an update is an improvement or a regression
  • Quickly drill down into good / bad examples
  • Diff specific examples vs. prior runs
  • Avoid playing whack-a-mole

Breaking down evals

Evals consist of 3 parts:

  • Data: a set of examples to test your application on
  • Task: the AI function you want to test (any function that takes in an input and returns an output)
  • Scores: a set of scoring functions that take an input, output, and optional expected value and compute a score

You can establish an Eval() function with these 3 pieces:

import { Eval } from "braintrust";
import { Levenshtein } from "autoevals";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: () => {
      return [
        {
          input: "Foo",
          expected: "Hi Foo",
        },
        {
          input: "Bar",
          expected: "Hello Bar",
        },
      ]; // Replace with your eval dataset
    },
    task: async (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Levenshtein],
  },
);

(see the full tutorial for more details)

Viewing evals

Running your eval function will automatically log your results to Braintrust, display a summary in your terminal, and populate the UI:

Eval in UI

This gives you great visibility into how your AI application performed. Specifically, you can:

  1. Preview each test case and score in a table
  2. Filter by high/low scores
  3. Click into any individual example and see detailed tracing
  4. See high level scores
  5. Sort by improvements or regressions

Where to go from here

Now that you understand the basics of evals, you can dive deeper into the following topics: