Automated Scores and Braintrust

The previous post ended with a working but limited eval harness: no UI, no score history, no way to compare two runs side by side. The results were JSON files on disk and any scoring was done by eye. That harness existed to teach the concepts. Now it gets thrown away.

Why Move to a Framework

The custom harness is not a product. Building it out to the point where it is actually useful would take months: a dashboard, a trend chart, a diff view between runs, automated score functions, and a way to filter by category or difficulty. That is an entire company's worth of work. Braintrust already built it.

What we actually want from an eval platform is simple: a GUI that shows scores per test case, a way to compare the current run against a previous one, and a history so you can see whether a number is moving up or down over time. Making a red number turn green is the whole job. That is how you prove an improvement worked.

Braintrust is free to sign up. No credit card required. Once you have an account, grab your API key from the onboarding screen and add it to your environment variables:

Plain text

BRAINTRUST_API_KEY=your_key_here

Install the SDK alongside the auto-evals library:

Bash

npm install braintrust autoevals

The autoevals package is worth knowing about even if you do not use it immediately. It is a collection of pre-built eval functions, LLM-as-a-judge scorers (where a second AI model scores the output of your first), deterministic checkers (pure functions that give a guaranteed answer), and fine-tuned binary classifiers (small specialised models trained to give a yes/no answer on one specific question) covering everything from factuality and summarization to SQL correctness and JSON validity. When you need to eval something standard, this library probably already has a scorer for it. You can use it directly with the Braintrust SDK.

Expanding the Dataset

The initial golden dataset only covered the create-from-scratch cases. The modify cases were skipped in the manual harness because they require too much setup to do by hand. Now that the framework handles the infrastructure, we can add them.

There are also test cases in the dataset that are intentionally unsolvable with the current agent. A prompt like "make me a diagram of our architecture, covering components A, B, and C" cannot be answered without knowing what "our architecture" means. The agent has no access to that context. It cannot be solved with a better prompt or a better tool description. It requires RAG, or some form of injected context, to resolve.

These cases are in the dataset on purpose. They will fail now. They are there so that when RAG is added later, you can look back and see exactly which test cases moved from red to green and by how much. The eval tells you whether the technique actually helped.

Simulating Chat History for Modify Evals

The modify test cases need special handling. The modifyDiagram tool only makes sense when there is something on the canvas to modify. In a real session, the agent would have generated a diagram earlier in the conversation and the user is now asking to change something. In an eval, that history does not exist.

The fix is to construct a fake message history that puts the LLM in the right state. The approach looks like this:

Start with the user's initial prompt (e.g., "draw a login flowchart")
Add a fake assistant message showing that the LLM called generateDiagram
Add a fake tool result with the elements that would have been returned
Add another fake assistant message saying it completed the diagram
Now add the actual eval input (e.g., "make the login box red")

TypeScript

function buildMessages(seed: TestCaseSeed, evalInput: string) {
  return [
    { role: 'user', content: seed.initialPrompt },
    {
      role: 'assistant',
      content: null,
      tool_calls: [{
        id: seed.toolCallId,
        type: 'function',
        function: {
          name: 'generateDiagram',
          arguments: JSON.stringify({ elements: seed.elements }),
        },
      }],
    },
    {
      role: 'tool',
      tool_call_id: seed.toolCallId,
      content: JSON.stringify({ elements: seed.elements }),
    },
    { role: 'assistant', content: 'I\'ve generated the diagram.' },
    { role: 'user', content: evalInput },
  ];
}

This is sometimes called a fixture in test terminology -- pre-built data you set up before a test runs so the system is in a known state when the assertion happens. Here the "state" is the LLM's belief that it already generated a diagram with known element IDs. Once it believes that, you can ask it to modify specific elements and evaluate whether it does so correctly.

The LLM cannot verify that this history is real. It treats it as ground truth. This is the same property that makes prompt injection dangerous and makes eval fixtures useful.

Abstracting the Agent Core

When the manual harness was built, the agent had to be reconstructed from scratch inside the eval runner: import the tools, import the system prompt, pass in the model, configure max steps. Every time something changed in the agent, it had to be updated in two places.

The fix is to extract a shared agent-core.ts file that both the Cloudflare worker and the eval runner import from. Two functions live here: streamAgent for the production chat interface, and runAgent for the eval harness.

They are nearly identical, the only difference is whether output is streamed or collected:

TypeScript

import { generateText, streamText } from 'ai';
import { tools } from './tools';
import { systemPrompt as defaultSystemPrompt } from './system-prompt';

interface AgentOptions {
  model?: string;
  messages?: unknown[];
  systemPrompt?: string;
  maxSteps?: number;
}

export async function streamAgent(options: AgentOptions = {}) {
  const {
    model = 'gpt-4o-mini',
    messages = [],
    systemPrompt = defaultSystemPrompt,
    maxSteps = 5,
  } = options;

  return streamText({ model, system: systemPrompt, messages, tools, maxSteps });
}

export async function runAgent(options: AgentOptions = {}) {
  const {
    model = 'gpt-4o-mini',
    messages = [],
    systemPrompt = defaultSystemPrompt,
    maxSteps = 5,
  } = options;

  const result = await generateText({ model, system: systemPrompt, messages, tools, maxSteps });

  return {
    text: result.text,
    elements: extractElements(result.steps),
    steps: result.steps,
  };
}

Every parameter has a default so existing code continues to work without changes. The eval runner can override the model to benchmark two models head-to-head. It can inject a seeded messages array for the modify test cases. It can swap out the system prompt to test a new version without touching production.

result.text is the final answer after all tool calls complete. result.steps is the full step array. A step is one pair: the agent requests a tool with arguments, the tool runs, and the result comes back. extractElements walks every step and collects all generateDiagram and modifyDiagram outputs into one flat array.

The Cloudflare worker shrinks down to just calling streamAgent:

TypeScript

import { streamAgent } from './agent-core';

const result = await streamAgent({ model: 'gpt-4o-mini', messages });

The tools, system prompt, and max steps are all handled by the defaults. If any of those change, they change in one place.

Code-Based Scorers

A Braintrust scorer is a function that receives the input, the agent output, and an optional expected value. It returns an object with a name, a numeric score between 0 and 1, and metadata. Returning null tells Braintrust to skip that test case for this scorer rather than penalizing it.

The null skip matters. A test case for the modify tool should not be scored by a scorer that checks whether generateDiagram produced elements. Returning null says: this test case is not applicable here, exclude it from the average.

The first scorer checks structural validity: does the agent's output contain valid Excalidraw elements? Every element must have an id, type, x, y, width, and height, and must be one of the recognized types. These are hard constraints from the Excalidraw schema. That makes the check fully deterministic.

TypeScript

const REQUIRED_FIELDS = ['id', 'type', 'x', 'y', 'width', 'height'];
const VALID_TYPES = ['rectangle', 'ellipse', 'diamond', 'arrow', 'line', 'text', 'image'];

export function validElementStructure({ output }: { output: AgentOutput }) {
  const elements = output.elements;

  if (!elements || elements.length === 0) {
    return { name: 'ValidElementStructure', score: 0, metadata: { reason: 'no elements returned' } };
  }

  const results = elements.map(el => {
    if (!el || typeof el !== 'object') return { valid: false, reason: 'not an object' };
    const missing = REQUIRED_FIELDS.filter(f => !(f in el));
    if (missing.length > 0) return { valid: false, reason: `missing: ${missing.join(', ')}` };
    if (!VALID_TYPES.includes(el.type)) return { valid: false, reason: `unknown type: ${el.type}` };
    return { valid: true };
  });

  const passCount = results.filter(r => r.valid).length;
  return {
    name: 'ValidElementStructure',
    score: passCount / results.length,
    metadata: { passCount, total: results.length },
  };
}

The score is fractional: if 7 out of 10 elements are valid, the score is 0.7. This shows degree rather than binary pass/fail. You could also validate with Zod, using the same schema that describes the tool's input to validate the output. One schema, consistent enforcement across the whole codebase.

Beyond the schema scorer, other useful deterministic scorers for the diagram agent include a preservation scorer (elements not targeted by modifyDiagram should be unchanged), a label keyword scorer (shapes should carry text labels that match words from the user's prompt), and a structure scorer (a "three-step flowchart" should produce roughly three connected nodes).

Wiring the Eval File

The Braintrust Eval function takes four things: a name for the dashboard, a data function, a task function, and a scores array:

TypeScript

Eval('design-agent', {
  data: () =>
    testCases.map(tc => ({
      input: tc,
      expected: tc,
      metadata: { id: tc.id, difficulty: tc.difficulty, category: tc.category },
    })),

  task: async (testCase) => {
    const result = await runAgent({
      model: 'gpt-4o-mini',
      messages: buildMessages(testCase),
    });
    return { text: result.text, elements: result.elements };
  },

  scores: [validElementStructure],
});

Adding another scorer is one line in the scores array. Difficulty and category go in metadata so you can filter by them in the Braintrust dashboard.

Reading the First Results

Running npm run eval streams results live to the terminal and the Braintrust dashboard. On the first run against 23 test cases, the schema scorer averages 89%. Two cases score 0% because they are modify cases with no elements. The fix is to return null for those:

TypeScript

if (testCase.input.category === 'modify') return null;

Once that is in place, the creation test cases score 100%. Every element the agent generates for every creation input passes the structural check. That is the first green number, and the baseline everything else is measured against.

Automated Scores and Braintrust

Why Move to a Framework

Expanding the Dataset

Simulating Chat History for Modify Evals

Abstracting the Agent Core

Code-Based Scorers

Wiring the Eval File

Reading the First Results

Further Reading and Watching