The Eval Harness and Runners

The last post covered the concepts: what evals are, how golden datasets work, and what manual scoring tells you. This post is the code: building the harness, structuring the dataset, and running the first eval.

The Eval Harness

Rather than using an off-the-shelf framework right away, building a minimal eval harness from scratch first is worth doing. Not because you should ship it, but because understanding how it works at the bottom changes how you use the abstractions on top.

The types that structure the harness reflect everything covered in the previous post. A test case needs a difficulty level:

TypeScript

type Difficulty = 'simple' | 'medium' | 'hard' | 'edge';

And a category that describes what kind of failure it is testing for:

TypeScript

type Category = 'layout' | 'content' | 'structure' | 'coverage' | 'edge';

A test case brings these together with an input and an expected output:

TypeScript

interface TestCase {
  id: string;
  input: string;
  expectedOutput: unknown;
  difficulty: Difficulty;
  category: Category;
}

The golden dataset is a collection of these objects. Each one represents one thing you have decided your agent should be able to do. Taken together, they are the definition of what "working well" means for your specific agent.

Expected Characteristics and Keywords

Beyond the input and expected output, each test case can carry two more fields: expectedCharacteristics and expectedKeywords.

Expected characteristics are hints to whoever or whatever is doing the scoring. They describe what a good response should look like in plain terms: "at least one rectangle exists," "every node has an accompanying text element," "the layout has no overlapping elements." These can be evaluated by a human looking at the output, or by an LLM you give them to as a scoring prompt. The form depends on whether you are doing manual scoring or automated scoring, but the value is the same: you are writing down what good looks like before you run the eval, not after.

Expected keywords are different. These can be checked deterministically with a simple string search. If the user said "label it hello," the word "hello" should be in the output. You do not need an LLM to judge that. A regex or a plain includes() check is sufficient. When you can write a deterministic function to catch a failure, you should. It is faster, cheaper, and more reliable than probabilistic scoring.

The combination gives you two layers of scoring per test case: a hard check that fails without ambiguity, and a soft check that requires judgment. Both are useful. The hard checks catch obvious failures immediately. The soft checks catch the subtler quality issues that are harder to automate.

TypeScript

interface TestCase {
  id: string;
  input: string;
  expectedOutput: unknown;
  expectedCharacteristics: string[];
  expectedKeywords: string[];
  difficulty: Difficulty;
  category: Category;
  mockElements?: unknown[];
}

The mockElements field handles test cases that require existing canvas state. A "modify this element" test cannot run without something on the canvas to modify. Since the eval harness runs outside the real application, you supply fake element data in the test case itself.

Building the Golden Dataset

The dataset is hand-authored. There is no tool that does this for you. You sit down with your knowledge of the agent's purpose and write out the cases.

For the diagram agent, a dataset of around 20 test cases covers the main scenarios: simple shape creation, multi-step flowcharts, entity relationship diagrams, complex architecture requests, modification tasks, out-of-scope requests the agent should decline, and edge cases with strange or malformed input.

Each object in the JSON file is one test case. Group them by difficulty and category. When you later want to understand why the agent struggles with modifications but handles creation well, those groupings are what let you isolate the data.

The categories also give you a way to direct improvement work. If the eval data shows that the "coverage" category has the worst scores, that is where to focus next. You do not need to improve everything at once. Pick a category, work on it, and run the evals again.

Abstracting the System Prompt

The agent's system prompt starts as a string hardcoded somewhere in the agent file. That works for a single agent instance, but it breaks down the moment you want to run evals, because the eval harness needs to call the model with the same system prompt the production agent uses.

The fix is to extract the system prompt into its own file:

TypeScript

// src/system-prompt.ts
export const systemPrompt = `
You are a diagram generation assistant...
`;

Both the production agent and the eval harness import from this file. When the system prompt changes, it changes everywhere at once. No drift between what is being tested and what is running in production.

The same principle applies to tools. The eval harness and the production agent should use the same tool definitions. Any divergence means you are testing something different from what ships.

The Eval Runner

The eval harness has one job: load the golden dataset, run each test case through the model, and capture the results. No WebSocket. No Durable Object. No UI. Just the model, the prompt, and the output.

The runner uses generateText from the Vercel AI SDK rather than streamText. The difference is simple: streamText sends tokens back to you as they are generated, one chunk at a time -- which is what you want for a live chat UI where the user sees text appearing. generateText waits until the model is completely done and returns everything at once. For an eval harness that is capturing outputs to score, you want the full result before moving on.

TypeScript

import { createOpenAI } from '@ai-sdk/openai';
import { generateText } from 'ai';
import { tools } from '../src/tools';
import { systemPrompt } from '../src/system-prompt';

const openai = createOpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

async function runTestCase(testCase: TestCase): Promise<EvalResult> {
  const start = Date.now();

  try {
    const result = await generateText({
      model: openai('gpt-4o-mini'),
      system: systemPrompt,
      prompt: testCase.input,
      tools,
    });

    return {
      testCaseId: testCase.id,
      output: result.text,
      durationMs: Date.now() - start,
      error: null,
    };
  } catch (err) {
    return {
      testCaseId: testCase.id,
      output: null,
      durationMs: Date.now() - start,
      error: String(err),
    };
  }
}

Running the full dataset sequentially (one test case at a time) takes longer than running in parallel, but it is easier to reason about and debug. With 20 or so test cases, each taking a second or two, the total run time is in the range of a minute. That is acceptable. You would parallelize when the dataset grows to hundreds of cases.

The runner also captures timing. How long a test case takes to complete is a useful signal. A case that consistently takes five times longer than others is telling you something about the model's difficulty with that input.

Eval Results and Experiments

One run of the full dataset is called an experiment -- a single recorded attempt at running every test case, which you can compare against previous and future attempts. Each experiment produces a collection of EvalResult objects, one per test case. What you want to track over time:

The score for each individual test case across multiple experiments
The average score across all test cases per experiment

The first shows you which specific inputs are improving or regressing. The second gives you the overall system trend. Both views are necessary. A rising average can hide a regression on one critical test case. Watching individual scores alongside the average catches that.

TypeScript

interface EvalResult {
  testCaseId: string;
  output: unknown;
  durationMs: number;
  error: string | null;
}

interface ScoredResult extends EvalResult {
  score: number;       // 0–1
  notes: string;       // why you scored it this way
}

The ScoredResult type includes a notes field. When you do manual scoring, you want a record of your reasoning. "Scored 0.3 because the layout has overlapping nodes and the label text is cut off" is useful context when you review the same test case three weeks later after an improvement and the score goes up to 0.8.

This is not a product. It is a minimal harness built to understand the mechanics. The result files live on disk and you view them in whatever JSON viewer you have. There is no dashboard, no trend chart, no alerting. Once you understand how the raw pieces fit together, you swap this out for a proper eval platform that gives you all of those things. But you will understand what those platforms are doing, and that changes how you use them.

Extracting Tool Call Results

The agent does not just return text. It calls tools. The eval harness needs to capture what those tools returned, not just the final text response.

The AI SDK's generateText returns a steps array. Each step represents one round of model inference -- one time the model was called to generate a response. If the model called a tool, there will be multiple steps: one where it decided to call the tool, and one where it used the tool result to generate its final response. Looping through the steps and collecting tool outputs is how you get a flat list of everything the agent did:

TypeScript

const elements: unknown[] = [];

for (const step of result.steps) {
  for (const toolResult of step.toolResults ?? []) {
    if (toolResult.toolName === 'generateDiagram') {
      const outputs = toolResult.result as { elements?: unknown[] };
      if (Array.isArray(outputs?.elements)) {
        elements.push(...outputs.elements);
      }
    }
  }
}

The reason for this is that there is no single place in the message history where all tool call outputs are collected. You have to walk the steps yourself. At the end you have one array containing every element the agent generated across all tool calls in that run.

The full result object returned by runTestCase contains:

TypeScript

return {
  testCaseId: testCase.id,
  input: testCase.input,
  response: result.text,
  elements,
  durationMs: Date.now() - start,
};

This is what gets written to disk. The text response, the generated elements, the timing, and the test case ID so you can match it back to the golden dataset entry.

The Main Runner

The main function ties it together:

TypeScript

async function main() {
  const datasetPath = join(root, 'evals', 'datasets', 'golden', 'test-cases.json');
  const testCases: TestCase[] = JSON.parse(readFileSync(datasetPath, 'utf-8'));

  console.log(`Running ${testCases.length} test cases`);

  const results = [];

  for (const testCase of testCases) {
    console.log(`Running: ${testCase.id} (${testCase.difficulty})`);
    const result = await runTestCase(testCase);
    results.push(result);
  }

  const outputPath = join(root, 'evals', 'results', `run-${Date.now()}.json`);
  mkdirSync(dirname(outputPath), { recursive: true });
  writeFileSync(outputPath, JSON.stringify(results, null, 2));

  console.log(`Done. Results written to ${outputPath}`);
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

Test cases run sequentially, one at a time. With 23 test cases each taking a few seconds, the full run takes around two to three minutes. Running them in parallel would be faster but harder to debug. Start sequential and parallelize once you understand what can fail.

The runner is invoked via a dedicated npm script that loads environment variables from a .env file before handing off to tsx:

JSON

"scripts": {
  "eval": "dotenv -e .env.dev -- tsx evals/runs.ts"
}

What the Results Tell You

When you look at the output, something becomes visible immediately. There is a correlation between how hard a test case is and how many elements the agent generates. Simple cases produce fewer elements. Medium cases produce more. Hard cases produce the most.

If that correlation holds across runs, it validates the dataset. It means the difficulty labels you assigned match what the model actually experiences. When your intuition about difficulty and the model's output are aligned, you have a dataset you can trust. When they diverge, either the labels are wrong or the model is doing something unexpected.

This kind of observation is available only because you are running real inputs through the real model and looking at the actual outputs. There is no shortcut to this. You have to run the harness.

Eval-Driven Development

There is a way of working that flips the order. Instead of building the agent first and then writing evals to measure it, you write the evals first.

The idea is: before you write any agent logic, write the dataset. Write down what a user will ask and what a good response looks like. Build the tools as empty stubs that return hardcoded data. Run the evals against a bare-minimum agent. Let the eval results drive every subsequent decision about what to build and in what order.

The alternative, building first and measuring later, tends to produce over-engineered agents. You add features because they seem useful, not because you have evidence they improve anything. By the time you add evals, the agent is already complex and the results are hard to interpret.

Starting with the evals is uncomfortable because the agent will fail almost everything at first. That discomfort is the point. The failures tell you exactly what to build next. You are not guessing at what the agent needs. The data is telling you.

The Manual Scoring Problem

Once the run is complete, you are left with a JSON file full of results. Each result contains the agent's raw text response and an array of Excalidraw elements.

Looking at Excalidraw elements as JSON and judging whether they represent a good diagram is not realistic. A rectangle with coordinates, a label, and an ID does not tell you if the diagram makes sense visually. You need to render it.

This is the practical problem with manual scoring at scale: the tooling matters. A useful manual scoring setup would let you paste the result JSON in and see the rendered canvas. You would see whether nodes overlap, whether the layout is readable, whether the labels are correct. Without that, you are squinting at raw coordinates and guessing.

Before LLMs made automated scoring possible, large annotation teams would do this work at scale. Companies were built entirely around the problem of looking at model outputs and labeling them. The bottleneck was always human attention.

Your Agent Cannot Be Good at Everything

One thing that becomes clear when you start building a golden dataset is that you are forced to make a decision you would rather avoid: what is this agent actually for?

An agent cannot be excellent at many things. It can be bad at almost everything, adequate at a small set of things, and close to excellent at a very small subset. The goal is to identify that small subset and make it reliable. What are the ten things your agent should be nearly perfect at? What are the ten things it should handle well? What are the ten things it might attempt but it is acceptable if it struggles?

And equally important: what are the things it should never attempt? An agent that tries to do everything will be unreliable at all of it. Scope is not a limitation. Scope is what makes reliability possible.

This is why expanding capability is slow and expensive. Adding an eleventh thing to the "nearly perfect" list might take two months of data collection, eval runs, and refinement, while making sure the original ten do not regress. Each addition to the capability surface is a new thing to measure, a new thing to maintain, and a new source of regressions in the things you already have.

The harness you have built captures this. The category field in your test cases makes the scope explicit. The difficulty field shows where the edges are. When the eval data tells you modification tasks are failing at 80% but creation tasks are succeeding at 70%, you know exactly where to focus next.

From Harness to Framework

The custom harness is not something you would run in production. It has no dashboard, no trend visualization, no way to compare two runs side by side, and no automated scoring. The results on disk are raw JSON. You can reason about them if you look hard enough, but you cannot act on them efficiently.

This matters because manual scoring does not scale. Looking at a JSON array of Excalidraw elements and deciding if the diagram is good is not a sustainable process, especially for anything visual. You need a rendered view. You need to be able to compare this run against the last one. You need to see scores trending over time.

That is what eval frameworks and dashboards provide. The most widely used is Braintrust, which gives you a dashboard for tracking experiments, viewing individual test case results, comparing runs, and seeing score trends. Other options in the ecosystem include LangFuse, PromptFoo (acquired by OpenAI), and lighter-weight libraries for embedding evals directly into test suites.

The reason to build the custom harness first is not that it is useful in practice. It is that the concepts it forces you to confront, what a test case is, what an experiment is, how results are structured, what you actually need to score, are the same concepts the frameworks are built on. When you pick up Braintrust, you will know what problem each part of the interface is solving. That understanding is worth the pain of the manual harness.

The Eval Harness and Runners

The Eval Harness

Expected Characteristics and Keywords

Building the Golden Dataset

Abstracting the System Prompt

The Eval Runner

Eval Results and Experiments

Extracting Tool Call Results

The Main Runner

What the Results Tell You

Eval-Driven Development

The Manual Scoring Problem

Your Agent Cannot Be Good at Everything

From Harness to Framework

Further Reading and Watching