Human Review and the Eval Flywheel

The previous post set up Braintrust, wrote the first code-based scorer, and got the first green number. Code scorers handle everything that can be checked with a function. This post covers what happens after: how human review closes the gap between "structurally valid" and "actually good," and how repeated review cycles turn into the flywheel that drives improvement.

Rendering Results for Human Review

The Braintrust dashboard shows each test case as a row with its input, output, and score. For most output types, looking at the raw data is enough. For a visual agent, raw Excalidraw JSON tells you nothing about whether the diagram actually looks good.

A viewer panel in the app lets you paste element JSON copied from Braintrust and render it as a live canvas. This closes the gap between "the JSON passes the schema check" and "the diagram looks like what the user asked for." When doing manual review, the workflow is: copy the output JSON from Braintrust, paste it into the viewer, look at the result.

The non-deterministic nature of the agent is visible here. The same edge-case prompt, "draw a square that is also a circle," can produce a reasonable approximation on one run and a refusal on the next. That variance is why single-run results are not enough signal. Braintrust's CLI supports a flag to run the same eval N times. The average across multiple runs is a more trustworthy indicator than any individual result.

Human Review Scores

Braintrust supports human review scores alongside automated code scores. A reviewer opens a test case, looks at the output or the rendered canvas, and assigns a rating. The scale is yours to define. A simple four-point scale works well: unacceptable, rough but usable, acceptable, good.

Human review is not optional at the start. Until there is enough data to build an automated quality judge, you are the only source of ground truth for subjective quality. Spending a week going through test case outputs and rating them is not a distraction from the real work. It is the real work. The expert who built the agent is the only person who knows what "good enough to ship" means.

The Flywheel

Manual reviews are most valuable when the reviewer leaves notes explaining why they scored something low. Those notes go into Braintrust alongside the score. Over time, a dataset of outputs, scores, and reviewer reasoning accumulates.

That accumulated data feeds the next improvement cycle. You look at the cases that scored worst, read the reviewer notes, and work backward to what caused the failure: tool coverage, missing context, layout logic, a gap in the system prompt. Each failure with a note points at something concrete to fix.

This is the flywheel. Build the agent. Run the evals. Review the failures. Fix the most impactful thing. Run the evals again. Repeat. The cycle does not end. In a production system, the dataset lives in Braintrust and grows continuously as new outputs are reviewed. The agent improves because the data tells it to, not because of guesses.

In a real product, the dataset does not stay as a static JSON file. It grows. New outputs get added. Reviewer notes accumulate. User feedback from production gets pulled in. The flywheel only turns because data keeps flowing into it.

A Scorer Should Tell You What to Fix

There is a useful test for whether a scorer is well-designed: if it fails, do you know exactly what to do next?

A schema scorer that fails tells you: the agent is producing invalid element structures. The fix is to add structured output or tighten the schema in the tool definition. A label keyword scorer that fails tells you: elements are missing expected text labels. The fix is to update the system prompt or add an explicit constraint that labels are required.

If a scorer fails and you do not know what it means or what to change, the scorer is poorly designed. The purpose of a score is not just to measure. It is to direct the next round of work. A score that cannot tell you what to improve is noise.

Think of it this way: before you write a scorer, ask yourself what you would do if this score dropped. If the answer is "I'm not sure," reconsider the scorer.

Offline Evals and Online Evals

Everything described so far is an offline eval: you run a batch of test cases outside of production (separately, not on live traffic), capture the results, and review them. This is the main workflow during development.

Once the agent is in production, online evals run alongside it. A user submits a prompt, the agent responds, and the automated scorers run against that real interaction immediately -- on live traffic, in real time. By the time you look at the data, it has already been evaluated. A thumbs up or thumbs down from the user is an online eval. A code-based scorer checking whether the returned JSON is valid is an online eval.

The distinction matters for speed and cost. Automated code scorers run fast enough to evaluate production traffic in real time. LLM-based judges are slower and more expensive, so they tend to run offline on sampled data rather than against every live request.

LLM as a Judge

Deterministic scorers cover a lot of ground. But some quality dimensions cannot be captured with a function. "Does this diagram communicate the idea clearly?" requires judgment that code cannot provide.

LLM as a judge means using a second LLM to evaluate the output of the first. You give it the input, the output, and a description of what good looks like, and ask it to return a score. This is useful but it carries a cost: the judge itself is a non-deterministic system -- it can give different scores for the same output on different runs. It needs its own eval.

The practical approach is to resist using an LLM as a judge until there is enough manual review data to calibrate it -- to check whether the judge's scores actually match what a human expert would say. A judge prompt written without reference data is guessing. Once reviewers have gone through enough outputs and left enough notes, you can use that data to build a judge that reflects their taste. The judge is an approximation of the human expert, not a substitute for doing the expert work first.

This gets recursive quickly: building a good LLM judge requires collecting data, evaluating it, and iterating. That is the same process as building the agent. The difference is that the agent delivers value to users while the judge only measures that value. Start with deterministic scorers, add human review, and only reach for LLM judgment when manual review becomes the bottleneck.

The preference ordering is: deterministic code checks first, human review second, LLM as judge third. The further right you go, the more expensive, slower, and harder to calibrate the scoring becomes.