Evals and Golden Datasets

Hands-on practice for this lecture. Work through the exercises and quizzes to reinforce what you've learned.

Exercise 1 of 1

Pass@K vs Pass^K: Measuring Quality and Reliability

Running an eval once gives you a lucky or unlucky single result. Run it K=10 times and toggle each outcome to see how Pass@K (quality: at least one pass) and Pass^K (reliability: all pass) respond differently.

interact to see yellow flashes â†’

Eval Scoring â€” Single Run (binary pass/fail)

Test case

input: "draw a single rectangle labelled Hello"
difficulty: simple
category: create

Re-read the lecture All lectures →