Evals and Golden Datasets

Hands-on practice for this lecture. Work through the exercises and quizzes to reinforce what you've learned.

1

Exercise 1 of 1

Pass@K vs Pass^K: Measuring Quality and Reliability

Running an eval once gives you a lucky or unlucky single result. Run it K=10 times and toggle each outcome to see how Pass@K (quality: at least one pass) and Pass^K (reliability: all pass) respond differently.

Eval Scoring — Single Run (binary pass/fail)

Test case

input: "draw a single rectangle labelled Hello"
difficulty: simple
category: create
Practice: Evals and Golden Datasets — Interactive Exercises | Durgesh Rai