Part 1
The Golden Evaluation Set
The foundation. Everything else depends on it.
A versioned, owned set of input-output pairs that defines what "correct" looks like for your LLM application. The single source of truth for whether the model is working.
Minimum viable setup
- 25 to 50 input-output pairs covering the most important use cases.
- Stored in git, version-controlled like code.
- Owned by the business team, not the ML team.
- Each pair tagged with use case, expected behaviour, and edge case category.
- Reviewed at least once a quarter.
What good looks like
The set has grown from around 25 cases at launch to 300+ inside the first year, driven by real production traffic and real incidents. The business owns the labels and reviews them quarterly. Every model or prompt change is evaluated against the set automatically.