mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 19:25:55 +00:00
1.1 KiB
1.1 KiB
Evaluators: Overview
When and how to build automated evaluators.
Decision Framework
Should I Build an Evaluator?
│
▼
Can I fix it with a prompt change?
YES → Fix the prompt first
NO → Is this a recurring issue?
YES → Build evaluator
NO → Add to watchlist
Don't automate prematurely. Many issues are simple prompt fixes.
Evaluator Requirements
- Clear criteria - Specific, not "Is it good?"
- Labeled test set - 100+ examples with human labels
- Measured accuracy - Know TPR/TNR before deploying
Evaluator Lifecycle
- Discover - Error analysis reveals pattern
- Design - Define criteria and test cases
- Implement - Build code or LLM evaluator
- Calibrate - Validate against human labels
- Deploy - Add to experiment/CI pipeline
- Monitor - Track accuracy over time
- Maintain - Update as product evolves
What NOT to Automate
- Rare issues - <5 instances? Watchlist, don't build
- Quick fixes - Fixable by prompt change? Fix it
- Evolving criteria - Stabilize definition first