mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 11:15:56 +00:00
41 lines
1.1 KiB
Markdown
41 lines
1.1 KiB
Markdown
# Evaluators: Overview
|
|
|
|
When and how to build automated evaluators.
|
|
|
|
## Decision Framework
|
|
|
|
```
|
|
Should I Build an Evaluator?
|
|
│
|
|
▼
|
|
Can I fix it with a prompt change?
|
|
YES → Fix the prompt first
|
|
NO → Is this a recurring issue?
|
|
YES → Build evaluator
|
|
NO → Add to watchlist
|
|
```
|
|
|
|
**Don't automate prematurely.** Many issues are simple prompt fixes.
|
|
|
|
## Evaluator Requirements
|
|
|
|
1. **Clear criteria** - Specific, not "Is it good?"
|
|
2. **Labeled test set** - 100+ examples with human labels
|
|
3. **Measured accuracy** - Know TPR/TNR before deploying
|
|
|
|
## Evaluator Lifecycle
|
|
|
|
1. **Discover** - Error analysis reveals pattern
|
|
2. **Design** - Define criteria and test cases
|
|
3. **Implement** - Build code or LLM evaluator
|
|
4. **Calibrate** - Validate against human labels
|
|
5. **Deploy** - Add to experiment/CI pipeline
|
|
6. **Monitor** - Track accuracy over time
|
|
7. **Maintain** - Update as product evolves
|
|
|
|
## What NOT to Automate
|
|
|
|
- **Rare issues** - <5 instances? Watchlist, don't build
|
|
- **Quick fixes** - Fixable by prompt change? Fix it
|
|
- **Evolving criteria** - Stabilize definition first
|