Awesome/awesome-copilot

mirror of https://github.com/github/awesome-copilot.git synced 2026-04-12 19:25:55 +00:00

Files

github-actions[bot] a68b190031 chore: publish from staged

2026-04-09 06:26:21 +00:00

1.1 KiB

Raw Blame History

Evaluators: Overview

When and how to build automated evaluators.

Decision Framework

Should I Build an Evaluator?
        │
        ▼
Can I fix it with a prompt change?
    YES → Fix the prompt first
    NO  → Is this a recurring issue?
          YES → Build evaluator
          NO  → Add to watchlist

Don't automate prematurely. Many issues are simple prompt fixes.

Evaluator Requirements

Clear criteria - Specific, not "Is it good?"
Labeled test set - 100+ examples with human labels
Measured accuracy - Know TPR/TNR before deploying

Evaluator Lifecycle

Discover - Error analysis reveals pattern
Design - Define criteria and test cases
Implement - Build code or LLM evaluator
Calibrate - Validate against human labels
Deploy - Add to experiment/CI pipeline
Monitor - Track accuracy over time
Maintain - Update as product evolves

What NOT to Automate

Rare issues - <5 instances? Watchlist, don't build
Quick fixes - Fixable by prompt change? Fix it
Evolving criteria - Stabilize definition first