Files
awesome-copilot/plugins/phoenix/skills/phoenix-evals/references/evaluators-overview.md
2026-04-01 23:04:18 +00:00

1.1 KiB

Evaluators: Overview

When and how to build automated evaluators.

Decision Framework

Should I Build an Evaluator?
        │
        ▼
Can I fix it with a prompt change?
    YES → Fix the prompt first
    NO  → Is this a recurring issue?
          YES → Build evaluator
          NO  → Add to watchlist

Don't automate prematurely. Many issues are simple prompt fixes.

Evaluator Requirements

  1. Clear criteria - Specific, not "Is it good?"
  2. Labeled test set - 100+ examples with human labels
  3. Measured accuracy - Know TPR/TNR before deploying

Evaluator Lifecycle

  1. Discover - Error analysis reveals pattern
  2. Design - Define criteria and test cases
  3. Implement - Build code or LLM evaluator
  4. Calibrate - Validate against human labels
  5. Deploy - Add to experiment/CI pipeline
  6. Monitor - Track accuracy over time
  7. Maintain - Update as product evolves

What NOT to Automate

  • Rare issues - <5 instances? Watchlist, don't build
  • Quick fixes - Fixable by prompt change? Fix it
  • Evolving criteria - Stabilize definition first