chore: publish from staged

2026-04-13 03:35:55 +00:00 · 2026-04-09 06:26:21 +00:00
parent 017f31f495
commit a68b190031
467 changed files with 97527 additions and 276 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/evaluators-overview.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/evaluators-overview.md
@@ -0,0 +1,40 @@
+# Evaluators: Overview
+
+When and how to build automated evaluators.
+
+## Decision Framework
+
+```
+Should I Build an Evaluator?
+        │
+        ▼
+Can I fix it with a prompt change?
+    YES → Fix the prompt first
+    NO  → Is this a recurring issue?
+          YES → Build evaluator
+          NO  → Add to watchlist
+```
+
+**Don't automate prematurely.** Many issues are simple prompt fixes.
+
+## Evaluator Requirements
+
+1. **Clear criteria** - Specific, not "Is it good?"
+2. **Labeled test set** - 100+ examples with human labels
+3. **Measured accuracy** - Know TPR/TNR before deploying
+
+## Evaluator Lifecycle
+
+1. **Discover** - Error analysis reveals pattern
+2. **Design** - Define criteria and test cases
+3. **Implement** - Build code or LLM evaluator
+4. **Calibrate** - Validate against human labels
+5. **Deploy** - Add to experiment/CI pipeline
+6. **Monitor** - Track accuracy over time
+7. **Maintain** - Update as product evolves
+
+## What NOT to Automate
+
+- **Rare issues** - <5 instances? Watchlist, don't build
+- **Quick fixes** - Fixable by prompt change? Fix it
+- **Evolving criteria** - Stabilize definition first