chore: publish from staged

2026-04-13 11:45:56 +00:00 · 2026-04-01 23:04:18 +00:00
parent 5f3d66c380
commit 0c3c5bbbfb
407 changed files with 85783 additions and 237 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/fundamentals-anti-patterns.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/fundamentals-anti-patterns.md
@@ -0,0 +1,43 @@
+# Anti-Patterns
+
+Common mistakes and fixes.
+
+| Anti-Pattern | Problem | Fix |
+| ------------ | ------- | --- |
+| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |
+| Vibe-based | No quantification | Measure with experiments |
+| Ignoring humans | Uncalibrated LLM judges | Validate >80% TPR/TNR |
+| Premature automation | Evaluators for imagined problems | Let observed failures drive |
+| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |
+| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |
+| Model switching | Hoping a model works better | Error analysis first |
+
+## Quantify Changes
+
+```python
+baseline = run_experiment(dataset, old_prompt, evaluators)
+improved = run_experiment(dataset, new_prompt, evaluators)
+print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")
+```
+
+## Don't Use Similarity for Generation
+
+```python
+# BAD
+score = bertscore(output, reference)
+
+# GOOD
+correct_facts = check_facts_against_source(output, context)
+```
+
+## Error Analysis Before Model Change
+
+```python
+# BAD
+for model in models:
+    results = test(model)
+
+# GOOD
+failures = analyze_errors(results)
+# Then decide if model change is warranted
+```