Files
awesome-copilot/plugins/phoenix/skills/phoenix-evals/references/fundamentals-anti-patterns.md
2026-04-01 23:04:18 +00:00

1.2 KiB

Anti-Patterns

Common mistakes and fixes.

Anti-Pattern Problem Fix
Generic metrics Pre-built scores don't match your failures Build from error analysis
Vibe-based No quantification Measure with experiments
Ignoring humans Uncalibrated LLM judges Validate >80% TPR/TNR
Premature automation Evaluators for imagined problems Let observed failures drive
Saturation blindness 100% pass = no signal Keep capability evals at 50-80%
Similarity metrics BERTScore/ROUGE for generation Use for retrieval only
Model switching Hoping a model works better Error analysis first

Quantify Changes

baseline = run_experiment(dataset, old_prompt, evaluators)
improved = run_experiment(dataset, new_prompt, evaluators)
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")

Don't Use Similarity for Generation

# BAD
score = bertscore(output, reference)

# GOOD
correct_facts = check_facts_against_source(output, context)

Error Analysis Before Model Change

# BAD
for model in models:
    results = test(model)

# GOOD
failures = analyze_errors(results)
# Then decide if model change is warranted