mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 19:25:55 +00:00
2.4 KiB
2.4 KiB
Evaluators: Pre-Built
Use for exploration only. Validate before production.
Python
from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator
llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
Note: HallucinationEvaluator is deprecated. Use FaithfulnessEvaluator instead.
It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.
TypeScript
import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });
Available (2.0)
| Evaluator | Type | Description |
|---|---|---|
FaithfulnessEvaluator |
LLM | Is the response faithful to the context? |
CorrectnessEvaluator |
LLM | Is the response correct? |
DocumentRelevanceEvaluator |
LLM | Are retrieved documents relevant? |
ToolSelectionEvaluator |
LLM | Did the agent select the right tool? |
ToolInvocationEvaluator |
LLM | Did the agent invoke the tool correctly? |
ToolResponseHandlingEvaluator |
LLM | Did the agent handle the tool response well? |
MatchesRegex |
Code | Does output match a regex pattern? |
PrecisionRecallFScore |
Code | Precision/recall/F-score metrics |
exact_match |
Code | Exact string match |
Legacy evaluators (HallucinationEvaluator, QAEvaluator, RelevanceEvaluator,
ToxicityEvaluator, SummarizationEvaluator) are in phoenix.evals.legacy and deprecated.
When to Use
| Situation | Recommendation |
|---|---|
| Exploration | Find traces to review |
| Find outliers | Sort by scores |
| Production | Validate first (>80% human agreement) |
| Domain-specific | Build custom |
Exploration Pattern
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])
# Score columns contain dicts — extract numeric scores
scores = results_df["faithfulness_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
low_scores = results_df[scores < 0.5] # Review these
high_scores = results_df[scores > 0.9] # Also sample
Validation Required
from sklearn.metrics import classification_report
print(classification_report(human_labels, evaluator_results["label"]))
# Target: >80% agreement