chore: publish from staged

2026-04-13 03:35:55 +00:00 · 2026-04-13 01:02:36 +00:00
parent e37cd3123f
commit 2f4953242f
467 changed files with 97528 additions and 276 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/evaluators-pre-built.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/evaluators-pre-built.md
@@ -0,0 +1,75 @@
+# Evaluators: Pre-Built
+
+Use for exploration only. Validate before production.
+
+## Python
+
+```python
+from phoenix.evals import LLM
+from phoenix.evals.metrics import FaithfulnessEvaluator
+
+llm = LLM(provider="openai", model="gpt-4o")
+faithfulness_eval = FaithfulnessEvaluator(llm=llm)
+```
+
+**Note**: `HallucinationEvaluator` is deprecated. Use `FaithfulnessEvaluator` instead.
+It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.
+
+## TypeScript
+
+```typescript
+import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
+import { openai } from "@ai-sdk/openai";
+
+const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });
+```
+
+## Available (2.0)
+
+| Evaluator | Type | Description |
+| --------- | ---- | ----------- |
+| `FaithfulnessEvaluator` | LLM | Is the response faithful to the context? |
+| `CorrectnessEvaluator` | LLM | Is the response correct? |
+| `DocumentRelevanceEvaluator` | LLM | Are retrieved documents relevant? |
+| `ToolSelectionEvaluator` | LLM | Did the agent select the right tool? |
+| `ToolInvocationEvaluator` | LLM | Did the agent invoke the tool correctly? |
+| `ToolResponseHandlingEvaluator` | LLM | Did the agent handle the tool response well? |
+| `MatchesRegex` | Code | Does output match a regex pattern? |
+| `PrecisionRecallFScore` | Code | Precision/recall/F-score metrics |
+| `exact_match` | Code | Exact string match |
+
+Legacy evaluators (`HallucinationEvaluator`, `QAEvaluator`, `RelevanceEvaluator`,
+`ToxicityEvaluator`, `SummarizationEvaluator`) are in `phoenix.evals.legacy` and deprecated.
+
+## When to Use
+
+| Situation | Recommendation |
+| --------- | -------------- |
+| Exploration | Find traces to review |
+| Find outliers | Sort by scores |
+| Production | Validate first (>80% human agreement) |
+| Domain-specific | Build custom |
+
+## Exploration Pattern
+
+```python
+from phoenix.evals import evaluate_dataframe
+
+results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])
+
+# Score columns contain dicts — extract numeric scores
+scores = results_df["faithfulness_score"].apply(
+    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
+)
+low_scores = results_df[scores < 0.5]   # Review these
+high_scores = results_df[scores > 0.9]  # Also sample
+```
+
+## Validation Required
+
+```python
+from sklearn.metrics import classification_report
+
+print(classification_report(human_labels, evaluator_results["label"]))
+# Target: >80% agreement
+```