chore: publish from staged

This commit is contained in:
github-actions[bot]
2026-04-10 04:45:41 +00:00
parent 10fda505b7
commit 8395dce14c
467 changed files with 97526 additions and 276 deletions

View File

@@ -0,0 +1,72 @@
---
name: phoenix-evals
description: Build and run evaluators for AI/LLM applications using Phoenix.
license: Apache-2.0
compatibility: Requires Phoenix server. Python skills need phoenix and openai packages; TypeScript skills need @arizeai/phoenix-client.
metadata:
author: oss@arize.com
version: "1.0.0"
languages: "Python, TypeScript"
---
# Phoenix Evals
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
## Quick Reference
| Task | Files |
| ---- | ----- |
| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |
| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |
| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |
| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |
| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |
| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |
| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |
| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |
| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |
| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |
| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |
| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |
| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |
| RAG evals | [evaluators-rag](references/evaluators-rag.md) |
| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |
| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |
## Workflows
**Starting Fresh:**
[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)
**Building Evaluator:**
[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
**RAG Systems:**
[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
**Production:**
[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)
## Reference Categories
| Prefix | Description |
| ------ | ----------- |
| `fundamentals-*` | Types, scores, anti-patterns |
| `observe-*` | Tracing, sampling |
| `error-analysis-*` | Finding failures |
| `axial-coding-*` | Categorizing failures |
| `evaluators-*` | Code, LLM, RAG evaluators |
| `experiments-*` | Datasets, running experiments |
| `validation-*` | Validating evaluator accuracy against human labels |
| `production-*` | CI/CD, monitoring |
## Key Principles
| Principle | Action |
| --------- | ------ |
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |

View File

@@ -0,0 +1,95 @@
# Axial Coding
Group open-ended notes into structured failure taxonomies.
## Process
1. **Gather** - Collect open coding notes
2. **Pattern** - Group notes with common themes
3. **Name** - Create actionable category names
4. **Quantify** - Count failures per category
## Example Taxonomy
```yaml
failure_taxonomy:
content_quality:
hallucination: [invented_facts, fictional_citations]
incompleteness: [partial_answer, missing_key_info]
inaccuracy: [wrong_numbers, wrong_dates]
communication:
tone_mismatch: [too_casual, too_formal]
clarity: [ambiguous, jargon_heavy]
context:
user_context: [ignored_preferences, misunderstood_intent]
retrieved_context: [ignored_documents, wrong_context]
safety:
missing_disclaimers: [legal, medical, financial]
```
## Add Annotation (Python)
```python
from phoenix.client import Client
client = Client()
client.spans.add_span_annotation(
span_id="abc123",
annotation_name="failure_category",
label="hallucination",
explanation="invented a feature that doesn't exist",
annotator_kind="HUMAN",
sync=True,
)
```
## Add Annotation (TypeScript)
```typescript
import { addSpanAnnotation } from "@arizeai/phoenix-client/spans";
await addSpanAnnotation({
spanAnnotation: {
spanId: "abc123",
name: "failure_category",
label: "hallucination",
explanation: "invented a feature that doesn't exist",
annotatorKind: "HUMAN",
}
});
```
## Agent Failure Taxonomy
```yaml
agent_failures:
planning: [wrong_plan, incomplete_plan]
tool_selection: [wrong_tool, missed_tool, unnecessary_call]
tool_execution: [wrong_parameters, type_error]
state_management: [lost_context, stuck_in_loop]
error_recovery: [no_fallback, wrong_fallback]
```
## Transition Matrix (Agents)
Shows where failures occur between states:
```python
def build_transition_matrix(conversations, states):
matrix = defaultdict(lambda: defaultdict(int))
for conv in conversations:
if conv["failed"]:
last_success = find_last_success(conv)
first_failure = find_first_failure(conv)
matrix[last_success][first_failure] += 1
return pd.DataFrame(matrix).fillna(0)
```
## Principles
- **MECE** - Each failure fits ONE category
- **Actionable** - Categories suggest fixes
- **Bottom-up** - Let categories emerge from data

View File

@@ -0,0 +1,225 @@
# Common Mistakes (Python)
Patterns that LLMs frequently generate incorrectly from training data.
## Legacy Model Classes
```python
# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")
# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")
```
**Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`.
The `LLM` class is provider-agnostic and is the current 2.0 API.
## Using run_evals Instead of evaluate_dataframe
```python
# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames
# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns
```
**Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current
2.0 function with a different return format.
## Wrong Result Column Names
```python
# WRONG — column doesn't exist
score = results_df["relevance"].mean()
# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()
# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()
```
**Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts
like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`.
## Deprecated project_name Parameter
```python
# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")
# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")
```
**Why**: `project_name` is deprecated in favor of `project_identifier`, which also
accepts project IDs.
## Wrong Client Constructor
```python
# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")
# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")
# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()
```
**Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances,
`Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required.
## Too-Aggressive Time Filters
```python
# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
start_time=datetime.now() - timedelta(hours=1),
)
# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
limit=50,
)
```
**Why**: Traces may be from any time period. A 1-hour window frequently returns
nothing. Use `limit=` to control result size instead.
## Not Filtering Spans Appropriately
```python
# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")
# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
root_spans_only=True,
)
# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]
```
**Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`.
For RAG systems, you often need child spans separately — retriever spans for
DocumentRelevance and LLM spans for Faithfulness. Choose the right span level
for your evaluation target.
## Assuming Span Output is Plain Text
```python
# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]
# RIGHT — parse JSON and extract the answer field
import json
def extract_answer(output_value):
if not isinstance(output_value, str):
return str(output_value) if output_value is not None else ""
try:
parsed = json.loads(output_value)
if isinstance(parsed, dict):
for key in ("answer", "result", "output", "response"):
if key in parsed:
return str(parsed[key])
except (json.JSONDecodeError, TypeError):
pass
return output_value
df["output"] = df["attributes.output.value"].apply(extract_answer)
```
**Why**: LangChain and other frameworks often output structured JSON from root spans,
like `{"context": "...", "question": "...", "answer": "..."}`. Evaluators need
the actual answer text, not the raw JSON.
## Using @create_evaluator for LLM-Based Evaluation
```python
# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
pass # No LLM is involved
# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM
relevance = ClassificationEvaluator(
name="relevance",
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
```
**Why**: `@create_evaluator` wraps a plain Python function. Setting `kind="llm"`
marks it as LLM-based but you must implement the LLM call yourself.
For LLM-based evaluation, prefer `ClassificationEvaluator` which handles
the LLM call, structured output parsing, and explanations automatically.
## Using llm_classify Instead of ClassificationEvaluator
```python
# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
dataframe=df,
template=template_str,
model=model,
rails=["relevant", "irrelevant"],
)
# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM
classifier = ClassificationEvaluator(
name="relevance",
prompt_template=template_str,
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])
```
**Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create
an evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`.
## Using HallucinationEvaluator
```python
# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)
# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))
```
**Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement,
using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).

View File

@@ -0,0 +1,52 @@
# Error Analysis: Multi-Turn Conversations
Debugging complex multi-turn conversation traces.
## The Approach
1. **End-to-end first** - Did the conversation achieve the goal?
2. **Find first failure** - Trace backwards to root cause
3. **Simplify** - Try single-turn before multi-turn debug
4. **N-1 testing** - Isolate turn-specific vs capability issues
## Find First Upstream Failure
```
Turn 1: User asks about flights ✓
Turn 2: Assistant asks for dates ✓
Turn 3: User provides dates ✓
Turn 4: Assistant searches WRONG dates ← FIRST FAILURE
Turn 5: Shows wrong flights (consequence)
Turn 6: User frustrated (consequence)
```
Focus on Turn 4, not Turn 6.
## Simplify First
Before debugging multi-turn, test single-turn:
```python
# If single-turn also fails → problem is retrieval/knowledge
# If single-turn passes → problem is conversation context
response = chat("What's the return policy for electronics?")
```
## N-1 Testing
Give turns 1 to N-1 as context, test turn N:
```python
context = conversation[:n-1]
response = chat_with_context(context, user_message_n)
# Compare to actual turn N
```
This isolates whether error is from context or underlying capability.
## Checklist
1. Did conversation achieve goal? (E2E)
2. Which turn first went wrong?
3. Can you reproduce with single-turn?
4. Is error from context or capability? (N-1 test)

View File

@@ -0,0 +1,170 @@
# Error Analysis
Review traces to discover failure modes before building evaluators.
## Process
1. **Sample** - 100+ traces (errors, negative feedback, random)
2. **Open Code** - Write free-form notes per trace
3. **Axial Code** - Group notes into failure categories
4. **Quantify** - Count failures per category
5. **Prioritize** - Rank by frequency × severity
## Sample Traces
### Span-level sampling (Python — DataFrame)
```python
from phoenix.client import Client
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
# Build representative sample
sample = pd.concat([
spans_df[spans_df["status_code"] == "ERROR"].sample(30),
spans_df[spans_df["feedback"] == "negative"].sample(30),
spans_df.sample(40),
]).drop_duplicates("span_id").head(100)
```
### Span-level sampling (TypeScript)
```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";
const { spans: errors } = await getSpans({
project: { projectName: "my-app" },
statusCode: "ERROR",
limit: 30,
});
const { spans: allSpans } = await getSpans({
project: { projectName: "my-app" },
limit: 70,
});
const sample = [...errors, ...allSpans.sort(() => Math.random() - 0.5).slice(0, 40)];
const unique = [...new Map(sample.map((s) => [s.context.span_id, s])).values()].slice(0, 100);
```
### Trace-level sampling (Python)
When errors span multiple spans (e.g., agent workflows), sample whole traces:
```python
from datetime import datetime, timedelta
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=24),
include_spans=True,
sort="latency_ms",
order="desc",
limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans
```
### Trace-level sampling (TypeScript)
```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
includeSpans: true,
limit: 100,
});
```
## Add Notes (Python)
```python
client.spans.add_span_note(
span_id="abc123",
note="wrong timezone - said 3pm EST but user is PST"
)
```
## Add Notes (TypeScript)
```typescript
import { addSpanNote } from "@arizeai/phoenix-client/spans";
await addSpanNote({
spanNote: {
spanId: "abc123",
note: "wrong timezone - said 3pm EST but user is PST"
}
});
```
## What to Note
| Type | Examples |
| ---- | -------- |
| Factual errors | Wrong dates, prices, made-up features |
| Missing info | Didn't answer question, omitted details |
| Tone issues | Too casual/formal for context |
| Tool issues | Wrong tool, wrong parameters |
| Retrieval | Wrong docs, missing relevant docs |
## Good Notes
```
BAD: "Response is bad"
GOOD: "Response says ships in 2 days but policy is 5-7 days"
```
## Group into Categories
```python
categories = {
"factual_inaccuracy": ["wrong shipping time", "incorrect price"],
"hallucination": ["made up a discount", "invented feature"],
"tone_mismatch": ["informal for enterprise client"],
}
# Priority = Frequency × Severity
```
## Retrieve Existing Annotations
### Python
```python
# From a spans DataFrame
annotations_df = client.spans.get_span_annotations_dataframe(
spans_dataframe=sample,
project_identifier="my-app",
include_annotation_names=["quality", "correctness"],
)
# annotations_df has: span_id (index), name, label, score, explanation
# Or from specific span IDs
annotations_df = client.spans.get_span_annotations_dataframe(
span_ids=["span-id-1", "span-id-2"],
project_identifier="my-app",
)
```
### TypeScript
```typescript
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
const { annotations } = await getSpanAnnotations({
project: { projectName: "my-app" },
spanIds: ["span-id-1", "span-id-2"],
includeAnnotationNames: ["quality", "correctness"],
});
for (const ann of annotations) {
console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);
}
```
## Saturation
Stop when new traces reveal no new failure modes. Minimum: 100 traces.

View File

@@ -0,0 +1,137 @@
# Batch Evaluation with evaluate_dataframe (Python)
Run evaluators across a DataFrame. The core 2.0 batch evaluation API.
## Preferred: async_evaluate_dataframe
For batch evaluations (especially with LLM evaluators), prefer the async version
for better throughput:
```python
from phoenix.evals import async_evaluate_dataframe
results_df = await async_evaluate_dataframe(
dataframe=df, # pandas DataFrame with columns matching evaluator params
evaluators=[eval1, eval2], # List of evaluators
concurrency=5, # Max concurrent LLM calls (default 3)
exit_on_error=False, # Optional: stop on first error (default True)
max_retries=3, # Optional: retry failed LLM calls (default 10)
)
```
## Sync Version
```python
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(
dataframe=df, # pandas DataFrame with columns matching evaluator params
evaluators=[eval1, eval2], # List of evaluators
exit_on_error=False, # Optional: stop on first error (default True)
max_retries=3, # Optional: retry failed LLM calls (default 10)
)
```
## Result Column Format
`async_evaluate_dataframe` / `evaluate_dataframe` returns a copy of the input DataFrame with added columns.
**Result columns contain dicts, NOT raw numbers.**
For each evaluator named `"foo"`, two columns are added:
| Column | Type | Contents |
| ------ | ---- | -------- |
| `foo_score` | `dict` | `{"name": "foo", "score": 1.0, "label": "True", "explanation": "...", "metadata": {...}, "kind": "code", "direction": "maximize"}` |
| `foo_execution_details` | `dict` | `{"status": "success", "exceptions": [], "execution_seconds": 0.001}` |
Only non-None fields appear in the score dict.
### Extracting Numeric Scores
```python
# WRONG — these will fail or produce unexpected results
score = results_df["relevance"].mean() # KeyError!
score = results_df["relevance_score"].mean() # Tries to average dicts!
# RIGHT — extract the numeric score from each dict
scores = results_df["relevance_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
mean_score = scores.mean()
```
### Extracting Labels
```python
labels = results_df["relevance_score"].apply(
lambda x: x.get("label", "") if isinstance(x, dict) else ""
)
```
### Extracting Explanations (LLM evaluators)
```python
explanations = results_df["relevance_score"].apply(
lambda x: x.get("explanation", "") if isinstance(x, dict) else ""
)
```
### Finding Failures
```python
scores = results_df["relevance_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
failed_mask = scores < 0.5
failures = results_df[failed_mask]
```
## Input Mapping
Evaluators receive each row as a dict. Column names must match the evaluator's
expected parameter names. If they don't match, use `.bind()` or `bind_evaluator`:
```python
from phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe
@create_evaluator(name="check", kind="code")
def check(response: str) -> bool:
return len(response.strip()) > 0
# Option 1: Use .bind() method on the evaluator
check.bind(input_mapping={"response": "answer"})
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])
# Option 2: Use bind_evaluator function
bound = bind_evaluator(evaluator=check, input_mapping={"response": "answer"})
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])
```
Or simply rename columns to match:
```python
df = df.rename(columns={
"attributes.input.value": "input",
"attributes.output.value": "output",
})
```
## DO NOT use run_evals
```python
# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1])
# Returns List[DataFrame] — one per evaluator
# RIGHT — current 2.0 API
from phoenix.evals import async_evaluate_dataframe
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns
```
Key differences:
- `run_evals` returns a **list** of DataFrames (one per evaluator)
- `async_evaluate_dataframe` returns a **single** DataFrame with all results merged
- `async_evaluate_dataframe` uses `{name}_score` dict column format
- `async_evaluate_dataframe` uses `bind_evaluator` for input mapping (not `input_mapping=` param)

View File

@@ -0,0 +1,91 @@
# Evaluators: Code Evaluators in Python
Deterministic evaluators without LLM. Fast, cheap, reproducible.
## Basic Pattern
```python
import re
import json
from phoenix.evals import create_evaluator
@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
return bool(re.search(r'\[\d+\]', output))
@create_evaluator(name="json_valid", kind="code")
def json_valid(output: str) -> bool:
try:
json.loads(output)
return True
except json.JSONDecodeError:
return False
```
## Parameter Binding
| Parameter | Description |
| --------- | ----------- |
| `output` | Task output |
| `input` | Example input |
| `expected` | Expected output |
| `metadata` | Example metadata |
```python
@create_evaluator(name="matches_expected", kind="code")
def matches_expected(output: str, expected: dict) -> bool:
return output.strip() == expected.get("answer", "").strip()
```
## Common Patterns
- **Regex**: `re.search(pattern, output)`
- **JSON schema**: `jsonschema.validate()`
- **Keywords**: `keyword in output.lower()`
- **Length**: `len(output.split())`
- **Similarity**: `editdistance.eval()` or Jaccard
## Return Types
| Return type | Result |
| ----------- | ------ |
| `bool` | `True` → score=1.0, label="True"; `False` → score=0.0, label="False" |
| `float`/`int` | Used as the `score` value directly |
| `str` (short, ≤3 words) | Used as the `label` value |
| `str` (long, ≥4 words) | Used as the `explanation` value |
| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |
| `Score` object | Used as-is |
## Important: Code vs LLM Evaluators
The `@create_evaluator` decorator wraps a plain Python function.
- `kind="code"` (default): For deterministic evaluators that don't call an LLM.
- `kind="llm"`: Marks the evaluator as LLM-based, but **you** must implement the LLM
call inside the function. The decorator does not call an LLM for you.
For most LLM-based evaluation, prefer `ClassificationEvaluator` which handles
the LLM call, structured output parsing, and explanations automatically:
```python
from phoenix.evals import ClassificationEvaluator, LLM
relevance = ClassificationEvaluator(
name="relevance",
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
```
## Pre-Built
```python
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex
evaluators = [
ContainsAnyKeyword(keywords=["disclaimer"]),
JSONParseable(),
MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
]
```

View File

@@ -0,0 +1,51 @@
# Evaluators: Code Evaluators in TypeScript
Deterministic evaluators without LLM. Fast, cheap, reproducible.
## Basic Pattern
```typescript
import { createEvaluator } from "@arizeai/phoenix-evals";
const containsCitation = createEvaluator<{ output: string }>(
({ output }) => /\[\d+\]/.test(output) ? 1 : 0,
{ name: "contains_citation", kind: "CODE" }
);
```
## With Full Results (asExperimentEvaluator)
```typescript
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";
const jsonValid = asExperimentEvaluator({
name: "json_valid",
kind: "CODE",
evaluate: async ({ output }) => {
try {
JSON.parse(String(output));
return { score: 1.0, label: "valid_json" };
} catch (e) {
return { score: 0.0, label: "invalid_json", explanation: String(e) };
}
},
});
```
## Parameter Types
```typescript
interface EvaluatorParams {
input: Record<string, unknown>;
output: unknown;
expected: Record<string, unknown>;
metadata: Record<string, unknown>;
}
```
## Common Patterns
- **Regex**: `/pattern/.test(output)`
- **JSON**: `JSON.parse()` + zod schema
- **Keywords**: `output.includes(keyword)`
- **Similarity**: `fastest-levenshtein`

View File

@@ -0,0 +1,54 @@
# Evaluators: Custom Templates
Design LLM judge prompts.
## Complete Template Pattern
```python
TEMPLATE = """Evaluate faithfulness of the response to the context.
<context>{{context}}</context>
<response>{{output}}</response>
CRITERIA:
"faithful" = ALL claims supported by context
"unfaithful" = ANY claim NOT in context
EXAMPLES:
Context: "Price is $10" → Response: "It costs $10" → faithful
Context: "Price is $10" → Response: "About $15" → unfaithful
EDGE CASES:
- Empty context → cannot_evaluate
- "I don't know" when appropriate → faithful
- Partial faithfulness → unfaithful (strict)
Answer (faithful/unfaithful):"""
```
## Template Structure
1. Task description
2. Input variables in XML tags
3. Criteria definitions
4. Examples (2-4 cases)
5. Edge cases
6. Output format
## XML Tags
```
<question>{{input}}</question>
<response>{{output}}</response>
<context>{{context}}</context>
<reference>{{reference}}</reference>
```
## Common Mistakes
| Mistake | Fix |
| ------- | --- |
| Vague criteria | Define each label exactly |
| No examples | Include 2-4 cases |
| Ambiguous format | Specify exact output |
| No edge cases | Address ambiguity |

View File

@@ -0,0 +1,92 @@
# Evaluators: LLM Evaluators in Python
LLM evaluators use a language model to judge outputs. Use when criteria are subjective.
## Quick Start
```python
from phoenix.evals import ClassificationEvaluator, LLM
llm = LLM(provider="openai", model="gpt-4o")
HELPFULNESS_TEMPLATE = """Rate how helpful the response is.
<question>{{input}}</question>
<response>{{output}}</response>
"helpful" means directly addresses the question.
"not_helpful" means does not address the question.
Your answer (helpful/not_helpful):"""
helpfulness = ClassificationEvaluator(
name="helpfulness",
prompt_template=HELPFULNESS_TEMPLATE,
llm=llm,
choices={"not_helpful": 0, "helpful": 1}
)
```
## Template Variables
Use XML tags to wrap variables for clarity:
| Variable | XML Tag |
| -------- | ------- |
| `{{input}}` | `<question>{{input}}</question>` |
| `{{output}}` | `<response>{{output}}</response>` |
| `{{reference}}` | `<reference>{{reference}}</reference>` |
| `{{context}}` | `<context>{{context}}</context>` |
## create_classifier (Factory)
Shorthand factory that returns a `ClassificationEvaluator`. Prefer direct
`ClassificationEvaluator` instantiation for more parameters/customization:
```python
from phoenix.evals import create_classifier, LLM
relevance = create_classifier(
name="relevance",
prompt_template="""Is this response relevant to the question?
<question>{{input}}</question>
<response>{{output}}</response>
Answer (relevant/irrelevant):""",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
```
## Input Mapping
Column names must match template variables. Rename columns or use `bind_evaluator`:
```python
# Option 1: Rename columns to match template variables
df = df.rename(columns={"user_query": "input", "ai_response": "output"})
# Option 2: Use bind_evaluator
from phoenix.evals import bind_evaluator
bound = bind_evaluator(
evaluator=helpfulness,
input_mapping={"input": "user_query", "output": "ai_response"},
)
```
## Running
```python
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])
```
## Best Practices
1. **Be specific** - Define exactly what pass/fail means
2. **Include examples** - Show concrete cases for each label
3. **Explanations by default** - `ClassificationEvaluator` includes explanations automatically
4. **Study built-in prompts** - See
`phoenix.evals.__generated__.classification_evaluator_configs` for examples
of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)

View File

@@ -0,0 +1,58 @@
# Evaluators: LLM Evaluators in TypeScript
LLM evaluators use a language model to judge outputs. Uses Vercel AI SDK.
## Quick Start
```typescript
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const helpfulness = await createClassificationEvaluator<{
input: string;
output: string;
}>({
name: "helpfulness",
model: openai("gpt-4o"),
promptTemplate: `Rate helpfulness.
<question>{{input}}</question>
<response>{{output}}</response>
Answer (helpful/not_helpful):`,
choices: { not_helpful: 0, helpful: 1 },
});
```
## Template Variables
Use XML tags: `<question>{{input}}</question>`, `<response>{{output}}</response>`, `<context>{{context}}</context>`
## Custom Evaluator with asExperimentEvaluator
```typescript
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";
const customEval = asExperimentEvaluator({
name: "custom",
kind: "LLM",
evaluate: async ({ input, output }) => {
// Your LLM call here
return { score: 1.0, label: "pass", explanation: "..." };
},
});
```
## Pre-Built Evaluators
```typescript
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
const faithfulnessEvaluator = createFaithfulnessEvaluator({
model: openai("gpt-4o"),
});
```
## Best Practices
- Be specific about criteria
- Include examples in prompts
- Use `<thinking>` for chain of thought

View File

@@ -0,0 +1,40 @@
# Evaluators: Overview
When and how to build automated evaluators.
## Decision Framework
```
Should I Build an Evaluator?
Can I fix it with a prompt change?
YES → Fix the prompt first
NO → Is this a recurring issue?
YES → Build evaluator
NO → Add to watchlist
```
**Don't automate prematurely.** Many issues are simple prompt fixes.
## Evaluator Requirements
1. **Clear criteria** - Specific, not "Is it good?"
2. **Labeled test set** - 100+ examples with human labels
3. **Measured accuracy** - Know TPR/TNR before deploying
## Evaluator Lifecycle
1. **Discover** - Error analysis reveals pattern
2. **Design** - Define criteria and test cases
3. **Implement** - Build code or LLM evaluator
4. **Calibrate** - Validate against human labels
5. **Deploy** - Add to experiment/CI pipeline
6. **Monitor** - Track accuracy over time
7. **Maintain** - Update as product evolves
## What NOT to Automate
- **Rare issues** - <5 instances? Watchlist, don't build
- **Quick fixes** - Fixable by prompt change? Fix it
- **Evolving criteria** - Stabilize definition first

View File

@@ -0,0 +1,75 @@
# Evaluators: Pre-Built
Use for exploration only. Validate before production.
## Python
```python
from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator
llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
```
**Note**: `HallucinationEvaluator` is deprecated. Use `FaithfulnessEvaluator` instead.
It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.
## TypeScript
```typescript
import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });
```
## Available (2.0)
| Evaluator | Type | Description |
| --------- | ---- | ----------- |
| `FaithfulnessEvaluator` | LLM | Is the response faithful to the context? |
| `CorrectnessEvaluator` | LLM | Is the response correct? |
| `DocumentRelevanceEvaluator` | LLM | Are retrieved documents relevant? |
| `ToolSelectionEvaluator` | LLM | Did the agent select the right tool? |
| `ToolInvocationEvaluator` | LLM | Did the agent invoke the tool correctly? |
| `ToolResponseHandlingEvaluator` | LLM | Did the agent handle the tool response well? |
| `MatchesRegex` | Code | Does output match a regex pattern? |
| `PrecisionRecallFScore` | Code | Precision/recall/F-score metrics |
| `exact_match` | Code | Exact string match |
Legacy evaluators (`HallucinationEvaluator`, `QAEvaluator`, `RelevanceEvaluator`,
`ToxicityEvaluator`, `SummarizationEvaluator`) are in `phoenix.evals.legacy` and deprecated.
## When to Use
| Situation | Recommendation |
| --------- | -------------- |
| Exploration | Find traces to review |
| Find outliers | Sort by scores |
| Production | Validate first (>80% human agreement) |
| Domain-specific | Build custom |
## Exploration Pattern
```python
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])
# Score columns contain dicts — extract numeric scores
scores = results_df["faithfulness_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
low_scores = results_df[scores < 0.5] # Review these
high_scores = results_df[scores > 0.9] # Also sample
```
## Validation Required
```python
from sklearn.metrics import classification_report
print(classification_report(human_labels, evaluator_results["label"]))
# Target: >80% agreement
```

View File

@@ -0,0 +1,108 @@
# Evaluators: RAG Systems
RAG has two distinct components requiring different evaluation approaches.
## Two-Phase Evaluation
```
RETRIEVAL GENERATION
───────── ──────────
Query → Retriever → Docs Docs + Query → LLM → Answer
│ │
IR Metrics LLM Judges / Code Checks
```
**Debug retrieval first** using IR metrics, then tackle generation quality.
## Retrieval Evaluation (IR Metrics)
Use traditional information retrieval metrics:
| Metric | What It Measures |
| ------ | ---------------- |
| Recall@k | Of all relevant docs, how many in top k? |
| Precision@k | Of k retrieved docs, how many relevant? |
| MRR | How high is first relevant doc? |
| NDCG | Quality weighted by position |
```python
# Requires query-document relevance labels
def recall_at_k(retrieved_ids, relevant_ids, k=5):
retrieved_set = set(retrieved_ids[:k])
relevant_set = set(relevant_ids)
if not relevant_set:
return 0.0
return len(retrieved_set & relevant_set) / len(relevant_set)
```
## Creating Retrieval Test Data
Generate query-document pairs synthetically:
```python
# Reverse process: document → questions that document answers
def generate_retrieval_test(documents):
test_pairs = []
for doc in documents:
# Extract facts, generate questions
questions = llm(f"Generate 3 questions this document answers:\n{doc}")
for q in questions:
test_pairs.append({"query": q, "relevant_doc_id": doc.id})
return test_pairs
```
## Generation Evaluation
Use LLM judges for qualities code can't measure:
| Eval | Question |
| ---- | -------- |
| **Faithfulness** | Are all claims supported by retrieved context? |
| **Relevance** | Does answer address the question? |
| **Completeness** | Does answer cover key points from context? |
```python
from phoenix.evals import ClassificationEvaluator, LLM
FAITHFULNESS_TEMPLATE = """Given the context and answer, is every claim in the answer supported by the context?
<context>{{context}}</context>
<answer>{{output}}</answer>
"faithful" = ALL claims supported by context
"unfaithful" = ANY claim NOT in context
Answer (faithful/unfaithful):"""
faithfulness = ClassificationEvaluator(
name="faithfulness",
prompt_template=FAITHFULNESS_TEMPLATE,
llm=LLM(provider="openai", model="gpt-4o"),
choices={"unfaithful": 0, "faithful": 1}
)
```
## RAG Failure Taxonomy
Common failure modes to evaluate:
```yaml
retrieval_failures:
- no_relevant_docs: Query returns unrelated content
- partial_retrieval: Some relevant docs missed
- wrong_chunk: Right doc, wrong section
generation_failures:
- hallucination: Claims not in retrieved context
- ignored_context: Answer doesn't use retrieved docs
- incomplete: Missing key information from context
- wrong_synthesis: Misinterprets or miscombines sources
```
## Evaluation Order
1. **Retrieval first** - If wrong docs, generation will fail
2. **Faithfulness** - Is answer grounded in context?
3. **Answer quality** - Does answer address the question?
Fix retrieval problems before debugging generation.

View File

@@ -0,0 +1,133 @@
# Experiments: Datasets in Python
Creating and managing evaluation datasets.
## Creating Datasets
```python
from phoenix.client import Client
client = Client()
# From examples
dataset = client.datasets.create_dataset(
name="qa-test-v1",
examples=[
{
"input": {"question": "What is 2+2?"},
"output": {"answer": "4"},
"metadata": {"category": "math"},
},
],
)
# From DataFrame
dataset = client.datasets.create_dataset(
dataframe=df,
name="qa-test-v1",
input_keys=["question"],
output_keys=["answer"],
metadata_keys=["category"],
)
```
## From Production Traces
```python
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
dataset = client.datasets.create_dataset(
dataframe=spans_df[["input.value", "output.value"]],
name="production-sample-v1",
input_keys=["input.value"],
output_keys=["output.value"],
)
```
## Retrieving Datasets
```python
dataset = client.datasets.get_dataset(name="qa-test-v1")
df = dataset.to_dataframe()
```
## Key Parameters
| Parameter | Description |
| --------- | ----------- |
| `input_keys` | Columns for task input |
| `output_keys` | Columns for expected output |
| `metadata_keys` | Additional context |
## Using Evaluators in Experiments
### Evaluators as experiment evaluators
Pass phoenix-evals evaluators directly to `run_experiment` as the `evaluators` argument:
```python
from functools import partial
from phoenix.client import AsyncClient
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator
# Define an LLM evaluator
refusal = ClassificationEvaluator(
name="refusal",
prompt_template="Is this a refusal?\nQuestion: {{query}}\nResponse: {{response}}",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"refusal": 0, "answer": 1},
)
# Bind to map dataset columns to evaluator params
refusal_evaluator = bind_evaluator(refusal, {"query": "input.query", "response": "output"})
# Define experiment task
async def run_rag_task(input, rag_engine):
return rag_engine.query(input["query"])
# Run experiment with the evaluator
experiment = await AsyncClient().experiments.run_experiment(
dataset=ds,
task=partial(run_rag_task, rag_engine=query_engine),
experiment_name="baseline",
evaluators=[refusal_evaluator],
concurrency=10,
)
```
### Evaluators as the task (meta evaluation)
Use an LLM evaluator as the experiment **task** to test the evaluator itself
against human annotations:
```python
from phoenix.evals import create_evaluator
# The evaluator IS the task being tested
def run_refusal_eval(input, evaluator):
result = evaluator.evaluate(input)
return result[0]
# A simple heuristic checks judge vs human agreement
@create_evaluator(name="exact_match")
def exact_match(output, expected):
return float(output["score"]) == float(expected["refusal_score"])
# Run: evaluator is the task, exact_match evaluates it
experiment = await AsyncClient().experiments.run_experiment(
dataset=annotated_dataset,
task=partial(run_refusal_eval, evaluator=refusal),
experiment_name="judge-v1",
evaluators=[exact_match],
concurrency=10,
)
```
This pattern lets you iterate on evaluator prompts until they align with human judgments.
See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example.
## Best Practices
- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
- **Metadata**: Track source, category, difficulty
- **Balance**: Ensure diverse coverage across categories

View File

@@ -0,0 +1,69 @@
# Experiments: Datasets in TypeScript
Creating and managing evaluation datasets.
## Creating Datasets
```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createDataset } from "@arizeai/phoenix-client/datasets";
const client = createClient();
const { datasetId } = await createDataset({
client,
name: "qa-test-v1",
examples: [
{
input: { question: "What is 2+2?" },
output: { answer: "4" },
metadata: { category: "math" },
},
],
});
```
## Example Structure
```typescript
interface DatasetExample {
input: Record<string, unknown>; // Task input
output?: Record<string, unknown>; // Expected output
metadata?: Record<string, unknown>; // Additional context
}
```
## From Production Traces
```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";
const { spans } = await getSpans({
project: { projectName: "my-app" },
parentId: null, // root spans only
limit: 100,
});
const examples = spans.map((span) => ({
input: { query: span.attributes?.["input.value"] },
output: { response: span.attributes?.["output.value"] },
metadata: { spanId: span.context.span_id },
}));
await createDataset({ client, name: "production-sample", examples });
```
## Retrieving Datasets
```typescript
import { getDataset, listDatasets } from "@arizeai/phoenix-client/datasets";
const dataset = await getDataset({ client, datasetId: "..." });
const all = await listDatasets({ client });
```
## Best Practices
- **Versioning**: Create new datasets, don't modify existing
- **Metadata**: Track source, category, provenance
- **Type safety**: Use TypeScript interfaces for structure

View File

@@ -0,0 +1,50 @@
# Experiments: Overview
Systematic testing of AI systems with datasets, tasks, and evaluators.
## Structure
```
DATASET → Examples: {input, expected_output, metadata}
TASK → function(input) → output
EVALUATORS → (input, output, expected) → score
EXPERIMENT → Run task on all examples, score results
```
## Basic Usage
```python
from phoenix.client.experiments import run_experiment
experiment = run_experiment(
dataset=my_dataset,
task=my_task,
evaluators=[accuracy, faithfulness],
experiment_name="improved-retrieval-v2",
)
print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}
```
## Workflow
1. **Create dataset** - From traces, synthetic data, or manual curation
2. **Define task** - The function to test (your LLM pipeline)
3. **Select evaluators** - Code and/or LLM-based
4. **Run experiment** - Execute and score
5. **Analyze & iterate** - Review, modify task, re-run
## Dry Runs
Test setup before full execution:
```python
experiment = run_experiment(dataset, task, evaluators, dry_run=3) # Just 3 examples
```
## Best Practices
- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
- **Version datasets**: Don't modify existing
- **Multiple evaluators**: Combine perspectives

View File

@@ -0,0 +1,78 @@
# Experiments: Running Experiments in Python
Execute experiments with `run_experiment`.
## Basic Usage
```python
from phoenix.client import Client
from phoenix.client.experiments import run_experiment
client = Client()
dataset = client.datasets.get_dataset(name="qa-test-v1")
def my_task(example):
return call_llm(example.input["question"])
def exact_match(output, expected):
return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[exact_match],
experiment_name="qa-experiment-v1",
)
```
## Task Functions
```python
# Basic task
def task(example):
return call_llm(example.input["question"])
# With context (RAG)
def rag_task(example):
return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")
```
## Evaluator Parameters
| Parameter | Access |
| --------- | ------ |
| `output` | Task output |
| `expected` | Example expected output |
| `input` | Example input |
| `metadata` | Example metadata |
## Options
```python
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=evaluators,
experiment_name="my-experiment",
dry_run=3, # Test with 3 examples
repetitions=3, # Run each example 3 times
)
```
## Results
```python
print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}
for run in experiment.runs:
print(run.output, run.scores)
```
## Add Evaluations Later
```python
from phoenix.client.experiments import evaluate_experiment
evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])
```

View File

@@ -0,0 +1,82 @@
# Experiments: Running Experiments in TypeScript
Execute experiments with `runExperiment`.
## Basic Usage
```typescript
import { createClient } from "@arizeai/phoenix-client";
import {
runExperiment,
asExperimentEvaluator,
} from "@arizeai/phoenix-client/experiments";
const client = createClient();
const task = async (example: { input: Record<string, unknown> }) => {
return await callLLM(example.input.question as string);
};
const exactMatch = asExperimentEvaluator({
name: "exact_match",
kind: "CODE",
evaluate: async ({ output, expected }) => ({
score: output === expected?.answer ? 1.0 : 0.0,
label: output === expected?.answer ? "match" : "no_match",
}),
});
const experiment = await runExperiment({
client,
experimentName: "qa-experiment-v1",
dataset: { datasetId: "your-dataset-id" },
task,
evaluators: [exactMatch],
});
```
## Task Functions
```typescript
// Basic task
const task = async (example) => await callLLM(example.input.question as string);
// With context (RAG)
const ragTask = async (example) => {
const prompt = `Context: ${example.input.context}\nQ: ${example.input.question}`;
return await callLLM(prompt);
};
```
## Evaluator Parameters
```typescript
interface EvaluatorParams {
input: Record<string, unknown>;
output: unknown;
expected: Record<string, unknown>;
metadata: Record<string, unknown>;
}
```
## Options
```typescript
const experiment = await runExperiment({
client,
experimentName: "my-experiment",
dataset: { datasetName: "qa-test-v1" },
task,
evaluators,
repetitions: 3, // Run each example 3 times
maxConcurrency: 5, // Limit concurrent executions
});
```
## Add Evaluations Later
```typescript
import { evaluateExperiment } from "@arizeai/phoenix-client/experiments";
await evaluateExperiment({ client, experiment, evaluators: [newEvaluator] });
```

View File

@@ -0,0 +1,70 @@
# Experiments: Generating Synthetic Test Data
Creating diverse, targeted test data for evaluation.
## Dimension-Based Approach
Define axes of variation, then generate combinations:
```python
dimensions = {
"issue_type": ["billing", "technical", "shipping"],
"customer_mood": ["frustrated", "neutral", "happy"],
"complexity": ["simple", "moderate", "complex"],
}
```
## Two-Step Generation
1. **Generate tuples** (combinations of dimension values)
2. **Convert to natural queries** (separate LLM call per tuple)
```python
# Step 1: Create tuples
tuples = [
("billing", "frustrated", "complex"),
("shipping", "neutral", "simple"),
]
# Step 2: Convert to natural query
def tuple_to_query(t):
prompt = f"""Generate a realistic customer message:
Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
Write naturally, include typos if appropriate. Don't be formulaic."""
return llm(prompt)
```
## Target Failure Modes
Dimensions should target known failures from error analysis:
```python
# From error analysis findings
dimensions = {
"timezone": ["EST", "PST", "UTC", "ambiguous"], # Known failure
"date_format": ["ISO", "US", "EU", "relative"], # Known failure
}
```
## Quality Control
- **Validate**: Check for placeholder text, minimum length
- **Deduplicate**: Remove near-duplicate queries using embeddings
- **Balance**: Ensure coverage across dimension values
## When to Use
| Use Synthetic | Use Real Data |
| ------------- | ------------- |
| Limited production data | Sufficient traces |
| Testing edge cases | Validating actual behavior |
| Pre-launch evals | Post-launch monitoring |
## Sample Sizes
| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Comprehensive eval | 100-500 |
| Per-dimension | 10-20 per combination |

View File

@@ -0,0 +1,86 @@
# Experiments: Generating Synthetic Test Data (TypeScript)
Creating diverse, targeted test data for evaluation.
## Dimension-Based Approach
Define axes of variation, then generate combinations:
```typescript
const dimensions = {
issueType: ["billing", "technical", "shipping"],
customerMood: ["frustrated", "neutral", "happy"],
complexity: ["simple", "moderate", "complex"],
};
```
## Two-Step Generation
1. **Generate tuples** (combinations of dimension values)
2. **Convert to natural queries** (separate LLM call per tuple)
```typescript
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
// Step 1: Create tuples
type Tuple = [string, string, string];
const tuples: Tuple[] = [
["billing", "frustrated", "complex"],
["shipping", "neutral", "simple"],
];
// Step 2: Convert to natural query
async function tupleToQuery(t: Tuple): Promise<string> {
const { text } = await generateText({
model: openai("gpt-4o"),
prompt: `Generate a realistic customer message:
Issue: ${t[0]}, Mood: ${t[1]}, Complexity: ${t[2]}
Write naturally, include typos if appropriate. Don't be formulaic.`,
});
return text;
}
```
## Target Failure Modes
Dimensions should target known failures from error analysis:
```typescript
// From error analysis findings
const dimensions = {
timezone: ["EST", "PST", "UTC", "ambiguous"], // Known failure
dateFormat: ["ISO", "US", "EU", "relative"], // Known failure
};
```
## Quality Control
- **Validate**: Check for placeholder text, minimum length
- **Deduplicate**: Remove near-duplicate queries using embeddings
- **Balance**: Ensure coverage across dimension values
```typescript
function validateQuery(query: string): boolean {
const minLength = 20;
const hasPlaceholder = /\[.*?\]|<.*?>/.test(query);
return query.length >= minLength && !hasPlaceholder;
}
```
## When to Use
| Use Synthetic | Use Real Data |
| ------------- | ------------- |
| Limited production data | Sufficient traces |
| Testing edge cases | Validating actual behavior |
| Pre-launch evals | Post-launch monitoring |
## Sample Sizes
| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Comprehensive eval | 100-500 |
| Per-dimension | 10-20 per combination |

View File

@@ -0,0 +1,43 @@
# Anti-Patterns
Common mistakes and fixes.
| Anti-Pattern | Problem | Fix |
| ------------ | ------- | --- |
| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |
| Vibe-based | No quantification | Measure with experiments |
| Ignoring humans | Uncalibrated LLM judges | Validate >80% TPR/TNR |
| Premature automation | Evaluators for imagined problems | Let observed failures drive |
| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |
| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |
| Model switching | Hoping a model works better | Error analysis first |
## Quantify Changes
```python
baseline = run_experiment(dataset, old_prompt, evaluators)
improved = run_experiment(dataset, new_prompt, evaluators)
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")
```
## Don't Use Similarity for Generation
```python
# BAD
score = bertscore(output, reference)
# GOOD
correct_facts = check_facts_against_source(output, context)
```
## Error Analysis Before Model Change
```python
# BAD
for model in models:
results = test(model)
# GOOD
failures = analyze_errors(results)
# Then decide if model change is warranted
```

View File

@@ -0,0 +1,58 @@
# Model Selection
Error analysis first, model changes last.
## Decision Tree
```
Performance Issue?
Error analysis suggests model problem?
NO → Fix prompts, retrieval, tools
YES → Is it a capability gap?
YES → Consider model change
NO → Fix the actual problem
```
## Judge Model Selection
| Principle | Action |
| --------- | ------ |
| Start capable | Use gpt-4o first |
| Optimize later | Test cheaper after criteria stable |
| Same model OK | Judge does different task |
```python
# Start with capable model
judge = ClassificationEvaluator(
llm=LLM(provider="openai", model="gpt-4o"),
...
)
# After validation, test cheaper
judge_cheap = ClassificationEvaluator(
llm=LLM(provider="openai", model="gpt-4o-mini"),
...
)
# Compare TPR/TNR on same test set
```
## Don't Model Shop
```python
# BAD
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
results = run_experiment(dataset, task, model)
# GOOD
failures = analyze_errors(results)
# "Ignores context" → Fix prompt
# "Can't do math" → Maybe try better model
```
## When Model Change Is Warranted
- Failures persist after prompt optimization
- Capability gaps (reasoning, math, code)
- Error analysis confirms model limitation

View File

@@ -0,0 +1,76 @@
# Fundamentals
Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
## Evaluator Types
| Type | Speed | Cost | Use Case |
| ---- | ----- | ---- | -------- |
| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
| **LLM** | Medium | Medium | Subjective quality, complex criteria |
| **Human** | Slow | Expensive | Ground truth, calibration |
**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.
## Score Structure
| Property | Required | Description |
| -------- | -------- | ----------- |
| `name` | Yes | Evaluator name |
| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
| `score` | No* | 0-1 numeric |
| `label` | No* | `"pass"`, `"fail"` |
| `explanation` | No | Rationale |
*One of `score` or `label` required.
## Binary > Likert
Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
```python
# Multiple binary checks instead of one Likert scale
evaluators = [
AnswersQuestion(), # Yes/No
UsesContext(), # Yes/No
NoHallucination(), # Yes/No
]
```
## Quick Patterns
### Code Evaluator
```python
from phoenix.evals import create_evaluator
@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
return bool(re.search(r'\[\d+\]', output))
```
### LLM Evaluator
```python
from phoenix.evals import ClassificationEvaluator, LLM
evaluator = ClassificationEvaluator(
name="helpfulness",
prompt_template="...",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"not_helpful": 0, "helpful": 1}
)
```
### Run Experiment
```python
from phoenix.client.experiments import run_experiment
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[evaluator1, evaluator2],
)
print(experiment.aggregate_scores)
```

View File

@@ -0,0 +1,101 @@
# Observe: Sampling Strategies
How to efficiently sample production traces for review.
## Strategies
### 1. Failure-Focused (Highest Priority)
```python
errors = spans_df[spans_df["status_code"] == "ERROR"]
negative_feedback = spans_df[spans_df["feedback"] == "negative"]
```
### 2. Outliers
```python
long_responses = spans_df.nlargest(50, "response_length")
slow_responses = spans_df.nlargest(50, "latency_ms")
```
### 3. Stratified (Coverage)
```python
# Sample equally from each category
by_query_type = spans_df.groupby("metadata.query_type").apply(
lambda x: x.sample(min(len(x), 20))
)
```
### 4. Metric-Guided
```python
# Review traces flagged by automated evaluators
flagged = spans_df[eval_results["label"] == "hallucinated"]
borderline = spans_df[(eval_results["score"] > 0.3) & (eval_results["score"] < 0.7)]
```
## Building a Review Queue
```python
def build_review_queue(spans_df, max_traces=100):
queue = pd.concat([
spans_df[spans_df["status_code"] == "ERROR"],
spans_df[spans_df["feedback"] == "negative"],
spans_df.nlargest(10, "response_length"),
spans_df.sample(min(30, len(spans_df))),
]).drop_duplicates("span_id").head(max_traces)
return queue
```
## Sample Size Guidelines
| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Error analysis | 100+ (until saturation) |
| Golden dataset | 100-500 |
| Judge calibration | 100+ per class |
**Saturation:** Stop when new traces show the same failure patterns.
## Trace-Level Sampling
When you need whole requests (all spans per trace), use `get_traces`:
```python
from phoenix.client import Client
from datetime import datetime, timedelta
client = Client()
# Recent traces with full span trees
traces = client.traces.get_traces(
project_identifier="my-app",
limit=100,
include_spans=True,
)
# Time-windowed sampling (e.g., last hour)
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
limit=50,
include_spans=True,
)
# Filter by session (multi-turn conversations)
traces = client.traces.get_traces(
project_identifier="my-app",
session_id="user-session-abc",
include_spans=True,
)
# Sort by latency to find slowest requests
traces = client.traces.get_traces(
project_identifier="my-app",
sort="latency_ms",
order="desc",
limit=50,
)
```

View File

@@ -0,0 +1,147 @@
# Observe: Sampling Strategies (TypeScript)
How to efficiently sample production traces for review.
## Strategies
### 1. Failure-Focused (Highest Priority)
Use server-side filters to fetch only what you need:
```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";
// Server-side filter — only ERROR spans are returned
const { spans: errors } = await getSpans({
project: { projectName: "my-project" },
statusCode: "ERROR",
limit: 100,
});
// Fetch only LLM spans
const { spans: llmSpans } = await getSpans({
project: { projectName: "my-project" },
spanKind: "LLM",
limit: 100,
});
// Filter by span name
const { spans: chatSpans } = await getSpans({
project: { projectName: "my-project" },
name: "chat_completion",
limit: 100,
});
```
### 2. Outliers
```typescript
const { spans } = await getSpans({
project: { projectName: "my-project" },
limit: 200,
});
const latency = (s: (typeof spans)[number]) =>
new Date(s.end_time).getTime() - new Date(s.start_time).getTime();
const sorted = [...spans].sort((a, b) => latency(b) - latency(a));
const slowResponses = sorted.slice(0, 50);
```
### 3. Stratified (Coverage)
```typescript
// Sample equally from each category
function stratifiedSample<T>(items: T[], groupBy: (item: T) => string, perGroup: number): T[] {
const groups = new Map<string, T[]>();
for (const item of items) {
const key = groupBy(item);
if (!groups.has(key)) groups.set(key, []);
groups.get(key)!.push(item);
}
return [...groups.values()].flatMap((g) => g.slice(0, perGroup));
}
const { spans } = await getSpans({
project: { projectName: "my-project" },
limit: 500,
});
const byQueryType = stratifiedSample(spans, (s) => s.attributes?.["metadata.query_type"] ?? "unknown", 20);
```
### 4. Metric-Guided
```typescript
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
// Fetch annotations for your spans, then filter by label
const { annotations } = await getSpanAnnotations({
project: { projectName: "my-project" },
spanIds: spans.map((s) => s.context.span_id),
includeAnnotationNames: ["hallucination"],
});
const flaggedSpanIds = new Set(
annotations.filter((a) => a.result?.label === "hallucinated").map((a) => a.span_id)
);
const flagged = spans.filter((s) => flaggedSpanIds.has(s.context.span_id));
```
## Trace-Level Sampling
When you need whole requests (all spans in a trace), use `getTraces`:
```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";
// Recent traces with full span trees
const { traces } = await getTraces({
project: { projectName: "my-project" },
limit: 100,
includeSpans: true,
});
// Filter by session (e.g., multi-turn conversations)
const { traces: sessionTraces } = await getTraces({
project: { projectName: "my-project" },
sessionId: "user-session-abc",
includeSpans: true,
});
// Time-windowed sampling
const { traces: recentTraces } = await getTraces({
project: { projectName: "my-project" },
startTime: new Date(Date.now() - 60 * 60 * 1000), // last hour
limit: 50,
includeSpans: true,
});
```
## Building a Review Queue
```typescript
// Combine server-side filters into a review queue
const { spans: errorSpans } = await getSpans({
project: { projectName: "my-project" },
statusCode: "ERROR",
limit: 30,
});
const { spans: allSpans } = await getSpans({
project: { projectName: "my-project" },
limit: 100,
});
const random = allSpans.sort(() => Math.random() - 0.5).slice(0, 30);
const combined = [...errorSpans, ...random];
const unique = [...new Map(combined.map((s) => [s.context.span_id, s])).values()];
const reviewQueue = unique.slice(0, 100);
```
## Sample Size Guidelines
| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Error analysis | 100+ (until saturation) |
| Golden dataset | 100-500 |
| Judge calibration | 100+ per class |
**Saturation:** Stop when new traces show the same failure patterns.

View File

@@ -0,0 +1,144 @@
# Observe: Tracing Setup
Configure tracing to capture data for evaluation.
## Quick Setup
```python
# Python
from phoenix.otel import register
register(project_name="my-app", auto_instrument=True)
```
```typescript
// TypeScript
import { registerPhoenix } from "@arizeai/phoenix-otel";
registerPhoenix({ projectName: "my-app", autoInstrument: true });
```
## Essential Attributes
| Attribute | Why It Matters |
| --------- | -------------- |
| `input.value` | User's request |
| `output.value` | Response to evaluate |
| `retrieval.documents` | Context for faithfulness |
| `tool.name`, `tool.parameters` | Agent evaluation |
| `llm.model_name` | Track by model |
## Custom Attributes for Evals
```python
span.set_attribute("metadata.client_type", "enterprise")
span.set_attribute("metadata.query_category", "billing")
```
## Exporting for Evaluation
### Spans (Python — DataFrame)
```python
from phoenix.client import Client
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(
project_identifier="my-app", # NOT project_name= (deprecated)
root_spans_only=True,
)
dataset = client.datasets.create_dataset(
name="error-analysis-set",
dataframe=spans_df[["input.value", "output.value"]],
input_keys=["input.value"],
output_keys=["output.value"],
)
```
### Spans (TypeScript)
```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";
const { spans } = await getSpans({
project: { projectName: "my-app" },
parentId: null, // root spans only
limit: 100,
});
```
### Traces (Python — structured)
Use `get_traces` when you need full trace trees (e.g., multi-turn conversations, agent workflows):
```python
from datetime import datetime, timedelta
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=24),
include_spans=True, # includes all spans per trace
limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans (when include_spans=True)
```
### Traces (TypeScript)
```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
includeSpans: true,
limit: 100,
});
```
## Uploading Evaluations as Annotations
### Python
```python
from phoenix.evals import evaluate_dataframe
from phoenix.evals.utils import to_annotation_dataframe
# Run evaluations
results_df = evaluate_dataframe(dataframe=spans_df, evaluators=[my_eval])
# Format results for Phoenix annotations
annotations_df = to_annotation_dataframe(results_df)
# Upload to Phoenix
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
```
### TypeScript
```typescript
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
await logSpanAnnotations({
spanAnnotations: [
{
spanId: "abc123",
name: "quality",
label: "good",
score: 0.95,
annotatorKind: "LLM",
},
],
});
```
Annotations are visible in the Phoenix UI alongside your traces.
## Verify
Required attributes: `input.value`, `output.value`, `status_code`
For RAG: `retrieval.documents`
For agents: `tool.name`, `tool.parameters`

View File

@@ -0,0 +1,137 @@
# Production: Continuous Evaluation
Capability vs regression evals and the ongoing feedback loop.
## Two Types of Evals
| Type | Pass Rate Target | Purpose | Update |
| ---- | ---------------- | ------- | ------ |
| **Capability** | 50-80% | Measure improvement | Add harder cases |
| **Regression** | 95-100% | Catch breakage | Add fixed bugs |
## Saturation
When capability evals hit >95% pass rate, they're saturated:
1. Graduate passing cases to regression suite
2. Add new challenging cases to capability suite
## Feedback Loop
```
Production → Sample traffic → Run evaluators → Find failures
↑ ↓
Deploy ← Run CI evals ← Create test cases ← Error analysis
```
## Implementation
Build a continuous monitoring loop:
1. **Sample recent traces** at regular intervals (e.g., 100 traces per hour)
2. **Run evaluators** on sampled traces
3. **Log results** to Phoenix for tracking
4. **Queue concerning results** for human review
5. **Create test cases** from recurring failure patterns
### Python
```python
from phoenix.client import Client
from datetime import datetime, timedelta
client = Client()
# 1. Sample recent spans (includes full attributes for evaluation)
spans_df = client.spans.get_spans_dataframe(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
root_spans_only=True,
limit=100,
)
# 2. Run evaluators
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(
dataframe=spans_df,
evaluators=[quality_eval, safety_eval],
)
# 3. Upload results as annotations
from phoenix.evals.utils import to_annotation_dataframe
annotations_df = to_annotation_dataframe(results_df)
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
```
### TypeScript
```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
// 1. Sample recent spans
const { spans } = await getSpans({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
parentId: null, // root spans only
limit: 100,
});
// 2. Run evaluators (user-defined)
const results = await Promise.all(
spans.map(async (span) => ({
spanId: span.context.span_id,
...await runEvaluators(span, [qualityEval, safetyEval]),
}))
);
// 3. Upload results as annotations
await logSpanAnnotations({
spanAnnotations: results.map((r) => ({
spanId: r.spanId,
name: "quality",
score: r.qualityScore,
label: r.qualityLabel,
annotatorKind: "LLM" as const,
})),
});
```
For trace-level monitoring (e.g., agent workflows), use `get_traces`/`getTraces` to identify traces:
```python
# Python: identify slow traces
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
sort="latency_ms",
order="desc",
limit=50,
)
```
```typescript
// TypeScript: identify slow traces
import { getTraces } from "@arizeai/phoenix-client/traces";
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
limit: 50,
});
```
## Alerting
| Condition | Severity | Action |
| --------- | -------- | ------ |
| Regression < 98% | Critical | Page oncall |
| Capability declining | Warning | Slack notify |
| Capability > 95% for 7d | Info | Schedule review |
## Key Principles
- **Two suites** - Capability + Regression always
- **Graduate cases** - Move consistent passes to regression
- **Track trends** - Monitor over time, not just snapshots

View File

@@ -0,0 +1,53 @@
# Production: Guardrails vs Evaluators
Guardrails block in real-time. Evaluators measure asynchronously.
## Key Distinction
```
Request → [INPUT GUARDRAIL] → LLM → [OUTPUT GUARDRAIL] → Response
└──→ ASYNC EVALUATOR (background)
```
## Guardrails
| Aspect | Requirement |
| ------ | ----------- |
| Timing | Synchronous, blocking |
| Latency | < 100ms |
| Purpose | Prevent harm |
| Type | Code-based (deterministic) |
**Use for:** PII detection, prompt injection, profanity, length limits, format validation.
## Evaluators
| Aspect | Characteristic |
| ------ | -------------- |
| Timing | Async, background |
| Latency | Can be seconds |
| Purpose | Measure quality |
| Type | Can use LLMs |
**Use for:** Helpfulness, faithfulness, tone, completeness, citation accuracy.
## Decision
| Question | Answer |
| -------- | ------ |
| Must block harmful content? | Guardrail |
| Measuring quality? | Evaluator |
| Need LLM judgment? | Evaluator |
| < 100ms required? | Guardrail |
| False positives = angry users? | Evaluator |
## LLM Guardrails: Rarely
Only use LLM guardrails if:
- Latency budget > 1s
- Error cost >> LLM cost
- Low volume
- Fallback exists
**Key Principle:** Guardrails prevent harm (block). Evaluators measure quality (log).

View File

@@ -0,0 +1,92 @@
# Production: Overview
CI/CD evals vs production monitoring - complementary approaches.
## Two Evaluation Modes
| Aspect | CI/CD Evals | Production Monitoring |
| ------ | ----------- | -------------------- |
| **When** | Pre-deployment | Post-deployment, ongoing |
| **Data** | Fixed dataset | Sampled traffic |
| **Goal** | Prevent regression | Detect drift |
| **Response** | Block deploy | Alert & analyze |
## CI/CD Evaluations
```python
# Fast, deterministic checks
ci_evaluators = [
has_required_format,
no_pii_leak,
safety_check,
regression_test_suite,
]
# Small but representative dataset (~100 examples)
run_experiment(ci_dataset, task, ci_evaluators)
```
Set thresholds: regression=0.95, safety=1.0, format=0.98.
## Production Monitoring
### Python
```python
from phoenix.client import Client
from datetime import datetime, timedelta
client = Client()
# Sample recent traces (last hour)
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
include_spans=True,
limit=100,
)
# Run evaluators on sampled traffic
for trace in traces:
results = run_evaluators_async(trace, production_evaluators)
if any(r["score"] < 0.5 for r in results):
alert_on_failure(trace, results)
```
### TypeScript
```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";
import { getSpans } from "@arizeai/phoenix-client/spans";
// Sample recent traces (last hour)
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
includeSpans: true,
limit: 100,
});
// Or sample spans directly for evaluation
const { spans } = await getSpans({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
limit: 100,
});
// Run evaluators on sampled traffic
for (const span of spans) {
const results = await runEvaluators(span, productionEvaluators);
if (results.some((r) => r.score < 0.5)) {
await alertOnFailure(span, results);
}
}
```
Prioritize: errors → negative feedback → random sample.
## Feedback Loop
```
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
```

View File

@@ -0,0 +1,64 @@
# Setup: Python
Packages required for Phoenix evals and experiments.
## Installation
```bash
# Core Phoenix package (includes client, evals, otel)
pip install arize-phoenix
# Or install individual packages
pip install arize-phoenix-client # Phoenix client only
pip install arize-phoenix-evals # Evaluation utilities
pip install arize-phoenix-otel # OpenTelemetry integration
```
## LLM Providers
For LLM-as-judge evaluators, install your provider's SDK:
```bash
pip install openai # OpenAI
pip install anthropic # Anthropic
pip install google-generativeai # Google
```
## Validation (Optional)
```bash
pip install scikit-learn # For TPR/TNR metrics
```
## Quick Verify
```python
from phoenix.client import Client
from phoenix.evals import LLM, ClassificationEvaluator
from phoenix.otel import register
# All imports should work
print("Phoenix Python setup complete")
```
## Key Imports (Evals 2.0)
```python
from phoenix.client import Client
from phoenix.evals import (
ClassificationEvaluator, # LLM classification evaluator (preferred)
LLM, # Provider-agnostic LLM wrapper
async_evaluate_dataframe, # Batch evaluate a DataFrame (preferred, async)
evaluate_dataframe, # Batch evaluate a DataFrame (sync)
create_evaluator, # Decorator for code-based evaluators
create_classifier, # Factory for LLM classification evaluators
bind_evaluator, # Map column names to evaluator params
Score, # Score dataclass
)
from phoenix.evals.utils import to_annotation_dataframe # Format results for Phoenix annotations
```
**Prefer**: `ClassificationEvaluator` over `create_classifier` (more parameters/customization).
**Prefer**: `async_evaluate_dataframe` over `evaluate_dataframe` (better throughput for LLM evals).
**Do NOT use** legacy 1.0 imports: `OpenAIModel`, `AnthropicModel`, `run_evals`, `llm_classify`.

View File

@@ -0,0 +1,41 @@
# Setup: TypeScript
Packages required for Phoenix evals and experiments.
## Installation
```bash
# Using npm
npm install @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
# Using pnpm
pnpm add @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
```
## LLM Providers
For LLM-as-judge evaluators, install Vercel AI SDK providers:
```bash
npm install ai @ai-sdk/openai # Vercel AI SDK + OpenAI
npm install @ai-sdk/anthropic # Anthropic
npm install @ai-sdk/google # Google
```
Or use direct provider SDKs:
```bash
npm install openai # OpenAI direct
npm install @anthropic-ai/sdk # Anthropic direct
```
## Quick Verify
```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { registerPhoenix } from "@arizeai/phoenix-otel";
// All imports should work
console.log("Phoenix TypeScript setup complete");
```

View File

@@ -0,0 +1,43 @@
# Validating Evaluators (Python)
Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.
## Calculate Metrics
```python
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(human_labels, evaluator_predictions))
cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")
```
## Correct Production Estimates
```python
def correct_estimate(observed, tpr, tnr):
"""Adjust observed pass rate using known TPR/TNR."""
return (observed - (1 - tnr)) / (tpr - (1 - tnr))
```
## Find Misclassified
```python
# False Positives: Evaluator pass, human fail
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
false_positives = dataset[fp_mask]
# False Negatives: Evaluator fail, human pass
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
false_negatives = dataset[fn_mask]
```
## Red Flags
- TPR or TNR < 70%
- Large gap between TPR and TNR
- Kappa < 0.6

View File

@@ -0,0 +1,106 @@
# Validating Evaluators (TypeScript)
Validate an LLM evaluator against human-labeled examples before deploying it.
Target: **>80% TPR and >80% TNR**.
Roles are inverted compared to a normal task experiment:
| Normal experiment | Evaluator validation |
|---|---|
| Task = agent logic | Task = run the evaluator under test |
| Evaluator = judge output | Evaluator = exact-match vs human ground truth |
| Dataset = agent examples | Dataset = golden hand-labeled examples |
## Golden Dataset
Use a separate dataset name so validation experiments don't mix with task experiments in Phoenix.
Store human ground truth in `metadata.groundTruthLabel`. Aim for ~50/50 balance:
```typescript
import type { Example } from "@arizeai/phoenix-client/types/datasets";
const goldenExamples: Example[] = [
{ input: { q: "Capital of France?" }, output: { answer: "Paris" }, metadata: { groundTruthLabel: "correct" } },
{ input: { q: "Capital of France?" }, output: { answer: "Lyon" }, metadata: { groundTruthLabel: "incorrect" } },
{ input: { q: "Capital of France?" }, output: { answer: "Major city..." }, metadata: { groundTruthLabel: "incorrect" } },
];
const VALIDATOR_DATASET = "my-app-qa-evaluator-validation"; // separate from task dataset
const POSITIVE_LABEL = "correct";
const NEGATIVE_LABEL = "incorrect";
```
## Validation Experiment
```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createOrGetDataset, getDatasetExamples } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";
import { myEvaluator } from "./myEvaluator.js";
const client = createClient();
const { datasetId } = await createOrGetDataset({ client, name: VALIDATOR_DATASET, examples: goldenExamples });
const { examples } = await getDatasetExamples({ client, dataset: { datasetId } });
const groundTruth = new Map(examples.map((ex) => [ex.id, ex.metadata?.groundTruthLabel as string]));
// Task: invoke the evaluator under test
const task = async (example: (typeof examples)[number]) => {
const result = await myEvaluator.evaluate({ input: example.input, output: example.output, metadata: example.metadata });
return result.label ?? "unknown";
};
// Evaluator: exact-match against human ground truth
const exactMatch = asExperimentEvaluator({
name: "exact-match", kind: "CODE",
evaluate: ({ output, metadata }) => {
const expected = metadata?.groundTruthLabel as string;
const predicted = typeof output === "string" ? output : "unknown";
return { score: predicted === expected ? 1 : 0, label: predicted, explanation: `Expected: ${expected}, Got: ${predicted}` };
},
});
const experiment = await runExperiment({
client, experimentName: `evaluator-validation-${Date.now()}`,
dataset: { datasetId }, task, evaluators: [exactMatch],
});
// Compute confusion matrix
const runs = Object.values(experiment.runs);
const predicted = new Map((experiment.evaluationRuns ?? [])
.filter((e) => e.name === "exact-match")
.map((e) => [e.experimentRunId, e.result?.label ?? null]));
let tp = 0, fp = 0, tn = 0, fn = 0;
for (const run of runs) {
if (run.error) continue;
const p = predicted.get(run.id), a = groundTruth.get(run.datasetExampleId);
if (!p || !a) continue;
if (a === POSITIVE_LABEL && p === POSITIVE_LABEL) tp++;
else if (a === NEGATIVE_LABEL && p === POSITIVE_LABEL) fp++;
else if (a === NEGATIVE_LABEL && p === NEGATIVE_LABEL) tn++;
else if (a === POSITIVE_LABEL && p === NEGATIVE_LABEL) fn++;
}
const total = tp + fp + tn + fn;
const tpr = tp + fn > 0 ? (tp / (tp + fn)) * 100 : 0;
const tnr = tn + fp > 0 ? (tn / (tn + fp)) * 100 : 0;
console.log(`TPR: ${tpr.toFixed(1)}% TNR: ${tnr.toFixed(1)}% Accuracy: ${((tp + tn) / total * 100).toFixed(1)}%`);
```
## Results & Quality Rules
| Metric | Target | Low value means |
|---|---|---|
| TPR (sensitivity) | >80% | Misses real failures (false negatives) |
| TNR (specificity) | >80% | Flags good outputs (false positives) |
| Accuracy | >80% | General weakness |
**Golden dataset rules:** ~50/50 balance · include edge cases · human-labeled only · never mutate (append new versions) · 2050 examples is enough.
**Re-validate when:** prompt template changes · judge model changes · criteria updated · production FP/FN spike.
## See Also
- `validation.md` — Metric definitions and concepts
- `experiments-running-typescript.md``runExperiment` API
- `experiments-datasets-typescript.md``createOrGetDataset` / `getDatasetExamples`

View File

@@ -0,0 +1,74 @@
# Validation
Validate LLM judges against human labels before deploying. Target >80% agreement.
## Requirements
| Requirement | Target |
| ----------- | ------ |
| Test set size | 100+ examples |
| Balance | ~50/50 pass/fail |
| Accuracy | >80% |
| TPR/TNR | Both >70% |
## Metrics
| Metric | Formula | Use When |
| ------ | ------- | -------- |
| **Accuracy** | (TP+TN) / Total | General |
| **TPR (Recall)** | TP / (TP+FN) | Quality assurance |
| **TNR (Specificity)** | TN / (TN+FP) | Safety-critical |
| **Cohen's Kappa** | Agreement beyond chance | Comparing evaluators |
## Quick Validation
```python
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score
print(classification_report(human_labels, evaluator_predictions))
print(f"Kappa: {cohen_kappa_score(human_labels, evaluator_predictions):.3f}")
# Get TPR/TNR
cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
```
## Golden Dataset Structure
```python
golden_example = {
"input": "What is the capital of France?",
"output": "Paris is the capital.",
"ground_truth_label": "correct",
}
```
## Building Golden Datasets
1. Sample production traces (errors, negative feedback, edge cases)
2. Balance ~50/50 pass/fail
3. Expert labels each example
4. Version datasets (never modify existing)
```python
# GOOD - create new version
golden_v2 = golden_v1 + [new_examples]
# BAD - never modify existing
golden_v1.append(new_example)
```
## Warning Signs
- All pass or all fail → too lenient/strict
- Random results → criteria unclear
- TPR/TNR < 70% → needs improvement
## Re-Validate When
- Prompt template changes
- Judge model changes
- Criteria changes
- Monthly