mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-13 11:45:56 +00:00
chore: publish from staged
This commit is contained in:
@@ -0,0 +1,95 @@
|
||||
# Axial Coding
|
||||
|
||||
Group open-ended notes into structured failure taxonomies.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Gather** - Collect open coding notes
|
||||
2. **Pattern** - Group notes with common themes
|
||||
3. **Name** - Create actionable category names
|
||||
4. **Quantify** - Count failures per category
|
||||
|
||||
## Example Taxonomy
|
||||
|
||||
```yaml
|
||||
failure_taxonomy:
|
||||
content_quality:
|
||||
hallucination: [invented_facts, fictional_citations]
|
||||
incompleteness: [partial_answer, missing_key_info]
|
||||
inaccuracy: [wrong_numbers, wrong_dates]
|
||||
|
||||
communication:
|
||||
tone_mismatch: [too_casual, too_formal]
|
||||
clarity: [ambiguous, jargon_heavy]
|
||||
|
||||
context:
|
||||
user_context: [ignored_preferences, misunderstood_intent]
|
||||
retrieved_context: [ignored_documents, wrong_context]
|
||||
|
||||
safety:
|
||||
missing_disclaimers: [legal, medical, financial]
|
||||
```
|
||||
|
||||
## Add Annotation (Python)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
client = Client()
|
||||
client.spans.add_span_annotation(
|
||||
span_id="abc123",
|
||||
annotation_name="failure_category",
|
||||
label="hallucination",
|
||||
explanation="invented a feature that doesn't exist",
|
||||
annotator_kind="HUMAN",
|
||||
sync=True,
|
||||
)
|
||||
```
|
||||
|
||||
## Add Annotation (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { addSpanAnnotation } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
await addSpanAnnotation({
|
||||
spanAnnotation: {
|
||||
spanId: "abc123",
|
||||
name: "failure_category",
|
||||
label: "hallucination",
|
||||
explanation: "invented a feature that doesn't exist",
|
||||
annotatorKind: "HUMAN",
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
## Agent Failure Taxonomy
|
||||
|
||||
```yaml
|
||||
agent_failures:
|
||||
planning: [wrong_plan, incomplete_plan]
|
||||
tool_selection: [wrong_tool, missed_tool, unnecessary_call]
|
||||
tool_execution: [wrong_parameters, type_error]
|
||||
state_management: [lost_context, stuck_in_loop]
|
||||
error_recovery: [no_fallback, wrong_fallback]
|
||||
```
|
||||
|
||||
## Transition Matrix (Agents)
|
||||
|
||||
Shows where failures occur between states:
|
||||
|
||||
```python
|
||||
def build_transition_matrix(conversations, states):
|
||||
matrix = defaultdict(lambda: defaultdict(int))
|
||||
for conv in conversations:
|
||||
if conv["failed"]:
|
||||
last_success = find_last_success(conv)
|
||||
first_failure = find_first_failure(conv)
|
||||
matrix[last_success][first_failure] += 1
|
||||
return pd.DataFrame(matrix).fillna(0)
|
||||
```
|
||||
|
||||
## Principles
|
||||
|
||||
- **MECE** - Each failure fits ONE category
|
||||
- **Actionable** - Categories suggest fixes
|
||||
- **Bottom-up** - Let categories emerge from data
|
||||
@@ -0,0 +1,225 @@
|
||||
# Common Mistakes (Python)
|
||||
|
||||
Patterns that LLMs frequently generate incorrectly from training data.
|
||||
|
||||
## Legacy Model Classes
|
||||
|
||||
```python
|
||||
# WRONG
|
||||
from phoenix.evals import OpenAIModel, AnthropicModel
|
||||
model = OpenAIModel(model="gpt-4")
|
||||
|
||||
# RIGHT
|
||||
from phoenix.evals import LLM
|
||||
llm = LLM(provider="openai", model="gpt-4o")
|
||||
```
|
||||
|
||||
**Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`.
|
||||
The `LLM` class is provider-agnostic and is the current 2.0 API.
|
||||
|
||||
## Using run_evals Instead of evaluate_dataframe
|
||||
|
||||
```python
|
||||
# WRONG — legacy 1.0 API
|
||||
from phoenix.evals import run_evals
|
||||
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
|
||||
# Returns list of DataFrames
|
||||
|
||||
# RIGHT — current 2.0 API
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
|
||||
# Returns single DataFrame with {name}_score dict columns
|
||||
```
|
||||
|
||||
**Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current
|
||||
2.0 function with a different return format.
|
||||
|
||||
## Wrong Result Column Names
|
||||
|
||||
```python
|
||||
# WRONG — column doesn't exist
|
||||
score = results_df["relevance"].mean()
|
||||
|
||||
# WRONG — column exists but contains dicts, not numbers
|
||||
score = results_df["relevance_score"].mean()
|
||||
|
||||
# RIGHT — extract numeric score from dict
|
||||
scores = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
score = scores.mean()
|
||||
```
|
||||
|
||||
**Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts
|
||||
like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`.
|
||||
|
||||
## Deprecated project_name Parameter
|
||||
|
||||
```python
|
||||
# WRONG
|
||||
df = client.spans.get_spans_dataframe(project_name="my-project")
|
||||
|
||||
# RIGHT
|
||||
df = client.spans.get_spans_dataframe(project_identifier="my-project")
|
||||
```
|
||||
|
||||
**Why**: `project_name` is deprecated in favor of `project_identifier`, which also
|
||||
accepts project IDs.
|
||||
|
||||
## Wrong Client Constructor
|
||||
|
||||
```python
|
||||
# WRONG
|
||||
client = Client(endpoint="https://app.phoenix.arize.com")
|
||||
client = Client(url="https://app.phoenix.arize.com")
|
||||
|
||||
# RIGHT — for remote/cloud Phoenix
|
||||
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")
|
||||
|
||||
# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
|
||||
client = Client()
|
||||
```
|
||||
|
||||
**Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances,
|
||||
`Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required.
|
||||
|
||||
## Too-Aggressive Time Filters
|
||||
|
||||
```python
|
||||
# WRONG — often returns zero spans
|
||||
from datetime import datetime, timedelta
|
||||
df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
)
|
||||
|
||||
# RIGHT — use limit to control result size instead
|
||||
df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
limit=50,
|
||||
)
|
||||
```
|
||||
|
||||
**Why**: Traces may be from any time period. A 1-hour window frequently returns
|
||||
nothing. Use `limit=` to control result size instead.
|
||||
|
||||
## Not Filtering Spans Appropriately
|
||||
|
||||
```python
|
||||
# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
|
||||
df = client.spans.get_spans_dataframe(project_identifier="my-project")
|
||||
|
||||
# RIGHT for end-to-end evaluation — filter to top-level spans
|
||||
df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
root_spans_only=True,
|
||||
)
|
||||
|
||||
# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
|
||||
all_spans = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
)
|
||||
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
|
||||
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]
|
||||
```
|
||||
|
||||
**Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`.
|
||||
For RAG systems, you often need child spans separately — retriever spans for
|
||||
DocumentRelevance and LLM spans for Faithfulness. Choose the right span level
|
||||
for your evaluation target.
|
||||
|
||||
## Assuming Span Output is Plain Text
|
||||
|
||||
```python
|
||||
# WRONG — output may be JSON, not plain text
|
||||
df["output"] = df["attributes.output.value"]
|
||||
|
||||
# RIGHT — parse JSON and extract the answer field
|
||||
import json
|
||||
|
||||
def extract_answer(output_value):
|
||||
if not isinstance(output_value, str):
|
||||
return str(output_value) if output_value is not None else ""
|
||||
try:
|
||||
parsed = json.loads(output_value)
|
||||
if isinstance(parsed, dict):
|
||||
for key in ("answer", "result", "output", "response"):
|
||||
if key in parsed:
|
||||
return str(parsed[key])
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
pass
|
||||
return output_value
|
||||
|
||||
df["output"] = df["attributes.output.value"].apply(extract_answer)
|
||||
```
|
||||
|
||||
**Why**: LangChain and other frameworks often output structured JSON from root spans,
|
||||
like `{"context": "...", "question": "...", "answer": "..."}`. Evaluators need
|
||||
the actual answer text, not the raw JSON.
|
||||
|
||||
## Using @create_evaluator for LLM-Based Evaluation
|
||||
|
||||
```python
|
||||
# WRONG — @create_evaluator doesn't call an LLM
|
||||
@create_evaluator(name="relevance", kind="llm")
|
||||
def relevance(input: str, output: str) -> str:
|
||||
pass # No LLM is involved
|
||||
|
||||
# RIGHT — use ClassificationEvaluator for LLM-based evaluation
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
relevance = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
**Why**: `@create_evaluator` wraps a plain Python function. Setting `kind="llm"`
|
||||
marks it as LLM-based but you must implement the LLM call yourself.
|
||||
For LLM-based evaluation, prefer `ClassificationEvaluator` which handles
|
||||
the LLM call, structured output parsing, and explanations automatically.
|
||||
|
||||
## Using llm_classify Instead of ClassificationEvaluator
|
||||
|
||||
```python
|
||||
# WRONG — legacy 1.0 API
|
||||
from phoenix.evals import llm_classify
|
||||
results = llm_classify(
|
||||
dataframe=df,
|
||||
template=template_str,
|
||||
model=model,
|
||||
rails=["relevant", "irrelevant"],
|
||||
)
|
||||
|
||||
# RIGHT — current 2.0 API
|
||||
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM
|
||||
|
||||
classifier = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template=template_str,
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])
|
||||
```
|
||||
|
||||
**Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create
|
||||
an evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`.
|
||||
|
||||
## Using HallucinationEvaluator
|
||||
|
||||
```python
|
||||
# WRONG — deprecated
|
||||
from phoenix.evals import HallucinationEvaluator
|
||||
eval = HallucinationEvaluator(model)
|
||||
|
||||
# RIGHT — use FaithfulnessEvaluator
|
||||
from phoenix.evals.metrics import FaithfulnessEvaluator
|
||||
from phoenix.evals import LLM
|
||||
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))
|
||||
```
|
||||
|
||||
**Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement,
|
||||
using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).
|
||||
@@ -0,0 +1,52 @@
|
||||
# Error Analysis: Multi-Turn Conversations
|
||||
|
||||
Debugging complex multi-turn conversation traces.
|
||||
|
||||
## The Approach
|
||||
|
||||
1. **End-to-end first** - Did the conversation achieve the goal?
|
||||
2. **Find first failure** - Trace backwards to root cause
|
||||
3. **Simplify** - Try single-turn before multi-turn debug
|
||||
4. **N-1 testing** - Isolate turn-specific vs capability issues
|
||||
|
||||
## Find First Upstream Failure
|
||||
|
||||
```
|
||||
Turn 1: User asks about flights ✓
|
||||
Turn 2: Assistant asks for dates ✓
|
||||
Turn 3: User provides dates ✓
|
||||
Turn 4: Assistant searches WRONG dates ← FIRST FAILURE
|
||||
Turn 5: Shows wrong flights (consequence)
|
||||
Turn 6: User frustrated (consequence)
|
||||
```
|
||||
|
||||
Focus on Turn 4, not Turn 6.
|
||||
|
||||
## Simplify First
|
||||
|
||||
Before debugging multi-turn, test single-turn:
|
||||
|
||||
```python
|
||||
# If single-turn also fails → problem is retrieval/knowledge
|
||||
# If single-turn passes → problem is conversation context
|
||||
response = chat("What's the return policy for electronics?")
|
||||
```
|
||||
|
||||
## N-1 Testing
|
||||
|
||||
Give turns 1 to N-1 as context, test turn N:
|
||||
|
||||
```python
|
||||
context = conversation[:n-1]
|
||||
response = chat_with_context(context, user_message_n)
|
||||
# Compare to actual turn N
|
||||
```
|
||||
|
||||
This isolates whether error is from context or underlying capability.
|
||||
|
||||
## Checklist
|
||||
|
||||
1. Did conversation achieve goal? (E2E)
|
||||
2. Which turn first went wrong?
|
||||
3. Can you reproduce with single-turn?
|
||||
4. Is error from context or capability? (N-1 test)
|
||||
@@ -0,0 +1,170 @@
|
||||
# Error Analysis
|
||||
|
||||
Review traces to discover failure modes before building evaluators.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Sample** - 100+ traces (errors, negative feedback, random)
|
||||
2. **Open Code** - Write free-form notes per trace
|
||||
3. **Axial Code** - Group notes into failure categories
|
||||
4. **Quantify** - Count failures per category
|
||||
5. **Prioritize** - Rank by frequency × severity
|
||||
|
||||
## Sample Traces
|
||||
|
||||
### Span-level sampling (Python — DataFrame)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
|
||||
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
|
||||
client = Client()
|
||||
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
|
||||
|
||||
# Build representative sample
|
||||
sample = pd.concat([
|
||||
spans_df[spans_df["status_code"] == "ERROR"].sample(30),
|
||||
spans_df[spans_df["feedback"] == "negative"].sample(30),
|
||||
spans_df.sample(40),
|
||||
]).drop_duplicates("span_id").head(100)
|
||||
```
|
||||
|
||||
### Span-level sampling (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { spans: errors } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
statusCode: "ERROR",
|
||||
limit: 30,
|
||||
});
|
||||
const { spans: allSpans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
limit: 70,
|
||||
});
|
||||
const sample = [...errors, ...allSpans.sort(() => Math.random() - 0.5).slice(0, 40)];
|
||||
const unique = [...new Map(sample.map((s) => [s.context.span_id, s])).values()].slice(0, 100);
|
||||
```
|
||||
|
||||
### Trace-level sampling (Python)
|
||||
|
||||
When errors span multiple spans (e.g., agent workflows), sample whole traces:
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=24),
|
||||
include_spans=True,
|
||||
sort="latency_ms",
|
||||
order="desc",
|
||||
limit=100,
|
||||
)
|
||||
# Each trace has: trace_id, start_time, end_time, spans
|
||||
```
|
||||
|
||||
### Trace-level sampling (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
|
||||
includeSpans: true,
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
## Add Notes (Python)
|
||||
|
||||
```python
|
||||
client.spans.add_span_note(
|
||||
span_id="abc123",
|
||||
note="wrong timezone - said 3pm EST but user is PST"
|
||||
)
|
||||
```
|
||||
|
||||
## Add Notes (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { addSpanNote } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
await addSpanNote({
|
||||
spanNote: {
|
||||
spanId: "abc123",
|
||||
note: "wrong timezone - said 3pm EST but user is PST"
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
## What to Note
|
||||
|
||||
| Type | Examples |
|
||||
| ---- | -------- |
|
||||
| Factual errors | Wrong dates, prices, made-up features |
|
||||
| Missing info | Didn't answer question, omitted details |
|
||||
| Tone issues | Too casual/formal for context |
|
||||
| Tool issues | Wrong tool, wrong parameters |
|
||||
| Retrieval | Wrong docs, missing relevant docs |
|
||||
|
||||
## Good Notes
|
||||
|
||||
```
|
||||
BAD: "Response is bad"
|
||||
GOOD: "Response says ships in 2 days but policy is 5-7 days"
|
||||
```
|
||||
|
||||
## Group into Categories
|
||||
|
||||
```python
|
||||
categories = {
|
||||
"factual_inaccuracy": ["wrong shipping time", "incorrect price"],
|
||||
"hallucination": ["made up a discount", "invented feature"],
|
||||
"tone_mismatch": ["informal for enterprise client"],
|
||||
}
|
||||
# Priority = Frequency × Severity
|
||||
```
|
||||
|
||||
## Retrieve Existing Annotations
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
# From a spans DataFrame
|
||||
annotations_df = client.spans.get_span_annotations_dataframe(
|
||||
spans_dataframe=sample,
|
||||
project_identifier="my-app",
|
||||
include_annotation_names=["quality", "correctness"],
|
||||
)
|
||||
# annotations_df has: span_id (index), name, label, score, explanation
|
||||
|
||||
# Or from specific span IDs
|
||||
annotations_df = client.spans.get_span_annotations_dataframe(
|
||||
span_ids=["span-id-1", "span-id-2"],
|
||||
project_identifier="my-app",
|
||||
)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { annotations } = await getSpanAnnotations({
|
||||
project: { projectName: "my-app" },
|
||||
spanIds: ["span-id-1", "span-id-2"],
|
||||
includeAnnotationNames: ["quality", "correctness"],
|
||||
});
|
||||
|
||||
for (const ann of annotations) {
|
||||
console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);
|
||||
}
|
||||
```
|
||||
|
||||
## Saturation
|
||||
|
||||
Stop when new traces reveal no new failure modes. Minimum: 100 traces.
|
||||
@@ -0,0 +1,137 @@
|
||||
# Batch Evaluation with evaluate_dataframe (Python)
|
||||
|
||||
Run evaluators across a DataFrame. The core 2.0 batch evaluation API.
|
||||
|
||||
## Preferred: async_evaluate_dataframe
|
||||
|
||||
For batch evaluations (especially with LLM evaluators), prefer the async version
|
||||
for better throughput:
|
||||
|
||||
```python
|
||||
from phoenix.evals import async_evaluate_dataframe
|
||||
|
||||
results_df = await async_evaluate_dataframe(
|
||||
dataframe=df, # pandas DataFrame with columns matching evaluator params
|
||||
evaluators=[eval1, eval2], # List of evaluators
|
||||
concurrency=5, # Max concurrent LLM calls (default 3)
|
||||
exit_on_error=False, # Optional: stop on first error (default True)
|
||||
max_retries=3, # Optional: retry failed LLM calls (default 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Sync Version
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(
|
||||
dataframe=df, # pandas DataFrame with columns matching evaluator params
|
||||
evaluators=[eval1, eval2], # List of evaluators
|
||||
exit_on_error=False, # Optional: stop on first error (default True)
|
||||
max_retries=3, # Optional: retry failed LLM calls (default 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Result Column Format
|
||||
|
||||
`async_evaluate_dataframe` / `evaluate_dataframe` returns a copy of the input DataFrame with added columns.
|
||||
**Result columns contain dicts, NOT raw numbers.**
|
||||
|
||||
For each evaluator named `"foo"`, two columns are added:
|
||||
|
||||
| Column | Type | Contents |
|
||||
| ------ | ---- | -------- |
|
||||
| `foo_score` | `dict` | `{"name": "foo", "score": 1.0, "label": "True", "explanation": "...", "metadata": {...}, "kind": "code", "direction": "maximize"}` |
|
||||
| `foo_execution_details` | `dict` | `{"status": "success", "exceptions": [], "execution_seconds": 0.001}` |
|
||||
|
||||
Only non-None fields appear in the score dict.
|
||||
|
||||
### Extracting Numeric Scores
|
||||
|
||||
```python
|
||||
# WRONG — these will fail or produce unexpected results
|
||||
score = results_df["relevance"].mean() # KeyError!
|
||||
score = results_df["relevance_score"].mean() # Tries to average dicts!
|
||||
|
||||
# RIGHT — extract the numeric score from each dict
|
||||
scores = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
mean_score = scores.mean()
|
||||
```
|
||||
|
||||
### Extracting Labels
|
||||
|
||||
```python
|
||||
labels = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("label", "") if isinstance(x, dict) else ""
|
||||
)
|
||||
```
|
||||
|
||||
### Extracting Explanations (LLM evaluators)
|
||||
|
||||
```python
|
||||
explanations = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("explanation", "") if isinstance(x, dict) else ""
|
||||
)
|
||||
```
|
||||
|
||||
### Finding Failures
|
||||
|
||||
```python
|
||||
scores = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
failed_mask = scores < 0.5
|
||||
failures = results_df[failed_mask]
|
||||
```
|
||||
|
||||
## Input Mapping
|
||||
|
||||
Evaluators receive each row as a dict. Column names must match the evaluator's
|
||||
expected parameter names. If they don't match, use `.bind()` or `bind_evaluator`:
|
||||
|
||||
```python
|
||||
from phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe
|
||||
|
||||
@create_evaluator(name="check", kind="code")
|
||||
def check(response: str) -> bool:
|
||||
return len(response.strip()) > 0
|
||||
|
||||
# Option 1: Use .bind() method on the evaluator
|
||||
check.bind(input_mapping={"response": "answer"})
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])
|
||||
|
||||
# Option 2: Use bind_evaluator function
|
||||
bound = bind_evaluator(evaluator=check, input_mapping={"response": "answer"})
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])
|
||||
```
|
||||
|
||||
Or simply rename columns to match:
|
||||
|
||||
```python
|
||||
df = df.rename(columns={
|
||||
"attributes.input.value": "input",
|
||||
"attributes.output.value": "output",
|
||||
})
|
||||
```
|
||||
|
||||
## DO NOT use run_evals
|
||||
|
||||
```python
|
||||
# WRONG — legacy 1.0 API
|
||||
from phoenix.evals import run_evals
|
||||
results = run_evals(dataframe=df, evaluators=[eval1])
|
||||
# Returns List[DataFrame] — one per evaluator
|
||||
|
||||
# RIGHT — current 2.0 API
|
||||
from phoenix.evals import async_evaluate_dataframe
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])
|
||||
# Returns single DataFrame with {name}_score dict columns
|
||||
```
|
||||
|
||||
Key differences:
|
||||
- `run_evals` returns a **list** of DataFrames (one per evaluator)
|
||||
- `async_evaluate_dataframe` returns a **single** DataFrame with all results merged
|
||||
- `async_evaluate_dataframe` uses `{name}_score` dict column format
|
||||
- `async_evaluate_dataframe` uses `bind_evaluator` for input mapping (not `input_mapping=` param)
|
||||
@@ -0,0 +1,91 @@
|
||||
# Evaluators: Code Evaluators in Python
|
||||
|
||||
Deterministic evaluators without LLM. Fast, cheap, reproducible.
|
||||
|
||||
## Basic Pattern
|
||||
|
||||
```python
|
||||
import re
|
||||
import json
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
@create_evaluator(name="has_citation", kind="code")
|
||||
def has_citation(output: str) -> bool:
|
||||
return bool(re.search(r'\[\d+\]', output))
|
||||
|
||||
@create_evaluator(name="json_valid", kind="code")
|
||||
def json_valid(output: str) -> bool:
|
||||
try:
|
||||
json.loads(output)
|
||||
return True
|
||||
except json.JSONDecodeError:
|
||||
return False
|
||||
```
|
||||
|
||||
## Parameter Binding
|
||||
|
||||
| Parameter | Description |
|
||||
| --------- | ----------- |
|
||||
| `output` | Task output |
|
||||
| `input` | Example input |
|
||||
| `expected` | Expected output |
|
||||
| `metadata` | Example metadata |
|
||||
|
||||
```python
|
||||
@create_evaluator(name="matches_expected", kind="code")
|
||||
def matches_expected(output: str, expected: dict) -> bool:
|
||||
return output.strip() == expected.get("answer", "").strip()
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
- **Regex**: `re.search(pattern, output)`
|
||||
- **JSON schema**: `jsonschema.validate()`
|
||||
- **Keywords**: `keyword in output.lower()`
|
||||
- **Length**: `len(output.split())`
|
||||
- **Similarity**: `editdistance.eval()` or Jaccard
|
||||
|
||||
## Return Types
|
||||
|
||||
| Return type | Result |
|
||||
| ----------- | ------ |
|
||||
| `bool` | `True` → score=1.0, label="True"; `False` → score=0.0, label="False" |
|
||||
| `float`/`int` | Used as the `score` value directly |
|
||||
| `str` (short, ≤3 words) | Used as the `label` value |
|
||||
| `str` (long, ≥4 words) | Used as the `explanation` value |
|
||||
| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |
|
||||
| `Score` object | Used as-is |
|
||||
|
||||
## Important: Code vs LLM Evaluators
|
||||
|
||||
The `@create_evaluator` decorator wraps a plain Python function.
|
||||
|
||||
- `kind="code"` (default): For deterministic evaluators that don't call an LLM.
|
||||
- `kind="llm"`: Marks the evaluator as LLM-based, but **you** must implement the LLM
|
||||
call inside the function. The decorator does not call an LLM for you.
|
||||
|
||||
For most LLM-based evaluation, prefer `ClassificationEvaluator` which handles
|
||||
the LLM call, structured output parsing, and explanations automatically:
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
relevance = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
## Pre-Built
|
||||
|
||||
```python
|
||||
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex
|
||||
|
||||
evaluators = [
|
||||
ContainsAnyKeyword(keywords=["disclaimer"]),
|
||||
JSONParseable(),
|
||||
MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
|
||||
]
|
||||
```
|
||||
@@ -0,0 +1,51 @@
|
||||
# Evaluators: Code Evaluators in TypeScript
|
||||
|
||||
Deterministic evaluators without LLM. Fast, cheap, reproducible.
|
||||
|
||||
## Basic Pattern
|
||||
|
||||
```typescript
|
||||
import { createEvaluator } from "@arizeai/phoenix-evals";
|
||||
|
||||
const containsCitation = createEvaluator<{ output: string }>(
|
||||
({ output }) => /\[\d+\]/.test(output) ? 1 : 0,
|
||||
{ name: "contains_citation", kind: "CODE" }
|
||||
);
|
||||
```
|
||||
|
||||
## With Full Results (asExperimentEvaluator)
|
||||
|
||||
```typescript
|
||||
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
const jsonValid = asExperimentEvaluator({
|
||||
name: "json_valid",
|
||||
kind: "CODE",
|
||||
evaluate: async ({ output }) => {
|
||||
try {
|
||||
JSON.parse(String(output));
|
||||
return { score: 1.0, label: "valid_json" };
|
||||
} catch (e) {
|
||||
return { score: 0.0, label: "invalid_json", explanation: String(e) };
|
||||
}
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
## Parameter Types
|
||||
|
||||
```typescript
|
||||
interface EvaluatorParams {
|
||||
input: Record<string, unknown>;
|
||||
output: unknown;
|
||||
expected: Record<string, unknown>;
|
||||
metadata: Record<string, unknown>;
|
||||
}
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
- **Regex**: `/pattern/.test(output)`
|
||||
- **JSON**: `JSON.parse()` + zod schema
|
||||
- **Keywords**: `output.includes(keyword)`
|
||||
- **Similarity**: `fastest-levenshtein`
|
||||
@@ -0,0 +1,54 @@
|
||||
# Evaluators: Custom Templates
|
||||
|
||||
Design LLM judge prompts.
|
||||
|
||||
## Complete Template Pattern
|
||||
|
||||
```python
|
||||
TEMPLATE = """Evaluate faithfulness of the response to the context.
|
||||
|
||||
<context>{{context}}</context>
|
||||
<response>{{output}}</response>
|
||||
|
||||
CRITERIA:
|
||||
"faithful" = ALL claims supported by context
|
||||
"unfaithful" = ANY claim NOT in context
|
||||
|
||||
EXAMPLES:
|
||||
Context: "Price is $10" → Response: "It costs $10" → faithful
|
||||
Context: "Price is $10" → Response: "About $15" → unfaithful
|
||||
|
||||
EDGE CASES:
|
||||
- Empty context → cannot_evaluate
|
||||
- "I don't know" when appropriate → faithful
|
||||
- Partial faithfulness → unfaithful (strict)
|
||||
|
||||
Answer (faithful/unfaithful):"""
|
||||
```
|
||||
|
||||
## Template Structure
|
||||
|
||||
1. Task description
|
||||
2. Input variables in XML tags
|
||||
3. Criteria definitions
|
||||
4. Examples (2-4 cases)
|
||||
5. Edge cases
|
||||
6. Output format
|
||||
|
||||
## XML Tags
|
||||
|
||||
```
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
<context>{{context}}</context>
|
||||
<reference>{{reference}}</reference>
|
||||
```
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
| Mistake | Fix |
|
||||
| ------- | --- |
|
||||
| Vague criteria | Define each label exactly |
|
||||
| No examples | Include 2-4 cases |
|
||||
| Ambiguous format | Specify exact output |
|
||||
| No edge cases | Address ambiguity |
|
||||
@@ -0,0 +1,92 @@
|
||||
# Evaluators: LLM Evaluators in Python
|
||||
|
||||
LLM evaluators use a language model to judge outputs. Use when criteria are subjective.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
llm = LLM(provider="openai", model="gpt-4o")
|
||||
|
||||
HELPFULNESS_TEMPLATE = """Rate how helpful the response is.
|
||||
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
|
||||
"helpful" means directly addresses the question.
|
||||
"not_helpful" means does not address the question.
|
||||
|
||||
Your answer (helpful/not_helpful):"""
|
||||
|
||||
helpfulness = ClassificationEvaluator(
|
||||
name="helpfulness",
|
||||
prompt_template=HELPFULNESS_TEMPLATE,
|
||||
llm=llm,
|
||||
choices={"not_helpful": 0, "helpful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
## Template Variables
|
||||
|
||||
Use XML tags to wrap variables for clarity:
|
||||
|
||||
| Variable | XML Tag |
|
||||
| -------- | ------- |
|
||||
| `{{input}}` | `<question>{{input}}</question>` |
|
||||
| `{{output}}` | `<response>{{output}}</response>` |
|
||||
| `{{reference}}` | `<reference>{{reference}}</reference>` |
|
||||
| `{{context}}` | `<context>{{context}}</context>` |
|
||||
|
||||
## create_classifier (Factory)
|
||||
|
||||
Shorthand factory that returns a `ClassificationEvaluator`. Prefer direct
|
||||
`ClassificationEvaluator` instantiation for more parameters/customization:
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_classifier, LLM
|
||||
|
||||
relevance = create_classifier(
|
||||
name="relevance",
|
||||
prompt_template="""Is this response relevant to the question?
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
Answer (relevant/irrelevant):""",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
## Input Mapping
|
||||
|
||||
Column names must match template variables. Rename columns or use `bind_evaluator`:
|
||||
|
||||
```python
|
||||
# Option 1: Rename columns to match template variables
|
||||
df = df.rename(columns={"user_query": "input", "ai_response": "output"})
|
||||
|
||||
# Option 2: Use bind_evaluator
|
||||
from phoenix.evals import bind_evaluator
|
||||
|
||||
bound = bind_evaluator(
|
||||
evaluator=helpfulness,
|
||||
input_mapping={"input": "user_query", "output": "ai_response"},
|
||||
)
|
||||
```
|
||||
|
||||
## Running
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Be specific** - Define exactly what pass/fail means
|
||||
2. **Include examples** - Show concrete cases for each label
|
||||
3. **Explanations by default** - `ClassificationEvaluator` includes explanations automatically
|
||||
4. **Study built-in prompts** - See
|
||||
`phoenix.evals.__generated__.classification_evaluator_configs` for examples
|
||||
of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)
|
||||
@@ -0,0 +1,58 @@
|
||||
# Evaluators: LLM Evaluators in TypeScript
|
||||
|
||||
LLM evaluators use a language model to judge outputs. Uses Vercel AI SDK.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```typescript
|
||||
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
|
||||
import { openai } from "@ai-sdk/openai";
|
||||
|
||||
const helpfulness = await createClassificationEvaluator<{
|
||||
input: string;
|
||||
output: string;
|
||||
}>({
|
||||
name: "helpfulness",
|
||||
model: openai("gpt-4o"),
|
||||
promptTemplate: `Rate helpfulness.
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
Answer (helpful/not_helpful):`,
|
||||
choices: { not_helpful: 0, helpful: 1 },
|
||||
});
|
||||
```
|
||||
|
||||
## Template Variables
|
||||
|
||||
Use XML tags: `<question>{{input}}</question>`, `<response>{{output}}</response>`, `<context>{{context}}</context>`
|
||||
|
||||
## Custom Evaluator with asExperimentEvaluator
|
||||
|
||||
```typescript
|
||||
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
const customEval = asExperimentEvaluator({
|
||||
name: "custom",
|
||||
kind: "LLM",
|
||||
evaluate: async ({ input, output }) => {
|
||||
// Your LLM call here
|
||||
return { score: 1.0, label: "pass", explanation: "..." };
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
## Pre-Built Evaluators
|
||||
|
||||
```typescript
|
||||
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
|
||||
|
||||
const faithfulnessEvaluator = createFaithfulnessEvaluator({
|
||||
model: openai("gpt-4o"),
|
||||
});
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Be specific about criteria
|
||||
- Include examples in prompts
|
||||
- Use `<thinking>` for chain of thought
|
||||
@@ -0,0 +1,40 @@
|
||||
# Evaluators: Overview
|
||||
|
||||
When and how to build automated evaluators.
|
||||
|
||||
## Decision Framework
|
||||
|
||||
```
|
||||
Should I Build an Evaluator?
|
||||
│
|
||||
▼
|
||||
Can I fix it with a prompt change?
|
||||
YES → Fix the prompt first
|
||||
NO → Is this a recurring issue?
|
||||
YES → Build evaluator
|
||||
NO → Add to watchlist
|
||||
```
|
||||
|
||||
**Don't automate prematurely.** Many issues are simple prompt fixes.
|
||||
|
||||
## Evaluator Requirements
|
||||
|
||||
1. **Clear criteria** - Specific, not "Is it good?"
|
||||
2. **Labeled test set** - 100+ examples with human labels
|
||||
3. **Measured accuracy** - Know TPR/TNR before deploying
|
||||
|
||||
## Evaluator Lifecycle
|
||||
|
||||
1. **Discover** - Error analysis reveals pattern
|
||||
2. **Design** - Define criteria and test cases
|
||||
3. **Implement** - Build code or LLM evaluator
|
||||
4. **Calibrate** - Validate against human labels
|
||||
5. **Deploy** - Add to experiment/CI pipeline
|
||||
6. **Monitor** - Track accuracy over time
|
||||
7. **Maintain** - Update as product evolves
|
||||
|
||||
## What NOT to Automate
|
||||
|
||||
- **Rare issues** - <5 instances? Watchlist, don't build
|
||||
- **Quick fixes** - Fixable by prompt change? Fix it
|
||||
- **Evolving criteria** - Stabilize definition first
|
||||
@@ -0,0 +1,75 @@
|
||||
# Evaluators: Pre-Built
|
||||
|
||||
Use for exploration only. Validate before production.
|
||||
|
||||
## Python
|
||||
|
||||
```python
|
||||
from phoenix.evals import LLM
|
||||
from phoenix.evals.metrics import FaithfulnessEvaluator
|
||||
|
||||
llm = LLM(provider="openai", model="gpt-4o")
|
||||
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
|
||||
```
|
||||
|
||||
**Note**: `HallucinationEvaluator` is deprecated. Use `FaithfulnessEvaluator` instead.
|
||||
It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.
|
||||
|
||||
## TypeScript
|
||||
|
||||
```typescript
|
||||
import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
|
||||
import { openai } from "@ai-sdk/openai";
|
||||
|
||||
const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });
|
||||
```
|
||||
|
||||
## Available (2.0)
|
||||
|
||||
| Evaluator | Type | Description |
|
||||
| --------- | ---- | ----------- |
|
||||
| `FaithfulnessEvaluator` | LLM | Is the response faithful to the context? |
|
||||
| `CorrectnessEvaluator` | LLM | Is the response correct? |
|
||||
| `DocumentRelevanceEvaluator` | LLM | Are retrieved documents relevant? |
|
||||
| `ToolSelectionEvaluator` | LLM | Did the agent select the right tool? |
|
||||
| `ToolInvocationEvaluator` | LLM | Did the agent invoke the tool correctly? |
|
||||
| `ToolResponseHandlingEvaluator` | LLM | Did the agent handle the tool response well? |
|
||||
| `MatchesRegex` | Code | Does output match a regex pattern? |
|
||||
| `PrecisionRecallFScore` | Code | Precision/recall/F-score metrics |
|
||||
| `exact_match` | Code | Exact string match |
|
||||
|
||||
Legacy evaluators (`HallucinationEvaluator`, `QAEvaluator`, `RelevanceEvaluator`,
|
||||
`ToxicityEvaluator`, `SummarizationEvaluator`) are in `phoenix.evals.legacy` and deprecated.
|
||||
|
||||
## When to Use
|
||||
|
||||
| Situation | Recommendation |
|
||||
| --------- | -------------- |
|
||||
| Exploration | Find traces to review |
|
||||
| Find outliers | Sort by scores |
|
||||
| Production | Validate first (>80% human agreement) |
|
||||
| Domain-specific | Build custom |
|
||||
|
||||
## Exploration Pattern
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])
|
||||
|
||||
# Score columns contain dicts — extract numeric scores
|
||||
scores = results_df["faithfulness_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
low_scores = results_df[scores < 0.5] # Review these
|
||||
high_scores = results_df[scores > 0.9] # Also sample
|
||||
```
|
||||
|
||||
## Validation Required
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
print(classification_report(human_labels, evaluator_results["label"]))
|
||||
# Target: >80% agreement
|
||||
```
|
||||
@@ -0,0 +1,108 @@
|
||||
# Evaluators: RAG Systems
|
||||
|
||||
RAG has two distinct components requiring different evaluation approaches.
|
||||
|
||||
## Two-Phase Evaluation
|
||||
|
||||
```
|
||||
RETRIEVAL GENERATION
|
||||
───────── ──────────
|
||||
Query → Retriever → Docs Docs + Query → LLM → Answer
|
||||
│ │
|
||||
IR Metrics LLM Judges / Code Checks
|
||||
```
|
||||
|
||||
**Debug retrieval first** using IR metrics, then tackle generation quality.
|
||||
|
||||
## Retrieval Evaluation (IR Metrics)
|
||||
|
||||
Use traditional information retrieval metrics:
|
||||
|
||||
| Metric | What It Measures |
|
||||
| ------ | ---------------- |
|
||||
| Recall@k | Of all relevant docs, how many in top k? |
|
||||
| Precision@k | Of k retrieved docs, how many relevant? |
|
||||
| MRR | How high is first relevant doc? |
|
||||
| NDCG | Quality weighted by position |
|
||||
|
||||
```python
|
||||
# Requires query-document relevance labels
|
||||
def recall_at_k(retrieved_ids, relevant_ids, k=5):
|
||||
retrieved_set = set(retrieved_ids[:k])
|
||||
relevant_set = set(relevant_ids)
|
||||
if not relevant_set:
|
||||
return 0.0
|
||||
return len(retrieved_set & relevant_set) / len(relevant_set)
|
||||
```
|
||||
|
||||
## Creating Retrieval Test Data
|
||||
|
||||
Generate query-document pairs synthetically:
|
||||
|
||||
```python
|
||||
# Reverse process: document → questions that document answers
|
||||
def generate_retrieval_test(documents):
|
||||
test_pairs = []
|
||||
for doc in documents:
|
||||
# Extract facts, generate questions
|
||||
questions = llm(f"Generate 3 questions this document answers:\n{doc}")
|
||||
for q in questions:
|
||||
test_pairs.append({"query": q, "relevant_doc_id": doc.id})
|
||||
return test_pairs
|
||||
```
|
||||
|
||||
## Generation Evaluation
|
||||
|
||||
Use LLM judges for qualities code can't measure:
|
||||
|
||||
| Eval | Question |
|
||||
| ---- | -------- |
|
||||
| **Faithfulness** | Are all claims supported by retrieved context? |
|
||||
| **Relevance** | Does answer address the question? |
|
||||
| **Completeness** | Does answer cover key points from context? |
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
FAITHFULNESS_TEMPLATE = """Given the context and answer, is every claim in the answer supported by the context?
|
||||
|
||||
<context>{{context}}</context>
|
||||
<answer>{{output}}</answer>
|
||||
|
||||
"faithful" = ALL claims supported by context
|
||||
"unfaithful" = ANY claim NOT in context
|
||||
|
||||
Answer (faithful/unfaithful):"""
|
||||
|
||||
faithfulness = ClassificationEvaluator(
|
||||
name="faithfulness",
|
||||
prompt_template=FAITHFULNESS_TEMPLATE,
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"unfaithful": 0, "faithful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
## RAG Failure Taxonomy
|
||||
|
||||
Common failure modes to evaluate:
|
||||
|
||||
```yaml
|
||||
retrieval_failures:
|
||||
- no_relevant_docs: Query returns unrelated content
|
||||
- partial_retrieval: Some relevant docs missed
|
||||
- wrong_chunk: Right doc, wrong section
|
||||
|
||||
generation_failures:
|
||||
- hallucination: Claims not in retrieved context
|
||||
- ignored_context: Answer doesn't use retrieved docs
|
||||
- incomplete: Missing key information from context
|
||||
- wrong_synthesis: Misinterprets or miscombines sources
|
||||
```
|
||||
|
||||
## Evaluation Order
|
||||
|
||||
1. **Retrieval first** - If wrong docs, generation will fail
|
||||
2. **Faithfulness** - Is answer grounded in context?
|
||||
3. **Answer quality** - Does answer address the question?
|
||||
|
||||
Fix retrieval problems before debugging generation.
|
||||
@@ -0,0 +1,133 @@
|
||||
# Experiments: Datasets in Python
|
||||
|
||||
Creating and managing evaluation datasets.
|
||||
|
||||
## Creating Datasets
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
client = Client()
|
||||
|
||||
# From examples
|
||||
dataset = client.datasets.create_dataset(
|
||||
name="qa-test-v1",
|
||||
examples=[
|
||||
{
|
||||
"input": {"question": "What is 2+2?"},
|
||||
"output": {"answer": "4"},
|
||||
"metadata": {"category": "math"},
|
||||
},
|
||||
],
|
||||
)
|
||||
|
||||
# From DataFrame
|
||||
dataset = client.datasets.create_dataset(
|
||||
dataframe=df,
|
||||
name="qa-test-v1",
|
||||
input_keys=["question"],
|
||||
output_keys=["answer"],
|
||||
metadata_keys=["category"],
|
||||
)
|
||||
```
|
||||
|
||||
## From Production Traces
|
||||
|
||||
```python
|
||||
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
|
||||
|
||||
dataset = client.datasets.create_dataset(
|
||||
dataframe=spans_df[["input.value", "output.value"]],
|
||||
name="production-sample-v1",
|
||||
input_keys=["input.value"],
|
||||
output_keys=["output.value"],
|
||||
)
|
||||
```
|
||||
|
||||
## Retrieving Datasets
|
||||
|
||||
```python
|
||||
dataset = client.datasets.get_dataset(name="qa-test-v1")
|
||||
df = dataset.to_dataframe()
|
||||
```
|
||||
|
||||
## Key Parameters
|
||||
|
||||
| Parameter | Description |
|
||||
| --------- | ----------- |
|
||||
| `input_keys` | Columns for task input |
|
||||
| `output_keys` | Columns for expected output |
|
||||
| `metadata_keys` | Additional context |
|
||||
|
||||
## Using Evaluators in Experiments
|
||||
|
||||
### Evaluators as experiment evaluators
|
||||
|
||||
Pass phoenix-evals evaluators directly to `run_experiment` as the `evaluators` argument:
|
||||
|
||||
```python
|
||||
from functools import partial
|
||||
from phoenix.client import AsyncClient
|
||||
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator
|
||||
|
||||
# Define an LLM evaluator
|
||||
refusal = ClassificationEvaluator(
|
||||
name="refusal",
|
||||
prompt_template="Is this a refusal?\nQuestion: {{query}}\nResponse: {{response}}",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"refusal": 0, "answer": 1},
|
||||
)
|
||||
|
||||
# Bind to map dataset columns to evaluator params
|
||||
refusal_evaluator = bind_evaluator(refusal, {"query": "input.query", "response": "output"})
|
||||
|
||||
# Define experiment task
|
||||
async def run_rag_task(input, rag_engine):
|
||||
return rag_engine.query(input["query"])
|
||||
|
||||
# Run experiment with the evaluator
|
||||
experiment = await AsyncClient().experiments.run_experiment(
|
||||
dataset=ds,
|
||||
task=partial(run_rag_task, rag_engine=query_engine),
|
||||
experiment_name="baseline",
|
||||
evaluators=[refusal_evaluator],
|
||||
concurrency=10,
|
||||
)
|
||||
```
|
||||
|
||||
### Evaluators as the task (meta evaluation)
|
||||
|
||||
Use an LLM evaluator as the experiment **task** to test the evaluator itself
|
||||
against human annotations:
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
# The evaluator IS the task being tested
|
||||
def run_refusal_eval(input, evaluator):
|
||||
result = evaluator.evaluate(input)
|
||||
return result[0]
|
||||
|
||||
# A simple heuristic checks judge vs human agreement
|
||||
@create_evaluator(name="exact_match")
|
||||
def exact_match(output, expected):
|
||||
return float(output["score"]) == float(expected["refusal_score"])
|
||||
|
||||
# Run: evaluator is the task, exact_match evaluates it
|
||||
experiment = await AsyncClient().experiments.run_experiment(
|
||||
dataset=annotated_dataset,
|
||||
task=partial(run_refusal_eval, evaluator=refusal),
|
||||
experiment_name="judge-v1",
|
||||
evaluators=[exact_match],
|
||||
concurrency=10,
|
||||
)
|
||||
```
|
||||
|
||||
This pattern lets you iterate on evaluator prompts until they align with human judgments.
|
||||
See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example.
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
|
||||
- **Metadata**: Track source, category, difficulty
|
||||
- **Balance**: Ensure diverse coverage across categories
|
||||
@@ -0,0 +1,69 @@
|
||||
# Experiments: Datasets in TypeScript
|
||||
|
||||
Creating and managing evaluation datasets.
|
||||
|
||||
## Creating Datasets
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import { createDataset } from "@arizeai/phoenix-client/datasets";
|
||||
|
||||
const client = createClient();
|
||||
|
||||
const { datasetId } = await createDataset({
|
||||
client,
|
||||
name: "qa-test-v1",
|
||||
examples: [
|
||||
{
|
||||
input: { question: "What is 2+2?" },
|
||||
output: { answer: "4" },
|
||||
metadata: { category: "math" },
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
## Example Structure
|
||||
|
||||
```typescript
|
||||
interface DatasetExample {
|
||||
input: Record<string, unknown>; // Task input
|
||||
output?: Record<string, unknown>; // Expected output
|
||||
metadata?: Record<string, unknown>; // Additional context
|
||||
}
|
||||
```
|
||||
|
||||
## From Production Traces
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
parentId: null, // root spans only
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
const examples = spans.map((span) => ({
|
||||
input: { query: span.attributes?.["input.value"] },
|
||||
output: { response: span.attributes?.["output.value"] },
|
||||
metadata: { spanId: span.context.span_id },
|
||||
}));
|
||||
|
||||
await createDataset({ client, name: "production-sample", examples });
|
||||
```
|
||||
|
||||
## Retrieving Datasets
|
||||
|
||||
```typescript
|
||||
import { getDataset, listDatasets } from "@arizeai/phoenix-client/datasets";
|
||||
|
||||
const dataset = await getDataset({ client, datasetId: "..." });
|
||||
const all = await listDatasets({ client });
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Versioning**: Create new datasets, don't modify existing
|
||||
- **Metadata**: Track source, category, provenance
|
||||
- **Type safety**: Use TypeScript interfaces for structure
|
||||
@@ -0,0 +1,50 @@
|
||||
# Experiments: Overview
|
||||
|
||||
Systematic testing of AI systems with datasets, tasks, and evaluators.
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
DATASET → Examples: {input, expected_output, metadata}
|
||||
TASK → function(input) → output
|
||||
EVALUATORS → (input, output, expected) → score
|
||||
EXPERIMENT → Run task on all examples, score results
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=my_dataset,
|
||||
task=my_task,
|
||||
evaluators=[accuracy, faithfulness],
|
||||
experiment_name="improved-retrieval-v2",
|
||||
)
|
||||
|
||||
print(experiment.aggregate_scores)
|
||||
# {'accuracy': 0.85, 'faithfulness': 0.92}
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Create dataset** - From traces, synthetic data, or manual curation
|
||||
2. **Define task** - The function to test (your LLM pipeline)
|
||||
3. **Select evaluators** - Code and/or LLM-based
|
||||
4. **Run experiment** - Execute and score
|
||||
5. **Analyze & iterate** - Review, modify task, re-run
|
||||
|
||||
## Dry Runs
|
||||
|
||||
Test setup before full execution:
|
||||
|
||||
```python
|
||||
experiment = run_experiment(dataset, task, evaluators, dry_run=3) # Just 3 examples
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
|
||||
- **Version datasets**: Don't modify existing
|
||||
- **Multiple evaluators**: Combine perspectives
|
||||
@@ -0,0 +1,78 @@
|
||||
# Experiments: Running Experiments in Python
|
||||
|
||||
Execute experiments with `run_experiment`.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
client = Client()
|
||||
dataset = client.datasets.get_dataset(name="qa-test-v1")
|
||||
|
||||
def my_task(example):
|
||||
return call_llm(example.input["question"])
|
||||
|
||||
def exact_match(output, expected):
|
||||
return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=[exact_match],
|
||||
experiment_name="qa-experiment-v1",
|
||||
)
|
||||
```
|
||||
|
||||
## Task Functions
|
||||
|
||||
```python
|
||||
# Basic task
|
||||
def task(example):
|
||||
return call_llm(example.input["question"])
|
||||
|
||||
# With context (RAG)
|
||||
def rag_task(example):
|
||||
return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")
|
||||
```
|
||||
|
||||
## Evaluator Parameters
|
||||
|
||||
| Parameter | Access |
|
||||
| --------- | ------ |
|
||||
| `output` | Task output |
|
||||
| `expected` | Example expected output |
|
||||
| `input` | Example input |
|
||||
| `metadata` | Example metadata |
|
||||
|
||||
## Options
|
||||
|
||||
```python
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=evaluators,
|
||||
experiment_name="my-experiment",
|
||||
dry_run=3, # Test with 3 examples
|
||||
repetitions=3, # Run each example 3 times
|
||||
)
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
```python
|
||||
print(experiment.aggregate_scores)
|
||||
# {'accuracy': 0.85, 'faithfulness': 0.92}
|
||||
|
||||
for run in experiment.runs:
|
||||
print(run.output, run.scores)
|
||||
```
|
||||
|
||||
## Add Evaluations Later
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import evaluate_experiment
|
||||
|
||||
evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])
|
||||
```
|
||||
@@ -0,0 +1,82 @@
|
||||
# Experiments: Running Experiments in TypeScript
|
||||
|
||||
Execute experiments with `runExperiment`.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import {
|
||||
runExperiment,
|
||||
asExperimentEvaluator,
|
||||
} from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
const client = createClient();
|
||||
|
||||
const task = async (example: { input: Record<string, unknown> }) => {
|
||||
return await callLLM(example.input.question as string);
|
||||
};
|
||||
|
||||
const exactMatch = asExperimentEvaluator({
|
||||
name: "exact_match",
|
||||
kind: "CODE",
|
||||
evaluate: async ({ output, expected }) => ({
|
||||
score: output === expected?.answer ? 1.0 : 0.0,
|
||||
label: output === expected?.answer ? "match" : "no_match",
|
||||
}),
|
||||
});
|
||||
|
||||
const experiment = await runExperiment({
|
||||
client,
|
||||
experimentName: "qa-experiment-v1",
|
||||
dataset: { datasetId: "your-dataset-id" },
|
||||
task,
|
||||
evaluators: [exactMatch],
|
||||
});
|
||||
```
|
||||
|
||||
## Task Functions
|
||||
|
||||
```typescript
|
||||
// Basic task
|
||||
const task = async (example) => await callLLM(example.input.question as string);
|
||||
|
||||
// With context (RAG)
|
||||
const ragTask = async (example) => {
|
||||
const prompt = `Context: ${example.input.context}\nQ: ${example.input.question}`;
|
||||
return await callLLM(prompt);
|
||||
};
|
||||
```
|
||||
|
||||
## Evaluator Parameters
|
||||
|
||||
```typescript
|
||||
interface EvaluatorParams {
|
||||
input: Record<string, unknown>;
|
||||
output: unknown;
|
||||
expected: Record<string, unknown>;
|
||||
metadata: Record<string, unknown>;
|
||||
}
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
```typescript
|
||||
const experiment = await runExperiment({
|
||||
client,
|
||||
experimentName: "my-experiment",
|
||||
dataset: { datasetName: "qa-test-v1" },
|
||||
task,
|
||||
evaluators,
|
||||
repetitions: 3, // Run each example 3 times
|
||||
maxConcurrency: 5, // Limit concurrent executions
|
||||
});
|
||||
```
|
||||
|
||||
## Add Evaluations Later
|
||||
|
||||
```typescript
|
||||
import { evaluateExperiment } from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
await evaluateExperiment({ client, experiment, evaluators: [newEvaluator] });
|
||||
```
|
||||
@@ -0,0 +1,70 @@
|
||||
# Experiments: Generating Synthetic Test Data
|
||||
|
||||
Creating diverse, targeted test data for evaluation.
|
||||
|
||||
## Dimension-Based Approach
|
||||
|
||||
Define axes of variation, then generate combinations:
|
||||
|
||||
```python
|
||||
dimensions = {
|
||||
"issue_type": ["billing", "technical", "shipping"],
|
||||
"customer_mood": ["frustrated", "neutral", "happy"],
|
||||
"complexity": ["simple", "moderate", "complex"],
|
||||
}
|
||||
```
|
||||
|
||||
## Two-Step Generation
|
||||
|
||||
1. **Generate tuples** (combinations of dimension values)
|
||||
2. **Convert to natural queries** (separate LLM call per tuple)
|
||||
|
||||
```python
|
||||
# Step 1: Create tuples
|
||||
tuples = [
|
||||
("billing", "frustrated", "complex"),
|
||||
("shipping", "neutral", "simple"),
|
||||
]
|
||||
|
||||
# Step 2: Convert to natural query
|
||||
def tuple_to_query(t):
|
||||
prompt = f"""Generate a realistic customer message:
|
||||
Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
|
||||
|
||||
Write naturally, include typos if appropriate. Don't be formulaic."""
|
||||
return llm(prompt)
|
||||
```
|
||||
|
||||
## Target Failure Modes
|
||||
|
||||
Dimensions should target known failures from error analysis:
|
||||
|
||||
```python
|
||||
# From error analysis findings
|
||||
dimensions = {
|
||||
"timezone": ["EST", "PST", "UTC", "ambiguous"], # Known failure
|
||||
"date_format": ["ISO", "US", "EU", "relative"], # Known failure
|
||||
}
|
||||
```
|
||||
|
||||
## Quality Control
|
||||
|
||||
- **Validate**: Check for placeholder text, minimum length
|
||||
- **Deduplicate**: Remove near-duplicate queries using embeddings
|
||||
- **Balance**: Ensure coverage across dimension values
|
||||
|
||||
## When to Use
|
||||
|
||||
| Use Synthetic | Use Real Data |
|
||||
| ------------- | ------------- |
|
||||
| Limited production data | Sufficient traces |
|
||||
| Testing edge cases | Validating actual behavior |
|
||||
| Pre-launch evals | Post-launch monitoring |
|
||||
|
||||
## Sample Sizes
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Comprehensive eval | 100-500 |
|
||||
| Per-dimension | 10-20 per combination |
|
||||
@@ -0,0 +1,86 @@
|
||||
# Experiments: Generating Synthetic Test Data (TypeScript)
|
||||
|
||||
Creating diverse, targeted test data for evaluation.
|
||||
|
||||
## Dimension-Based Approach
|
||||
|
||||
Define axes of variation, then generate combinations:
|
||||
|
||||
```typescript
|
||||
const dimensions = {
|
||||
issueType: ["billing", "technical", "shipping"],
|
||||
customerMood: ["frustrated", "neutral", "happy"],
|
||||
complexity: ["simple", "moderate", "complex"],
|
||||
};
|
||||
```
|
||||
|
||||
## Two-Step Generation
|
||||
|
||||
1. **Generate tuples** (combinations of dimension values)
|
||||
2. **Convert to natural queries** (separate LLM call per tuple)
|
||||
|
||||
```typescript
|
||||
import { generateText } from "ai";
|
||||
import { openai } from "@ai-sdk/openai";
|
||||
|
||||
// Step 1: Create tuples
|
||||
type Tuple = [string, string, string];
|
||||
const tuples: Tuple[] = [
|
||||
["billing", "frustrated", "complex"],
|
||||
["shipping", "neutral", "simple"],
|
||||
];
|
||||
|
||||
// Step 2: Convert to natural query
|
||||
async function tupleToQuery(t: Tuple): Promise<string> {
|
||||
const { text } = await generateText({
|
||||
model: openai("gpt-4o"),
|
||||
prompt: `Generate a realistic customer message:
|
||||
Issue: ${t[0]}, Mood: ${t[1]}, Complexity: ${t[2]}
|
||||
|
||||
Write naturally, include typos if appropriate. Don't be formulaic.`,
|
||||
});
|
||||
return text;
|
||||
}
|
||||
```
|
||||
|
||||
## Target Failure Modes
|
||||
|
||||
Dimensions should target known failures from error analysis:
|
||||
|
||||
```typescript
|
||||
// From error analysis findings
|
||||
const dimensions = {
|
||||
timezone: ["EST", "PST", "UTC", "ambiguous"], // Known failure
|
||||
dateFormat: ["ISO", "US", "EU", "relative"], // Known failure
|
||||
};
|
||||
```
|
||||
|
||||
## Quality Control
|
||||
|
||||
- **Validate**: Check for placeholder text, minimum length
|
||||
- **Deduplicate**: Remove near-duplicate queries using embeddings
|
||||
- **Balance**: Ensure coverage across dimension values
|
||||
|
||||
```typescript
|
||||
function validateQuery(query: string): boolean {
|
||||
const minLength = 20;
|
||||
const hasPlaceholder = /\[.*?\]|<.*?>/.test(query);
|
||||
return query.length >= minLength && !hasPlaceholder;
|
||||
}
|
||||
```
|
||||
|
||||
## When to Use
|
||||
|
||||
| Use Synthetic | Use Real Data |
|
||||
| ------------- | ------------- |
|
||||
| Limited production data | Sufficient traces |
|
||||
| Testing edge cases | Validating actual behavior |
|
||||
| Pre-launch evals | Post-launch monitoring |
|
||||
|
||||
## Sample Sizes
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Comprehensive eval | 100-500 |
|
||||
| Per-dimension | 10-20 per combination |
|
||||
@@ -0,0 +1,43 @@
|
||||
# Anti-Patterns
|
||||
|
||||
Common mistakes and fixes.
|
||||
|
||||
| Anti-Pattern | Problem | Fix |
|
||||
| ------------ | ------- | --- |
|
||||
| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |
|
||||
| Vibe-based | No quantification | Measure with experiments |
|
||||
| Ignoring humans | Uncalibrated LLM judges | Validate >80% TPR/TNR |
|
||||
| Premature automation | Evaluators for imagined problems | Let observed failures drive |
|
||||
| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |
|
||||
| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |
|
||||
| Model switching | Hoping a model works better | Error analysis first |
|
||||
|
||||
## Quantify Changes
|
||||
|
||||
```python
|
||||
baseline = run_experiment(dataset, old_prompt, evaluators)
|
||||
improved = run_experiment(dataset, new_prompt, evaluators)
|
||||
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")
|
||||
```
|
||||
|
||||
## Don't Use Similarity for Generation
|
||||
|
||||
```python
|
||||
# BAD
|
||||
score = bertscore(output, reference)
|
||||
|
||||
# GOOD
|
||||
correct_facts = check_facts_against_source(output, context)
|
||||
```
|
||||
|
||||
## Error Analysis Before Model Change
|
||||
|
||||
```python
|
||||
# BAD
|
||||
for model in models:
|
||||
results = test(model)
|
||||
|
||||
# GOOD
|
||||
failures = analyze_errors(results)
|
||||
# Then decide if model change is warranted
|
||||
```
|
||||
@@ -0,0 +1,58 @@
|
||||
# Model Selection
|
||||
|
||||
Error analysis first, model changes last.
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
Performance Issue?
|
||||
│
|
||||
▼
|
||||
Error analysis suggests model problem?
|
||||
NO → Fix prompts, retrieval, tools
|
||||
YES → Is it a capability gap?
|
||||
YES → Consider model change
|
||||
NO → Fix the actual problem
|
||||
```
|
||||
|
||||
## Judge Model Selection
|
||||
|
||||
| Principle | Action |
|
||||
| --------- | ------ |
|
||||
| Start capable | Use gpt-4o first |
|
||||
| Optimize later | Test cheaper after criteria stable |
|
||||
| Same model OK | Judge does different task |
|
||||
|
||||
```python
|
||||
# Start with capable model
|
||||
judge = ClassificationEvaluator(
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
...
|
||||
)
|
||||
|
||||
# After validation, test cheaper
|
||||
judge_cheap = ClassificationEvaluator(
|
||||
llm=LLM(provider="openai", model="gpt-4o-mini"),
|
||||
...
|
||||
)
|
||||
# Compare TPR/TNR on same test set
|
||||
```
|
||||
|
||||
## Don't Model Shop
|
||||
|
||||
```python
|
||||
# BAD
|
||||
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
|
||||
results = run_experiment(dataset, task, model)
|
||||
|
||||
# GOOD
|
||||
failures = analyze_errors(results)
|
||||
# "Ignores context" → Fix prompt
|
||||
# "Can't do math" → Maybe try better model
|
||||
```
|
||||
|
||||
## When Model Change Is Warranted
|
||||
|
||||
- Failures persist after prompt optimization
|
||||
- Capability gaps (reasoning, math, code)
|
||||
- Error analysis confirms model limitation
|
||||
@@ -0,0 +1,76 @@
|
||||
# Fundamentals
|
||||
|
||||
Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
|
||||
|
||||
## Evaluator Types
|
||||
|
||||
| Type | Speed | Cost | Use Case |
|
||||
| ---- | ----- | ---- | -------- |
|
||||
| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
|
||||
| **LLM** | Medium | Medium | Subjective quality, complex criteria |
|
||||
| **Human** | Slow | Expensive | Ground truth, calibration |
|
||||
|
||||
**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.
|
||||
|
||||
## Score Structure
|
||||
|
||||
| Property | Required | Description |
|
||||
| -------- | -------- | ----------- |
|
||||
| `name` | Yes | Evaluator name |
|
||||
| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
|
||||
| `score` | No* | 0-1 numeric |
|
||||
| `label` | No* | `"pass"`, `"fail"` |
|
||||
| `explanation` | No | Rationale |
|
||||
|
||||
*One of `score` or `label` required.
|
||||
|
||||
## Binary > Likert
|
||||
|
||||
Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
|
||||
|
||||
```python
|
||||
# Multiple binary checks instead of one Likert scale
|
||||
evaluators = [
|
||||
AnswersQuestion(), # Yes/No
|
||||
UsesContext(), # Yes/No
|
||||
NoHallucination(), # Yes/No
|
||||
]
|
||||
```
|
||||
|
||||
## Quick Patterns
|
||||
|
||||
### Code Evaluator
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
@create_evaluator(name="has_citation", kind="code")
|
||||
def has_citation(output: str) -> bool:
|
||||
return bool(re.search(r'\[\d+\]', output))
|
||||
```
|
||||
|
||||
### LLM Evaluator
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
evaluator = ClassificationEvaluator(
|
||||
name="helpfulness",
|
||||
prompt_template="...",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"not_helpful": 0, "helpful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
### Run Experiment
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=[evaluator1, evaluator2],
|
||||
)
|
||||
print(experiment.aggregate_scores)
|
||||
```
|
||||
@@ -0,0 +1,101 @@
|
||||
# Observe: Sampling Strategies
|
||||
|
||||
How to efficiently sample production traces for review.
|
||||
|
||||
## Strategies
|
||||
|
||||
### 1. Failure-Focused (Highest Priority)
|
||||
|
||||
```python
|
||||
errors = spans_df[spans_df["status_code"] == "ERROR"]
|
||||
negative_feedback = spans_df[spans_df["feedback"] == "negative"]
|
||||
```
|
||||
|
||||
### 2. Outliers
|
||||
|
||||
```python
|
||||
long_responses = spans_df.nlargest(50, "response_length")
|
||||
slow_responses = spans_df.nlargest(50, "latency_ms")
|
||||
```
|
||||
|
||||
### 3. Stratified (Coverage)
|
||||
|
||||
```python
|
||||
# Sample equally from each category
|
||||
by_query_type = spans_df.groupby("metadata.query_type").apply(
|
||||
lambda x: x.sample(min(len(x), 20))
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Metric-Guided
|
||||
|
||||
```python
|
||||
# Review traces flagged by automated evaluators
|
||||
flagged = spans_df[eval_results["label"] == "hallucinated"]
|
||||
borderline = spans_df[(eval_results["score"] > 0.3) & (eval_results["score"] < 0.7)]
|
||||
```
|
||||
|
||||
## Building a Review Queue
|
||||
|
||||
```python
|
||||
def build_review_queue(spans_df, max_traces=100):
|
||||
queue = pd.concat([
|
||||
spans_df[spans_df["status_code"] == "ERROR"],
|
||||
spans_df[spans_df["feedback"] == "negative"],
|
||||
spans_df.nlargest(10, "response_length"),
|
||||
spans_df.sample(min(30, len(spans_df))),
|
||||
]).drop_duplicates("span_id").head(max_traces)
|
||||
return queue
|
||||
```
|
||||
|
||||
## Sample Size Guidelines
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Error analysis | 100+ (until saturation) |
|
||||
| Golden dataset | 100-500 |
|
||||
| Judge calibration | 100+ per class |
|
||||
|
||||
**Saturation:** Stop when new traces show the same failure patterns.
|
||||
|
||||
## Trace-Level Sampling
|
||||
|
||||
When you need whole requests (all spans per trace), use `get_traces`:
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
client = Client()
|
||||
|
||||
# Recent traces with full span trees
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
limit=100,
|
||||
include_spans=True,
|
||||
)
|
||||
|
||||
# Time-windowed sampling (e.g., last hour)
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
limit=50,
|
||||
include_spans=True,
|
||||
)
|
||||
|
||||
# Filter by session (multi-turn conversations)
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
session_id="user-session-abc",
|
||||
include_spans=True,
|
||||
)
|
||||
|
||||
# Sort by latency to find slowest requests
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
sort="latency_ms",
|
||||
order="desc",
|
||||
limit=50,
|
||||
)
|
||||
```
|
||||
@@ -0,0 +1,147 @@
|
||||
# Observe: Sampling Strategies (TypeScript)
|
||||
|
||||
How to efficiently sample production traces for review.
|
||||
|
||||
## Strategies
|
||||
|
||||
### 1. Failure-Focused (Highest Priority)
|
||||
|
||||
Use server-side filters to fetch only what you need:
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// Server-side filter — only ERROR spans are returned
|
||||
const { spans: errors } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
statusCode: "ERROR",
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Fetch only LLM spans
|
||||
const { spans: llmSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
spanKind: "LLM",
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Filter by span name
|
||||
const { spans: chatSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
name: "chat_completion",
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Outliers
|
||||
|
||||
```typescript
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 200,
|
||||
});
|
||||
const latency = (s: (typeof spans)[number]) =>
|
||||
new Date(s.end_time).getTime() - new Date(s.start_time).getTime();
|
||||
const sorted = [...spans].sort((a, b) => latency(b) - latency(a));
|
||||
const slowResponses = sorted.slice(0, 50);
|
||||
```
|
||||
|
||||
### 3. Stratified (Coverage)
|
||||
|
||||
```typescript
|
||||
// Sample equally from each category
|
||||
function stratifiedSample<T>(items: T[], groupBy: (item: T) => string, perGroup: number): T[] {
|
||||
const groups = new Map<string, T[]>();
|
||||
for (const item of items) {
|
||||
const key = groupBy(item);
|
||||
if (!groups.has(key)) groups.set(key, []);
|
||||
groups.get(key)!.push(item);
|
||||
}
|
||||
return [...groups.values()].flatMap((g) => g.slice(0, perGroup));
|
||||
}
|
||||
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 500,
|
||||
});
|
||||
const byQueryType = stratifiedSample(spans, (s) => s.attributes?.["metadata.query_type"] ?? "unknown", 20);
|
||||
```
|
||||
|
||||
### 4. Metric-Guided
|
||||
|
||||
```typescript
|
||||
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// Fetch annotations for your spans, then filter by label
|
||||
const { annotations } = await getSpanAnnotations({
|
||||
project: { projectName: "my-project" },
|
||||
spanIds: spans.map((s) => s.context.span_id),
|
||||
includeAnnotationNames: ["hallucination"],
|
||||
});
|
||||
|
||||
const flaggedSpanIds = new Set(
|
||||
annotations.filter((a) => a.result?.label === "hallucinated").map((a) => a.span_id)
|
||||
);
|
||||
const flagged = spans.filter((s) => flaggedSpanIds.has(s.context.span_id));
|
||||
```
|
||||
|
||||
## Trace-Level Sampling
|
||||
|
||||
When you need whole requests (all spans in a trace), use `getTraces`:
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
// Recent traces with full span trees
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 100,
|
||||
includeSpans: true,
|
||||
});
|
||||
|
||||
// Filter by session (e.g., multi-turn conversations)
|
||||
const { traces: sessionTraces } = await getTraces({
|
||||
project: { projectName: "my-project" },
|
||||
sessionId: "user-session-abc",
|
||||
includeSpans: true,
|
||||
});
|
||||
|
||||
// Time-windowed sampling
|
||||
const { traces: recentTraces } = await getTraces({
|
||||
project: { projectName: "my-project" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000), // last hour
|
||||
limit: 50,
|
||||
includeSpans: true,
|
||||
});
|
||||
```
|
||||
|
||||
## Building a Review Queue
|
||||
|
||||
```typescript
|
||||
// Combine server-side filters into a review queue
|
||||
const { spans: errorSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
statusCode: "ERROR",
|
||||
limit: 30,
|
||||
});
|
||||
const { spans: allSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 100,
|
||||
});
|
||||
const random = allSpans.sort(() => Math.random() - 0.5).slice(0, 30);
|
||||
|
||||
const combined = [...errorSpans, ...random];
|
||||
const unique = [...new Map(combined.map((s) => [s.context.span_id, s])).values()];
|
||||
const reviewQueue = unique.slice(0, 100);
|
||||
```
|
||||
|
||||
## Sample Size Guidelines
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Error analysis | 100+ (until saturation) |
|
||||
| Golden dataset | 100-500 |
|
||||
| Judge calibration | 100+ per class |
|
||||
|
||||
**Saturation:** Stop when new traces show the same failure patterns.
|
||||
@@ -0,0 +1,144 @@
|
||||
# Observe: Tracing Setup
|
||||
|
||||
Configure tracing to capture data for evaluation.
|
||||
|
||||
## Quick Setup
|
||||
|
||||
```python
|
||||
# Python
|
||||
from phoenix.otel import register
|
||||
|
||||
register(project_name="my-app", auto_instrument=True)
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript
|
||||
import { registerPhoenix } from "@arizeai/phoenix-otel";
|
||||
|
||||
registerPhoenix({ projectName: "my-app", autoInstrument: true });
|
||||
```
|
||||
|
||||
## Essential Attributes
|
||||
|
||||
| Attribute | Why It Matters |
|
||||
| --------- | -------------- |
|
||||
| `input.value` | User's request |
|
||||
| `output.value` | Response to evaluate |
|
||||
| `retrieval.documents` | Context for faithfulness |
|
||||
| `tool.name`, `tool.parameters` | Agent evaluation |
|
||||
| `llm.model_name` | Track by model |
|
||||
|
||||
## Custom Attributes for Evals
|
||||
|
||||
```python
|
||||
span.set_attribute("metadata.client_type", "enterprise")
|
||||
span.set_attribute("metadata.query_category", "billing")
|
||||
```
|
||||
|
||||
## Exporting for Evaluation
|
||||
|
||||
### Spans (Python — DataFrame)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
|
||||
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
|
||||
client = Client()
|
||||
spans_df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-app", # NOT project_name= (deprecated)
|
||||
root_spans_only=True,
|
||||
)
|
||||
|
||||
dataset = client.datasets.create_dataset(
|
||||
name="error-analysis-set",
|
||||
dataframe=spans_df[["input.value", "output.value"]],
|
||||
input_keys=["input.value"],
|
||||
output_keys=["output.value"],
|
||||
)
|
||||
```
|
||||
|
||||
### Spans (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
parentId: null, // root spans only
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
### Traces (Python — structured)
|
||||
|
||||
Use `get_traces` when you need full trace trees (e.g., multi-turn conversations, agent workflows):
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=24),
|
||||
include_spans=True, # includes all spans per trace
|
||||
limit=100,
|
||||
)
|
||||
# Each trace has: trace_id, start_time, end_time, spans (when include_spans=True)
|
||||
```
|
||||
|
||||
### Traces (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
|
||||
includeSpans: true,
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
## Uploading Evaluations as Annotations
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
from phoenix.evals.utils import to_annotation_dataframe
|
||||
|
||||
# Run evaluations
|
||||
results_df = evaluate_dataframe(dataframe=spans_df, evaluators=[my_eval])
|
||||
|
||||
# Format results for Phoenix annotations
|
||||
annotations_df = to_annotation_dataframe(results_df)
|
||||
|
||||
# Upload to Phoenix
|
||||
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
await logSpanAnnotations({
|
||||
spanAnnotations: [
|
||||
{
|
||||
spanId: "abc123",
|
||||
name: "quality",
|
||||
label: "good",
|
||||
score: 0.95,
|
||||
annotatorKind: "LLM",
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
Annotations are visible in the Phoenix UI alongside your traces.
|
||||
|
||||
## Verify
|
||||
|
||||
Required attributes: `input.value`, `output.value`, `status_code`
|
||||
For RAG: `retrieval.documents`
|
||||
For agents: `tool.name`, `tool.parameters`
|
||||
@@ -0,0 +1,137 @@
|
||||
# Production: Continuous Evaluation
|
||||
|
||||
Capability vs regression evals and the ongoing feedback loop.
|
||||
|
||||
## Two Types of Evals
|
||||
|
||||
| Type | Pass Rate Target | Purpose | Update |
|
||||
| ---- | ---------------- | ------- | ------ |
|
||||
| **Capability** | 50-80% | Measure improvement | Add harder cases |
|
||||
| **Regression** | 95-100% | Catch breakage | Add fixed bugs |
|
||||
|
||||
## Saturation
|
||||
|
||||
When capability evals hit >95% pass rate, they're saturated:
|
||||
1. Graduate passing cases to regression suite
|
||||
2. Add new challenging cases to capability suite
|
||||
|
||||
## Feedback Loop
|
||||
|
||||
```
|
||||
Production → Sample traffic → Run evaluators → Find failures
|
||||
↑ ↓
|
||||
Deploy ← Run CI evals ← Create test cases ← Error analysis
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
Build a continuous monitoring loop:
|
||||
|
||||
1. **Sample recent traces** at regular intervals (e.g., 100 traces per hour)
|
||||
2. **Run evaluators** on sampled traces
|
||||
3. **Log results** to Phoenix for tracking
|
||||
4. **Queue concerning results** for human review
|
||||
5. **Create test cases** from recurring failure patterns
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
client = Client()
|
||||
|
||||
# 1. Sample recent spans (includes full attributes for evaluation)
|
||||
spans_df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
root_spans_only=True,
|
||||
limit=100,
|
||||
)
|
||||
|
||||
# 2. Run evaluators
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(
|
||||
dataframe=spans_df,
|
||||
evaluators=[quality_eval, safety_eval],
|
||||
)
|
||||
|
||||
# 3. Upload results as annotations
|
||||
from phoenix.evals.utils import to_annotation_dataframe
|
||||
|
||||
annotations_df = to_annotation_dataframe(results_df)
|
||||
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// 1. Sample recent spans
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
parentId: null, // root spans only
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// 2. Run evaluators (user-defined)
|
||||
const results = await Promise.all(
|
||||
spans.map(async (span) => ({
|
||||
spanId: span.context.span_id,
|
||||
...await runEvaluators(span, [qualityEval, safetyEval]),
|
||||
}))
|
||||
);
|
||||
|
||||
// 3. Upload results as annotations
|
||||
await logSpanAnnotations({
|
||||
spanAnnotations: results.map((r) => ({
|
||||
spanId: r.spanId,
|
||||
name: "quality",
|
||||
score: r.qualityScore,
|
||||
label: r.qualityLabel,
|
||||
annotatorKind: "LLM" as const,
|
||||
})),
|
||||
});
|
||||
```
|
||||
|
||||
For trace-level monitoring (e.g., agent workflows), use `get_traces`/`getTraces` to identify traces:
|
||||
|
||||
```python
|
||||
# Python: identify slow traces
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
sort="latency_ms",
|
||||
order="desc",
|
||||
limit=50,
|
||||
)
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript: identify slow traces
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
limit: 50,
|
||||
});
|
||||
```
|
||||
|
||||
## Alerting
|
||||
|
||||
| Condition | Severity | Action |
|
||||
| --------- | -------- | ------ |
|
||||
| Regression < 98% | Critical | Page oncall |
|
||||
| Capability declining | Warning | Slack notify |
|
||||
| Capability > 95% for 7d | Info | Schedule review |
|
||||
|
||||
## Key Principles
|
||||
|
||||
- **Two suites** - Capability + Regression always
|
||||
- **Graduate cases** - Move consistent passes to regression
|
||||
- **Track trends** - Monitor over time, not just snapshots
|
||||
@@ -0,0 +1,53 @@
|
||||
# Production: Guardrails vs Evaluators
|
||||
|
||||
Guardrails block in real-time. Evaluators measure asynchronously.
|
||||
|
||||
## Key Distinction
|
||||
|
||||
```
|
||||
Request → [INPUT GUARDRAIL] → LLM → [OUTPUT GUARDRAIL] → Response
|
||||
│
|
||||
└──→ ASYNC EVALUATOR (background)
|
||||
```
|
||||
|
||||
## Guardrails
|
||||
|
||||
| Aspect | Requirement |
|
||||
| ------ | ----------- |
|
||||
| Timing | Synchronous, blocking |
|
||||
| Latency | < 100ms |
|
||||
| Purpose | Prevent harm |
|
||||
| Type | Code-based (deterministic) |
|
||||
|
||||
**Use for:** PII detection, prompt injection, profanity, length limits, format validation.
|
||||
|
||||
## Evaluators
|
||||
|
||||
| Aspect | Characteristic |
|
||||
| ------ | -------------- |
|
||||
| Timing | Async, background |
|
||||
| Latency | Can be seconds |
|
||||
| Purpose | Measure quality |
|
||||
| Type | Can use LLMs |
|
||||
|
||||
**Use for:** Helpfulness, faithfulness, tone, completeness, citation accuracy.
|
||||
|
||||
## Decision
|
||||
|
||||
| Question | Answer |
|
||||
| -------- | ------ |
|
||||
| Must block harmful content? | Guardrail |
|
||||
| Measuring quality? | Evaluator |
|
||||
| Need LLM judgment? | Evaluator |
|
||||
| < 100ms required? | Guardrail |
|
||||
| False positives = angry users? | Evaluator |
|
||||
|
||||
## LLM Guardrails: Rarely
|
||||
|
||||
Only use LLM guardrails if:
|
||||
- Latency budget > 1s
|
||||
- Error cost >> LLM cost
|
||||
- Low volume
|
||||
- Fallback exists
|
||||
|
||||
**Key Principle:** Guardrails prevent harm (block). Evaluators measure quality (log).
|
||||
@@ -0,0 +1,92 @@
|
||||
# Production: Overview
|
||||
|
||||
CI/CD evals vs production monitoring - complementary approaches.
|
||||
|
||||
## Two Evaluation Modes
|
||||
|
||||
| Aspect | CI/CD Evals | Production Monitoring |
|
||||
| ------ | ----------- | -------------------- |
|
||||
| **When** | Pre-deployment | Post-deployment, ongoing |
|
||||
| **Data** | Fixed dataset | Sampled traffic |
|
||||
| **Goal** | Prevent regression | Detect drift |
|
||||
| **Response** | Block deploy | Alert & analyze |
|
||||
|
||||
## CI/CD Evaluations
|
||||
|
||||
```python
|
||||
# Fast, deterministic checks
|
||||
ci_evaluators = [
|
||||
has_required_format,
|
||||
no_pii_leak,
|
||||
safety_check,
|
||||
regression_test_suite,
|
||||
]
|
||||
|
||||
# Small but representative dataset (~100 examples)
|
||||
run_experiment(ci_dataset, task, ci_evaluators)
|
||||
```
|
||||
|
||||
Set thresholds: regression=0.95, safety=1.0, format=0.98.
|
||||
|
||||
## Production Monitoring
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
client = Client()
|
||||
|
||||
# Sample recent traces (last hour)
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
include_spans=True,
|
||||
limit=100,
|
||||
)
|
||||
|
||||
# Run evaluators on sampled traffic
|
||||
for trace in traces:
|
||||
results = run_evaluators_async(trace, production_evaluators)
|
||||
if any(r["score"] < 0.5 for r in results):
|
||||
alert_on_failure(trace, results)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// Sample recent traces (last hour)
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
includeSpans: true,
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Or sample spans directly for evaluation
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Run evaluators on sampled traffic
|
||||
for (const span of spans) {
|
||||
const results = await runEvaluators(span, productionEvaluators);
|
||||
if (results.some((r) => r.score < 0.5)) {
|
||||
await alertOnFailure(span, results);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Prioritize: errors → negative feedback → random sample.
|
||||
|
||||
## Feedback Loop
|
||||
|
||||
```
|
||||
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
|
||||
```
|
||||
@@ -0,0 +1,64 @@
|
||||
# Setup: Python
|
||||
|
||||
Packages required for Phoenix evals and experiments.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Core Phoenix package (includes client, evals, otel)
|
||||
pip install arize-phoenix
|
||||
|
||||
# Or install individual packages
|
||||
pip install arize-phoenix-client # Phoenix client only
|
||||
pip install arize-phoenix-evals # Evaluation utilities
|
||||
pip install arize-phoenix-otel # OpenTelemetry integration
|
||||
```
|
||||
|
||||
## LLM Providers
|
||||
|
||||
For LLM-as-judge evaluators, install your provider's SDK:
|
||||
|
||||
```bash
|
||||
pip install openai # OpenAI
|
||||
pip install anthropic # Anthropic
|
||||
pip install google-generativeai # Google
|
||||
```
|
||||
|
||||
## Validation (Optional)
|
||||
|
||||
```bash
|
||||
pip install scikit-learn # For TPR/TNR metrics
|
||||
```
|
||||
|
||||
## Quick Verify
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from phoenix.evals import LLM, ClassificationEvaluator
|
||||
from phoenix.otel import register
|
||||
|
||||
# All imports should work
|
||||
print("Phoenix Python setup complete")
|
||||
```
|
||||
|
||||
## Key Imports (Evals 2.0)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from phoenix.evals import (
|
||||
ClassificationEvaluator, # LLM classification evaluator (preferred)
|
||||
LLM, # Provider-agnostic LLM wrapper
|
||||
async_evaluate_dataframe, # Batch evaluate a DataFrame (preferred, async)
|
||||
evaluate_dataframe, # Batch evaluate a DataFrame (sync)
|
||||
create_evaluator, # Decorator for code-based evaluators
|
||||
create_classifier, # Factory for LLM classification evaluators
|
||||
bind_evaluator, # Map column names to evaluator params
|
||||
Score, # Score dataclass
|
||||
)
|
||||
from phoenix.evals.utils import to_annotation_dataframe # Format results for Phoenix annotations
|
||||
```
|
||||
|
||||
**Prefer**: `ClassificationEvaluator` over `create_classifier` (more parameters/customization).
|
||||
**Prefer**: `async_evaluate_dataframe` over `evaluate_dataframe` (better throughput for LLM evals).
|
||||
|
||||
**Do NOT use** legacy 1.0 imports: `OpenAIModel`, `AnthropicModel`, `run_evals`, `llm_classify`.
|
||||
@@ -0,0 +1,41 @@
|
||||
# Setup: TypeScript
|
||||
|
||||
Packages required for Phoenix evals and experiments.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Using npm
|
||||
npm install @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
|
||||
|
||||
# Using pnpm
|
||||
pnpm add @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
|
||||
```
|
||||
|
||||
## LLM Providers
|
||||
|
||||
For LLM-as-judge evaluators, install Vercel AI SDK providers:
|
||||
|
||||
```bash
|
||||
npm install ai @ai-sdk/openai # Vercel AI SDK + OpenAI
|
||||
npm install @ai-sdk/anthropic # Anthropic
|
||||
npm install @ai-sdk/google # Google
|
||||
```
|
||||
|
||||
Or use direct provider SDKs:
|
||||
|
||||
```bash
|
||||
npm install openai # OpenAI direct
|
||||
npm install @anthropic-ai/sdk # Anthropic direct
|
||||
```
|
||||
|
||||
## Quick Verify
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
|
||||
import { registerPhoenix } from "@arizeai/phoenix-otel";
|
||||
|
||||
// All imports should work
|
||||
console.log("Phoenix TypeScript setup complete");
|
||||
```
|
||||
@@ -0,0 +1,43 @@
|
||||
# Validating Evaluators (Python)
|
||||
|
||||
Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.
|
||||
|
||||
## Calculate Metrics
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix
|
||||
|
||||
print(classification_report(human_labels, evaluator_predictions))
|
||||
|
||||
cm = confusion_matrix(human_labels, evaluator_predictions)
|
||||
tn, fp, fn, tp = cm.ravel()
|
||||
tpr = tp / (tp + fn)
|
||||
tnr = tn / (tn + fp)
|
||||
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")
|
||||
```
|
||||
|
||||
## Correct Production Estimates
|
||||
|
||||
```python
|
||||
def correct_estimate(observed, tpr, tnr):
|
||||
"""Adjust observed pass rate using known TPR/TNR."""
|
||||
return (observed - (1 - tnr)) / (tpr - (1 - tnr))
|
||||
```
|
||||
|
||||
## Find Misclassified
|
||||
|
||||
```python
|
||||
# False Positives: Evaluator pass, human fail
|
||||
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
|
||||
false_positives = dataset[fp_mask]
|
||||
|
||||
# False Negatives: Evaluator fail, human pass
|
||||
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
|
||||
false_negatives = dataset[fn_mask]
|
||||
```
|
||||
|
||||
## Red Flags
|
||||
|
||||
- TPR or TNR < 70%
|
||||
- Large gap between TPR and TNR
|
||||
- Kappa < 0.6
|
||||
@@ -0,0 +1,106 @@
|
||||
# Validating Evaluators (TypeScript)
|
||||
|
||||
Validate an LLM evaluator against human-labeled examples before deploying it.
|
||||
Target: **>80% TPR and >80% TNR**.
|
||||
|
||||
Roles are inverted compared to a normal task experiment:
|
||||
|
||||
| Normal experiment | Evaluator validation |
|
||||
|---|---|
|
||||
| Task = agent logic | Task = run the evaluator under test |
|
||||
| Evaluator = judge output | Evaluator = exact-match vs human ground truth |
|
||||
| Dataset = agent examples | Dataset = golden hand-labeled examples |
|
||||
|
||||
## Golden Dataset
|
||||
|
||||
Use a separate dataset name so validation experiments don't mix with task experiments in Phoenix.
|
||||
Store human ground truth in `metadata.groundTruthLabel`. Aim for ~50/50 balance:
|
||||
|
||||
```typescript
|
||||
import type { Example } from "@arizeai/phoenix-client/types/datasets";
|
||||
|
||||
const goldenExamples: Example[] = [
|
||||
{ input: { q: "Capital of France?" }, output: { answer: "Paris" }, metadata: { groundTruthLabel: "correct" } },
|
||||
{ input: { q: "Capital of France?" }, output: { answer: "Lyon" }, metadata: { groundTruthLabel: "incorrect" } },
|
||||
{ input: { q: "Capital of France?" }, output: { answer: "Major city..." }, metadata: { groundTruthLabel: "incorrect" } },
|
||||
];
|
||||
|
||||
const VALIDATOR_DATASET = "my-app-qa-evaluator-validation"; // separate from task dataset
|
||||
const POSITIVE_LABEL = "correct";
|
||||
const NEGATIVE_LABEL = "incorrect";
|
||||
```
|
||||
|
||||
## Validation Experiment
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import { createOrGetDataset, getDatasetExamples } from "@arizeai/phoenix-client/datasets";
|
||||
import { asExperimentEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";
|
||||
import { myEvaluator } from "./myEvaluator.js";
|
||||
|
||||
const client = createClient();
|
||||
|
||||
const { datasetId } = await createOrGetDataset({ client, name: VALIDATOR_DATASET, examples: goldenExamples });
|
||||
const { examples } = await getDatasetExamples({ client, dataset: { datasetId } });
|
||||
const groundTruth = new Map(examples.map((ex) => [ex.id, ex.metadata?.groundTruthLabel as string]));
|
||||
|
||||
// Task: invoke the evaluator under test
|
||||
const task = async (example: (typeof examples)[number]) => {
|
||||
const result = await myEvaluator.evaluate({ input: example.input, output: example.output, metadata: example.metadata });
|
||||
return result.label ?? "unknown";
|
||||
};
|
||||
|
||||
// Evaluator: exact-match against human ground truth
|
||||
const exactMatch = asExperimentEvaluator({
|
||||
name: "exact-match", kind: "CODE",
|
||||
evaluate: ({ output, metadata }) => {
|
||||
const expected = metadata?.groundTruthLabel as string;
|
||||
const predicted = typeof output === "string" ? output : "unknown";
|
||||
return { score: predicted === expected ? 1 : 0, label: predicted, explanation: `Expected: ${expected}, Got: ${predicted}` };
|
||||
},
|
||||
});
|
||||
|
||||
const experiment = await runExperiment({
|
||||
client, experimentName: `evaluator-validation-${Date.now()}`,
|
||||
dataset: { datasetId }, task, evaluators: [exactMatch],
|
||||
});
|
||||
|
||||
// Compute confusion matrix
|
||||
const runs = Object.values(experiment.runs);
|
||||
const predicted = new Map((experiment.evaluationRuns ?? [])
|
||||
.filter((e) => e.name === "exact-match")
|
||||
.map((e) => [e.experimentRunId, e.result?.label ?? null]));
|
||||
|
||||
let tp = 0, fp = 0, tn = 0, fn = 0;
|
||||
for (const run of runs) {
|
||||
if (run.error) continue;
|
||||
const p = predicted.get(run.id), a = groundTruth.get(run.datasetExampleId);
|
||||
if (!p || !a) continue;
|
||||
if (a === POSITIVE_LABEL && p === POSITIVE_LABEL) tp++;
|
||||
else if (a === NEGATIVE_LABEL && p === POSITIVE_LABEL) fp++;
|
||||
else if (a === NEGATIVE_LABEL && p === NEGATIVE_LABEL) tn++;
|
||||
else if (a === POSITIVE_LABEL && p === NEGATIVE_LABEL) fn++;
|
||||
}
|
||||
const total = tp + fp + tn + fn;
|
||||
const tpr = tp + fn > 0 ? (tp / (tp + fn)) * 100 : 0;
|
||||
const tnr = tn + fp > 0 ? (tn / (tn + fp)) * 100 : 0;
|
||||
console.log(`TPR: ${tpr.toFixed(1)}% TNR: ${tnr.toFixed(1)}% Accuracy: ${((tp + tn) / total * 100).toFixed(1)}%`);
|
||||
```
|
||||
|
||||
## Results & Quality Rules
|
||||
|
||||
| Metric | Target | Low value means |
|
||||
|---|---|---|
|
||||
| TPR (sensitivity) | >80% | Misses real failures (false negatives) |
|
||||
| TNR (specificity) | >80% | Flags good outputs (false positives) |
|
||||
| Accuracy | >80% | General weakness |
|
||||
|
||||
**Golden dataset rules:** ~50/50 balance · include edge cases · human-labeled only · never mutate (append new versions) · 20–50 examples is enough.
|
||||
|
||||
**Re-validate when:** prompt template changes · judge model changes · criteria updated · production FP/FN spike.
|
||||
|
||||
## See Also
|
||||
|
||||
- `validation.md` — Metric definitions and concepts
|
||||
- `experiments-running-typescript.md` — `runExperiment` API
|
||||
- `experiments-datasets-typescript.md` — `createOrGetDataset` / `getDatasetExamples`
|
||||
@@ -0,0 +1,74 @@
|
||||
# Validation
|
||||
|
||||
Validate LLM judges against human labels before deploying. Target >80% agreement.
|
||||
|
||||
## Requirements
|
||||
|
||||
| Requirement | Target |
|
||||
| ----------- | ------ |
|
||||
| Test set size | 100+ examples |
|
||||
| Balance | ~50/50 pass/fail |
|
||||
| Accuracy | >80% |
|
||||
| TPR/TNR | Both >70% |
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Formula | Use When |
|
||||
| ------ | ------- | -------- |
|
||||
| **Accuracy** | (TP+TN) / Total | General |
|
||||
| **TPR (Recall)** | TP / (TP+FN) | Quality assurance |
|
||||
| **TNR (Specificity)** | TN / (TN+FP) | Safety-critical |
|
||||
| **Cohen's Kappa** | Agreement beyond chance | Comparing evaluators |
|
||||
|
||||
## Quick Validation
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score
|
||||
|
||||
print(classification_report(human_labels, evaluator_predictions))
|
||||
print(f"Kappa: {cohen_kappa_score(human_labels, evaluator_predictions):.3f}")
|
||||
|
||||
# Get TPR/TNR
|
||||
cm = confusion_matrix(human_labels, evaluator_predictions)
|
||||
tn, fp, fn, tp = cm.ravel()
|
||||
tpr = tp / (tp + fn)
|
||||
tnr = tn / (tn + fp)
|
||||
```
|
||||
|
||||
## Golden Dataset Structure
|
||||
|
||||
```python
|
||||
golden_example = {
|
||||
"input": "What is the capital of France?",
|
||||
"output": "Paris is the capital.",
|
||||
"ground_truth_label": "correct",
|
||||
}
|
||||
```
|
||||
|
||||
## Building Golden Datasets
|
||||
|
||||
1. Sample production traces (errors, negative feedback, edge cases)
|
||||
2. Balance ~50/50 pass/fail
|
||||
3. Expert labels each example
|
||||
4. Version datasets (never modify existing)
|
||||
|
||||
```python
|
||||
# GOOD - create new version
|
||||
golden_v2 = golden_v1 + [new_examples]
|
||||
|
||||
# BAD - never modify existing
|
||||
golden_v1.append(new_example)
|
||||
```
|
||||
|
||||
## Warning Signs
|
||||
|
||||
- All pass or all fail → too lenient/strict
|
||||
- Random results → criteria unclear
|
||||
- TPR/TNR < 70% → needs improvement
|
||||
|
||||
## Re-Validate When
|
||||
|
||||
- Prompt template changes
|
||||
- Judge model changes
|
||||
- Criteria changes
|
||||
- Monthly
|
||||
Reference in New Issue
Block a user