mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-13 19:55:56 +00:00
Add Arize and Phoenix LLM observability skills (#1204)
* Add 9 Arize LLM observability skills Add skills for Arize AI platform covering trace export, instrumentation, datasets, experiments, evaluators, AI provider integrations, annotations, prompt optimization, and deep linking to the Arize UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add 3 Phoenix AI observability skills Add skills for Phoenix (Arize open-source) covering CLI debugging, LLM evaluation workflows, and OpenInference tracing/instrumentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Ignoring intentional bad spelling * Fix CI: remove .DS_Store from generated skills README and add codespell ignore Remove .DS_Store artifact from winmd-api-search asset listing in generated README.skills.md so it matches the CI Linux build output. Add queston to codespell ignore list (intentional misspelling example in arize-dataset skill). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add arize-ax and phoenix plugins Bundle the 9 Arize skills into an arize-ax plugin and the 3 Phoenix skills into a phoenix plugin for easier installation as single packages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix skill folder structures to match source repos Move arize supporting files from references/ to root level and rename phoenix references/ to rules/ to exactly match the original source repository folder structures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fixing file locations * Fixing readme --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
72
skills/phoenix-evals/SKILL.md
Normal file
72
skills/phoenix-evals/SKILL.md
Normal file
@@ -0,0 +1,72 @@
|
||||
---
|
||||
name: phoenix-evals
|
||||
description: Build and run evaluators for AI/LLM applications using Phoenix.
|
||||
license: Apache-2.0
|
||||
compatibility: Requires Phoenix server. Python skills need phoenix and openai packages; TypeScript skills need @arizeai/phoenix-client.
|
||||
metadata:
|
||||
author: oss@arize.com
|
||||
version: "1.0.0"
|
||||
languages: "Python, TypeScript"
|
||||
---
|
||||
|
||||
# Phoenix Evals
|
||||
|
||||
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | Files |
|
||||
| ---- | ----- |
|
||||
| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |
|
||||
| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |
|
||||
| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |
|
||||
| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |
|
||||
| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |
|
||||
| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |
|
||||
| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |
|
||||
| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |
|
||||
| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |
|
||||
| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |
|
||||
| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |
|
||||
| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |
|
||||
| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |
|
||||
| RAG evals | [evaluators-rag](references/evaluators-rag.md) |
|
||||
| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |
|
||||
| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |
|
||||
|
||||
## Workflows
|
||||
|
||||
**Starting Fresh:**
|
||||
[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)
|
||||
|
||||
**Building Evaluator:**
|
||||
[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
|
||||
|
||||
**RAG Systems:**
|
||||
[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
|
||||
|
||||
**Production:**
|
||||
[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)
|
||||
|
||||
## Reference Categories
|
||||
|
||||
| Prefix | Description |
|
||||
| ------ | ----------- |
|
||||
| `fundamentals-*` | Types, scores, anti-patterns |
|
||||
| `observe-*` | Tracing, sampling |
|
||||
| `error-analysis-*` | Finding failures |
|
||||
| `axial-coding-*` | Categorizing failures |
|
||||
| `evaluators-*` | Code, LLM, RAG evaluators |
|
||||
| `experiments-*` | Datasets, running experiments |
|
||||
| `validation-*` | Validating evaluator accuracy against human labels |
|
||||
| `production-*` | CI/CD, monitoring |
|
||||
|
||||
## Key Principles
|
||||
|
||||
| Principle | Action |
|
||||
| --------- | ------ |
|
||||
| Error analysis first | Can't automate what you haven't observed |
|
||||
| Custom > generic | Build from your failures |
|
||||
| Code first | Deterministic before LLM |
|
||||
| Validate judges | >80% TPR/TNR |
|
||||
| Binary > Likert | Pass/fail, not 1-5 |
|
||||
95
skills/phoenix-evals/references/axial-coding.md
Normal file
95
skills/phoenix-evals/references/axial-coding.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# Axial Coding
|
||||
|
||||
Group open-ended notes into structured failure taxonomies.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Gather** - Collect open coding notes
|
||||
2. **Pattern** - Group notes with common themes
|
||||
3. **Name** - Create actionable category names
|
||||
4. **Quantify** - Count failures per category
|
||||
|
||||
## Example Taxonomy
|
||||
|
||||
```yaml
|
||||
failure_taxonomy:
|
||||
content_quality:
|
||||
hallucination: [invented_facts, fictional_citations]
|
||||
incompleteness: [partial_answer, missing_key_info]
|
||||
inaccuracy: [wrong_numbers, wrong_dates]
|
||||
|
||||
communication:
|
||||
tone_mismatch: [too_casual, too_formal]
|
||||
clarity: [ambiguous, jargon_heavy]
|
||||
|
||||
context:
|
||||
user_context: [ignored_preferences, misunderstood_intent]
|
||||
retrieved_context: [ignored_documents, wrong_context]
|
||||
|
||||
safety:
|
||||
missing_disclaimers: [legal, medical, financial]
|
||||
```
|
||||
|
||||
## Add Annotation (Python)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
client = Client()
|
||||
client.spans.add_span_annotation(
|
||||
span_id="abc123",
|
||||
annotation_name="failure_category",
|
||||
label="hallucination",
|
||||
explanation="invented a feature that doesn't exist",
|
||||
annotator_kind="HUMAN",
|
||||
sync=True,
|
||||
)
|
||||
```
|
||||
|
||||
## Add Annotation (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { addSpanAnnotation } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
await addSpanAnnotation({
|
||||
spanAnnotation: {
|
||||
spanId: "abc123",
|
||||
name: "failure_category",
|
||||
label: "hallucination",
|
||||
explanation: "invented a feature that doesn't exist",
|
||||
annotatorKind: "HUMAN",
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
## Agent Failure Taxonomy
|
||||
|
||||
```yaml
|
||||
agent_failures:
|
||||
planning: [wrong_plan, incomplete_plan]
|
||||
tool_selection: [wrong_tool, missed_tool, unnecessary_call]
|
||||
tool_execution: [wrong_parameters, type_error]
|
||||
state_management: [lost_context, stuck_in_loop]
|
||||
error_recovery: [no_fallback, wrong_fallback]
|
||||
```
|
||||
|
||||
## Transition Matrix (Agents)
|
||||
|
||||
Shows where failures occur between states:
|
||||
|
||||
```python
|
||||
def build_transition_matrix(conversations, states):
|
||||
matrix = defaultdict(lambda: defaultdict(int))
|
||||
for conv in conversations:
|
||||
if conv["failed"]:
|
||||
last_success = find_last_success(conv)
|
||||
first_failure = find_first_failure(conv)
|
||||
matrix[last_success][first_failure] += 1
|
||||
return pd.DataFrame(matrix).fillna(0)
|
||||
```
|
||||
|
||||
## Principles
|
||||
|
||||
- **MECE** - Each failure fits ONE category
|
||||
- **Actionable** - Categories suggest fixes
|
||||
- **Bottom-up** - Let categories emerge from data
|
||||
225
skills/phoenix-evals/references/common-mistakes-python.md
Normal file
225
skills/phoenix-evals/references/common-mistakes-python.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# Common Mistakes (Python)
|
||||
|
||||
Patterns that LLMs frequently generate incorrectly from training data.
|
||||
|
||||
## Legacy Model Classes
|
||||
|
||||
```python
|
||||
# WRONG
|
||||
from phoenix.evals import OpenAIModel, AnthropicModel
|
||||
model = OpenAIModel(model="gpt-4")
|
||||
|
||||
# RIGHT
|
||||
from phoenix.evals import LLM
|
||||
llm = LLM(provider="openai", model="gpt-4o")
|
||||
```
|
||||
|
||||
**Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`.
|
||||
The `LLM` class is provider-agnostic and is the current 2.0 API.
|
||||
|
||||
## Using run_evals Instead of evaluate_dataframe
|
||||
|
||||
```python
|
||||
# WRONG — legacy 1.0 API
|
||||
from phoenix.evals import run_evals
|
||||
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
|
||||
# Returns list of DataFrames
|
||||
|
||||
# RIGHT — current 2.0 API
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
|
||||
# Returns single DataFrame with {name}_score dict columns
|
||||
```
|
||||
|
||||
**Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current
|
||||
2.0 function with a different return format.
|
||||
|
||||
## Wrong Result Column Names
|
||||
|
||||
```python
|
||||
# WRONG — column doesn't exist
|
||||
score = results_df["relevance"].mean()
|
||||
|
||||
# WRONG — column exists but contains dicts, not numbers
|
||||
score = results_df["relevance_score"].mean()
|
||||
|
||||
# RIGHT — extract numeric score from dict
|
||||
scores = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
score = scores.mean()
|
||||
```
|
||||
|
||||
**Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts
|
||||
like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`.
|
||||
|
||||
## Deprecated project_name Parameter
|
||||
|
||||
```python
|
||||
# WRONG
|
||||
df = client.spans.get_spans_dataframe(project_name="my-project")
|
||||
|
||||
# RIGHT
|
||||
df = client.spans.get_spans_dataframe(project_identifier="my-project")
|
||||
```
|
||||
|
||||
**Why**: `project_name` is deprecated in favor of `project_identifier`, which also
|
||||
accepts project IDs.
|
||||
|
||||
## Wrong Client Constructor
|
||||
|
||||
```python
|
||||
# WRONG
|
||||
client = Client(endpoint="https://app.phoenix.arize.com")
|
||||
client = Client(url="https://app.phoenix.arize.com")
|
||||
|
||||
# RIGHT — for remote/cloud Phoenix
|
||||
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")
|
||||
|
||||
# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
|
||||
client = Client()
|
||||
```
|
||||
|
||||
**Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances,
|
||||
`Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required.
|
||||
|
||||
## Too-Aggressive Time Filters
|
||||
|
||||
```python
|
||||
# WRONG — often returns zero spans
|
||||
from datetime import datetime, timedelta
|
||||
df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
)
|
||||
|
||||
# RIGHT — use limit to control result size instead
|
||||
df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
limit=50,
|
||||
)
|
||||
```
|
||||
|
||||
**Why**: Traces may be from any time period. A 1-hour window frequently returns
|
||||
nothing. Use `limit=` to control result size instead.
|
||||
|
||||
## Not Filtering Spans Appropriately
|
||||
|
||||
```python
|
||||
# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
|
||||
df = client.spans.get_spans_dataframe(project_identifier="my-project")
|
||||
|
||||
# RIGHT for end-to-end evaluation — filter to top-level spans
|
||||
df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
root_spans_only=True,
|
||||
)
|
||||
|
||||
# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
|
||||
all_spans = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-project",
|
||||
)
|
||||
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
|
||||
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]
|
||||
```
|
||||
|
||||
**Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`.
|
||||
For RAG systems, you often need child spans separately — retriever spans for
|
||||
DocumentRelevance and LLM spans for Faithfulness. Choose the right span level
|
||||
for your evaluation target.
|
||||
|
||||
## Assuming Span Output is Plain Text
|
||||
|
||||
```python
|
||||
# WRONG — output may be JSON, not plain text
|
||||
df["output"] = df["attributes.output.value"]
|
||||
|
||||
# RIGHT — parse JSON and extract the answer field
|
||||
import json
|
||||
|
||||
def extract_answer(output_value):
|
||||
if not isinstance(output_value, str):
|
||||
return str(output_value) if output_value is not None else ""
|
||||
try:
|
||||
parsed = json.loads(output_value)
|
||||
if isinstance(parsed, dict):
|
||||
for key in ("answer", "result", "output", "response"):
|
||||
if key in parsed:
|
||||
return str(parsed[key])
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
pass
|
||||
return output_value
|
||||
|
||||
df["output"] = df["attributes.output.value"].apply(extract_answer)
|
||||
```
|
||||
|
||||
**Why**: LangChain and other frameworks often output structured JSON from root spans,
|
||||
like `{"context": "...", "question": "...", "answer": "..."}`. Evaluators need
|
||||
the actual answer text, not the raw JSON.
|
||||
|
||||
## Using @create_evaluator for LLM-Based Evaluation
|
||||
|
||||
```python
|
||||
# WRONG — @create_evaluator doesn't call an LLM
|
||||
@create_evaluator(name="relevance", kind="llm")
|
||||
def relevance(input: str, output: str) -> str:
|
||||
pass # No LLM is involved
|
||||
|
||||
# RIGHT — use ClassificationEvaluator for LLM-based evaluation
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
relevance = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
**Why**: `@create_evaluator` wraps a plain Python function. Setting `kind="llm"`
|
||||
marks it as LLM-based but you must implement the LLM call yourself.
|
||||
For LLM-based evaluation, prefer `ClassificationEvaluator` which handles
|
||||
the LLM call, structured output parsing, and explanations automatically.
|
||||
|
||||
## Using llm_classify Instead of ClassificationEvaluator
|
||||
|
||||
```python
|
||||
# WRONG — legacy 1.0 API
|
||||
from phoenix.evals import llm_classify
|
||||
results = llm_classify(
|
||||
dataframe=df,
|
||||
template=template_str,
|
||||
model=model,
|
||||
rails=["relevant", "irrelevant"],
|
||||
)
|
||||
|
||||
# RIGHT — current 2.0 API
|
||||
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM
|
||||
|
||||
classifier = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template=template_str,
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])
|
||||
```
|
||||
|
||||
**Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create
|
||||
an evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`.
|
||||
|
||||
## Using HallucinationEvaluator
|
||||
|
||||
```python
|
||||
# WRONG — deprecated
|
||||
from phoenix.evals import HallucinationEvaluator
|
||||
eval = HallucinationEvaluator(model)
|
||||
|
||||
# RIGHT — use FaithfulnessEvaluator
|
||||
from phoenix.evals.metrics import FaithfulnessEvaluator
|
||||
from phoenix.evals import LLM
|
||||
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))
|
||||
```
|
||||
|
||||
**Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement,
|
||||
using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).
|
||||
52
skills/phoenix-evals/references/error-analysis-multi-turn.md
Normal file
52
skills/phoenix-evals/references/error-analysis-multi-turn.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Error Analysis: Multi-Turn Conversations
|
||||
|
||||
Debugging complex multi-turn conversation traces.
|
||||
|
||||
## The Approach
|
||||
|
||||
1. **End-to-end first** - Did the conversation achieve the goal?
|
||||
2. **Find first failure** - Trace backwards to root cause
|
||||
3. **Simplify** - Try single-turn before multi-turn debug
|
||||
4. **N-1 testing** - Isolate turn-specific vs capability issues
|
||||
|
||||
## Find First Upstream Failure
|
||||
|
||||
```
|
||||
Turn 1: User asks about flights ✓
|
||||
Turn 2: Assistant asks for dates ✓
|
||||
Turn 3: User provides dates ✓
|
||||
Turn 4: Assistant searches WRONG dates ← FIRST FAILURE
|
||||
Turn 5: Shows wrong flights (consequence)
|
||||
Turn 6: User frustrated (consequence)
|
||||
```
|
||||
|
||||
Focus on Turn 4, not Turn 6.
|
||||
|
||||
## Simplify First
|
||||
|
||||
Before debugging multi-turn, test single-turn:
|
||||
|
||||
```python
|
||||
# If single-turn also fails → problem is retrieval/knowledge
|
||||
# If single-turn passes → problem is conversation context
|
||||
response = chat("What's the return policy for electronics?")
|
||||
```
|
||||
|
||||
## N-1 Testing
|
||||
|
||||
Give turns 1 to N-1 as context, test turn N:
|
||||
|
||||
```python
|
||||
context = conversation[:n-1]
|
||||
response = chat_with_context(context, user_message_n)
|
||||
# Compare to actual turn N
|
||||
```
|
||||
|
||||
This isolates whether error is from context or underlying capability.
|
||||
|
||||
## Checklist
|
||||
|
||||
1. Did conversation achieve goal? (E2E)
|
||||
2. Which turn first went wrong?
|
||||
3. Can you reproduce with single-turn?
|
||||
4. Is error from context or capability? (N-1 test)
|
||||
170
skills/phoenix-evals/references/error-analysis.md
Normal file
170
skills/phoenix-evals/references/error-analysis.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Error Analysis
|
||||
|
||||
Review traces to discover failure modes before building evaluators.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Sample** - 100+ traces (errors, negative feedback, random)
|
||||
2. **Open Code** - Write free-form notes per trace
|
||||
3. **Axial Code** - Group notes into failure categories
|
||||
4. **Quantify** - Count failures per category
|
||||
5. **Prioritize** - Rank by frequency × severity
|
||||
|
||||
## Sample Traces
|
||||
|
||||
### Span-level sampling (Python — DataFrame)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
|
||||
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
|
||||
client = Client()
|
||||
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
|
||||
|
||||
# Build representative sample
|
||||
sample = pd.concat([
|
||||
spans_df[spans_df["status_code"] == "ERROR"].sample(30),
|
||||
spans_df[spans_df["feedback"] == "negative"].sample(30),
|
||||
spans_df.sample(40),
|
||||
]).drop_duplicates("span_id").head(100)
|
||||
```
|
||||
|
||||
### Span-level sampling (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { spans: errors } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
statusCode: "ERROR",
|
||||
limit: 30,
|
||||
});
|
||||
const { spans: allSpans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
limit: 70,
|
||||
});
|
||||
const sample = [...errors, ...allSpans.sort(() => Math.random() - 0.5).slice(0, 40)];
|
||||
const unique = [...new Map(sample.map((s) => [s.context.span_id, s])).values()].slice(0, 100);
|
||||
```
|
||||
|
||||
### Trace-level sampling (Python)
|
||||
|
||||
When errors span multiple spans (e.g., agent workflows), sample whole traces:
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=24),
|
||||
include_spans=True,
|
||||
sort="latency_ms",
|
||||
order="desc",
|
||||
limit=100,
|
||||
)
|
||||
# Each trace has: trace_id, start_time, end_time, spans
|
||||
```
|
||||
|
||||
### Trace-level sampling (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
|
||||
includeSpans: true,
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
## Add Notes (Python)
|
||||
|
||||
```python
|
||||
client.spans.add_span_note(
|
||||
span_id="abc123",
|
||||
note="wrong timezone - said 3pm EST but user is PST"
|
||||
)
|
||||
```
|
||||
|
||||
## Add Notes (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { addSpanNote } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
await addSpanNote({
|
||||
spanNote: {
|
||||
spanId: "abc123",
|
||||
note: "wrong timezone - said 3pm EST but user is PST"
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
## What to Note
|
||||
|
||||
| Type | Examples |
|
||||
| ---- | -------- |
|
||||
| Factual errors | Wrong dates, prices, made-up features |
|
||||
| Missing info | Didn't answer question, omitted details |
|
||||
| Tone issues | Too casual/formal for context |
|
||||
| Tool issues | Wrong tool, wrong parameters |
|
||||
| Retrieval | Wrong docs, missing relevant docs |
|
||||
|
||||
## Good Notes
|
||||
|
||||
```
|
||||
BAD: "Response is bad"
|
||||
GOOD: "Response says ships in 2 days but policy is 5-7 days"
|
||||
```
|
||||
|
||||
## Group into Categories
|
||||
|
||||
```python
|
||||
categories = {
|
||||
"factual_inaccuracy": ["wrong shipping time", "incorrect price"],
|
||||
"hallucination": ["made up a discount", "invented feature"],
|
||||
"tone_mismatch": ["informal for enterprise client"],
|
||||
}
|
||||
# Priority = Frequency × Severity
|
||||
```
|
||||
|
||||
## Retrieve Existing Annotations
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
# From a spans DataFrame
|
||||
annotations_df = client.spans.get_span_annotations_dataframe(
|
||||
spans_dataframe=sample,
|
||||
project_identifier="my-app",
|
||||
include_annotation_names=["quality", "correctness"],
|
||||
)
|
||||
# annotations_df has: span_id (index), name, label, score, explanation
|
||||
|
||||
# Or from specific span IDs
|
||||
annotations_df = client.spans.get_span_annotations_dataframe(
|
||||
span_ids=["span-id-1", "span-id-2"],
|
||||
project_identifier="my-app",
|
||||
)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { annotations } = await getSpanAnnotations({
|
||||
project: { projectName: "my-app" },
|
||||
spanIds: ["span-id-1", "span-id-2"],
|
||||
includeAnnotationNames: ["quality", "correctness"],
|
||||
});
|
||||
|
||||
for (const ann of annotations) {
|
||||
console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);
|
||||
}
|
||||
```
|
||||
|
||||
## Saturation
|
||||
|
||||
Stop when new traces reveal no new failure modes. Minimum: 100 traces.
|
||||
137
skills/phoenix-evals/references/evaluate-dataframe-python.md
Normal file
137
skills/phoenix-evals/references/evaluate-dataframe-python.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Batch Evaluation with evaluate_dataframe (Python)
|
||||
|
||||
Run evaluators across a DataFrame. The core 2.0 batch evaluation API.
|
||||
|
||||
## Preferred: async_evaluate_dataframe
|
||||
|
||||
For batch evaluations (especially with LLM evaluators), prefer the async version
|
||||
for better throughput:
|
||||
|
||||
```python
|
||||
from phoenix.evals import async_evaluate_dataframe
|
||||
|
||||
results_df = await async_evaluate_dataframe(
|
||||
dataframe=df, # pandas DataFrame with columns matching evaluator params
|
||||
evaluators=[eval1, eval2], # List of evaluators
|
||||
concurrency=5, # Max concurrent LLM calls (default 3)
|
||||
exit_on_error=False, # Optional: stop on first error (default True)
|
||||
max_retries=3, # Optional: retry failed LLM calls (default 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Sync Version
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(
|
||||
dataframe=df, # pandas DataFrame with columns matching evaluator params
|
||||
evaluators=[eval1, eval2], # List of evaluators
|
||||
exit_on_error=False, # Optional: stop on first error (default True)
|
||||
max_retries=3, # Optional: retry failed LLM calls (default 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Result Column Format
|
||||
|
||||
`async_evaluate_dataframe` / `evaluate_dataframe` returns a copy of the input DataFrame with added columns.
|
||||
**Result columns contain dicts, NOT raw numbers.**
|
||||
|
||||
For each evaluator named `"foo"`, two columns are added:
|
||||
|
||||
| Column | Type | Contents |
|
||||
| ------ | ---- | -------- |
|
||||
| `foo_score` | `dict` | `{"name": "foo", "score": 1.0, "label": "True", "explanation": "...", "metadata": {...}, "kind": "code", "direction": "maximize"}` |
|
||||
| `foo_execution_details` | `dict` | `{"status": "success", "exceptions": [], "execution_seconds": 0.001}` |
|
||||
|
||||
Only non-None fields appear in the score dict.
|
||||
|
||||
### Extracting Numeric Scores
|
||||
|
||||
```python
|
||||
# WRONG — these will fail or produce unexpected results
|
||||
score = results_df["relevance"].mean() # KeyError!
|
||||
score = results_df["relevance_score"].mean() # Tries to average dicts!
|
||||
|
||||
# RIGHT — extract the numeric score from each dict
|
||||
scores = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
mean_score = scores.mean()
|
||||
```
|
||||
|
||||
### Extracting Labels
|
||||
|
||||
```python
|
||||
labels = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("label", "") if isinstance(x, dict) else ""
|
||||
)
|
||||
```
|
||||
|
||||
### Extracting Explanations (LLM evaluators)
|
||||
|
||||
```python
|
||||
explanations = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("explanation", "") if isinstance(x, dict) else ""
|
||||
)
|
||||
```
|
||||
|
||||
### Finding Failures
|
||||
|
||||
```python
|
||||
scores = results_df["relevance_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
failed_mask = scores < 0.5
|
||||
failures = results_df[failed_mask]
|
||||
```
|
||||
|
||||
## Input Mapping
|
||||
|
||||
Evaluators receive each row as a dict. Column names must match the evaluator's
|
||||
expected parameter names. If they don't match, use `.bind()` or `bind_evaluator`:
|
||||
|
||||
```python
|
||||
from phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe
|
||||
|
||||
@create_evaluator(name="check", kind="code")
|
||||
def check(response: str) -> bool:
|
||||
return len(response.strip()) > 0
|
||||
|
||||
# Option 1: Use .bind() method on the evaluator
|
||||
check.bind(input_mapping={"response": "answer"})
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])
|
||||
|
||||
# Option 2: Use bind_evaluator function
|
||||
bound = bind_evaluator(evaluator=check, input_mapping={"response": "answer"})
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])
|
||||
```
|
||||
|
||||
Or simply rename columns to match:
|
||||
|
||||
```python
|
||||
df = df.rename(columns={
|
||||
"attributes.input.value": "input",
|
||||
"attributes.output.value": "output",
|
||||
})
|
||||
```
|
||||
|
||||
## DO NOT use run_evals
|
||||
|
||||
```python
|
||||
# WRONG — legacy 1.0 API
|
||||
from phoenix.evals import run_evals
|
||||
results = run_evals(dataframe=df, evaluators=[eval1])
|
||||
# Returns List[DataFrame] — one per evaluator
|
||||
|
||||
# RIGHT — current 2.0 API
|
||||
from phoenix.evals import async_evaluate_dataframe
|
||||
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])
|
||||
# Returns single DataFrame with {name}_score dict columns
|
||||
```
|
||||
|
||||
Key differences:
|
||||
- `run_evals` returns a **list** of DataFrames (one per evaluator)
|
||||
- `async_evaluate_dataframe` returns a **single** DataFrame with all results merged
|
||||
- `async_evaluate_dataframe` uses `{name}_score` dict column format
|
||||
- `async_evaluate_dataframe` uses `bind_evaluator` for input mapping (not `input_mapping=` param)
|
||||
91
skills/phoenix-evals/references/evaluators-code-python.md
Normal file
91
skills/phoenix-evals/references/evaluators-code-python.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# Evaluators: Code Evaluators in Python
|
||||
|
||||
Deterministic evaluators without LLM. Fast, cheap, reproducible.
|
||||
|
||||
## Basic Pattern
|
||||
|
||||
```python
|
||||
import re
|
||||
import json
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
@create_evaluator(name="has_citation", kind="code")
|
||||
def has_citation(output: str) -> bool:
|
||||
return bool(re.search(r'\[\d+\]', output))
|
||||
|
||||
@create_evaluator(name="json_valid", kind="code")
|
||||
def json_valid(output: str) -> bool:
|
||||
try:
|
||||
json.loads(output)
|
||||
return True
|
||||
except json.JSONDecodeError:
|
||||
return False
|
||||
```
|
||||
|
||||
## Parameter Binding
|
||||
|
||||
| Parameter | Description |
|
||||
| --------- | ----------- |
|
||||
| `output` | Task output |
|
||||
| `input` | Example input |
|
||||
| `expected` | Expected output |
|
||||
| `metadata` | Example metadata |
|
||||
|
||||
```python
|
||||
@create_evaluator(name="matches_expected", kind="code")
|
||||
def matches_expected(output: str, expected: dict) -> bool:
|
||||
return output.strip() == expected.get("answer", "").strip()
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
- **Regex**: `re.search(pattern, output)`
|
||||
- **JSON schema**: `jsonschema.validate()`
|
||||
- **Keywords**: `keyword in output.lower()`
|
||||
- **Length**: `len(output.split())`
|
||||
- **Similarity**: `editdistance.eval()` or Jaccard
|
||||
|
||||
## Return Types
|
||||
|
||||
| Return type | Result |
|
||||
| ----------- | ------ |
|
||||
| `bool` | `True` → score=1.0, label="True"; `False` → score=0.0, label="False" |
|
||||
| `float`/`int` | Used as the `score` value directly |
|
||||
| `str` (short, ≤3 words) | Used as the `label` value |
|
||||
| `str` (long, ≥4 words) | Used as the `explanation` value |
|
||||
| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |
|
||||
| `Score` object | Used as-is |
|
||||
|
||||
## Important: Code vs LLM Evaluators
|
||||
|
||||
The `@create_evaluator` decorator wraps a plain Python function.
|
||||
|
||||
- `kind="code"` (default): For deterministic evaluators that don't call an LLM.
|
||||
- `kind="llm"`: Marks the evaluator as LLM-based, but **you** must implement the LLM
|
||||
call inside the function. The decorator does not call an LLM for you.
|
||||
|
||||
For most LLM-based evaluation, prefer `ClassificationEvaluator` which handles
|
||||
the LLM call, structured output parsing, and explanations automatically:
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
relevance = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
## Pre-Built
|
||||
|
||||
```python
|
||||
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex
|
||||
|
||||
evaluators = [
|
||||
ContainsAnyKeyword(keywords=["disclaimer"]),
|
||||
JSONParseable(),
|
||||
MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
|
||||
]
|
||||
```
|
||||
@@ -0,0 +1,51 @@
|
||||
# Evaluators: Code Evaluators in TypeScript
|
||||
|
||||
Deterministic evaluators without LLM. Fast, cheap, reproducible.
|
||||
|
||||
## Basic Pattern
|
||||
|
||||
```typescript
|
||||
import { createEvaluator } from "@arizeai/phoenix-evals";
|
||||
|
||||
const containsCitation = createEvaluator<{ output: string }>(
|
||||
({ output }) => /\[\d+\]/.test(output) ? 1 : 0,
|
||||
{ name: "contains_citation", kind: "CODE" }
|
||||
);
|
||||
```
|
||||
|
||||
## With Full Results (asExperimentEvaluator)
|
||||
|
||||
```typescript
|
||||
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
const jsonValid = asExperimentEvaluator({
|
||||
name: "json_valid",
|
||||
kind: "CODE",
|
||||
evaluate: async ({ output }) => {
|
||||
try {
|
||||
JSON.parse(String(output));
|
||||
return { score: 1.0, label: "valid_json" };
|
||||
} catch (e) {
|
||||
return { score: 0.0, label: "invalid_json", explanation: String(e) };
|
||||
}
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
## Parameter Types
|
||||
|
||||
```typescript
|
||||
interface EvaluatorParams {
|
||||
input: Record<string, unknown>;
|
||||
output: unknown;
|
||||
expected: Record<string, unknown>;
|
||||
metadata: Record<string, unknown>;
|
||||
}
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
- **Regex**: `/pattern/.test(output)`
|
||||
- **JSON**: `JSON.parse()` + zod schema
|
||||
- **Keywords**: `output.includes(keyword)`
|
||||
- **Similarity**: `fastest-levenshtein`
|
||||
@@ -0,0 +1,54 @@
|
||||
# Evaluators: Custom Templates
|
||||
|
||||
Design LLM judge prompts.
|
||||
|
||||
## Complete Template Pattern
|
||||
|
||||
```python
|
||||
TEMPLATE = """Evaluate faithfulness of the response to the context.
|
||||
|
||||
<context>{{context}}</context>
|
||||
<response>{{output}}</response>
|
||||
|
||||
CRITERIA:
|
||||
"faithful" = ALL claims supported by context
|
||||
"unfaithful" = ANY claim NOT in context
|
||||
|
||||
EXAMPLES:
|
||||
Context: "Price is $10" → Response: "It costs $10" → faithful
|
||||
Context: "Price is $10" → Response: "About $15" → unfaithful
|
||||
|
||||
EDGE CASES:
|
||||
- Empty context → cannot_evaluate
|
||||
- "I don't know" when appropriate → faithful
|
||||
- Partial faithfulness → unfaithful (strict)
|
||||
|
||||
Answer (faithful/unfaithful):"""
|
||||
```
|
||||
|
||||
## Template Structure
|
||||
|
||||
1. Task description
|
||||
2. Input variables in XML tags
|
||||
3. Criteria definitions
|
||||
4. Examples (2-4 cases)
|
||||
5. Edge cases
|
||||
6. Output format
|
||||
|
||||
## XML Tags
|
||||
|
||||
```
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
<context>{{context}}</context>
|
||||
<reference>{{reference}}</reference>
|
||||
```
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
| Mistake | Fix |
|
||||
| ------- | --- |
|
||||
| Vague criteria | Define each label exactly |
|
||||
| No examples | Include 2-4 cases |
|
||||
| Ambiguous format | Specify exact output |
|
||||
| No edge cases | Address ambiguity |
|
||||
92
skills/phoenix-evals/references/evaluators-llm-python.md
Normal file
92
skills/phoenix-evals/references/evaluators-llm-python.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Evaluators: LLM Evaluators in Python
|
||||
|
||||
LLM evaluators use a language model to judge outputs. Use when criteria are subjective.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
llm = LLM(provider="openai", model="gpt-4o")
|
||||
|
||||
HELPFULNESS_TEMPLATE = """Rate how helpful the response is.
|
||||
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
|
||||
"helpful" means directly addresses the question.
|
||||
"not_helpful" means does not address the question.
|
||||
|
||||
Your answer (helpful/not_helpful):"""
|
||||
|
||||
helpfulness = ClassificationEvaluator(
|
||||
name="helpfulness",
|
||||
prompt_template=HELPFULNESS_TEMPLATE,
|
||||
llm=llm,
|
||||
choices={"not_helpful": 0, "helpful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
## Template Variables
|
||||
|
||||
Use XML tags to wrap variables for clarity:
|
||||
|
||||
| Variable | XML Tag |
|
||||
| -------- | ------- |
|
||||
| `{{input}}` | `<question>{{input}}</question>` |
|
||||
| `{{output}}` | `<response>{{output}}</response>` |
|
||||
| `{{reference}}` | `<reference>{{reference}}</reference>` |
|
||||
| `{{context}}` | `<context>{{context}}</context>` |
|
||||
|
||||
## create_classifier (Factory)
|
||||
|
||||
Shorthand factory that returns a `ClassificationEvaluator`. Prefer direct
|
||||
`ClassificationEvaluator` instantiation for more parameters/customization:
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_classifier, LLM
|
||||
|
||||
relevance = create_classifier(
|
||||
name="relevance",
|
||||
prompt_template="""Is this response relevant to the question?
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
Answer (relevant/irrelevant):""",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
## Input Mapping
|
||||
|
||||
Column names must match template variables. Rename columns or use `bind_evaluator`:
|
||||
|
||||
```python
|
||||
# Option 1: Rename columns to match template variables
|
||||
df = df.rename(columns={"user_query": "input", "ai_response": "output"})
|
||||
|
||||
# Option 2: Use bind_evaluator
|
||||
from phoenix.evals import bind_evaluator
|
||||
|
||||
bound = bind_evaluator(
|
||||
evaluator=helpfulness,
|
||||
input_mapping={"input": "user_query", "output": "ai_response"},
|
||||
)
|
||||
```
|
||||
|
||||
## Running
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Be specific** - Define exactly what pass/fail means
|
||||
2. **Include examples** - Show concrete cases for each label
|
||||
3. **Explanations by default** - `ClassificationEvaluator` includes explanations automatically
|
||||
4. **Study built-in prompts** - See
|
||||
`phoenix.evals.__generated__.classification_evaluator_configs` for examples
|
||||
of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)
|
||||
58
skills/phoenix-evals/references/evaluators-llm-typescript.md
Normal file
58
skills/phoenix-evals/references/evaluators-llm-typescript.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Evaluators: LLM Evaluators in TypeScript
|
||||
|
||||
LLM evaluators use a language model to judge outputs. Uses Vercel AI SDK.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```typescript
|
||||
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
|
||||
import { openai } from "@ai-sdk/openai";
|
||||
|
||||
const helpfulness = await createClassificationEvaluator<{
|
||||
input: string;
|
||||
output: string;
|
||||
}>({
|
||||
name: "helpfulness",
|
||||
model: openai("gpt-4o"),
|
||||
promptTemplate: `Rate helpfulness.
|
||||
<question>{{input}}</question>
|
||||
<response>{{output}}</response>
|
||||
Answer (helpful/not_helpful):`,
|
||||
choices: { not_helpful: 0, helpful: 1 },
|
||||
});
|
||||
```
|
||||
|
||||
## Template Variables
|
||||
|
||||
Use XML tags: `<question>{{input}}</question>`, `<response>{{output}}</response>`, `<context>{{context}}</context>`
|
||||
|
||||
## Custom Evaluator with asExperimentEvaluator
|
||||
|
||||
```typescript
|
||||
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
const customEval = asExperimentEvaluator({
|
||||
name: "custom",
|
||||
kind: "LLM",
|
||||
evaluate: async ({ input, output }) => {
|
||||
// Your LLM call here
|
||||
return { score: 1.0, label: "pass", explanation: "..." };
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
## Pre-Built Evaluators
|
||||
|
||||
```typescript
|
||||
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
|
||||
|
||||
const faithfulnessEvaluator = createFaithfulnessEvaluator({
|
||||
model: openai("gpt-4o"),
|
||||
});
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Be specific about criteria
|
||||
- Include examples in prompts
|
||||
- Use `<thinking>` for chain of thought
|
||||
40
skills/phoenix-evals/references/evaluators-overview.md
Normal file
40
skills/phoenix-evals/references/evaluators-overview.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Evaluators: Overview
|
||||
|
||||
When and how to build automated evaluators.
|
||||
|
||||
## Decision Framework
|
||||
|
||||
```
|
||||
Should I Build an Evaluator?
|
||||
│
|
||||
▼
|
||||
Can I fix it with a prompt change?
|
||||
YES → Fix the prompt first
|
||||
NO → Is this a recurring issue?
|
||||
YES → Build evaluator
|
||||
NO → Add to watchlist
|
||||
```
|
||||
|
||||
**Don't automate prematurely.** Many issues are simple prompt fixes.
|
||||
|
||||
## Evaluator Requirements
|
||||
|
||||
1. **Clear criteria** - Specific, not "Is it good?"
|
||||
2. **Labeled test set** - 100+ examples with human labels
|
||||
3. **Measured accuracy** - Know TPR/TNR before deploying
|
||||
|
||||
## Evaluator Lifecycle
|
||||
|
||||
1. **Discover** - Error analysis reveals pattern
|
||||
2. **Design** - Define criteria and test cases
|
||||
3. **Implement** - Build code or LLM evaluator
|
||||
4. **Calibrate** - Validate against human labels
|
||||
5. **Deploy** - Add to experiment/CI pipeline
|
||||
6. **Monitor** - Track accuracy over time
|
||||
7. **Maintain** - Update as product evolves
|
||||
|
||||
## What NOT to Automate
|
||||
|
||||
- **Rare issues** - <5 instances? Watchlist, don't build
|
||||
- **Quick fixes** - Fixable by prompt change? Fix it
|
||||
- **Evolving criteria** - Stabilize definition first
|
||||
75
skills/phoenix-evals/references/evaluators-pre-built.md
Normal file
75
skills/phoenix-evals/references/evaluators-pre-built.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Evaluators: Pre-Built
|
||||
|
||||
Use for exploration only. Validate before production.
|
||||
|
||||
## Python
|
||||
|
||||
```python
|
||||
from phoenix.evals import LLM
|
||||
from phoenix.evals.metrics import FaithfulnessEvaluator
|
||||
|
||||
llm = LLM(provider="openai", model="gpt-4o")
|
||||
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
|
||||
```
|
||||
|
||||
**Note**: `HallucinationEvaluator` is deprecated. Use `FaithfulnessEvaluator` instead.
|
||||
It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.
|
||||
|
||||
## TypeScript
|
||||
|
||||
```typescript
|
||||
import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
|
||||
import { openai } from "@ai-sdk/openai";
|
||||
|
||||
const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });
|
||||
```
|
||||
|
||||
## Available (2.0)
|
||||
|
||||
| Evaluator | Type | Description |
|
||||
| --------- | ---- | ----------- |
|
||||
| `FaithfulnessEvaluator` | LLM | Is the response faithful to the context? |
|
||||
| `CorrectnessEvaluator` | LLM | Is the response correct? |
|
||||
| `DocumentRelevanceEvaluator` | LLM | Are retrieved documents relevant? |
|
||||
| `ToolSelectionEvaluator` | LLM | Did the agent select the right tool? |
|
||||
| `ToolInvocationEvaluator` | LLM | Did the agent invoke the tool correctly? |
|
||||
| `ToolResponseHandlingEvaluator` | LLM | Did the agent handle the tool response well? |
|
||||
| `MatchesRegex` | Code | Does output match a regex pattern? |
|
||||
| `PrecisionRecallFScore` | Code | Precision/recall/F-score metrics |
|
||||
| `exact_match` | Code | Exact string match |
|
||||
|
||||
Legacy evaluators (`HallucinationEvaluator`, `QAEvaluator`, `RelevanceEvaluator`,
|
||||
`ToxicityEvaluator`, `SummarizationEvaluator`) are in `phoenix.evals.legacy` and deprecated.
|
||||
|
||||
## When to Use
|
||||
|
||||
| Situation | Recommendation |
|
||||
| --------- | -------------- |
|
||||
| Exploration | Find traces to review |
|
||||
| Find outliers | Sort by scores |
|
||||
| Production | Validate first (>80% human agreement) |
|
||||
| Domain-specific | Build custom |
|
||||
|
||||
## Exploration Pattern
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])
|
||||
|
||||
# Score columns contain dicts — extract numeric scores
|
||||
scores = results_df["faithfulness_score"].apply(
|
||||
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
||||
)
|
||||
low_scores = results_df[scores < 0.5] # Review these
|
||||
high_scores = results_df[scores > 0.9] # Also sample
|
||||
```
|
||||
|
||||
## Validation Required
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report
|
||||
|
||||
print(classification_report(human_labels, evaluator_results["label"]))
|
||||
# Target: >80% agreement
|
||||
```
|
||||
108
skills/phoenix-evals/references/evaluators-rag.md
Normal file
108
skills/phoenix-evals/references/evaluators-rag.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Evaluators: RAG Systems
|
||||
|
||||
RAG has two distinct components requiring different evaluation approaches.
|
||||
|
||||
## Two-Phase Evaluation
|
||||
|
||||
```
|
||||
RETRIEVAL GENERATION
|
||||
───────── ──────────
|
||||
Query → Retriever → Docs Docs + Query → LLM → Answer
|
||||
│ │
|
||||
IR Metrics LLM Judges / Code Checks
|
||||
```
|
||||
|
||||
**Debug retrieval first** using IR metrics, then tackle generation quality.
|
||||
|
||||
## Retrieval Evaluation (IR Metrics)
|
||||
|
||||
Use traditional information retrieval metrics:
|
||||
|
||||
| Metric | What It Measures |
|
||||
| ------ | ---------------- |
|
||||
| Recall@k | Of all relevant docs, how many in top k? |
|
||||
| Precision@k | Of k retrieved docs, how many relevant? |
|
||||
| MRR | How high is first relevant doc? |
|
||||
| NDCG | Quality weighted by position |
|
||||
|
||||
```python
|
||||
# Requires query-document relevance labels
|
||||
def recall_at_k(retrieved_ids, relevant_ids, k=5):
|
||||
retrieved_set = set(retrieved_ids[:k])
|
||||
relevant_set = set(relevant_ids)
|
||||
if not relevant_set:
|
||||
return 0.0
|
||||
return len(retrieved_set & relevant_set) / len(relevant_set)
|
||||
```
|
||||
|
||||
## Creating Retrieval Test Data
|
||||
|
||||
Generate query-document pairs synthetically:
|
||||
|
||||
```python
|
||||
# Reverse process: document → questions that document answers
|
||||
def generate_retrieval_test(documents):
|
||||
test_pairs = []
|
||||
for doc in documents:
|
||||
# Extract facts, generate questions
|
||||
questions = llm(f"Generate 3 questions this document answers:\n{doc}")
|
||||
for q in questions:
|
||||
test_pairs.append({"query": q, "relevant_doc_id": doc.id})
|
||||
return test_pairs
|
||||
```
|
||||
|
||||
## Generation Evaluation
|
||||
|
||||
Use LLM judges for qualities code can't measure:
|
||||
|
||||
| Eval | Question |
|
||||
| ---- | -------- |
|
||||
| **Faithfulness** | Are all claims supported by retrieved context? |
|
||||
| **Relevance** | Does answer address the question? |
|
||||
| **Completeness** | Does answer cover key points from context? |
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
FAITHFULNESS_TEMPLATE = """Given the context and answer, is every claim in the answer supported by the context?
|
||||
|
||||
<context>{{context}}</context>
|
||||
<answer>{{output}}</answer>
|
||||
|
||||
"faithful" = ALL claims supported by context
|
||||
"unfaithful" = ANY claim NOT in context
|
||||
|
||||
Answer (faithful/unfaithful):"""
|
||||
|
||||
faithfulness = ClassificationEvaluator(
|
||||
name="faithfulness",
|
||||
prompt_template=FAITHFULNESS_TEMPLATE,
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"unfaithful": 0, "faithful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
## RAG Failure Taxonomy
|
||||
|
||||
Common failure modes to evaluate:
|
||||
|
||||
```yaml
|
||||
retrieval_failures:
|
||||
- no_relevant_docs: Query returns unrelated content
|
||||
- partial_retrieval: Some relevant docs missed
|
||||
- wrong_chunk: Right doc, wrong section
|
||||
|
||||
generation_failures:
|
||||
- hallucination: Claims not in retrieved context
|
||||
- ignored_context: Answer doesn't use retrieved docs
|
||||
- incomplete: Missing key information from context
|
||||
- wrong_synthesis: Misinterprets or miscombines sources
|
||||
```
|
||||
|
||||
## Evaluation Order
|
||||
|
||||
1. **Retrieval first** - If wrong docs, generation will fail
|
||||
2. **Faithfulness** - Is answer grounded in context?
|
||||
3. **Answer quality** - Does answer address the question?
|
||||
|
||||
Fix retrieval problems before debugging generation.
|
||||
133
skills/phoenix-evals/references/experiments-datasets-python.md
Normal file
133
skills/phoenix-evals/references/experiments-datasets-python.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Experiments: Datasets in Python
|
||||
|
||||
Creating and managing evaluation datasets.
|
||||
|
||||
## Creating Datasets
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
client = Client()
|
||||
|
||||
# From examples
|
||||
dataset = client.datasets.create_dataset(
|
||||
name="qa-test-v1",
|
||||
examples=[
|
||||
{
|
||||
"input": {"question": "What is 2+2?"},
|
||||
"output": {"answer": "4"},
|
||||
"metadata": {"category": "math"},
|
||||
},
|
||||
],
|
||||
)
|
||||
|
||||
# From DataFrame
|
||||
dataset = client.datasets.create_dataset(
|
||||
dataframe=df,
|
||||
name="qa-test-v1",
|
||||
input_keys=["question"],
|
||||
output_keys=["answer"],
|
||||
metadata_keys=["category"],
|
||||
)
|
||||
```
|
||||
|
||||
## From Production Traces
|
||||
|
||||
```python
|
||||
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
|
||||
|
||||
dataset = client.datasets.create_dataset(
|
||||
dataframe=spans_df[["input.value", "output.value"]],
|
||||
name="production-sample-v1",
|
||||
input_keys=["input.value"],
|
||||
output_keys=["output.value"],
|
||||
)
|
||||
```
|
||||
|
||||
## Retrieving Datasets
|
||||
|
||||
```python
|
||||
dataset = client.datasets.get_dataset(name="qa-test-v1")
|
||||
df = dataset.to_dataframe()
|
||||
```
|
||||
|
||||
## Key Parameters
|
||||
|
||||
| Parameter | Description |
|
||||
| --------- | ----------- |
|
||||
| `input_keys` | Columns for task input |
|
||||
| `output_keys` | Columns for expected output |
|
||||
| `metadata_keys` | Additional context |
|
||||
|
||||
## Using Evaluators in Experiments
|
||||
|
||||
### Evaluators as experiment evaluators
|
||||
|
||||
Pass phoenix-evals evaluators directly to `run_experiment` as the `evaluators` argument:
|
||||
|
||||
```python
|
||||
from functools import partial
|
||||
from phoenix.client import AsyncClient
|
||||
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator
|
||||
|
||||
# Define an LLM evaluator
|
||||
refusal = ClassificationEvaluator(
|
||||
name="refusal",
|
||||
prompt_template="Is this a refusal?\nQuestion: {{query}}\nResponse: {{response}}",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"refusal": 0, "answer": 1},
|
||||
)
|
||||
|
||||
# Bind to map dataset columns to evaluator params
|
||||
refusal_evaluator = bind_evaluator(refusal, {"query": "input.query", "response": "output"})
|
||||
|
||||
# Define experiment task
|
||||
async def run_rag_task(input, rag_engine):
|
||||
return rag_engine.query(input["query"])
|
||||
|
||||
# Run experiment with the evaluator
|
||||
experiment = await AsyncClient().experiments.run_experiment(
|
||||
dataset=ds,
|
||||
task=partial(run_rag_task, rag_engine=query_engine),
|
||||
experiment_name="baseline",
|
||||
evaluators=[refusal_evaluator],
|
||||
concurrency=10,
|
||||
)
|
||||
```
|
||||
|
||||
### Evaluators as the task (meta evaluation)
|
||||
|
||||
Use an LLM evaluator as the experiment **task** to test the evaluator itself
|
||||
against human annotations:
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
# The evaluator IS the task being tested
|
||||
def run_refusal_eval(input, evaluator):
|
||||
result = evaluator.evaluate(input)
|
||||
return result[0]
|
||||
|
||||
# A simple heuristic checks judge vs human agreement
|
||||
@create_evaluator(name="exact_match")
|
||||
def exact_match(output, expected):
|
||||
return float(output["score"]) == float(expected["refusal_score"])
|
||||
|
||||
# Run: evaluator is the task, exact_match evaluates it
|
||||
experiment = await AsyncClient().experiments.run_experiment(
|
||||
dataset=annotated_dataset,
|
||||
task=partial(run_refusal_eval, evaluator=refusal),
|
||||
experiment_name="judge-v1",
|
||||
evaluators=[exact_match],
|
||||
concurrency=10,
|
||||
)
|
||||
```
|
||||
|
||||
This pattern lets you iterate on evaluator prompts until they align with human judgments.
|
||||
See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example.
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
|
||||
- **Metadata**: Track source, category, difficulty
|
||||
- **Balance**: Ensure diverse coverage across categories
|
||||
@@ -0,0 +1,69 @@
|
||||
# Experiments: Datasets in TypeScript
|
||||
|
||||
Creating and managing evaluation datasets.
|
||||
|
||||
## Creating Datasets
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import { createDataset } from "@arizeai/phoenix-client/datasets";
|
||||
|
||||
const client = createClient();
|
||||
|
||||
const { datasetId } = await createDataset({
|
||||
client,
|
||||
name: "qa-test-v1",
|
||||
examples: [
|
||||
{
|
||||
input: { question: "What is 2+2?" },
|
||||
output: { answer: "4" },
|
||||
metadata: { category: "math" },
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
## Example Structure
|
||||
|
||||
```typescript
|
||||
interface DatasetExample {
|
||||
input: Record<string, unknown>; // Task input
|
||||
output?: Record<string, unknown>; // Expected output
|
||||
metadata?: Record<string, unknown>; // Additional context
|
||||
}
|
||||
```
|
||||
|
||||
## From Production Traces
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
parentId: null, // root spans only
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
const examples = spans.map((span) => ({
|
||||
input: { query: span.attributes?.["input.value"] },
|
||||
output: { response: span.attributes?.["output.value"] },
|
||||
metadata: { spanId: span.context.span_id },
|
||||
}));
|
||||
|
||||
await createDataset({ client, name: "production-sample", examples });
|
||||
```
|
||||
|
||||
## Retrieving Datasets
|
||||
|
||||
```typescript
|
||||
import { getDataset, listDatasets } from "@arizeai/phoenix-client/datasets";
|
||||
|
||||
const dataset = await getDataset({ client, datasetId: "..." });
|
||||
const all = await listDatasets({ client });
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Versioning**: Create new datasets, don't modify existing
|
||||
- **Metadata**: Track source, category, provenance
|
||||
- **Type safety**: Use TypeScript interfaces for structure
|
||||
50
skills/phoenix-evals/references/experiments-overview.md
Normal file
50
skills/phoenix-evals/references/experiments-overview.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Experiments: Overview
|
||||
|
||||
Systematic testing of AI systems with datasets, tasks, and evaluators.
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
DATASET → Examples: {input, expected_output, metadata}
|
||||
TASK → function(input) → output
|
||||
EVALUATORS → (input, output, expected) → score
|
||||
EXPERIMENT → Run task on all examples, score results
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=my_dataset,
|
||||
task=my_task,
|
||||
evaluators=[accuracy, faithfulness],
|
||||
experiment_name="improved-retrieval-v2",
|
||||
)
|
||||
|
||||
print(experiment.aggregate_scores)
|
||||
# {'accuracy': 0.85, 'faithfulness': 0.92}
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Create dataset** - From traces, synthetic data, or manual curation
|
||||
2. **Define task** - The function to test (your LLM pipeline)
|
||||
3. **Select evaluators** - Code and/or LLM-based
|
||||
4. **Run experiment** - Execute and score
|
||||
5. **Analyze & iterate** - Review, modify task, re-run
|
||||
|
||||
## Dry Runs
|
||||
|
||||
Test setup before full execution:
|
||||
|
||||
```python
|
||||
experiment = run_experiment(dataset, task, evaluators, dry_run=3) # Just 3 examples
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
|
||||
- **Version datasets**: Don't modify existing
|
||||
- **Multiple evaluators**: Combine perspectives
|
||||
@@ -0,0 +1,78 @@
|
||||
# Experiments: Running Experiments in Python
|
||||
|
||||
Execute experiments with `run_experiment`.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
client = Client()
|
||||
dataset = client.datasets.get_dataset(name="qa-test-v1")
|
||||
|
||||
def my_task(example):
|
||||
return call_llm(example.input["question"])
|
||||
|
||||
def exact_match(output, expected):
|
||||
return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=[exact_match],
|
||||
experiment_name="qa-experiment-v1",
|
||||
)
|
||||
```
|
||||
|
||||
## Task Functions
|
||||
|
||||
```python
|
||||
# Basic task
|
||||
def task(example):
|
||||
return call_llm(example.input["question"])
|
||||
|
||||
# With context (RAG)
|
||||
def rag_task(example):
|
||||
return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")
|
||||
```
|
||||
|
||||
## Evaluator Parameters
|
||||
|
||||
| Parameter | Access |
|
||||
| --------- | ------ |
|
||||
| `output` | Task output |
|
||||
| `expected` | Example expected output |
|
||||
| `input` | Example input |
|
||||
| `metadata` | Example metadata |
|
||||
|
||||
## Options
|
||||
|
||||
```python
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=evaluators,
|
||||
experiment_name="my-experiment",
|
||||
dry_run=3, # Test with 3 examples
|
||||
repetitions=3, # Run each example 3 times
|
||||
)
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
```python
|
||||
print(experiment.aggregate_scores)
|
||||
# {'accuracy': 0.85, 'faithfulness': 0.92}
|
||||
|
||||
for run in experiment.runs:
|
||||
print(run.output, run.scores)
|
||||
```
|
||||
|
||||
## Add Evaluations Later
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import evaluate_experiment
|
||||
|
||||
evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])
|
||||
```
|
||||
@@ -0,0 +1,82 @@
|
||||
# Experiments: Running Experiments in TypeScript
|
||||
|
||||
Execute experiments with `runExperiment`.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import {
|
||||
runExperiment,
|
||||
asExperimentEvaluator,
|
||||
} from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
const client = createClient();
|
||||
|
||||
const task = async (example: { input: Record<string, unknown> }) => {
|
||||
return await callLLM(example.input.question as string);
|
||||
};
|
||||
|
||||
const exactMatch = asExperimentEvaluator({
|
||||
name: "exact_match",
|
||||
kind: "CODE",
|
||||
evaluate: async ({ output, expected }) => ({
|
||||
score: output === expected?.answer ? 1.0 : 0.0,
|
||||
label: output === expected?.answer ? "match" : "no_match",
|
||||
}),
|
||||
});
|
||||
|
||||
const experiment = await runExperiment({
|
||||
client,
|
||||
experimentName: "qa-experiment-v1",
|
||||
dataset: { datasetId: "your-dataset-id" },
|
||||
task,
|
||||
evaluators: [exactMatch],
|
||||
});
|
||||
```
|
||||
|
||||
## Task Functions
|
||||
|
||||
```typescript
|
||||
// Basic task
|
||||
const task = async (example) => await callLLM(example.input.question as string);
|
||||
|
||||
// With context (RAG)
|
||||
const ragTask = async (example) => {
|
||||
const prompt = `Context: ${example.input.context}\nQ: ${example.input.question}`;
|
||||
return await callLLM(prompt);
|
||||
};
|
||||
```
|
||||
|
||||
## Evaluator Parameters
|
||||
|
||||
```typescript
|
||||
interface EvaluatorParams {
|
||||
input: Record<string, unknown>;
|
||||
output: unknown;
|
||||
expected: Record<string, unknown>;
|
||||
metadata: Record<string, unknown>;
|
||||
}
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
```typescript
|
||||
const experiment = await runExperiment({
|
||||
client,
|
||||
experimentName: "my-experiment",
|
||||
dataset: { datasetName: "qa-test-v1" },
|
||||
task,
|
||||
evaluators,
|
||||
repetitions: 3, // Run each example 3 times
|
||||
maxConcurrency: 5, // Limit concurrent executions
|
||||
});
|
||||
```
|
||||
|
||||
## Add Evaluations Later
|
||||
|
||||
```typescript
|
||||
import { evaluateExperiment } from "@arizeai/phoenix-client/experiments";
|
||||
|
||||
await evaluateExperiment({ client, experiment, evaluators: [newEvaluator] });
|
||||
```
|
||||
@@ -0,0 +1,70 @@
|
||||
# Experiments: Generating Synthetic Test Data
|
||||
|
||||
Creating diverse, targeted test data for evaluation.
|
||||
|
||||
## Dimension-Based Approach
|
||||
|
||||
Define axes of variation, then generate combinations:
|
||||
|
||||
```python
|
||||
dimensions = {
|
||||
"issue_type": ["billing", "technical", "shipping"],
|
||||
"customer_mood": ["frustrated", "neutral", "happy"],
|
||||
"complexity": ["simple", "moderate", "complex"],
|
||||
}
|
||||
```
|
||||
|
||||
## Two-Step Generation
|
||||
|
||||
1. **Generate tuples** (combinations of dimension values)
|
||||
2. **Convert to natural queries** (separate LLM call per tuple)
|
||||
|
||||
```python
|
||||
# Step 1: Create tuples
|
||||
tuples = [
|
||||
("billing", "frustrated", "complex"),
|
||||
("shipping", "neutral", "simple"),
|
||||
]
|
||||
|
||||
# Step 2: Convert to natural query
|
||||
def tuple_to_query(t):
|
||||
prompt = f"""Generate a realistic customer message:
|
||||
Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
|
||||
|
||||
Write naturally, include typos if appropriate. Don't be formulaic."""
|
||||
return llm(prompt)
|
||||
```
|
||||
|
||||
## Target Failure Modes
|
||||
|
||||
Dimensions should target known failures from error analysis:
|
||||
|
||||
```python
|
||||
# From error analysis findings
|
||||
dimensions = {
|
||||
"timezone": ["EST", "PST", "UTC", "ambiguous"], # Known failure
|
||||
"date_format": ["ISO", "US", "EU", "relative"], # Known failure
|
||||
}
|
||||
```
|
||||
|
||||
## Quality Control
|
||||
|
||||
- **Validate**: Check for placeholder text, minimum length
|
||||
- **Deduplicate**: Remove near-duplicate queries using embeddings
|
||||
- **Balance**: Ensure coverage across dimension values
|
||||
|
||||
## When to Use
|
||||
|
||||
| Use Synthetic | Use Real Data |
|
||||
| ------------- | ------------- |
|
||||
| Limited production data | Sufficient traces |
|
||||
| Testing edge cases | Validating actual behavior |
|
||||
| Pre-launch evals | Post-launch monitoring |
|
||||
|
||||
## Sample Sizes
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Comprehensive eval | 100-500 |
|
||||
| Per-dimension | 10-20 per combination |
|
||||
@@ -0,0 +1,86 @@
|
||||
# Experiments: Generating Synthetic Test Data (TypeScript)
|
||||
|
||||
Creating diverse, targeted test data for evaluation.
|
||||
|
||||
## Dimension-Based Approach
|
||||
|
||||
Define axes of variation, then generate combinations:
|
||||
|
||||
```typescript
|
||||
const dimensions = {
|
||||
issueType: ["billing", "technical", "shipping"],
|
||||
customerMood: ["frustrated", "neutral", "happy"],
|
||||
complexity: ["simple", "moderate", "complex"],
|
||||
};
|
||||
```
|
||||
|
||||
## Two-Step Generation
|
||||
|
||||
1. **Generate tuples** (combinations of dimension values)
|
||||
2. **Convert to natural queries** (separate LLM call per tuple)
|
||||
|
||||
```typescript
|
||||
import { generateText } from "ai";
|
||||
import { openai } from "@ai-sdk/openai";
|
||||
|
||||
// Step 1: Create tuples
|
||||
type Tuple = [string, string, string];
|
||||
const tuples: Tuple[] = [
|
||||
["billing", "frustrated", "complex"],
|
||||
["shipping", "neutral", "simple"],
|
||||
];
|
||||
|
||||
// Step 2: Convert to natural query
|
||||
async function tupleToQuery(t: Tuple): Promise<string> {
|
||||
const { text } = await generateText({
|
||||
model: openai("gpt-4o"),
|
||||
prompt: `Generate a realistic customer message:
|
||||
Issue: ${t[0]}, Mood: ${t[1]}, Complexity: ${t[2]}
|
||||
|
||||
Write naturally, include typos if appropriate. Don't be formulaic.`,
|
||||
});
|
||||
return text;
|
||||
}
|
||||
```
|
||||
|
||||
## Target Failure Modes
|
||||
|
||||
Dimensions should target known failures from error analysis:
|
||||
|
||||
```typescript
|
||||
// From error analysis findings
|
||||
const dimensions = {
|
||||
timezone: ["EST", "PST", "UTC", "ambiguous"], // Known failure
|
||||
dateFormat: ["ISO", "US", "EU", "relative"], // Known failure
|
||||
};
|
||||
```
|
||||
|
||||
## Quality Control
|
||||
|
||||
- **Validate**: Check for placeholder text, minimum length
|
||||
- **Deduplicate**: Remove near-duplicate queries using embeddings
|
||||
- **Balance**: Ensure coverage across dimension values
|
||||
|
||||
```typescript
|
||||
function validateQuery(query: string): boolean {
|
||||
const minLength = 20;
|
||||
const hasPlaceholder = /\[.*?\]|<.*?>/.test(query);
|
||||
return query.length >= minLength && !hasPlaceholder;
|
||||
}
|
||||
```
|
||||
|
||||
## When to Use
|
||||
|
||||
| Use Synthetic | Use Real Data |
|
||||
| ------------- | ------------- |
|
||||
| Limited production data | Sufficient traces |
|
||||
| Testing edge cases | Validating actual behavior |
|
||||
| Pre-launch evals | Post-launch monitoring |
|
||||
|
||||
## Sample Sizes
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Comprehensive eval | 100-500 |
|
||||
| Per-dimension | 10-20 per combination |
|
||||
@@ -0,0 +1,43 @@
|
||||
# Anti-Patterns
|
||||
|
||||
Common mistakes and fixes.
|
||||
|
||||
| Anti-Pattern | Problem | Fix |
|
||||
| ------------ | ------- | --- |
|
||||
| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |
|
||||
| Vibe-based | No quantification | Measure with experiments |
|
||||
| Ignoring humans | Uncalibrated LLM judges | Validate >80% TPR/TNR |
|
||||
| Premature automation | Evaluators for imagined problems | Let observed failures drive |
|
||||
| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |
|
||||
| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |
|
||||
| Model switching | Hoping a model works better | Error analysis first |
|
||||
|
||||
## Quantify Changes
|
||||
|
||||
```python
|
||||
baseline = run_experiment(dataset, old_prompt, evaluators)
|
||||
improved = run_experiment(dataset, new_prompt, evaluators)
|
||||
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")
|
||||
```
|
||||
|
||||
## Don't Use Similarity for Generation
|
||||
|
||||
```python
|
||||
# BAD
|
||||
score = bertscore(output, reference)
|
||||
|
||||
# GOOD
|
||||
correct_facts = check_facts_against_source(output, context)
|
||||
```
|
||||
|
||||
## Error Analysis Before Model Change
|
||||
|
||||
```python
|
||||
# BAD
|
||||
for model in models:
|
||||
results = test(model)
|
||||
|
||||
# GOOD
|
||||
failures = analyze_errors(results)
|
||||
# Then decide if model change is warranted
|
||||
```
|
||||
@@ -0,0 +1,58 @@
|
||||
# Model Selection
|
||||
|
||||
Error analysis first, model changes last.
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
Performance Issue?
|
||||
│
|
||||
▼
|
||||
Error analysis suggests model problem?
|
||||
NO → Fix prompts, retrieval, tools
|
||||
YES → Is it a capability gap?
|
||||
YES → Consider model change
|
||||
NO → Fix the actual problem
|
||||
```
|
||||
|
||||
## Judge Model Selection
|
||||
|
||||
| Principle | Action |
|
||||
| --------- | ------ |
|
||||
| Start capable | Use gpt-4o first |
|
||||
| Optimize later | Test cheaper after criteria stable |
|
||||
| Same model OK | Judge does different task |
|
||||
|
||||
```python
|
||||
# Start with capable model
|
||||
judge = ClassificationEvaluator(
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
...
|
||||
)
|
||||
|
||||
# After validation, test cheaper
|
||||
judge_cheap = ClassificationEvaluator(
|
||||
llm=LLM(provider="openai", model="gpt-4o-mini"),
|
||||
...
|
||||
)
|
||||
# Compare TPR/TNR on same test set
|
||||
```
|
||||
|
||||
## Don't Model Shop
|
||||
|
||||
```python
|
||||
# BAD
|
||||
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
|
||||
results = run_experiment(dataset, task, model)
|
||||
|
||||
# GOOD
|
||||
failures = analyze_errors(results)
|
||||
# "Ignores context" → Fix prompt
|
||||
# "Can't do math" → Maybe try better model
|
||||
```
|
||||
|
||||
## When Model Change Is Warranted
|
||||
|
||||
- Failures persist after prompt optimization
|
||||
- Capability gaps (reasoning, math, code)
|
||||
- Error analysis confirms model limitation
|
||||
76
skills/phoenix-evals/references/fundamentals.md
Normal file
76
skills/phoenix-evals/references/fundamentals.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Fundamentals
|
||||
|
||||
Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
|
||||
|
||||
## Evaluator Types
|
||||
|
||||
| Type | Speed | Cost | Use Case |
|
||||
| ---- | ----- | ---- | -------- |
|
||||
| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
|
||||
| **LLM** | Medium | Medium | Subjective quality, complex criteria |
|
||||
| **Human** | Slow | Expensive | Ground truth, calibration |
|
||||
|
||||
**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.
|
||||
|
||||
## Score Structure
|
||||
|
||||
| Property | Required | Description |
|
||||
| -------- | -------- | ----------- |
|
||||
| `name` | Yes | Evaluator name |
|
||||
| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
|
||||
| `score` | No* | 0-1 numeric |
|
||||
| `label` | No* | `"pass"`, `"fail"` |
|
||||
| `explanation` | No | Rationale |
|
||||
|
||||
*One of `score` or `label` required.
|
||||
|
||||
## Binary > Likert
|
||||
|
||||
Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
|
||||
|
||||
```python
|
||||
# Multiple binary checks instead of one Likert scale
|
||||
evaluators = [
|
||||
AnswersQuestion(), # Yes/No
|
||||
UsesContext(), # Yes/No
|
||||
NoHallucination(), # Yes/No
|
||||
]
|
||||
```
|
||||
|
||||
## Quick Patterns
|
||||
|
||||
### Code Evaluator
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
@create_evaluator(name="has_citation", kind="code")
|
||||
def has_citation(output: str) -> bool:
|
||||
return bool(re.search(r'\[\d+\]', output))
|
||||
```
|
||||
|
||||
### LLM Evaluator
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
evaluator = ClassificationEvaluator(
|
||||
name="helpfulness",
|
||||
prompt_template="...",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"not_helpful": 0, "helpful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
### Run Experiment
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=[evaluator1, evaluator2],
|
||||
)
|
||||
print(experiment.aggregate_scores)
|
||||
```
|
||||
101
skills/phoenix-evals/references/observe-sampling-python.md
Normal file
101
skills/phoenix-evals/references/observe-sampling-python.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Observe: Sampling Strategies
|
||||
|
||||
How to efficiently sample production traces for review.
|
||||
|
||||
## Strategies
|
||||
|
||||
### 1. Failure-Focused (Highest Priority)
|
||||
|
||||
```python
|
||||
errors = spans_df[spans_df["status_code"] == "ERROR"]
|
||||
negative_feedback = spans_df[spans_df["feedback"] == "negative"]
|
||||
```
|
||||
|
||||
### 2. Outliers
|
||||
|
||||
```python
|
||||
long_responses = spans_df.nlargest(50, "response_length")
|
||||
slow_responses = spans_df.nlargest(50, "latency_ms")
|
||||
```
|
||||
|
||||
### 3. Stratified (Coverage)
|
||||
|
||||
```python
|
||||
# Sample equally from each category
|
||||
by_query_type = spans_df.groupby("metadata.query_type").apply(
|
||||
lambda x: x.sample(min(len(x), 20))
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Metric-Guided
|
||||
|
||||
```python
|
||||
# Review traces flagged by automated evaluators
|
||||
flagged = spans_df[eval_results["label"] == "hallucinated"]
|
||||
borderline = spans_df[(eval_results["score"] > 0.3) & (eval_results["score"] < 0.7)]
|
||||
```
|
||||
|
||||
## Building a Review Queue
|
||||
|
||||
```python
|
||||
def build_review_queue(spans_df, max_traces=100):
|
||||
queue = pd.concat([
|
||||
spans_df[spans_df["status_code"] == "ERROR"],
|
||||
spans_df[spans_df["feedback"] == "negative"],
|
||||
spans_df.nlargest(10, "response_length"),
|
||||
spans_df.sample(min(30, len(spans_df))),
|
||||
]).drop_duplicates("span_id").head(max_traces)
|
||||
return queue
|
||||
```
|
||||
|
||||
## Sample Size Guidelines
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Error analysis | 100+ (until saturation) |
|
||||
| Golden dataset | 100-500 |
|
||||
| Judge calibration | 100+ per class |
|
||||
|
||||
**Saturation:** Stop when new traces show the same failure patterns.
|
||||
|
||||
## Trace-Level Sampling
|
||||
|
||||
When you need whole requests (all spans per trace), use `get_traces`:
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
client = Client()
|
||||
|
||||
# Recent traces with full span trees
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
limit=100,
|
||||
include_spans=True,
|
||||
)
|
||||
|
||||
# Time-windowed sampling (e.g., last hour)
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
limit=50,
|
||||
include_spans=True,
|
||||
)
|
||||
|
||||
# Filter by session (multi-turn conversations)
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
session_id="user-session-abc",
|
||||
include_spans=True,
|
||||
)
|
||||
|
||||
# Sort by latency to find slowest requests
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
sort="latency_ms",
|
||||
order="desc",
|
||||
limit=50,
|
||||
)
|
||||
```
|
||||
147
skills/phoenix-evals/references/observe-sampling-typescript.md
Normal file
147
skills/phoenix-evals/references/observe-sampling-typescript.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Observe: Sampling Strategies (TypeScript)
|
||||
|
||||
How to efficiently sample production traces for review.
|
||||
|
||||
## Strategies
|
||||
|
||||
### 1. Failure-Focused (Highest Priority)
|
||||
|
||||
Use server-side filters to fetch only what you need:
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// Server-side filter — only ERROR spans are returned
|
||||
const { spans: errors } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
statusCode: "ERROR",
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Fetch only LLM spans
|
||||
const { spans: llmSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
spanKind: "LLM",
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Filter by span name
|
||||
const { spans: chatSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
name: "chat_completion",
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Outliers
|
||||
|
||||
```typescript
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 200,
|
||||
});
|
||||
const latency = (s: (typeof spans)[number]) =>
|
||||
new Date(s.end_time).getTime() - new Date(s.start_time).getTime();
|
||||
const sorted = [...spans].sort((a, b) => latency(b) - latency(a));
|
||||
const slowResponses = sorted.slice(0, 50);
|
||||
```
|
||||
|
||||
### 3. Stratified (Coverage)
|
||||
|
||||
```typescript
|
||||
// Sample equally from each category
|
||||
function stratifiedSample<T>(items: T[], groupBy: (item: T) => string, perGroup: number): T[] {
|
||||
const groups = new Map<string, T[]>();
|
||||
for (const item of items) {
|
||||
const key = groupBy(item);
|
||||
if (!groups.has(key)) groups.set(key, []);
|
||||
groups.get(key)!.push(item);
|
||||
}
|
||||
return [...groups.values()].flatMap((g) => g.slice(0, perGroup));
|
||||
}
|
||||
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 500,
|
||||
});
|
||||
const byQueryType = stratifiedSample(spans, (s) => s.attributes?.["metadata.query_type"] ?? "unknown", 20);
|
||||
```
|
||||
|
||||
### 4. Metric-Guided
|
||||
|
||||
```typescript
|
||||
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// Fetch annotations for your spans, then filter by label
|
||||
const { annotations } = await getSpanAnnotations({
|
||||
project: { projectName: "my-project" },
|
||||
spanIds: spans.map((s) => s.context.span_id),
|
||||
includeAnnotationNames: ["hallucination"],
|
||||
});
|
||||
|
||||
const flaggedSpanIds = new Set(
|
||||
annotations.filter((a) => a.result?.label === "hallucinated").map((a) => a.span_id)
|
||||
);
|
||||
const flagged = spans.filter((s) => flaggedSpanIds.has(s.context.span_id));
|
||||
```
|
||||
|
||||
## Trace-Level Sampling
|
||||
|
||||
When you need whole requests (all spans in a trace), use `getTraces`:
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
// Recent traces with full span trees
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 100,
|
||||
includeSpans: true,
|
||||
});
|
||||
|
||||
// Filter by session (e.g., multi-turn conversations)
|
||||
const { traces: sessionTraces } = await getTraces({
|
||||
project: { projectName: "my-project" },
|
||||
sessionId: "user-session-abc",
|
||||
includeSpans: true,
|
||||
});
|
||||
|
||||
// Time-windowed sampling
|
||||
const { traces: recentTraces } = await getTraces({
|
||||
project: { projectName: "my-project" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000), // last hour
|
||||
limit: 50,
|
||||
includeSpans: true,
|
||||
});
|
||||
```
|
||||
|
||||
## Building a Review Queue
|
||||
|
||||
```typescript
|
||||
// Combine server-side filters into a review queue
|
||||
const { spans: errorSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
statusCode: "ERROR",
|
||||
limit: 30,
|
||||
});
|
||||
const { spans: allSpans } = await getSpans({
|
||||
project: { projectName: "my-project" },
|
||||
limit: 100,
|
||||
});
|
||||
const random = allSpans.sort(() => Math.random() - 0.5).slice(0, 30);
|
||||
|
||||
const combined = [...errorSpans, ...random];
|
||||
const unique = [...new Map(combined.map((s) => [s.context.span_id, s])).values()];
|
||||
const reviewQueue = unique.slice(0, 100);
|
||||
```
|
||||
|
||||
## Sample Size Guidelines
|
||||
|
||||
| Purpose | Size |
|
||||
| ------- | ---- |
|
||||
| Initial exploration | 50-100 |
|
||||
| Error analysis | 100+ (until saturation) |
|
||||
| Golden dataset | 100-500 |
|
||||
| Judge calibration | 100+ per class |
|
||||
|
||||
**Saturation:** Stop when new traces show the same failure patterns.
|
||||
144
skills/phoenix-evals/references/observe-tracing-setup.md
Normal file
144
skills/phoenix-evals/references/observe-tracing-setup.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# Observe: Tracing Setup
|
||||
|
||||
Configure tracing to capture data for evaluation.
|
||||
|
||||
## Quick Setup
|
||||
|
||||
```python
|
||||
# Python
|
||||
from phoenix.otel import register
|
||||
|
||||
register(project_name="my-app", auto_instrument=True)
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript
|
||||
import { registerPhoenix } from "@arizeai/phoenix-otel";
|
||||
|
||||
registerPhoenix({ projectName: "my-app", autoInstrument: true });
|
||||
```
|
||||
|
||||
## Essential Attributes
|
||||
|
||||
| Attribute | Why It Matters |
|
||||
| --------- | -------------- |
|
||||
| `input.value` | User's request |
|
||||
| `output.value` | Response to evaluate |
|
||||
| `retrieval.documents` | Context for faithfulness |
|
||||
| `tool.name`, `tool.parameters` | Agent evaluation |
|
||||
| `llm.model_name` | Track by model |
|
||||
|
||||
## Custom Attributes for Evals
|
||||
|
||||
```python
|
||||
span.set_attribute("metadata.client_type", "enterprise")
|
||||
span.set_attribute("metadata.query_category", "billing")
|
||||
```
|
||||
|
||||
## Exporting for Evaluation
|
||||
|
||||
### Spans (Python — DataFrame)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
|
||||
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
|
||||
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
|
||||
client = Client()
|
||||
spans_df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-app", # NOT project_name= (deprecated)
|
||||
root_spans_only=True,
|
||||
)
|
||||
|
||||
dataset = client.datasets.create_dataset(
|
||||
name="error-analysis-set",
|
||||
dataframe=spans_df[["input.value", "output.value"]],
|
||||
input_keys=["input.value"],
|
||||
output_keys=["output.value"],
|
||||
)
|
||||
```
|
||||
|
||||
### Spans (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
parentId: null, // root spans only
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
### Traces (Python — structured)
|
||||
|
||||
Use `get_traces` when you need full trace trees (e.g., multi-turn conversations, agent workflows):
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=24),
|
||||
include_spans=True, # includes all spans per trace
|
||||
limit=100,
|
||||
)
|
||||
# Each trace has: trace_id, start_time, end_time, spans (when include_spans=True)
|
||||
```
|
||||
|
||||
### Traces (TypeScript)
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
|
||||
includeSpans: true,
|
||||
limit: 100,
|
||||
});
|
||||
```
|
||||
|
||||
## Uploading Evaluations as Annotations
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
from phoenix.evals.utils import to_annotation_dataframe
|
||||
|
||||
# Run evaluations
|
||||
results_df = evaluate_dataframe(dataframe=spans_df, evaluators=[my_eval])
|
||||
|
||||
# Format results for Phoenix annotations
|
||||
annotations_df = to_annotation_dataframe(results_df)
|
||||
|
||||
# Upload to Phoenix
|
||||
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
await logSpanAnnotations({
|
||||
spanAnnotations: [
|
||||
{
|
||||
spanId: "abc123",
|
||||
name: "quality",
|
||||
label: "good",
|
||||
score: 0.95,
|
||||
annotatorKind: "LLM",
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
Annotations are visible in the Phoenix UI alongside your traces.
|
||||
|
||||
## Verify
|
||||
|
||||
Required attributes: `input.value`, `output.value`, `status_code`
|
||||
For RAG: `retrieval.documents`
|
||||
For agents: `tool.name`, `tool.parameters`
|
||||
137
skills/phoenix-evals/references/production-continuous.md
Normal file
137
skills/phoenix-evals/references/production-continuous.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Production: Continuous Evaluation
|
||||
|
||||
Capability vs regression evals and the ongoing feedback loop.
|
||||
|
||||
## Two Types of Evals
|
||||
|
||||
| Type | Pass Rate Target | Purpose | Update |
|
||||
| ---- | ---------------- | ------- | ------ |
|
||||
| **Capability** | 50-80% | Measure improvement | Add harder cases |
|
||||
| **Regression** | 95-100% | Catch breakage | Add fixed bugs |
|
||||
|
||||
## Saturation
|
||||
|
||||
When capability evals hit >95% pass rate, they're saturated:
|
||||
1. Graduate passing cases to regression suite
|
||||
2. Add new challenging cases to capability suite
|
||||
|
||||
## Feedback Loop
|
||||
|
||||
```
|
||||
Production → Sample traffic → Run evaluators → Find failures
|
||||
↑ ↓
|
||||
Deploy ← Run CI evals ← Create test cases ← Error analysis
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
Build a continuous monitoring loop:
|
||||
|
||||
1. **Sample recent traces** at regular intervals (e.g., 100 traces per hour)
|
||||
2. **Run evaluators** on sampled traces
|
||||
3. **Log results** to Phoenix for tracking
|
||||
4. **Queue concerning results** for human review
|
||||
5. **Create test cases** from recurring failure patterns
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
client = Client()
|
||||
|
||||
# 1. Sample recent spans (includes full attributes for evaluation)
|
||||
spans_df = client.spans.get_spans_dataframe(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
root_spans_only=True,
|
||||
limit=100,
|
||||
)
|
||||
|
||||
# 2. Run evaluators
|
||||
from phoenix.evals import evaluate_dataframe
|
||||
|
||||
results_df = evaluate_dataframe(
|
||||
dataframe=spans_df,
|
||||
evaluators=[quality_eval, safety_eval],
|
||||
)
|
||||
|
||||
# 3. Upload results as annotations
|
||||
from phoenix.evals.utils import to_annotation_dataframe
|
||||
|
||||
annotations_df = to_annotation_dataframe(results_df)
|
||||
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// 1. Sample recent spans
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
parentId: null, // root spans only
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// 2. Run evaluators (user-defined)
|
||||
const results = await Promise.all(
|
||||
spans.map(async (span) => ({
|
||||
spanId: span.context.span_id,
|
||||
...await runEvaluators(span, [qualityEval, safetyEval]),
|
||||
}))
|
||||
);
|
||||
|
||||
// 3. Upload results as annotations
|
||||
await logSpanAnnotations({
|
||||
spanAnnotations: results.map((r) => ({
|
||||
spanId: r.spanId,
|
||||
name: "quality",
|
||||
score: r.qualityScore,
|
||||
label: r.qualityLabel,
|
||||
annotatorKind: "LLM" as const,
|
||||
})),
|
||||
});
|
||||
```
|
||||
|
||||
For trace-level monitoring (e.g., agent workflows), use `get_traces`/`getTraces` to identify traces:
|
||||
|
||||
```python
|
||||
# Python: identify slow traces
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
sort="latency_ms",
|
||||
order="desc",
|
||||
limit=50,
|
||||
)
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript: identify slow traces
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
limit: 50,
|
||||
});
|
||||
```
|
||||
|
||||
## Alerting
|
||||
|
||||
| Condition | Severity | Action |
|
||||
| --------- | -------- | ------ |
|
||||
| Regression < 98% | Critical | Page oncall |
|
||||
| Capability declining | Warning | Slack notify |
|
||||
| Capability > 95% for 7d | Info | Schedule review |
|
||||
|
||||
## Key Principles
|
||||
|
||||
- **Two suites** - Capability + Regression always
|
||||
- **Graduate cases** - Move consistent passes to regression
|
||||
- **Track trends** - Monitor over time, not just snapshots
|
||||
53
skills/phoenix-evals/references/production-guardrails.md
Normal file
53
skills/phoenix-evals/references/production-guardrails.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Production: Guardrails vs Evaluators
|
||||
|
||||
Guardrails block in real-time. Evaluators measure asynchronously.
|
||||
|
||||
## Key Distinction
|
||||
|
||||
```
|
||||
Request → [INPUT GUARDRAIL] → LLM → [OUTPUT GUARDRAIL] → Response
|
||||
│
|
||||
└──→ ASYNC EVALUATOR (background)
|
||||
```
|
||||
|
||||
## Guardrails
|
||||
|
||||
| Aspect | Requirement |
|
||||
| ------ | ----------- |
|
||||
| Timing | Synchronous, blocking |
|
||||
| Latency | < 100ms |
|
||||
| Purpose | Prevent harm |
|
||||
| Type | Code-based (deterministic) |
|
||||
|
||||
**Use for:** PII detection, prompt injection, profanity, length limits, format validation.
|
||||
|
||||
## Evaluators
|
||||
|
||||
| Aspect | Characteristic |
|
||||
| ------ | -------------- |
|
||||
| Timing | Async, background |
|
||||
| Latency | Can be seconds |
|
||||
| Purpose | Measure quality |
|
||||
| Type | Can use LLMs |
|
||||
|
||||
**Use for:** Helpfulness, faithfulness, tone, completeness, citation accuracy.
|
||||
|
||||
## Decision
|
||||
|
||||
| Question | Answer |
|
||||
| -------- | ------ |
|
||||
| Must block harmful content? | Guardrail |
|
||||
| Measuring quality? | Evaluator |
|
||||
| Need LLM judgment? | Evaluator |
|
||||
| < 100ms required? | Guardrail |
|
||||
| False positives = angry users? | Evaluator |
|
||||
|
||||
## LLM Guardrails: Rarely
|
||||
|
||||
Only use LLM guardrails if:
|
||||
- Latency budget > 1s
|
||||
- Error cost >> LLM cost
|
||||
- Low volume
|
||||
- Fallback exists
|
||||
|
||||
**Key Principle:** Guardrails prevent harm (block). Evaluators measure quality (log).
|
||||
92
skills/phoenix-evals/references/production-overview.md
Normal file
92
skills/phoenix-evals/references/production-overview.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Production: Overview
|
||||
|
||||
CI/CD evals vs production monitoring - complementary approaches.
|
||||
|
||||
## Two Evaluation Modes
|
||||
|
||||
| Aspect | CI/CD Evals | Production Monitoring |
|
||||
| ------ | ----------- | -------------------- |
|
||||
| **When** | Pre-deployment | Post-deployment, ongoing |
|
||||
| **Data** | Fixed dataset | Sampled traffic |
|
||||
| **Goal** | Prevent regression | Detect drift |
|
||||
| **Response** | Block deploy | Alert & analyze |
|
||||
|
||||
## CI/CD Evaluations
|
||||
|
||||
```python
|
||||
# Fast, deterministic checks
|
||||
ci_evaluators = [
|
||||
has_required_format,
|
||||
no_pii_leak,
|
||||
safety_check,
|
||||
regression_test_suite,
|
||||
]
|
||||
|
||||
# Small but representative dataset (~100 examples)
|
||||
run_experiment(ci_dataset, task, ci_evaluators)
|
||||
```
|
||||
|
||||
Set thresholds: regression=0.95, safety=1.0, format=0.98.
|
||||
|
||||
## Production Monitoring
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
client = Client()
|
||||
|
||||
# Sample recent traces (last hour)
|
||||
traces = client.traces.get_traces(
|
||||
project_identifier="my-app",
|
||||
start_time=datetime.now() - timedelta(hours=1),
|
||||
include_spans=True,
|
||||
limit=100,
|
||||
)
|
||||
|
||||
# Run evaluators on sampled traffic
|
||||
for trace in traces:
|
||||
results = run_evaluators_async(trace, production_evaluators)
|
||||
if any(r["score"] < 0.5 for r in results):
|
||||
alert_on_failure(trace, results)
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
```typescript
|
||||
import { getTraces } from "@arizeai/phoenix-client/traces";
|
||||
import { getSpans } from "@arizeai/phoenix-client/spans";
|
||||
|
||||
// Sample recent traces (last hour)
|
||||
const { traces } = await getTraces({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
includeSpans: true,
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Or sample spans directly for evaluation
|
||||
const { spans } = await getSpans({
|
||||
project: { projectName: "my-app" },
|
||||
startTime: new Date(Date.now() - 60 * 60 * 1000),
|
||||
limit: 100,
|
||||
});
|
||||
|
||||
// Run evaluators on sampled traffic
|
||||
for (const span of spans) {
|
||||
const results = await runEvaluators(span, productionEvaluators);
|
||||
if (results.some((r) => r.score < 0.5)) {
|
||||
await alertOnFailure(span, results);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Prioritize: errors → negative feedback → random sample.
|
||||
|
||||
## Feedback Loop
|
||||
|
||||
```
|
||||
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
|
||||
```
|
||||
64
skills/phoenix-evals/references/setup-python.md
Normal file
64
skills/phoenix-evals/references/setup-python.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# Setup: Python
|
||||
|
||||
Packages required for Phoenix evals and experiments.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Core Phoenix package (includes client, evals, otel)
|
||||
pip install arize-phoenix
|
||||
|
||||
# Or install individual packages
|
||||
pip install arize-phoenix-client # Phoenix client only
|
||||
pip install arize-phoenix-evals # Evaluation utilities
|
||||
pip install arize-phoenix-otel # OpenTelemetry integration
|
||||
```
|
||||
|
||||
## LLM Providers
|
||||
|
||||
For LLM-as-judge evaluators, install your provider's SDK:
|
||||
|
||||
```bash
|
||||
pip install openai # OpenAI
|
||||
pip install anthropic # Anthropic
|
||||
pip install google-generativeai # Google
|
||||
```
|
||||
|
||||
## Validation (Optional)
|
||||
|
||||
```bash
|
||||
pip install scikit-learn # For TPR/TNR metrics
|
||||
```
|
||||
|
||||
## Quick Verify
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from phoenix.evals import LLM, ClassificationEvaluator
|
||||
from phoenix.otel import register
|
||||
|
||||
# All imports should work
|
||||
print("Phoenix Python setup complete")
|
||||
```
|
||||
|
||||
## Key Imports (Evals 2.0)
|
||||
|
||||
```python
|
||||
from phoenix.client import Client
|
||||
from phoenix.evals import (
|
||||
ClassificationEvaluator, # LLM classification evaluator (preferred)
|
||||
LLM, # Provider-agnostic LLM wrapper
|
||||
async_evaluate_dataframe, # Batch evaluate a DataFrame (preferred, async)
|
||||
evaluate_dataframe, # Batch evaluate a DataFrame (sync)
|
||||
create_evaluator, # Decorator for code-based evaluators
|
||||
create_classifier, # Factory for LLM classification evaluators
|
||||
bind_evaluator, # Map column names to evaluator params
|
||||
Score, # Score dataclass
|
||||
)
|
||||
from phoenix.evals.utils import to_annotation_dataframe # Format results for Phoenix annotations
|
||||
```
|
||||
|
||||
**Prefer**: `ClassificationEvaluator` over `create_classifier` (more parameters/customization).
|
||||
**Prefer**: `async_evaluate_dataframe` over `evaluate_dataframe` (better throughput for LLM evals).
|
||||
|
||||
**Do NOT use** legacy 1.0 imports: `OpenAIModel`, `AnthropicModel`, `run_evals`, `llm_classify`.
|
||||
41
skills/phoenix-evals/references/setup-typescript.md
Normal file
41
skills/phoenix-evals/references/setup-typescript.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Setup: TypeScript
|
||||
|
||||
Packages required for Phoenix evals and experiments.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Using npm
|
||||
npm install @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
|
||||
|
||||
# Using pnpm
|
||||
pnpm add @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
|
||||
```
|
||||
|
||||
## LLM Providers
|
||||
|
||||
For LLM-as-judge evaluators, install Vercel AI SDK providers:
|
||||
|
||||
```bash
|
||||
npm install ai @ai-sdk/openai # Vercel AI SDK + OpenAI
|
||||
npm install @ai-sdk/anthropic # Anthropic
|
||||
npm install @ai-sdk/google # Google
|
||||
```
|
||||
|
||||
Or use direct provider SDKs:
|
||||
|
||||
```bash
|
||||
npm install openai # OpenAI direct
|
||||
npm install @anthropic-ai/sdk # Anthropic direct
|
||||
```
|
||||
|
||||
## Quick Verify
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
|
||||
import { registerPhoenix } from "@arizeai/phoenix-otel";
|
||||
|
||||
// All imports should work
|
||||
console.log("Phoenix TypeScript setup complete");
|
||||
```
|
||||
@@ -0,0 +1,43 @@
|
||||
# Validating Evaluators (Python)
|
||||
|
||||
Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.
|
||||
|
||||
## Calculate Metrics
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix
|
||||
|
||||
print(classification_report(human_labels, evaluator_predictions))
|
||||
|
||||
cm = confusion_matrix(human_labels, evaluator_predictions)
|
||||
tn, fp, fn, tp = cm.ravel()
|
||||
tpr = tp / (tp + fn)
|
||||
tnr = tn / (tn + fp)
|
||||
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")
|
||||
```
|
||||
|
||||
## Correct Production Estimates
|
||||
|
||||
```python
|
||||
def correct_estimate(observed, tpr, tnr):
|
||||
"""Adjust observed pass rate using known TPR/TNR."""
|
||||
return (observed - (1 - tnr)) / (tpr - (1 - tnr))
|
||||
```
|
||||
|
||||
## Find Misclassified
|
||||
|
||||
```python
|
||||
# False Positives: Evaluator pass, human fail
|
||||
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
|
||||
false_positives = dataset[fp_mask]
|
||||
|
||||
# False Negatives: Evaluator fail, human pass
|
||||
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
|
||||
false_negatives = dataset[fn_mask]
|
||||
```
|
||||
|
||||
## Red Flags
|
||||
|
||||
- TPR or TNR < 70%
|
||||
- Large gap between TPR and TNR
|
||||
- Kappa < 0.6
|
||||
@@ -0,0 +1,106 @@
|
||||
# Validating Evaluators (TypeScript)
|
||||
|
||||
Validate an LLM evaluator against human-labeled examples before deploying it.
|
||||
Target: **>80% TPR and >80% TNR**.
|
||||
|
||||
Roles are inverted compared to a normal task experiment:
|
||||
|
||||
| Normal experiment | Evaluator validation |
|
||||
|---|---|
|
||||
| Task = agent logic | Task = run the evaluator under test |
|
||||
| Evaluator = judge output | Evaluator = exact-match vs human ground truth |
|
||||
| Dataset = agent examples | Dataset = golden hand-labeled examples |
|
||||
|
||||
## Golden Dataset
|
||||
|
||||
Use a separate dataset name so validation experiments don't mix with task experiments in Phoenix.
|
||||
Store human ground truth in `metadata.groundTruthLabel`. Aim for ~50/50 balance:
|
||||
|
||||
```typescript
|
||||
import type { Example } from "@arizeai/phoenix-client/types/datasets";
|
||||
|
||||
const goldenExamples: Example[] = [
|
||||
{ input: { q: "Capital of France?" }, output: { answer: "Paris" }, metadata: { groundTruthLabel: "correct" } },
|
||||
{ input: { q: "Capital of France?" }, output: { answer: "Lyon" }, metadata: { groundTruthLabel: "incorrect" } },
|
||||
{ input: { q: "Capital of France?" }, output: { answer: "Major city..." }, metadata: { groundTruthLabel: "incorrect" } },
|
||||
];
|
||||
|
||||
const VALIDATOR_DATASET = "my-app-qa-evaluator-validation"; // separate from task dataset
|
||||
const POSITIVE_LABEL = "correct";
|
||||
const NEGATIVE_LABEL = "incorrect";
|
||||
```
|
||||
|
||||
## Validation Experiment
|
||||
|
||||
```typescript
|
||||
import { createClient } from "@arizeai/phoenix-client";
|
||||
import { createOrGetDataset, getDatasetExamples } from "@arizeai/phoenix-client/datasets";
|
||||
import { asExperimentEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";
|
||||
import { myEvaluator } from "./myEvaluator.js";
|
||||
|
||||
const client = createClient();
|
||||
|
||||
const { datasetId } = await createOrGetDataset({ client, name: VALIDATOR_DATASET, examples: goldenExamples });
|
||||
const { examples } = await getDatasetExamples({ client, dataset: { datasetId } });
|
||||
const groundTruth = new Map(examples.map((ex) => [ex.id, ex.metadata?.groundTruthLabel as string]));
|
||||
|
||||
// Task: invoke the evaluator under test
|
||||
const task = async (example: (typeof examples)[number]) => {
|
||||
const result = await myEvaluator.evaluate({ input: example.input, output: example.output, metadata: example.metadata });
|
||||
return result.label ?? "unknown";
|
||||
};
|
||||
|
||||
// Evaluator: exact-match against human ground truth
|
||||
const exactMatch = asExperimentEvaluator({
|
||||
name: "exact-match", kind: "CODE",
|
||||
evaluate: ({ output, metadata }) => {
|
||||
const expected = metadata?.groundTruthLabel as string;
|
||||
const predicted = typeof output === "string" ? output : "unknown";
|
||||
return { score: predicted === expected ? 1 : 0, label: predicted, explanation: `Expected: ${expected}, Got: ${predicted}` };
|
||||
},
|
||||
});
|
||||
|
||||
const experiment = await runExperiment({
|
||||
client, experimentName: `evaluator-validation-${Date.now()}`,
|
||||
dataset: { datasetId }, task, evaluators: [exactMatch],
|
||||
});
|
||||
|
||||
// Compute confusion matrix
|
||||
const runs = Object.values(experiment.runs);
|
||||
const predicted = new Map((experiment.evaluationRuns ?? [])
|
||||
.filter((e) => e.name === "exact-match")
|
||||
.map((e) => [e.experimentRunId, e.result?.label ?? null]));
|
||||
|
||||
let tp = 0, fp = 0, tn = 0, fn = 0;
|
||||
for (const run of runs) {
|
||||
if (run.error) continue;
|
||||
const p = predicted.get(run.id), a = groundTruth.get(run.datasetExampleId);
|
||||
if (!p || !a) continue;
|
||||
if (a === POSITIVE_LABEL && p === POSITIVE_LABEL) tp++;
|
||||
else if (a === NEGATIVE_LABEL && p === POSITIVE_LABEL) fp++;
|
||||
else if (a === NEGATIVE_LABEL && p === NEGATIVE_LABEL) tn++;
|
||||
else if (a === POSITIVE_LABEL && p === NEGATIVE_LABEL) fn++;
|
||||
}
|
||||
const total = tp + fp + tn + fn;
|
||||
const tpr = tp + fn > 0 ? (tp / (tp + fn)) * 100 : 0;
|
||||
const tnr = tn + fp > 0 ? (tn / (tn + fp)) * 100 : 0;
|
||||
console.log(`TPR: ${tpr.toFixed(1)}% TNR: ${tnr.toFixed(1)}% Accuracy: ${((tp + tn) / total * 100).toFixed(1)}%`);
|
||||
```
|
||||
|
||||
## Results & Quality Rules
|
||||
|
||||
| Metric | Target | Low value means |
|
||||
|---|---|---|
|
||||
| TPR (sensitivity) | >80% | Misses real failures (false negatives) |
|
||||
| TNR (specificity) | >80% | Flags good outputs (false positives) |
|
||||
| Accuracy | >80% | General weakness |
|
||||
|
||||
**Golden dataset rules:** ~50/50 balance · include edge cases · human-labeled only · never mutate (append new versions) · 20–50 examples is enough.
|
||||
|
||||
**Re-validate when:** prompt template changes · judge model changes · criteria updated · production FP/FN spike.
|
||||
|
||||
## See Also
|
||||
|
||||
- `validation.md` — Metric definitions and concepts
|
||||
- `experiments-running-typescript.md` — `runExperiment` API
|
||||
- `experiments-datasets-typescript.md` — `createOrGetDataset` / `getDatasetExamples`
|
||||
74
skills/phoenix-evals/references/validation.md
Normal file
74
skills/phoenix-evals/references/validation.md
Normal file
@@ -0,0 +1,74 @@
|
||||
# Validation
|
||||
|
||||
Validate LLM judges against human labels before deploying. Target >80% agreement.
|
||||
|
||||
## Requirements
|
||||
|
||||
| Requirement | Target |
|
||||
| ----------- | ------ |
|
||||
| Test set size | 100+ examples |
|
||||
| Balance | ~50/50 pass/fail |
|
||||
| Accuracy | >80% |
|
||||
| TPR/TNR | Both >70% |
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Formula | Use When |
|
||||
| ------ | ------- | -------- |
|
||||
| **Accuracy** | (TP+TN) / Total | General |
|
||||
| **TPR (Recall)** | TP / (TP+FN) | Quality assurance |
|
||||
| **TNR (Specificity)** | TN / (TN+FP) | Safety-critical |
|
||||
| **Cohen's Kappa** | Agreement beyond chance | Comparing evaluators |
|
||||
|
||||
## Quick Validation
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score
|
||||
|
||||
print(classification_report(human_labels, evaluator_predictions))
|
||||
print(f"Kappa: {cohen_kappa_score(human_labels, evaluator_predictions):.3f}")
|
||||
|
||||
# Get TPR/TNR
|
||||
cm = confusion_matrix(human_labels, evaluator_predictions)
|
||||
tn, fp, fn, tp = cm.ravel()
|
||||
tpr = tp / (tp + fn)
|
||||
tnr = tn / (tn + fp)
|
||||
```
|
||||
|
||||
## Golden Dataset Structure
|
||||
|
||||
```python
|
||||
golden_example = {
|
||||
"input": "What is the capital of France?",
|
||||
"output": "Paris is the capital.",
|
||||
"ground_truth_label": "correct",
|
||||
}
|
||||
```
|
||||
|
||||
## Building Golden Datasets
|
||||
|
||||
1. Sample production traces (errors, negative feedback, edge cases)
|
||||
2. Balance ~50/50 pass/fail
|
||||
3. Expert labels each example
|
||||
4. Version datasets (never modify existing)
|
||||
|
||||
```python
|
||||
# GOOD - create new version
|
||||
golden_v2 = golden_v1 + [new_examples]
|
||||
|
||||
# BAD - never modify existing
|
||||
golden_v1.append(new_example)
|
||||
```
|
||||
|
||||
## Warning Signs
|
||||
|
||||
- All pass or all fail → too lenient/strict
|
||||
- Random results → criteria unclear
|
||||
- TPR/TNR < 70% → needs improvement
|
||||
|
||||
## Re-Validate When
|
||||
|
||||
- Prompt template changes
|
||||
- Judge model changes
|
||||
- Criteria changes
|
||||
- Monthly
|
||||
Reference in New Issue
Block a user