mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
4.2 KiB
4.2 KiB
Error Analysis
Review traces to discover failure modes before building evaluators.
Process
- Sample - 100+ traces (errors, negative feedback, random)
- Open Code - Write free-form notes per trace
- Axial Code - Group notes into failure categories
- Quantify - Count failures per category
- Prioritize - Rank by frequency × severity
Sample Traces
Span-level sampling (Python — DataFrame)
from phoenix.client import Client
# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")
# Build representative sample
sample = pd.concat([
spans_df[spans_df["status_code"] == "ERROR"].sample(30),
spans_df[spans_df["feedback"] == "negative"].sample(30),
spans_df.sample(40),
]).drop_duplicates("span_id").head(100)
Span-level sampling (TypeScript)
import { getSpans } from "@arizeai/phoenix-client/spans";
const { spans: errors } = await getSpans({
project: { projectName: "my-app" },
statusCode: "ERROR",
limit: 30,
});
const { spans: allSpans } = await getSpans({
project: { projectName: "my-app" },
limit: 70,
});
const sample = [...errors, ...allSpans.sort(() => Math.random() - 0.5).slice(0, 40)];
const unique = [...new Map(sample.map((s) => [s.context.span_id, s])).values()].slice(0, 100);
Trace-level sampling (Python)
When errors span multiple spans (e.g., agent workflows), sample whole traces:
from datetime import datetime, timedelta
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=24),
include_spans=True,
sort="latency_ms",
order="desc",
limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans
Trace-level sampling (TypeScript)
import { getTraces } from "@arizeai/phoenix-client/traces";
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
includeSpans: true,
limit: 100,
});
Add Notes (Python)
client.spans.add_span_note(
span_id="abc123",
note="wrong timezone - said 3pm EST but user is PST"
)
Add Notes (TypeScript)
import { addSpanNote } from "@arizeai/phoenix-client/spans";
await addSpanNote({
spanNote: {
spanId: "abc123",
note: "wrong timezone - said 3pm EST but user is PST"
}
});
What to Note
| Type | Examples |
|---|---|
| Factual errors | Wrong dates, prices, made-up features |
| Missing info | Didn't answer question, omitted details |
| Tone issues | Too casual/formal for context |
| Tool issues | Wrong tool, wrong parameters |
| Retrieval | Wrong docs, missing relevant docs |
Good Notes
BAD: "Response is bad"
GOOD: "Response says ships in 2 days but policy is 5-7 days"
Group into Categories
categories = {
"factual_inaccuracy": ["wrong shipping time", "incorrect price"],
"hallucination": ["made up a discount", "invented feature"],
"tone_mismatch": ["informal for enterprise client"],
}
# Priority = Frequency × Severity
Retrieve Existing Annotations
Python
# From a spans DataFrame
annotations_df = client.spans.get_span_annotations_dataframe(
spans_dataframe=sample,
project_identifier="my-app",
include_annotation_names=["quality", "correctness"],
)
# annotations_df has: span_id (index), name, label, score, explanation
# Or from specific span IDs
annotations_df = client.spans.get_span_annotations_dataframe(
span_ids=["span-id-1", "span-id-2"],
project_identifier="my-app",
)
TypeScript
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";
const { annotations } = await getSpanAnnotations({
project: { projectName: "my-app" },
spanIds: ["span-id-1", "span-id-2"],
includeAnnotationNames: ["quality", "correctness"],
});
for (const ann of annotations) {
console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);
}
Saturation
Stop when new traces reveal no new failure modes. Minimum: 100 traces.