mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 19:25:55 +00:00
3.6 KiB
3.6 KiB
Production: Continuous Evaluation
Capability vs regression evals and the ongoing feedback loop.
Two Types of Evals
| Type | Pass Rate Target | Purpose | Update |
|---|---|---|---|
| Capability | 50-80% | Measure improvement | Add harder cases |
| Regression | 95-100% | Catch breakage | Add fixed bugs |
Saturation
When capability evals hit >95% pass rate, they're saturated:
- Graduate passing cases to regression suite
- Add new challenging cases to capability suite
Feedback Loop
Production → Sample traffic → Run evaluators → Find failures
↑ ↓
Deploy ← Run CI evals ← Create test cases ← Error analysis
Implementation
Build a continuous monitoring loop:
- Sample recent traces at regular intervals (e.g., 100 traces per hour)
- Run evaluators on sampled traces
- Log results to Phoenix for tracking
- Queue concerning results for human review
- Create test cases from recurring failure patterns
Python
from phoenix.client import Client
from datetime import datetime, timedelta
client = Client()
# 1. Sample recent spans (includes full attributes for evaluation)
spans_df = client.spans.get_spans_dataframe(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
root_spans_only=True,
limit=100,
)
# 2. Run evaluators
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(
dataframe=spans_df,
evaluators=[quality_eval, safety_eval],
)
# 3. Upload results as annotations
from phoenix.evals.utils import to_annotation_dataframe
annotations_df = to_annotation_dataframe(results_df)
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
TypeScript
import { getSpans } from "@arizeai/phoenix-client/spans";
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
// 1. Sample recent spans
const { spans } = await getSpans({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
parentId: null, // root spans only
limit: 100,
});
// 2. Run evaluators (user-defined)
const results = await Promise.all(
spans.map(async (span) => ({
spanId: span.context.span_id,
...await runEvaluators(span, [qualityEval, safetyEval]),
}))
);
// 3. Upload results as annotations
await logSpanAnnotations({
spanAnnotations: results.map((r) => ({
spanId: r.spanId,
name: "quality",
score: r.qualityScore,
label: r.qualityLabel,
annotatorKind: "LLM" as const,
})),
});
For trace-level monitoring (e.g., agent workflows), use get_traces/getTraces to identify traces:
# Python: identify slow traces
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
sort="latency_ms",
order="desc",
limit=50,
)
// TypeScript: identify slow traces
import { getTraces } from "@arizeai/phoenix-client/traces";
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
limit: 50,
});
Alerting
| Condition | Severity | Action |
|---|---|---|
| Regression < 98% | Critical | Page oncall |
| Capability declining | Warning | Slack notify |
| Capability > 95% for 7d | Info | Schedule review |
Key Principles
- Two suites - Capability + Regression always
- Graduate cases - Move consistent passes to regression
- Track trends - Monitor over time, not just snapshots