mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 11:15:56 +00:00
2.2 KiB
2.2 KiB
Production: Overview
CI/CD evals vs production monitoring - complementary approaches.
Two Evaluation Modes
| Aspect | CI/CD Evals | Production Monitoring |
|---|---|---|
| When | Pre-deployment | Post-deployment, ongoing |
| Data | Fixed dataset | Sampled traffic |
| Goal | Prevent regression | Detect drift |
| Response | Block deploy | Alert & analyze |
CI/CD Evaluations
# Fast, deterministic checks
ci_evaluators = [
has_required_format,
no_pii_leak,
safety_check,
regression_test_suite,
]
# Small but representative dataset (~100 examples)
run_experiment(ci_dataset, task, ci_evaluators)
Set thresholds: regression=0.95, safety=1.0, format=0.98.
Production Monitoring
Python
from phoenix.client import Client
from datetime import datetime, timedelta
client = Client()
# Sample recent traces (last hour)
traces = client.traces.get_traces(
project_identifier="my-app",
start_time=datetime.now() - timedelta(hours=1),
include_spans=True,
limit=100,
)
# Run evaluators on sampled traffic
for trace in traces:
results = run_evaluators_async(trace, production_evaluators)
if any(r["score"] < 0.5 for r in results):
alert_on_failure(trace, results)
TypeScript
import { getTraces } from "@arizeai/phoenix-client/traces";
import { getSpans } from "@arizeai/phoenix-client/spans";
// Sample recent traces (last hour)
const { traces } = await getTraces({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
includeSpans: true,
limit: 100,
});
// Or sample spans directly for evaluation
const { spans } = await getSpans({
project: { projectName: "my-app" },
startTime: new Date(Date.now() - 60 * 60 * 1000),
limit: 100,
});
// Run evaluators on sampled traffic
for (const span of spans) {
const results = await runEvaluators(span, productionEvaluators);
if (results.some((r) => r.score < 0.5)) {
await alertOnFailure(span, results);
}
}
Prioritize: errors → negative feedback → random sample.
Feedback Loop
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression