Production: Overview

CI/CD evals vs production monitoring - complementary approaches.

Two Evaluation Modes

Aspect	CI/CD Evals	Production Monitoring
When	Pre-deployment	Post-deployment, ongoing
Data	Fixed dataset	Sampled traffic
Goal	Prevent regression	Detect drift
Response	Block deploy	Alert & analyze

CI/CD Evaluations

# Fast, deterministic checks
ci_evaluators = [
    has_required_format,
    no_pii_leak,
    safety_check,
    regression_test_suite,
]

# Small but representative dataset (~100 examples)
run_experiment(ci_dataset, task, ci_evaluators)

Set thresholds: regression=0.95, safety=1.0, format=0.98.

Production Monitoring

Python

from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# Sample recent traces (last hour)
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    include_spans=True,
    limit=100,
)

# Run evaluators on sampled traffic
for trace in traces:
    results = run_evaluators_async(trace, production_evaluators)
    if any(r["score"] < 0.5 for r in results):
        alert_on_failure(trace, results)

TypeScript

import { getTraces } from "@arizeai/phoenix-client/traces";
import { getSpans } from "@arizeai/phoenix-client/spans";

// Sample recent traces (last hour)
const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});

// Or sample spans directly for evaluation
const { spans } = await getSpans({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  limit: 100,
});

// Run evaluators on sampled traffic
for (const span of spans) {
  const results = await runEvaluators(span, productionEvaluators);
  if (results.some((r) => r.score < 0.5)) {
    await alertOnFailure(span, results);
  }
}

Prioritize: errors → negative feedback → random sample.

Feedback Loop

Production finds failure → Error analysis → Add to CI dataset → Prevents future regression

2.2 KiB Raw Blame History

Production: Overview

Two Evaluation Modes

CI/CD Evaluations

Production Monitoring

Python

TypeScript

Feedback Loop

2.2 KiB

Raw Blame History