awesome-copilot/skills/phoenix-evals/references/production-overview.md

# Production: Overview

CI/CD evals vs production monitoring - complementary approaches.

## Two Evaluation Modes

| Aspect | CI/CD Evals | Production Monitoring |
| ------ | ----------- | -------------------- |
| **When** | Pre-deployment | Post-deployment, ongoing |
| **Data** | Fixed dataset | Sampled traffic |
| **Goal** | Prevent regression | Detect drift |
| **Response** | Block deploy | Alert & analyze |

## CI/CD Evaluations

```python
# Fast, deterministic checks
ci_evaluators = [
    has_required_format,
    no_pii_leak,
    safety_check,
    regression_test_suite,
]

# Small but representative dataset (~100 examples)
run_experiment(ci_dataset, task, ci_evaluators)
```

Set thresholds: regression=0.95, safety=1.0, format=0.98.

## Production Monitoring

### Python

```python
from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# Sample recent traces (last hour)
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    include_spans=True,
    limit=100,
)

# Run evaluators on sampled traffic
for trace in traces:
    results = run_evaluators_async(trace, production_evaluators)
    if any(r["score"] < 0.5 for r in results):
        alert_on_failure(trace, results)
```

### TypeScript

```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";
import { getSpans } from "@arizeai/phoenix-client/spans";

// Sample recent traces (last hour)
const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});

// Or sample spans directly for evaluation
const { spans } = await getSpans({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  limit: 100,
});

// Run evaluators on sampled traffic
for (const span of spans) {
  const results = await runEvaluators(span, productionEvaluators);
  if (results.some((r) => r.score < 0.5)) {
    await alertOnFailure(span, results);
  }
}
```

Prioritize: errors → negative feedback → random sample.

## Feedback Loop

```
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
```