chore: publish from staged

2026-04-13 03:35:55 +00:00 · 2026-04-01 23:04:18 +00:00
parent 5f3d66c380
commit 0c3c5bbbfb
407 changed files with 85783 additions and 237 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/production-continuous.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/production-continuous.md
@@ -0,0 +1,137 @@
+# Production: Continuous Evaluation
+
+Capability vs regression evals and the ongoing feedback loop.
+
+## Two Types of Evals
+
+| Type | Pass Rate Target | Purpose | Update |
+| ---- | ---------------- | ------- | ------ |
+| **Capability** | 50-80% | Measure improvement | Add harder cases |
+| **Regression** | 95-100% | Catch breakage | Add fixed bugs |
+
+## Saturation
+
+When capability evals hit >95% pass rate, they're saturated:
+1. Graduate passing cases to regression suite
+2. Add new challenging cases to capability suite
+
+## Feedback Loop
+
+```
+Production → Sample traffic → Run evaluators → Find failures
+    ↑                                              ↓
+Deploy  ←  Run CI evals  ←  Create test cases  ←  Error analysis
+```
+
+## Implementation
+
+Build a continuous monitoring loop:
+
+1. **Sample recent traces** at regular intervals (e.g., 100 traces per hour)
+2. **Run evaluators** on sampled traces
+3. **Log results** to Phoenix for tracking
+4. **Queue concerning results** for human review
+5. **Create test cases** from recurring failure patterns
+
+### Python
+
+```python
+from phoenix.client import Client
+from datetime import datetime, timedelta
+
+client = Client()
+
+# 1. Sample recent spans (includes full attributes for evaluation)
+spans_df = client.spans.get_spans_dataframe(
+    project_identifier="my-app",
+    start_time=datetime.now() - timedelta(hours=1),
+    root_spans_only=True,
+    limit=100,
+)
+
+# 2. Run evaluators
+from phoenix.evals import evaluate_dataframe
+
+results_df = evaluate_dataframe(
+    dataframe=spans_df,
+    evaluators=[quality_eval, safety_eval],
+)
+
+# 3. Upload results as annotations
+from phoenix.evals.utils import to_annotation_dataframe
+
+annotations_df = to_annotation_dataframe(results_df)
+client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
+```
+
+### TypeScript
+
+```typescript
+import { getSpans } from "@arizeai/phoenix-client/spans";
+import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";
+
+// 1. Sample recent spans
+const { spans } = await getSpans({
+  project: { projectName: "my-app" },
+  startTime: new Date(Date.now() - 60 * 60 * 1000),
+  parentId: null, // root spans only
+  limit: 100,
+});
+
+// 2. Run evaluators (user-defined)
+const results = await Promise.all(
+  spans.map(async (span) => ({
+    spanId: span.context.span_id,
+    ...await runEvaluators(span, [qualityEval, safetyEval]),
+  }))
+);
+
+// 3. Upload results as annotations
+await logSpanAnnotations({
+  spanAnnotations: results.map((r) => ({
+    spanId: r.spanId,
+    name: "quality",
+    score: r.qualityScore,
+    label: r.qualityLabel,
+    annotatorKind: "LLM" as const,
+  })),
+});
+```
+
+For trace-level monitoring (e.g., agent workflows), use `get_traces`/`getTraces` to identify traces:
+
+```python
+# Python: identify slow traces
+traces = client.traces.get_traces(
+    project_identifier="my-app",
+    start_time=datetime.now() - timedelta(hours=1),
+    sort="latency_ms",
+    order="desc",
+    limit=50,
+)
+```
+
+```typescript
+// TypeScript: identify slow traces
+import { getTraces } from "@arizeai/phoenix-client/traces";
+
+const { traces } = await getTraces({
+  project: { projectName: "my-app" },
+  startTime: new Date(Date.now() - 60 * 60 * 1000),
+  limit: 50,
+});
+```
+
+## Alerting
+
+| Condition | Severity | Action |
+| --------- | -------- | ------ |
+| Regression < 98% | Critical | Page oncall |
+| Capability declining | Warning | Slack notify |
+| Capability > 95% for 7d | Info | Schedule review |
+
+## Key Principles
+
+- **Two suites** - Capability + Regression always
+- **Graduate cases** - Move consistent passes to regression
+- **Track trends** - Monitor over time, not just snapshots