chore: publish from staged

2026-04-13 11:45:56 +00:00 · 2026-04-09 06:26:21 +00:00
parent 017f31f495
commit a68b190031
467 changed files with 97527 additions and 276 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/production-overview.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/production-overview.md
@@ -0,0 +1,92 @@
+# Production: Overview
+
+CI/CD evals vs production monitoring - complementary approaches.
+
+## Two Evaluation Modes
+
+| Aspect | CI/CD Evals | Production Monitoring |
+| ------ | ----------- | -------------------- |
+| **When** | Pre-deployment | Post-deployment, ongoing |
+| **Data** | Fixed dataset | Sampled traffic |
+| **Goal** | Prevent regression | Detect drift |
+| **Response** | Block deploy | Alert & analyze |
+
+## CI/CD Evaluations
+
+```python
+# Fast, deterministic checks
+ci_evaluators = [
+    has_required_format,
+    no_pii_leak,
+    safety_check,
+    regression_test_suite,
+]
+
+# Small but representative dataset (~100 examples)
+run_experiment(ci_dataset, task, ci_evaluators)
+```
+
+Set thresholds: regression=0.95, safety=1.0, format=0.98.
+
+## Production Monitoring
+
+### Python
+
+```python
+from phoenix.client import Client
+from datetime import datetime, timedelta
+
+client = Client()
+
+# Sample recent traces (last hour)
+traces = client.traces.get_traces(
+    project_identifier="my-app",
+    start_time=datetime.now() - timedelta(hours=1),
+    include_spans=True,
+    limit=100,
+)
+
+# Run evaluators on sampled traffic
+for trace in traces:
+    results = run_evaluators_async(trace, production_evaluators)
+    if any(r["score"] < 0.5 for r in results):
+        alert_on_failure(trace, results)
+```
+
+### TypeScript
+
+```typescript
+import { getTraces } from "@arizeai/phoenix-client/traces";
+import { getSpans } from "@arizeai/phoenix-client/spans";
+
+// Sample recent traces (last hour)
+const { traces } = await getTraces({
+  project: { projectName: "my-app" },
+  startTime: new Date(Date.now() - 60 * 60 * 1000),
+  includeSpans: true,
+  limit: 100,
+});
+
+// Or sample spans directly for evaluation
+const { spans } = await getSpans({
+  project: { projectName: "my-app" },
+  startTime: new Date(Date.now() - 60 * 60 * 1000),
+  limit: 100,
+});
+
+// Run evaluators on sampled traffic
+for (const span of spans) {
+  const results = await runEvaluators(span, productionEvaluators);
+  if (results.some((r) => r.score < 0.5)) {
+    await alertOnFailure(span, results);
+  }
+}
+```
+
+Prioritize: errors → negative feedback → random sample.
+
+## Feedback Loop
+
+```
+Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
+```