chore: publish from staged

2026-04-12 03:05:55 +00:00 · 2026-04-01 23:04:18 +00:00
parent 5f3d66c380
commit 0c3c5bbbfb
407 changed files with 85783 additions and 237 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/experiments-synthetic-python.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/experiments-synthetic-python.md
@@ -0,0 +1,70 @@
+# Experiments: Generating Synthetic Test Data
+
+Creating diverse, targeted test data for evaluation.
+
+## Dimension-Based Approach
+
+Define axes of variation, then generate combinations:
+
+```python
+dimensions = {
+    "issue_type": ["billing", "technical", "shipping"],
+    "customer_mood": ["frustrated", "neutral", "happy"],
+    "complexity": ["simple", "moderate", "complex"],
+}
+```
+
+## Two-Step Generation
+
+1. **Generate tuples** (combinations of dimension values)
+2. **Convert to natural queries** (separate LLM call per tuple)
+
+```python
+# Step 1: Create tuples
+tuples = [
+    ("billing", "frustrated", "complex"),
+    ("shipping", "neutral", "simple"),
+]
+
+# Step 2: Convert to natural query
+def tuple_to_query(t):
+    prompt = f"""Generate a realistic customer message:
+    Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
+    
+    Write naturally, include typos if appropriate. Don't be formulaic."""
+    return llm(prompt)
+```
+
+## Target Failure Modes
+
+Dimensions should target known failures from error analysis:
+
+```python
+# From error analysis findings
+dimensions = {
+    "timezone": ["EST", "PST", "UTC", "ambiguous"],  # Known failure
+    "date_format": ["ISO", "US", "EU", "relative"],   # Known failure
+}
+```
+
+## Quality Control
+
+- **Validate**: Check for placeholder text, minimum length
+- **Deduplicate**: Remove near-duplicate queries using embeddings
+- **Balance**: Ensure coverage across dimension values
+
+## When to Use
+
+| Use Synthetic | Use Real Data |
+| ------------- | ------------- |
+| Limited production data | Sufficient traces |
+| Testing edge cases | Validating actual behavior |
+| Pre-launch evals | Post-launch monitoring |
+
+## Sample Sizes
+
+| Purpose | Size |
+| ------- | ---- |
+| Initial exploration | 50-100 |
+| Comprehensive eval | 100-500 |
+| Per-dimension | 10-20 per combination |