mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
71 lines
1.8 KiB
Markdown
71 lines
1.8 KiB
Markdown
# Experiments: Generating Synthetic Test Data
|
|
|
|
Creating diverse, targeted test data for evaluation.
|
|
|
|
## Dimension-Based Approach
|
|
|
|
Define axes of variation, then generate combinations:
|
|
|
|
```python
|
|
dimensions = {
|
|
"issue_type": ["billing", "technical", "shipping"],
|
|
"customer_mood": ["frustrated", "neutral", "happy"],
|
|
"complexity": ["simple", "moderate", "complex"],
|
|
}
|
|
```
|
|
|
|
## Two-Step Generation
|
|
|
|
1. **Generate tuples** (combinations of dimension values)
|
|
2. **Convert to natural queries** (separate LLM call per tuple)
|
|
|
|
```python
|
|
# Step 1: Create tuples
|
|
tuples = [
|
|
("billing", "frustrated", "complex"),
|
|
("shipping", "neutral", "simple"),
|
|
]
|
|
|
|
# Step 2: Convert to natural query
|
|
def tuple_to_query(t):
|
|
prompt = f"""Generate a realistic customer message:
|
|
Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
|
|
|
|
Write naturally, include typos if appropriate. Don't be formulaic."""
|
|
return llm(prompt)
|
|
```
|
|
|
|
## Target Failure Modes
|
|
|
|
Dimensions should target known failures from error analysis:
|
|
|
|
```python
|
|
# From error analysis findings
|
|
dimensions = {
|
|
"timezone": ["EST", "PST", "UTC", "ambiguous"], # Known failure
|
|
"date_format": ["ISO", "US", "EU", "relative"], # Known failure
|
|
}
|
|
```
|
|
|
|
## Quality Control
|
|
|
|
- **Validate**: Check for placeholder text, minimum length
|
|
- **Deduplicate**: Remove near-duplicate queries using embeddings
|
|
- **Balance**: Ensure coverage across dimension values
|
|
|
|
## When to Use
|
|
|
|
| Use Synthetic | Use Real Data |
|
|
| ------------- | ------------- |
|
|
| Limited production data | Sufficient traces |
|
|
| Testing edge cases | Validating actual behavior |
|
|
| Pre-launch evals | Post-launch monitoring |
|
|
|
|
## Sample Sizes
|
|
|
|
| Purpose | Size |
|
|
| ------- | ---- |
|
|
| Initial exploration | 50-100 |
|
|
| Comprehensive eval | 100-500 |
|
|
| Per-dimension | 10-20 per combination |
|