mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
1.3 KiB
1.3 KiB
Experiments: Overview
Systematic testing of AI systems with datasets, tasks, and evaluators.
Structure
DATASET → Examples: {input, expected_output, metadata}
TASK → function(input) → output
EVALUATORS → (input, output, expected) → score
EXPERIMENT → Run task on all examples, score results
Basic Usage
from phoenix.client.experiments import run_experiment
experiment = run_experiment(
dataset=my_dataset,
task=my_task,
evaluators=[accuracy, faithfulness],
experiment_name="improved-retrieval-v2",
)
print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}
Workflow
- Create dataset - From traces, synthetic data, or manual curation
- Define task - The function to test (your LLM pipeline)
- Select evaluators - Code and/or LLM-based
- Run experiment - Execute and score
- Analyze & iterate - Review, modify task, re-run
Dry Runs
Test setup before full execution:
experiment = run_experiment(dataset, task, evaluators, dry_run=3) # Just 3 examples
Best Practices
- Name meaningfully:
"improved-retrieval-v2-2024-01-15"not"test" - Version datasets: Don't modify existing
- Multiple evaluators: Combine perspectives