Awesome/awesome-copilot

mirror of https://github.com/github/awesome-copilot.git synced 2026-04-11 18:55:55 +00:00

Files

github-actions[bot] a68b190031 chore: publish from staged

2026-04-09 06:26:21 +00:00

1.3 KiB

Raw Permalink Blame History

Experiments: Overview

Systematic testing of AI systems with datasets, tasks, and evaluators.

Structure

DATASET     → Examples: {input, expected_output, metadata}
TASK        → function(input) → output
EVALUATORS  → (input, output, expected) → score
EXPERIMENT  → Run task on all examples, score results

Basic Usage

from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=my_dataset,
    task=my_task,
    evaluators=[accuracy, faithfulness],
    experiment_name="improved-retrieval-v2",
)

print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}

Workflow

Create dataset - From traces, synthetic data, or manual curation
Define task - The function to test (your LLM pipeline)
Select evaluators - Code and/or LLM-based
Run experiment - Execute and score
Analyze & iterate - Review, modify task, re-run

Dry Runs

Test setup before full execution:

experiment = run_experiment(dataset, task, evaluators, dry_run=3)  # Just 3 examples

Best Practices

Name meaningfully: "improved-retrieval-v2-2024-01-15" not "test"
Version datasets: Don't modify existing
Multiple evaluators: Combine perspectives