mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 11:15:56 +00:00
1.8 KiB
1.8 KiB
Fundamentals
Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
Evaluator Types
| Type | Speed | Cost | Use Case |
|---|---|---|---|
| Code | Fast | Cheap | Regex, JSON, format, exact match |
| LLM | Medium | Medium | Subjective quality, complex criteria |
| Human | Slow | Expensive | Ground truth, calibration |
Decision: Code first → LLM only when code can't capture criteria → Human for calibration.
Score Structure
| Property | Required | Description |
|---|---|---|
name |
Yes | Evaluator name |
kind |
Yes | "code", "llm", "human" |
score |
No* | 0-1 numeric |
label |
No* | "pass", "fail" |
explanation |
No | Rationale |
*One of score or label required.
Binary > Likert
Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
# Multiple binary checks instead of one Likert scale
evaluators = [
AnswersQuestion(), # Yes/No
UsesContext(), # Yes/No
NoHallucination(), # Yes/No
]
Quick Patterns
Code Evaluator
from phoenix.evals import create_evaluator
@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
return bool(re.search(r'\[\d+\]', output))
LLM Evaluator
from phoenix.evals import ClassificationEvaluator, LLM
evaluator = ClassificationEvaluator(
name="helpfulness",
prompt_template="...",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"not_helpful": 0, "helpful": 1}
)
Run Experiment
from phoenix.client.experiments import run_experiment
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[evaluator1, evaluator2],
)
print(experiment.aggregate_scores)