# Fundamentals Application-specific tests for AI systems. Code first, LLM for nuance, human for truth. ## Evaluator Types | Type | Speed | Cost | Use Case | | ---- | ----- | ---- | -------- | | **Code** | Fast | Cheap | Regex, JSON, format, exact match | | **LLM** | Medium | Medium | Subjective quality, complex criteria | | **Human** | Slow | Expensive | Ground truth, calibration | **Decision:** Code first → LLM only when code can't capture criteria → Human for calibration. ## Score Structure | Property | Required | Description | | -------- | -------- | ----------- | | `name` | Yes | Evaluator name | | `kind` | Yes | `"code"`, `"llm"`, `"human"` | | `score` | No* | 0-1 numeric | | `label` | No* | `"pass"`, `"fail"` | | `explanation` | No | Rationale | *One of `score` or `label` required. ## Binary > Likert Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration. ```python # Multiple binary checks instead of one Likert scale evaluators = [ AnswersQuestion(), # Yes/No UsesContext(), # Yes/No NoHallucination(), # Yes/No ] ``` ## Quick Patterns ### Code Evaluator ```python from phoenix.evals import create_evaluator @create_evaluator(name="has_citation", kind="code") def has_citation(output: str) -> bool: return bool(re.search(r'\[\d+\]', output)) ``` ### LLM Evaluator ```python from phoenix.evals import ClassificationEvaluator, LLM evaluator = ClassificationEvaluator( name="helpfulness", prompt_template="...", llm=LLM(provider="openai", model="gpt-4o"), choices={"not_helpful": 0, "helpful": 1} ) ``` ### Run Experiment ```python from phoenix.client.experiments import run_experiment experiment = run_experiment( dataset=dataset, task=my_task, evaluators=[evaluator1, evaluator2], ) print(experiment.aggregate_scores) ```