Evaluators: Code Evaluators in Python

Deterministic evaluators without LLM. Fast, cheap, reproducible.

Basic Pattern

import re
import json
from phoenix.evals import create_evaluator

@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
    return bool(re.search(r'\[\d+\]', output))

@create_evaluator(name="json_valid", kind="code")
def json_valid(output: str) -> bool:
    try:
        json.loads(output)
        return True
    except json.JSONDecodeError:
        return False

Parameter Binding

Parameter	Description
`output`	Task output
`input`	Example input
`expected`	Expected output
`metadata`	Example metadata

@create_evaluator(name="matches_expected", kind="code")
def matches_expected(output: str, expected: dict) -> bool:
    return output.strip() == expected.get("answer", "").strip()

Common Patterns

Regex: re.search(pattern, output)
JSON schema: jsonschema.validate()
Keywords: keyword in output.lower()
Length: len(output.split())
Similarity: editdistance.eval() or Jaccard

Return Types

Return type	Result
`bool`	`True` → score=1.0, label="True"; `False` → score=0.0, label="False"
`float`/`int`	Used as the `score` value directly
`str` (short, ≤3 words)	Used as the `label` value
`str` (long, ≥4 words)	Used as the `explanation` value
`dict` with `score`/`label`/`explanation`	Mapped to Score fields directly
`Score` object	Used as-is

Important: Code vs LLM Evaluators

The @create_evaluator decorator wraps a plain Python function.

kind="code" (default): For deterministic evaluators that don't call an LLM.
kind="llm": Marks the evaluator as LLM-based, but you must implement the LLM call inside the function. The decorator does not call an LLM for you.

For most LLM-based evaluation, prefer ClassificationEvaluator which handles the LLM call, structured output parsing, and explanations automatically:

from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Pre-Built

from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex

evaluators = [
    ContainsAnyKeyword(keywords=["disclaimer"]),
    JSONParseable(),
    MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
]

2.6 KiB Raw Permalink Blame History