Files
awesome-copilot/plugins/phoenix/skills/phoenix-evals/references/evaluators-code-python.md
2026-04-09 06:26:21 +00:00

2.6 KiB

Evaluators: Code Evaluators in Python

Deterministic evaluators without LLM. Fast, cheap, reproducible.

Basic Pattern

import re
import json
from phoenix.evals import create_evaluator

@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
    return bool(re.search(r'\[\d+\]', output))

@create_evaluator(name="json_valid", kind="code")
def json_valid(output: str) -> bool:
    try:
        json.loads(output)
        return True
    except json.JSONDecodeError:
        return False

Parameter Binding

Parameter Description
output Task output
input Example input
expected Expected output
metadata Example metadata
@create_evaluator(name="matches_expected", kind="code")
def matches_expected(output: str, expected: dict) -> bool:
    return output.strip() == expected.get("answer", "").strip()

Common Patterns

  • Regex: re.search(pattern, output)
  • JSON schema: jsonschema.validate()
  • Keywords: keyword in output.lower()
  • Length: len(output.split())
  • Similarity: editdistance.eval() or Jaccard

Return Types

Return type Result
bool True → score=1.0, label="True"; False → score=0.0, label="False"
float/int Used as the score value directly
str (short, ≤3 words) Used as the label value
str (long, ≥4 words) Used as the explanation value
dict with score/label/explanation Mapped to Score fields directly
Score object Used as-is

Important: Code vs LLM Evaluators

The @create_evaluator decorator wraps a plain Python function.

  • kind="code" (default): For deterministic evaluators that don't call an LLM.
  • kind="llm": Marks the evaluator as LLM-based, but you must implement the LLM call inside the function. The decorator does not call an LLM for you.

For most LLM-based evaluation, prefer ClassificationEvaluator which handles the LLM call, structured output parsing, and explanations automatically:

from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Pre-Built

from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex

evaluators = [
    ContainsAnyKeyword(keywords=["disclaimer"]),
    JSONParseable(),
    MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
]