mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 10:45:56 +00:00
2.6 KiB
2.6 KiB
Evaluators: Code Evaluators in Python
Deterministic evaluators without LLM. Fast, cheap, reproducible.
Basic Pattern
import re
import json
from phoenix.evals import create_evaluator
@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
return bool(re.search(r'\[\d+\]', output))
@create_evaluator(name="json_valid", kind="code")
def json_valid(output: str) -> bool:
try:
json.loads(output)
return True
except json.JSONDecodeError:
return False
Parameter Binding
| Parameter | Description |
|---|---|
output |
Task output |
input |
Example input |
expected |
Expected output |
metadata |
Example metadata |
@create_evaluator(name="matches_expected", kind="code")
def matches_expected(output: str, expected: dict) -> bool:
return output.strip() == expected.get("answer", "").strip()
Common Patterns
- Regex:
re.search(pattern, output) - JSON schema:
jsonschema.validate() - Keywords:
keyword in output.lower() - Length:
len(output.split()) - Similarity:
editdistance.eval()or Jaccard
Return Types
| Return type | Result |
|---|---|
bool |
True → score=1.0, label="True"; False → score=0.0, label="False" |
float/int |
Used as the score value directly |
str (short, ≤3 words) |
Used as the label value |
str (long, ≥4 words) |
Used as the explanation value |
dict with score/label/explanation |
Mapped to Score fields directly |
Score object |
Used as-is |
Important: Code vs LLM Evaluators
The @create_evaluator decorator wraps a plain Python function.
kind="code"(default): For deterministic evaluators that don't call an LLM.kind="llm": Marks the evaluator as LLM-based, but you must implement the LLM call inside the function. The decorator does not call an LLM for you.
For most LLM-based evaluation, prefer ClassificationEvaluator which handles
the LLM call, structured output parsing, and explanations automatically:
from phoenix.evals import ClassificationEvaluator, LLM
relevance = ClassificationEvaluator(
name="relevance",
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
Pre-Built
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex
evaluators = [
ContainsAnyKeyword(keywords=["disclaimer"]),
JSONParseable(),
MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
]