mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
92 lines
2.6 KiB
Markdown
92 lines
2.6 KiB
Markdown
# Evaluators: Code Evaluators in Python
|
|
|
|
Deterministic evaluators without LLM. Fast, cheap, reproducible.
|
|
|
|
## Basic Pattern
|
|
|
|
```python
|
|
import re
|
|
import json
|
|
from phoenix.evals import create_evaluator
|
|
|
|
@create_evaluator(name="has_citation", kind="code")
|
|
def has_citation(output: str) -> bool:
|
|
return bool(re.search(r'\[\d+\]', output))
|
|
|
|
@create_evaluator(name="json_valid", kind="code")
|
|
def json_valid(output: str) -> bool:
|
|
try:
|
|
json.loads(output)
|
|
return True
|
|
except json.JSONDecodeError:
|
|
return False
|
|
```
|
|
|
|
## Parameter Binding
|
|
|
|
| Parameter | Description |
|
|
| --------- | ----------- |
|
|
| `output` | Task output |
|
|
| `input` | Example input |
|
|
| `expected` | Expected output |
|
|
| `metadata` | Example metadata |
|
|
|
|
```python
|
|
@create_evaluator(name="matches_expected", kind="code")
|
|
def matches_expected(output: str, expected: dict) -> bool:
|
|
return output.strip() == expected.get("answer", "").strip()
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
- **Regex**: `re.search(pattern, output)`
|
|
- **JSON schema**: `jsonschema.validate()`
|
|
- **Keywords**: `keyword in output.lower()`
|
|
- **Length**: `len(output.split())`
|
|
- **Similarity**: `editdistance.eval()` or Jaccard
|
|
|
|
## Return Types
|
|
|
|
| Return type | Result |
|
|
| ----------- | ------ |
|
|
| `bool` | `True` → score=1.0, label="True"; `False` → score=0.0, label="False" |
|
|
| `float`/`int` | Used as the `score` value directly |
|
|
| `str` (short, ≤3 words) | Used as the `label` value |
|
|
| `str` (long, ≥4 words) | Used as the `explanation` value |
|
|
| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |
|
|
| `Score` object | Used as-is |
|
|
|
|
## Important: Code vs LLM Evaluators
|
|
|
|
The `@create_evaluator` decorator wraps a plain Python function.
|
|
|
|
- `kind="code"` (default): For deterministic evaluators that don't call an LLM.
|
|
- `kind="llm"`: Marks the evaluator as LLM-based, but **you** must implement the LLM
|
|
call inside the function. The decorator does not call an LLM for you.
|
|
|
|
For most LLM-based evaluation, prefer `ClassificationEvaluator` which handles
|
|
the LLM call, structured output parsing, and explanations automatically:
|
|
|
|
```python
|
|
from phoenix.evals import ClassificationEvaluator, LLM
|
|
|
|
relevance = ClassificationEvaluator(
|
|
name="relevance",
|
|
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
|
|
llm=LLM(provider="openai", model="gpt-4o"),
|
|
choices={"relevant": 1.0, "irrelevant": 0.0},
|
|
)
|
|
```
|
|
|
|
## Pre-Built
|
|
|
|
```python
|
|
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex
|
|
|
|
evaluators = [
|
|
ContainsAnyKeyword(keywords=["disclaimer"]),
|
|
JSONParseable(),
|
|
MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
|
|
]
|
|
```
|