mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
chore: publish from staged
This commit is contained in:
@@ -0,0 +1,91 @@
|
||||
# Evaluators: Code Evaluators in Python
|
||||
|
||||
Deterministic evaluators without LLM. Fast, cheap, reproducible.
|
||||
|
||||
## Basic Pattern
|
||||
|
||||
```python
|
||||
import re
|
||||
import json
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
@create_evaluator(name="has_citation", kind="code")
|
||||
def has_citation(output: str) -> bool:
|
||||
return bool(re.search(r'\[\d+\]', output))
|
||||
|
||||
@create_evaluator(name="json_valid", kind="code")
|
||||
def json_valid(output: str) -> bool:
|
||||
try:
|
||||
json.loads(output)
|
||||
return True
|
||||
except json.JSONDecodeError:
|
||||
return False
|
||||
```
|
||||
|
||||
## Parameter Binding
|
||||
|
||||
| Parameter | Description |
|
||||
| --------- | ----------- |
|
||||
| `output` | Task output |
|
||||
| `input` | Example input |
|
||||
| `expected` | Expected output |
|
||||
| `metadata` | Example metadata |
|
||||
|
||||
```python
|
||||
@create_evaluator(name="matches_expected", kind="code")
|
||||
def matches_expected(output: str, expected: dict) -> bool:
|
||||
return output.strip() == expected.get("answer", "").strip()
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
- **Regex**: `re.search(pattern, output)`
|
||||
- **JSON schema**: `jsonschema.validate()`
|
||||
- **Keywords**: `keyword in output.lower()`
|
||||
- **Length**: `len(output.split())`
|
||||
- **Similarity**: `editdistance.eval()` or Jaccard
|
||||
|
||||
## Return Types
|
||||
|
||||
| Return type | Result |
|
||||
| ----------- | ------ |
|
||||
| `bool` | `True` → score=1.0, label="True"; `False` → score=0.0, label="False" |
|
||||
| `float`/`int` | Used as the `score` value directly |
|
||||
| `str` (short, ≤3 words) | Used as the `label` value |
|
||||
| `str` (long, ≥4 words) | Used as the `explanation` value |
|
||||
| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |
|
||||
| `Score` object | Used as-is |
|
||||
|
||||
## Important: Code vs LLM Evaluators
|
||||
|
||||
The `@create_evaluator` decorator wraps a plain Python function.
|
||||
|
||||
- `kind="code"` (default): For deterministic evaluators that don't call an LLM.
|
||||
- `kind="llm"`: Marks the evaluator as LLM-based, but **you** must implement the LLM
|
||||
call inside the function. The decorator does not call an LLM for you.
|
||||
|
||||
For most LLM-based evaluation, prefer `ClassificationEvaluator` which handles
|
||||
the LLM call, structured output parsing, and explanations automatically:
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
relevance = ClassificationEvaluator(
|
||||
name="relevance",
|
||||
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"relevant": 1.0, "irrelevant": 0.0},
|
||||
)
|
||||
```
|
||||
|
||||
## Pre-Built
|
||||
|
||||
```python
|
||||
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex
|
||||
|
||||
evaluators = [
|
||||
ContainsAnyKeyword(keywords=["disclaimer"]),
|
||||
JSONParseable(),
|
||||
MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
|
||||
]
|
||||
```
|
||||
Reference in New Issue
Block a user