mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-13 11:45:56 +00:00
chore: publish from staged
This commit is contained in:
@@ -0,0 +1,76 @@
|
||||
# Fundamentals
|
||||
|
||||
Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
|
||||
|
||||
## Evaluator Types
|
||||
|
||||
| Type | Speed | Cost | Use Case |
|
||||
| ---- | ----- | ---- | -------- |
|
||||
| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
|
||||
| **LLM** | Medium | Medium | Subjective quality, complex criteria |
|
||||
| **Human** | Slow | Expensive | Ground truth, calibration |
|
||||
|
||||
**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.
|
||||
|
||||
## Score Structure
|
||||
|
||||
| Property | Required | Description |
|
||||
| -------- | -------- | ----------- |
|
||||
| `name` | Yes | Evaluator name |
|
||||
| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
|
||||
| `score` | No* | 0-1 numeric |
|
||||
| `label` | No* | `"pass"`, `"fail"` |
|
||||
| `explanation` | No | Rationale |
|
||||
|
||||
*One of `score` or `label` required.
|
||||
|
||||
## Binary > Likert
|
||||
|
||||
Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
|
||||
|
||||
```python
|
||||
# Multiple binary checks instead of one Likert scale
|
||||
evaluators = [
|
||||
AnswersQuestion(), # Yes/No
|
||||
UsesContext(), # Yes/No
|
||||
NoHallucination(), # Yes/No
|
||||
]
|
||||
```
|
||||
|
||||
## Quick Patterns
|
||||
|
||||
### Code Evaluator
|
||||
|
||||
```python
|
||||
from phoenix.evals import create_evaluator
|
||||
|
||||
@create_evaluator(name="has_citation", kind="code")
|
||||
def has_citation(output: str) -> bool:
|
||||
return bool(re.search(r'\[\d+\]', output))
|
||||
```
|
||||
|
||||
### LLM Evaluator
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
evaluator = ClassificationEvaluator(
|
||||
name="helpfulness",
|
||||
prompt_template="...",
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"not_helpful": 0, "helpful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
### Run Experiment
|
||||
|
||||
```python
|
||||
from phoenix.client.experiments import run_experiment
|
||||
|
||||
experiment = run_experiment(
|
||||
dataset=dataset,
|
||||
task=my_task,
|
||||
evaluators=[evaluator1, evaluator2],
|
||||
)
|
||||
print(experiment.aggregate_scores)
|
||||
```
|
||||
Reference in New Issue
Block a user