mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
242 lines
12 KiB
Markdown
242 lines
12 KiB
Markdown
# Eval Tests: Evaluator Selection and Test Writing
|
|
|
|
This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running `pixie test`.
|
|
|
|
**Before writing any test code, re-read `references/pixie-api.md`** (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
|
|
|
|
---
|
|
|
|
## Evaluator selection
|
|
|
|
Choose evaluators based on the **output type** and your eval criteria from Step 1, not the app type.
|
|
|
|
### Decision table
|
|
|
|
| Output type | Evaluator category | Examples |
|
|
| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------- |
|
|
| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: `ExactMatchEval`, `JSONDiffEval`, `ValidJSONEval` | Label classification, JSON extraction |
|
|
| Open-ended text with a reference answer | LLM-as-judge: `FactualityEval`, `ClosedQAEval`, `AnswerCorrectnessEval` | Chatbot responses, QA, summaries |
|
|
| Text with expected context/grounding | RAG evaluators: `FaithfulnessEval`, `ContextRelevancyEval` | RAG pipelines, context-grounded responses |
|
|
| Text with style/format requirements | Custom LLM-as-judge via `create_llm_evaluator` | Voice-friendly responses, tone checks |
|
|
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
|
|
|
|
### Critical rules
|
|
|
|
- For open-ended LLM text, **never** use `ExactMatchEval`. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
|
|
- `AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
|
|
- Do NOT use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without `expected_output` — they produce meaningless scores.
|
|
|
|
### When `expected_output` IS available
|
|
|
|
Use comparison-based evaluators:
|
|
|
|
| Evaluator | Use when |
|
|
| ----------------------- | ---------------------------------------------------------- |
|
|
| `FactualityEval` | Output is factually correct compared to reference |
|
|
| `ClosedQAEval` | Output matches the expected answer |
|
|
| `ExactMatchEval` | Exact string match (structured/deterministic outputs only) |
|
|
| `AnswerCorrectnessEval` | Answer is correct vs reference |
|
|
|
|
### When `expected_output` is NOT available
|
|
|
|
Use standalone evaluators that judge quality without a reference:
|
|
|
|
| Evaluator | Use when | Note |
|
|
| ---------------------- | ------------------------------------- | ---------------------------------------------------------------- |
|
|
| `FaithfulnessEval` | Response faithful to provided context | RAG pipelines |
|
|
| `ContextRelevancyEval` | Retrieved context relevant to query | RAG pipelines |
|
|
| `AnswerRelevancyEval` | Answer addresses the question | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
|
|
| `PossibleEval` | Output is plausible / feasible | General purpose |
|
|
| `ModerationEval` | Output is safe and appropriate | Content safety |
|
|
| `SecurityEval` | No security vulnerabilities | Security check |
|
|
|
|
For non-RAG apps needing response relevance, write a `create_llm_evaluator` instead.
|
|
|
|
---
|
|
|
|
## Custom evaluators
|
|
|
|
### `create_llm_evaluator` factory
|
|
|
|
Use when the quality dimension is domain-specific and no built-in evaluator fits:
|
|
|
|
```python
|
|
from pixie import create_llm_evaluator
|
|
|
|
concise_voice_style = create_llm_evaluator(
|
|
name="ConciseVoiceStyle",
|
|
prompt_template="""
|
|
You are evaluating whether this response is concise and phone-friendly.
|
|
|
|
Input: {eval_input}
|
|
Response: {eval_output}
|
|
|
|
Score 1.0 if the response is concise (under 3 sentences), directly addresses
|
|
the question, and uses conversational language suitable for a phone call.
|
|
Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
|
|
""",
|
|
)
|
|
```
|
|
|
|
**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
|
|
|
|
**Rules**:
|
|
|
|
- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
|
|
- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
|
|
- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
|
|
|
|
**Non-RAG response relevance** (instead of `AnswerRelevancyEval`):
|
|
|
|
```python
|
|
response_relevance = create_llm_evaluator(
|
|
name="ResponseRelevance",
|
|
prompt_template="""
|
|
You are evaluating whether a customer support response is relevant and helpful.
|
|
|
|
Input: {eval_input}
|
|
Response: {eval_output}
|
|
Expected: {expected_output}
|
|
|
|
Score 1.0 if the response directly addresses the question and meets expectations.
|
|
Score 0.5 if partially relevant but misses important aspects.
|
|
Score 0.0 if off-topic, ignores the question, or contradicts expectations.
|
|
""",
|
|
)
|
|
```
|
|
|
|
### Manual custom evaluator
|
|
|
|
```python
|
|
from pixie import Evaluation, Evaluable
|
|
|
|
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
|
|
# evaluable.eval_input — what was passed to the observed function
|
|
# evaluable.eval_output — what the function returned
|
|
# evaluable.expected_output — reference answer (UNSET if not provided)
|
|
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
|
|
return Evaluation(score=score, reasoning="...")
|
|
```
|
|
|
|
---
|
|
|
|
## Writing the test file
|
|
|
|
Create `pixie_qa/tests/test_<feature>.py`. The pattern: a `runnable` adapter that calls the app's production function, plus `async` test functions that `await` `assert_dataset_pass`.
|
|
|
|
**Before writing any test code, re-read the `assert_dataset_pass` API reference below.** The exact parameter names matter — using `dataset=` instead of `dataset_name=`, or omitting `await`, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
|
|
|
|
### Test file template
|
|
|
|
```python
|
|
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
|
|
|
|
from myapp import answer_question
|
|
|
|
|
|
def runnable(eval_input):
|
|
"""Replays one dataset item through the app.
|
|
|
|
Calls the same function the production app uses.
|
|
enable_storage() here ensures traces are captured during eval runs.
|
|
"""
|
|
enable_storage()
|
|
answer_question(**eval_input)
|
|
|
|
|
|
async def test_answer_quality():
|
|
await assert_dataset_pass(
|
|
runnable=runnable,
|
|
dataset_name="qa-golden-set",
|
|
evaluators=[FactualityEval()],
|
|
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
|
|
from_trace=last_llm_call,
|
|
)
|
|
```
|
|
|
|
### `assert_dataset_pass` API — exact parameter names
|
|
|
|
```python
|
|
await assert_dataset_pass(
|
|
runnable=runnable, # callable that takes eval_input dict
|
|
dataset_name="my-dataset", # NOT dataset_path — name of dataset created in Step 4
|
|
evaluators=[...], # list of evaluator instances
|
|
pass_criteria=ScoreThreshold( # NOT thresholds — ScoreThreshold object
|
|
threshold=0.7, # minimum score to count as passing
|
|
pct=0.8, # fraction of items that must pass
|
|
),
|
|
from_trace=last_llm_call, # which span to extract eval data from
|
|
)
|
|
```
|
|
|
|
### Common mistakes that break tests
|
|
|
|
| Mistake | Symptom | Fix |
|
|
| ------------------------ | ------------------------------------------------------------------- | --------------------------------------------- |
|
|
| `def test_...():` (sync) | RuntimeWarning "coroutine was never awaited", test passes vacuously | Use `async def test_...():` |
|
|
| No `await` | Same: "coroutine was never awaited" | Add `await` before `assert_dataset_pass(...)` |
|
|
| `dataset_path="..."` | TypeError: unexpected keyword argument | Use `dataset_name="..."` |
|
|
| `thresholds={...}` | TypeError: unexpected keyword argument | Use `pass_criteria=ScoreThreshold(...)` |
|
|
| Omitting `from_trace` | Evaluator may not find the right span | Add `from_trace=last_llm_call` |
|
|
|
|
**If `pixie test` shows "No assert_pass / assert_dataset_pass calls recorded"**, the test passed vacuously because `assert_dataset_pass` was never awaited. Fix the async signature and await immediately.
|
|
|
|
### Multiple test functions
|
|
|
|
Split into separate test functions when you have different evaluator sets:
|
|
|
|
```python
|
|
async def test_factual_answers():
|
|
"""Test items that have deterministic expected outputs."""
|
|
await assert_dataset_pass(
|
|
runnable=runnable,
|
|
dataset_name="qa-deterministic",
|
|
evaluators=[FactualityEval()],
|
|
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
|
|
from_trace=last_llm_call,
|
|
)
|
|
|
|
async def test_response_style():
|
|
"""Test open-ended quality criteria."""
|
|
await assert_dataset_pass(
|
|
runnable=runnable,
|
|
dataset_name="qa-open-ended",
|
|
evaluators=[concise_voice_style],
|
|
pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
|
|
from_trace=last_llm_call,
|
|
)
|
|
```
|
|
|
|
### Key points
|
|
|
|
- `enable_storage()` belongs inside the `runnable`, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
|
|
- The `runnable` imports and calls the **same function** that production uses — the app's entry point, going through the utility function from Step 3.
|
|
- If the `runnable` calls a different function than what the utility function calls, something is wrong.
|
|
- The `eval_input` dict should contain **only the semantic arguments** the function needs (e.g., `question`, `messages`, `context`). The `@observe` decorator automatically strips `self` and `cls`.
|
|
- **Choose evaluators that match your data.** If dataset items have `expected_output`, use comparison evaluators. If not, use standalone evaluators.
|
|
|
|
---
|
|
|
|
## Running tests
|
|
|
|
The test runner is `pixie test` (not `pytest`):
|
|
|
|
```bash
|
|
uv run pixie test # run all test_*.py in current directory
|
|
uv run pixie test pixie_qa/tests/ # specify path
|
|
uv run pixie test -k factuality # filter by name
|
|
uv run pixie test -v # verbose: shows per-case scores and reasoning
|
|
```
|
|
|
|
`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell. No `sys.path` hacks are needed in test files.
|
|
|
|
The `-v` flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
|
|
|
|
### After running, verify the scorecard
|
|
|
|
1. Shows "N/M tests passed" with real numbers
|
|
2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
|
|
3. Per-evaluator scores appear with real values
|
|
|
|
A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
|