Files
awesome-copilot/skills/eval-driven-dev/references/eval-tests.md
Yiou Li df0ed6aa51 update eval-driven-dev skill. (#1201)
* update eval-driven-dev skill.

Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions.

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-30 08:07:39 +11:00

242 lines
12 KiB
Markdown

# Eval Tests: Evaluator Selection and Test Writing
This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running `pixie test`.
**Before writing any test code, re-read `references/pixie-api.md`** (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
---
## Evaluator selection
Choose evaluators based on the **output type** and your eval criteria from Step 1, not the app type.
### Decision table
| Output type | Evaluator category | Examples |
| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------- |
| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: `ExactMatchEval`, `JSONDiffEval`, `ValidJSONEval` | Label classification, JSON extraction |
| Open-ended text with a reference answer | LLM-as-judge: `FactualityEval`, `ClosedQAEval`, `AnswerCorrectnessEval` | Chatbot responses, QA, summaries |
| Text with expected context/grounding | RAG evaluators: `FaithfulnessEval`, `ContextRelevancyEval` | RAG pipelines, context-grounded responses |
| Text with style/format requirements | Custom LLM-as-judge via `create_llm_evaluator` | Voice-friendly responses, tone checks |
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
### Critical rules
- For open-ended LLM text, **never** use `ExactMatchEval`. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
- `AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
- Do NOT use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without `expected_output` — they produce meaningless scores.
### When `expected_output` IS available
Use comparison-based evaluators:
| Evaluator | Use when |
| ----------------------- | ---------------------------------------------------------- |
| `FactualityEval` | Output is factually correct compared to reference |
| `ClosedQAEval` | Output matches the expected answer |
| `ExactMatchEval` | Exact string match (structured/deterministic outputs only) |
| `AnswerCorrectnessEval` | Answer is correct vs reference |
### When `expected_output` is NOT available
Use standalone evaluators that judge quality without a reference:
| Evaluator | Use when | Note |
| ---------------------- | ------------------------------------- | ---------------------------------------------------------------- |
| `FaithfulnessEval` | Response faithful to provided context | RAG pipelines |
| `ContextRelevancyEval` | Retrieved context relevant to query | RAG pipelines |
| `AnswerRelevancyEval` | Answer addresses the question | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
| `PossibleEval` | Output is plausible / feasible | General purpose |
| `ModerationEval` | Output is safe and appropriate | Content safety |
| `SecurityEval` | No security vulnerabilities | Security check |
For non-RAG apps needing response relevance, write a `create_llm_evaluator` instead.
---
## Custom evaluators
### `create_llm_evaluator` factory
Use when the quality dimension is domain-specific and no built-in evaluator fits:
```python
from pixie import create_llm_evaluator
concise_voice_style = create_llm_evaluator(
name="ConciseVoiceStyle",
prompt_template="""
You are evaluating whether this response is concise and phone-friendly.
Input: {eval_input}
Response: {eval_output}
Score 1.0 if the response is concise (under 3 sentences), directly addresses
the question, and uses conversational language suitable for a phone call.
Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
""",
)
```
**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
**Rules**:
- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
**Non-RAG response relevance** (instead of `AnswerRelevancyEval`):
```python
response_relevance = create_llm_evaluator(
name="ResponseRelevance",
prompt_template="""
You are evaluating whether a customer support response is relevant and helpful.
Input: {eval_input}
Response: {eval_output}
Expected: {expected_output}
Score 1.0 if the response directly addresses the question and meets expectations.
Score 0.5 if partially relevant but misses important aspects.
Score 0.0 if off-topic, ignores the question, or contradicts expectations.
""",
)
```
### Manual custom evaluator
```python
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
```
---
## Writing the test file
Create `pixie_qa/tests/test_<feature>.py`. The pattern: a `runnable` adapter that calls the app's production function, plus `async` test functions that `await` `assert_dataset_pass`.
**Before writing any test code, re-read the `assert_dataset_pass` API reference below.** The exact parameter names matter — using `dataset=` instead of `dataset_name=`, or omitting `await`, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
### Test file template
```python
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""Replays one dataset item through the app.
Calls the same function the production app uses.
enable_storage() here ensures traces are captured during eval runs.
"""
enable_storage()
answer_question(**eval_input)
async def test_answer_quality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="qa-golden-set",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)
```
### `assert_dataset_pass` API — exact parameter names
```python
await assert_dataset_pass(
runnable=runnable, # callable that takes eval_input dict
dataset_name="my-dataset", # NOT dataset_path — name of dataset created in Step 4
evaluators=[...], # list of evaluator instances
pass_criteria=ScoreThreshold( # NOT thresholds — ScoreThreshold object
threshold=0.7, # minimum score to count as passing
pct=0.8, # fraction of items that must pass
),
from_trace=last_llm_call, # which span to extract eval data from
)
```
### Common mistakes that break tests
| Mistake | Symptom | Fix |
| ------------------------ | ------------------------------------------------------------------- | --------------------------------------------- |
| `def test_...():` (sync) | RuntimeWarning "coroutine was never awaited", test passes vacuously | Use `async def test_...():` |
| No `await` | Same: "coroutine was never awaited" | Add `await` before `assert_dataset_pass(...)` |
| `dataset_path="..."` | TypeError: unexpected keyword argument | Use `dataset_name="..."` |
| `thresholds={...}` | TypeError: unexpected keyword argument | Use `pass_criteria=ScoreThreshold(...)` |
| Omitting `from_trace` | Evaluator may not find the right span | Add `from_trace=last_llm_call` |
**If `pixie test` shows "No assert_pass / assert_dataset_pass calls recorded"**, the test passed vacuously because `assert_dataset_pass` was never awaited. Fix the async signature and await immediately.
### Multiple test functions
Split into separate test functions when you have different evaluator sets:
```python
async def test_factual_answers():
"""Test items that have deterministic expected outputs."""
await assert_dataset_pass(
runnable=runnable,
dataset_name="qa-deterministic",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)
async def test_response_style():
"""Test open-ended quality criteria."""
await assert_dataset_pass(
runnable=runnable,
dataset_name="qa-open-ended",
evaluators=[concise_voice_style],
pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
from_trace=last_llm_call,
)
```
### Key points
- `enable_storage()` belongs inside the `runnable`, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
- The `runnable` imports and calls the **same function** that production uses — the app's entry point, going through the utility function from Step 3.
- If the `runnable` calls a different function than what the utility function calls, something is wrong.
- The `eval_input` dict should contain **only the semantic arguments** the function needs (e.g., `question`, `messages`, `context`). The `@observe` decorator automatically strips `self` and `cls`.
- **Choose evaluators that match your data.** If dataset items have `expected_output`, use comparison evaluators. If not, use standalone evaluators.
---
## Running tests
The test runner is `pixie test` (not `pytest`):
```bash
uv run pixie test # run all test_*.py in current directory
uv run pixie test pixie_qa/tests/ # specify path
uv run pixie test -k factuality # filter by name
uv run pixie test -v # verbose: shows per-case scores and reasoning
```
`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell. No `sys.path` hacks are needed in test files.
The `-v` flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
### After running, verify the scorecard
1. Shows "N/M tests passed" with real numbers
2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
3. Per-evaluator scores appear with real values
A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.