* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
12 KiB
Eval Tests: Evaluator Selection and Test Writing
This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running pixie test.
Before writing any test code, re-read references/pixie-api.md (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
Evaluator selection
Choose evaluators based on the output type and your eval criteria from Step 1, not the app type.
Decision table
| Output type | Evaluator category | Examples |
|---|---|---|
| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: ExactMatchEval, JSONDiffEval, ValidJSONEval |
Label classification, JSON extraction |
| Open-ended text with a reference answer | LLM-as-judge: FactualityEval, ClosedQAEval, AnswerCorrectnessEval |
Chatbot responses, QA, summaries |
| Text with expected context/grounding | RAG evaluators: FaithfulnessEval, ContextRelevancyEval |
RAG pipelines, context-grounded responses |
| Text with style/format requirements | Custom LLM-as-judge via create_llm_evaluator |
Voice-friendly responses, tone checks |
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
Critical rules
- For open-ended LLM text, never use
ExactMatchEval. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead. AnswerRelevancyEvalis RAG-only — it requires acontextvalue in the trace. Returns 0.0 without it. For general relevance without RAG, usecreate_llm_evaluatorwith a custom prompt.- Do NOT use comparison evaluators (
FactualityEval,ClosedQAEval,ExactMatchEval) on items withoutexpected_output— they produce meaningless scores.
When expected_output IS available
Use comparison-based evaluators:
| Evaluator | Use when |
|---|---|
FactualityEval |
Output is factually correct compared to reference |
ClosedQAEval |
Output matches the expected answer |
ExactMatchEval |
Exact string match (structured/deterministic outputs only) |
AnswerCorrectnessEval |
Answer is correct vs reference |
When expected_output is NOT available
Use standalone evaluators that judge quality without a reference:
| Evaluator | Use when | Note |
|---|---|---|
FaithfulnessEval |
Response faithful to provided context | RAG pipelines |
ContextRelevancyEval |
Retrieved context relevant to query | RAG pipelines |
AnswerRelevancyEval |
Answer addresses the question | RAG only — needs context in trace. Returns 0.0 without it. |
PossibleEval |
Output is plausible / feasible | General purpose |
ModerationEval |
Output is safe and appropriate | Content safety |
SecurityEval |
No security vulnerabilities | Security check |
For non-RAG apps needing response relevance, write a create_llm_evaluator instead.
Custom evaluators
create_llm_evaluator factory
Use when the quality dimension is domain-specific and no built-in evaluator fits:
from pixie import create_llm_evaluator
concise_voice_style = create_llm_evaluator(
name="ConciseVoiceStyle",
prompt_template="""
You are evaluating whether this response is concise and phone-friendly.
Input: {eval_input}
Response: {eval_output}
Score 1.0 if the response is concise (under 3 sentences), directly addresses
the question, and uses conversational language suitable for a phone call.
Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
""",
)
How template variables work: {eval_input}, {eval_output}, {expected_output} are the only placeholders. Each is replaced with a string representation of the corresponding Evaluable field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
Rules:
- Only
{eval_input},{eval_output},{expected_output}— no nested access like{eval_input[key]}(this will crash with aTypeError) - Keep templates short and direct — the system prompt already tells the LLM to return
Score: X.X. Your template just needs to present the data and define the scoring criteria. - Don't instruct the LLM to "parse" or "extract" data — just present the values and state the criteria. The LLM can read JSON naturally.
Non-RAG response relevance (instead of AnswerRelevancyEval):
response_relevance = create_llm_evaluator(
name="ResponseRelevance",
prompt_template="""
You are evaluating whether a customer support response is relevant and helpful.
Input: {eval_input}
Response: {eval_output}
Expected: {expected_output}
Score 1.0 if the response directly addresses the question and meets expectations.
Score 0.5 if partially relevant but misses important aspects.
Score 0.0 if off-topic, ignores the question, or contradicts expectations.
""",
)
Manual custom evaluator
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
Writing the test file
Create pixie_qa/tests/test_<feature>.py. The pattern: a runnable adapter that calls the app's production function, plus async test functions that await assert_dataset_pass.
Before writing any test code, re-read the assert_dataset_pass API reference below. The exact parameter names matter — using dataset= instead of dataset_name=, or omitting await, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
Test file template
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""Replays one dataset item through the app.
Calls the same function the production app uses.
enable_storage() here ensures traces are captured during eval runs.
"""
enable_storage()
answer_question(**eval_input)
async def test_answer_quality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="qa-golden-set",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)
assert_dataset_pass API — exact parameter names
await assert_dataset_pass(
runnable=runnable, # callable that takes eval_input dict
dataset_name="my-dataset", # NOT dataset_path — name of dataset created in Step 4
evaluators=[...], # list of evaluator instances
pass_criteria=ScoreThreshold( # NOT thresholds — ScoreThreshold object
threshold=0.7, # minimum score to count as passing
pct=0.8, # fraction of items that must pass
),
from_trace=last_llm_call, # which span to extract eval data from
)
Common mistakes that break tests
| Mistake | Symptom | Fix |
|---|---|---|
def test_...(): (sync) |
RuntimeWarning "coroutine was never awaited", test passes vacuously | Use async def test_...(): |
No await |
Same: "coroutine was never awaited" | Add await before assert_dataset_pass(...) |
dataset_path="..." |
TypeError: unexpected keyword argument | Use dataset_name="..." |
thresholds={...} |
TypeError: unexpected keyword argument | Use pass_criteria=ScoreThreshold(...) |
Omitting from_trace |
Evaluator may not find the right span | Add from_trace=last_llm_call |
If pixie test shows "No assert_pass / assert_dataset_pass calls recorded", the test passed vacuously because assert_dataset_pass was never awaited. Fix the async signature and await immediately.
Multiple test functions
Split into separate test functions when you have different evaluator sets:
async def test_factual_answers():
"""Test items that have deterministic expected outputs."""
await assert_dataset_pass(
runnable=runnable,
dataset_name="qa-deterministic",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)
async def test_response_style():
"""Test open-ended quality criteria."""
await assert_dataset_pass(
runnable=runnable,
dataset_name="qa-open-ended",
evaluators=[concise_voice_style],
pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
from_trace=last_llm_call,
)
Key points
enable_storage()belongs inside therunnable, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.- The
runnableimports and calls the same function that production uses — the app's entry point, going through the utility function from Step 3. - If the
runnablecalls a different function than what the utility function calls, something is wrong. - The
eval_inputdict should contain only the semantic arguments the function needs (e.g.,question,messages,context). The@observedecorator automatically stripsselfandcls. - Choose evaluators that match your data. If dataset items have
expected_output, use comparison evaluators. If not, use standalone evaluators.
Running tests
The test runner is pixie test (not pytest):
uv run pixie test # run all test_*.py in current directory
uv run pixie test pixie_qa/tests/ # specify path
uv run pixie test -k factuality # filter by name
uv run pixie test -v # verbose: shows per-case scores and reasoning
pixie test automatically loads the .env file before running tests, so API keys do not need to be exported in the shell. No sys.path hacks are needed in test files.
The -v flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
After running, verify the scorecard
- Shows "N/M tests passed" with real numbers
- Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing
await) - Per-evaluator scores appear with real values
A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.