Files
awesome-copilot/skills/eval-driven-dev/references/eval-tests.md
Yiou Li df0ed6aa51 update eval-driven-dev skill. (#1201)
* update eval-driven-dev skill.

Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions.

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-30 08:07:39 +11:00

12 KiB

Eval Tests: Evaluator Selection and Test Writing

This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running pixie test.

Before writing any test code, re-read references/pixie-api.md (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.


Evaluator selection

Choose evaluators based on the output type and your eval criteria from Step 1, not the app type.

Decision table

Output type Evaluator category Examples
Deterministic (classification labels, yes/no, fixed-format) Heuristic: ExactMatchEval, JSONDiffEval, ValidJSONEval Label classification, JSON extraction
Open-ended text with a reference answer LLM-as-judge: FactualityEval, ClosedQAEval, AnswerCorrectnessEval Chatbot responses, QA, summaries
Text with expected context/grounding RAG evaluators: FaithfulnessEval, ContextRelevancyEval RAG pipelines, context-grounded responses
Text with style/format requirements Custom LLM-as-judge via create_llm_evaluator Voice-friendly responses, tone checks
Multi-aspect quality Multiple evaluators combined Factuality + relevance + tone

Critical rules

  • For open-ended LLM text, never use ExactMatchEval. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
  • AnswerRelevancyEval is RAG-only — it requires a context value in the trace. Returns 0.0 without it. For general relevance without RAG, use create_llm_evaluator with a custom prompt.
  • Do NOT use comparison evaluators (FactualityEval, ClosedQAEval, ExactMatchEval) on items without expected_output — they produce meaningless scores.

When expected_output IS available

Use comparison-based evaluators:

Evaluator Use when
FactualityEval Output is factually correct compared to reference
ClosedQAEval Output matches the expected answer
ExactMatchEval Exact string match (structured/deterministic outputs only)
AnswerCorrectnessEval Answer is correct vs reference

When expected_output is NOT available

Use standalone evaluators that judge quality without a reference:

Evaluator Use when Note
FaithfulnessEval Response faithful to provided context RAG pipelines
ContextRelevancyEval Retrieved context relevant to query RAG pipelines
AnswerRelevancyEval Answer addresses the question RAG only — needs context in trace. Returns 0.0 without it.
PossibleEval Output is plausible / feasible General purpose
ModerationEval Output is safe and appropriate Content safety
SecurityEval No security vulnerabilities Security check

For non-RAG apps needing response relevance, write a create_llm_evaluator instead.


Custom evaluators

create_llm_evaluator factory

Use when the quality dimension is domain-specific and no built-in evaluator fits:

from pixie import create_llm_evaluator

concise_voice_style = create_llm_evaluator(
    name="ConciseVoiceStyle",
    prompt_template="""
    You are evaluating whether this response is concise and phone-friendly.

    Input: {eval_input}
    Response: {eval_output}

    Score 1.0 if the response is concise (under 3 sentences), directly addresses
    the question, and uses conversational language suitable for a phone call.
    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
    """,
)

How template variables work: {eval_input}, {eval_output}, {expected_output} are the only placeholders. Each is replaced with a string representation of the corresponding Evaluable field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.

Rules:

  • Only {eval_input}, {eval_output}, {expected_output} — no nested access like {eval_input[key]} (this will crash with a TypeError)
  • Keep templates short and direct — the system prompt already tells the LLM to return Score: X.X. Your template just needs to present the data and define the scoring criteria.
  • Don't instruct the LLM to "parse" or "extract" data — just present the values and state the criteria. The LLM can read JSON naturally.

Non-RAG response relevance (instead of AnswerRelevancyEval):

response_relevance = create_llm_evaluator(
    name="ResponseRelevance",
    prompt_template="""
    You are evaluating whether a customer support response is relevant and helpful.

    Input: {eval_input}
    Response: {eval_output}
    Expected: {expected_output}

    Score 1.0 if the response directly addresses the question and meets expectations.
    Score 0.5 if partially relevant but misses important aspects.
    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
    """,
)

Manual custom evaluator

from pixie import Evaluation, Evaluable

async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    # evaluable.eval_input  — what was passed to the observed function
    # evaluable.eval_output — what the function returned
    # evaluable.expected_output — reference answer (UNSET if not provided)
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")

Writing the test file

Create pixie_qa/tests/test_<feature>.py. The pattern: a runnable adapter that calls the app's production function, plus async test functions that await assert_dataset_pass.

Before writing any test code, re-read the assert_dataset_pass API reference below. The exact parameter names matter — using dataset= instead of dataset_name=, or omitting await, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.

Test file template

from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """Replays one dataset item through the app.

    Calls the same function the production app uses.
    enable_storage() here ensures traces are captured during eval runs.
    """
    enable_storage()
    answer_question(**eval_input)


async def test_answer_quality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="qa-golden-set",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )

assert_dataset_pass API — exact parameter names

await assert_dataset_pass(
    runnable=runnable,              # callable that takes eval_input dict
    dataset_name="my-dataset",      # NOT dataset_path — name of dataset created in Step 4
    evaluators=[...],               # list of evaluator instances
    pass_criteria=ScoreThreshold(   # NOT thresholds — ScoreThreshold object
        threshold=0.7,              # minimum score to count as passing
        pct=0.8,                    # fraction of items that must pass
    ),
    from_trace=last_llm_call,       # which span to extract eval data from
)

Common mistakes that break tests

Mistake Symptom Fix
def test_...(): (sync) RuntimeWarning "coroutine was never awaited", test passes vacuously Use async def test_...():
No await Same: "coroutine was never awaited" Add await before assert_dataset_pass(...)
dataset_path="..." TypeError: unexpected keyword argument Use dataset_name="..."
thresholds={...} TypeError: unexpected keyword argument Use pass_criteria=ScoreThreshold(...)
Omitting from_trace Evaluator may not find the right span Add from_trace=last_llm_call

If pixie test shows "No assert_pass / assert_dataset_pass calls recorded", the test passed vacuously because assert_dataset_pass was never awaited. Fix the async signature and await immediately.

Multiple test functions

Split into separate test functions when you have different evaluator sets:

async def test_factual_answers():
    """Test items that have deterministic expected outputs."""
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="qa-deterministic",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )

async def test_response_style():
    """Test open-ended quality criteria."""
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="qa-open-ended",
        evaluators=[concise_voice_style],
        pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
        from_trace=last_llm_call,
    )

Key points

  • enable_storage() belongs inside the runnable, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
  • The runnable imports and calls the same function that production uses — the app's entry point, going through the utility function from Step 3.
  • If the runnable calls a different function than what the utility function calls, something is wrong.
  • The eval_input dict should contain only the semantic arguments the function needs (e.g., question, messages, context). The @observe decorator automatically strips self and cls.
  • Choose evaluators that match your data. If dataset items have expected_output, use comparison evaluators. If not, use standalone evaluators.

Running tests

The test runner is pixie test (not pytest):

uv run pixie test                           # run all test_*.py in current directory
uv run pixie test pixie_qa/tests/           # specify path
uv run pixie test -k factuality             # filter by name
uv run pixie test -v                        # verbose: shows per-case scores and reasoning

pixie test automatically loads the .env file before running tests, so API keys do not need to be exported in the shell. No sys.path hacks are needed in test files.

The -v flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.

After running, verify the scorecard

  1. Shows "N/M tests passed" with real numbers
  2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing await)
  3. Per-evaluator scores appear with real values

A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.