update eval-driven-dev skill. (#1201)

* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-11 18:55:55 +00:00 · 2026-03-29 14:07:39 -07:00
parent ac305117f6
commit df0ed6aa51
9 changed files with 1803 additions and 947 deletions
--- a/skills/eval-driven-dev/references/eval-tests.md
+++ b/skills/eval-driven-dev/references/eval-tests.md
@@ -0,0 +1,241 @@
+# Eval Tests: Evaluator Selection and Test Writing
+
+This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running `pixie test`.
+
+**Before writing any test code, re-read `references/pixie-api.md`** (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
+
+---
+
+## Evaluator selection
+
+Choose evaluators based on the **output type** and your eval criteria from Step 1, not the app type.
+
+### Decision table
+
+| Output type                                                 | Evaluator category                                                      | Examples                                  |
+| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------- |
+| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: `ExactMatchEval`, `JSONDiffEval`, `ValidJSONEval`            | Label classification, JSON extraction     |
+| Open-ended text with a reference answer                     | LLM-as-judge: `FactualityEval`, `ClosedQAEval`, `AnswerCorrectnessEval` | Chatbot responses, QA, summaries          |
+| Text with expected context/grounding                        | RAG evaluators: `FaithfulnessEval`, `ContextRelevancyEval`              | RAG pipelines, context-grounded responses |
+| Text with style/format requirements                         | Custom LLM-as-judge via `create_llm_evaluator`                          | Voice-friendly responses, tone checks     |
+| Multi-aspect quality                                        | Multiple evaluators combined                                            | Factuality + relevance + tone             |
+
+### Critical rules
+
+- For open-ended LLM text, **never** use `ExactMatchEval`. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
+- `AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
+- Do NOT use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without `expected_output` — they produce meaningless scores.
+
+### When `expected_output` IS available
+
+Use comparison-based evaluators:
+
+| Evaluator               | Use when                                                   |
+| ----------------------- | ---------------------------------------------------------- |
+| `FactualityEval`        | Output is factually correct compared to reference          |
+| `ClosedQAEval`          | Output matches the expected answer                         |
+| `ExactMatchEval`        | Exact string match (structured/deterministic outputs only) |
+| `AnswerCorrectnessEval` | Answer is correct vs reference                             |
+
+### When `expected_output` is NOT available
+
+Use standalone evaluators that judge quality without a reference:
+
+| Evaluator              | Use when                              | Note                                                             |
+| ---------------------- | ------------------------------------- | ---------------------------------------------------------------- |
+| `FaithfulnessEval`     | Response faithful to provided context | RAG pipelines                                                    |
+| `ContextRelevancyEval` | Retrieved context relevant to query   | RAG pipelines                                                    |
+| `AnswerRelevancyEval`  | Answer addresses the question         | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
+| `PossibleEval`         | Output is plausible / feasible        | General purpose                                                  |
+| `ModerationEval`       | Output is safe and appropriate        | Content safety                                                   |
+| `SecurityEval`         | No security vulnerabilities           | Security check                                                   |
+
+For non-RAG apps needing response relevance, write a `create_llm_evaluator` instead.
+
+---
+
+## Custom evaluators
+
+### `create_llm_evaluator` factory
+
+Use when the quality dimension is domain-specific and no built-in evaluator fits:
+
+```python
+from pixie import create_llm_evaluator
+
+concise_voice_style = create_llm_evaluator(
+    name="ConciseVoiceStyle",
+    prompt_template="""
+    You are evaluating whether this response is concise and phone-friendly.
+
+    Input: {eval_input}
+    Response: {eval_output}
+
+    Score 1.0 if the response is concise (under 3 sentences), directly addresses
+    the question, and uses conversational language suitable for a phone call.
+    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
+    """,
+)
+```
+
+**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
+
+**Rules**:
+
+- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
+- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
+- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
+
+**Non-RAG response relevance** (instead of `AnswerRelevancyEval`):
+
+```python
+response_relevance = create_llm_evaluator(
+    name="ResponseRelevance",
+    prompt_template="""
+    You are evaluating whether a customer support response is relevant and helpful.
+
+    Input: {eval_input}
+    Response: {eval_output}
+    Expected: {expected_output}
+
+    Score 1.0 if the response directly addresses the question and meets expectations.
+    Score 0.5 if partially relevant but misses important aspects.
+    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
+    """,
+)
+```
+
+### Manual custom evaluator
+
+```python
+from pixie import Evaluation, Evaluable
+
+async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
+    # evaluable.eval_input  — what was passed to the observed function
+    # evaluable.eval_output — what the function returned
+    # evaluable.expected_output — reference answer (UNSET if not provided)
+    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
+    return Evaluation(score=score, reasoning="...")
+```
+
+---
+
+## Writing the test file
+
+Create `pixie_qa/tests/test_<feature>.py`. The pattern: a `runnable` adapter that calls the app's production function, plus `async` test functions that `await` `assert_dataset_pass`.
+
+**Before writing any test code, re-read the `assert_dataset_pass` API reference below.** The exact parameter names matter — using `dataset=` instead of `dataset_name=`, or omitting `await`, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
+
+### Test file template
+
+```python
+from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
+
+from myapp import answer_question
+
+
+def runnable(eval_input):
+    """Replays one dataset item through the app.
+
+    Calls the same function the production app uses.
+    enable_storage() here ensures traces are captured during eval runs.
+    """
+    enable_storage()
+    answer_question(**eval_input)
+
+
+async def test_answer_quality():
+    await assert_dataset_pass(
+        runnable=runnable,
+        dataset_name="qa-golden-set",
+        evaluators=[FactualityEval()],
+        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
+        from_trace=last_llm_call,
+    )
+```
+
+### `assert_dataset_pass` API — exact parameter names
+
+```python
+await assert_dataset_pass(
+    runnable=runnable,              # callable that takes eval_input dict
+    dataset_name="my-dataset",      # NOT dataset_path — name of dataset created in Step 4
+    evaluators=[...],               # list of evaluator instances
+    pass_criteria=ScoreThreshold(   # NOT thresholds — ScoreThreshold object
+        threshold=0.7,              # minimum score to count as passing
+        pct=0.8,                    # fraction of items that must pass
+    ),
+    from_trace=last_llm_call,       # which span to extract eval data from
+)
+```
+
+### Common mistakes that break tests
+
+| Mistake                  | Symptom                                                             | Fix                                           |
+| ------------------------ | ------------------------------------------------------------------- | --------------------------------------------- |
+| `def test_...():` (sync) | RuntimeWarning "coroutine was never awaited", test passes vacuously | Use `async def test_...():`                   |
+| No `await`               | Same: "coroutine was never awaited"                                 | Add `await` before `assert_dataset_pass(...)` |
+| `dataset_path="..."`     | TypeError: unexpected keyword argument                              | Use `dataset_name="..."`                      |
+| `thresholds={...}`       | TypeError: unexpected keyword argument                              | Use `pass_criteria=ScoreThreshold(...)`       |
+| Omitting `from_trace`    | Evaluator may not find the right span                               | Add `from_trace=last_llm_call`                |
+
+**If `pixie test` shows "No assert_pass / assert_dataset_pass calls recorded"**, the test passed vacuously because `assert_dataset_pass` was never awaited. Fix the async signature and await immediately.
+
+### Multiple test functions
+
+Split into separate test functions when you have different evaluator sets:
+
+```python
+async def test_factual_answers():
+    """Test items that have deterministic expected outputs."""
+    await assert_dataset_pass(
+        runnable=runnable,
+        dataset_name="qa-deterministic",
+        evaluators=[FactualityEval()],
+        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
+        from_trace=last_llm_call,
+    )
+
+async def test_response_style():
+    """Test open-ended quality criteria."""
+    await assert_dataset_pass(
+        runnable=runnable,
+        dataset_name="qa-open-ended",
+        evaluators=[concise_voice_style],
+        pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
+        from_trace=last_llm_call,
+    )
+```
+
+### Key points
+
+- `enable_storage()` belongs inside the `runnable`, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
+- The `runnable` imports and calls the **same function** that production uses — the app's entry point, going through the utility function from Step 3.
+- If the `runnable` calls a different function than what the utility function calls, something is wrong.
+- The `eval_input` dict should contain **only the semantic arguments** the function needs (e.g., `question`, `messages`, `context`). The `@observe` decorator automatically strips `self` and `cls`.
+- **Choose evaluators that match your data.** If dataset items have `expected_output`, use comparison evaluators. If not, use standalone evaluators.
+
+---
+
+## Running tests
+
+The test runner is `pixie test` (not `pytest`):
+
+```bash
+uv run pixie test                           # run all test_*.py in current directory
+uv run pixie test pixie_qa/tests/           # specify path
+uv run pixie test -k factuality             # filter by name
+uv run pixie test -v                        # verbose: shows per-case scores and reasoning
+```
+
+`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell. No `sys.path` hacks are needed in test files.
+
+The `-v` flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
+
+### After running, verify the scorecard
+
+1. Shows "N/M tests passed" with real numbers
+2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
+3. Per-evaluator scores appear with real values
+
+A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.