mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
update eval-driven-dev skill. (#1201)
* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
241
skills/eval-driven-dev/references/eval-tests.md
Normal file
241
skills/eval-driven-dev/references/eval-tests.md
Normal file
@@ -0,0 +1,241 @@
|
||||
# Eval Tests: Evaluator Selection and Test Writing
|
||||
|
||||
This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running `pixie test`.
|
||||
|
||||
**Before writing any test code, re-read `references/pixie-api.md`** (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
|
||||
|
||||
---
|
||||
|
||||
## Evaluator selection
|
||||
|
||||
Choose evaluators based on the **output type** and your eval criteria from Step 1, not the app type.
|
||||
|
||||
### Decision table
|
||||
|
||||
| Output type | Evaluator category | Examples |
|
||||
| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------- |
|
||||
| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: `ExactMatchEval`, `JSONDiffEval`, `ValidJSONEval` | Label classification, JSON extraction |
|
||||
| Open-ended text with a reference answer | LLM-as-judge: `FactualityEval`, `ClosedQAEval`, `AnswerCorrectnessEval` | Chatbot responses, QA, summaries |
|
||||
| Text with expected context/grounding | RAG evaluators: `FaithfulnessEval`, `ContextRelevancyEval` | RAG pipelines, context-grounded responses |
|
||||
| Text with style/format requirements | Custom LLM-as-judge via `create_llm_evaluator` | Voice-friendly responses, tone checks |
|
||||
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
|
||||
|
||||
### Critical rules
|
||||
|
||||
- For open-ended LLM text, **never** use `ExactMatchEval`. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
|
||||
- `AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
|
||||
- Do NOT use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without `expected_output` — they produce meaningless scores.
|
||||
|
||||
### When `expected_output` IS available
|
||||
|
||||
Use comparison-based evaluators:
|
||||
|
||||
| Evaluator | Use when |
|
||||
| ----------------------- | ---------------------------------------------------------- |
|
||||
| `FactualityEval` | Output is factually correct compared to reference |
|
||||
| `ClosedQAEval` | Output matches the expected answer |
|
||||
| `ExactMatchEval` | Exact string match (structured/deterministic outputs only) |
|
||||
| `AnswerCorrectnessEval` | Answer is correct vs reference |
|
||||
|
||||
### When `expected_output` is NOT available
|
||||
|
||||
Use standalone evaluators that judge quality without a reference:
|
||||
|
||||
| Evaluator | Use when | Note |
|
||||
| ---------------------- | ------------------------------------- | ---------------------------------------------------------------- |
|
||||
| `FaithfulnessEval` | Response faithful to provided context | RAG pipelines |
|
||||
| `ContextRelevancyEval` | Retrieved context relevant to query | RAG pipelines |
|
||||
| `AnswerRelevancyEval` | Answer addresses the question | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
|
||||
| `PossibleEval` | Output is plausible / feasible | General purpose |
|
||||
| `ModerationEval` | Output is safe and appropriate | Content safety |
|
||||
| `SecurityEval` | No security vulnerabilities | Security check |
|
||||
|
||||
For non-RAG apps needing response relevance, write a `create_llm_evaluator` instead.
|
||||
|
||||
---
|
||||
|
||||
## Custom evaluators
|
||||
|
||||
### `create_llm_evaluator` factory
|
||||
|
||||
Use when the quality dimension is domain-specific and no built-in evaluator fits:
|
||||
|
||||
```python
|
||||
from pixie import create_llm_evaluator
|
||||
|
||||
concise_voice_style = create_llm_evaluator(
|
||||
name="ConciseVoiceStyle",
|
||||
prompt_template="""
|
||||
You are evaluating whether this response is concise and phone-friendly.
|
||||
|
||||
Input: {eval_input}
|
||||
Response: {eval_output}
|
||||
|
||||
Score 1.0 if the response is concise (under 3 sentences), directly addresses
|
||||
the question, and uses conversational language suitable for a phone call.
|
||||
Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
|
||||
""",
|
||||
)
|
||||
```
|
||||
|
||||
**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
|
||||
|
||||
**Rules**:
|
||||
|
||||
- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
|
||||
- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
|
||||
- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
|
||||
|
||||
**Non-RAG response relevance** (instead of `AnswerRelevancyEval`):
|
||||
|
||||
```python
|
||||
response_relevance = create_llm_evaluator(
|
||||
name="ResponseRelevance",
|
||||
prompt_template="""
|
||||
You are evaluating whether a customer support response is relevant and helpful.
|
||||
|
||||
Input: {eval_input}
|
||||
Response: {eval_output}
|
||||
Expected: {expected_output}
|
||||
|
||||
Score 1.0 if the response directly addresses the question and meets expectations.
|
||||
Score 0.5 if partially relevant but misses important aspects.
|
||||
Score 0.0 if off-topic, ignores the question, or contradicts expectations.
|
||||
""",
|
||||
)
|
||||
```
|
||||
|
||||
### Manual custom evaluator
|
||||
|
||||
```python
|
||||
from pixie import Evaluation, Evaluable
|
||||
|
||||
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
|
||||
# evaluable.eval_input — what was passed to the observed function
|
||||
# evaluable.eval_output — what the function returned
|
||||
# evaluable.expected_output — reference answer (UNSET if not provided)
|
||||
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
|
||||
return Evaluation(score=score, reasoning="...")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Writing the test file
|
||||
|
||||
Create `pixie_qa/tests/test_<feature>.py`. The pattern: a `runnable` adapter that calls the app's production function, plus `async` test functions that `await` `assert_dataset_pass`.
|
||||
|
||||
**Before writing any test code, re-read the `assert_dataset_pass` API reference below.** The exact parameter names matter — using `dataset=` instead of `dataset_name=`, or omitting `await`, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
|
||||
|
||||
### Test file template
|
||||
|
||||
```python
|
||||
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
|
||||
|
||||
from myapp import answer_question
|
||||
|
||||
|
||||
def runnable(eval_input):
|
||||
"""Replays one dataset item through the app.
|
||||
|
||||
Calls the same function the production app uses.
|
||||
enable_storage() here ensures traces are captured during eval runs.
|
||||
"""
|
||||
enable_storage()
|
||||
answer_question(**eval_input)
|
||||
|
||||
|
||||
async def test_answer_quality():
|
||||
await assert_dataset_pass(
|
||||
runnable=runnable,
|
||||
dataset_name="qa-golden-set",
|
||||
evaluators=[FactualityEval()],
|
||||
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
|
||||
from_trace=last_llm_call,
|
||||
)
|
||||
```
|
||||
|
||||
### `assert_dataset_pass` API — exact parameter names
|
||||
|
||||
```python
|
||||
await assert_dataset_pass(
|
||||
runnable=runnable, # callable that takes eval_input dict
|
||||
dataset_name="my-dataset", # NOT dataset_path — name of dataset created in Step 4
|
||||
evaluators=[...], # list of evaluator instances
|
||||
pass_criteria=ScoreThreshold( # NOT thresholds — ScoreThreshold object
|
||||
threshold=0.7, # minimum score to count as passing
|
||||
pct=0.8, # fraction of items that must pass
|
||||
),
|
||||
from_trace=last_llm_call, # which span to extract eval data from
|
||||
)
|
||||
```
|
||||
|
||||
### Common mistakes that break tests
|
||||
|
||||
| Mistake | Symptom | Fix |
|
||||
| ------------------------ | ------------------------------------------------------------------- | --------------------------------------------- |
|
||||
| `def test_...():` (sync) | RuntimeWarning "coroutine was never awaited", test passes vacuously | Use `async def test_...():` |
|
||||
| No `await` | Same: "coroutine was never awaited" | Add `await` before `assert_dataset_pass(...)` |
|
||||
| `dataset_path="..."` | TypeError: unexpected keyword argument | Use `dataset_name="..."` |
|
||||
| `thresholds={...}` | TypeError: unexpected keyword argument | Use `pass_criteria=ScoreThreshold(...)` |
|
||||
| Omitting `from_trace` | Evaluator may not find the right span | Add `from_trace=last_llm_call` |
|
||||
|
||||
**If `pixie test` shows "No assert_pass / assert_dataset_pass calls recorded"**, the test passed vacuously because `assert_dataset_pass` was never awaited. Fix the async signature and await immediately.
|
||||
|
||||
### Multiple test functions
|
||||
|
||||
Split into separate test functions when you have different evaluator sets:
|
||||
|
||||
```python
|
||||
async def test_factual_answers():
|
||||
"""Test items that have deterministic expected outputs."""
|
||||
await assert_dataset_pass(
|
||||
runnable=runnable,
|
||||
dataset_name="qa-deterministic",
|
||||
evaluators=[FactualityEval()],
|
||||
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
|
||||
from_trace=last_llm_call,
|
||||
)
|
||||
|
||||
async def test_response_style():
|
||||
"""Test open-ended quality criteria."""
|
||||
await assert_dataset_pass(
|
||||
runnable=runnable,
|
||||
dataset_name="qa-open-ended",
|
||||
evaluators=[concise_voice_style],
|
||||
pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
|
||||
from_trace=last_llm_call,
|
||||
)
|
||||
```
|
||||
|
||||
### Key points
|
||||
|
||||
- `enable_storage()` belongs inside the `runnable`, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
|
||||
- The `runnable` imports and calls the **same function** that production uses — the app's entry point, going through the utility function from Step 3.
|
||||
- If the `runnable` calls a different function than what the utility function calls, something is wrong.
|
||||
- The `eval_input` dict should contain **only the semantic arguments** the function needs (e.g., `question`, `messages`, `context`). The `@observe` decorator automatically strips `self` and `cls`.
|
||||
- **Choose evaluators that match your data.** If dataset items have `expected_output`, use comparison evaluators. If not, use standalone evaluators.
|
||||
|
||||
---
|
||||
|
||||
## Running tests
|
||||
|
||||
The test runner is `pixie test` (not `pytest`):
|
||||
|
||||
```bash
|
||||
uv run pixie test # run all test_*.py in current directory
|
||||
uv run pixie test pixie_qa/tests/ # specify path
|
||||
uv run pixie test -k factuality # filter by name
|
||||
uv run pixie test -v # verbose: shows per-case scores and reasoning
|
||||
```
|
||||
|
||||
`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell. No `sys.path` hacks are needed in test files.
|
||||
|
||||
The `-v` flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
|
||||
|
||||
### After running, verify the scorecard
|
||||
|
||||
1. Shows "N/M tests passed" with real numbers
|
||||
2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
|
||||
3. Per-evaluator scores appear with real values
|
||||
|
||||
A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
|
||||
Reference in New Issue
Block a user