Files
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

229 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Step 4: Build the Dataset
**Why this step**: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b) — into concrete test scenarios. At test time, `pixie test` calls the runnable with `entry_kwargs`, the wrap registry is populated with `eval_input`, and evaluators score the resulting captured outputs.
---
## Understanding `entry_kwargs`, `eval_input`, and `expectation`
Before building the dataset, understand what these terms mean:
- **`entry_kwargs`** = the kwargs passed to `Runnable.run()` as a Pydantic model. These are the entry-point inputs (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for `run(args: T)`.
- **`eval_input`** = a list of `{"name": ..., "value": ...}` objects corresponding to `wrap(purpose="input")` calls in the app. At test time, these are injected automatically by the wrap registry; `wrap(purpose="input")` calls in the app return the registry value instead of calling the real external dependency.
**CRITICAL**: `eval_input` must have **at least one item** (enforced by `min_length=1` validation). If the app has no `wrap(purpose="input")` calls, you must still include at least one `eval_input` item — use the primary entry-point argument as a synthetic input:
```json
"eval_input": [
{ "name": "user_input", "value": "What are your business hours?" }
]
```
Each item is a `NamedData` object with `name` (str) and `value` (any JSON-serializable value).
- **`expectation`** (optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g., `Factuality`, `ClosedQA`). Not needed for output-quality evaluators that don't require a reference.
- **eval output** = what the app actually produces, captured at runtime by `wrap(purpose="output")` and `wrap(purpose="state")` calls. **Not stored in the dataset** — it's produced when `pixie test` runs the app.
The **reference trace** at `pixie_qa/reference-trace.jsonl` is your primary source for data shapes:
- Filter it to see the exact serialized format for `eval_input` values
- Read the `kwargs` record to understand the `entry_kwargs` structure
- Read `purpose="output"/"state"` events to understand what outputs the app produces, so you can write meaningful `expectation` values
---
## 4a. Derive evaluator assignments
The eval criteria artifact (`pixie_qa/02-eval-criteria.md`) maps each criterion to use cases. The evaluator mapping artifact (`pixie_qa/03-evaluator-mapping.md`) maps each criterion to a concrete evaluator name. Combine these:
1. **Dataset-level default evaluators**: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level `"evaluators"` array.
2. **Item-level evaluators**: Criteria that apply to only a subset → their evaluator names go in `"evaluators"` on the relevant rows only, using `"..."` to also include the defaults.
## 4b. Inspect data shapes with `pixie format`
Use `pixie format` on the reference trace to see the exact data shapes **and** the real app output in dataset-entry format:
```bash
pixie format --input reference-trace.jsonl --output dataset-sample.json
```
The output looks like:
```json
{
"entry_kwargs": {
"user_message": "What are your business hours?"
},
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice", "tier": "gold" }
},
{
"name": "conversation_history",
"value": [{ "role": "user", "content": "What are your hours?" }]
}
],
"expectation": null,
"eval_output": {
"response": "Our business hours are Monday to Friday, 9am to 5pm..."
}
}
```
**Important**: The `eval_output` in this template is the **full real output** produced by the running app. Do NOT copy `eval_output` into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:
- Use `entry_kwargs` and `eval_input` as exact templates for data keys and format
- Look at `eval_output` to understand what the app produces — then write a **concise `expectation` description** that captures the key quality criteria for each scenario
**Example**: if `eval_output.response` is `"Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM."`, write `expectation` as `"Should mention weekday hours (MonFri 9am5pm) and Saturday hours"` — a short description a human or LLM evaluator can compare against.
## 4c. Generate dataset items
Create diverse entries guided by the reference trace and use cases:
- **`entry_kwargs` keys** must match the fields of the Pydantic model used in `Runnable.run(args: T)`
- **`eval_input`** must be a list of `{"name": ..., "value": ...}` objects matching the `name` values of `wrap(purpose="input")` calls in the app
- **Cover each use case** from `pixie_qa/02-eval-criteria.md` — at least one entry per use case, with meaningfully diverse inputs across entries
**If the user specified a dataset or data source in the prompt** (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the `entry_kwargs` / `eval_input` shape, and incorporate them into the dataset. Do NOT ignore specified data.
## 4d. Build the dataset JSON file
Create the dataset at `pixie_qa/datasets/<name>.json`:
```json
{
"name": "qa-golden-set",
"runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
"evaluators": ["Factuality", "pixie_qa/evaluators.py:concise_voice_style"],
"entries": [
{
"entry_kwargs": {
"user_message": "What are your business hours?"
},
"description": "Customer asks about business hours with gold tier account",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice Johnson", "tier": "gold" }
}
],
"expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
},
{
"entry_kwargs": {
"user_message": "I want to change something"
},
"description": "Ambiguous change request from basic tier customer",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Bob Smith", "tier": "basic" }
}
],
"expectation": "Should ask for clarification",
"evaluators": ["...", "ClosedQA"]
},
{
"entry_kwargs": {
"user_message": "I want to end this call"
},
"description": "User requests call end after failed verification",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Charlie Brown", "tier": "basic" }
}
],
"expectation": "Agent should call endCall tool and end the conversation",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
]
}
```
### Key fields
**Entry structure** — all fields are top-level on each entry (flat structure — no nesting):
```
entry:
├── entry_kwargs (required) — args for Runnable.run()
├── eval_input (required) — list of {"name": ..., "value": ...} objects
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
```
**Top-level fields:**
- **`runnable`** (required): `filepath:ClassName` reference to the `Runnable` class from Step 2 (e.g., `"pixie_qa/scripts/run_app.py:AppRunnable"`). Path is relative to the project root.
- **`evaluators`** (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.
**Per-entry fields (all top-level on each entry):**
- **`entry_kwargs`** (required): Keys match the Pydantic model fields for `Runnable.run(args: T)`. These are the app's entry-point inputs.
- **`eval_input`** (required): List of `{"name": ..., "value": ...}` objects. Names match `wrap(purpose="input")` names in the app.
- **`description`** (required): Use case one-liner from `pixie_qa/02-eval-criteria.md`.
- **`expectation`** (optional): Case-specific expectation text for evaluators that need a reference.
- **`eval_metadata`** (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as `evaluable.eval_metadata`.
- **`evaluators`** (optional): Row-level evaluator override.
### Evaluator assignment rules
1. Evaluators that apply to ALL items go in the top-level `"evaluators"` array.
2. Items that need **additional** evaluators use `"evaluators": ["...", "ExtraEval"]` — `"..."` expands to defaults.
3. Items that need a **completely different** set use `"evaluators": ["OnlyThis"]` without `"..."`.
4. Items using only defaults: omit the `"evaluators"` field.
---
## Dataset Creation Reference
### Using `eval_input` values
The `eval_input` values are `{"name": ..., "value": ...}` objects. Use the reference trace as templates — copy the `"data"` field from the relevant `purpose="input"` event and adapt the values:
**Simple dict**:
```json
{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }
```
**List of dicts** (e.g., conversation history):
```json
{
"name": "conversation_history",
"value": [
{ "role": "user", "content": "Hello" },
{ "role": "assistant", "content": "Hi there!" }
]
}
```
**Important**: The exact format depends on what the `wrap(purpose="input")` call captures. Always copy from the reference trace rather than constructing from scratch.
### Crafting diverse eval scenarios
Cover different aspects of each use case:
- Different user phrasings of the same request
- Edge cases (ambiguous input, missing information, error conditions)
- Entries that stress-test specific eval criteria
- At least one entry per use case from Step 1b
---
## Output
`pixie_qa/datasets/<name>.json` — the dataset file.