* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
10 KiB
Step 4: Build the Dataset
Why this step: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b) — into concrete test scenarios. At test time, pixie test calls the runnable with entry_kwargs, the wrap registry is populated with eval_input, and evaluators score the resulting captured outputs.
Understanding entry_kwargs, eval_input, and expectation
Before building the dataset, understand what these terms mean:
-
entry_kwargs= the kwargs passed toRunnable.run()as a Pydantic model. These are the entry-point inputs (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined forrun(args: T). -
eval_input= a list of{"name": ..., "value": ...}objects corresponding towrap(purpose="input")calls in the app. At test time, these are injected automatically by the wrap registry;wrap(purpose="input")calls in the app return the registry value instead of calling the real external dependency.CRITICAL:
eval_inputmust have at least one item (enforced bymin_length=1validation). If the app has nowrap(purpose="input")calls, you must still include at least oneeval_inputitem — use the primary entry-point argument as a synthetic input:"eval_input": [ { "name": "user_input", "value": "What are your business hours?" } ]Each item is a
NamedDataobject withname(str) andvalue(any JSON-serializable value). -
expectation(optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g.,Factuality,ClosedQA). Not needed for output-quality evaluators that don't require a reference. -
eval output = what the app actually produces, captured at runtime by
wrap(purpose="output")andwrap(purpose="state")calls. Not stored in the dataset — it's produced whenpixie testruns the app.
The reference trace at pixie_qa/reference-trace.jsonl is your primary source for data shapes:
- Filter it to see the exact serialized format for
eval_inputvalues - Read the
kwargsrecord to understand theentry_kwargsstructure - Read
purpose="output"/"state"events to understand what outputs the app produces, so you can write meaningfulexpectationvalues
4a. Derive evaluator assignments
The eval criteria artifact (pixie_qa/02-eval-criteria.md) maps each criterion to use cases. The evaluator mapping artifact (pixie_qa/03-evaluator-mapping.md) maps each criterion to a concrete evaluator name. Combine these:
- Dataset-level default evaluators: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level
"evaluators"array. - Item-level evaluators: Criteria that apply to only a subset → their evaluator names go in
"evaluators"on the relevant rows only, using"..."to also include the defaults.
4b. Inspect data shapes with pixie format
Use pixie format on the reference trace to see the exact data shapes and the real app output in dataset-entry format:
pixie format --input reference-trace.jsonl --output dataset-sample.json
The output looks like:
{
"entry_kwargs": {
"user_message": "What are your business hours?"
},
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice", "tier": "gold" }
},
{
"name": "conversation_history",
"value": [{ "role": "user", "content": "What are your hours?" }]
}
],
"expectation": null,
"eval_output": {
"response": "Our business hours are Monday to Friday, 9am to 5pm..."
}
}
Important: The eval_output in this template is the full real output produced by the running app. Do NOT copy eval_output into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:
- Use
entry_kwargsandeval_inputas exact templates for data keys and format - Look at
eval_outputto understand what the app produces — then write a conciseexpectationdescription that captures the key quality criteria for each scenario
Example: if eval_output.response is "Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM.", write expectation as "Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours" — a short description a human or LLM evaluator can compare against.
4c. Generate dataset items
Create diverse entries guided by the reference trace and use cases:
entry_kwargskeys must match the fields of the Pydantic model used inRunnable.run(args: T)eval_inputmust be a list of{"name": ..., "value": ...}objects matching thenamevalues ofwrap(purpose="input")calls in the app- Cover each use case from
pixie_qa/02-eval-criteria.md— at least one entry per use case, with meaningfully diverse inputs across entries
If the user specified a dataset or data source in the prompt (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the entry_kwargs / eval_input shape, and incorporate them into the dataset. Do NOT ignore specified data.
4d. Build the dataset JSON file
Create the dataset at pixie_qa/datasets/<name>.json:
{
"name": "qa-golden-set",
"runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
"evaluators": ["Factuality", "pixie_qa/evaluators.py:concise_voice_style"],
"entries": [
{
"entry_kwargs": {
"user_message": "What are your business hours?"
},
"description": "Customer asks about business hours with gold tier account",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice Johnson", "tier": "gold" }
}
],
"expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
},
{
"entry_kwargs": {
"user_message": "I want to change something"
},
"description": "Ambiguous change request from basic tier customer",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Bob Smith", "tier": "basic" }
}
],
"expectation": "Should ask for clarification",
"evaluators": ["...", "ClosedQA"]
},
{
"entry_kwargs": {
"user_message": "I want to end this call"
},
"description": "User requests call end after failed verification",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Charlie Brown", "tier": "basic" }
}
],
"expectation": "Agent should call endCall tool and end the conversation",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
]
}
Key fields
Entry structure — all fields are top-level on each entry (flat structure — no nesting):
entry:
├── entry_kwargs (required) — args for Runnable.run()
├── eval_input (required) — list of {"name": ..., "value": ...} objects
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
Top-level fields:
runnable(required):filepath:ClassNamereference to theRunnableclass from Step 2 (e.g.,"pixie_qa/scripts/run_app.py:AppRunnable"). Path is relative to the project root.evaluators(dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.
Per-entry fields (all top-level on each entry):
entry_kwargs(required): Keys match the Pydantic model fields forRunnable.run(args: T). These are the app's entry-point inputs.eval_input(required): List of{"name": ..., "value": ...}objects. Names matchwrap(purpose="input")names in the app.description(required): Use case one-liner frompixie_qa/02-eval-criteria.md.expectation(optional): Case-specific expectation text for evaluators that need a reference.eval_metadata(optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators asevaluable.eval_metadata.evaluators(optional): Row-level evaluator override.
Evaluator assignment rules
- Evaluators that apply to ALL items go in the top-level
"evaluators"array. - Items that need additional evaluators use
"evaluators": ["...", "ExtraEval"]—"..."expands to defaults. - Items that need a completely different set use
"evaluators": ["OnlyThis"]without"...". - Items using only defaults: omit the
"evaluators"field.
Dataset Creation Reference
Using eval_input values
The eval_input values are {"name": ..., "value": ...} objects. Use the reference trace as templates — copy the "data" field from the relevant purpose="input" event and adapt the values:
Simple dict:
{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }
List of dicts (e.g., conversation history):
{
"name": "conversation_history",
"value": [
{ "role": "user", "content": "Hello" },
{ "role": "assistant", "content": "Hi there!" }
]
}
Important: The exact format depends on what the wrap(purpose="input") call captures. Always copy from the reference trace rather than constructing from scratch.
Crafting diverse eval scenarios
Cover different aspects of each use case:
- Different user phrasings of the same request
- Edge cases (ambiguous input, missing information, error conditions)
- Entries that stress-test specific eval criteria
- At least one entry per use case from Step 1b
Output
pixie_qa/datasets/<name>.json — the dataset file.