Files
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

10 KiB
Raw Permalink Blame History

Step 4: Build the Dataset

Why this step: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b) — into concrete test scenarios. At test time, pixie test calls the runnable with entry_kwargs, the wrap registry is populated with eval_input, and evaluators score the resulting captured outputs.


Understanding entry_kwargs, eval_input, and expectation

Before building the dataset, understand what these terms mean:

  • entry_kwargs = the kwargs passed to Runnable.run() as a Pydantic model. These are the entry-point inputs (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for run(args: T).

  • eval_input = a list of {"name": ..., "value": ...} objects corresponding to wrap(purpose="input") calls in the app. At test time, these are injected automatically by the wrap registry; wrap(purpose="input") calls in the app return the registry value instead of calling the real external dependency.

    CRITICAL: eval_input must have at least one item (enforced by min_length=1 validation). If the app has no wrap(purpose="input") calls, you must still include at least one eval_input item — use the primary entry-point argument as a synthetic input:

    "eval_input": [
      { "name": "user_input", "value": "What are your business hours?" }
    ]
    

    Each item is a NamedData object with name (str) and value (any JSON-serializable value).

  • expectation (optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g., Factuality, ClosedQA). Not needed for output-quality evaluators that don't require a reference.

  • eval output = what the app actually produces, captured at runtime by wrap(purpose="output") and wrap(purpose="state") calls. Not stored in the dataset — it's produced when pixie test runs the app.

The reference trace at pixie_qa/reference-trace.jsonl is your primary source for data shapes:

  • Filter it to see the exact serialized format for eval_input values
  • Read the kwargs record to understand the entry_kwargs structure
  • Read purpose="output"/"state" events to understand what outputs the app produces, so you can write meaningful expectation values

4a. Derive evaluator assignments

The eval criteria artifact (pixie_qa/02-eval-criteria.md) maps each criterion to use cases. The evaluator mapping artifact (pixie_qa/03-evaluator-mapping.md) maps each criterion to a concrete evaluator name. Combine these:

  1. Dataset-level default evaluators: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level "evaluators" array.
  2. Item-level evaluators: Criteria that apply to only a subset → their evaluator names go in "evaluators" on the relevant rows only, using "..." to also include the defaults.

4b. Inspect data shapes with pixie format

Use pixie format on the reference trace to see the exact data shapes and the real app output in dataset-entry format:

pixie format --input reference-trace.jsonl --output dataset-sample.json

The output looks like:

{
  "entry_kwargs": {
    "user_message": "What are your business hours?"
  },
  "eval_input": [
    {
      "name": "customer_profile",
      "value": { "name": "Alice", "tier": "gold" }
    },
    {
      "name": "conversation_history",
      "value": [{ "role": "user", "content": "What are your hours?" }]
    }
  ],
  "expectation": null,
  "eval_output": {
    "response": "Our business hours are Monday to Friday, 9am to 5pm..."
  }
}

Important: The eval_output in this template is the full real output produced by the running app. Do NOT copy eval_output into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:

  • Use entry_kwargs and eval_input as exact templates for data keys and format
  • Look at eval_output to understand what the app produces — then write a concise expectation description that captures the key quality criteria for each scenario

Example: if eval_output.response is "Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM.", write expectation as "Should mention weekday hours (MonFri 9am5pm) and Saturday hours" — a short description a human or LLM evaluator can compare against.

4c. Generate dataset items

Create diverse entries guided by the reference trace and use cases:

  • entry_kwargs keys must match the fields of the Pydantic model used in Runnable.run(args: T)
  • eval_input must be a list of {"name": ..., "value": ...} objects matching the name values of wrap(purpose="input") calls in the app
  • Cover each use case from pixie_qa/02-eval-criteria.md — at least one entry per use case, with meaningfully diverse inputs across entries

If the user specified a dataset or data source in the prompt (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the entry_kwargs / eval_input shape, and incorporate them into the dataset. Do NOT ignore specified data.

4d. Build the dataset JSON file

Create the dataset at pixie_qa/datasets/<name>.json:

{
  "name": "qa-golden-set",
  "runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
  "evaluators": ["Factuality", "pixie_qa/evaluators.py:concise_voice_style"],
  "entries": [
    {
      "entry_kwargs": {
        "user_message": "What are your business hours?"
      },
      "description": "Customer asks about business hours with gold tier account",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Alice Johnson", "tier": "gold" }
        }
      ],
      "expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
    },
    {
      "entry_kwargs": {
        "user_message": "I want to change something"
      },
      "description": "Ambiguous change request from basic tier customer",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Bob Smith", "tier": "basic" }
        }
      ],
      "expectation": "Should ask for clarification",
      "evaluators": ["...", "ClosedQA"]
    },
    {
      "entry_kwargs": {
        "user_message": "I want to end this call"
      },
      "description": "User requests call end after failed verification",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Charlie Brown", "tier": "basic" }
        }
      ],
      "expectation": "Agent should call endCall tool and end the conversation",
      "eval_metadata": {
        "expected_tool": "endCall",
        "expected_call_ended": true
      },
      "evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
    }
  ]
}

Key fields

Entry structure — all fields are top-level on each entry (flat structure — no nesting):

entry:
  ├── entry_kwargs    (required) — args for Runnable.run()
  ├── eval_input      (required) — list of {"name": ..., "value": ...} objects
  ├── description     (required) — human-readable label for the test case
  ├── expectation     (optional) — reference for comparison-based evaluators
  ├── eval_metadata   (optional) — extra per-entry data for custom evaluators
  └── evaluators      (optional) — evaluator names for THIS entry

Top-level fields:

  • runnable (required): filepath:ClassName reference to the Runnable class from Step 2 (e.g., "pixie_qa/scripts/run_app.py:AppRunnable"). Path is relative to the project root.
  • evaluators (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.

Per-entry fields (all top-level on each entry):

  • entry_kwargs (required): Keys match the Pydantic model fields for Runnable.run(args: T). These are the app's entry-point inputs.
  • eval_input (required): List of {"name": ..., "value": ...} objects. Names match wrap(purpose="input") names in the app.
  • description (required): Use case one-liner from pixie_qa/02-eval-criteria.md.
  • expectation (optional): Case-specific expectation text for evaluators that need a reference.
  • eval_metadata (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as evaluable.eval_metadata.
  • evaluators (optional): Row-level evaluator override.

Evaluator assignment rules

  1. Evaluators that apply to ALL items go in the top-level "evaluators" array.
  2. Items that need additional evaluators use "evaluators": ["...", "ExtraEval"]"..." expands to defaults.
  3. Items that need a completely different set use "evaluators": ["OnlyThis"] without "...".
  4. Items using only defaults: omit the "evaluators" field.

Dataset Creation Reference

Using eval_input values

The eval_input values are {"name": ..., "value": ...} objects. Use the reference trace as templates — copy the "data" field from the relevant purpose="input" event and adapt the values:

Simple dict:

{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }

List of dicts (e.g., conversation history):

{
  "name": "conversation_history",
  "value": [
    { "role": "user", "content": "Hello" },
    { "role": "assistant", "content": "Hi there!" }
  ]
}

Important: The exact format depends on what the wrap(purpose="input") call captures. Always copy from the reference trace rather than constructing from scratch.

Crafting diverse eval scenarios

Cover different aspects of each use case:

  • Different user phrasings of the same request
  • Edge cases (ambiguous input, missing information, error conditions)
  • Entries that stress-test specific eval criteria
  • At least one entry per use case from Step 1b

Output

pixie_qa/datasets/<name>.json — the dataset file.