update eval-driven-dev skill (#1352)

* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
2026-07-17 11:23:27 +00:00 · 2026-04-09 18:19:28 -07:00
parent 88b1920cb7
commit 5f59ddb9cf
19 changed files with 2180 additions and 1708 deletions
@@ -0,0 +1,228 @@
+# Step 4: Build the Dataset
+
+**Why this step**: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b) — into concrete test scenarios. At test time, `pixie test` calls the runnable with `entry_kwargs`, the wrap registry is populated with `eval_input`, and evaluators score the resulting captured outputs.
+
+---
+
+## Understanding `entry_kwargs`, `eval_input`, and `expectation`
+
+Before building the dataset, understand what these terms mean:
+
+- **`entry_kwargs`** = the kwargs passed to `Runnable.run()` as a Pydantic model. These are the entry-point inputs (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for `run(args: T)`.
+
+- **`eval_input`** = a list of `{"name": ..., "value": ...}` objects corresponding to `wrap(purpose="input")` calls in the app. At test time, these are injected automatically by the wrap registry; `wrap(purpose="input")` calls in the app return the registry value instead of calling the real external dependency.
+
+  **CRITICAL**: `eval_input` must have **at least one item** (enforced by `min_length=1` validation). If the app has no `wrap(purpose="input")` calls, you must still include at least one `eval_input` item — use the primary entry-point argument as a synthetic input:
+
+  ```json
+  "eval_input": [
+    { "name": "user_input", "value": "What are your business hours?" }
+  ]
+  ```
+
+  Each item is a `NamedData` object with `name` (str) and `value` (any JSON-serializable value).
+
+- **`expectation`** (optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g., `Factuality`, `ClosedQA`). Not needed for output-quality evaluators that don't require a reference.
+
+- **eval output** = what the app actually produces, captured at runtime by `wrap(purpose="output")` and `wrap(purpose="state")` calls. **Not stored in the dataset** — it's produced when `pixie test` runs the app.
+
+The **reference trace** at `pixie_qa/reference-trace.jsonl` is your primary source for data shapes:
+
+- Filter it to see the exact serialized format for `eval_input` values
+- Read the `kwargs` record to understand the `entry_kwargs` structure
+- Read `purpose="output"/"state"` events to understand what outputs the app produces, so you can write meaningful `expectation` values
+
+---
+
+## 4a. Derive evaluator assignments
+
+The eval criteria artifact (`pixie_qa/02-eval-criteria.md`) maps each criterion to use cases. The evaluator mapping artifact (`pixie_qa/03-evaluator-mapping.md`) maps each criterion to a concrete evaluator name. Combine these:
+
+1. **Dataset-level default evaluators**: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level `"evaluators"` array.
+2. **Item-level evaluators**: Criteria that apply to only a subset → their evaluator names go in `"evaluators"` on the relevant rows only, using `"..."` to also include the defaults.
+
+## 4b. Inspect data shapes with `pixie format`
+
+Use `pixie format` on the reference trace to see the exact data shapes **and** the real app output in dataset-entry format:
+
+```bash
+pixie format --input reference-trace.jsonl --output dataset-sample.json
+```
+
+The output looks like:
+
+```json
+{
+  "entry_kwargs": {
+    "user_message": "What are your business hours?"
+  },
+  "eval_input": [
+    {
+      "name": "customer_profile",
+      "value": { "name": "Alice", "tier": "gold" }
+    },
+    {
+      "name": "conversation_history",
+      "value": [{ "role": "user", "content": "What are your hours?" }]
+    }
+  ],
+  "expectation": null,
+  "eval_output": {
+    "response": "Our business hours are Monday to Friday, 9am to 5pm..."
+  }
+}
+```
+
+**Important**: The `eval_output` in this template is the **full real output** produced by the running app. Do NOT copy `eval_output` into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:
+
+- Use `entry_kwargs` and `eval_input` as exact templates for data keys and format
+- Look at `eval_output` to understand what the app produces — then write a **concise `expectation` description** that captures the key quality criteria for each scenario
+
+**Example**: if `eval_output.response` is `"Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM."`, write `expectation` as `"Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours"` — a short description a human or LLM evaluator can compare against.
+
+## 4c. Generate dataset items
+
+Create diverse entries guided by the reference trace and use cases:
+
+- **`entry_kwargs` keys** must match the fields of the Pydantic model used in `Runnable.run(args: T)`
+- **`eval_input`** must be a list of `{"name": ..., "value": ...}` objects matching the `name` values of `wrap(purpose="input")` calls in the app
+- **Cover each use case** from `pixie_qa/02-eval-criteria.md` — at least one entry per use case, with meaningfully diverse inputs across entries
+
+**If the user specified a dataset or data source in the prompt** (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the `entry_kwargs` / `eval_input` shape, and incorporate them into the dataset. Do NOT ignore specified data.
+
+## 4d. Build the dataset JSON file
+
+Create the dataset at `pixie_qa/datasets/<name>.json`:
+
+```json
+{
+  "name": "qa-golden-set",
+  "runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
+  "evaluators": ["Factuality", "pixie_qa/evaluators.py:concise_voice_style"],
+  "entries": [
+    {
+      "entry_kwargs": {
+        "user_message": "What are your business hours?"
+      },
+      "description": "Customer asks about business hours with gold tier account",
+      "eval_input": [
+        {
+          "name": "customer_profile",
+          "value": { "name": "Alice Johnson", "tier": "gold" }
+        }
+      ],
+      "expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
+    },
+    {
+      "entry_kwargs": {
+        "user_message": "I want to change something"
+      },
+      "description": "Ambiguous change request from basic tier customer",
+      "eval_input": [
+        {
+          "name": "customer_profile",
+          "value": { "name": "Bob Smith", "tier": "basic" }
+        }
+      ],
+      "expectation": "Should ask for clarification",
+      "evaluators": ["...", "ClosedQA"]
+    },
+    {
+      "entry_kwargs": {
+        "user_message": "I want to end this call"
+      },
+      "description": "User requests call end after failed verification",
+      "eval_input": [
+        {
+          "name": "customer_profile",
+          "value": { "name": "Charlie Brown", "tier": "basic" }
+        }
+      ],
+      "expectation": "Agent should call endCall tool and end the conversation",
+      "eval_metadata": {
+        "expected_tool": "endCall",
+        "expected_call_ended": true
+      },
+      "evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
+    }
+  ]
+}
+```
+
+### Key fields
+
+**Entry structure** — all fields are top-level on each entry (flat structure — no nesting):
+
+```
+entry:
+  ├── entry_kwargs    (required) — args for Runnable.run()
+  ├── eval_input      (required) — list of {"name": ..., "value": ...} objects
+  ├── description     (required) — human-readable label for the test case
+  ├── expectation     (optional) — reference for comparison-based evaluators
+  ├── eval_metadata   (optional) — extra per-entry data for custom evaluators
+  └── evaluators      (optional) — evaluator names for THIS entry
+```
+
+**Top-level fields:**
+
+- **`runnable`** (required): `filepath:ClassName` reference to the `Runnable` class from Step 2 (e.g., `"pixie_qa/scripts/run_app.py:AppRunnable"`). Path is relative to the project root.
+- **`evaluators`** (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.
+
+**Per-entry fields (all top-level on each entry):**
+
+- **`entry_kwargs`** (required): Keys match the Pydantic model fields for `Runnable.run(args: T)`. These are the app's entry-point inputs.
+- **`eval_input`** (required): List of `{"name": ..., "value": ...}` objects. Names match `wrap(purpose="input")` names in the app.
+- **`description`** (required): Use case one-liner from `pixie_qa/02-eval-criteria.md`.
+- **`expectation`** (optional): Case-specific expectation text for evaluators that need a reference.
+- **`eval_metadata`** (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as `evaluable.eval_metadata`.
+- **`evaluators`** (optional): Row-level evaluator override.
+
+### Evaluator assignment rules
+
+1. Evaluators that apply to ALL items go in the top-level `"evaluators"` array.
+2. Items that need **additional** evaluators use `"evaluators": ["...", "ExtraEval"]` — `"..."` expands to defaults.
+3. Items that need a **completely different** set use `"evaluators": ["OnlyThis"]` without `"..."`.
+4. Items using only defaults: omit the `"evaluators"` field.
+
+---
+
+## Dataset Creation Reference
+
+### Using `eval_input` values
+
+The `eval_input` values are `{"name": ..., "value": ...}` objects. Use the reference trace as templates — copy the `"data"` field from the relevant `purpose="input"` event and adapt the values:
+
+**Simple dict**:
+
+```json
+{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }
+```
+
+**List of dicts** (e.g., conversation history):
+
+```json
+{
+  "name": "conversation_history",
+  "value": [
+    { "role": "user", "content": "Hello" },
+    { "role": "assistant", "content": "Hi there!" }
+  ]
+}
+```
+
+**Important**: The exact format depends on what the `wrap(purpose="input")` call captures. Always copy from the reference trace rather than constructing from scratch.
+
+### Crafting diverse eval scenarios
+
+Cover different aspects of each use case:
+
+- Different user phrasings of the same request
+- Edge cases (ambiguous input, missing information, error conditions)
+- Entries that stress-test specific eval criteria
+- At least one entry per use case from Step 1b
+
+---
+
+## Output
+
+`pixie_qa/datasets/<name>.json` — the dataset file.