From df0ed6aa51eb0347c0cedc66d3a9ca7c2fbfd05b Mon Sep 17 00:00:00 2001
From: Yiou Li <liyiousu@gmail.com>
Date: Sun, 29 Mar 2026 14:07:39 -0700
Subject: [PATCH] update eval-driven-dev skill. (#1201)

* update eval-driven-dev skill.

Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions.

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 docs/README.skills.md                         |    2 +-
 skills/eval-driven-dev/SKILL.md               | 1184 +++++------------
 .../references/dataset-generation.md          |  235 ++++
 .../eval-driven-dev/references/eval-tests.md  |  241 ++++
 .../references/instrumentation.md             |  174 +++
 .../references/investigation.md               |  146 ++
 .../eval-driven-dev/references/pixie-api.md   |  286 ++--
 .../references/run-harness-patterns.md        |  281 ++++
 .../references/understanding-app.md           |  201 +++
 9 files changed, 1803 insertions(+), 947 deletions(-)
 create mode 100644 skills/eval-driven-dev/references/dataset-generation.md
 create mode 100644 skills/eval-driven-dev/references/eval-tests.md
 create mode 100644 skills/eval-driven-dev/references/instrumentation.md
 create mode 100644 skills/eval-driven-dev/references/investigation.md
 create mode 100644 skills/eval-driven-dev/references/run-harness-patterns.md
 create mode 100644 skills/eval-driven-dev/references/understanding-app.md
diff --git a/docs/README.skills.md b/docs/README.skills.md
index ba71dbdd..fc1c3cf2 100644
--- a/docs/README.skills.md
+++ b/docs/README.skills.md
@@ -120,7 +120,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
 | [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
 | [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
-| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works correctly. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM. Use for making sure an LLM application works correctly, catching regressions after prompt changes, fixing unexpected behavior, or validating output quality before shipping. | `references/pixie-api.md` |
+| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
 | [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
 | [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
 | [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |
diff --git a/skills/eval-driven-dev/SKILL.md b/skills/eval-driven-dev/SKILL.md
index 51e02539..498bca26 100644
--- a/skills/eval-driven-dev/SKILL.md
+++ b/skills/eval-driven-dev/SKILL.md
@@ -1,862 +1,378 @@
 ---
 name: eval-driven-dev
-description: Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works correctly. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM. Use for making sure an LLM application works correctly, catching regressions after prompt changes, fixing unexpected behavior, or validating output quality before shipping.
+description: >
+  Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets,
+  write and run eval tests, and iterate on failures.
+  ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals,
+  evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model.
 license: MIT
 compatibility: Python 3.11+
 metadata:
-  version: 0.1.11
+  version: 0.2.0
 ---
 
-# Evaluation-Driven Development for Python LLM Applications
+# Eval-Driven Development for Python LLM Applications
 
-This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.
+You're building an **automated QA pipeline** that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`.
 
-## Startup checks (always first)
+**What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.
 
-Attempt to upgrade the `pixie-qa` package in the user's environment. Detect the package manager from the project (check for `uv.lock`, `poetry.lock`, `requirements.txt`, or a plain `pip` environment) and run the appropriate upgrade command:
+**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness.
 
-- **uv**: `uv add pixie-qa --upgrade` (or `uv sync --upgrade-package pixie-qa`)
-- **poetry**: `poetry add pixie-qa@latest`
-- **pip**: `pip install --upgrade pixie-qa`
+**The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset.
 
-If the upgrade fails (e.g., no network, version conflict), log the error and continue — a failed upgrade must not block the rest of the skill.
+This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.
 
-**All pixie-generated files live in a single `pixie_qa` directory** at the project root:
+---
+
+## Before you start
+
+Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.
+
+**Update the skill:**
+
+```bash
+npx skills update
+```
+
+**Upgrade the `pixie-qa` package**
+
+Make sure the python virtual environment is active and use the project's package manager:
+
+```bash
+# uv project (uv.lock exists):
+uv add pixie-qa --upgrade
+
+# poetry project (poetry.lock exists):
+poetry add pixie-qa@latest
+
+# pip / no lock file:
+pip install --upgrade pixie-qa
+```
+
+---
+
+## The workflow
+
+Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
+
+**Two modes:**
+
+- **Setup** ("set up evals", "add tests", "set up QA"): Complete Steps 1–5. After the test run, report results and ask whether to iterate.
+- **Iteration** ("fix", "improve", "debug"): Complete Steps 1–5 if not already done, then do one round of Step 6.
+
+If ambiguous: default to setup.
+
+---
+
+### Step 1: Understand the app and define eval criteria
+
+Read the source code to understand:
+
+1. **How it runs** — entry point, startup, config/env vars
+2. **The real entry point** — how a real user invokes the app (HTTP endpoint, CLI, function call). This is what the eval must exercise — not an inner function that bypasses the request pipeline.
+3. **The request pipeline** — trace the full path from entry point to response. What middleware, routing, state management, prompt assembly, retrieval, or formatting happens along the way? All of this is under test.
+4. **External dependencies (both directions)** — identify every external system the app talks to (databases, APIs, caches, queues, file systems, speech services). For each, understand:
+   - **Data flowing IN** (external → app): what data does the app read from this system? What shapes, types, realistic values? You'll make up this data for eval scenarios.
+   - **Data flowing OUT** (app → external): what does the app write, send, or mutate in this system? These are side-effects that evaluations may need to verify (e.g., "did the app create the right calendar entry?", "did it send the correct transfer request?").
+   - **How to mock it** — look for abstract base classes, protocols, or constructor-injected backends (e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
+5. **Use cases** — distinct scenarios, what good/bad output looks like
+
+Read `references/understanding-app.md` for detailed guidance on mapping data flows and the MEMORY.md template.
+
+Write your findings to `pixie_qa/MEMORY.md` before moving on. Include:
+
+- The entry point and the full request pipeline
+- Every external dependency, what it provides/receives, and how you'll mock it
+- The testability seams (pluggable interfaces, patchable module-level objects)
+
+Determine **high-level, application-specific eval criteria**:
+
+**Good criteria are specific to the app's purpose.** Examples:
+
+- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation (under 3 sentences)?", "Does the agent route to the correct department based on the caller's request?"
+- Research report generator: "Does the report address all sub-questions in the query?", "Are claims supported by the retrieved sources?", "Is the report structured with clear sections?"
+- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when the context doesn't contain the answer?"
+
+**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
+
+At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.
+
+Record the criteria in `pixie_qa/MEMORY.md` and continue.
+
+> **Checkpoint**: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.
+
+---
+
+### Step 2: Instrument and observe a real run
+
+**Why this step**: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:
+
+1. **Learn the data shapes** — what data flows in from external dependencies, and what side-effects flow out? What types, structures, realistic values? You'll need to make up this data for eval scenarios later.
+2. **Verify instrumentation captures what evaluators need** — do the traces contain the data required to assess each eval criterion from Step 1? If a criterion is "does the agent route to the correct department," the trace must capture the routing decision.
+
+**This is a normal app run with instrumentation — no mocks, no patches.**
+
+#### 2a. Decide what to instrument
+
+This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:
+
+- **For each eval criterion**, ask: what observable data would prove this criterion is met or violated?
+- **Map that data to code locations** — which functions produce, consume, or transform that data?
+- **Those functions need `@observe`** — so their inputs and outputs are captured in traces.
+
+Examples:
+
+| Eval criterion                             | Data needed                                        | What to instrument                                           |
+| ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------------------ |
+| "Routes to correct department"             | The routing decision (which department was chosen) | The routing/dispatch function                                |
+| "Responses grounded in retrieved context"  | The retrieved documents + the final response       | The retrieval function AND the response function             |
+| "Verifies caller identity before transfer" | Whether identity check happened, transfer decision | The identity verification function AND the transfer function |
+| "Concise phone-friendly responses"         | The final response text                            | The function that produces the LLM response                  |
+
+**LLM provider calls (OpenAI, Anthropic, etc.) are auto-captured** — `enable_storage()` activates OpenInference instrumentors that automatically trace every LLM API call with full input messages, output messages, token usage, and model parameters. You do NOT need `@observe` on the function that calls `client.chat.completions.create()` just to see the LLM interaction.
+
+**Use `@observe` for application-level functions** whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone. Examples: the app's entry-point function (to capture what the user sent and what the app returned), retrieval functions (to capture what context was fetched), routing functions (to capture dispatch decisions).
+
+`enable_storage()` goes at application startup. Read `references/instrumentation.md` for the full rules, code patterns, and anti-patterns for adding instrumentation.
+
+#### 2b. Add instrumentation and run the app
+
+Add `@observe` to the functions you identified in 2a. Then run the app normally — with its real external dependencies, or by manually interacting with it — to produce a **reference trace**. Do NOT mock or patch anything. This is an observation run.
+
+If the app can't run without infrastructure you don't have (a real database, third-party service credentials, etc.), use the simplest possible approach to get it running — a local Docker container, a test account, or ask the user for help. The goal is one real trace.
+
+```bash
+uv run pixie trace list
+uv run pixie trace last
+```
+
+#### 2c. Examine the reference trace
+
+Study the trace data carefully. This is your blueprint for everything that follows. Document:
+
+1. **Data from external dependencies (inbound)** — What did the app read from databases, APIs, caches? What are the shapes, types, and realistic value ranges? This is what you'll make up in eval_input for the dataset.
+2. **Side-effects (outbound)** — What did the app write to, send to, or mutate in external systems? These need to be captured by mocks and may be part of eval_output for verification.
+3. **Intermediate states** — What did the instrumentation capture beyond the final output? Tool calls, retrieved documents, routing decisions? Are these sufficient to evaluate every criterion from Step 1?
+4. **The eval_input / eval_output structure** — What does the `@observe`-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.
+
+**Check instrumentation completeness**: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more `@observe` decorators and re-run.
+
+**Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.**
+
+> **Checkpoint**: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.
+
+---
+
+### Step 3: Write a utility function to run the full app end-to-end
+
+**Why this step**: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).
+
+#### The contract
+
+```
+run_app(eval_input) → eval_output
+```
+
+- **eval_input** = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
+- **eval_output** = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)
+
+#### How to implement
+
+1. **Patch external dependencies** — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or `unittest.mock.patch` the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
+
+2. **Call the app through its real entry point** — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use `TestClient` or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
+
+3. **Collect the response** — the app's output becomes eval_output, along with any side-effects captured by mock objects.
+
+Read `references/run-harness-patterns.md` for concrete examples of entry point invocation for different app types.
+
+**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.
+
+#### Verify
+
+Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:
+
+- **Same structure** — same fields present, same types, same nesting
+- **Same code path** — same routing decisions, same intermediate states captured
+- **Sensible values** — eval_output fields have real, meaningful data (not null, not empty, not error messages)
+
+**If it fails after two attempts**, stop and ask the user for help.
+
+> **Checkpoint**: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.
+
+---
+
+### Step 4: Build the dataset
+
+**Why this step**: The dataset is a collection of eval_input items (made up by you) that define the test scenarios. Each item may also carry case-specific expectations. The eval_output is NOT pre-populated in the dataset — it's produced at test time by the utility function from Step 3.
+
+#### 4a. Determine verification and expectations
+
+Before generating data, decide how each eval criterion from Step 1 will be checked.
+
+**Examine the reference trace from Step 2** and identify:
+
+- **Structural constraints** you can verify with code — JSON schema, required fields, value types, enum ranges, string length bounds. These become validation checks on your generated eval_inputs.
+- **Semantic constraints** that require judgment — "the mock customer profile should be realistic", "the conversation history should be topically coherent". Apply these yourself when crafting the data.
+- **Which criteria are universal vs. case-specific**:
+  - **Universal criteria** apply to ALL test cases the same way → implement in the test function (e.g., "responses must be under 3 sentences", "must not hallucinate information not in context")
+  - **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")
+
+#### 4b. Generate eval_input items
+
+Create eval_input items that match the data shape from the reference trace:
+
+- **Application inputs** (user queries, requests) — make these up to cover the scenarios you identified in Step 1
+- **External dependency data** (database records, API responses, cache entries) — make these up in the exact shape you observed in the reference trace
+
+Each dataset item contains:
+
+- `eval_input`: the made-up input data (app input + external dependency data)
+- `expected_output`: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.
+
+At test time, `eval_output` is produced by the utility function from Step 3 and is not stored in the dataset itself.
+Read `references/dataset-generation.md` for the dataset creation API, data shape matching, expected_output strategy, and validation checklist.
+
+#### 4c. Validate the dataset
+
+After building:
+
+1. **Execute `build_dataset.py`** — don't just write it, run it
+2. **Verify structural constraints** — each eval_input matches the reference trace's schema (same fields, same types)
+3. **Verify diversity** — items have meaningfully different inputs, not just minor variations
+4. **Verify case-specific expectations** — `expected_output` values are specific and testable, not vague
+5. For conversational apps, include items with conversation history
+
+> **Checkpoint**: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.
+
+---
+
+### Step 5: Write and run eval tests
+
+**Why this step**: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.
+
+#### 5a. Map criteria to evaluators
+
+For each eval criterion from Step 1, decide how to evaluate it:
+
+- **Can it be checked with a built-in evaluator?** (factual correctness → `FactualityEval`, exact match → `ExactMatchEval`, RAG faithfulness → `FaithfulnessEval`)
+- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion.
+- **Is it universal or case-specific?** Universal criteria go in the test function. Case-specific criteria use `expected_output` from the dataset.
+
+For open-ended LLM text, **never** use `ExactMatchEval` — LLM outputs are non-deterministic.
+
+`AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
+
+Read `references/eval-tests.md` for the evaluator catalog, custom evaluator examples, and the test file boilerplate.
+
+#### 5b. Write the test file and run
+
+The test file wires together: a `runnable` (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.
+
+Read `references/eval-tests.md` for the exact `assert_dataset_pass` API, required parameter names, and common mistakes to avoid. **Re-read the API reference immediately before writing test code** — do not rely on earlier context.
+
+Run with `pixie test` — not `pytest`:
+
+```bash
+uv run pixie test pixie_qa/tests/ -v
+```
+
+**After running, verify the scorecard:**
+
+1. Shows "N/M tests passed" with real numbers
+2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
+3. Per-evaluator scores appear with real values
+
+A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
+
+> **Checkpoint**: Tests run and produce real scores.
+>
+> - **Setup mode**: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.
+> - **Iteration mode**: Proceed directly to Step 6.
+>
+> If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
+
+---
+
+### Step 6: Investigate and iterate
+
+**Iteration mode only, or after the user confirmed in setup mode.**
+
+When tests fail, understand _why_ — don't just adjust thresholds until things pass.
+
+Read `references/investigation.md` for procedures and root-cause patterns.
+
+The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.
+
+---
+
+## Quick reference
+
+### Imports
+
+```python
+from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
+from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
+```
+
+Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module.
+
+### CLI commands
+
+```bash
+uv run pixie test pixie_qa/tests/ -v    # Run eval tests (NOT pytest)
+uv run pixie trace list                 # List captured traces
+uv run pixie trace last                 # Show most recent trace
+uv run pixie trace show <id> --verbose  # Show specific trace
+uv run pixie dataset create <name>      # Create a new dataset
+```
+
+### Directory layout
 
 ```
 pixie_qa/
-  MEMORY.md              # your understanding and eval plan
-  observations.db        # SQLite trace DB (auto-created by enable_storage)
-  datasets/              # golden datasets (JSON files)
-  tests/                 # eval test files (test_*.py)
-  scripts/               # helper scripts (run_harness.py, build_dataset.py, etc.)
+  MEMORY.md      # your understanding and eval plan
+  datasets/      # golden datasets (JSON)
+  tests/         # eval test files (test_*.py)
+  scripts/       # run_app.py, build_dataset.py
 ```
 
----
+All pixie files go here — not at the project root, not in a top-level `tests/` directory.
 
-## Setup vs. Iteration: when to stop
+### Key concepts
 
-**This is critical.** What you do depends on what the user asked for.
+- **eval_input** = application input + data from external dependencies
+- **eval_output** = application output + captured side-effects + captured intermediate states (produced at test time by the utility function, NOT pre-populated in the dataset)
+- **expected_output** = case-specific evaluation reference (optional per dataset item)
+- **test function** = utility function (produces eval_output) + evaluators (check criteria)
 
-### "Setup QA" / "set up evals" / "add tests" (setup intent)
+### Evaluator selection
 
-The user wants a **working eval pipeline**. Your job is Stages 0–7: install, understand, instrument, build a run harness, capture real traces, write tests, build dataset, run tests. **Stop after the first test run**, regardless of whether tests pass or fail. Report:
+| Output type                           | Evaluator                                             | Notes                                                            |
+| ------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------- |
+| Open-ended text with reference answer | `FactualityEval`, `ClosedQAEval`                      | Best default for most apps                                       |
+| Open-ended text, no reference         | `AnswerRelevancyEval`                                 | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
+| Deterministic output                  | `ExactMatchEval`, `JSONDiffEval`                      | Never use for open-ended LLM text                                |
+| RAG with retrieved context            | `FaithfulnessEval`, `ContextRelevancyEval`            | Requires context capture in instrumentation                      |
+| Domain-specific quality               | `create_llm_evaluator(name=..., prompt_template=...)` | Custom LLM-as-judge — use for app-specific criteria              |
 
-1. What you set up (instrumentation, run harness, test file, dataset)
-2. The test results (pass/fail, scores)
-3. If tests failed: a **brief summary** of what failed and likely causes — but do NOT fix anything
+### What goes where: SKILL.md vs references
 
-Then ask: _"QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"_
+**This file** (SKILL.md) is loaded for the entire session. It contains the _what_ and _why_ — the reasoning, decision-making process, goals, and checkpoints for each step.
 
-Only proceed to Stage 8 (investigation and fixes) if the user confirms.
+**Reference files** are loaded when executing a specific step. They contain the _how_ — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.
 
-**Exception**: If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are **setup problems**, not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.
+When in doubt: if it's about _deciding what to do_, it's in SKILL.md. If it's about _how to implement that decision_, it's in a reference file.
 
-### "Fix" / "improve" / "debug" / "why is X failing" (iteration intent)
+### Reference files
 
-The user wants you to investigate and fix. Proceed through all stages including Stage 8 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.
-
-### Ambiguous requests
-
-If the intent is unclear, default to **setup only** and ask before iterating. It's better to stop early and ask than to make unwanted changes to the user's application code.
-
----
-
-## Hard gates: when to STOP and get the user involved
-
-Some blockers cannot be worked around. When you hit one, **stop working and tell the user what you need** — do not guess, fabricate data, or skip ahead to later stages.
-
-### Missing API keys or credentials
-
-If the app or evaluators need an API key (e.g. `OPENAI_API_KEY`) and it's not set in the environment or `.env`, tell the user exactly which key is missing and wait for them to provide it. Do not:
-
-- Proceed with running the app or evals (they will fail)
-- Hardcode a placeholder key
-- Skip to later stages hoping it won't matter
-
-### Cannot run the app from a script
-
-If after reading the code (Stage 1) you cannot figure out how to invoke the app's core LLM-calling function from a standalone script — because it requires a running server, a webhook trigger, complex authentication, or external infrastructure you can't mock — **stop and ask the user**:
-
-> "I've identified `<function_name>` in `<file>` as the core function to evaluate, but it requires `<dependency>` which I can't easily mock. Can you either (a) show me how to call this function standalone, or (b) run the app yourself with a few representative inputs so I can capture traces?"
-
-### App errors during run harness execution
-
-If the run harness script (Stage 4) errors out and you can't fix it after two attempts, stop and share the error with the user. Common blockers include database connections, missing configuration files, authentication/OAuth flows, and hardware-specific dependencies.
-
-### Why stopping matters
-
-Every subsequent stage depends on having real traces from the actual app. If you can't run the app, you can't capture traces. If you can't capture traces, you can't build a real dataset. If you fabricate a dataset, the entire eval pipeline is testing a fiction, not the user's app. It's better to stop early and get the user's help than to produce an eval pipeline that tests the wrong thing.
-
----
-
-## The eval boundary: what to evaluate
-
-**Eval-driven development focuses on LLM-dependent behaviour.** The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.
-
-### In scope (evaluate this)
-
-- LLM response quality: factual accuracy, relevance, format compliance, safety
-- Agent routing decisions: did the LLM choose the right tool/handoff/action?
-- Prompt effectiveness: does the prompt produce the desired behaviour?
-- Multi-turn coherence: does the agent maintain context across turns?
-
-### Out of scope (do NOT evaluate this with evals)
-
-- **Tool implementations** (database queries, API calls, keyword matching, business logic) — these are traditional software; test them with unit tests
-- **Infrastructure** (authentication, rate limiting, caching, serialization)
-- **Deterministic post-processing** (formatting, filtering, sorting results)
-
-The boundary is: everything **downstream** of the LLM call (tools, databases, APIs) produces deterministic outputs that serve as **inputs** to the LLM-powered system. Eval tests should treat those as given facts and focus on what the LLM does with them.
-
-**Example**: If an FAQ tool has a keyword-matching bug that returns wrong data, that's a traditional bug — fix it with a regular code change, not by adjusting eval thresholds. The eval tests exist to verify that _given correct tool outputs_, the LLM agent produces correct user-facing responses.
-
-When building datasets and expected outputs, **use the actual tool/system outputs as ground truth**. The expected output for an eval case should reflect what a correct LLM response looks like _given the tool results the system actually produces_.
-
----
-
-## Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
-
-Before doing anything else, check that the `pixie-qa` package is available:
-
-```bash
-python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
-```
-
-If it's not installed, install it:
-
-```bash
-pip install pixie-qa
-```
-
-This provides the `pixie` Python module, the `pixie` CLI, and the `pixie test` runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
-
-### Verify API keys
-
-The application under test almost certainly needs an LLM provider API key (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). LLM-as-judge evaluators like `FactualityEval` also need `OPENAI_API_KEY`. **Before running anything**, verify the key is set:
-
-```bash
-[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
-```
-
-If the key is not set: check whether the project uses a `.env` file. If it does, note that `python-dotenv` only loads `.env` when the app explicitly calls `load_dotenv()` — shell commands and the `pixie` CLI will not see variables from `.env` unless they're exported. Tell the user which key is missing and how to set it. **Do not proceed** with running the app or evals without a confirmed API key — you'll get failures that waste time and look like app bugs.
-
----
-
-## Stage 1: Understand the Application
-
-Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
-
-### What to investigate
-
-1. **How the software runs**: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
-
-2. **All inputs to the LLM**: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
-
-   - User input (queries, messages, uploaded files)
-   - System prompts (hardcoded or templated)
-   - Retrieved context (RAG chunks, search results, database records)
-   - Tool definitions and function schemas
-   - Conversation history / memory
-   - Configuration or feature flags that change prompt behavior
-
-3. **All intermediate steps and outputs**: Walk through the code path from input to final output and document each stage:
-
-   - Retrieval / search results
-   - Tool calls and their results
-   - Agent routing / handoff decisions
-   - Intermediate LLM calls (e.g., summarization before final answer)
-   - Post-processing or formatting steps
-
-4. **The final output**: What does the user see? What format is it in? What are the quality expectations?
-
-5. **Use cases and expected behaviors**: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
-
-### Identify the eval-boundary function
-
-This is the single most important decision you'll make, and getting it right determines whether the eval pipeline tests the real app or a fiction.
-
-The **eval-boundary function** is the function in the actual production code that:
-
-1. Takes structured input (text, dict, message list) — not raw HTTP requests, audio streams, or webhook payloads
-2. Calls the LLM (directly or through a chain of internal calls)
-3. Returns the LLM's response (or a processed version of it)
-
-Everything **upstream** of this function (webhook handlers, voice-to-text processing, request parsing, authentication, session management) will be mocked or bypassed when building the run harness. Everything **at and below** this function is the real code you're evaluating.
-
-**Example**: In a Twilio voice AI app:
-
-- Twilio sends a webhook with audio → **upstream, mock this**
-- Audio processing converts speech to text → **upstream, mock this**
-- Call state is loaded from Redis → **upstream, mock or simplify this**
-- `agent.respond(user_text, conversation_history)` calls the LLM → **eval-boundary function**
-- Response text is converted to speech → **downstream, not part of eval**
-
-**Example**: In a FastAPI RAG chatbot:
-
-- HTTP endpoint receives POST request → **upstream, bypass this**
-- Request validation and auth → **upstream, bypass this**
-- `chatbot.answer(question, context)` retrieves docs and calls LLM → **eval-boundary function**
-- Response is formatted as JSON → **downstream, not part of eval**
-
-**Example**: In a simple CLI Q&A tool:
-
-- `main()` reads user input from stdin → **upstream, bypass this**
-- `answer_question(question)` calls the LLM → **eval-boundary function**
-
-When identifying the eval-boundary function, record:
-
-- The exact function name and file location
-- Its signature (parameter names and types)
-- What upstream dependencies it needs (clients, config objects, state)
-- Which of those dependencies require real credentials vs. can be mocked
-
-If you cannot identify a clear eval-boundary function — if the LLM call is deeply entangled with infrastructure code that can't be separated — **stop and ask the user**. See "Hard gates" above.
-
-### Write MEMORY.md
-
-Write your findings down in `pixie_qa/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
-
-**CRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet.** Those belong in later sections, only after they've been implemented.
-
-The understanding section should include:
-
-```markdown
-# Eval Notes: <Project Name>
-
-## How the application works
-
-### Entry point and execution flow
-
-<Describe how to start/run the app, what happens step by step>
-
-### Inputs to LLM calls
-
-<For each LLM call in the codebase, document:>
-
-- Where it is in the code (file + function name)
-- What system prompt it uses (quote it or summarize)
-- What user/dynamic content feeds into it
-- What tools/functions are available to it
-
-### Intermediate processing
-
-<Describe any steps between input and output:>
-- Retrieval, routing, tool execution, etc.
-- Include code pointers (file:line) for each step
-
-### Final output
-
-<What the user sees, what format, what the quality bar should be>
-
-### Use cases
-
-<List each distinct scenario the app handles, with examples of good/bad outputs>
-
-### Eval-boundary function
-
-- **Function**: `<class.method or function_name>`
-- **Location**: `<file:line>`
-- **Signature**: `<parameters and return type>`
-- **Upstream dependencies to mock**: <list what needs mocking for standalone execution>
-- **Why this boundary**: <explain why this is the right function to evaluate>
-
-## Evaluation plan
-
-### What to evaluate and why
-
-<Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>
-
-### Evaluation granularity
-
-<Which function/span boundary captures one "test case"? Why that boundary?>
-
-### Evaluators and criteria
-
-<For each eval test, specify: evaluator, dataset, threshold, reasoning>
-
-### Data needed for evaluation
-
-<What data points need to be captured, with code pointers to where they live>
-```
-
-If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.
-
----
-
-## Stage 2: Decide What to Evaluate
-
-Now that you understand the app, you can make thoughtful choices about what to measure:
-
-- **What quality dimension matters most?** Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
-- **Which span to evaluate:** the whole pipeline (`root`) or just the LLM call (`last_llm_call`)? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.
-- **Which evaluators fit:** see `references/pixie-api.md` → Evaluators. For factual QA: `FactualityEval`. For structured output: `ValidJSONEval` / `JSONDiffEval`. For RAG pipelines: `ContextRelevancyEval` / `FaithfulnessEval`.
-- **Pass criteria:** `ScoreThreshold(threshold=0.7, pct=0.8)` means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.
-- **Expected outputs:** `FactualityEval` needs them. Format evaluators usually don't.
-
-Update `pixie_qa/MEMORY.md` with the plan before writing any code.
-
----
-
-## Stage 3: Instrument the Application
-
-Add pixie instrumentation to the **existing production code**. The goal is to capture the inputs and outputs of functions that are already part of the application's normal execution path. Instrumentation must be on the **real code path** — the same code that runs when the app is used in production — so that traces are captured both during eval runs and real usage.
-
-### Add `enable_storage()` at application startup
-
-Call `enable_storage()` once at the beginning of the application's startup code — inside `main()`, or at the top of a server's initialization. **Never at module level** (top of a file outside any function), because that causes storage setup to trigger on import.
-
-Good places:
-
-- Inside `if __name__ == "__main__":` blocks
-- In a FastAPI `lifespan` or `on_startup` handler
-- At the top of `main()` / `run()` functions
-- Inside the `runnable` function in test files
-
-```python
-# ✅ CORRECT — at application startup
-async def main():
-    enable_storage()
-    ...
-
-# ✅ CORRECT — in a runnable for tests
-def runnable(eval_input):
-    enable_storage()
-    my_function(**eval_input)
-
-# ❌ WRONG — at module level, runs on import
-from pixie import enable_storage
-enable_storage()  # this runs when any file imports this module!
-```
-
-### Wrap existing functions with `@observe` or `start_observation`
-
-**CRITICAL: Instrument the production code path. Never create separate functions or alternate code paths for testing.**
-
-The `@observe` decorator or `start_observation` context manager goes on the **existing function** that the app actually calls during normal operation. If the app's entry point is an interactive `main()` loop, instrument `main()` or the core function it calls per user turn — not a new helper function that duplicates logic.
-
-```python
-# ✅ CORRECT — decorating the existing production function
-from pixie import observe
-
-@observe(name="answer_question")
-def answer_question(question: str, context: str) -> str:  # existing function
-    ...  # existing code, unchanged
-```
-
-```python
-# ✅ CORRECT — context manager inside an existing function
-from pixie import start_observation
-
-async def main():  # existing function
-    ...
-    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
-        result = await Runner.run(current_agent, input_items, context=context)
-        # ... existing response handling ...
-        obs.set_output(response_text)
-    ...
-```
-
-```python
-# ❌ WRONG — creating a new function that duplicates logic from main()
-@observe(name="run_for_eval")
-async def run_for_eval(user_messages: list[str]) -> str:
-    # This duplicates what main() does, creating a separate code path
-    # that diverges from production. Don't do this.
-    ...
-```
-
-```python
-# ❌ WRONG — calling the LLM directly instead of calling the app's function
-@observe(name="agent_answer_question")
-def answer_question(question: str) -> str:
-    # This bypasses the entire app and calls OpenAI directly.
-    # You're testing a script you just wrote, not the user's app.
-    response = client.responses.create(
-        model="gpt-4.1",
-        input=[{"role": "user", "content": question}],
-    )
-    return response.output_text
-```
-
-**Rules:**
-
-- **Never add new wrapper functions** to the application code for eval purposes.
-- **Never bypass the app by calling the LLM provider directly** — if you find yourself writing `client.responses.create(...)` or `openai.ChatCompletion.create(...)` in a test or run harness, you're not testing the app. Import and call the app's own function instead.
-- **Never change the function's interface** (arguments, return type, behavior).
-- **Never duplicate production logic** into a separate "testable" function.
-- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
-- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
-- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.
-
-**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
-
----
-
-## Stage 4: Create a Run Harness and Verify Traces
-
-**This stage is a hard gate.** You cannot proceed to writing tests or building datasets until you have successfully run the app's real code through the run harness and confirmed that traces appear in the database.
-
-The run harness is a short script that calls the eval-boundary function you identified in Stage 1, bypassing external infrastructure that isn't relevant to LLM evaluation.
-
-### When the app is simple
-
-If the eval-boundary function is a straightforward call with no complex dependencies (e.g., `answer_question(question: str) -> str`), the harness can be minimal:
-
-```python
-# pixie_qa/scripts/run_harness.py
-from pixie import enable_storage, flush
-from myapp import answer_question
-
-enable_storage()
-result = answer_question("What is the capital of France?")
-print(f"Result: {result}")
-flush()
-```
-
-Run it, verify traces appear, and move on.
-
-### When the app has complex dependencies
-
-Most real-world apps need more setup. The eval-boundary function often requires configuration objects, database connections, API clients, or state objects to run. Your job is to mock or stub the **minimum** necessary to call the real production function.
-
-```python
-# pixie_qa/scripts/run_harness.py
-"""Exercises the actual app code through the eval-boundary function.
-
-Mocks upstream infrastructure (webhooks, voice processing, call state, etc.)
-and calls the real production function with representative text inputs.
-"""
-from pixie import enable_storage, flush
-
-# Load .env if the project uses one for API keys
-from dotenv import load_dotenv
-load_dotenv()
-
-# Import the ACTUAL production function — not a copy, not a re-implementation
-from myapp.agents.llm.openai import OpenAILLM
-
-
-def run_one_case(question: str) -> str:
-    """Call the actual production function with minimal mocked dependencies."""
-    enable_storage()
-
-    # Construct the minimum context the function needs.
-    # Use real API client (needs real key), mock everything else.
-    llm = OpenAILLM(...)
-
-    # Call the ACTUAL function — the same one production uses
-    result = llm.run_normal_ai_response(
-        prompt=question,
-        messages=[{"role": "user", "content": question}],
-    )
-
-    flush()
-    return result
-
-
-if __name__ == "__main__":
-    test_inputs = [
-        "What are your business hours?",
-        "I need to update my account information.",
-    ]
-    for q in test_inputs:
-        print(f"Q: {q}")
-        print(f"A: {run_one_case(q)}")
-        print("---")
-```
-
-**Critical rules for the run harness:**
-
-- **Call the real function.** The same function production uses. If you find yourself writing `client.responses.create(...)` or `openai.ChatCompletion.create(...)` in the harness instead of calling the app's own function, you are bypassing the app and testing something else entirely.
-- **Mock only upstream infrastructure.** Database connections, webhook payloads, session state, audio processing — these can be mocked or stubbed. The LLM call itself must be real because that's what you're evaluating.
-- **The LLM API key must be real.** If it's missing, stop and ask the user. See "Hard gates."
-- **Keep it minimal.** This is not a full integration test. It's a way to exercise the real LLM-calling code path and capture traces.
-- **If you can't create a working harness after two attempts**, stop and ask the user for help.
-
-### Verify traces are captured
-
-After running the harness, verify that traces were actually captured:
-
-```bash
-python pixie_qa/scripts/run_harness.py
-```
-
-Then check the database:
-
-```python
-import asyncio
-from pixie import ObservationStore
-
-async def check():
-    store = ObservationStore()
-    traces = await store.list_traces(limit=5)
-    print(f"Found {len(traces)} traces")
-    for t in traces:
-        print(t)
-
-asyncio.run(check())
-```
-
-**What to check:**
-
-- At least one trace appears in the database
-- The trace contains a span for the eval-boundary function (the span name should match the `@observe(name=...)` you added in Stage 3)
-- The span has captured `eval_input` and `eval_output` with sensible values
-
-**If no traces appear:**
-
-- Is `enable_storage()` being called before the instrumented function runs?
-- Is `flush()` being called after the function returns?
-- Is the `@observe` decorator on the correct function?
-- Is the function actually being executed (not just defined/imported)?
-
-**Do not proceed to Stage 5 until you have seen real traces from the actual app in the database.** If traces don't appear, debug the issue now or ask the user for help. This is a setup problem and must be resolved before anything else.
-
----
-
-## Stage 5: Write the Eval Test File
-
-Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.
-
-Create `pixie_qa/tests/test_<feature>.py`. The pattern is: a `runnable` adapter that calls the app's **existing production function**, plus an async test function that calls `assert_dataset_pass`:
-
-```python
-from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
-
-from myapp import answer_question
-
-
-def runnable(eval_input):
-    """Replays one dataset item through the app.
-
-    Calls the same function the production app uses.
-    enable_storage() here ensures traces are captured during eval runs.
-    """
-    enable_storage()
-    answer_question(**eval_input)
-
-
-async def test_factuality():
-    await assert_dataset_pass(
-        runnable=runnable,
-        dataset_name="<dataset-name>",
-        evaluators=[FactualityEval()],
-        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
-        from_trace=last_llm_call,
-    )
-```
-
-Note that `enable_storage()` belongs inside the `runnable`, not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.
-
-The `runnable` imports and calls **the same function that production uses** — the eval-boundary function you identified in Stage 1 and verified in Stage 4. If the `runnable` calls a different function than what the run harness calls, something is wrong.
-
-The test runner is `pixie test` (not `pytest`):
-
-```bash
-pixie test                           # run all test_*.py in current directory
-pixie test pixie_qa/tests/           # specify path
-pixie test -k factuality             # filter by name
-pixie test -v                        # verbose: shows per-case scores and reasoning
-```
-
-`pixie test` automatically finds the project root (the directory containing `pyproject.toml`, `setup.py`, or `setup.cfg`) and adds it to `sys.path` — just like pytest. No `sys.path` hacks are needed in test files.
-
----
-
-## Stage 6: Build the Dataset
-
-**Prerequisite**: You must have successfully run the app and verified traces in Stage 4. If you skipped Stage 4 or it failed, go back — do not proceed.
-
-Create the dataset, then populate it by **actually running the app** with representative inputs. Dataset items must contain real app outputs captured from actual execution.
-
-```bash
-pixie dataset create <dataset-name>
-pixie dataset list   # verify it exists
-```
-
-### Run the app and capture traces to the dataset
-
-The easiest approach is to extend the run harness from Stage 4 into a dataset builder. Since you already have a working script that calls the real app code and produces traces, adapt it to save results:
-
-```python
-# pixie_qa/scripts/build_dataset.py
-import asyncio
-from pixie import enable_storage, flush, DatasetStore, Evaluable
-
-from myapp import answer_question
-
-GOLDEN_CASES = [
-    ("What is the capital of France?", "Paris"),
-    ("What is the speed of light?", "299,792,458 meters per second"),
-]
-
-async def build_dataset():
-    enable_storage()
-    store = DatasetStore()
-    try:
-        store.create("qa-golden-set")
-    except FileExistsError:
-        pass
-
-    for question, expected in GOLDEN_CASES:
-        result = answer_question(question=question)
-        flush()
-
-        store.append("qa-golden-set", Evaluable(
-            eval_input={"question": question},
-            eval_output=result,
-            expected_output=expected,
-        ))
-
-asyncio.run(build_dataset())
-```
-
-Note that `eval_output=result` is the **actual return value from running the app** — not a string you typed in.
-
-Alternatively, use the CLI for per-case capture:
-
-```bash
-# Run the app (enable_storage() must be active)
-python -c "from myapp import main; main('What is the capital of France?')"
-
-# Save the root span to the dataset
-pixie dataset save <dataset-name>
-
-# Or specifically save the last LLM call:
-pixie dataset save <dataset-name> --select last_llm_call
-
-# Add context:
-pixie dataset save <dataset-name> --notes "basic geography question"
-
-# Attach expected output for evaluators like FactualityEval:
-echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
-```
-
-### The cardinal sin of dataset building
-
-**Never fabricate `eval_output` values by hand.** If you type `"eval_output": "4"` into a dataset JSON file without the app actually producing that output, the dataset is testing a fiction. A fabricated dataset is worse than no dataset because it gives false confidence — the user thinks their app is being tested, but it isn't.
-
-If you catch yourself writing or editing `eval_output` values directly in a JSON file, stop. Go back to Stage 4, run the app, and capture real outputs.
-
-### Key rules for dataset building
-
-- **Every `eval_output` must come from a real execution** of the eval-boundary function. No exceptions.
-- **Include expected outputs** for comparison-based evaluators like `FactualityEval`. Expected outputs should reflect the **correct LLM response given what the tools/system actually return** — not an idealized answer predicated on fixing non-LLM bugs.
-- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
-- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
-
----
-
-## Stage 7: Run the Tests
-
-```bash
-pixie test pixie_qa/tests/ -v
-```
-
-The `-v` flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your `ScoreThreshold`.
-
-**After this stage, if the user's intent was "setup" — STOP.** Report results and ask before proceeding. See "Setup vs. Iteration" above.
-
----
-
-## Stage 8: Investigate Failures
-
-**Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.**
-
-When tests fail, the goal is to understand _why_, not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.
-
-### Step 1: Get the detailed test output
-
-```bash
-pixie test pixie_qa/tests/ -v    # shows score and reasoning per case
-```
-
-Capture the full verbose output. For each failing case, note:
-
-- The `eval_input` (what was sent)
-- The `eval_output` (what the app produced)
-- The `expected_output` (what was expected, if applicable)
-- The evaluator score and reasoning
-
-### Step 2: Inspect the trace data
-
-For each failing case, look up the full trace to see what happened inside the app:
-
-```python
-from pixie import DatasetStore
-
-store = DatasetStore()
-ds = store.get("<dataset-name>")
-for i, item in enumerate(ds.items):
-    print(i, item.eval_metadata)   # trace_id is here
-```
-
-Then inspect the full span tree:
-
-```python
-import asyncio
-from pixie import ObservationStore
-
-async def inspect(trace_id: str):
-    store = ObservationStore()
-    roots = await store.get_trace(trace_id)
-    for root in roots:
-        print(root.to_text())   # full span tree: inputs, outputs, LLM messages
-
-asyncio.run(inspect("the-trace-id-here"))
-```
-
-### Step 3: Root-cause analysis
-
-Walk through the trace and identify exactly where the failure originates. Common patterns:
-
-**LLM-related failures (fix with prompt/model/eval changes):**
-
-| Symptom                                                | Likely cause                                                  |
-| ------------------------------------------------------ | ------------------------------------------------------------- |
-| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
-| Agent routes to wrong tool/handoff                     | Routing prompt or handoff descriptions are ambiguous          |
-| Output format is wrong                                 | Missing format instructions in prompt                         |
-| LLM hallucinated instead of using tool                 | Prompt doesn't enforce tool usage                             |
-
-**Non-LLM failures (fix with traditional code changes, out of eval scope):**
-
-| Symptom                                           | Likely cause                                            |
-| ------------------------------------------------- | ------------------------------------------------------- |
-| Tool returned wrong data                          | Bug in tool implementation — fix the tool, not the eval |
-| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code           |
-| Database returned stale/wrong records             | Data issue — fix independently                          |
-| API call failed with error                        | Infrastructure issue                                    |
-
-For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly.
-
-### Step 4: Document findings in MEMORY.md
-
-**Every failure investigation must be documented in `pixie_qa/MEMORY.md`** in a structured format:
-
-```markdown
-### Investigation: <test_name> failure — <date>
-
-**Test**: `test_faq_factuality` in `pixie_qa/tests/test_customer_service.py`
-**Result**: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
-
-#### Failing case 1: "What rows have extra legroom?"
-
-- **eval_input**: `{"user_message": "What rows have extra legroom?"}`
-- **eval_output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
-- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
-- **Evaluator score**: 0.1 (FactualityEval)
-- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
-
-**Trace analysis**:
-Inspected trace `abc123`. The span tree shows:
-
-1. Triage Agent routed to FAQ Agent ✓
-2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
-3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
-
-**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
-The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
-The question "What rows have extra legroom?" contains none of these keywords, so it
-falls through to the default "I don't know" response.
-
-**Classification**: Non-LLM failure — the keyword-matching tool is broken.
-The LLM agent correctly routed to the FAQ agent and used the tool; the tool
-itself returned wrong data.
-
-**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
-`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
-not an eval/prompt change.
-
-**Verification**: After fix, re-run:
-\`\`\`bash
-python pixie_qa/scripts/build_dataset.py # refresh dataset
-pixie test pixie_qa/tests/ -k faq -v # verify
-\`\`\`
-```
-
-### Step 5: Fix and re-run
-
-Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
-
-```bash
-pixie test pixie_qa/tests/test_<feature>.py -v
-```
-
----
-
-## Memory Template
-
-```markdown
-# Eval Notes: <Project Name>
-
-## How the application works
-
-### Entry point and execution flow
-
-<How to start/run the app. Step-by-step flow from input to output.>
-
-### Inputs to LLM calls
-
-<For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>
-
-### Intermediate processing
-
-<Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>
-
-### Final output
-
-<What the user sees. Format. Quality expectations.>
-
-### Use cases
-
-<Each scenario with examples of good/bad outputs:>
-
-1. <Use case 1>: <description>
-   - Input example: ...
-   - Good output: ...
-   - Bad output: ...
-
-### Eval-boundary function
-
-- **Function**: `<fully qualified name>`
-- **Location**: `<file:line>`
-- **Signature**: `<params and return type>`
-- **Upstream dependencies to mock**: <what needs mocking/stubbing>
-- **Why this boundary**: <rationale>
-
-## Evaluation plan
-
-### What to evaluate and why
-
-<Quality dimensions and rationale>
-
-### Evaluators and criteria
-
-| Test | Dataset | Evaluator | Criteria | Rationale |
-| ---- | ------- | --------- | -------- | --------- |
-| ...  | ...     | ...       | ...      | ...       |
-
-### Data needed for evaluation
-
-<What data to capture, with code pointers>
-
-## Datasets
-
-| Dataset | Items | Purpose |
-| ------- | ----- | ------- |
-| ...     | ...   | ...     |
-
-## Investigation log
-
-### <date> — <test_name> failure
-
-<Full structured investigation as described in Stage 8>
-```
-
----
-
-## Reference
-
-See `references/pixie-api.md` for all CLI commands, evaluator signatures, and the Python dataset/store API.
+| Reference                            | When to read                                                                       |
+| ------------------------------------ | ---------------------------------------------------------------------------------- |
+| `references/understanding-app.md`    | Step 1 — investigating the codebase, MEMORY.md template                            |
+| `references/instrumentation.md`      | Step 2 — `@observe` and `enable_storage` rules, code patterns, anti-patterns       |
+| `references/run-harness-patterns.md` | Step 3 — examples of how to invoke different app types (web server, CLI, function) |
+| `references/dataset-generation.md`   | Step 4 — crafting eval_input items, expected_output strategy, validation           |
+| `references/eval-tests.md`           | Step 5 — evaluator selection, test file pattern, assert_dataset_pass API           |
+| `references/investigation.md`        | Step 6 — failure analysis, root-cause patterns                                     |
+| `references/pixie-api.md`            | Any step — full CLI and Python API reference                                       |
diff --git a/skills/eval-driven-dev/references/dataset-generation.md b/skills/eval-driven-dev/references/dataset-generation.md
new file mode 100644
index 00000000..cbdfebad
--- /dev/null
+++ b/skills/eval-driven-dev/references/dataset-generation.md
@@ -0,0 +1,235 @@
+# Dataset Generation
+
+This reference covers Step 4 of the eval-driven-dev process: creating the eval dataset.
+
+For full `DatasetStore`, `Evaluable`, and CLI command signatures, see `references/pixie-api.md` (Dataset Python API and CLI Commands sections).
+
+---
+
+## What a dataset contains
+
+A dataset is a collection of `Evaluable` items. Each item has:
+
+- **`eval_input`**: Made-up application input + data from external dependencies. This is what the utility function from Step 3 feeds into the app at test time.
+- **`expected_output`**: Case-specific evaluation reference (optional). The meaning depends on the evaluator — it could be an exact answer, a factual reference, or quality criteria text.
+- **`eval_output`**: **NOT stored in the dataset.** Produced at test time when the utility function replays the eval_input through the real app.
+
+The dataset is made up by you based on the data shapes observed in the reference trace from Step 2. You are NOT extracting data from traces — you are crafting realistic test scenarios.
+
+---
+
+## Creating the dataset
+
+### CLI
+
+```bash
+pixie dataset create <dataset-name>
+pixie dataset list   # verify it exists
+```
+
+### Python API
+
+```python
+from pixie import DatasetStore, Evaluable
+
+store = DatasetStore()
+store.create("qa-golden-set", items=[
+    Evaluable(
+        eval_input={"user_message": "What are your hours?", "customer_profile": {"name": "Alice", "tier": "gold"}},
+        expected_output="Response should mention Monday-Friday 9am-5pm and Saturday 10am-2pm",
+    ),
+    Evaluable(
+        eval_input={"user_message": "I need to cancel my order", "customer_profile": {"name": "Bob", "tier": "basic"}},
+        expected_output="Should confirm which order and explain the cancellation policy",
+    ),
+])
+```
+
+Or build incrementally:
+
+```python
+store = DatasetStore()
+store.create("qa-golden-set")
+for item in items:
+    store.append("qa-golden-set", item)
+```
+
+---
+
+## Crafting eval_input items
+
+Each eval_input must match the **exact data shape** from the reference trace. Look at what the `@observe`-decorated function received as input in Step 2 — same field names, same types, same nesting.
+
+### What goes into eval_input
+
+| Data category            | Example                                           | Source                                              |
+| ------------------------ | ------------------------------------------------- | --------------------------------------------------- |
+| Application input        | User message, query, request body                 | What a real user would send                         |
+| External dependency data | Customer profile, retrieved documents, DB records | Made up to match the shape from the reference trace |
+| Conversation history     | Previous messages in a chat                       | Made up to set up the scenario                      |
+| Configuration / context  | Feature flags, session state                      | Whatever the function expects as arguments          |
+
+### Matching the reference trace shape
+
+From the reference trace (`pixie trace last`), note:
+
+1. **Field names** — use the exact same keys (e.g., `user_message` not `message`, `customer_profile` not `profile`)
+2. **Types** — if the trace shows a list, use a list; if it shows a nested dict, use a nested dict
+3. **Realistic values** — the data should look like something the app would actually receive. Don't use placeholder text like "test input" or "lorem ipsum"
+
+**Example**: If the reference trace shows the function received:
+
+```json
+{
+  "user_message": "I'd like to reschedule my appointment",
+  "customer_profile": {
+    "name": "Jane Smith",
+    "account_id": "A12345",
+    "tier": "premium"
+  },
+  "conversation_history": [
+    { "role": "assistant", "content": "Welcome! How can I help you today?" }
+  ]
+}
+```
+
+Then every eval_input you make up must have `user_message` (string), `customer_profile` (dict with `name`, `account_id`, `tier`), and `conversation_history` (list of message dicts).
+
+---
+
+## Setting expected_output
+
+`expected_output` is a **reference for evaluation** — its meaning depends on which evaluator will consume it.
+
+### When to set it
+
+| Scenario                                    | expected_output value                                                                  | Evaluator it pairs with                                    |
+| ------------------------------------------- | -------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
+| Deterministic answer exists                 | The exact answer: `"Paris"`                                                            | `ExactMatchEval`, `FactualityEval`, `ClosedQAEval`         |
+| Open-ended but has quality criteria         | Description of good output: `"Should mention Saturday hours and be under 2 sentences"` | `create_llm_evaluator` with `{expected_output}` in prompt  |
+| Truly open-ended, no case-specific criteria | Leave as `"UNSET"` or omit                                                             | Standalone evaluators (`PossibleEval`, `FaithfulnessEval`) |
+
+### Universal vs. case-specific criteria
+
+- **Universal criteria** apply to ALL test cases → implement in the test function's evaluators (e.g., "responses must be concise", "must not hallucinate"). These don't need expected_output.
+- **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's Tuesday appointment", "should route to billing").
+
+### Anti-patterns
+
+- **Don't generate both eval_output and expected_output from the same source.** If they're identical and you use `ExactMatchEval`, the test is circular and catches zero regressions.
+- **Don't use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without expected_output.** They produce meaningless scores.
+- **Don't mix expected_output semantics in one dataset.** If some items use expected_output as a factual answer and others as style guidance, evaluators can't handle both. Split into separate datasets or use separate test functions.
+
+---
+
+## Validating the dataset
+
+After creating the dataset, check:
+
+### 1. Structural validation
+
+Every eval_input must match the reference trace's schema:
+
+- Same fields present
+- Same types (string, int, list, dict)
+- Same nesting depth
+- No extra or missing fields compared to what the function expects
+
+### 2. Semantic validation
+
+- **Realistic values** — names, messages, and data look like real-world inputs, not test placeholders
+- **Coherent scenarios** — if there's conversation history, it should make topical sense with the user message
+- **External dependency data makes sense** — customer profiles have realistic account IDs, retrieved documents are plausible
+
+### 3. Diversity validation
+
+- Items have **meaningfully different** inputs — different user intents, different customer types, different edge cases
+- Not just minor variations of the same scenario (e.g., don't have 5 items that are all "What are your hours?" with different names)
+- Cover: normal cases, edge cases, things the app might plausibly get wrong
+
+### 4. Expected_output validation
+
+- case-specific `expected_output` values are specific and testable, not vague
+- Items where expected_output is universal don't redundantly carry expected_output
+
+### 5. Verify by listing
+
+```bash
+pixie dataset list
+```
+
+Or in the build script:
+
+```python
+ds = store.get("qa-golden-set")
+print(f"Dataset has {len(ds.items)} items")
+for i, item in enumerate(ds.items):
+    print(f"  [{i}] input keys: {list(item.eval_input.keys()) if isinstance(item.eval_input, dict) else type(item.eval_input)}")
+    print(f"       expected_output: {item.expected_output[:80] if item.expected_output != 'UNSET' else 'UNSET'}...")
+```
+
+---
+
+## Recommended build_dataset.py structure
+
+Put the build script at `pixie_qa/scripts/build_dataset.py`:
+
+```python
+"""Build the eval dataset with made-up scenarios.
+
+Each eval_input matches the data shape from the reference trace (Step 2).
+Run this script to create/recreate the dataset.
+"""
+from pixie import DatasetStore, Evaluable
+
+DATASET_NAME = "qa-golden-set"
+
+def build() -> None:
+    store = DatasetStore()
+
+    # Recreate fresh
+    try:
+        store.delete(DATASET_NAME)
+    except FileNotFoundError:
+        pass
+    store.create(DATASET_NAME)
+
+    items = [
+        # Normal case — straightforward question
+        Evaluable(
+            eval_input={
+                "user_message": "What are your business hours?",
+                "customer_profile": {"name": "Alice Johnson", "account_id": "C100", "tier": "gold"},
+            },
+            expected_output="Should mention Mon-Fri 9am-5pm and Sat 10am-2pm",
+        ),
+        # Edge case — ambiguous request
+        Evaluable(
+            eval_input={
+                "user_message": "I want to change something",
+                "customer_profile": {"name": "Bob Smith", "account_id": "C200", "tier": "basic"},
+            },
+            expected_output="Should ask for clarification about what to change",
+        ),
+        # ... more items covering different scenarios
+    ]
+
+    for item in items:
+        store.append(DATASET_NAME, item)
+
+    # Verify
+    ds = store.get(DATASET_NAME)
+    print(f"Dataset '{DATASET_NAME}' has {len(ds.items)} items")
+    for i, entry in enumerate(ds.items):
+        keys = list(entry.eval_input.keys()) if isinstance(entry.eval_input, dict) else type(entry.eval_input)
+        print(f"  [{i}] input keys: {keys}")
+
+if __name__ == "__main__":
+    build()
+```
+
+---
+
+## The cardinal rule
+
+**`eval_output` is always produced at test time, never stored in the dataset.** The dataset contains `eval_input` (made-up input matching the reference trace shape) and optionally `expected_output` (the reference to judge against). The test's `runnable` function produces `eval_output` by replaying `eval_input` through the real app.
diff --git a/skills/eval-driven-dev/references/eval-tests.md b/skills/eval-driven-dev/references/eval-tests.md
new file mode 100644
index 00000000..dcf046e3
--- /dev/null
+++ b/skills/eval-driven-dev/references/eval-tests.md
@@ -0,0 +1,241 @@
+# Eval Tests: Evaluator Selection and Test Writing
+
+This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running `pixie test`.
+
+**Before writing any test code, re-read `references/pixie-api.md`** (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
+
+---
+
+## Evaluator selection
+
+Choose evaluators based on the **output type** and your eval criteria from Step 1, not the app type.
+
+### Decision table
+
+| Output type                                                 | Evaluator category                                                      | Examples                                  |
+| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------- |
+| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: `ExactMatchEval`, `JSONDiffEval`, `ValidJSONEval`            | Label classification, JSON extraction     |
+| Open-ended text with a reference answer                     | LLM-as-judge: `FactualityEval`, `ClosedQAEval`, `AnswerCorrectnessEval` | Chatbot responses, QA, summaries          |
+| Text with expected context/grounding                        | RAG evaluators: `FaithfulnessEval`, `ContextRelevancyEval`              | RAG pipelines, context-grounded responses |
+| Text with style/format requirements                         | Custom LLM-as-judge via `create_llm_evaluator`                          | Voice-friendly responses, tone checks     |
+| Multi-aspect quality                                        | Multiple evaluators combined                                            | Factuality + relevance + tone             |
+
+### Critical rules
+
+- For open-ended LLM text, **never** use `ExactMatchEval`. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
+- `AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
+- Do NOT use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without `expected_output` — they produce meaningless scores.
+
+### When `expected_output` IS available
+
+Use comparison-based evaluators:
+
+| Evaluator               | Use when                                                   |
+| ----------------------- | ---------------------------------------------------------- |
+| `FactualityEval`        | Output is factually correct compared to reference          |
+| `ClosedQAEval`          | Output matches the expected answer                         |
+| `ExactMatchEval`        | Exact string match (structured/deterministic outputs only) |
+| `AnswerCorrectnessEval` | Answer is correct vs reference                             |
+
+### When `expected_output` is NOT available
+
+Use standalone evaluators that judge quality without a reference:
+
+| Evaluator              | Use when                              | Note                                                             |
+| ---------------------- | ------------------------------------- | ---------------------------------------------------------------- |
+| `FaithfulnessEval`     | Response faithful to provided context | RAG pipelines                                                    |
+| `ContextRelevancyEval` | Retrieved context relevant to query   | RAG pipelines                                                    |
+| `AnswerRelevancyEval`  | Answer addresses the question         | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
+| `PossibleEval`         | Output is plausible / feasible        | General purpose                                                  |
+| `ModerationEval`       | Output is safe and appropriate        | Content safety                                                   |
+| `SecurityEval`         | No security vulnerabilities           | Security check                                                   |
+
+For non-RAG apps needing response relevance, write a `create_llm_evaluator` instead.
+
+---
+
+## Custom evaluators
+
+### `create_llm_evaluator` factory
+
+Use when the quality dimension is domain-specific and no built-in evaluator fits:
+
+```python
+from pixie import create_llm_evaluator
+
+concise_voice_style = create_llm_evaluator(
+    name="ConciseVoiceStyle",
+    prompt_template="""
+    You are evaluating whether this response is concise and phone-friendly.
+
+    Input: {eval_input}
+    Response: {eval_output}
+
+    Score 1.0 if the response is concise (under 3 sentences), directly addresses
+    the question, and uses conversational language suitable for a phone call.
+    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
+    """,
+)
+```
+
+**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
+
+**Rules**:
+
+- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
+- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
+- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
+
+**Non-RAG response relevance** (instead of `AnswerRelevancyEval`):
+
+```python
+response_relevance = create_llm_evaluator(
+    name="ResponseRelevance",
+    prompt_template="""
+    You are evaluating whether a customer support response is relevant and helpful.
+
+    Input: {eval_input}
+    Response: {eval_output}
+    Expected: {expected_output}
+
+    Score 1.0 if the response directly addresses the question and meets expectations.
+    Score 0.5 if partially relevant but misses important aspects.
+    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
+    """,
+)
+```
+
+### Manual custom evaluator
+
+```python
+from pixie import Evaluation, Evaluable
+
+async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
+    # evaluable.eval_input  — what was passed to the observed function
+    # evaluable.eval_output — what the function returned
+    # evaluable.expected_output — reference answer (UNSET if not provided)
+    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
+    return Evaluation(score=score, reasoning="...")
+```
+
+---
+
+## Writing the test file
+
+Create `pixie_qa/tests/test_<feature>.py`. The pattern: a `runnable` adapter that calls the app's production function, plus `async` test functions that `await` `assert_dataset_pass`.
+
+**Before writing any test code, re-read the `assert_dataset_pass` API reference below.** The exact parameter names matter — using `dataset=` instead of `dataset_name=`, or omitting `await`, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
+
+### Test file template
+
+```python
+from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
+
+from myapp import answer_question
+
+
+def runnable(eval_input):
+    """Replays one dataset item through the app.
+
+    Calls the same function the production app uses.
+    enable_storage() here ensures traces are captured during eval runs.
+    """
+    enable_storage()
+    answer_question(**eval_input)
+
+
+async def test_answer_quality():
+    await assert_dataset_pass(
+        runnable=runnable,
+        dataset_name="qa-golden-set",
+        evaluators=[FactualityEval()],
+        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
+        from_trace=last_llm_call,
+    )
+```
+
+### `assert_dataset_pass` API — exact parameter names
+
+```python
+await assert_dataset_pass(
+    runnable=runnable,              # callable that takes eval_input dict
+    dataset_name="my-dataset",      # NOT dataset_path — name of dataset created in Step 4
+    evaluators=[...],               # list of evaluator instances
+    pass_criteria=ScoreThreshold(   # NOT thresholds — ScoreThreshold object
+        threshold=0.7,              # minimum score to count as passing
+        pct=0.8,                    # fraction of items that must pass
+    ),
+    from_trace=last_llm_call,       # which span to extract eval data from
+)
+```
+
+### Common mistakes that break tests
+
+| Mistake                  | Symptom                                                             | Fix                                           |
+| ------------------------ | ------------------------------------------------------------------- | --------------------------------------------- |
+| `def test_...():` (sync) | RuntimeWarning "coroutine was never awaited", test passes vacuously | Use `async def test_...():`                   |
+| No `await`               | Same: "coroutine was never awaited"                                 | Add `await` before `assert_dataset_pass(...)` |
+| `dataset_path="..."`     | TypeError: unexpected keyword argument                              | Use `dataset_name="..."`                      |
+| `thresholds={...}`       | TypeError: unexpected keyword argument                              | Use `pass_criteria=ScoreThreshold(...)`       |
+| Omitting `from_trace`    | Evaluator may not find the right span                               | Add `from_trace=last_llm_call`                |
+
+**If `pixie test` shows "No assert_pass / assert_dataset_pass calls recorded"**, the test passed vacuously because `assert_dataset_pass` was never awaited. Fix the async signature and await immediately.
+
+### Multiple test functions
+
+Split into separate test functions when you have different evaluator sets:
+
+```python
+async def test_factual_answers():
+    """Test items that have deterministic expected outputs."""
+    await assert_dataset_pass(
+        runnable=runnable,
+        dataset_name="qa-deterministic",
+        evaluators=[FactualityEval()],
+        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
+        from_trace=last_llm_call,
+    )
+
+async def test_response_style():
+    """Test open-ended quality criteria."""
+    await assert_dataset_pass(
+        runnable=runnable,
+        dataset_name="qa-open-ended",
+        evaluators=[concise_voice_style],
+        pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
+        from_trace=last_llm_call,
+    )
+```
+
+### Key points
+
+- `enable_storage()` belongs inside the `runnable`, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
+- The `runnable` imports and calls the **same function** that production uses — the app's entry point, going through the utility function from Step 3.
+- If the `runnable` calls a different function than what the utility function calls, something is wrong.
+- The `eval_input` dict should contain **only the semantic arguments** the function needs (e.g., `question`, `messages`, `context`). The `@observe` decorator automatically strips `self` and `cls`.
+- **Choose evaluators that match your data.** If dataset items have `expected_output`, use comparison evaluators. If not, use standalone evaluators.
+
+---
+
+## Running tests
+
+The test runner is `pixie test` (not `pytest`):
+
+```bash
+uv run pixie test                           # run all test_*.py in current directory
+uv run pixie test pixie_qa/tests/           # specify path
+uv run pixie test -k factuality             # filter by name
+uv run pixie test -v                        # verbose: shows per-case scores and reasoning
+```
+
+`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell. No `sys.path` hacks are needed in test files.
+
+The `-v` flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
+
+### After running, verify the scorecard
+
+1. Shows "N/M tests passed" with real numbers
+2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
+3. Per-evaluator scores appear with real values
+
+A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
diff --git a/skills/eval-driven-dev/references/instrumentation.md b/skills/eval-driven-dev/references/instrumentation.md
new file mode 100644
index 00000000..9f8deef0
--- /dev/null
+++ b/skills/eval-driven-dev/references/instrumentation.md
@@ -0,0 +1,174 @@
+# Instrumentation
+
+This reference covers the tactical implementation of instrumentation in Step 2: how to use `@observe`, `enable_storage()`, and `start_observation` correctly.
+
+For full API signatures and all available parameters, see `references/pixie-api.md` (Instrumentation API section).
+
+For guidance on **what** to instrument (which functions, based on your eval criteria), see Step 2a in the main skill instructions.
+
+---
+
+## Adding `enable_storage()` at application startup
+
+Call `enable_storage()` once at the beginning of the application's startup code — inside `main()`, or at the top of a server's initialization. **Never at module level** (top of a file outside any function), because that causes storage setup to trigger on import.
+
+Good places:
+
+- Inside `if __name__ == "__main__":` blocks
+- In a FastAPI `lifespan` or `on_startup` handler
+- At the top of `main()` / `run()` functions
+- Inside the `runnable` function in test files
+
+```python
+# ✅ CORRECT — at application startup
+async def main():
+    enable_storage()
+    ...
+
+# ✅ CORRECT — in a runnable for tests
+def runnable(eval_input):
+    enable_storage()
+    my_function(**eval_input)
+
+# ❌ WRONG — at module level, runs on import
+from pixie import enable_storage
+enable_storage()  # this runs when any file imports this module!
+```
+
+---
+
+## Wrapping functions with `@observe` or `start_observation`
+
+Instrument the **existing function** that the app actually calls during normal operation. The `@observe` decorator or `start_observation` context manager goes on the production code path — not on new helper functions created for testing.
+
+```python
+# ✅ CORRECT — decorating the existing production function
+from pixie import observe
+
+@observe(name="answer_question")
+def answer_question(question: str, context: str) -> str:  # existing function
+    ...  # existing code, unchanged
+```
+
+```python
+# ✅ CORRECT — decorating a class method (works exactly the same way)
+from pixie import observe
+
+class OpenAIAgent:
+    def __init__(self, model: str = "gpt-4o-mini"):
+        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
+        self.model = model
+
+    @observe(name="openai_agent_respond")
+    def respond(self, user_message: str, conversation_history: list | None = None) -> str:
+        # existing code, unchanged — @observe handles `self` automatically
+        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+        if conversation_history:
+            messages.extend(conversation_history)
+        messages.append({"role": "user", "content": user_message})
+        response = self.client.chat.completions.create(model=self.model, messages=messages)
+        return response.choices[0].message.content or ""
+```
+
+**`@observe` handles `self` and `cls` automatically** — it strips them from the captured input so only the meaningful arguments appear in traces. Do NOT create wrapper methods or call unbound methods to work around this. Just decorate the existing method directly.
+
+```python
+# ✅ CORRECT — context manager inside an existing function
+from pixie import start_observation
+
+async def main():  # existing function
+    ...
+    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
+        result = await Runner.run(current_agent, input_items, context=context)
+        # ... existing response handling ...
+        obs.set_output(response_text)
+    ...
+```
+
+---
+
+## Anti-patterns to avoid
+
+### Creating new wrapper functions
+
+```python
+# ❌ WRONG — creating a new function that duplicates logic from main()
+@observe(name="run_for_eval")
+async def run_for_eval(user_messages: list[str]) -> str:
+    # This duplicates what main() does, creating a separate code path
+    # that diverges from production. Don't do this.
+    ...
+```
+
+### Creating wrapper methods instead of decorating the existing method
+
+```python
+# ❌ WRONG — creating a new _respond_observed wrapper method
+class OpenAIAgent:
+    def respond(self, user_message, conversation_history=None):
+        result = self._respond_observed({
+            'user_message': user_message,
+            'conversation_history': conversation_history,
+        })
+        return result['result']
+
+    @observe
+    def _respond_observed(self, args):
+        # WRONG: creates a separate code path, changes the interface,
+        # and breaks when called as an unbound method.
+        ...
+
+# ✅ CORRECT — just decorate the existing method directly
+class OpenAIAgent:
+    @observe(name="openai_agent_respond")
+    def respond(self, user_message, conversation_history=None):
+        ...  # existing code, unchanged
+```
+
+### Bypassing the app by calling the LLM directly
+
+```python
+# ❌ WRONG — calling the LLM directly instead of calling the app's function
+@observe(name="agent_answer_question")
+def answer_question(question: str) -> str:
+    # This bypasses the entire app and calls OpenAI directly.
+    # You're testing a script you just wrote, not the user's app.
+    response = client.responses.create(
+        model="gpt-4.1",
+        input=[{"role": "user", "content": question}],
+    )
+    return response.output_text
+```
+
+---
+
+## Rules
+
+- **Never add new wrapper functions** to the application code for eval purposes.
+- **Never bypass the app by calling the LLM provider directly** — if you find yourself writing `client.responses.create(...)` or `openai.ChatCompletion.create(...)` in a test or utility function, you're not testing the app. Import and call the app's own function instead.
+- **Never change the function's interface** (arguments, return type, behavior).
+- **Never duplicate production logic** into a separate "testable" function.
+- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
+- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
+- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.
+
+**Import rule**: All pixie symbols are importable from the top-level `pixie` package. Never import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
+
+---
+
+## What to instrument based on eval criteria
+
+**LLM provider calls are auto-captured.** When you call `enable_storage()`, pixie activates OpenInference instrumentors that automatically trace every LLM API call (OpenAI, Anthropic, Google, etc.) with full input/output messages, token usage, and model parameters. You do NOT need `@observe` on a function just because it contains an LLM call — the LLM call is already instrumented.
+
+**Use `@observe` for application-level functions** whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone:
+
+| What your evaluator needs                                  | What to instrument with `@observe`                                       |
+| ---------------------------------------------------------- | ------------------------------------------------------------------------ |
+| App-level input/output (what user sent, what app returned) | The app's entry-point or per-turn processing function                    |
+| Retrieved context (for faithfulness/grounding checks)      | The retrieval function — captures what documents were fetched            |
+| Routing/dispatch decisions                                 | The routing function — captures which tool/agent/department was selected |
+| Side-effects sent to external systems                      | The function that writes to the external system — captures what was sent |
+| Conversation history handling                              | The per-turn processing function — captures how history is assembled     |
+| Intermediate processing stages                             | Each intermediate function — captures each stage                         |
+
+If your eval criteria can be fully assessed from the auto-captured LLM inputs and outputs alone, you may not need `@observe` at all. But typically you need at least one `@observe` on the app's entry-point function to capture the application-level input/output shape that the dataset and evaluators work with.
diff --git a/skills/eval-driven-dev/references/investigation.md b/skills/eval-driven-dev/references/investigation.md
new file mode 100644
index 00000000..a6221c73
--- /dev/null
+++ b/skills/eval-driven-dev/references/investigation.md
@@ -0,0 +1,146 @@
+# Investigation and Iteration
+
+This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.
+
+---
+
+## When to use this
+
+Only proceed with investigation if the user asked for it (iteration intent) or confirmed after seeing setup results. If the user's intent was "set up evals," stop after reporting test results and ask before investigating.
+
+---
+
+## Step-by-step investigation
+
+### 1. Get detailed test output
+
+```bash
+pixie test pixie_qa/tests/ -v    # shows score and reasoning per case
+```
+
+Capture the full verbose output. For each failing case, note:
+
+- The `eval_input` (what was sent)
+- The `eval_output` (what the app produced)
+- The `expected_output` (what was expected, if applicable)
+- The evaluator score and reasoning
+
+### 2. Inspect the trace data
+
+For each failing case, look up the full trace to see what happened inside the app:
+
+```python
+from pixie import DatasetStore
+
+store = DatasetStore()
+ds = store.get("<dataset-name>")
+for i, item in enumerate(ds.items):
+    print(i, item.eval_metadata)   # trace_id is here
+```
+
+Then inspect the full span tree:
+
+```python
+import asyncio
+from pixie import ObservationStore
+
+async def inspect(trace_id: str):
+    store = ObservationStore()
+    roots = await store.get_trace(trace_id)
+    for root in roots:
+        print(root.to_text())   # full span tree: inputs, outputs, LLM messages
+
+asyncio.run(inspect("the-trace-id-here"))
+```
+
+### 3. Root-cause analysis
+
+Walk through the trace and identify exactly where the failure originates. Common patterns:
+
+**LLM-related failures** (fix with prompt/model/eval changes):
+
+| Symptom                                                | Likely cause                                                  |
+| ------------------------------------------------------ | ------------------------------------------------------------- |
+| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
+| Agent routes to wrong tool/handoff                     | Routing prompt or handoff descriptions are ambiguous          |
+| Output format is wrong                                 | Missing format instructions in prompt                         |
+| LLM hallucinated instead of using tool                 | Prompt doesn't enforce tool usage                             |
+
+**Non-LLM failures** (fix with traditional code changes, out of eval scope):
+
+| Symptom                                           | Likely cause                                            |
+| ------------------------------------------------- | ------------------------------------------------------- |
+| Tool returned wrong data                          | Bug in tool implementation — fix the tool, not the eval |
+| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code           |
+| Database returned stale/wrong records             | Data issue — fix independently                          |
+| API call failed with error                        | Infrastructure issue                                    |
+
+For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly.
+
+### 4. Document findings in MEMORY.md
+
+**Every failure investigation must be documented in `pixie_qa/MEMORY.md`** under the Investigation Log section:
+
+````markdown
+### <date> — <test_name> failure
+
+**Test**: `test_faq_factuality` in `pixie_qa/tests/test_customer_service.py`
+**Result**: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
+
+#### Failing case 1: "What rows have extra legroom?"
+
+- **eval_input**: `{"user_message": "What rows have extra legroom?"}`
+- **eval_output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
+- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
+- **Evaluator score**: 0.1 (FactualityEval)
+- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
+
+**Trace analysis**:
+Inspected trace `abc123`. The span tree shows:
+
+1. Triage Agent routed to FAQ Agent ✓
+2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
+3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
+
+**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
+The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
+The question "What rows have extra legroom?" contains none of these keywords, so it
+falls through to the default "I don't know" response.
+
+**Classification**: Non-LLM failure — the keyword-matching tool is broken.
+The LLM agent correctly routed to the FAQ agent and used the tool; the tool
+itself returned wrong data.
+
+**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
+`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
+not an eval/prompt change.
+
+**Verification**: After fix, re-run:
+
+```bash
+python pixie_qa/scripts/build_dataset.py  # refresh dataset
+pixie test pixie_qa/tests/ -k faq -v      # verify
+```
+````
+
+````
+
+### 5. Fix and re-run
+
+Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
+
+```bash
+pixie test pixie_qa/tests/test_<feature>.py -v
+````
+
+---
+
+## The iteration cycle
+
+1. Run tests → identify failures
+2. Investigate each failure → classify as LLM vs. non-LLM
+3. For LLM failures: adjust prompts, model, or eval criteria
+4. For non-LLM failures: recommend or apply code fix
+5. Rebuild dataset if the fix changed app behavior
+6. Re-run tests
+7. Repeat until passing or user is satisfied
diff --git a/skills/eval-driven-dev/references/pixie-api.md b/skills/eval-driven-dev/references/pixie-api.md
index 82cb064d..279cce49 100644
--- a/skills/eval-driven-dev/references/pixie-api.md
+++ b/skills/eval-driven-dev/references/pixie-api.md
@@ -1,5 +1,9 @@
 # pixie API Reference
 
+> This file is auto-generated by `generate_api_doc` from the
+> live pixie-qa package. Do not edit by hand — run
+> `generate_api_doc` to regenerate after updating pixie-qa.
+
 ## Configuration
 
 All settings read from environment variables at call time. By default,
@@ -22,18 +26,24 @@ from pixie import enable_storage, observe, start_observation, flush, init, add_h
 
 | Function / Decorator | Signature                                                    | Notes                                                                                               |
 | -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
-| `enable_storage()`   | `() → StorageHandler`                                        | Idempotent. Creates DB, registers handler. Call at app startup.                                     |
-| `init()`             | `(*, capture_content=True, queue_size=1000) → None`          | Called internally by `enable_storage`. Idempotent.                                                  |
-| `observe`            | `(name=None) → decorator`                                    | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
-| `start_observation`  | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside.                    |
-| `flush`              | `(timeout_seconds=5.0) → bool`                               | Drains the queue. Call after a run before using CLI commands.                                       |
-| `add_handler`        | `(handler) → None`                                           | Register a custom handler (must call `init()` first).                                               |
+| `observe`   | `observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
+| `enable_storage`   | `enable_storage() -> 'StorageHandler'` | Idempotent. Creates DB, registers handler. Call at app startup. |
+| `start_observation`   | `start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
+| `flush`   | `flush(timeout_seconds: 'float' = 5.0) -> 'bool'` | Drains the queue. Call after a run before using CLI commands. |
+| `init`   | `init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'` | Called internally by `enable_storage`. Idempotent. |
+| `add_handler`   | `add_handler(handler: 'InstrumentationHandler') -> 'None'` | Register a custom handler (must call `init()` first). |
+| `remove_handler`   | `remove_handler(handler: 'InstrumentationHandler') -> 'None'` | Unregister a previously added handler. |
 
 ---
 
 ## CLI Commands
 
 ```bash
+# Trace inspection
+pixie trace list [--limit N] [--errors]              # show recent traces
+pixie trace show <trace_id> [--verbose] [--json]     # show span tree for a trace
+pixie trace last [--json]                            # show most recent trace (verbose)
+
 # Dataset management
 pixie dataset create <name>
 pixie dataset list
@@ -47,6 +57,24 @@ echo '"expected value"' | pixie dataset save <name> --expected-output
 pixie test [path] [-k filter_substring] [-v]
 ```
 
+### `pixie trace` commands
+
+**`pixie trace list`** — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).
+
+- `--limit N` (default 10) — number of traces to show
+- `--errors` — show only traces with errors
+
+**`pixie trace show <trace_id>`** — show the span tree for a specific trace.
+
+- Default (compact): span names, types, timing
+- `--verbose` / `-v`: full input/output data for each span
+- `--json`: machine-readable JSON output
+- Trace ID accepts prefix match (first 8+ characters)
+
+**`pixie trace last`** — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.
+
+- `--json`: machine-readable JSON output
+
 **`pixie dataset save` selection modes:**
 
 - `root` (default) — the outermost `@observe` or `start_observation` span
@@ -55,112 +83,21 @@ pixie test [path] [-k filter_substring] [-v]
 
 ---
 
-## Eval Harness (`pixie`)
-
-```python
-from pixie import (
-    assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
-    EvalAssertionError, Evaluation, ScoreThreshold,
-    capture_traces, MemoryTraceHandler,
-    last_llm_call, root,
-)
-```
-
-### Key functions
-
-**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
-
-- Loads dataset by name, runs `assert_pass` with all items.
-- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
-- `evaluators`: list of evaluator callables.
-- `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5).
-- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
-
-**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
-
-- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
-
-**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
-
-- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
-
-**`ScoreThreshold(threshold=0.5, pct=1.0)`**
-
-- `threshold`: min score per item (default 0.5).
-- `pct`: fraction of items that must meet threshold (default 1.0 = all).
-- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
-
-**`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.0–1.0.
-
-**`capture_traces()`** — context manager; use for in-memory trace capture without DB.
-
-**`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers.
-
----
-
-## Evaluators
-
-### Heuristic (no LLM needed)
-
-| Evaluator                        | Use when                                            |
-| -------------------------------- | --------------------------------------------------- |
-| `ExactMatchEval(expected=...)`   | Output must exactly equal the expected string       |
-| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance)           |
-| `NumericDiffEval(expected=...)`  | Normalised numeric difference                       |
-| `JSONDiffEval(expected=...)`     | Structural JSON comparison                          |
-| `ValidJSONEval(schema=None)`     | Output is valid JSON (optionally matching a schema) |
-| `ListContainsEval(expected=...)` | Output list contains expected items                 |
-
-### LLM-as-judge (require OpenAI key or compatible client)
-
-| Evaluator                                             | Use when                                  |
-| ----------------------------------------------------- | ----------------------------------------- |
-| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
-| `ClosedQAEval(expected=..., model=..., client=...)`   | Closed-book QA comparison                 |
-| `SummaryEval(expected=..., model=..., client=...)`    | Summarisation quality                     |
-| `TranslationEval(expected=..., language=..., ...)`    | Translation quality                       |
-| `PossibleEval(model=..., client=...)`                 | Output is feasible / plausible            |
-| `SecurityEval(model=..., client=...)`                 | No security vulnerabilities in output     |
-| `ModerationEval(threshold=..., client=...)`           | Content moderation                        |
-| `BattleEval(expected=..., model=..., client=...)`     | Head-to-head comparison                   |
-
-### RAG / retrieval
-
-| Evaluator                                         | Use when                                   |
-| ------------------------------------------------- | ------------------------------------------ |
-| `ContextRelevancyEval(expected=..., client=...)`  | Retrieved context is relevant to query     |
-| `FaithfulnessEval(client=...)`                    | Answer is faithful to the provided context |
-| `AnswerRelevancyEval(client=...)`                 | Answer addresses the question              |
-| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference             |
-
-### Custom evaluator template
-
-```python
-from pixie import Evaluation, Evaluable
-
-async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
-    # evaluable.eval_input  — what was passed to the observed function
-    # evaluable.eval_output — what the function returned
-    # evaluable.expected_output — reference answer (UNSET if not provided)
-    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
-    return Evaluation(score=score, reasoning="...")
-```
-
----
-
 ## Dataset Python API
 
 ```python
 from pixie import DatasetStore, Evaluable
+```
 
+```python
 store = DatasetStore()                               # reads PIXIE_DATASET_DIR
-store.create("my-dataset")                          # create empty
-store.create("my-dataset", items=[...])             # create with items
-store.append("my-dataset", Evaluable(...))          # add one item
-store.get("my-dataset")                             # returns Dataset
-store.list()                                        # list names
-store.remove("my-dataset", index=2)                 # remove by index
-store.delete("my-dataset")                          # delete entirely
+store.append(...)    # add one or more items
+store.create(...)    # create empty / create with items
+store.delete(...)    # delete entirely
+store.get(...)    # returns Dataset
+store.list(...)    # list names
+store.list_details(...)    # list names with metadata
+store.remove(...)    # remove by index
 ```
 
 **`Evaluable` fields:**
@@ -179,13 +116,20 @@ from pixie import ObservationStore
 
 store = ObservationStore()   # reads PIXIE_DB_PATH
 await store.create_tables()
+```
 
-# Read traces
-await store.list_traces(limit=10, offset=0)         # → list of trace summaries
-await store.get_trace(trace_id)                     # → list[ObservationNode] (tree)
-await store.get_root(trace_id)                      # → root ObserveSpan
-await store.get_last_llm(trace_id)                  # → most recent LLMSpan
-await store.get_by_name(name, trace_id=None)        # → list of spans
+```python
+await store.create_tables(self) -> 'None'
+await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans
+await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans filtered by kind
+await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of error spans
+await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None'  # → most recent LLMSpan
+await store.get_root(self, trace_id: 'str') -> 'ObserveSpan'  # → root ObserveSpan
+await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]'  # → list[ObservationNode] (tree)
+await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]'  # → flat list of all spans
+await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]'  # → list of trace summaries
+await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None'  # persist a single span
+await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None'  # persist multiple spans
 
 # ObservationNode
 node.to_text()          # pretty-print span tree
@@ -193,3 +137,121 @@ node.find(name)         # find a child span by name
 node.children           # list of child ObservationNode
 node.span               # the underlying span (ObserveSpan or LLMSpan)
 ```
+
+---
+
+## Eval Runner API
+
+### `assert_dataset_pass`
+
+```python
+await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
+```
+
+**Parameters:**
+
+- `runnable` — callable that takes `eval_input` and runs the app
+- `dataset_name` — name of the dataset to load (NOT `dataset_path`)
+- `evaluators` — list of evaluator instances
+- `pass_criteria` — `ScoreThreshold(threshold=..., pct=...)` (NOT `thresholds`)
+- `from_trace` — span selector: use `last_llm_call` or `root`
+- `dataset_dir` — override dataset directory (default: reads from config)
+- `passes` — number of times to run the full matrix (default: 1)
+
+### `ScoreThreshold`
+
+```python
+ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
+
+# threshold: minimum per-item score to count as passing (0.0–1.0)
+# pct:       fraction of items that must pass (0.0–1.0, default=1.0)
+```
+
+### Trace helpers
+
+```python
+from pixie import last_llm_call, root
+
+# Pass one of these as the from_trace= argument:
+from_trace=last_llm_call  # extract eval data from the most recent LLM call span
+from_trace=root           # extract eval data from the root @observe span
+```
+
+---
+
+## Evaluator catalog
+
+Import any evaluator directly from `pixie`:
+
+```python
+from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
+```
+
+### Heuristic (no LLM required)
+
+| Evaluator | Signature | Use when | Needs `expected_output`? |
+| --- | --- | --- | --- |
+| `ExactMatchEval() -> 'AutoevalsAdapter'` | Output must exactly equal the expected string | **Yes** |
+| `LevenshteinMatch() -> 'AutoevalsAdapter'` | Partial string similarity (edit distance) | **Yes** |
+| `NumericDiffEval() -> 'AutoevalsAdapter'` | Normalised numeric difference | **Yes** |
+| `JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'` | Structural JSON comparison | **Yes** |
+| `ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'` | Output is valid JSON (optionally matching a schema) | No |
+| `ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'` | Output list contains expected items | **Yes** |
+
+### LLM-as-judge (require OpenAI key or compatible client)
+
+| Evaluator | Signature | Use when | Needs `expected_output`? |
+| --- | --- | --- | --- |
+| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | **Yes** |
+| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | **Yes** |
+| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | **Yes** |
+| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality | **Yes** |
+| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
+| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
+| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
+| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | **Yes** |
+| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | **Yes** |
+| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity | **Yes** |
+
+### RAG / retrieval
+
+| Evaluator | Signature | Use when | Needs `expected_output`? |
+| --- | --- | --- | --- |
+| `ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Retrieved context is relevant to query | **Yes** |
+| `FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is faithful to the provided context | No |
+| `AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer addresses the question (⚠️ requires `context` in trace — **RAG pipelines only**) | No |
+| `AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is correct vs reference | **Yes** |
+
+### Other evaluators
+
+| Evaluator | Signature | Needs `expected_output`? |
+| --- | --- | --- |
+| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |
+
+---
+
+## Custom evaluator — `create_llm_evaluator` factory
+
+```python
+from pixie import create_llm_evaluator
+
+my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
+```
+
+- Returns a callable satisfying the `Evaluator` protocol
+- Template variables: `{eval_input}`, `{eval_output}`, `{expected_output}` — populated from `Evaluable` fields
+- No nested field access — include any needed metadata in `eval_input` when building the dataset
+- Score parsing extracts a 0–1 float from the LLM response
+
+### Custom evaluator — manual template
+
+```python
+from pixie import Evaluation, Evaluable
+
+async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
+    # evaluable.eval_input  — what was passed to the observed function
+    # evaluable.eval_output — what the function returned
+    # evaluable.expected_output — reference answer (UNSET if not provided)
+    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
+    return Evaluation(score=score, reasoning="...")
+```
diff --git a/skills/eval-driven-dev/references/run-harness-patterns.md b/skills/eval-driven-dev/references/run-harness-patterns.md
new file mode 100644
index 00000000..ff995d32
--- /dev/null
+++ b/skills/eval-driven-dev/references/run-harness-patterns.md
@@ -0,0 +1,281 @@
+# Running the App from Its Entry Point — Examples by App Type
+
+This reference shows concrete examples of how to write the utility function from Step 3 — the function that runs the full application end-to-end with external dependencies mocked. Each example demonstrates what an "entry point" looks like for a different kind of application and how to invoke it.
+
+For `enable_storage()` and `observe` API details, see `references/pixie-api.md` (Instrumentation API section).
+
+## What entry point to use
+
+Look at how a real user or client invokes the app, and do the same thing in your utility function:
+
+| App type                                           | Entry point example     | How to invoke it                                     |
+| -------------------------------------------------- | ----------------------- | ---------------------------------------------------- |
+| **Web server** (FastAPI, Flask)                    | HTTP/WebSocket endpoint | `TestClient`, `httpx`, or subprocess + HTTP requests |
+| **CLI application**                                | Command-line invocation | `subprocess.run()`                                   |
+| **Standalone function** (no server, no middleware) | Python function         | Import and call directly                             |
+
+**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. Between the entry point and that inner function, the app does request handling, state management, prompt assembly, routing — all of which is under test. When you call an inner function, you skip all of that and end up reimplementing it in your test. Now your test is testing test code, not app code.
+
+Mock only external dependencies (databases, speech services, third-party APIs) — everything you identified and planned in Step 1.
+
+---
+
+## Example: FastAPI / Web Server with External Services
+
+**When your app is a web server** (FastAPI, Flask, etc.) with external service dependencies (Redis, Twilio, speech services, databases). **This is the most common case** — most production apps are web servers.
+
+**Approach**: Mock external dependencies, then drive the app through its HTTP/WebSocket interface. Two sub-approaches:
+
+- **Subprocess approach**: Launch the patched server as a subprocess, wait for health, then send HTTP/WebSocket requests with `httpx`. Best when the app has complex startup or uses `uvicorn.run()`.
+- **In-process approach**: Use FastAPI's `TestClient` (or `httpx.AsyncClient` with `ASGITransport`) to drive the app in-process. Simpler — no subprocess management, no ports. Best when you can import the `app` object directly.
+
+Both approaches exercise the full request pipeline: routing → middleware → state management → business logic → response assembly.
+
+### Step 1: Identify pluggable interfaces and write mock backends
+
+Look for abstract base classes, protocols, or constructor-injected backends in the codebase. These are the app's testability seams — the places where external services can be swapped out. Create mock implementations that satisfy the interface but don't call external services.
+
+```python
+# pixie_qa/scripts/mock_backends.py
+from myapp.services.transcription import TranscriptionBackend
+from myapp.services.voice_synthesis import SynthesisBackend
+
+class MockTranscriptionBackend(TranscriptionBackend):
+    """Decodes UTF-8 text instead of calling real STT service."""
+    async def transcribe_chunk(self, audio_data: bytes) -> str | None:
+        try:
+            return audio_data.decode("utf-8")
+        except UnicodeDecodeError:
+            return None
+
+class MockSynthesisBackend(SynthesisBackend):
+    """Encodes text as bytes instead of calling real TTS service."""
+    async def synthesize(self, text: str) -> bytes:
+        return text.encode("utf-8")
+```
+
+### Step 2: Write the patched server launcher
+
+Monkey-patch the app's module-level dependencies before starting the server:
+
+```python
+# pixie_qa/scripts/demo_server.py
+import uvicorn
+from pixie_qa.scripts.mock_backends import (
+    MockTranscriptionBackend,
+    MockSynthesisBackend,
+)
+
+# Patch module-level backends BEFORE uvicorn imports the ASGI app
+import myapp.app as the_app
+the_app.transcription_backend = MockTranscriptionBackend()
+the_app.synthesis_backend = MockSynthesisBackend()
+
+if __name__ == "__main__":
+    uvicorn.run(the_app.app, host="127.0.0.1", port=8000)
+```
+
+### Step 3: Write the utility function
+
+Launch the server subprocess, wait for health, send real requests, collect responses:
+
+```python
+# pixie_qa/scripts/run_app.py
+import subprocess
+import sys
+import time
+import httpx
+
+BASE_URL = "http://127.0.0.1:8000"
+
+def wait_for_server(timeout: float = 30.0) -> None:
+    start = time.time()
+    while time.time() - start < timeout:
+        try:
+            resp = httpx.get(f"{BASE_URL}/health", timeout=2)
+            if resp.status_code == 200:
+                return
+        except httpx.ConnectError:
+            pass
+        time.sleep(0.5)
+    raise TimeoutError(f"Server did not start within {timeout}s")
+
+def main() -> None:
+    # Launch patched server
+    server = subprocess.Popen(
+        [sys.executable, "-m", "pixie_qa.scripts.demo_server"],
+    )
+    try:
+        wait_for_server()
+        # Drive the app with real inputs
+        resp = httpx.post(f"{BASE_URL}/api/chat", json={
+            "message": "What are your business hours?"
+        })
+        print(resp.json())
+    finally:
+        server.terminate()
+        server.wait()
+
+if __name__ == "__main__":
+    main()
+```
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
+
+### Alternative: In-process with TestClient (simpler)
+
+If the app's `app` object can be imported directly, skip the subprocess and use FastAPI's `TestClient`:
+
+```python
+# pixie_qa/scripts/run_app.py
+from unittest.mock import patch
+from fastapi.testclient import TestClient
+from pixie import enable_storage, observe
+
+from pixie_qa.scripts.mock_backends import (
+    MockTranscriptionBackend,
+    MockSynthesisBackend,
+)
+
+@observe
+def run_app(eval_input: dict) -> dict:
+    """Run the voice agent through its real FastAPI app layer."""
+    enable_storage()
+    # Patch external dependencies before importing the app
+    with patch("myapp.app.transcription_backend", MockTranscriptionBackend()), \
+         patch("myapp.app.synthesis_backend", MockSynthesisBackend()), \
+         patch("myapp.app.call_state_store", MockCallStateStore()):
+
+        from myapp.app import app
+        client = TestClient(app)
+
+        # Drive through the real HTTP/WebSocket endpoints
+        resp = client.post("/api/chat", json={
+            "message": eval_input["user_message"],
+            "call_sid": eval_input.get("call_sid", "test-call-001"),
+        })
+        return {"response": resp.json()["response"]}
+```
+
+This approach is simpler (no subprocess, no port management) and equally valid. Both approaches exercise the full request pipeline.
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
+
+---
+
+## Example: CLI / Command-Line App
+
+**When your app is invoked from the command line** (e.g., `python -m myapp`, a CLI tool).
+
+**Approach**: Invoke the app's entry point via `subprocess.run()`, capture stdout/stderr, parse results.
+
+```python
+# pixie_qa/scripts/run_app.py
+import subprocess
+import sys
+import json
+
+def run_app(user_input: str) -> str:
+    """Run the CLI app with the given input and return its output."""
+    result = subprocess.run(
+        [sys.executable, "-m", "myapp", "--query", user_input],
+        capture_output=True,
+        text=True,
+        timeout=120,
+    )
+    if result.returncode != 0:
+        raise RuntimeError(f"App failed: {result.stderr}")
+    return result.stdout.strip()
+
+def main() -> None:
+    inputs = [
+        "What are your business hours?",
+        "How do I reset my password?",
+        "Tell me about your return policy",
+    ]
+    for user_input in inputs:
+        output = run_app(user_input)
+        print(f"Input: {user_input}")
+        print(f"Output: {output}")
+        print("---")
+
+if __name__ == "__main__":
+    main()
+```
+
+If the CLI app needs external dependencies mocked, create a wrapper script that patches them before invoking the entry point:
+
+```python
+# pixie_qa/scripts/patched_app.py
+"""Entry point that patches DB/cache before running the real app."""
+import myapp.config as config
+config.redis_url = "mock://localhost"  # or use a mock implementation
+
+from myapp.main import main
+main()
+```
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
+
+---
+
+## Example: Standalone Function (No Infrastructure)
+
+**When your app is a single function or module** with no server, no database, no external services.
+
+**Approach**: Import the function directly and call it. This is the simplest case.
+
+```python
+# pixie_qa/scripts/run_app.py
+from pixie import enable_storage, observe
+
+# Enable trace capture
+enable_storage()
+
+from myapp.agent import answer_question
+
+@observe
+def run_agent(question: str) -> str:
+    """Wrapper that captures traces for the agent call."""
+    return answer_question(question)
+
+def main() -> None:
+    inputs = [
+        "What are your business hours?",
+        "How do I reset my password?",
+        "Tell me about your return policy",
+    ]
+    for q in inputs:
+        result = run_agent(q)
+        print(f"Q: {q}")
+        print(f"A: {result}")
+        print("---")
+
+if __name__ == "__main__":
+    main()
+```
+
+If the function depends on something that needs mocking (e.g., a vector store client), patch it before calling:
+
+```python
+from unittest.mock import MagicMock
+import myapp.retriever as retriever
+
+# Mock the vector store with a simple keyword search
+retriever.vector_client = MagicMock()
+retriever.vector_client.search.return_value = [
+    {"text": "Business hours: Mon-Fri 9am-5pm", "score": 0.95}
+]
+```
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
+
+---
+
+## Key Rules
+
+1. **Always call through the real entry point** — the same way a real user or client would
+2. **Mock only external dependencies** — the ones you identified in Step 1
+3. **Use `uv run python -m <module>`** to run scripts — never `python <path>`
+4. **Add `enable_storage()` and `@observe`** in the utility function so traces are captured
+5. **After running, verify traces**: `uv run pixie trace list` then `uv run pixie trace show <trace_id> --verbose`
diff --git a/skills/eval-driven-dev/references/understanding-app.md b/skills/eval-driven-dev/references/understanding-app.md
new file mode 100644
index 00000000..e7c8a47a
--- /dev/null
+++ b/skills/eval-driven-dev/references/understanding-app.md
@@ -0,0 +1,201 @@
+# Understanding the Application
+
+This reference covers Step 1 of the eval-driven-dev process in detail: how to read the codebase, map the data flows, and document your findings.
+
+---
+
+## What to investigate
+
+Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would.
+
+### 1. How the software runs
+
+What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
+
+### 2. Find where the LLM provider client is called
+
+Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:
+
+- The file and function where the call lives
+- Which LLM provider/client is used
+- The exact arguments being passed (model, messages, tools, etc.)
+
+### 3. Track backwards: external data dependencies flowing IN
+
+Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt. Categorize each data source:
+
+**Application inputs** (from the user / caller):
+
+- User messages, queries, uploaded files
+- Configuration or feature flags
+
+**External dependency data** (from systems outside the app):
+
+- Database lookups (conversation history from Redis, user profiles from Postgres, etc.)
+- Retrieved context (RAG chunks from a vector DB, search results from an API)
+- Cache reads
+- Third-party API responses
+
+For each external data dependency, document:
+
+- What system it comes from
+- What the data shape looks like (types, fields, structure)
+- What realistic values look like
+- Whether it requires real credentials or can be mocked
+
+**In-code data** (assembled by the application itself):
+
+- System prompts (hardcoded or templated)
+- Tool definitions and function schemas
+- Prompt-building logic that combines the above
+
+### 4. Track forwards: external side-effects flowing OUT
+
+Starting from each LLM call site, trace **forwards** through the code to find every side-effect the application causes in external systems based on the LLM's output:
+
+- Database writes (saving conversation history, updating records)
+- API calls to third-party services (sending emails, creating calendar entries, initiating transfers)
+- Messages sent to other systems (queues, webhooks, notifications)
+- File system writes
+
+For each side-effect, document:
+
+- What system is affected
+- What data is written/sent
+- Whether this side-effect is something evaluations should verify (e.g., "did the agent route to the correct department?")
+
+### 5. Identify intermediate states to capture
+
+Along the paths between input and output, identify intermediate states that are necessary for proper evaluation but aren't visible in the final output:
+
+- Tool call decisions and results (which tools were called, what they returned)
+- Agent routing / handoff decisions
+- Intermediate LLM calls (e.g., summarization before final answer)
+- Retrieval results (what context was fetched)
+- Any branching logic that determines the code path
+
+These are things that evaluators will need to check criteria like "did the agent verify identity before transferring?" or "did it use the correct tool?"
+
+### 6. Use cases and expected behaviors
+
+What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
+
+---
+
+## Writing MEMORY.md
+
+Write your findings to `pixie_qa/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
+
+**MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet.** Those belong in later steps, only after they've been implemented.
+
+### Template
+
+```markdown
+# Eval Notes: <Project Name>
+
+## How the application works
+
+### Entry point and execution flow
+
+<Describe how to start/run the app, what happens step by step>
+
+### LLM call sites
+
+<For each LLM call in the codebase, document:>
+
+- Where it is in the code (file + function name)
+- Which LLM provider/client is used
+- What arguments are passed
+
+### External data dependencies (data flowing IN to LLM)
+
+<For each external system the app reads from:>
+
+- **System**: <e.g., Redis, Postgres, vector DB, third-party API>
+- **What data**: <e.g., conversation history, user profile, retrieved documents>
+- **Data shape**: <types, fields, structure, realistic values>
+- **Code path**: <file:line where each read happens>
+- **Credentials needed**: <yes/no, what kind>
+
+### External side-effects (data flowing OUT from LLM output)
+
+<For each external system the app writes to / affects:>
+
+- **System**: <e.g., database, API, queue, file system>
+- **What happens**: <e.g., saves conversation, sends email, creates calendar entry>
+- **Code path**: <file:line where each write happens>
+- **Eval-relevant?**: <should evaluations verify this side-effect?>
+
+### Pluggable/injectable interfaces (testability seams)
+
+<For each abstract base class, protocol, or constructor-injected backend:>
+
+- **Interface**: <e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`>
+- **Defined in**: <file:line>
+- **What it wraps**: <e.g., real STT service, real TTS service, Redis>
+- **How it's injected**: <constructor param, module-level var, dependency injection framework>
+- **Mock strategy**: <what mock implementation should do — e.g., decode UTF-8 instead of real STT>
+
+These are the primary testability seams. In Step 3, you'll write mock implementations of these interfaces.
+
+### Mocking plan summary
+
+<For each external dependency, how will you replace it in the utility function (Step 3)?>
+
+| Dependency          | Mock approach                  | What mock provides (IN)                | What mock captures (OUT) |
+| ------------------- | ------------------------------ | -------------------------------------- | ------------------------ |
+| <e.g., Redis>       | <mock.patch / mock class / DI> | <conversation history from eval_input> | <saved messages>         |
+| <e.g., STT service> | <MockTranscriptionBackend>     | <text from eval_input>                 | <n/a>                    |
+
+### Intermediate states to capture
+
+<States along the execution path needed for evaluation but not in final output:>
+
+- <e.g., tool call decisions, routing choices, retrieval results>
+- Include code pointers (file:line) for each
+
+### Final output
+
+<What the user sees, what format, what the quality bar should be>
+
+### Use cases
+
+<List each distinct scenario the app handles, with examples of good/bad outputs>
+
+1. <Use case 1>: <description>
+   - Input example: ...
+   - Good output: ...
+   - Bad output: ...
+
+## Evaluation plan
+
+### What to evaluate and why
+
+<App-specific quality dimensions and rationale — filled in during Step 1>
+
+### Evaluators and criteria
+
+<Filled in during Step 5 — maps each quality criterion to a specific evaluator>
+
+| Criterion | Evaluator | Dataset | Pass criteria | Rationale |
+| --------- | --------- | ------- | ------------- | --------- |
+| ...       | ...       | ...     | ...           | ...       |
+
+### Data needed for evaluation
+
+<What data to capture, with code pointers>
+
+## Datasets
+
+| Dataset | Items | Purpose |
+| ------- | ----- | ------- |
+| ...     | ...   | ...     |
+
+## Investigation log
+
+### <date> — <test_name> failure
+
+<Structured investigation entries — filled in during Step 6>
+```
+
+If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.