update eval-driven-dev skill (#1352)

* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
2026-05-29 18:11:45 +00:00 · 2026-04-09 18:19:28 -07:00
parent 88b1920cb7
commit 5f59ddb9cf
19 changed files with 2180 additions and 1708 deletions
@@ -1,14 +1,16 @@
 ---
 name: eval-driven-dev
 description: >
-  Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets,
-  write and run eval tests, and iterate on failures.
+  Set up eval-based QA for Python LLM applications: instrument the app,
+  build golden datasets, write and run eval tests, and iterate on failures.
  ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals,
  evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model.
 license: MIT
 compatibility: Python 3.11+
 metadata:
-  version: 0.2.0
+  version: 0.6.1
+  pixie-qa-version: ">=0.6.1,<0.7.0"
+  pixie-qa-source: https://github.com/yiouli/pixie-qa/
 ---

 # Eval-Driven Development for Python LLM Applications
@@ -17,7 +19,7 @@ You're building an **automated QA pipeline** that tests a Python application end

 **What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.

-**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness.
+During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.

 **The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset.

@@ -27,352 +29,119 @@ This skill is about doing the work, not describing it. Read code, edit files, ru

 ## Before you start

-Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.
-
-**Update the skill:**
-
-```bash
-npx skills update
-```
-
-**Upgrade the `pixie-qa` package**
-
-Make sure the python virtual environment is active and use the project's package manager:
-
-```bash
-# uv project (uv.lock exists):
-uv add pixie-qa --upgrade
-
-# poetry project (poetry.lock exists):
-poetry add pixie-qa@latest
-
-# pip / no lock file:
-pip install --upgrade pixie-qa
-```
+**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, then run the setup.sh included in the skill's resources.
+The script updates the `eval-driven-dev` skill and `pixie-qa` python package to the latest version, initializes the pixie working directory if it's not already initialized, and starts a web server in the background to show user updates. If the skill or package update fails, continue — do not let these failures block the rest of the workflow.

 ---

 ## The workflow

-Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
+Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.

-**Two modes:**
+**How to work — read this before doing anything else:**

- **Setup** ("set up evals", "add tests", "set up QA"): Complete Steps 1–5. After the test run, report results and ask whether to iterate.
- **Iteration** ("fix", "improve", "debug"): Complete Steps 1–5 if not already done, then do one round of Step 6.
+- **One step at a time.** Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.
+- **Read references only when a step tells you to.** Each step names a specific reference file. Read it when you reach that step — not before.
+- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
+- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.

-If ambiguous: default to setup.
+**Run Steps 1–6 in sequence.** If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1.

 ---

 ### Step 1: Understand the app and define eval criteria

-Read the source code to understand:
+**First, check the user's prompt for specific requirements.** Before reading app code, examine what the user asked for:

-1. **How it runs** — entry point, startup, config/env vars
-2. **The real entry point** — how a real user invokes the app (HTTP endpoint, CLI, function call). This is what the eval must exercise — not an inner function that bypasses the request pipeline.
-3. **The request pipeline** — trace the full path from entry point to response. What middleware, routing, state management, prompt assembly, retrieval, or formatting happens along the way? All of this is under test.
-4. **External dependencies (both directions)** — identify every external system the app talks to (databases, APIs, caches, queues, file systems, speech services). For each, understand:
-   - **Data flowing IN** (external → app): what data does the app read from this system? What shapes, types, realistic values? You'll make up this data for eval scenarios.
-   - **Data flowing OUT** (app → external): what does the app write, send, or mutate in this system? These are side-effects that evaluations may need to verify (e.g., "did the app create the right calendar entry?", "did it send the correct transfer request?").
-   - **How to mock it** — look for abstract base classes, protocols, or constructor-injected backends (e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
-5. **Use cases** — distinct scenarios, what good/bad output looks like
+- **Referenced documents or specs**: Does the prompt mention a file to follow (e.g., "follow the spec in EVAL_SPEC.md", "use the methodology in REQUIREMENTS.md")? If so, **read that file first** — it may specify datasets, evaluation dimensions, pass criteria, or methodology that override your defaults.
+- **Specified datasets or data sources**: Does the prompt reference specific data files (e.g., "use questions from eval_inputs/research_questions.json", "use the scenarios in call_scenarios.json")? If so, **read those files** — you must use them as the basis for your eval dataset, not fabricate generic alternatives.
+- **Specified evaluation dimensions**: Does the prompt name specific quality aspects to evaluate (e.g., "evaluate on factuality, completeness, and bias", "test identity verification and tool call correctness")? If so, **every named dimension must have a corresponding evaluator** in your test file.

-Read `references/understanding-app.md` for detailed guidance on mapping data flows and the MEMORY.md template.
+If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding.

-Write your findings to `pixie_qa/MEMORY.md` before moving on. Include:
+Step 1 has two sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.**

- The entry point and the full request pipeline
- Every external dependency, what it provides/receives, and how you'll mock it
- The testability seams (pluggable interfaces, patchable module-level objects)
+#### Sub-step 1a: Entry point & execution flow

-Determine **high-level, application-specific eval criteria**:
+> **Reference**: Read `references/1-a-entry-point.md` now.

-**Good criteria are specific to the app's purpose.** Examples:
+Read the source code to understand how the app starts and how a real user invokes it. Write your findings to `pixie_qa/01-entry-point.md` before moving on.

- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation (under 3 sentences)?", "Does the agent route to the correct department based on the caller's request?"
- Research report generator: "Does the report address all sub-questions in the query?", "Are claims supported by the retrieved sources?", "Is the report structured with clear sections?"
- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when the context doesn't contain the answer?"
+> **Checkpoint**: `pixie_qa/01-entry-point.md` written with entry point, execution flow, user-facing interface, and env requirements.

-**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
+#### Sub-step 1b: Eval criteria

-At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.
+> **Reference**: Read `references/1-b-eval-criteria.md` now.

-Record the criteria in `pixie_qa/MEMORY.md` and continue.
+Define the app's use cases and eval criteria. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write your findings to `pixie_qa/02-eval-criteria.md` before moving on.

-> **Checkpoint**: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.
+> **Checkpoint**: `pixie_qa/02-eval-criteria.md` written with use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet.

 ---

-### Step 2: Instrument and observe a real run
+### Step 2: Instrument with `wrap` and capture a reference trace

-**Why this step**: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:
+> **Reference**: Read `references/2-wrap-and-trace.md` now for the detailed sub-steps.

-1. **Learn the data shapes** — what data flows in from external dependencies, and what side-effects flow out? What types, structures, realistic values? You'll need to make up this data for eval scenarios later.
-2. **Verify instrumentation captures what evaluators need** — do the traces contain the data required to assess each eval criterion from Step 1? If a criterion is "does the agent route to the correct department," the trace must capture the routing decision.
+**Goal**: Make the app testable by controlling its external data and capturing its outputs. `wrap()` calls at data boundaries let the test harness inject controlled inputs (replacing real DB/API calls) and capture outputs for scoring. The `Runnable` class provides the lifecycle interface that `pixie test` uses to set up, invoke, and tear down the app. A reference trace captured with `pixie trace` proves the instrumentation works and provides the exact data shapes needed for dataset creation in Step 4.

-**This is a normal app run with instrumentation — no mocks, no patches.**
-
-#### 2a. Decide what to instrument
-
-This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:
-
- **For each eval criterion**, ask: what observable data would prove this criterion is met or violated?
- **Map that data to code locations** — which functions produce, consume, or transform that data?
- **Those functions need `@observe`** — so their inputs and outputs are captured in traces.
-
-Examples:
-
-| Eval criterion                             | Data needed                                        | What to instrument                                           |
-| ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------------------ |
-| "Routes to correct department"             | The routing decision (which department was chosen) | The routing/dispatch function                                |
-| "Responses grounded in retrieved context"  | The retrieved documents + the final response       | The retrieval function AND the response function             |
-| "Verifies caller identity before transfer" | Whether identity check happened, transfer decision | The identity verification function AND the transfer function |
-| "Concise phone-friendly responses"         | The final response text                            | The function that produces the LLM response                  |
-
-**LLM provider calls (OpenAI, Anthropic, etc.) are auto-captured** — `enable_storage()` activates OpenInference instrumentors that automatically trace every LLM API call with full input messages, output messages, token usage, and model parameters. You do NOT need `@observe` on the function that calls `client.chat.completions.create()` just to see the LLM interaction.
-
-**Use `@observe` for application-level functions** whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone. Examples: the app's entry-point function (to capture what the user sent and what the app returned), retrieval functions (to capture what context was fetched), routing functions (to capture dispatch decisions).
-
-`enable_storage()` goes at application startup. Read `references/instrumentation.md` for the full rules, code patterns, and anti-patterns for adding instrumentation.
-
-#### 2b. Add instrumentation and run the app
-
-Add `@observe` to the functions you identified in 2a. Then run the app normally — with its real external dependencies, or by manually interacting with it — to produce a **reference trace**. Do NOT mock or patch anything. This is an observation run.
-
-If the app can't run without infrastructure you don't have (a real database, third-party service credentials, etc.), use the simplest possible approach to get it running — a local Docker container, a test account, or ask the user for help. The goal is one real trace.
-
-```bash
-uv run pixie trace list
-uv run pixie trace last
-```
-
-#### 2c. Examine the reference trace
-
-Study the trace data carefully. This is your blueprint for everything that follows. Document:
-
-1. **Data from external dependencies (inbound)** — What did the app read from databases, APIs, caches? What are the shapes, types, and realistic value ranges? This is what you'll make up in eval_input for the dataset.
-2. **Side-effects (outbound)** — What did the app write to, send to, or mutate in external systems? These need to be captured by mocks and may be part of eval_output for verification.
-3. **Intermediate states** — What did the instrumentation capture beyond the final output? Tool calls, retrieved documents, routing decisions? Are these sufficient to evaluate every criterion from Step 1?
-4. **The eval_input / eval_output structure** — What does the `@observe`-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.
-
-**Check instrumentation completeness**: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more `@observe` decorators and re-run.
-
-**Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.**
-
-> **Checkpoint**: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.
+> **Checkpoint**: `pixie_qa/scripts/run_app.py` written and verified. `pixie_qa/reference-trace.jsonl` exists and all expected data points appear when formatted with `pixie format`. Do NOT read Step 3 instructions yet.

 ---

-### Step 3: Write a utility function to run the full app end-to-end
+### Step 3: Define evaluators

-**Why this step**: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).
+> **Reference**: Read `references/3-define-evaluators.md` now for the detailed sub-steps.

-#### The contract
+**Goal**: Turn the qualitative eval criteria from Step 1b into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator or a custom one you implement. The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer.

-```
-run_app(eval_input) → eval_output
-```
-
- **eval_input** = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
- **eval_output** = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)
-
-#### How to implement
-
-1. **Patch external dependencies** — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or `unittest.mock.patch` the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
-
-2. **Call the app through its real entry point** — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use `TestClient` or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
-
-3. **Collect the response** — the app's output becomes eval_output, along with any side-effects captured by mock objects.
-
-Read `references/run-harness-patterns.md` for concrete examples of entry point invocation for different app types.
-
-**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.
-
-#### Verify
-
-Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:
-
- **Same structure** — same fields present, same types, same nesting
- **Same code path** — same routing decisions, same intermediate states captured
- **Sensible values** — eval_output fields have real, meaningful data (not null, not empty, not error messages)
-
-**If it fails after two attempts**, stop and ask the user for help.
-
-> **Checkpoint**: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.
+> **Checkpoint**: All evaluators implemented. `pixie_qa/03-evaluator-mapping.md` written with criterion-to-evaluator mapping. Do NOT read Step 4 instructions yet.

 ---

 ### Step 4: Build the dataset

-**Why this step**: The dataset is a collection of eval_input items (made up by you) that define the test scenarios. Each item may also carry case-specific expectations. The eval_output is NOT pre-populated in the dataset — it's produced at test time by the utility function from Step 3.
+> **Reference**: Read `references/4-build-dataset.md` now for the detailed sub-steps.

-#### 4a. Determine verification and expectations
+**Goal**: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names.

-Before generating data, decide how each eval criterion from Step 1 will be checked.
-
-**Examine the reference trace from Step 2** and identify:
-
- **Structural constraints** you can verify with code — JSON schema, required fields, value types, enum ranges, string length bounds. These become validation checks on your generated eval_inputs.
- **Semantic constraints** that require judgment — "the mock customer profile should be realistic", "the conversation history should be topically coherent". Apply these yourself when crafting the data.
- **Which criteria are universal vs. case-specific**:
-  - **Universal criteria** apply to ALL test cases the same way → implement in the test function (e.g., "responses must be under 3 sentences", "must not hallucinate information not in context")
-  - **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")
-
-#### 4b. Generate eval_input items
-
-Create eval_input items that match the data shape from the reference trace:
-
- **Application inputs** (user queries, requests) — make these up to cover the scenarios you identified in Step 1
- **External dependency data** (database records, API responses, cache entries) — make these up in the exact shape you observed in the reference trace
-
-Each dataset item contains:
-
- `eval_input`: the made-up input data (app input + external dependency data)
- `expected_output`: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.
-
-At test time, `eval_output` is produced by the utility function from Step 3 and is not stored in the dataset itself.
-Read `references/dataset-generation.md` for the dataset creation API, data shape matching, expected_output strategy, and validation checklist.
-
-#### 4c. Validate the dataset
-
-After building:
-
-1. **Execute `build_dataset.py`** — don't just write it, run it
-2. **Verify structural constraints** — each eval_input matches the reference trace's schema (same fields, same types)
-3. **Verify diversity** — items have meaningfully different inputs, not just minor variations
-4. **Verify case-specific expectations** — `expected_output` values are specific and testable, not vague
-5. For conversational apps, include items with conversation history
-
-> **Checkpoint**: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.
+> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/<name>.json` with diverse entries covering all use cases. Do NOT read Step 5 instructions yet.

 ---

-### Step 5: Write and run eval tests
+### Step 5: Run evaluation-based tests

-**Why this step**: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.
+> **Reference**: Read `references/5-run-tests.md` now for the detailed sub-steps.

-#### 5a. Map criteria to evaluators
+**Goal**: Execute the full pipeline end-to-end and verify it produces real scores. This step is about getting the machinery running — fixing any setup or data issues until every dataset entry runs and gets scored. Once tests produce results, run `pixie analyze` for pattern analysis.

-For each eval criterion from Step 1, decide how to evaluate it:
-
- **Can it be checked with a built-in evaluator?** (factual correctness → `FactualityEval`, exact match → `ExactMatchEval`, RAG faithfulness → `FaithfulnessEval`)
- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion.
- **Is it universal or case-specific?** Universal criteria go in the test function. Case-specific criteria use `expected_output` from the dataset.
-
-For open-ended LLM text, **never** use `ExactMatchEval` — LLM outputs are non-deterministic.
-
-`AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
-
-Read `references/eval-tests.md` for the evaluator catalog, custom evaluator examples, and the test file boilerplate.
-
-#### 5b. Write the test file and run
-
-The test file wires together: a `runnable` (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.
-
-Read `references/eval-tests.md` for the exact `assert_dataset_pass` API, required parameter names, and common mistakes to avoid. **Re-read the API reference immediately before writing test code** — do not rely on earlier context.
-
-Run with `pixie test` — not `pytest`:
-
-```bash
-uv run pixie test pixie_qa/tests/ -v
-```
-
-**After running, verify the scorecard:**
-
-1. Shows "N/M tests passed" with real numbers
-2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
-3. Per-evaluator scores appear with real values
-
-A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
-
-> **Checkpoint**: Tests run and produce real scores.
+> **Checkpoint**: Tests run and produce real scores. Analysis generated.
 >
-> - **Setup mode**: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.
-> - **Iteration mode**: Proceed directly to Step 6.
+> If the test errors out, that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
 >
-> If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
+> **STOP GATE — read this before doing anything else after tests produce scores:**
+>
+> - If the user's original prompt asks only for setup ("set up QA", "add tests", "add evals", "set up evaluations"), **STOP HERE**. Report the test results to the user: "QA setup is complete. Tests show N/M passing. [brief summary]. Want me to investigate the failures and iterate?" Do NOT proceed to Step 6.
+> - If the user's original prompt explicitly asks for iteration ("fix", "improve", "debug", "iterate", "investigate failures", "make tests pass"), proceed to Step 6.

 ---

 ### Step 6: Investigate and iterate

-**Iteration mode only, or after the user confirmed in setup mode.**
-
-When tests fail, understand _why_ — don't just adjust thresholds until things pass.
-
-Read `references/investigation.md` for procedures and root-cause patterns.
-
-The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.
+> **Reference**: Read `references/6-investigate.md` now — it has the stop/continue decision, analysis review, root-cause patterns, and investigation procedures. **Follow its instructions before doing any investigation work.**

 ---

-## Quick reference
+## Web Server Management

-### Imports
+pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script (via `pixie start`, which launches a detached background process and returns immediately).

-```python
-from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
-from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
-```
-
-Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module.
-
-### CLI commands
+When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with:

 ```bash
-uv run pixie test pixie_qa/tests/ -v    # Run eval tests (NOT pytest)
-uv run pixie trace list                 # List captured traces
-uv run pixie trace last                 # Show most recent trace
-uv run pixie trace show <id> --verbose  # Show specific trace
-uv run pixie dataset create <name>      # Create a new dataset
+pixie stop
 ```

-### Directory layout
+IMPORTANT: after the web server is stopped, the web UI becomes inaccessible. So only stop the server if the user confirms they're done with all web UI features. If they want to keep using the web UI, do NOT stop the server.

-```
-pixie_qa/
-  MEMORY.md      # your understanding and eval plan
-  datasets/      # golden datasets (JSON)
-  tests/         # eval test files (test_*.py)
-  scripts/       # run_app.py, build_dataset.py
-```
-
-All pixie files go here — not at the project root, not in a top-level `tests/` directory.
-
-### Key concepts
-
- **eval_input** = application input + data from external dependencies
- **eval_output** = application output + captured side-effects + captured intermediate states (produced at test time by the utility function, NOT pre-populated in the dataset)
- **expected_output** = case-specific evaluation reference (optional per dataset item)
- **test function** = utility function (produces eval_output) + evaluators (check criteria)
-
-### Evaluator selection
-
-| Output type                           | Evaluator                                             | Notes                                                            |
-| ------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------- |
-| Open-ended text with reference answer | `FactualityEval`, `ClosedQAEval`                      | Best default for most apps                                       |
-| Open-ended text, no reference         | `AnswerRelevancyEval`                                 | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
-| Deterministic output                  | `ExactMatchEval`, `JSONDiffEval`                      | Never use for open-ended LLM text                                |
-| RAG with retrieved context            | `FaithfulnessEval`, `ContextRelevancyEval`            | Requires context capture in instrumentation                      |
-| Domain-specific quality               | `create_llm_evaluator(name=..., prompt_template=...)` | Custom LLM-as-judge — use for app-specific criteria              |
-
-### What goes where: SKILL.md vs references
-
-**This file** (SKILL.md) is loaded for the entire session. It contains the _what_ and _why_ — the reasoning, decision-making process, goals, and checkpoints for each step.
-
-**Reference files** are loaded when executing a specific step. They contain the _how_ — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.
-
-When in doubt: if it's about _deciding what to do_, it's in SKILL.md. If it's about _how to implement that decision_, it's in a reference file.
-
-### Reference files
-
-| Reference                            | When to read                                                                       |
-| ------------------------------------ | ---------------------------------------------------------------------------------- |
-| `references/understanding-app.md`    | Step 1 — investigating the codebase, MEMORY.md template                            |
-| `references/instrumentation.md`      | Step 2 — `@observe` and `enable_storage` rules, code patterns, anti-patterns       |
-| `references/run-harness-patterns.md` | Step 3 — examples of how to invoke different app types (web server, CLI, function) |
-| `references/dataset-generation.md`   | Step 4 — crafting eval_input items, expected_output strategy, validation           |
-| `references/eval-tests.md`           | Step 5 — evaluator selection, test file pattern, assert_dataset_pass API           |
-| `references/investigation.md`        | Step 6 — failure analysis, root-cause patterns                                     |
-| `references/pixie-api.md`            | Any step — full CLI and Python API reference                                       |
+And whenever you restart the workflow, always run the setup.sh script in resources again to ensure the web server is running: