Files
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

6.6 KiB

Investigation and Iteration

This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.


STOP — check before proceeding

Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.

Continue immediately if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below.

Otherwise, STOP here. Report the test results to the user:

"QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?"

Do not proceed with investigation until the user confirms. This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration.


Step-by-step investigation

When the user has confirmed (or their original prompt was explicitly about iteration), proceed:

1. Read the analysis

Start by reading the analysis generated in Step 5. The analysis files are at {PIXIE_ROOT}/results/<test_id>/dataset-<index>.md. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues.

2. Get detailed test output

pixie test -v    # shows score and reasoning per case

Capture the full verbose output. For each failing case, note:

  • The entry_kwargs (what was sent)
  • The the captured output (what the app produced)
  • The expected_output (what was expected, if applicable)
  • The evaluator score and reasoning

3. Inspect the trace data

For each failing case, look up the full trace to see what happened inside the app:

from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # trace_id is here

Then inspect the full span tree:

import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # full span tree: inputs, outputs, LLM messages

asyncio.run(inspect("the-trace-id-here"))

4. Root-cause analysis

Walk through the trace and identify exactly where the failure originates. Common patterns:

LLM-related failures (fix with prompt/model/eval changes):

Symptom Likely cause
Output is factually wrong despite correct tool results Prompt doesn't instruct the LLM to use tool output faithfully
Agent routes to wrong tool/handoff Routing prompt or handoff descriptions are ambiguous
Output format is wrong Missing format instructions in prompt
LLM hallucinated instead of using tool Prompt doesn't enforce tool usage

Non-LLM failures (fix with traditional code changes, out of eval scope):

Symptom Likely cause
Tool returned wrong data Bug in tool implementation — fix the tool, not the eval
Tool wasn't called at all due to keyword mismatch Tool-selection logic is broken — fix the code
Database returned stale/wrong records Data issue — fix independently
API call failed with error Infrastructure issue

For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.

5. Document findings

Every failure investigation should be documented alongside the fix. Include:

### <date> — failure investigation

**Dataset**: `qa-golden-set`
**Result**: 3/5 cases passed (60%)

#### Failing case 1: "What rows have extra legroom?"

- **entry_kwargs**: `{"user_message": "What rows have extra legroom?"}`
- **the captured output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
- **Evaluator score**: 0.1 (Factuality)
- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."

**Trace analysis**:
Inspected trace `abc123`. The span tree shows:

1. Triage Agent routed to FAQ Agent ✓
2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")`3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**

**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.

**Classification**: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.

**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.

**Verification**: After fix, re-run:

```bash
pixie test -v      # verify
```

6. Fix and re-run

Make the targeted change, update the dataset if needed, and re-run:

pixie test -v

After fixes stabilize, run analysis again to see if the patterns have changed:

pixie analyze <new_test_id>

The iteration cycle

  1. Read analysis from Step 6 → prioritize failures
  2. Run tests verbose → identify specific failures
  3. Investigate each failure → classify as LLM vs. non-LLM
  4. For LLM failures: adjust prompts, model, or eval criteria
  5. For non-LLM failures: recommend or apply code fix
  6. Update dataset if the fix changed app behavior
  7. Re-run tests and analysis
  8. Repeat until passing or user is satisfied