* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
6.6 KiB
Investigation and Iteration
This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.
STOP — check before proceeding
Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.
Continue immediately if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below.
Otherwise, STOP here. Report the test results to the user:
"QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?"
Do not proceed with investigation until the user confirms. This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration.
Step-by-step investigation
When the user has confirmed (or their original prompt was explicitly about iteration), proceed:
1. Read the analysis
Start by reading the analysis generated in Step 5. The analysis files are at {PIXIE_ROOT}/results/<test_id>/dataset-<index>.md. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues.
2. Get detailed test output
pixie test -v # shows score and reasoning per case
Capture the full verbose output. For each failing case, note:
- The
entry_kwargs(what was sent) - The
the captured output(what the app produced) - The
expected_output(what was expected, if applicable) - The evaluator score and reasoning
3. Inspect the trace data
For each failing case, look up the full trace to see what happened inside the app:
from pixie import DatasetStore
store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # trace_id is here
Then inspect the full span tree:
import asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # full span tree: inputs, outputs, LLM messages
asyncio.run(inspect("the-trace-id-here"))
4. Root-cause analysis
Walk through the trace and identify exactly where the failure originates. Common patterns:
LLM-related failures (fix with prompt/model/eval changes):
| Symptom | Likely cause |
|---|---|
| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous |
| Output format is wrong | Missing format instructions in prompt |
| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage |
Non-LLM failures (fix with traditional code changes, out of eval scope):
| Symptom | Likely cause |
|---|---|
| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval |
| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code |
| Database returned stale/wrong records | Data issue — fix independently |
| API call failed with error | Infrastructure issue |
For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.
5. Document findings
Every failure investigation should be documented alongside the fix. Include:
### <date> — failure investigation
**Dataset**: `qa-golden-set`
**Result**: 3/5 cases passed (60%)
#### Failing case 1: "What rows have extra legroom?"
- **entry_kwargs**: `{"user_message": "What rows have extra legroom?"}`
- **the captured output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
- **Evaluator score**: 0.1 (Factuality)
- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
**Trace analysis**:
Inspected trace `abc123`. The span tree shows:
1. Triage Agent routed to FAQ Agent ✓
2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.
**Classification**: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.
**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.
**Verification**: After fix, re-run:
```bash
pixie test -v # verify
```
6. Fix and re-run
Make the targeted change, update the dataset if needed, and re-run:
pixie test -v
After fixes stabilize, run analysis again to see if the patterns have changed:
pixie analyze <new_test_id>
The iteration cycle
- Read analysis from Step 6 → prioritize failures
- Run tests verbose → identify specific failures
- Investigate each failure → classify as LLM vs. non-LLM
- For LLM failures: adjust prompts, model, or eval criteria
- For non-LLM failures: recommend or apply code fix
- Update dataset if the fix changed app behavior
- Re-run tests and analysis
- Repeat until passing or user is satisfied