# Investigation and Iteration This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes. --- ## STOP — check before proceeding **Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.** **Continue immediately** if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below. **Otherwise, STOP here.** Report the test results to the user: > "QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?" **Do not proceed with investigation until the user confirms.** This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration. --- ## Step-by-step investigation When the user has confirmed (or their original prompt was explicitly about iteration), proceed: ### 1. Read the analysis Start by reading the analysis generated in Step 5. The analysis files are at `{PIXIE_ROOT}/results//dataset-.md`. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues. ### 2. Get detailed test output ```bash pixie test -v # shows score and reasoning per case ``` Capture the full verbose output. For each failing case, note: - The `entry_kwargs` (what was sent) - The `the captured output` (what the app produced) - The `expected_output` (what was expected, if applicable) - The evaluator score and reasoning ### 3. Inspect the trace data For each failing case, look up the full trace to see what happened inside the app: ```python from pixie import DatasetStore store = DatasetStore() ds = store.get("") for i, item in enumerate(ds.items): print(i, item.eval_metadata) # trace_id is here ``` Then inspect the full span tree: ```python import asyncio from pixie import ObservationStore async def inspect(trace_id: str): store = ObservationStore() roots = await store.get_trace(trace_id) for root in roots: print(root.to_text()) # full span tree: inputs, outputs, LLM messages asyncio.run(inspect("the-trace-id-here")) ``` ### 4. Root-cause analysis Walk through the trace and identify exactly where the failure originates. Common patterns: **LLM-related failures** (fix with prompt/model/eval changes): | Symptom | Likely cause | | ------------------------------------------------------ | ------------------------------------------------------------- | | Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully | | Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous | | Output format is wrong | Missing format instructions in prompt | | LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage | **Non-LLM failures** (fix with traditional code changes, out of eval scope): | Symptom | Likely cause | | ------------------------------------------------- | ------------------------------------------------------- | | Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval | | Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code | | Database returned stale/wrong records | Data issue — fix independently | | API call failed with error | Infrastructure issue | For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly. ### 5. Document findings **Every failure investigation should be documented** alongside the fix. Include: ````markdown ### — failure investigation **Dataset**: `qa-golden-set` **Result**: 3/5 cases passed (60%) #### Failing case 1: "What rows have extra legroom?" - **entry_kwargs**: `{"user_message": "What rows have extra legroom?"}` - **the captured output**: "I'm sorry, I don't have the exact row numbers for extra legroom..." - **expected_output**: "rows 5-8 Economy Plus with extra legroom" - **Evaluator score**: 0.1 (Factuality) - **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..." **Trace analysis**: Inspected trace `abc123`. The span tree shows: 1. Triage Agent routed to FAQ Agent ✓ 2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓ 3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause** **Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching. The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`. The question "What rows have extra legroom?" contains none of these keywords, so it falls through to the default "I don't know" response. **Classification**: Non-LLM failure — the keyword-matching tool is broken. The LLM agent correctly routed to the FAQ agent and used the tool; the tool itself returned wrong data. **Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in `faq_lookup_tool` (customer_service.py:130). This is a traditional code fix, not an eval/prompt change. **Verification**: After fix, re-run: ```bash pixie test -v # verify ``` ```` ### 6. Fix and re-run Make the targeted change, update the dataset if needed, and re-run: ```bash pixie test -v ``` After fixes stabilize, run analysis again to see if the patterns have changed: ```bash pixie analyze ``` --- ## The iteration cycle 1. Read analysis from Step 6 → prioritize failures 2. Run tests verbose → identify specific failures 3. Investigate each failure → classify as LLM vs. non-LLM 4. For LLM failures: adjust prompts, model, or eval criteria 5. For non-LLM failures: recommend or apply code fix 6. Update dataset if the fix changed app behavior 7. Re-run tests and analysis 8. Repeat until passing or user is satisfied