Files
awesome-copilot/skills/eval-driven-dev/references/6-investigate.md
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

165 lines
6.6 KiB
Markdown

# Investigation and Iteration
This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.
---
## STOP — check before proceeding
**Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.**
**Continue immediately** if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below.
**Otherwise, STOP here.** Report the test results to the user:
> "QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?"
**Do not proceed with investigation until the user confirms.** This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration.
---
## Step-by-step investigation
When the user has confirmed (or their original prompt was explicitly about iteration), proceed:
### 1. Read the analysis
Start by reading the analysis generated in Step 5. The analysis files are at `{PIXIE_ROOT}/results/<test_id>/dataset-<index>.md`. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues.
### 2. Get detailed test output
```bash
pixie test -v # shows score and reasoning per case
```
Capture the full verbose output. For each failing case, note:
- The `entry_kwargs` (what was sent)
- The `the captured output` (what the app produced)
- The `expected_output` (what was expected, if applicable)
- The evaluator score and reasoning
### 3. Inspect the trace data
For each failing case, look up the full trace to see what happened inside the app:
```python
from pixie import DatasetStore
store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # trace_id is here
```
Then inspect the full span tree:
```python
import asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # full span tree: inputs, outputs, LLM messages
asyncio.run(inspect("the-trace-id-here"))
```
### 4. Root-cause analysis
Walk through the trace and identify exactly where the failure originates. Common patterns:
**LLM-related failures** (fix with prompt/model/eval changes):
| Symptom | Likely cause |
| ------------------------------------------------------ | ------------------------------------------------------------- |
| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous |
| Output format is wrong | Missing format instructions in prompt |
| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage |
**Non-LLM failures** (fix with traditional code changes, out of eval scope):
| Symptom | Likely cause |
| ------------------------------------------------- | ------------------------------------------------------- |
| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval |
| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code |
| Database returned stale/wrong records | Data issue — fix independently |
| API call failed with error | Infrastructure issue |
For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly.
### 5. Document findings
**Every failure investigation should be documented** alongside the fix. Include:
````markdown
### <date> — failure investigation
**Dataset**: `qa-golden-set`
**Result**: 3/5 cases passed (60%)
#### Failing case 1: "What rows have extra legroom?"
- **entry_kwargs**: `{"user_message": "What rows have extra legroom?"}`
- **the captured output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
- **Evaluator score**: 0.1 (Factuality)
- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
**Trace analysis**:
Inspected trace `abc123`. The span tree shows:
1. Triage Agent routed to FAQ Agent ✓
2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.
**Classification**: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.
**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.
**Verification**: After fix, re-run:
```bash
pixie test -v # verify
```
````
### 6. Fix and re-run
Make the targeted change, update the dataset if needed, and re-run:
```bash
pixie test -v
```
After fixes stabilize, run analysis again to see if the patterns have changed:
```bash
pixie analyze <new_test_id>
```
---
## The iteration cycle
1. Read analysis from Step 6 → prioritize failures
2. Run tests verbose → identify specific failures
3. Investigate each failure → classify as LLM vs. non-LLM
4. For LLM failures: adjust prompts, model, or eval criteria
5. For non-LLM failures: recommend or apply code fix
6. Update dataset if the fix changed app behavior
7. Re-run tests and analysis
8. Repeat until passing or user is satisfied