mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 11:15:56 +00:00
* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
165 lines
6.6 KiB
Markdown
165 lines
6.6 KiB
Markdown
# Investigation and Iteration
|
|
|
|
This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.
|
|
|
|
---
|
|
|
|
## STOP — check before proceeding
|
|
|
|
**Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.**
|
|
|
|
**Continue immediately** if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below.
|
|
|
|
**Otherwise, STOP here.** Report the test results to the user:
|
|
|
|
> "QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?"
|
|
|
|
**Do not proceed with investigation until the user confirms.** This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration.
|
|
|
|
---
|
|
|
|
## Step-by-step investigation
|
|
|
|
When the user has confirmed (or their original prompt was explicitly about iteration), proceed:
|
|
|
|
### 1. Read the analysis
|
|
|
|
Start by reading the analysis generated in Step 5. The analysis files are at `{PIXIE_ROOT}/results/<test_id>/dataset-<index>.md`. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues.
|
|
|
|
### 2. Get detailed test output
|
|
|
|
```bash
|
|
pixie test -v # shows score and reasoning per case
|
|
```
|
|
|
|
Capture the full verbose output. For each failing case, note:
|
|
|
|
- The `entry_kwargs` (what was sent)
|
|
- The `the captured output` (what the app produced)
|
|
- The `expected_output` (what was expected, if applicable)
|
|
- The evaluator score and reasoning
|
|
|
|
### 3. Inspect the trace data
|
|
|
|
For each failing case, look up the full trace to see what happened inside the app:
|
|
|
|
```python
|
|
from pixie import DatasetStore
|
|
|
|
store = DatasetStore()
|
|
ds = store.get("<dataset-name>")
|
|
for i, item in enumerate(ds.items):
|
|
print(i, item.eval_metadata) # trace_id is here
|
|
```
|
|
|
|
Then inspect the full span tree:
|
|
|
|
```python
|
|
import asyncio
|
|
from pixie import ObservationStore
|
|
|
|
async def inspect(trace_id: str):
|
|
store = ObservationStore()
|
|
roots = await store.get_trace(trace_id)
|
|
for root in roots:
|
|
print(root.to_text()) # full span tree: inputs, outputs, LLM messages
|
|
|
|
asyncio.run(inspect("the-trace-id-here"))
|
|
```
|
|
|
|
### 4. Root-cause analysis
|
|
|
|
Walk through the trace and identify exactly where the failure originates. Common patterns:
|
|
|
|
**LLM-related failures** (fix with prompt/model/eval changes):
|
|
|
|
| Symptom | Likely cause |
|
|
| ------------------------------------------------------ | ------------------------------------------------------------- |
|
|
| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
|
|
| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous |
|
|
| Output format is wrong | Missing format instructions in prompt |
|
|
| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage |
|
|
|
|
**Non-LLM failures** (fix with traditional code changes, out of eval scope):
|
|
|
|
| Symptom | Likely cause |
|
|
| ------------------------------------------------- | ------------------------------------------------------- |
|
|
| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval |
|
|
| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code |
|
|
| Database returned stale/wrong records | Data issue — fix independently |
|
|
| API call failed with error | Infrastructure issue |
|
|
|
|
For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly.
|
|
|
|
### 5. Document findings
|
|
|
|
**Every failure investigation should be documented** alongside the fix. Include:
|
|
|
|
````markdown
|
|
### <date> — failure investigation
|
|
|
|
**Dataset**: `qa-golden-set`
|
|
**Result**: 3/5 cases passed (60%)
|
|
|
|
#### Failing case 1: "What rows have extra legroom?"
|
|
|
|
- **entry_kwargs**: `{"user_message": "What rows have extra legroom?"}`
|
|
- **the captured output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
|
|
- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
|
|
- **Evaluator score**: 0.1 (Factuality)
|
|
- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
|
|
|
|
**Trace analysis**:
|
|
Inspected trace `abc123`. The span tree shows:
|
|
|
|
1. Triage Agent routed to FAQ Agent ✓
|
|
2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
|
|
3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
|
|
|
|
**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
|
|
The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
|
|
The question "What rows have extra legroom?" contains none of these keywords, so it
|
|
falls through to the default "I don't know" response.
|
|
|
|
**Classification**: Non-LLM failure — the keyword-matching tool is broken.
|
|
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
|
|
itself returned wrong data.
|
|
|
|
**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
|
|
`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
|
|
not an eval/prompt change.
|
|
|
|
**Verification**: After fix, re-run:
|
|
|
|
```bash
|
|
pixie test -v # verify
|
|
```
|
|
````
|
|
|
|
### 6. Fix and re-run
|
|
|
|
Make the targeted change, update the dataset if needed, and re-run:
|
|
|
|
```bash
|
|
pixie test -v
|
|
```
|
|
|
|
After fixes stabilize, run analysis again to see if the patterns have changed:
|
|
|
|
```bash
|
|
pixie analyze <new_test_id>
|
|
```
|
|
|
|
---
|
|
|
|
## The iteration cycle
|
|
|
|
1. Read analysis from Step 6 → prioritize failures
|
|
2. Run tests verbose → identify specific failures
|
|
3. Investigate each failure → classify as LLM vs. non-LLM
|
|
4. For LLM failures: adjust prompts, model, or eval criteria
|
|
5. For non-LLM failures: recommend or apply code fix
|
|
6. Update dataset if the fix changed app behavior
|
|
7. Re-run tests and analysis
|
|
8. Repeat until passing or user is satisfied
|