mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
202 lines
7.7 KiB
Markdown
202 lines
7.7 KiB
Markdown
# Understanding the Application
|
|
|
|
This reference covers Step 1 of the eval-driven-dev process in detail: how to read the codebase, map the data flows, and document your findings.
|
|
|
|
---
|
|
|
|
## What to investigate
|
|
|
|
Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would.
|
|
|
|
### 1. How the software runs
|
|
|
|
What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
|
|
|
|
### 2. Find where the LLM provider client is called
|
|
|
|
Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:
|
|
|
|
- The file and function where the call lives
|
|
- Which LLM provider/client is used
|
|
- The exact arguments being passed (model, messages, tools, etc.)
|
|
|
|
### 3. Track backwards: external data dependencies flowing IN
|
|
|
|
Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt. Categorize each data source:
|
|
|
|
**Application inputs** (from the user / caller):
|
|
|
|
- User messages, queries, uploaded files
|
|
- Configuration or feature flags
|
|
|
|
**External dependency data** (from systems outside the app):
|
|
|
|
- Database lookups (conversation history from Redis, user profiles from Postgres, etc.)
|
|
- Retrieved context (RAG chunks from a vector DB, search results from an API)
|
|
- Cache reads
|
|
- Third-party API responses
|
|
|
|
For each external data dependency, document:
|
|
|
|
- What system it comes from
|
|
- What the data shape looks like (types, fields, structure)
|
|
- What realistic values look like
|
|
- Whether it requires real credentials or can be mocked
|
|
|
|
**In-code data** (assembled by the application itself):
|
|
|
|
- System prompts (hardcoded or templated)
|
|
- Tool definitions and function schemas
|
|
- Prompt-building logic that combines the above
|
|
|
|
### 4. Track forwards: external side-effects flowing OUT
|
|
|
|
Starting from each LLM call site, trace **forwards** through the code to find every side-effect the application causes in external systems based on the LLM's output:
|
|
|
|
- Database writes (saving conversation history, updating records)
|
|
- API calls to third-party services (sending emails, creating calendar entries, initiating transfers)
|
|
- Messages sent to other systems (queues, webhooks, notifications)
|
|
- File system writes
|
|
|
|
For each side-effect, document:
|
|
|
|
- What system is affected
|
|
- What data is written/sent
|
|
- Whether this side-effect is something evaluations should verify (e.g., "did the agent route to the correct department?")
|
|
|
|
### 5. Identify intermediate states to capture
|
|
|
|
Along the paths between input and output, identify intermediate states that are necessary for proper evaluation but aren't visible in the final output:
|
|
|
|
- Tool call decisions and results (which tools were called, what they returned)
|
|
- Agent routing / handoff decisions
|
|
- Intermediate LLM calls (e.g., summarization before final answer)
|
|
- Retrieval results (what context was fetched)
|
|
- Any branching logic that determines the code path
|
|
|
|
These are things that evaluators will need to check criteria like "did the agent verify identity before transferring?" or "did it use the correct tool?"
|
|
|
|
### 6. Use cases and expected behaviors
|
|
|
|
What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
|
|
|
|
---
|
|
|
|
## Writing MEMORY.md
|
|
|
|
Write your findings to `pixie_qa/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
|
|
|
|
**MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet.** Those belong in later steps, only after they've been implemented.
|
|
|
|
### Template
|
|
|
|
```markdown
|
|
# Eval Notes: <Project Name>
|
|
|
|
## How the application works
|
|
|
|
### Entry point and execution flow
|
|
|
|
<Describe how to start/run the app, what happens step by step>
|
|
|
|
### LLM call sites
|
|
|
|
<For each LLM call in the codebase, document:>
|
|
|
|
- Where it is in the code (file + function name)
|
|
- Which LLM provider/client is used
|
|
- What arguments are passed
|
|
|
|
### External data dependencies (data flowing IN to LLM)
|
|
|
|
<For each external system the app reads from:>
|
|
|
|
- **System**: <e.g., Redis, Postgres, vector DB, third-party API>
|
|
- **What data**: <e.g., conversation history, user profile, retrieved documents>
|
|
- **Data shape**: <types, fields, structure, realistic values>
|
|
- **Code path**: <file:line where each read happens>
|
|
- **Credentials needed**: <yes/no, what kind>
|
|
|
|
### External side-effects (data flowing OUT from LLM output)
|
|
|
|
<For each external system the app writes to / affects:>
|
|
|
|
- **System**: <e.g., database, API, queue, file system>
|
|
- **What happens**: <e.g., saves conversation, sends email, creates calendar entry>
|
|
- **Code path**: <file:line where each write happens>
|
|
- **Eval-relevant?**: <should evaluations verify this side-effect?>
|
|
|
|
### Pluggable/injectable interfaces (testability seams)
|
|
|
|
<For each abstract base class, protocol, or constructor-injected backend:>
|
|
|
|
- **Interface**: <e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`>
|
|
- **Defined in**: <file:line>
|
|
- **What it wraps**: <e.g., real STT service, real TTS service, Redis>
|
|
- **How it's injected**: <constructor param, module-level var, dependency injection framework>
|
|
- **Mock strategy**: <what mock implementation should do — e.g., decode UTF-8 instead of real STT>
|
|
|
|
These are the primary testability seams. In Step 3, you'll write mock implementations of these interfaces.
|
|
|
|
### Mocking plan summary
|
|
|
|
<For each external dependency, how will you replace it in the utility function (Step 3)?>
|
|
|
|
| Dependency | Mock approach | What mock provides (IN) | What mock captures (OUT) |
|
|
| ------------------- | ------------------------------ | -------------------------------------- | ------------------------ |
|
|
| <e.g., Redis> | <mock.patch / mock class / DI> | <conversation history from eval_input> | <saved messages> |
|
|
| <e.g., STT service> | <MockTranscriptionBackend> | <text from eval_input> | <n/a> |
|
|
|
|
### Intermediate states to capture
|
|
|
|
<States along the execution path needed for evaluation but not in final output:>
|
|
|
|
- <e.g., tool call decisions, routing choices, retrieval results>
|
|
- Include code pointers (file:line) for each
|
|
|
|
### Final output
|
|
|
|
<What the user sees, what format, what the quality bar should be>
|
|
|
|
### Use cases
|
|
|
|
<List each distinct scenario the app handles, with examples of good/bad outputs>
|
|
|
|
1. <Use case 1>: <description>
|
|
- Input example: ...
|
|
- Good output: ...
|
|
- Bad output: ...
|
|
|
|
## Evaluation plan
|
|
|
|
### What to evaluate and why
|
|
|
|
<App-specific quality dimensions and rationale — filled in during Step 1>
|
|
|
|
### Evaluators and criteria
|
|
|
|
<Filled in during Step 5 — maps each quality criterion to a specific evaluator>
|
|
|
|
| Criterion | Evaluator | Dataset | Pass criteria | Rationale |
|
|
| --------- | --------- | ------- | ------------- | --------- |
|
|
| ... | ... | ... | ... | ... |
|
|
|
|
### Data needed for evaluation
|
|
|
|
<What data to capture, with code pointers>
|
|
|
|
## Datasets
|
|
|
|
| Dataset | Items | Purpose |
|
|
| ------- | ----- | ------- |
|
|
| ... | ... | ... |
|
|
|
|
## Investigation log
|
|
|
|
### <date> — <test_name> failure
|
|
|
|
<Structured investigation entries — filled in during Step 6>
|
|
```
|
|
|
|
If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.
|