# Understanding the Application This reference covers Step 1 of the eval-driven-dev process in detail: how to read the codebase, map the data flows, and document your findings. --- ## What to investigate Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would. ### 1. How the software runs What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables? ### 2. Find where the LLM provider client is called Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record: - The file and function where the call lives - Which LLM provider/client is used - The exact arguments being passed (model, messages, tools, etc.) ### 3. Track backwards: external data dependencies flowing IN Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt. Categorize each data source: **Application inputs** (from the user / caller): - User messages, queries, uploaded files - Configuration or feature flags **External dependency data** (from systems outside the app): - Database lookups (conversation history from Redis, user profiles from Postgres, etc.) - Retrieved context (RAG chunks from a vector DB, search results from an API) - Cache reads - Third-party API responses For each external data dependency, document: - What system it comes from - What the data shape looks like (types, fields, structure) - What realistic values look like - Whether it requires real credentials or can be mocked **In-code data** (assembled by the application itself): - System prompts (hardcoded or templated) - Tool definitions and function schemas - Prompt-building logic that combines the above ### 4. Track forwards: external side-effects flowing OUT Starting from each LLM call site, trace **forwards** through the code to find every side-effect the application causes in external systems based on the LLM's output: - Database writes (saving conversation history, updating records) - API calls to third-party services (sending emails, creating calendar entries, initiating transfers) - Messages sent to other systems (queues, webhooks, notifications) - File system writes For each side-effect, document: - What system is affected - What data is written/sent - Whether this side-effect is something evaluations should verify (e.g., "did the agent route to the correct department?") ### 5. Identify intermediate states to capture Along the paths between input and output, identify intermediate states that are necessary for proper evaluation but aren't visible in the final output: - Tool call decisions and results (which tools were called, what they returned) - Agent routing / handoff decisions - Intermediate LLM calls (e.g., summarization before final answer) - Retrieval results (what context was fetched) - Any branching logic that determines the code path These are things that evaluators will need to check criteria like "did the agent verify identity before transferring?" or "did it use the correct tool?" ### 6. Use cases and expected behaviors What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure? --- ## Writing MEMORY.md Write your findings to `pixie_qa/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy. **MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet.** Those belong in later steps, only after they've been implemented. ### Template ```markdown # Eval Notes: ## How the application works ### Entry point and execution flow ### LLM call sites - Where it is in the code (file + function name) - Which LLM provider/client is used - What arguments are passed ### External data dependencies (data flowing IN to LLM) - **System**: - **What data**: - **Data shape**: - **Code path**: - **Credentials needed**: ### External side-effects (data flowing OUT from LLM output) - **System**: - **What happens**: - **Code path**: - **Eval-relevant?**: ### Pluggable/injectable interfaces (testability seams) - **Interface**: - **Defined in**: - **What it wraps**: - **How it's injected**: - **Mock strategy**: These are the primary testability seams. In Step 3, you'll write mock implementations of these interfaces. ### Mocking plan summary | Dependency | Mock approach | What mock provides (IN) | What mock captures (OUT) | | ------------------- | ------------------------------ | -------------------------------------- | ------------------------ | | | | | | | | | | | ### Intermediate states to capture - - Include code pointers (file:line) for each ### Final output ### Use cases 1. : - Input example: ... - Good output: ... - Bad output: ... ## Evaluation plan ### What to evaluate and why ### Evaluators and criteria | Criterion | Evaluator | Dataset | Pass criteria | Rationale | | --------- | --------- | ------- | ------------- | --------- | | ... | ... | ... | ... | ... | ### Data needed for evaluation ## Datasets | Dataset | Items | Purpose | | ------- | ----- | ------- | | ... | ... | ... | ## Investigation log ### failure ``` If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.