Files
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

11 KiB

Step 2: Instrument with wrap and capture a reference trace

For the full wrap() API, the Runnable class, and CLI commands, see wrap-api.md.

Why this step: You need to see the actual data flowing through the app before you can build anything. This step adds wrap() calls to mark data boundaries, implements a Runnable class, captures a reference trace with pixie trace, and verifies all eval criteria can be evaluated.

This step consolidates three things: (1) data-flow analysis, (2) instrumentation, and (3) writing the runnable.


2a. Data-flow analysis and wrap instrumentation

Starting from LLM call sites, trace backwards and forwards through the code to find:

  • Entry input: what the user sends in (via the entry point)
  • Dependency input: data from external systems (databases, APIs, caches)
  • App output: data going out to users or external systems
  • Intermediate state: internal decisions relevant to evaluation (routing, tool calls)

For each data point found, immediately add a wrap() call in the application code:

import pixie

# External dependency data — value form (result of a DB/API call)
profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile",
    description="Customer profile fetched from database")

# External dependency data — function form (for lazy evaluation / avoiding the call)
history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history",
    description="Conversation history from Redis")(session_id)

# App output — what the user receives
response = pixie.wrap(response_text, purpose="output", name="response",
    description="The assistant's response to the user")

# Intermediate state — internal decision relevant to evaluation
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
    description="Which agent was selected to handle this request")

Rules for wrapping

  1. Wrap at the data boundary — where data enters or exits the application, not deep inside utility functions
  2. Names must be unique across the entire application (they are used as registry keys and dataset field names)
  3. Use lower_snake_case for names
  4. Don't wrap LLM call arguments or responses — those are already captured by OpenInference auto-instrumentation
  5. Don't change the function's interfacewrap() is purely additive, returns the same type

Value vs. function wrapping

# Value form: wrap a data value (result already computed)
profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile")

# Function form: wrap the callable itself — in eval mode the original function
# is NOT called; the registry value is returned instead.
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id)

Use function form when you want to prevent the external call from happening in eval mode (e.g., the call is expensive, has side-effects, or you simply want a clean injection point). In tracing mode, the function is called normally and the result is logged.

Coverage check

After adding wrap() calls, go through each eval criterion from pixie_qa/02-eval-criteria.md and verify that every required data point has a corresponding wrap call. If a criterion needs data that isn't captured, add the wrap now — don't defer.

2b. Implement the Runnable class

The Runnable class replaces the plain function from older versions of the skill. It exposes three lifecycle methods:

  • setup() — async, called once before any run() call; initialize shared resources here (e.g., an async HTTP client, a DB connection, pre-loaded configuration). Optional — has a default no-op.
  • run(args) — async, called concurrently for each dataset entry (up to 4 in parallel); invoke the app's real entry point with args (a validated Pydantic model built from entry_kwargs). Must be concurrency-safe — see below.
  • teardown() — async, called once after all run() calls; clean up resources. Optional — has a default no-op.

Import resolution: The project root is automatically added to sys.path when your runnable is loaded, so you can use normal import statements (e.g., from app import service) — no sys.path manipulation needed.

Place the class in pixie_qa/scripts/run_app.py:

# pixie_qa/scripts/run_app.py
from __future__ import annotations
from pydantic import BaseModel
import pixie


class AppArgs(BaseModel):
    user_message: str


class AppRunnable(pixie.Runnable[AppArgs]):
    """Runnable that drives the application for tracing and evaluation.

    wrap(purpose="input") calls in the app inject dependency data from the
    test registry automatically.  wrap(purpose="output"/"state") calls
    capture data for evaluation.  No manual mocking needed.
    """

    @classmethod
    def create(cls) -> AppRunnable:
        return cls()

    async def run(self, args: AppArgs) -> None:
        from myapp import handle_request
        await handle_request(args.user_message)

For web servers, initialize an async HTTP client in setup() and use it in run():

import httpx
from pydantic import BaseModel
import pixie


class AppArgs(BaseModel):
    user_message: str


class AppRunnable(pixie.Runnable[AppArgs]):
    _client: httpx.AsyncClient

    @classmethod
    def create(cls) -> AppRunnable:
        return cls()

    async def setup(self) -> None:
        self._client = httpx.AsyncClient(base_url="http://localhost:8000")

    async def run(self, args: AppArgs) -> None:
        await self._client.post("/chat", json={"message": args.user_message})

    async def teardown(self) -> None:
        await self._client.aclose()

For FastAPI/Starlette apps (in-process testing without starting a server), use httpx.ASGITransport to run the ASGI app directly. This is faster and avoids port management:

import asyncio
import httpx
from pydantic import BaseModel
import pixie


class AppArgs(BaseModel):
    user_message: str


class AppRunnable(pixie.Runnable[AppArgs]):
    _client: httpx.AsyncClient
    _sem: asyncio.Semaphore

    @classmethod
    def create(cls) -> AppRunnable:
        inst = cls()
        inst._sem = asyncio.Semaphore(1)  # serialise if app uses shared mutable state
        return inst

    async def setup(self) -> None:
        from myapp.main import app  # your FastAPI/Starlette app instance

        # ASGITransport runs the app in-process — no server needed
        transport = httpx.ASGITransport(app=app)
        self._client = httpx.AsyncClient(transport=transport, base_url="http://test")

    async def run(self, args: AppArgs) -> None:
        async with self._sem:
            await self._client.post("/chat", json={"message": args.user_message})

    async def teardown(self) -> None:
        await self._client.aclose()

Choose the right pattern:

  • Direct function call: when the app exposes a simple async function (no web framework)
  • httpx.AsyncClient with base_url: when you need to test against a running HTTP server
  • httpx.ASGITransport: when the app is FastAPI/Starlette — fastest, no server needed, most reliable for eval

Rules:

  • The run() method receives a Pydantic model whose fields are populated from the dataset's entry_kwargs. Define a BaseModel subclass with the fields your app needs.
  • All lifecycle methods (setup, run, teardown) are async.
  • run() must call the app through its real entry point — never bypass request handling.
  • Place the file at pixie_qa/scripts/run_app.py — name the class AppRunnable (or anything descriptive).
  • The dataset's "runnable" field references the class: "pixie_qa/scripts/run_app.py:AppRunnable".

Concurrency: run() is called concurrently for multiple dataset entries (up to 4 in parallel). If the app uses shared mutable state — SQLite, file-based DBs, global caches — you must synchronise access:

import asyncio

class AppRunnable(pixie.Runnable[AppArgs]):
    _sem: asyncio.Semaphore

    @classmethod
    def create(cls) -> AppRunnable:
        inst = cls()
        inst._sem = asyncio.Semaphore(1)  # serialise DB access
        return inst

    async def run(self, args: AppArgs) -> None:
        async with self._sem:
            await call_app(args.message)

Common concurrency pitfalls:

  • SQLite: sqlite3 connections are not safe for concurrent async writes. Use Semaphore(1) to serialise, or switch to aiosqlite with WAL mode.
  • Global mutable state: module-level dicts/lists modified in run() need a lock.
  • Rate-limited external APIs: add a semaphore to avoid 429 errors.

2c. Capture the reference trace with pixie trace

Use the pixie trace CLI command to run your Runnable and capture a trace file. Pass the entry input as a JSON file:

# Create a JSON file with entry kwargs
echo '{"user_message": "a realistic sample input"}' > pixie_qa/sample-input.json

pixie trace --runnable pixie_qa/scripts/run_app.py:AppRunnable \
  --input pixie_qa/sample-input.json \
  --output pixie_qa/reference-trace.jsonl

The --input flag takes a file path to a JSON file (not inline JSON). The JSON object keys become the kwargs passed to the Pydantic model.

The command calls AppRunnable.create(), then setup(), then run(args) once with the given input, then teardown(). The resulting trace is written to the output file.

The JSONL trace file will contain one line per wrap() event and one line per LLM span:

{"type": "kwargs", "value": {"user_message": "What are your hours?"}}
{"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {...}, ...}
{"type": "llm_span", "request_model": "gpt-4o", "input_messages": [...], ...}
{"type": "wrap", "name": "response", "purpose": "output", "data": "Our hours are...", ...}

2d. Verify wrap coverage with pixie format

Run pixie format on the trace file to see the data in dataset-entry format. This shows you both the data shapes and what a real app output looks like:

pixie format --input reference-trace.jsonl --output dataset-sample.json

The output is a formatted dataset entry template — it contains:

  • entry_kwargs: the exact keys/values for the runnable arguments
  • eval_input: the data for all dependencies (from wrap(purpose="input") calls)
  • eval_output: the actual app output captured from the trace (this is the real output — use it to understand what the app produces, not as a dataset eval_output field)

For each eval criterion from pixie_qa/02-eval-criteria.md, verify the format output contains the data needed to evaluate it. If a data point is missing, go back and add the wrap() call.


Output

  • pixie_qa/scripts/run_app.py — the Runnable class
  • pixie_qa/reference-trace.jsonl — the reference trace with all expected wrap events