mirror of https://github.com/github/awesome-copilot.git synced 2026-04-11 18:55:55 +00:00

Files

Yiou Li df0ed6aa51 update eval-driven-dev skill. (#1201 )

* update eval-driven-dev skill.

Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions.

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2026-03-30 08:07:39 +11:00

8.2 KiB

Raw Blame History

Instrumentation

This reference covers the tactical implementation of instrumentation in Step 2: how to use @observe, enable_storage(), and start_observation correctly.

For full API signatures and all available parameters, see references/pixie-api.md (Instrumentation API section).

For guidance on what to instrument (which functions, based on your eval criteria), see Step 2a in the main skill instructions.

Adding `enable_storage()` at application startup

Call enable_storage() once at the beginning of the application's startup code — inside main(), or at the top of a server's initialization. Never at module level (top of a file outside any function), because that causes storage setup to trigger on import.

Good places:

Inside if __name__ == "__main__": blocks
In a FastAPI lifespan or on_startup handler
At the top of main() / run() functions
Inside the runnable function in test files

# ✅ CORRECT — at application startup
async def main():
    enable_storage()
    ...

# ✅ CORRECT — in a runnable for tests
def runnable(eval_input):
    enable_storage()
    my_function(**eval_input)

# ❌ WRONG — at module level, runs on import
from pixie import enable_storage
enable_storage()  # this runs when any file imports this module!

Wrapping functions with `@observe` or `start_observation`

Instrument the existing function that the app actually calls during normal operation. The @observe decorator or start_observation context manager goes on the production code path — not on new helper functions created for testing.

# ✅ CORRECT — decorating the existing production function
from pixie import observe

@observe(name="answer_question")
def answer_question(question: str, context: str) -> str:  # existing function
    ...  # existing code, unchanged

# ✅ CORRECT — decorating a class method (works exactly the same way)
from pixie import observe

class OpenAIAgent:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
        self.model = model

    @observe(name="openai_agent_respond")
    def respond(self, user_message: str, conversation_history: list | None = None) -> str:
        # existing code, unchanged — @observe handles `self` automatically
        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
        if conversation_history:
            messages.extend(conversation_history)
        messages.append({"role": "user", "content": user_message})
        response = self.client.chat.completions.create(model=self.model, messages=messages)
        return response.choices[0].message.content or ""

@observe handles self and cls automatically — it strips them from the captured input so only the meaningful arguments appear in traces. Do NOT create wrapper methods or call unbound methods to work around this. Just decorate the existing method directly.

# ✅ CORRECT — context manager inside an existing function
from pixie import start_observation

async def main():  # existing function
    ...
    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
        result = await Runner.run(current_agent, input_items, context=context)
        # ... existing response handling ...
        obs.set_output(response_text)
    ...

Anti-patterns to avoid

Creating new wrapper functions

# ❌ WRONG — creating a new function that duplicates logic from main()
@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
    # This duplicates what main() does, creating a separate code path
    # that diverges from production. Don't do this.
    ...

Creating wrapper methods instead of decorating the existing method

# ❌ WRONG — creating a new _respond_observed wrapper method
class OpenAIAgent:
    def respond(self, user_message, conversation_history=None):
        result = self._respond_observed({
            'user_message': user_message,
            'conversation_history': conversation_history,
        })
        return result['result']

    @observe
    def _respond_observed(self, args):
        # WRONG: creates a separate code path, changes the interface,
        # and breaks when called as an unbound method.
        ...

# ✅ CORRECT — just decorate the existing method directly
class OpenAIAgent:
    @observe(name="openai_agent_respond")
    def respond(self, user_message, conversation_history=None):
        ...  # existing code, unchanged

Bypassing the app by calling the LLM directly

# ❌ WRONG — calling the LLM directly instead of calling the app's function
@observe(name="agent_answer_question")
def answer_question(question: str) -> str:
    # This bypasses the entire app and calls OpenAI directly.
    # You're testing a script you just wrote, not the user's app.
    response = client.responses.create(
        model="gpt-4.1",
        input=[{"role": "user", "content": question}],
    )
    return response.output_text

Rules

Never add new wrapper functions to the application code for eval purposes.
Never bypass the app by calling the LLM provider directly — if you find yourself writing client.responses.create(...) or openai.ChatCompletion.create(...) in a test or utility function, you're not testing the app. Import and call the app's own function instead.
Never change the function's interface (arguments, return type, behavior).
Never duplicate production logic into a separate "testable" function.
The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
After instrumentation, call flush() at the end of runs to make sure all spans are written.
For interactive apps (CLI loops, chat interfaces), instrument the per-turn processing function — the one that takes user input and produces a response. The eval runnable should call this same function.

Import rule: All pixie symbols are importable from the top-level pixie package. Never import from submodules (pixie.instrumentation, pixie.evals, pixie.storage.evaluable, etc.) — always use from pixie import ....

What to instrument based on eval criteria

LLM provider calls are auto-captured. When you call enable_storage(), pixie activates OpenInference instrumentors that automatically trace every LLM API call (OpenAI, Anthropic, Google, etc.) with full input/output messages, token usage, and model parameters. You do NOT need @observe on a function just because it contains an LLM call — the LLM call is already instrumented.

Use @observe for application-level functions whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone:

What your evaluator needs	What to instrument with `@observe`
App-level input/output (what user sent, what app returned)	The app's entry-point or per-turn processing function
Retrieved context (for faithfulness/grounding checks)	The retrieval function — captures what documents were fetched
Routing/dispatch decisions	The routing function — captures which tool/agent/department was selected
Side-effects sent to external systems	The function that writes to the external system — captures what was sent
Conversation history handling	The per-turn processing function — captures how history is assembled
Intermediate processing stages	Each intermediate function — captures each stage

If your eval criteria can be fully assessed from the auto-captured LLM inputs and outputs alone, you may not need @observe at all. But typically you need at least one @observe on the app's entry-point function to capture the application-level input/output shape that the dataset and evaluators work with.

8.2 KiB Raw Blame History