mirror of https://github.com/github/awesome-copilot.git synced 2026-04-11 10:45:56 +00:00

Files

Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352 )

* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.

2026-04-10 11:19:28 +10:00

10 KiB

Raw Permalink Blame History

Step 4: Build the Dataset

Why this step: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b) — into concrete test scenarios. At test time, pixie test calls the runnable with entry_kwargs, the wrap registry is populated with eval_input, and evaluators score the resulting captured outputs.

Understanding `entry_kwargs`, `eval_input`, and `expectation`

Before building the dataset, understand what these terms mean:

entry_kwargs = the kwargs passed to Runnable.run() as a Pydantic model. These are the entry-point inputs (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for run(args: T).
eval_input = a list of {"name": ..., "value": ...} objects corresponding to wrap(purpose="input") calls in the app. At test time, these are injected automatically by the wrap registry; wrap(purpose="input") calls in the app return the registry value instead of calling the real external dependency.

CRITICAL: eval_input must have at least one item (enforced by min_length=1 validation). If the app has no wrap(purpose="input") calls, you must still include at least one eval_input item — use the primary entry-point argument as a synthetic input:
```
"eval_input": [
  { "name": "user_input", "value": "What are your business hours?" }
]
```
Each item is a NamedData object with name (str) and value (any JSON-serializable value).
expectation (optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g., Factuality, ClosedQA). Not needed for output-quality evaluators that don't require a reference.
eval output = what the app actually produces, captured at runtime by wrap(purpose="output") and wrap(purpose="state") calls. Not stored in the dataset — it's produced when pixie test runs the app.

The reference trace at pixie_qa/reference-trace.jsonl is your primary source for data shapes:

Filter it to see the exact serialized format for eval_input values
Read the kwargs record to understand the entry_kwargs structure
Read purpose="output"/"state" events to understand what outputs the app produces, so you can write meaningful expectation values

4a. Derive evaluator assignments

The eval criteria artifact (pixie_qa/02-eval-criteria.md) maps each criterion to use cases. The evaluator mapping artifact (pixie_qa/03-evaluator-mapping.md) maps each criterion to a concrete evaluator name. Combine these:

Dataset-level default evaluators: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level "evaluators" array.
Item-level evaluators: Criteria that apply to only a subset → their evaluator names go in "evaluators" on the relevant rows only, using "..." to also include the defaults.

4b. Inspect data shapes with `pixie format`

Use pixie format on the reference trace to see the exact data shapes and the real app output in dataset-entry format:

pixie format --input reference-trace.jsonl --output dataset-sample.json

The output looks like:

{
  "entry_kwargs": {
    "user_message": "What are your business hours?"
  },
  "eval_input": [
    {
      "name": "customer_profile",
      "value": { "name": "Alice", "tier": "gold" }
    },
    {
      "name": "conversation_history",
      "value": [{ "role": "user", "content": "What are your hours?" }]
    }
  ],
  "expectation": null,
  "eval_output": {
    "response": "Our business hours are Monday to Friday, 9am to 5pm..."
  }
}

Important: The eval_output in this template is the full real output produced by the running app. Do NOT copy eval_output into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:

Use entry_kwargs and eval_input as exact templates for data keys and format
Look at eval_output to understand what the app produces — then write a concise expectation description that captures the key quality criteria for each scenario

Example: if eval_output.response is "Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM.", write expectation as "Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours" — a short description a human or LLM evaluator can compare against.

4c. Generate dataset items

Create diverse entries guided by the reference trace and use cases:

entry_kwargs keys must match the fields of the Pydantic model used in Runnable.run(args: T)
eval_input must be a list of {"name": ..., "value": ...} objects matching the name values of wrap(purpose="input") calls in the app
Cover each use case from pixie_qa/02-eval-criteria.md — at least one entry per use case, with meaningfully diverse inputs across entries

If the user specified a dataset or data source in the prompt (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the entry_kwargs / eval_input shape, and incorporate them into the dataset. Do NOT ignore specified data.

4d. Build the dataset JSON file

Create the dataset at pixie_qa/datasets/<name>.json:

{
  "name": "qa-golden-set",
  "runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
  "evaluators": ["Factuality", "pixie_qa/evaluators.py:concise_voice_style"],
  "entries": [
    {
      "entry_kwargs": {
        "user_message": "What are your business hours?"
      },
      "description": "Customer asks about business hours with gold tier account",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Alice Johnson", "tier": "gold" }
        }
      ],
      "expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
    },
    {
      "entry_kwargs": {
        "user_message": "I want to change something"
      },
      "description": "Ambiguous change request from basic tier customer",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Bob Smith", "tier": "basic" }
        }
      ],
      "expectation": "Should ask for clarification",
      "evaluators": ["...", "ClosedQA"]
    },
    {
      "entry_kwargs": {
        "user_message": "I want to end this call"
      },
      "description": "User requests call end after failed verification",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Charlie Brown", "tier": "basic" }
        }
      ],
      "expectation": "Agent should call endCall tool and end the conversation",
      "eval_metadata": {
        "expected_tool": "endCall",
        "expected_call_ended": true
      },
      "evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
    }
  ]
}

Key fields

Entry structure — all fields are top-level on each entry (flat structure — no nesting):

entry:
  ├── entry_kwargs    (required) — args for Runnable.run()
  ├── eval_input      (required) — list of {"name": ..., "value": ...} objects
  ├── description     (required) — human-readable label for the test case
  ├── expectation     (optional) — reference for comparison-based evaluators
  ├── eval_metadata   (optional) — extra per-entry data for custom evaluators
  └── evaluators      (optional) — evaluator names for THIS entry

Top-level fields:

runnable (required): filepath:ClassName reference to the Runnable class from Step 2 (e.g., "pixie_qa/scripts/run_app.py:AppRunnable"). Path is relative to the project root.
evaluators (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.

Per-entry fields (all top-level on each entry):

entry_kwargs (required): Keys match the Pydantic model fields for Runnable.run(args: T). These are the app's entry-point inputs.
eval_input (required): List of {"name": ..., "value": ...} objects. Names match wrap(purpose="input") names in the app.
description (required): Use case one-liner from pixie_qa/02-eval-criteria.md.
expectation (optional): Case-specific expectation text for evaluators that need a reference.
eval_metadata (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as evaluable.eval_metadata.
evaluators (optional): Row-level evaluator override.

Evaluator assignment rules

Evaluators that apply to ALL items go in the top-level "evaluators" array.
Items that need additional evaluators use "evaluators": ["...", "ExtraEval"] — "..." expands to defaults.
Items that need a completely different set use "evaluators": ["OnlyThis"] without "...".
Items using only defaults: omit the "evaluators" field.

Dataset Creation Reference

Using `eval_input` values

The eval_input values are {"name": ..., "value": ...} objects. Use the reference trace as templates — copy the "data" field from the relevant purpose="input" event and adapt the values:

Simple dict:

{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }

List of dicts (e.g., conversation history):

{
  "name": "conversation_history",
  "value": [
    { "role": "user", "content": "Hello" },
    { "role": "assistant", "content": "Hi there!" }
  ]
}

Important: The exact format depends on what the wrap(purpose="input") call captures. Always copy from the reference trace rather than constructing from scratch.

Crafting diverse eval scenarios

Cover different aspects of each use case:

Different user phrasings of the same request
Edge cases (ambiguous input, missing information, error conditions)
Entries that stress-test specific eval criteria
At least one entry per use case from Step 1b

Output

pixie_qa/datasets/<name>.json — the dataset file.

10 KiB Raw Permalink Blame History Unescape Escape