update eval-driven-dev skill. (#1201)

* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-11 10:45:56 +00:00 · 2026-03-29 14:07:39 -07:00
parent ac305117f6
commit df0ed6aa51
9 changed files with 1803 additions and 947 deletions
--- a/skills/eval-driven-dev/references/pixie-api.md
+++ b/skills/eval-driven-dev/references/pixie-api.md
@@ -1,5 +1,9 @@
 # pixie API Reference

+> This file is auto-generated by `generate_api_doc` from the
+> live pixie-qa package. Do not edit by hand — run
+> `generate_api_doc` to regenerate after updating pixie-qa.
+
 ## Configuration

 All settings read from environment variables at call time. By default,
@@ -22,18 +26,24 @@ from pixie import enable_storage, observe, start_observation, flush, init, add_h

 | Function / Decorator | Signature                                                    | Notes                                                                                               |
 | -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
-| `enable_storage()`   | `() → StorageHandler`                                        | Idempotent. Creates DB, registers handler. Call at app startup.                                     |
-| `init()`             | `(*, capture_content=True, queue_size=1000) → None`          | Called internally by `enable_storage`. Idempotent.                                                  |
-| `observe`            | `(name=None) → decorator`                                    | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
-| `start_observation`  | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside.                    |
-| `flush`              | `(timeout_seconds=5.0) → bool`                               | Drains the queue. Call after a run before using CLI commands.                                       |
-| `add_handler`        | `(handler) → None`                                           | Register a custom handler (must call `init()` first).                                               |
+| `observe`   | `observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
+| `enable_storage`   | `enable_storage() -> 'StorageHandler'` | Idempotent. Creates DB, registers handler. Call at app startup. |
+| `start_observation`   | `start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
+| `flush`   | `flush(timeout_seconds: 'float' = 5.0) -> 'bool'` | Drains the queue. Call after a run before using CLI commands. |
+| `init`   | `init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'` | Called internally by `enable_storage`. Idempotent. |
+| `add_handler`   | `add_handler(handler: 'InstrumentationHandler') -> 'None'` | Register a custom handler (must call `init()` first). |
+| `remove_handler`   | `remove_handler(handler: 'InstrumentationHandler') -> 'None'` | Unregister a previously added handler. |

 ---

 ## CLI Commands

 ```bash
+# Trace inspection
+pixie trace list [--limit N] [--errors]              # show recent traces
+pixie trace show <trace_id> [--verbose] [--json]     # show span tree for a trace
+pixie trace last [--json]                            # show most recent trace (verbose)
+
 # Dataset management
 pixie dataset create <name>
 pixie dataset list
@@ -47,6 +57,24 @@ echo '"expected value"' | pixie dataset save <name> --expected-output
 pixie test [path] [-k filter_substring] [-v]
 ```

+### `pixie trace` commands
+
+**`pixie trace list`** — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).
+
+- `--limit N` (default 10) — number of traces to show
+- `--errors` — show only traces with errors
+
+**`pixie trace show <trace_id>`** — show the span tree for a specific trace.
+
+- Default (compact): span names, types, timing
+- `--verbose` / `-v`: full input/output data for each span
+- `--json`: machine-readable JSON output
+- Trace ID accepts prefix match (first 8+ characters)
+
+**`pixie trace last`** — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.
+
+- `--json`: machine-readable JSON output
+
 **`pixie dataset save` selection modes:**

 - `root` (default) — the outermost `@observe` or `start_observation` span
@@ -55,112 +83,21 @@ pixie test [path] [-k filter_substring] [-v]

 ---

-## Eval Harness (`pixie`)
-
-```python
-from pixie import (
-    assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
-    EvalAssertionError, Evaluation, ScoreThreshold,
-    capture_traces, MemoryTraceHandler,
-    last_llm_call, root,
-)
-```
-
-### Key functions
-
-**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
-
- Loads dataset by name, runs `assert_pass` with all items.
- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
- `evaluators`: list of evaluator callables.
- `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5).
- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
-
-**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
-
- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
-
-**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
-
- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
-
-**`ScoreThreshold(threshold=0.5, pct=1.0)`**
-
- `threshold`: min score per item (default 0.5).
- `pct`: fraction of items that must meet threshold (default 1.0 = all).
- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
-
-**`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.0–1.0.
-
-**`capture_traces()`** — context manager; use for in-memory trace capture without DB.
-
-**`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers.
-
---
-
-## Evaluators
-
-### Heuristic (no LLM needed)
-
-| Evaluator                        | Use when                                            |
-| -------------------------------- | --------------------------------------------------- |
-| `ExactMatchEval(expected=...)`   | Output must exactly equal the expected string       |
-| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance)           |
-| `NumericDiffEval(expected=...)`  | Normalised numeric difference                       |
-| `JSONDiffEval(expected=...)`     | Structural JSON comparison                          |
-| `ValidJSONEval(schema=None)`     | Output is valid JSON (optionally matching a schema) |
-| `ListContainsEval(expected=...)` | Output list contains expected items                 |
-
-### LLM-as-judge (require OpenAI key or compatible client)
-
-| Evaluator                                             | Use when                                  |
-| ----------------------------------------------------- | ----------------------------------------- |
-| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
-| `ClosedQAEval(expected=..., model=..., client=...)`   | Closed-book QA comparison                 |
-| `SummaryEval(expected=..., model=..., client=...)`    | Summarisation quality                     |
-| `TranslationEval(expected=..., language=..., ...)`    | Translation quality                       |
-| `PossibleEval(model=..., client=...)`                 | Output is feasible / plausible            |
-| `SecurityEval(model=..., client=...)`                 | No security vulnerabilities in output     |
-| `ModerationEval(threshold=..., client=...)`           | Content moderation                        |
-| `BattleEval(expected=..., model=..., client=...)`     | Head-to-head comparison                   |
-
-### RAG / retrieval
-
-| Evaluator                                         | Use when                                   |
-| ------------------------------------------------- | ------------------------------------------ |
-| `ContextRelevancyEval(expected=..., client=...)`  | Retrieved context is relevant to query     |
-| `FaithfulnessEval(client=...)`                    | Answer is faithful to the provided context |
-| `AnswerRelevancyEval(client=...)`                 | Answer addresses the question              |
-| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference             |
-
-### Custom evaluator template
-
-```python
-from pixie import Evaluation, Evaluable
-
-async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
-    # evaluable.eval_input  — what was passed to the observed function
-    # evaluable.eval_output — what the function returned
-    # evaluable.expected_output — reference answer (UNSET if not provided)
-    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
-    return Evaluation(score=score, reasoning="...")
-```
-
---
-
 ## Dataset Python API

 ```python
 from pixie import DatasetStore, Evaluable
+```

+```python
 store = DatasetStore()                               # reads PIXIE_DATASET_DIR
-store.create("my-dataset")                          # create empty
-store.create("my-dataset", items=[...])             # create with items
-store.append("my-dataset", Evaluable(...))          # add one item
-store.get("my-dataset")                             # returns Dataset
-store.list()                                        # list names
-store.remove("my-dataset", index=2)                 # remove by index
-store.delete("my-dataset")                          # delete entirely
+store.append(...)    # add one or more items
+store.create(...)    # create empty / create with items
+store.delete(...)    # delete entirely
+store.get(...)    # returns Dataset
+store.list(...)    # list names
+store.list_details(...)    # list names with metadata
+store.remove(...)    # remove by index
 ```

 **`Evaluable` fields:**
@@ -179,13 +116,20 @@ from pixie import ObservationStore

 store = ObservationStore()   # reads PIXIE_DB_PATH
 await store.create_tables()
+```

-# Read traces
-await store.list_traces(limit=10, offset=0)         # → list of trace summaries
-await store.get_trace(trace_id)                     # → list[ObservationNode] (tree)
-await store.get_root(trace_id)                      # → root ObserveSpan
-await store.get_last_llm(trace_id)                  # → most recent LLMSpan
-await store.get_by_name(name, trace_id=None)        # → list of spans
+```python
+await store.create_tables(self) -> 'None'
+await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans
+await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans filtered by kind
+await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of error spans
+await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None'  # → most recent LLMSpan
+await store.get_root(self, trace_id: 'str') -> 'ObserveSpan'  # → root ObserveSpan
+await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]'  # → list[ObservationNode] (tree)
+await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]'  # → flat list of all spans
+await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]'  # → list of trace summaries
+await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None'  # persist a single span
+await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None'  # persist multiple spans

 # ObservationNode
 node.to_text()          # pretty-print span tree
@@ -193,3 +137,121 @@ node.find(name)         # find a child span by name
 node.children           # list of child ObservationNode
 node.span               # the underlying span (ObserveSpan or LLMSpan)
 ```
+
+---
+
+## Eval Runner API
+
+### `assert_dataset_pass`
+
+```python
+await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
+```
+
+**Parameters:**
+
+- `runnable` — callable that takes `eval_input` and runs the app
+- `dataset_name` — name of the dataset to load (NOT `dataset_path`)
+- `evaluators` — list of evaluator instances
+- `pass_criteria` — `ScoreThreshold(threshold=..., pct=...)` (NOT `thresholds`)
+- `from_trace` — span selector: use `last_llm_call` or `root`
+- `dataset_dir` — override dataset directory (default: reads from config)
+- `passes` — number of times to run the full matrix (default: 1)
+
+### `ScoreThreshold`
+
+```python
+ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
+
+# threshold: minimum per-item score to count as passing (0.0–1.0)
+# pct:       fraction of items that must pass (0.0–1.0, default=1.0)
+```
+
+### Trace helpers
+
+```python
+from pixie import last_llm_call, root
+
+# Pass one of these as the from_trace= argument:
+from_trace=last_llm_call  # extract eval data from the most recent LLM call span
+from_trace=root           # extract eval data from the root @observe span
+```
+
+---
+
+## Evaluator catalog
+
+Import any evaluator directly from `pixie`:
+
+```python
+from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
+```
+
+### Heuristic (no LLM required)
+
+| Evaluator | Signature | Use when | Needs `expected_output`? |
+| --- | --- | --- | --- |
+| `ExactMatchEval() -> 'AutoevalsAdapter'` | Output must exactly equal the expected string | **Yes** |
+| `LevenshteinMatch() -> 'AutoevalsAdapter'` | Partial string similarity (edit distance) | **Yes** |
+| `NumericDiffEval() -> 'AutoevalsAdapter'` | Normalised numeric difference | **Yes** |
+| `JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'` | Structural JSON comparison | **Yes** |
+| `ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'` | Output is valid JSON (optionally matching a schema) | No |
+| `ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'` | Output list contains expected items | **Yes** |
+
+### LLM-as-judge (require OpenAI key or compatible client)
+
+| Evaluator | Signature | Use when | Needs `expected_output`? |
+| --- | --- | --- | --- |
+| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | **Yes** |
+| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | **Yes** |
+| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | **Yes** |
+| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality | **Yes** |
+| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
+| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
+| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
+| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | **Yes** |
+| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | **Yes** |
+| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity | **Yes** |
+
+### RAG / retrieval
+
+| Evaluator | Signature | Use when | Needs `expected_output`? |
+| --- | --- | --- | --- |
+| `ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Retrieved context is relevant to query | **Yes** |
+| `FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is faithful to the provided context | No |
+| `AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer addresses the question (⚠️ requires `context` in trace — **RAG pipelines only**) | No |
+| `AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is correct vs reference | **Yes** |
+
+### Other evaluators
+
+| Evaluator | Signature | Needs `expected_output`? |
+| --- | --- | --- |
+| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |
+
+---
+
+## Custom evaluator — `create_llm_evaluator` factory
+
+```python
+from pixie import create_llm_evaluator
+
+my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
+```
+
+- Returns a callable satisfying the `Evaluator` protocol
+- Template variables: `{eval_input}`, `{eval_output}`, `{expected_output}` — populated from `Evaluable` fields
+- No nested field access — include any needed metadata in `eval_input` when building the dataset
+- Score parsing extracts a 0–1 float from the LLM response
+
+### Custom evaluator — manual template
+
+```python
+from pixie import Evaluation, Evaluable
+
+async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
+    # evaluable.eval_input  — what was passed to the observed function
+    # evaluable.eval_output — what the function returned
+    # evaluable.expected_output — reference answer (UNSET if not provided)
+    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
+    return Evaluation(score=score, reasoning="...")
+```