awesome-copilot/skills/eval-driven-dev/references/pixie-api.md

# pixie API Reference

> This file is auto-generated by `generate_api_doc` from the
> live pixie-qa package. Do not edit by hand — run
> `generate_api_doc` to regenerate after updating pixie-qa.

## Configuration

All settings read from environment variables at call time. By default,
every artefact lives inside a single `pixie_qa` project directory:

| Variable            | Default                    | Description                        |
| ------------------- | -------------------------- | ---------------------------------- |
| `PIXIE_ROOT`        | `pixie_qa`                 | Root directory for all artefacts   |
| `PIXIE_DB_PATH`     | `pixie_qa/observations.db` | SQLite database file path          |
| `PIXIE_DB_ENGINE`   | `sqlite`                   | Database engine (currently sqlite) |
| `PIXIE_DATASET_DIR` | `pixie_qa/datasets`        | Directory for dataset JSON files   |

---

## Instrumentation API (`pixie`)

```python
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
```

| Function / Decorator | Signature                                                    | Notes                                                                                               |
| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
| `observe`   | `observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
| `enable_storage`   | `enable_storage() -> 'StorageHandler'` | Idempotent. Creates DB, registers handler. Call at app startup. |
| `start_observation`   | `start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
| `flush`   | `flush(timeout_seconds: 'float' = 5.0) -> 'bool'` | Drains the queue. Call after a run before using CLI commands. |
| `init`   | `init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'` | Called internally by `enable_storage`. Idempotent. |
| `add_handler`   | `add_handler(handler: 'InstrumentationHandler') -> 'None'` | Register a custom handler (must call `init()` first). |
| `remove_handler`   | `remove_handler(handler: 'InstrumentationHandler') -> 'None'` | Unregister a previously added handler. |

---

## CLI Commands

```bash
# Trace inspection
pixie trace list [--limit N] [--errors]              # show recent traces
pixie trace show <trace_id> [--verbose] [--json]     # show span tree for a trace
pixie trace last [--json]                            # show most recent trace (verbose)

# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name>                              # root span (default)
pixie dataset save <name> --select last_llm_call       # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output

# Run eval tests
pixie test [path] [-k filter_substring] [-v]
```

### `pixie trace` commands

**`pixie trace list`** — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).

- `--limit N` (default 10) — number of traces to show
- `--errors` — show only traces with errors

**`pixie trace show <trace_id>`** — show the span tree for a specific trace.

- Default (compact): span names, types, timing
- `--verbose` / `-v`: full input/output data for each span
- `--json`: machine-readable JSON output
- Trace ID accepts prefix match (first 8+ characters)

**`pixie trace last`** — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.

- `--json`: machine-readable JSON output

**`pixie dataset save` selection modes:**

- `root` (default) — the outermost `@observe` or `start_observation` span
- `last_llm_call` — the most recent LLM API call span in the trace
- `by_name` — a span matching the `--span-name` argument (takes the last matching span)

---

## Dataset Python API

```python
from pixie import DatasetStore, Evaluable
```

```python
store = DatasetStore()                               # reads PIXIE_DATASET_DIR
store.append(...)    # add one or more items
store.create(...)    # create empty / create with items
store.delete(...)    # delete entirely
store.get(...)    # returns Dataset
store.list(...)    # list names
store.list_details(...)    # list names with metadata
store.remove(...)    # remove by index
```

**`Evaluable` fields:**

- `eval_input`: the input (what `@observe` captured as function kwargs)
- `eval_output`: the output (return value of the observed function)
- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
- `expected_output`: reference answer for comparison (`UNSET` if not provided)

---

## ObservationStore Python API

```python
from pixie import ObservationStore

store = ObservationStore()   # reads PIXIE_DB_PATH
await store.create_tables()
```

```python
await store.create_tables(self) -> 'None'
await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans
await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans filtered by kind
await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of error spans
await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None'  # → most recent LLMSpan
await store.get_root(self, trace_id: 'str') -> 'ObserveSpan'  # → root ObserveSpan
await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]'  # → list[ObservationNode] (tree)
await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]'  # → flat list of all spans
await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]'  # → list of trace summaries
await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None'  # persist a single span
await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None'  # persist multiple spans

# ObservationNode
node.to_text()          # pretty-print span tree
node.find(name)         # find a child span by name
node.children           # list of child ObservationNode
node.span               # the underlying span (ObserveSpan or LLMSpan)
```

---

## Eval Runner API

### `assert_dataset_pass`

```python
await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
```

**Parameters:**

- `runnable` — callable that takes `eval_input` and runs the app
- `dataset_name` — name of the dataset to load (NOT `dataset_path`)
- `evaluators` — list of evaluator instances
- `pass_criteria` — `ScoreThreshold(threshold=..., pct=...)` (NOT `thresholds`)
- `from_trace` — span selector: use `last_llm_call` or `root`
- `dataset_dir` — override dataset directory (default: reads from config)
- `passes` — number of times to run the full matrix (default: 1)

### `ScoreThreshold`

```python
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None

# threshold: minimum per-item score to count as passing (0.0–1.0)
# pct:       fraction of items that must pass (0.0–1.0, default=1.0)
```

### Trace helpers

```python
from pixie import last_llm_call, root

# Pass one of these as the from_trace= argument:
from_trace=last_llm_call  # extract eval data from the most recent LLM call span
from_trace=root           # extract eval data from the root @observe span
```

---

## Evaluator catalog

Import any evaluator directly from `pixie`:

```python
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
```

### Heuristic (no LLM required)

| Evaluator | Signature | Use when | Needs `expected_output`? |
| --- | --- | --- | --- |
| `ExactMatchEval() -> 'AutoevalsAdapter'` | Output must exactly equal the expected string | **Yes** |
| `LevenshteinMatch() -> 'AutoevalsAdapter'` | Partial string similarity (edit distance) | **Yes** |
| `NumericDiffEval() -> 'AutoevalsAdapter'` | Normalised numeric difference | **Yes** |
| `JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'` | Structural JSON comparison | **Yes** |
| `ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'` | Output is valid JSON (optionally matching a schema) | No |
| `ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'` | Output list contains expected items | **Yes** |

### LLM-as-judge (require OpenAI key or compatible client)

| Evaluator | Signature | Use when | Needs `expected_output`? |
| --- | --- | --- | --- |
| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | **Yes** |
| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | **Yes** |
| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | **Yes** |
| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality | **Yes** |
| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | **Yes** |
| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | **Yes** |
| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity | **Yes** |

### RAG / retrieval

| Evaluator | Signature | Use when | Needs `expected_output`? |
| --- | --- | --- | --- |
| `ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Retrieved context is relevant to query | **Yes** |
| `FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is faithful to the provided context | No |
| `AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer addresses the question (⚠️ requires `context` in trace — **RAG pipelines only**) | No |
| `AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is correct vs reference | **Yes** |

### Other evaluators

| Evaluator | Signature | Needs `expected_output`? |
| --- | --- | --- |
| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |

---

## Custom evaluator — `create_llm_evaluator` factory

```python
from pixie import create_llm_evaluator

my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
```

- Returns a callable satisfying the `Evaluator` protocol
- Template variables: `{eval_input}`, `{eval_output}`, `{expected_output}` — populated from `Evaluable` fields
- No nested field access — include any needed metadata in `eval_input` when building the dataset
- Score parsing extracts a 0–1 float from the LLM response

### Custom evaluator — manual template

```python
from pixie import Evaluation, Evaluable

async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    # evaluable.eval_input  — what was passed to the observed function
    # evaluable.eval_output — what the function returned
    # evaluable.expected_output — reference answer (UNSET if not provided)
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")
```