Files
awesome-copilot/skills/eval-driven-dev/references/pixie-api.md
Yiou Li df0ed6aa51 update eval-driven-dev skill. (#1201)
* update eval-driven-dev skill.

Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions.

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-30 08:07:39 +11:00

258 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# pixie API Reference
> This file is auto-generated by `generate_api_doc` from the
> live pixie-qa package. Do not edit by hand — run
> `generate_api_doc` to regenerate after updating pixie-qa.
## Configuration
All settings read from environment variables at call time. By default,
every artefact lives inside a single `pixie_qa` project directory:
| Variable | Default | Description |
| ------------------- | -------------------------- | ---------------------------------- |
| `PIXIE_ROOT` | `pixie_qa` | Root directory for all artefacts |
| `PIXIE_DB_PATH` | `pixie_qa/observations.db` | SQLite database file path |
| `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
| `PIXIE_DATASET_DIR` | `pixie_qa/datasets` | Directory for dataset JSON files |
---
## Instrumentation API (`pixie`)
```python
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
```
| Function / Decorator | Signature | Notes |
| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
| `observe` | `observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
| `enable_storage` | `enable_storage() -> 'StorageHandler'` | Idempotent. Creates DB, registers handler. Call at app startup. |
| `start_observation` | `start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
| `flush` | `flush(timeout_seconds: 'float' = 5.0) -> 'bool'` | Drains the queue. Call after a run before using CLI commands. |
| `init` | `init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'` | Called internally by `enable_storage`. Idempotent. |
| `add_handler` | `add_handler(handler: 'InstrumentationHandler') -> 'None'` | Register a custom handler (must call `init()` first). |
| `remove_handler` | `remove_handler(handler: 'InstrumentationHandler') -> 'None'` | Unregister a previously added handler. |
---
## CLI Commands
```bash
# Trace inspection
pixie trace list [--limit N] [--errors] # show recent traces
pixie trace show <trace_id> [--verbose] [--json] # show span tree for a trace
pixie trace last [--json] # show most recent trace (verbose)
# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name> # root span (default)
pixie dataset save <name> --select last_llm_call # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output
# Run eval tests
pixie test [path] [-k filter_substring] [-v]
```
### `pixie trace` commands
**`pixie trace list`** — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).
- `--limit N` (default 10) — number of traces to show
- `--errors` — show only traces with errors
**`pixie trace show <trace_id>`** — show the span tree for a specific trace.
- Default (compact): span names, types, timing
- `--verbose` / `-v`: full input/output data for each span
- `--json`: machine-readable JSON output
- Trace ID accepts prefix match (first 8+ characters)
**`pixie trace last`** — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.
- `--json`: machine-readable JSON output
**`pixie dataset save` selection modes:**
- `root` (default) — the outermost `@observe` or `start_observation` span
- `last_llm_call` — the most recent LLM API call span in the trace
- `by_name` — a span matching the `--span-name` argument (takes the last matching span)
---
## Dataset Python API
```python
from pixie import DatasetStore, Evaluable
```
```python
store = DatasetStore() # reads PIXIE_DATASET_DIR
store.append(...) # add one or more items
store.create(...) # create empty / create with items
store.delete(...) # delete entirely
store.get(...) # returns Dataset
store.list(...) # list names
store.list_details(...) # list names with metadata
store.remove(...) # remove by index
```
**`Evaluable` fields:**
- `eval_input`: the input (what `@observe` captured as function kwargs)
- `eval_output`: the output (return value of the observed function)
- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
- `expected_output`: reference answer for comparison (`UNSET` if not provided)
---
## ObservationStore Python API
```python
from pixie import ObservationStore
store = ObservationStore() # reads PIXIE_DB_PATH
await store.create_tables()
```
```python
await store.create_tables(self) -> 'None'
await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of spans
await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of spans filtered by kind
await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of error spans
await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None' # → most recent LLMSpan
await store.get_root(self, trace_id: 'str') -> 'ObserveSpan' # → root ObserveSpan
await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]' # → list[ObservationNode] (tree)
await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]' # → flat list of all spans
await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]' # → list of trace summaries
await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None' # persist a single span
await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None' # persist multiple spans
# ObservationNode
node.to_text() # pretty-print span tree
node.find(name) # find a child span by name
node.children # list of child ObservationNode
node.span # the underlying span (ObserveSpan or LLMSpan)
```
---
## Eval Runner API
### `assert_dataset_pass`
```python
await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
```
**Parameters:**
- `runnable` — callable that takes `eval_input` and runs the app
- `dataset_name` — name of the dataset to load (NOT `dataset_path`)
- `evaluators` — list of evaluator instances
- `pass_criteria``ScoreThreshold(threshold=..., pct=...)` (NOT `thresholds`)
- `from_trace` — span selector: use `last_llm_call` or `root`
- `dataset_dir` — override dataset directory (default: reads from config)
- `passes` — number of times to run the full matrix (default: 1)
### `ScoreThreshold`
```python
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
# threshold: minimum per-item score to count as passing (0.01.0)
# pct: fraction of items that must pass (0.01.0, default=1.0)
```
### Trace helpers
```python
from pixie import last_llm_call, root
# Pass one of these as the from_trace= argument:
from_trace=last_llm_call # extract eval data from the most recent LLM call span
from_trace=root # extract eval data from the root @observe span
```
---
## Evaluator catalog
Import any evaluator directly from `pixie`:
```python
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
```
### Heuristic (no LLM required)
| Evaluator | Signature | Use when | Needs `expected_output`? |
| --- | --- | --- | --- |
| `ExactMatchEval() -> 'AutoevalsAdapter'` | Output must exactly equal the expected string | **Yes** |
| `LevenshteinMatch() -> 'AutoevalsAdapter'` | Partial string similarity (edit distance) | **Yes** |
| `NumericDiffEval() -> 'AutoevalsAdapter'` | Normalised numeric difference | **Yes** |
| `JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'` | Structural JSON comparison | **Yes** |
| `ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'` | Output is valid JSON (optionally matching a schema) | No |
| `ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'` | Output list contains expected items | **Yes** |
### LLM-as-judge (require OpenAI key or compatible client)
| Evaluator | Signature | Use when | Needs `expected_output`? |
| --- | --- | --- | --- |
| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | **Yes** |
| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | **Yes** |
| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | **Yes** |
| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality | **Yes** |
| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | **Yes** |
| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | **Yes** |
| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity | **Yes** |
### RAG / retrieval
| Evaluator | Signature | Use when | Needs `expected_output`? |
| --- | --- | --- | --- |
| `ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Retrieved context is relevant to query | **Yes** |
| `FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is faithful to the provided context | No |
| `AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer addresses the question (⚠️ requires `context` in trace — **RAG pipelines only**) | No |
| `AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is correct vs reference | **Yes** |
### Other evaluators
| Evaluator | Signature | Needs `expected_output`? |
| --- | --- | --- |
| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |
---
## Custom evaluator — `create_llm_evaluator` factory
```python
from pixie import create_llm_evaluator
my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
```
- Returns a callable satisfying the `Evaluator` protocol
- Template variables: `{eval_input}`, `{eval_output}`, `{expected_output}` — populated from `Evaluable` fields
- No nested field access — include any needed metadata in `eval_input` when building the dataset
- Score parsing extracts a 01 float from the LLM response
### Custom evaluator — manual template
```python
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
```