Files
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

368 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Testing API Reference
> Auto-generated from pixie source code docstrings.
> Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository.
pixie.evals — evaluation harness for LLM applications.
Public API: - `Evaluation` — result dataclass for a single evaluator run - `Evaluator` — protocol for evaluation callables - `evaluate` — run one evaluator against one evaluable - `run_and_evaluate` — evaluate spans from a MemoryTraceHandler - `assert_pass` — batch evaluation with pass/fail criteria - `assert_dataset_pass` — load a dataset and run assert_pass - `EvalAssertionError` — raised when assert_pass fails - `capture_traces` — context manager for in-memory trace capture - `MemoryTraceHandler` — InstrumentationHandler that collects spans - `ScoreThreshold` — configurable pass criteria - `last_llm_call` / `root` — trace-to-evaluable helpers - `DatasetEntryResult` — evaluation results for a single dataset entry - `DatasetScorecard` — per-dataset scorecard with non-uniform evaluators - `generate_dataset_scorecard_html` — render a scorecard as HTML - `save_dataset_scorecard` — write scorecard HTML to disk
Pre-made evaluators (autoevals adapters): - `AutoevalsAdapter` — generic wrapper for any autoevals `Scorer` - `LevenshteinMatch` — edit-distance string similarity - `ExactMatch` — exact value comparison - `NumericDiff` — normalised numeric difference - `JSONDiff` — structural JSON comparison - `ValidJSON` — JSON syntax / schema validation - `ListContains` — list overlap - `EmbeddingSimilarity` — embedding cosine similarity - `Factuality` — LLM factual accuracy check - `ClosedQA` — closed-book QA evaluation - `Battle` — head-to-head comparison - `Humor` — humor detection - `Security` — security vulnerability check - `Sql` — SQL equivalence - `Summary` — summarisation quality - `Translation` — translation quality - `Possible` — feasibility check - `Moderation` — content moderation - `ContextRelevancy` — RAGAS context relevancy - `Faithfulness` — RAGAS faithfulness - `AnswerRelevancy` — RAGAS answer relevancy - `AnswerCorrectness` — RAGAS answer correctness
## Dataset JSON Format
The dataset is a JSON object with these top-level fields:
```json
{
"name": "customer-faq",
"runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
"evaluators": ["Factuality"],
"entries": [
{
"entry_kwargs": { "question": "Hello" },
"description": "Basic greeting",
"eval_input": [{ "name": "input", "value": "Hello" }],
"expectation": "A friendly greeting that offers to help",
"evaluators": ["...", "ClosedQA"]
}
]
}
```
### Entry structure
All fields are top-level on each entry (flat structure — no nesting):
```
entry:
├── entry_kwargs (required) — args for Runnable.run()
├── eval_input (required) — list of {"name": ..., "value": ...} objects
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
```
### Field reference
- `runnable` (required): `filepath:ClassName` reference to the `Runnable`
subclass that drives the app during evaluation.
- `evaluators` (dataset-level, optional): Default evaluator names — applied to
every entry that does not declare its own `evaluators`.
- `entries[].entry_kwargs` (required): Kwargs passed to `Runnable.run()` as a
Pydantic model. Keys must match the fields of the Pydantic model used in
`run(args: T)`.
- `entries[].description` (required): Human-readable label for the test case.
- `entries[].eval_input` (required): List of `{"name": ..., "value": ...}`
objects. Used to populate the wrap input registry — `wrap(purpose="input")`
calls in the app return registry values keyed by `name`.
- `entries[].expectation` (optional): Concise expectation description
for comparison-based evaluators. Should describe what a correct output looks
like, **not** copy the verbatim output. Use `pixie format` on the trace to
see the real output shape, then write a shorter description.
- `entries[].eval_metadata` (optional): Extra per-entry data for custom
evaluators — e.g., expected tool names, boolean flags, thresholds. Accessed in
evaluators as `evaluable.eval_metadata`.
- `entries[].evaluators` (optional): Row-level evaluator override. Rules:
- Omit → entry inherits dataset-level `evaluators`.
- `["...", "ClosedQA"]` → dataset defaults **plus** ClosedQA.
- `["OnlyThis"]` (no `"..."`) → **only** OnlyThis, no defaults.
## Evaluator Name Resolution
In dataset JSON, evaluator names are resolved as follows:
- **Built-in names** (bare names like `"Factuality"`, `"ExactMatch"`) are
resolved to `pixie.{Name}` automatically.
- **Custom evaluators** use `filepath:callable_name` format
(e.g. `"pixie_qa/evaluators.py:my_evaluator"`).
- Custom evaluator references point to module-level callables — classes
(instantiated automatically), factory functions (called if zero-arg),
evaluator functions (used as-is), or pre-instantiated callables (e.g.
`create_llm_evaluator` results — used as-is).
## CLI Commands
| Command | Description |
| ------------------------------------------- | ------------------------------------- |
| `pixie test [path] [-v] [--no-open]` | Run eval tests on dataset files |
| `pixie dataset create <name>` | Create a new empty dataset |
| `pixie dataset list` | List all datasets |
| `pixie dataset save <name> [--select MODE]` | Save a span to a dataset |
| `pixie dataset validate [path]` | Validate dataset JSON files |
| `pixie analyze <test_run_id>` | Generate analysis and recommendations |
---
## Types
### `Evaluable`
```python
class Evaluable(TestCase):
eval_output: list[NamedData] # wrap(purpose="output") + wrap(purpose="state") values
# Inherited from TestCase:
# eval_input: list[NamedData] # from eval_input in dataset entry
# expectation: JsonValue | _Unset # from expectation in dataset entry
# eval_metadata: dict[str, JsonValue] | None # from eval_metadata in dataset entry
# description: str | None
```
Data carrier for evaluators. Extends `TestCase` with actual output.
- `eval_input``list[NamedData]` populated from the entry's `eval_input` field. **Must have at least one item** (`min_length=1`).
- `eval_output``list[NamedData]` containing ALL `wrap(purpose="output")` and `wrap(purpose="state")` values captured during the run. Each item has `.name` (str) and `.value` (JsonValue). Use `_get_output(evaluable, "name")` to look up by name.
- `eval_metadata``dict[str, JsonValue] | None` from the entry's `eval_metadata` field
- `expected_output` — expectation text from dataset (or `UNSET` if not provided)
Attributes:
eval_input: Named input data items (from dataset). Must be non-empty.
eval_output: Named output data items (from wrap calls during run).
Each item has `.name` (str) and `.value` (JsonValue).
Contains ALL `wrap(purpose="output")` and `wrap(purpose="state")` values.
eval_metadata: Supplementary metadata (`None` when absent).
expected_output: The expected/reference output for evaluation.
Defaults to `UNSET` (not provided). May be explicitly
set to `None` to indicate "there is no expected output".
### How `wrap()` maps to `Evaluable` fields at test time
When `pixie test` runs a dataset entry, `wrap()` calls in the app populate the `Evaluable` that evaluators receive:
| `wrap()` call in app code | Evaluable field | Type | How to access in evaluator |
| ---------------------------------------- | ----------------- | ----------------- | ---------------------------------------------------- |
| `wrap(data, purpose="input", name="X")` | `eval_input` | `list[NamedData]` | Pre-populated from `eval_input` in the dataset entry |
| `wrap(data, purpose="output", name="X")` | `eval_output` | `list[NamedData]` | `_get_output(evaluable, "X")` — see helper below |
| `wrap(data, purpose="state", name="X")` | `eval_output` | `list[NamedData]` | `_get_output(evaluable, "X")` — same list as output |
| (from dataset entry `expectation`) | `expected_output` | `str \| None` | `evaluable.expected_output` |
| (from dataset entry `eval_metadata`) | `eval_metadata` | `dict \| None` | `evaluable.eval_metadata` |
**Key insight**: Both `purpose="output"` and `purpose="state"` wrap values end up in `eval_output` as `NamedData` items. There is no separate `captured_output` or `captured_state` dict. Use the helper function below to look up values by wrap name:
```python
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
```
**`eval_metadata`** is for passing extra per-entry data to evaluators that isn't an app input or output — e.g., expected tool names, boolean flags, thresholds. Defined as a top-level field on the entry, accessed as `evaluable.eval_metadata`.
**Complete custom evaluator example** (tool call check + dataset entry):
```python
from pixie import Evaluation, Evaluable
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
def tool_call_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
expected = evaluable.eval_metadata.get("expected_tool") if evaluable.eval_metadata else None
actual = _get_output(evaluable, "function_called")
if expected is None:
return Evaluation(score=1.0, reasoning="No expected_tool specified")
match = str(actual) == str(expected)
return Evaluation(
score=1.0 if match else 0.0,
reasoning=f"Expected {expected}, got {actual}",
)
```
Corresponding dataset entry:
```json
{
"entry_kwargs": { "user_message": "I want to end this call" },
"description": "User requests call end after failed verification",
"eval_input": [{ "name": "user_input", "value": "I want to end this call" }],
"expectation": "Agent should call endCall tool",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
```
### `Evaluation`
```python
Evaluation(score: 'float', reasoning: 'str', details: 'dict[str, Any]' = <factory>) -> None
```
The result of a single evaluator applied to a single test case.
Attributes:
score: Evaluation score between 0.0 and 1.0.
reasoning: Human-readable explanation (required).
details: Arbitrary JSON-serializable metadata.
### `ScoreThreshold`
```python
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
```
Pass criteria: _pct_ fraction of inputs must score >= _threshold_ on all evaluators.
Attributes:
threshold: Minimum score an individual evaluation must reach.
pct: Fraction of test-case inputs (0.01.0) that must pass.
## Eval Functions
### `pixie.run_and_evaluate`
```python
pixie.run_and_evaluate(evaluator: 'Callable[..., Any]', runnable: 'Callable[..., Any]', eval_input: 'Any', *, expected_output: 'Any' = <object object at 0x7788c2ad5c80>, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'Evaluation'
```
Run _runnable(eval_input)_ while capturing traces, then evaluate.
Convenience wrapper combining `_run_and_capture` and `evaluate`.
The runnable is called exactly once.
Args:
evaluator: An evaluator callable (sync or async).
runnable: The application function to test.
eval*input: The single input passed to \_runnable*.
expected_output: Optional expected value merged into the
evaluable.
from_trace: Optional callable to select a specific span from
the trace tree for evaluation.
Returns:
The `Evaluation` result.
Raises:
ValueError: If no spans were captured during execution.
### `pixie.assert_pass`
```python
pixie.assert_pass(runnable: 'Callable[..., Any]', eval_inputs: 'list[Any]', evaluators: 'list[Callable[..., Any]]', *, evaluables: 'list[Evaluable] | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
```
Run evaluators against a runnable over multiple inputs.
For each input, runs the runnable once via `_run_and_capture`,
then evaluates with every evaluator concurrently via
`asyncio.gather`.
The results matrix has shape `[eval_inputs][evaluators]`.
If the pass criteria are not met, raises :class:`EvalAssertionError`
carrying the matrix.
When `evaluables` is provided, behaviour depends on whether each
item already has `eval_output` populated:
- **eval_output is None** — the `runnable` is called via
`run_and_evaluate` to produce an output from traces, and
`expected_output` from the evaluable is merged into the result.
- **eval_output is not None** — the evaluable is used directly
(the runnable is not called for that item).
Args:
runnable: The application function to test.
eval*inputs: List of inputs, each passed to \_runnable*.
evaluators: List of evaluator callables.
evaluables: Optional list of `Evaluable` items, one per input.
When provided, their `expected_output` is forwarded to
`run_and_evaluate`. Must have the same length as
_eval_inputs_.
pass_criteria: Receives the results matrix, returns
`(passed, message)`. Defaults to `ScoreThreshold()`.
from_trace: Optional span selector forwarded to
`run_and_evaluate`.
Raises:
EvalAssertionError: When pass criteria are not met.
ValueError: When _evaluables_ length does not match _eval_inputs_.
### `pixie.assert_dataset_pass`
```python
pixie.assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
```
Load a dataset by name, then run `assert_pass` with its items.
This is a convenience wrapper that:
1. Loads the dataset from the `DatasetStore`.
2. Extracts `eval_input` from each item as the runnable inputs.
3. Uses the full `Evaluable` items (which carry `expected_output`)
as the evaluables.
4. Delegates to `assert_pass`.
Args:
runnable: The application function to test.
dataset_name: Name of the dataset to load.
evaluators: List of evaluator callables.
dataset_dir: Override directory for the dataset store.
When `None`, reads from `PixieConfig.dataset_dir`.
pass_criteria: Receives the results matrix, returns
`(passed, message)`.
from_trace: Optional span selector forwarded to
`assert_pass`.
Raises:
FileNotFoundError: If no dataset with _dataset_name_ exists.
EvalAssertionError: When pass criteria are not met.
## Trace Helpers
### `pixie.last_llm_call`
```python
pixie.last_llm_call(trace: 'list[ObservationNode]') -> 'Evaluable'
```
Find the `LLMSpan` with the latest `ended_at` in the trace tree.
Args:
trace: The trace tree (list of root `ObservationNode` instances).
Returns:
An `Evaluable` wrapping the most recently ended `LLMSpan`.
Raises:
ValueError: If no `LLMSpan` exists in the trace.
### `pixie.root`
```python
pixie.root(trace: 'list[ObservationNode]') -> 'Evaluable'
```
Return the first root node's span as `Evaluable`.
Args:
trace: The trace tree (list of root `ObservationNode` instances).
Returns:
An `Evaluable` wrapping the first root node's span.
Raises:
ValueError: If the trace is empty.
### `pixie.capture_traces`
```python
pixie.capture_traces() -> 'Generator[MemoryTraceHandler, None, None]'
```
Context manager that installs a `MemoryTraceHandler` and yields it.
Calls `init()` (no-op if already initialised) then registers the
handler via `add_handler()`. On exit the handler is removed and
the delivery queue is flushed so that all spans are available on
`handler.spans`.