# Testing API Reference > Auto-generated from pixie source code docstrings. > Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository. pixie.evals — evaluation harness for LLM applications. Public API: - `Evaluation` — result dataclass for a single evaluator run - `Evaluator` — protocol for evaluation callables - `evaluate` — run one evaluator against one evaluable - `run_and_evaluate` — evaluate spans from a MemoryTraceHandler - `assert_pass` — batch evaluation with pass/fail criteria - `assert_dataset_pass` — load a dataset and run assert_pass - `EvalAssertionError` — raised when assert_pass fails - `capture_traces` — context manager for in-memory trace capture - `MemoryTraceHandler` — InstrumentationHandler that collects spans - `ScoreThreshold` — configurable pass criteria - `last_llm_call` / `root` — trace-to-evaluable helpers - `DatasetEntryResult` — evaluation results for a single dataset entry - `DatasetScorecard` — per-dataset scorecard with non-uniform evaluators - `generate_dataset_scorecard_html` — render a scorecard as HTML - `save_dataset_scorecard` — write scorecard HTML to disk Pre-made evaluators (autoevals adapters): - `AutoevalsAdapter` — generic wrapper for any autoevals `Scorer` - `LevenshteinMatch` — edit-distance string similarity - `ExactMatch` — exact value comparison - `NumericDiff` — normalised numeric difference - `JSONDiff` — structural JSON comparison - `ValidJSON` — JSON syntax / schema validation - `ListContains` — list overlap - `EmbeddingSimilarity` — embedding cosine similarity - `Factuality` — LLM factual accuracy check - `ClosedQA` — closed-book QA evaluation - `Battle` — head-to-head comparison - `Humor` — humor detection - `Security` — security vulnerability check - `Sql` — SQL equivalence - `Summary` — summarisation quality - `Translation` — translation quality - `Possible` — feasibility check - `Moderation` — content moderation - `ContextRelevancy` — RAGAS context relevancy - `Faithfulness` — RAGAS faithfulness - `AnswerRelevancy` — RAGAS answer relevancy - `AnswerCorrectness` — RAGAS answer correctness ## Dataset JSON Format The dataset is a JSON object with these top-level fields: ```json { "name": "customer-faq", "runnable": "pixie_qa/scripts/run_app.py:AppRunnable", "evaluators": ["Factuality"], "entries": [ { "entry_kwargs": { "question": "Hello" }, "description": "Basic greeting", "eval_input": [{ "name": "input", "value": "Hello" }], "expectation": "A friendly greeting that offers to help", "evaluators": ["...", "ClosedQA"] } ] } ``` ### Entry structure All fields are top-level on each entry (flat structure — no nesting): ``` entry: ├── entry_kwargs (required) — args for Runnable.run() ├── eval_input (required) — list of {"name": ..., "value": ...} objects ├── description (required) — human-readable label for the test case ├── expectation (optional) — reference for comparison-based evaluators ├── eval_metadata (optional) — extra per-entry data for custom evaluators └── evaluators (optional) — evaluator names for THIS entry ``` ### Field reference - `runnable` (required): `filepath:ClassName` reference to the `Runnable` subclass that drives the app during evaluation. - `evaluators` (dataset-level, optional): Default evaluator names — applied to every entry that does not declare its own `evaluators`. - `entries[].entry_kwargs` (required): Kwargs passed to `Runnable.run()` as a Pydantic model. Keys must match the fields of the Pydantic model used in `run(args: T)`. - `entries[].description` (required): Human-readable label for the test case. - `entries[].eval_input` (required): List of `{"name": ..., "value": ...}` objects. Used to populate the wrap input registry — `wrap(purpose="input")` calls in the app return registry values keyed by `name`. - `entries[].expectation` (optional): Concise expectation description for comparison-based evaluators. Should describe what a correct output looks like, **not** copy the verbatim output. Use `pixie format` on the trace to see the real output shape, then write a shorter description. - `entries[].eval_metadata` (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessed in evaluators as `evaluable.eval_metadata`. - `entries[].evaluators` (optional): Row-level evaluator override. Rules: - Omit → entry inherits dataset-level `evaluators`. - `["...", "ClosedQA"]` → dataset defaults **plus** ClosedQA. - `["OnlyThis"]` (no `"..."`) → **only** OnlyThis, no defaults. ## Evaluator Name Resolution In dataset JSON, evaluator names are resolved as follows: - **Built-in names** (bare names like `"Factuality"`, `"ExactMatch"`) are resolved to `pixie.{Name}` automatically. - **Custom evaluators** use `filepath:callable_name` format (e.g. `"pixie_qa/evaluators.py:my_evaluator"`). - Custom evaluator references point to module-level callables — classes (instantiated automatically), factory functions (called if zero-arg), evaluator functions (used as-is), or pre-instantiated callables (e.g. `create_llm_evaluator` results — used as-is). ## CLI Commands | Command | Description | | ------------------------------------------- | ------------------------------------- | | `pixie test [path] [-v] [--no-open]` | Run eval tests on dataset files | | `pixie dataset create ` | Create a new empty dataset | | `pixie dataset list` | List all datasets | | `pixie dataset save [--select MODE]` | Save a span to a dataset | | `pixie dataset validate [path]` | Validate dataset JSON files | | `pixie analyze ` | Generate analysis and recommendations | --- ## Types ### `Evaluable` ```python class Evaluable(TestCase): eval_output: list[NamedData] # wrap(purpose="output") + wrap(purpose="state") values # Inherited from TestCase: # eval_input: list[NamedData] # from eval_input in dataset entry # expectation: JsonValue | _Unset # from expectation in dataset entry # eval_metadata: dict[str, JsonValue] | None # from eval_metadata in dataset entry # description: str | None ``` Data carrier for evaluators. Extends `TestCase` with actual output. - `eval_input` — `list[NamedData]` populated from the entry's `eval_input` field. **Must have at least one item** (`min_length=1`). - `eval_output` — `list[NamedData]` containing ALL `wrap(purpose="output")` and `wrap(purpose="state")` values captured during the run. Each item has `.name` (str) and `.value` (JsonValue). Use `_get_output(evaluable, "name")` to look up by name. - `eval_metadata` — `dict[str, JsonValue] | None` from the entry's `eval_metadata` field - `expected_output` — expectation text from dataset (or `UNSET` if not provided) Attributes: eval_input: Named input data items (from dataset). Must be non-empty. eval_output: Named output data items (from wrap calls during run). Each item has `.name` (str) and `.value` (JsonValue). Contains ALL `wrap(purpose="output")` and `wrap(purpose="state")` values. eval_metadata: Supplementary metadata (`None` when absent). expected_output: The expected/reference output for evaluation. Defaults to `UNSET` (not provided). May be explicitly set to `None` to indicate "there is no expected output". ### How `wrap()` maps to `Evaluable` fields at test time When `pixie test` runs a dataset entry, `wrap()` calls in the app populate the `Evaluable` that evaluators receive: | `wrap()` call in app code | Evaluable field | Type | How to access in evaluator | | ---------------------------------------- | ----------------- | ----------------- | ---------------------------------------------------- | | `wrap(data, purpose="input", name="X")` | `eval_input` | `list[NamedData]` | Pre-populated from `eval_input` in the dataset entry | | `wrap(data, purpose="output", name="X")` | `eval_output` | `list[NamedData]` | `_get_output(evaluable, "X")` — see helper below | | `wrap(data, purpose="state", name="X")` | `eval_output` | `list[NamedData]` | `_get_output(evaluable, "X")` — same list as output | | (from dataset entry `expectation`) | `expected_output` | `str \| None` | `evaluable.expected_output` | | (from dataset entry `eval_metadata`) | `eval_metadata` | `dict \| None` | `evaluable.eval_metadata` | **Key insight**: Both `purpose="output"` and `purpose="state"` wrap values end up in `eval_output` as `NamedData` items. There is no separate `captured_output` or `captured_state` dict. Use the helper function below to look up values by wrap name: ```python def _get_output(evaluable: Evaluable, name: str) -> Any: """Look up a wrap value by name from eval_output.""" for item in evaluable.eval_output: if item.name == name: return item.value return None ``` **`eval_metadata`** is for passing extra per-entry data to evaluators that isn't an app input or output — e.g., expected tool names, boolean flags, thresholds. Defined as a top-level field on the entry, accessed as `evaluable.eval_metadata`. **Complete custom evaluator example** (tool call check + dataset entry): ```python from pixie import Evaluation, Evaluable def _get_output(evaluable: Evaluable, name: str) -> Any: """Look up a wrap value by name from eval_output.""" for item in evaluable.eval_output: if item.name == name: return item.value return None def tool_call_check(evaluable: Evaluable, *, trace=None) -> Evaluation: expected = evaluable.eval_metadata.get("expected_tool") if evaluable.eval_metadata else None actual = _get_output(evaluable, "function_called") if expected is None: return Evaluation(score=1.0, reasoning="No expected_tool specified") match = str(actual) == str(expected) return Evaluation( score=1.0 if match else 0.0, reasoning=f"Expected {expected}, got {actual}", ) ``` Corresponding dataset entry: ```json { "entry_kwargs": { "user_message": "I want to end this call" }, "description": "User requests call end after failed verification", "eval_input": [{ "name": "user_input", "value": "I want to end this call" }], "expectation": "Agent should call endCall tool", "eval_metadata": { "expected_tool": "endCall", "expected_call_ended": true }, "evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"] } ``` ### `Evaluation` ```python Evaluation(score: 'float', reasoning: 'str', details: 'dict[str, Any]' = ) -> None ``` The result of a single evaluator applied to a single test case. Attributes: score: Evaluation score between 0.0 and 1.0. reasoning: Human-readable explanation (required). details: Arbitrary JSON-serializable metadata. ### `ScoreThreshold` ```python ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None ``` Pass criteria: _pct_ fraction of inputs must score >= _threshold_ on all evaluators. Attributes: threshold: Minimum score an individual evaluation must reach. pct: Fraction of test-case inputs (0.0–1.0) that must pass. ## Eval Functions ### `pixie.run_and_evaluate` ```python pixie.run_and_evaluate(evaluator: 'Callable[..., Any]', runnable: 'Callable[..., Any]', eval_input: 'Any', *, expected_output: 'Any' = , from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'Evaluation' ``` Run _runnable(eval_input)_ while capturing traces, then evaluate. Convenience wrapper combining `_run_and_capture` and `evaluate`. The runnable is called exactly once. Args: evaluator: An evaluator callable (sync or async). runnable: The application function to test. eval*input: The single input passed to \_runnable*. expected_output: Optional expected value merged into the evaluable. from_trace: Optional callable to select a specific span from the trace tree for evaluation. Returns: The `Evaluation` result. Raises: ValueError: If no spans were captured during execution. ### `pixie.assert_pass` ```python pixie.assert_pass(runnable: 'Callable[..., Any]', eval_inputs: 'list[Any]', evaluators: 'list[Callable[..., Any]]', *, evaluables: 'list[Evaluable] | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None' ``` Run evaluators against a runnable over multiple inputs. For each input, runs the runnable once via `_run_and_capture`, then evaluates with every evaluator concurrently via `asyncio.gather`. The results matrix has shape `[eval_inputs][evaluators]`. If the pass criteria are not met, raises :class:`EvalAssertionError` carrying the matrix. When `evaluables` is provided, behaviour depends on whether each item already has `eval_output` populated: - **eval_output is None** — the `runnable` is called via `run_and_evaluate` to produce an output from traces, and `expected_output` from the evaluable is merged into the result. - **eval_output is not None** — the evaluable is used directly (the runnable is not called for that item). Args: runnable: The application function to test. eval*inputs: List of inputs, each passed to \_runnable*. evaluators: List of evaluator callables. evaluables: Optional list of `Evaluable` items, one per input. When provided, their `expected_output` is forwarded to `run_and_evaluate`. Must have the same length as _eval_inputs_. pass_criteria: Receives the results matrix, returns `(passed, message)`. Defaults to `ScoreThreshold()`. from_trace: Optional span selector forwarded to `run_and_evaluate`. Raises: EvalAssertionError: When pass criteria are not met. ValueError: When _evaluables_ length does not match _eval_inputs_. ### `pixie.assert_dataset_pass` ```python pixie.assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None' ``` Load a dataset by name, then run `assert_pass` with its items. This is a convenience wrapper that: 1. Loads the dataset from the `DatasetStore`. 2. Extracts `eval_input` from each item as the runnable inputs. 3. Uses the full `Evaluable` items (which carry `expected_output`) as the evaluables. 4. Delegates to `assert_pass`. Args: runnable: The application function to test. dataset_name: Name of the dataset to load. evaluators: List of evaluator callables. dataset_dir: Override directory for the dataset store. When `None`, reads from `PixieConfig.dataset_dir`. pass_criteria: Receives the results matrix, returns `(passed, message)`. from_trace: Optional span selector forwarded to `assert_pass`. Raises: FileNotFoundError: If no dataset with _dataset_name_ exists. EvalAssertionError: When pass criteria are not met. ## Trace Helpers ### `pixie.last_llm_call` ```python pixie.last_llm_call(trace: 'list[ObservationNode]') -> 'Evaluable' ``` Find the `LLMSpan` with the latest `ended_at` in the trace tree. Args: trace: The trace tree (list of root `ObservationNode` instances). Returns: An `Evaluable` wrapping the most recently ended `LLMSpan`. Raises: ValueError: If no `LLMSpan` exists in the trace. ### `pixie.root` ```python pixie.root(trace: 'list[ObservationNode]') -> 'Evaluable' ``` Return the first root node's span as `Evaluable`. Args: trace: The trace tree (list of root `ObservationNode` instances). Returns: An `Evaluable` wrapping the first root node's span. Raises: ValueError: If the trace is empty. ### `pixie.capture_traces` ```python pixie.capture_traces() -> 'Generator[MemoryTraceHandler, None, None]' ``` Context manager that installs a `MemoryTraceHandler` and yields it. Calls `init()` (no-op if already initialised) then registers the handler via `add_handler()`. On exit the handler is removed and the delivery queue is flushed so that all spans are available on `handler.spans`.