* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
16 KiB
Testing API Reference
Auto-generated from pixie source code docstrings. Do not edit by hand — regenerate from the upstream pixie-qa source repository.
pixie.evals — evaluation harness for LLM applications.
Public API: - Evaluation — result dataclass for a single evaluator run - Evaluator — protocol for evaluation callables - evaluate — run one evaluator against one evaluable - run_and_evaluate — evaluate spans from a MemoryTraceHandler - assert_pass — batch evaluation with pass/fail criteria - assert_dataset_pass — load a dataset and run assert_pass - EvalAssertionError — raised when assert_pass fails - capture_traces — context manager for in-memory trace capture - MemoryTraceHandler — InstrumentationHandler that collects spans - ScoreThreshold — configurable pass criteria - last_llm_call / root — trace-to-evaluable helpers - DatasetEntryResult — evaluation results for a single dataset entry - DatasetScorecard — per-dataset scorecard with non-uniform evaluators - generate_dataset_scorecard_html — render a scorecard as HTML - save_dataset_scorecard — write scorecard HTML to disk
Pre-made evaluators (autoevals adapters): - AutoevalsAdapter — generic wrapper for any autoevals Scorer - LevenshteinMatch — edit-distance string similarity - ExactMatch — exact value comparison - NumericDiff — normalised numeric difference - JSONDiff — structural JSON comparison - ValidJSON — JSON syntax / schema validation - ListContains — list overlap - EmbeddingSimilarity — embedding cosine similarity - Factuality — LLM factual accuracy check - ClosedQA — closed-book QA evaluation - Battle — head-to-head comparison - Humor — humor detection - Security — security vulnerability check - Sql — SQL equivalence - Summary — summarisation quality - Translation — translation quality - Possible — feasibility check - Moderation — content moderation - ContextRelevancy — RAGAS context relevancy - Faithfulness — RAGAS faithfulness - AnswerRelevancy — RAGAS answer relevancy - AnswerCorrectness — RAGAS answer correctness
Dataset JSON Format
The dataset is a JSON object with these top-level fields:
{
"name": "customer-faq",
"runnable": "pixie_qa/scripts/run_app.py:AppRunnable",
"evaluators": ["Factuality"],
"entries": [
{
"entry_kwargs": { "question": "Hello" },
"description": "Basic greeting",
"eval_input": [{ "name": "input", "value": "Hello" }],
"expectation": "A friendly greeting that offers to help",
"evaluators": ["...", "ClosedQA"]
}
]
}
Entry structure
All fields are top-level on each entry (flat structure — no nesting):
entry:
├── entry_kwargs (required) — args for Runnable.run()
├── eval_input (required) — list of {"name": ..., "value": ...} objects
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
Field reference
runnable(required):filepath:ClassNamereference to theRunnablesubclass that drives the app during evaluation.evaluators(dataset-level, optional): Default evaluator names — applied to every entry that does not declare its ownevaluators.entries[].entry_kwargs(required): Kwargs passed toRunnable.run()as a Pydantic model. Keys must match the fields of the Pydantic model used inrun(args: T).entries[].description(required): Human-readable label for the test case.entries[].eval_input(required): List of{"name": ..., "value": ...}objects. Used to populate the wrap input registry —wrap(purpose="input")calls in the app return registry values keyed byname.entries[].expectation(optional): Concise expectation description for comparison-based evaluators. Should describe what a correct output looks like, not copy the verbatim output. Usepixie formaton the trace to see the real output shape, then write a shorter description.entries[].eval_metadata(optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessed in evaluators asevaluable.eval_metadata.entries[].evaluators(optional): Row-level evaluator override. Rules:- Omit → entry inherits dataset-level
evaluators. ["...", "ClosedQA"]→ dataset defaults plus ClosedQA.["OnlyThis"](no"...") → only OnlyThis, no defaults.
- Omit → entry inherits dataset-level
Evaluator Name Resolution
In dataset JSON, evaluator names are resolved as follows:
- Built-in names (bare names like
"Factuality","ExactMatch") are resolved topixie.{Name}automatically. - Custom evaluators use
filepath:callable_nameformat (e.g."pixie_qa/evaluators.py:my_evaluator"). - Custom evaluator references point to module-level callables — classes
(instantiated automatically), factory functions (called if zero-arg),
evaluator functions (used as-is), or pre-instantiated callables (e.g.
create_llm_evaluatorresults — used as-is).
CLI Commands
| Command | Description |
|---|---|
pixie test [path] [-v] [--no-open] |
Run eval tests on dataset files |
pixie dataset create <name> |
Create a new empty dataset |
pixie dataset list |
List all datasets |
pixie dataset save <name> [--select MODE] |
Save a span to a dataset |
pixie dataset validate [path] |
Validate dataset JSON files |
pixie analyze <test_run_id> |
Generate analysis and recommendations |
Types
Evaluable
class Evaluable(TestCase):
eval_output: list[NamedData] # wrap(purpose="output") + wrap(purpose="state") values
# Inherited from TestCase:
# eval_input: list[NamedData] # from eval_input in dataset entry
# expectation: JsonValue | _Unset # from expectation in dataset entry
# eval_metadata: dict[str, JsonValue] | None # from eval_metadata in dataset entry
# description: str | None
Data carrier for evaluators. Extends TestCase with actual output.
eval_input—list[NamedData]populated from the entry'seval_inputfield. Must have at least one item (min_length=1).eval_output—list[NamedData]containing ALLwrap(purpose="output")andwrap(purpose="state")values captured during the run. Each item has.name(str) and.value(JsonValue). Use_get_output(evaluable, "name")to look up by name.eval_metadata—dict[str, JsonValue] | Nonefrom the entry'seval_metadatafieldexpected_output— expectation text from dataset (orUNSETif not provided)
Attributes:
eval_input: Named input data items (from dataset). Must be non-empty.
eval_output: Named output data items (from wrap calls during run).
Each item has .name (str) and .value (JsonValue).
Contains ALL wrap(purpose="output") and wrap(purpose="state") values.
eval_metadata: Supplementary metadata (None when absent).
expected_output: The expected/reference output for evaluation.
Defaults to UNSET (not provided). May be explicitly
set to None to indicate "there is no expected output".
How wrap() maps to Evaluable fields at test time
When pixie test runs a dataset entry, wrap() calls in the app populate the Evaluable that evaluators receive:
wrap() call in app code |
Evaluable field | Type | How to access in evaluator |
|---|---|---|---|
wrap(data, purpose="input", name="X") |
eval_input |
list[NamedData] |
Pre-populated from eval_input in the dataset entry |
wrap(data, purpose="output", name="X") |
eval_output |
list[NamedData] |
_get_output(evaluable, "X") — see helper below |
wrap(data, purpose="state", name="X") |
eval_output |
list[NamedData] |
_get_output(evaluable, "X") — same list as output |
(from dataset entry expectation) |
expected_output |
str | None |
evaluable.expected_output |
(from dataset entry eval_metadata) |
eval_metadata |
dict | None |
evaluable.eval_metadata |
Key insight: Both purpose="output" and purpose="state" wrap values end up in eval_output as NamedData items. There is no separate captured_output or captured_state dict. Use the helper function below to look up values by wrap name:
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
eval_metadata is for passing extra per-entry data to evaluators that isn't an app input or output — e.g., expected tool names, boolean flags, thresholds. Defined as a top-level field on the entry, accessed as evaluable.eval_metadata.
Complete custom evaluator example (tool call check + dataset entry):
from pixie import Evaluation, Evaluable
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
def tool_call_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
expected = evaluable.eval_metadata.get("expected_tool") if evaluable.eval_metadata else None
actual = _get_output(evaluable, "function_called")
if expected is None:
return Evaluation(score=1.0, reasoning="No expected_tool specified")
match = str(actual) == str(expected)
return Evaluation(
score=1.0 if match else 0.0,
reasoning=f"Expected {expected}, got {actual}",
)
Corresponding dataset entry:
{
"entry_kwargs": { "user_message": "I want to end this call" },
"description": "User requests call end after failed verification",
"eval_input": [{ "name": "user_input", "value": "I want to end this call" }],
"expectation": "Agent should call endCall tool",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
Evaluation
Evaluation(score: 'float', reasoning: 'str', details: 'dict[str, Any]' = <factory>) -> None
The result of a single evaluator applied to a single test case.
Attributes: score: Evaluation score between 0.0 and 1.0. reasoning: Human-readable explanation (required). details: Arbitrary JSON-serializable metadata.
ScoreThreshold
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
Pass criteria: pct fraction of inputs must score >= threshold on all evaluators.
Attributes: threshold: Minimum score an individual evaluation must reach. pct: Fraction of test-case inputs (0.0–1.0) that must pass.
Eval Functions
pixie.run_and_evaluate
pixie.run_and_evaluate(evaluator: 'Callable[..., Any]', runnable: 'Callable[..., Any]', eval_input: 'Any', *, expected_output: 'Any' = <object object at 0x7788c2ad5c80>, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'Evaluation'
Run runnable(eval_input) while capturing traces, then evaluate.
Convenience wrapper combining _run_and_capture and evaluate.
The runnable is called exactly once.
Args: evaluator: An evaluator callable (sync or async). runnable: The application function to test. evalinput: The single input passed to _runnable. expected_output: Optional expected value merged into the evaluable. from_trace: Optional callable to select a specific span from the trace tree for evaluation.
Returns:
The Evaluation result.
Raises: ValueError: If no spans were captured during execution.
pixie.assert_pass
pixie.assert_pass(runnable: 'Callable[..., Any]', eval_inputs: 'list[Any]', evaluators: 'list[Callable[..., Any]]', *, evaluables: 'list[Evaluable] | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
Run evaluators against a runnable over multiple inputs.
For each input, runs the runnable once via _run_and_capture,
then evaluates with every evaluator concurrently via
asyncio.gather.
The results matrix has shape [eval_inputs][evaluators].
If the pass criteria are not met, raises :class:EvalAssertionError
carrying the matrix.
When evaluables is provided, behaviour depends on whether each
item already has eval_output populated:
- eval_output is None — the
runnableis called viarun_and_evaluateto produce an output from traces, andexpected_outputfrom the evaluable is merged into the result. - eval_output is not None — the evaluable is used directly (the runnable is not called for that item).
Args:
runnable: The application function to test.
evalinputs: List of inputs, each passed to _runnable.
evaluators: List of evaluator callables.
evaluables: Optional list of Evaluable items, one per input.
When provided, their expected_output is forwarded to
run_and_evaluate. Must have the same length as
eval_inputs.
pass_criteria: Receives the results matrix, returns
(passed, message). Defaults to ScoreThreshold().
from_trace: Optional span selector forwarded to
run_and_evaluate.
Raises: EvalAssertionError: When pass criteria are not met. ValueError: When evaluables length does not match eval_inputs.
pixie.assert_dataset_pass
pixie.assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
Load a dataset by name, then run assert_pass with its items.
This is a convenience wrapper that:
- Loads the dataset from the
DatasetStore. - Extracts
eval_inputfrom each item as the runnable inputs. - Uses the full
Evaluableitems (which carryexpected_output) as the evaluables. - Delegates to
assert_pass.
Args:
runnable: The application function to test.
dataset_name: Name of the dataset to load.
evaluators: List of evaluator callables.
dataset_dir: Override directory for the dataset store.
When None, reads from PixieConfig.dataset_dir.
pass_criteria: Receives the results matrix, returns
(passed, message).
from_trace: Optional span selector forwarded to
assert_pass.
Raises: FileNotFoundError: If no dataset with dataset_name exists. EvalAssertionError: When pass criteria are not met.
Trace Helpers
pixie.last_llm_call
pixie.last_llm_call(trace: 'list[ObservationNode]') -> 'Evaluable'
Find the LLMSpan with the latest ended_at in the trace tree.
Args:
trace: The trace tree (list of root ObservationNode instances).
Returns:
An Evaluable wrapping the most recently ended LLMSpan.
Raises:
ValueError: If no LLMSpan exists in the trace.
pixie.root
pixie.root(trace: 'list[ObservationNode]') -> 'Evaluable'
Return the first root node's span as Evaluable.
Args:
trace: The trace tree (list of root ObservationNode instances).
Returns:
An Evaluable wrapping the first root node's span.
Raises: ValueError: If the trace is empty.
pixie.capture_traces
pixie.capture_traces() -> 'Generator[MemoryTraceHandler, None, None]'
Context manager that installs a MemoryTraceHandler and yields it.
Calls init() (no-op if already initialised) then registers the
handler via add_handler(). On exit the handler is removed and
the delivery queue is flushed so that all spans are available on
handler.spans.