mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
12 KiB
12 KiB
pixie API Reference
This file is auto-generated by
generate_api_docfrom the live pixie-qa package. Do not edit by hand — rungenerate_api_docto regenerate after updating pixie-qa.
Configuration
All settings read from environment variables at call time. By default,
every artefact lives inside a single pixie_qa project directory:
| Variable | Default | Description |
|---|---|---|
PIXIE_ROOT |
pixie_qa |
Root directory for all artefacts |
PIXIE_DB_PATH |
pixie_qa/observations.db |
SQLite database file path |
PIXIE_DB_ENGINE |
sqlite |
Database engine (currently sqlite) |
PIXIE_DATASET_DIR |
pixie_qa/datasets |
Directory for dataset JSON files |
Instrumentation API (pixie)
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
| Function / Decorator | Signature | Notes |
|---|---|---|
observe |
`observe(name: 'str | None' = None) -> 'CallableCallable[P, T, Callable[P, T]]'` |
enable_storage |
enable_storage() -> 'StorageHandler' |
Idempotent. Creates DB, registers handler. Call at app startup. |
start_observation |
`start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` |
flush |
flush(timeout_seconds: 'float' = 5.0) -> 'bool' |
Drains the queue. Call after a run before using CLI commands. |
init |
init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None' |
Called internally by enable_storage. Idempotent. |
add_handler |
add_handler(handler: 'InstrumentationHandler') -> 'None' |
Register a custom handler (must call init() first). |
remove_handler |
remove_handler(handler: 'InstrumentationHandler') -> 'None' |
Unregister a previously added handler. |
CLI Commands
# Trace inspection
pixie trace list [--limit N] [--errors] # show recent traces
pixie trace show <trace_id> [--verbose] [--json] # show span tree for a trace
pixie trace last [--json] # show most recent trace (verbose)
# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name> # root span (default)
pixie dataset save <name> --select last_llm_call # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output
# Run eval tests
pixie test [path] [-k filter_substring] [-v]
pixie trace commands
pixie trace list — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).
--limit N(default 10) — number of traces to show--errors— show only traces with errors
pixie trace show <trace_id> — show the span tree for a specific trace.
- Default (compact): span names, types, timing
--verbose/-v: full input/output data for each span--json: machine-readable JSON output- Trace ID accepts prefix match (first 8+ characters)
pixie trace last — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.
--json: machine-readable JSON output
pixie dataset save selection modes:
root(default) — the outermost@observeorstart_observationspanlast_llm_call— the most recent LLM API call span in the traceby_name— a span matching the--span-nameargument (takes the last matching span)
Dataset Python API
from pixie import DatasetStore, Evaluable
store = DatasetStore() # reads PIXIE_DATASET_DIR
store.append(...) # add one or more items
store.create(...) # create empty / create with items
store.delete(...) # delete entirely
store.get(...) # returns Dataset
store.list(...) # list names
store.list_details(...) # list names with metadata
store.remove(...) # remove by index
Evaluable fields:
eval_input: the input (what@observecaptured as function kwargs)eval_output: the output (return value of the observed function)eval_metadata: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includestrace_idandspan_idexpected_output: reference answer for comparison (UNSETif not provided)
ObservationStore Python API
from pixie import ObservationStore
store = ObservationStore() # reads PIXIE_DB_PATH
await store.create_tables()
await store.create_tables(self) -> 'None'
await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of spans
await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of spans filtered by kind
await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of error spans
await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None' # → most recent LLMSpan
await store.get_root(self, trace_id: 'str') -> 'ObserveSpan' # → root ObserveSpan
await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]' # → list[ObservationNode] (tree)
await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]' # → flat list of all spans
await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]' # → list of trace summaries
await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None' # persist a single span
await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None' # persist multiple spans
# ObservationNode
node.to_text() # pretty-print span tree
node.find(name) # find a child span by name
node.children # list of child ObservationNode
node.span # the underlying span (ObserveSpan or LLMSpan)
Eval Runner API
assert_dataset_pass
await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
Parameters:
runnable— callable that takeseval_inputand runs the appdataset_name— name of the dataset to load (NOTdataset_path)evaluators— list of evaluator instancespass_criteria—ScoreThreshold(threshold=..., pct=...)(NOTthresholds)from_trace— span selector: uselast_llm_callorrootdataset_dir— override dataset directory (default: reads from config)passes— number of times to run the full matrix (default: 1)
ScoreThreshold
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
# threshold: minimum per-item score to count as passing (0.0–1.0)
# pct: fraction of items that must pass (0.0–1.0, default=1.0)
Trace helpers
from pixie import last_llm_call, root
# Pass one of these as the from_trace= argument:
from_trace=last_llm_call # extract eval data from the most recent LLM call span
from_trace=root # extract eval data from the root @observe span
Evaluator catalog
Import any evaluator directly from pixie:
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
Heuristic (no LLM required)
| Evaluator | Signature | Use when | Needs expected_output? |
|---|---|---|---|
ExactMatchEval() -> 'AutoevalsAdapter' |
Output must exactly equal the expected string | Yes | |
LevenshteinMatch() -> 'AutoevalsAdapter' |
Partial string similarity (edit distance) | Yes | |
NumericDiffEval() -> 'AutoevalsAdapter' |
Normalised numeric difference | Yes | |
JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter' |
Structural JSON comparison | Yes | |
ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter' |
Output is valid JSON (optionally matching a schema) | No | |
ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter' |
Output list contains expected items | Yes |
LLM-as-judge (require OpenAI key or compatible client)
| Evaluator | Signature | Use when | Needs expected_output? |
|---|---|---|---|
| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | Yes |
| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | Yes |
| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | Yes |
| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality |
| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | Yes |
| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | Yes |
| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity |
RAG / retrieval
| Evaluator | Signature | Use when | Needs expected_output? |
|---|---|---|---|
ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter' |
Retrieved context is relevant to query | Yes | |
FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter' |
Answer is faithful to the provided context | No | |
AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter' |
Answer addresses the question (⚠️ requires context in trace — RAG pipelines only) |
No | |
AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter' |
Answer is correct vs reference | Yes |
Other evaluators
| Evaluator | Signature | Needs expected_output? |
|---|---|---|
| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |
Custom evaluator — create_llm_evaluator factory
from pixie import create_llm_evaluator
my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
- Returns a callable satisfying the
Evaluatorprotocol - Template variables:
{eval_input},{eval_output},{expected_output}— populated fromEvaluablefields - No nested field access — include any needed metadata in
eval_inputwhen building the dataset - Score parsing extracts a 0–1 float from the LLM response
Custom evaluator — manual template
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")