mirror of https://github.com/github/awesome-copilot.git synced 2026-04-11 10:45:56 +00:00

Files

Yiou Li df0ed6aa51 update eval-driven-dev skill. (#1201 )

* update eval-driven-dev skill.

Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions.

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2026-03-30 08:07:39 +11:00

12 KiB

Raw Permalink Blame History

pixie API Reference

This file is auto-generated by generate_api_doc from the live pixie-qa package. Do not edit by hand — run generate_api_doc to regenerate after updating pixie-qa.

Configuration

All settings read from environment variables at call time. By default, every artefact lives inside a single pixie_qa project directory:

Variable	Default	Description
`PIXIE_ROOT`	`pixie_qa`	Root directory for all artefacts
`PIXIE_DB_PATH`	`pixie_qa/observations.db`	SQLite database file path
`PIXIE_DB_ENGINE`	`sqlite`	Database engine (currently sqlite)
`PIXIE_DATASET_DIR`	`pixie_qa/datasets`	Directory for dataset JSON files

Instrumentation API (`pixie`)

from pixie import enable_storage, observe, start_observation, flush, init, add_handler

Function / Decorator	Signature	Notes
`observe`	`observe(name: 'str	None' = None) -> 'CallableCallable[P, T, Callable[P, T]]'`
`enable_storage`	`enable_storage() -> 'StorageHandler'`	Idempotent. Creates DB, registers handler. Call at app startup.
`start_observation`	`start_observation(*, input: 'JsonValue', name: 'str	None' = None) -> 'Generator[ObservationContext, None, None]'`
`flush`	`flush(timeout_seconds: 'float' = 5.0) -> 'bool'`	Drains the queue. Call after a run before using CLI commands.
`init`	`init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'`	Called internally by `enable_storage`. Idempotent.
`add_handler`	`add_handler(handler: 'InstrumentationHandler') -> 'None'`	Register a custom handler (must call `init()` first).
`remove_handler`	`remove_handler(handler: 'InstrumentationHandler') -> 'None'`	Unregister a previously added handler.

CLI Commands

# Trace inspection
pixie trace list [--limit N] [--errors]              # show recent traces
pixie trace show <trace_id> [--verbose] [--json]     # show span tree for a trace
pixie trace last [--json]                            # show most recent trace (verbose)

# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name>                              # root span (default)
pixie dataset save <name> --select last_llm_call       # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output

# Run eval tests
pixie test [path] [-k filter_substring] [-v]

`pixie trace` commands

pixie trace list — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).

--limit N (default 10) — number of traces to show
--errors — show only traces with errors

pixie trace show <trace_id> — show the span tree for a specific trace.

Default (compact): span names, types, timing
--verbose / -v: full input/output data for each span
--json: machine-readable JSON output
Trace ID accepts prefix match (first 8+ characters)

pixie trace last — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.

--json: machine-readable JSON output

pixie dataset save selection modes:

root (default) — the outermost @observe or start_observation span
last_llm_call — the most recent LLM API call span in the trace
by_name — a span matching the --span-name argument (takes the last matching span)

Dataset Python API

from pixie import DatasetStore, Evaluable

store = DatasetStore()                               # reads PIXIE_DATASET_DIR
store.append(...)    # add one or more items
store.create(...)    # create empty / create with items
store.delete(...)    # delete entirely
store.get(...)    # returns Dataset
store.list(...)    # list names
store.list_details(...)    # list names with metadata
store.remove(...)    # remove by index

Evaluable fields:

eval_input: the input (what @observe captured as function kwargs)
eval_output: the output (return value of the observed function)
eval_metadata: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes trace_id and span_id
expected_output: reference answer for comparison (UNSET if not provided)

ObservationStore Python API

from pixie import ObservationStore

store = ObservationStore()   # reads PIXIE_DB_PATH
await store.create_tables()

await store.create_tables(self) -> 'None'
await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans
await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans filtered by kind
await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of error spans
await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None'  # → most recent LLMSpan
await store.get_root(self, trace_id: 'str') -> 'ObserveSpan'  # → root ObserveSpan
await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]'  # → list[ObservationNode] (tree)
await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]'  # → flat list of all spans
await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]'  # → list of trace summaries
await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None'  # persist a single span
await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None'  # persist multiple spans

# ObservationNode
node.to_text()          # pretty-print span tree
node.find(name)         # find a child span by name
node.children           # list of child ObservationNode
node.span               # the underlying span (ObserveSpan or LLMSpan)

Eval Runner API

`assert_dataset_pass`

await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'

Parameters:

runnable — callable that takes eval_input and runs the app
dataset_name — name of the dataset to load (NOT dataset_path)
evaluators — list of evaluator instances
pass_criteria — ScoreThreshold(threshold=..., pct=...) (NOT thresholds)
from_trace — span selector: use last_llm_call or root
dataset_dir — override dataset directory (default: reads from config)
passes — number of times to run the full matrix (default: 1)

`ScoreThreshold`

ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None

# threshold: minimum per-item score to count as passing (0.0–1.0)
# pct:       fraction of items that must pass (0.0–1.0, default=1.0)

Trace helpers

from pixie import last_llm_call, root

# Pass one of these as the from_trace= argument:
from_trace=last_llm_call  # extract eval data from the most recent LLM call span
from_trace=root           # extract eval data from the root @observe span

Evaluator catalog

Import any evaluator directly from pixie:

from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator

Heuristic (no LLM required)

Evaluator	Signature	Use when
`ExactMatchEval() -> 'AutoevalsAdapter'`	Output must exactly equal the expected string	Yes
`LevenshteinMatch() -> 'AutoevalsAdapter'`	Partial string similarity (edit distance)	Yes
`NumericDiffEval() -> 'AutoevalsAdapter'`	Normalised numeric difference	Yes
`JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'`	Structural JSON comparison	Yes
`ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'`	Output is valid JSON (optionally matching a schema)	No
`ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'`	Output list contains expected items	Yes

LLM-as-judge (require OpenAI key or compatible client)

Evaluator	Signature	Use when	Needs `expected_output`?
`FactualityEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Output is factually accurate vs reference	Yes
`ClosedQAEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Closed-book QA comparison	Yes
`SummaryEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Summarisation quality	Yes
`TranslationEval(*, language: 'str	None' = None, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Translation quality
`PossibleEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Output is feasible / plausible	No
`SecurityEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	No security vulnerabilities in output	No
`ModerationEval(*, threshold: 'float	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Content moderation	No
`BattleEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Head-to-head comparison	Yes
`HumorEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Humor quality evaluation	Yes
`EmbeddingSimilarityEval(*, prefix: 'str	None' = None, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	Embedding-based semantic similarity

RAG / retrieval

Evaluator	Signature	Use when
`ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'`	Retrieved context is relevant to query	Yes
`FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'`	Answer is faithful to the provided context	No
`AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'`	Answer addresses the question (⚠️ requires `context` in trace — RAG pipelines only)	No
`AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'`	Answer is correct vs reference	Yes

Other evaluators

Evaluator	Signature	Needs `expected_output`?
`SqlEval(*, model: 'str	None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'`	No

Custom evaluator — `create_llm_evaluator` factory

from pixie import create_llm_evaluator

my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'

Returns a callable satisfying the Evaluator protocol
Template variables: {eval_input}, {eval_output}, {expected_output} — populated from Evaluable fields
No nested field access — include any needed metadata in eval_input when building the dataset
Score parsing extracts a 0–1 float from the LLM response

Custom evaluator — manual template

from pixie import Evaluation, Evaluable

async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    # evaluable.eval_input  — what was passed to the observed function
    # evaluable.eval_output — what the function returned
    # evaluable.expected_output — reference answer (UNSET if not provided)
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")

12 KiB Raw Permalink Blame History Unescape Escape