mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 10:45:56 +00:00
update eval-driven-dev skill. (#1201)
* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -1,5 +1,9 @@
|
||||
# pixie API Reference
|
||||
|
||||
> This file is auto-generated by `generate_api_doc` from the
|
||||
> live pixie-qa package. Do not edit by hand — run
|
||||
> `generate_api_doc` to regenerate after updating pixie-qa.
|
||||
|
||||
## Configuration
|
||||
|
||||
All settings read from environment variables at call time. By default,
|
||||
@@ -22,18 +26,24 @@ from pixie import enable_storage, observe, start_observation, flush, init, add_h
|
||||
|
||||
| Function / Decorator | Signature | Notes |
|
||||
| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
|
||||
| `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
|
||||
| `init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
|
||||
| `observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
|
||||
| `start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
|
||||
| `flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
|
||||
| `add_handler` | `(handler) → None` | Register a custom handler (must call `init()` first). |
|
||||
| `observe` | `observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
|
||||
| `enable_storage` | `enable_storage() -> 'StorageHandler'` | Idempotent. Creates DB, registers handler. Call at app startup. |
|
||||
| `start_observation` | `start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
|
||||
| `flush` | `flush(timeout_seconds: 'float' = 5.0) -> 'bool'` | Drains the queue. Call after a run before using CLI commands. |
|
||||
| `init` | `init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'` | Called internally by `enable_storage`. Idempotent. |
|
||||
| `add_handler` | `add_handler(handler: 'InstrumentationHandler') -> 'None'` | Register a custom handler (must call `init()` first). |
|
||||
| `remove_handler` | `remove_handler(handler: 'InstrumentationHandler') -> 'None'` | Unregister a previously added handler. |
|
||||
|
||||
---
|
||||
|
||||
## CLI Commands
|
||||
|
||||
```bash
|
||||
# Trace inspection
|
||||
pixie trace list [--limit N] [--errors] # show recent traces
|
||||
pixie trace show <trace_id> [--verbose] [--json] # show span tree for a trace
|
||||
pixie trace last [--json] # show most recent trace (verbose)
|
||||
|
||||
# Dataset management
|
||||
pixie dataset create <name>
|
||||
pixie dataset list
|
||||
@@ -47,6 +57,24 @@ echo '"expected value"' | pixie dataset save <name> --expected-output
|
||||
pixie test [path] [-k filter_substring] [-v]
|
||||
```
|
||||
|
||||
### `pixie trace` commands
|
||||
|
||||
**`pixie trace list`** — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).
|
||||
|
||||
- `--limit N` (default 10) — number of traces to show
|
||||
- `--errors` — show only traces with errors
|
||||
|
||||
**`pixie trace show <trace_id>`** — show the span tree for a specific trace.
|
||||
|
||||
- Default (compact): span names, types, timing
|
||||
- `--verbose` / `-v`: full input/output data for each span
|
||||
- `--json`: machine-readable JSON output
|
||||
- Trace ID accepts prefix match (first 8+ characters)
|
||||
|
||||
**`pixie trace last`** — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.
|
||||
|
||||
- `--json`: machine-readable JSON output
|
||||
|
||||
**`pixie dataset save` selection modes:**
|
||||
|
||||
- `root` (default) — the outermost `@observe` or `start_observation` span
|
||||
@@ -55,112 +83,21 @@ pixie test [path] [-k filter_substring] [-v]
|
||||
|
||||
---
|
||||
|
||||
## Eval Harness (`pixie`)
|
||||
|
||||
```python
|
||||
from pixie import (
|
||||
assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
|
||||
EvalAssertionError, Evaluation, ScoreThreshold,
|
||||
capture_traces, MemoryTraceHandler,
|
||||
last_llm_call, root,
|
||||
)
|
||||
```
|
||||
|
||||
### Key functions
|
||||
|
||||
**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
|
||||
|
||||
- Loads dataset by name, runs `assert_pass` with all items.
|
||||
- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
|
||||
- `evaluators`: list of evaluator callables.
|
||||
- `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5).
|
||||
- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
|
||||
|
||||
**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
|
||||
|
||||
- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
|
||||
|
||||
**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
|
||||
|
||||
- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
|
||||
|
||||
**`ScoreThreshold(threshold=0.5, pct=1.0)`**
|
||||
|
||||
- `threshold`: min score per item (default 0.5).
|
||||
- `pct`: fraction of items that must meet threshold (default 1.0 = all).
|
||||
- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
|
||||
|
||||
**`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.0–1.0.
|
||||
|
||||
**`capture_traces()`** — context manager; use for in-memory trace capture without DB.
|
||||
|
||||
**`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers.
|
||||
|
||||
---
|
||||
|
||||
## Evaluators
|
||||
|
||||
### Heuristic (no LLM needed)
|
||||
|
||||
| Evaluator | Use when |
|
||||
| -------------------------------- | --------------------------------------------------- |
|
||||
| `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
|
||||
| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
|
||||
| `NumericDiffEval(expected=...)` | Normalised numeric difference |
|
||||
| `JSONDiffEval(expected=...)` | Structural JSON comparison |
|
||||
| `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
|
||||
| `ListContainsEval(expected=...)` | Output list contains expected items |
|
||||
|
||||
### LLM-as-judge (require OpenAI key or compatible client)
|
||||
|
||||
| Evaluator | Use when |
|
||||
| ----------------------------------------------------- | ----------------------------------------- |
|
||||
| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
|
||||
| `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
|
||||
| `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
|
||||
| `TranslationEval(expected=..., language=..., ...)` | Translation quality |
|
||||
| `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
|
||||
| `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
|
||||
| `ModerationEval(threshold=..., client=...)` | Content moderation |
|
||||
| `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
|
||||
|
||||
### RAG / retrieval
|
||||
|
||||
| Evaluator | Use when |
|
||||
| ------------------------------------------------- | ------------------------------------------ |
|
||||
| `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
|
||||
| `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
|
||||
| `AnswerRelevancyEval(client=...)` | Answer addresses the question |
|
||||
| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
|
||||
|
||||
### Custom evaluator template
|
||||
|
||||
```python
|
||||
from pixie import Evaluation, Evaluable
|
||||
|
||||
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
|
||||
# evaluable.eval_input — what was passed to the observed function
|
||||
# evaluable.eval_output — what the function returned
|
||||
# evaluable.expected_output — reference answer (UNSET if not provided)
|
||||
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
|
||||
return Evaluation(score=score, reasoning="...")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dataset Python API
|
||||
|
||||
```python
|
||||
from pixie import DatasetStore, Evaluable
|
||||
```
|
||||
|
||||
```python
|
||||
store = DatasetStore() # reads PIXIE_DATASET_DIR
|
||||
store.create("my-dataset") # create empty
|
||||
store.create("my-dataset", items=[...]) # create with items
|
||||
store.append("my-dataset", Evaluable(...)) # add one item
|
||||
store.get("my-dataset") # returns Dataset
|
||||
store.list() # list names
|
||||
store.remove("my-dataset", index=2) # remove by index
|
||||
store.delete("my-dataset") # delete entirely
|
||||
store.append(...) # add one or more items
|
||||
store.create(...) # create empty / create with items
|
||||
store.delete(...) # delete entirely
|
||||
store.get(...) # returns Dataset
|
||||
store.list(...) # list names
|
||||
store.list_details(...) # list names with metadata
|
||||
store.remove(...) # remove by index
|
||||
```
|
||||
|
||||
**`Evaluable` fields:**
|
||||
@@ -179,13 +116,20 @@ from pixie import ObservationStore
|
||||
|
||||
store = ObservationStore() # reads PIXIE_DB_PATH
|
||||
await store.create_tables()
|
||||
```
|
||||
|
||||
# Read traces
|
||||
await store.list_traces(limit=10, offset=0) # → list of trace summaries
|
||||
await store.get_trace(trace_id) # → list[ObservationNode] (tree)
|
||||
await store.get_root(trace_id) # → root ObserveSpan
|
||||
await store.get_last_llm(trace_id) # → most recent LLMSpan
|
||||
await store.get_by_name(name, trace_id=None) # → list of spans
|
||||
```python
|
||||
await store.create_tables(self) -> 'None'
|
||||
await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of spans
|
||||
await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of spans filtered by kind
|
||||
await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]' # → list of error spans
|
||||
await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None' # → most recent LLMSpan
|
||||
await store.get_root(self, trace_id: 'str') -> 'ObserveSpan' # → root ObserveSpan
|
||||
await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]' # → list[ObservationNode] (tree)
|
||||
await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]' # → flat list of all spans
|
||||
await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]' # → list of trace summaries
|
||||
await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None' # persist a single span
|
||||
await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None' # persist multiple spans
|
||||
|
||||
# ObservationNode
|
||||
node.to_text() # pretty-print span tree
|
||||
@@ -193,3 +137,121 @@ node.find(name) # find a child span by name
|
||||
node.children # list of child ObservationNode
|
||||
node.span # the underlying span (ObserveSpan or LLMSpan)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Eval Runner API
|
||||
|
||||
### `assert_dataset_pass`
|
||||
|
||||
```python
|
||||
await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
- `runnable` — callable that takes `eval_input` and runs the app
|
||||
- `dataset_name` — name of the dataset to load (NOT `dataset_path`)
|
||||
- `evaluators` — list of evaluator instances
|
||||
- `pass_criteria` — `ScoreThreshold(threshold=..., pct=...)` (NOT `thresholds`)
|
||||
- `from_trace` — span selector: use `last_llm_call` or `root`
|
||||
- `dataset_dir` — override dataset directory (default: reads from config)
|
||||
- `passes` — number of times to run the full matrix (default: 1)
|
||||
|
||||
### `ScoreThreshold`
|
||||
|
||||
```python
|
||||
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
|
||||
|
||||
# threshold: minimum per-item score to count as passing (0.0–1.0)
|
||||
# pct: fraction of items that must pass (0.0–1.0, default=1.0)
|
||||
```
|
||||
|
||||
### Trace helpers
|
||||
|
||||
```python
|
||||
from pixie import last_llm_call, root
|
||||
|
||||
# Pass one of these as the from_trace= argument:
|
||||
from_trace=last_llm_call # extract eval data from the most recent LLM call span
|
||||
from_trace=root # extract eval data from the root @observe span
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evaluator catalog
|
||||
|
||||
Import any evaluator directly from `pixie`:
|
||||
|
||||
```python
|
||||
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
|
||||
```
|
||||
|
||||
### Heuristic (no LLM required)
|
||||
|
||||
| Evaluator | Signature | Use when | Needs `expected_output`? |
|
||||
| --- | --- | --- | --- |
|
||||
| `ExactMatchEval() -> 'AutoevalsAdapter'` | Output must exactly equal the expected string | **Yes** |
|
||||
| `LevenshteinMatch() -> 'AutoevalsAdapter'` | Partial string similarity (edit distance) | **Yes** |
|
||||
| `NumericDiffEval() -> 'AutoevalsAdapter'` | Normalised numeric difference | **Yes** |
|
||||
| `JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'` | Structural JSON comparison | **Yes** |
|
||||
| `ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'` | Output is valid JSON (optionally matching a schema) | No |
|
||||
| `ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'` | Output list contains expected items | **Yes** |
|
||||
|
||||
### LLM-as-judge (require OpenAI key or compatible client)
|
||||
|
||||
| Evaluator | Signature | Use when | Needs `expected_output`? |
|
||||
| --- | --- | --- | --- |
|
||||
| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | **Yes** |
|
||||
| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | **Yes** |
|
||||
| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | **Yes** |
|
||||
| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality | **Yes** |
|
||||
| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
|
||||
| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
|
||||
| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
|
||||
| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | **Yes** |
|
||||
| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | **Yes** |
|
||||
| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity | **Yes** |
|
||||
|
||||
### RAG / retrieval
|
||||
|
||||
| Evaluator | Signature | Use when | Needs `expected_output`? |
|
||||
| --- | --- | --- | --- |
|
||||
| `ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Retrieved context is relevant to query | **Yes** |
|
||||
| `FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is faithful to the provided context | No |
|
||||
| `AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer addresses the question (⚠️ requires `context` in trace — **RAG pipelines only**) | No |
|
||||
| `AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is correct vs reference | **Yes** |
|
||||
|
||||
### Other evaluators
|
||||
|
||||
| Evaluator | Signature | Needs `expected_output`? |
|
||||
| --- | --- | --- |
|
||||
| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |
|
||||
|
||||
---
|
||||
|
||||
## Custom evaluator — `create_llm_evaluator` factory
|
||||
|
||||
```python
|
||||
from pixie import create_llm_evaluator
|
||||
|
||||
my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
|
||||
```
|
||||
|
||||
- Returns a callable satisfying the `Evaluator` protocol
|
||||
- Template variables: `{eval_input}`, `{eval_output}`, `{expected_output}` — populated from `Evaluable` fields
|
||||
- No nested field access — include any needed metadata in `eval_input` when building the dataset
|
||||
- Score parsing extracts a 0–1 float from the LLM response
|
||||
|
||||
### Custom evaluator — manual template
|
||||
|
||||
```python
|
||||
from pixie import Evaluation, Evaluable
|
||||
|
||||
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
|
||||
# evaluable.eval_input — what was passed to the observed function
|
||||
# evaluable.eval_output — what the function returned
|
||||
# evaluable.expected_output — reference answer (UNSET if not provided)
|
||||
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
|
||||
return Evaluation(score=score, reasoning="...")
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user