mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 03:05:55 +00:00
update eval-driven-dev skill (#1352)
* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
This commit is contained in:
531
skills/eval-driven-dev/references/evaluators.md
Normal file
531
skills/eval-driven-dev/references/evaluators.md
Normal file
@@ -0,0 +1,531 @@
|
||||
# Built-in Evaluators
|
||||
|
||||
> Auto-generated from pixie source code docstrings.
|
||||
> Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository.
|
||||
|
||||
Autoevals adapters — pre-made evaluators wrapping `autoevals` scorers.
|
||||
|
||||
This module provides :class:`AutoevalsAdapter`, which bridges the
|
||||
autoevals `Scorer` interface to pixie's `Evaluator` protocol, and
|
||||
a set of factory functions for common evaluation tasks.
|
||||
|
||||
Public API (all are also re-exported from `pixie.evals`):
|
||||
|
||||
**Core adapter:** - :class:`AutoevalsAdapter` — generic wrapper for any autoevals `Scorer`.
|
||||
|
||||
**Heuristic scorers (no LLM required):** - :func:`LevenshteinMatch` — edit-distance string similarity. - :func:`ExactMatch` — exact value comparison. - :func:`NumericDiff` — normalised numeric difference. - :func:`JSONDiff` — structural JSON comparison. - :func:`ValidJSON` — JSON syntax / schema validation. - :func:`ListContains` — overlap between two string lists.
|
||||
|
||||
**Embedding scorer:** - :func:`EmbeddingSimilarity` — cosine similarity via embeddings.
|
||||
|
||||
**LLM-as-judge scorers:** - :func:`Factuality`, :func:`ClosedQA`, :func:`Battle`,
|
||||
:func:`Humor`, :func:`Security`, :func:`Sql`,
|
||||
:func:`Summary`, :func:`Translation`, :func:`Possible`.
|
||||
|
||||
**Moderation:** - :func:`Moderation` — OpenAI content-moderation check.
|
||||
|
||||
**RAGAS metrics:** - :func:`ContextRelevancy`, :func:`Faithfulness`,
|
||||
:func:`AnswerRelevancy`, :func:`AnswerCorrectness`.
|
||||
|
||||
## Evaluator Selection Guide
|
||||
|
||||
Choose evaluators based on the **output type** and eval criteria:
|
||||
|
||||
| Output type | Evaluator category | Examples |
|
||||
| -------------------------------------------- | ----------------------------------------------------------- | ------------------------------------- |
|
||||
| Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON` | Label classification, JSON extraction |
|
||||
| Open-ended text with a reference answer | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries |
|
||||
| Text with expected context/grounding | RAG: `Faithfulness`, `ContextRelevancy` | RAG pipelines |
|
||||
| Text with style/format requirements | Custom via `create_llm_evaluator` | Voice-friendly responses, tone checks |
|
||||
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
|
||||
|
||||
Critical rules:
|
||||
|
||||
- For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are
|
||||
non-deterministic.
|
||||
- `AnswerRelevancy` is **RAG-only** — requires `context` in the trace.
|
||||
Returns 0.0 without it. For general relevance, use `create_llm_evaluator`.
|
||||
- Do NOT use comparison evaluators (`Factuality`, `ClosedQA`,
|
||||
`ExactMatch`) on items without `expected_output` — they produce
|
||||
meaningless scores.
|
||||
|
||||
---
|
||||
|
||||
## Evaluator Reference
|
||||
|
||||
### `AnswerCorrectness`
|
||||
|
||||
```python
|
||||
AnswerCorrectness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Answer correctness evaluator (RAGAS).
|
||||
|
||||
Judges whether `eval_output` is correct compared to
|
||||
`expected_output`, combining factual similarity and semantic
|
||||
similarity.
|
||||
|
||||
**When to use**: QA scenarios in RAG pipelines where you have a
|
||||
reference answer and want a comprehensive correctness score.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
**Requires `eval_metadata["context"]`**: Optional (improves accuracy).
|
||||
|
||||
Args:
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `AnswerRelevancy`
|
||||
|
||||
```python
|
||||
AnswerRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Answer relevancy evaluator (RAGAS).
|
||||
|
||||
Judges whether `eval_output` directly addresses the question in
|
||||
`eval_input`.
|
||||
|
||||
**When to use**: RAG pipelines only — requires `context` in the
|
||||
trace. Returns 0.0 without it. For general (non-RAG) response
|
||||
relevance, use `create_llm_evaluator` with a custom prompt instead.
|
||||
|
||||
**Requires `expected_output`**: No.
|
||||
**Requires `eval_metadata["context"]`**: Yes — **RAG pipelines only**.
|
||||
|
||||
Args:
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Battle`
|
||||
|
||||
```python
|
||||
Battle(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Head-to-head comparison evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to compare `eval_output` against `expected_output`
|
||||
and determine which is better given the instructions in `eval_input`.
|
||||
|
||||
**When to use**: A/B testing scenarios, comparing model outputs,
|
||||
or ranking alternative responses.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `ClosedQA`
|
||||
|
||||
```python
|
||||
ClosedQA(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Closed-book question-answering evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge whether `eval_output` correctly answers the
|
||||
question in `eval_input` compared to `expected_output`. Optionally
|
||||
forwards `eval_metadata["criteria"]` for custom grading criteria.
|
||||
|
||||
**When to use**: QA scenarios where the answer should match a reference —
|
||||
e.g. customer support answers, knowledge-base queries.
|
||||
|
||||
**Requires `expected_output`**: Yes — do NOT use on items without
|
||||
`expected_output`; produces meaningless scores.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `ContextRelevancy`
|
||||
|
||||
```python
|
||||
ContextRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Context relevancy evaluator (RAGAS).
|
||||
|
||||
Judges whether the retrieved context is relevant to the query.
|
||||
Forwards `eval_metadata["context"]` to the underlying scorer.
|
||||
|
||||
**When to use**: RAG pipelines — evaluating retrieval quality.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
**Requires `eval_metadata["context"]`**: Yes (RAG pipelines only).
|
||||
|
||||
Args:
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `EmbeddingSimilarity`
|
||||
|
||||
```python
|
||||
EmbeddingSimilarity(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Embedding-based semantic similarity evaluator.
|
||||
|
||||
Computes cosine similarity between embedding vectors of `eval_output`
|
||||
and `expected_output`.
|
||||
|
||||
**When to use**: Comparing semantic meaning of two texts when exact
|
||||
wording doesn't matter. More robust than Levenshtein for paraphrased
|
||||
content but less nuanced than LLM-as-judge evaluators.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
prefix: Optional text to prepend for domain context.
|
||||
model: Embedding model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `ExactMatch`
|
||||
|
||||
```python
|
||||
ExactMatch() -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Exact value comparison evaluator.
|
||||
|
||||
Returns 1.0 if `eval_output` exactly equals `expected_output`,
|
||||
0.0 otherwise.
|
||||
|
||||
**When to use**: Deterministic, structured outputs (classification labels,
|
||||
yes/no answers, fixed-format strings). **Never** use for open-ended LLM
|
||||
text — LLM outputs are non-deterministic, so exact match will almost always
|
||||
fail.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
### `Factuality`
|
||||
|
||||
```python
|
||||
Factuality(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Factual accuracy evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge whether `eval_output` is factually consistent
|
||||
with `expected_output` given the `eval_input` context.
|
||||
|
||||
**When to use**: Open-ended text where factual correctness matters
|
||||
(chatbot responses, QA answers, summaries). Preferred over
|
||||
`ExactMatch` for LLM-generated text.
|
||||
|
||||
**Requires `expected_output`**: Yes — do NOT use on items without
|
||||
`expected_output`; produces meaningless scores.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Faithfulness`
|
||||
|
||||
```python
|
||||
Faithfulness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Faithfulness evaluator (RAGAS).
|
||||
|
||||
Judges whether `eval_output` is faithful to (i.e. supported by)
|
||||
the provided context. Forwards `eval_metadata["context"]`.
|
||||
|
||||
**When to use**: RAG pipelines — ensuring the answer doesn't
|
||||
hallucinate beyond what the retrieved context supports.
|
||||
|
||||
**Requires `expected_output`**: No.
|
||||
**Requires `eval_metadata["context"]`**: Yes (RAG pipelines only).
|
||||
|
||||
Args:
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Humor`
|
||||
|
||||
```python
|
||||
Humor(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Humor quality evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge the humor quality of `eval_output` against
|
||||
`expected_output`.
|
||||
|
||||
**When to use**: Evaluating humor in creative writing, chatbot
|
||||
personality, or entertainment applications.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `JSONDiff`
|
||||
|
||||
```python
|
||||
JSONDiff(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Structural JSON comparison evaluator.
|
||||
|
||||
Recursively compares two JSON structures and produces a similarity
|
||||
score. Handles nested objects, arrays, and mixed types.
|
||||
|
||||
**When to use**: Structured JSON outputs where field-level comparison
|
||||
is needed (e.g. extracted data, API response schemas, tool call arguments).
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
string_scorer: Optional pairwise scorer for string fields.
|
||||
|
||||
### `LevenshteinMatch`
|
||||
|
||||
```python
|
||||
LevenshteinMatch() -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Edit-distance string similarity evaluator.
|
||||
|
||||
Computes a normalised Levenshtein distance between `eval_output` and
|
||||
`expected_output`. Returns 1.0 for identical strings and decreasing
|
||||
scores as edit distance grows.
|
||||
|
||||
**When to use**: Deterministic or near-deterministic outputs where small
|
||||
textual variations are acceptable (e.g. formatting differences, minor
|
||||
spelling). Not suitable for open-ended LLM text — use an LLM-as-judge
|
||||
evaluator instead.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
### `ListContains`
|
||||
|
||||
```python
|
||||
ListContains(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
List overlap evaluator.
|
||||
|
||||
Checks whether `eval_output` contains all items from
|
||||
`expected_output`. Scores based on overlap ratio.
|
||||
|
||||
**When to use**: Outputs that produce a list of items where completeness
|
||||
matters (e.g. extracted entities, search results, recommendations).
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
pairwise_scorer: Optional scorer for pairwise element comparison.
|
||||
allow_extra_entities: If True, extra items in output are not penalised.
|
||||
|
||||
### `Moderation`
|
||||
|
||||
```python
|
||||
Moderation(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Content moderation evaluator.
|
||||
|
||||
Uses the OpenAI moderation API to check `eval_output` for unsafe
|
||||
content (hate speech, violence, self-harm, etc.).
|
||||
|
||||
**When to use**: Any application where output safety is a concern —
|
||||
chatbots, content generation, user-facing AI.
|
||||
|
||||
**Requires `expected_output`**: No.
|
||||
|
||||
Args:
|
||||
threshold: Custom flagging threshold.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `NumericDiff`
|
||||
|
||||
```python
|
||||
NumericDiff() -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Normalised numeric difference evaluator.
|
||||
|
||||
Computes a normalised numeric distance between `eval_output` and
|
||||
`expected_output`. Returns 1.0 for identical numbers and decreasing
|
||||
scores as the difference grows.
|
||||
|
||||
**When to use**: Numeric outputs where approximate equality is acceptable
|
||||
(e.g. price calculations, scores, measurements).
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
### `Possible`
|
||||
|
||||
```python
|
||||
Possible(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Feasibility / plausibility evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge whether `eval_output` is a plausible or
|
||||
feasible response.
|
||||
|
||||
**When to use**: General-purpose quality check when you want to
|
||||
verify outputs are reasonable without a specific reference answer.
|
||||
|
||||
**Requires `expected_output`**: No.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Security`
|
||||
|
||||
```python
|
||||
Security(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Security vulnerability evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to check `eval_output` for security vulnerabilities
|
||||
based on the instructions in `eval_input`.
|
||||
|
||||
**When to use**: Code generation, SQL output, or any scenario
|
||||
where output must be checked for injection or vulnerability risks.
|
||||
|
||||
**Requires `expected_output`**: No.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Sql`
|
||||
|
||||
```python
|
||||
Sql(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
SQL equivalence evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge whether `eval_output` SQL is semantically
|
||||
equivalent to `expected_output` SQL.
|
||||
|
||||
**When to use**: Text-to-SQL applications where the generated SQL
|
||||
should be functionally equivalent to a reference query.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Summary`
|
||||
|
||||
```python
|
||||
Summary(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Summarisation quality evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge the quality of `eval_output` as a summary
|
||||
compared to the reference summary in `expected_output`.
|
||||
|
||||
**When to use**: Summarisation tasks where the output must capture
|
||||
key information from the source material.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `Translation`
|
||||
|
||||
```python
|
||||
Translation(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
Translation quality evaluator (LLM-as-judge).
|
||||
|
||||
Uses an LLM to judge the translation quality of `eval_output`
|
||||
compared to `expected_output` in the target language.
|
||||
|
||||
**When to use**: Machine translation or multilingual output scenarios.
|
||||
|
||||
**Requires `expected_output`**: Yes.
|
||||
|
||||
Args:
|
||||
language: Target language (e.g. `"Spanish"`).
|
||||
model: LLM model name.
|
||||
client: OpenAI client instance.
|
||||
|
||||
### `ValidJSON`
|
||||
|
||||
```python
|
||||
ValidJSON(*, schema: 'Any' = None) -> 'AutoevalsAdapter'
|
||||
```
|
||||
|
||||
JSON syntax and schema validation evaluator.
|
||||
|
||||
Returns 1.0 if `eval_output` is valid JSON (and optionally matches
|
||||
the provided schema), 0.0 otherwise.
|
||||
|
||||
**When to use**: Outputs that must be valid JSON — optionally conforming
|
||||
to a specific schema (e.g. tool call responses, structured extraction).
|
||||
|
||||
**Requires `expected_output`**: No.
|
||||
|
||||
Args:
|
||||
schema: Optional JSON Schema to validate against.
|
||||
|
||||
---
|
||||
|
||||
## Custom Evaluators: `create_llm_evaluator`
|
||||
|
||||
Factory for custom LLM-as-judge evaluators from prompt templates.
|
||||
|
||||
Usage::
|
||||
|
||||
from pixie import create_llm_evaluator
|
||||
|
||||
concise_voice_style = create_llm_evaluator(
|
||||
name="ConciseVoiceStyle",
|
||||
prompt_template="""
|
||||
You are evaluating whether a voice agent response is concise and
|
||||
phone-friendly.
|
||||
|
||||
User said: {eval_input}
|
||||
Agent responded: {eval_output}
|
||||
Expected behavior: {expectation}
|
||||
|
||||
Score 1.0 if the response is concise (under 3 sentences), directly
|
||||
addresses the question, and uses conversational language suitable for
|
||||
a phone call. Score 0.0 if it's verbose, off-topic, or uses
|
||||
written-style formatting.
|
||||
""",
|
||||
)
|
||||
|
||||
### `create_llm_evaluator`
|
||||
|
||||
```python
|
||||
create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
|
||||
```
|
||||
|
||||
Create a custom LLM-as-judge evaluator from a prompt template.
|
||||
|
||||
The template may reference these variables (populated from the
|
||||
:class:`~pixie.storage.evaluable.Evaluable` fields):
|
||||
|
||||
- `{eval_input}` — the evaluable's input data. Single-item lists expand
|
||||
to that item's value; multi-item lists expand to a JSON dict of
|
||||
`name → value` pairs.
|
||||
- `{eval_output}` — the evaluable's output data (same rule as
|
||||
`eval_input`).
|
||||
- `{expectation}` — the evaluable's expected output
|
||||
|
||||
Args:
|
||||
name: Display name for the evaluator (shown in scorecard).
|
||||
prompt_template: A string template with `{eval_input}`,
|
||||
`{eval_output}`, and/or `{expectation}` placeholders.
|
||||
model: OpenAI model name (default: `gpt-4o-mini`).
|
||||
client: Optional pre-configured OpenAI client instance.
|
||||
|
||||
Returns:
|
||||
An evaluator callable satisfying the `Evaluator` protocol.
|
||||
|
||||
Raises:
|
||||
ValueError: If the template uses nested field access like
|
||||
`{eval_input[key]}` (only top-level placeholders are supported).
|
||||
Reference in New Issue
Block a user