* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
16 KiB
Built-in Evaluators
Auto-generated from pixie source code docstrings. Do not edit by hand — regenerate from the upstream pixie-qa source repository.
Autoevals adapters — pre-made evaluators wrapping autoevals scorers.
This module provides :class:AutoevalsAdapter, which bridges the
autoevals Scorer interface to pixie's Evaluator protocol, and
a set of factory functions for common evaluation tasks.
Public API (all are also re-exported from pixie.evals):
Core adapter: - :class:AutoevalsAdapter — generic wrapper for any autoevals Scorer.
Heuristic scorers (no LLM required): - :func:LevenshteinMatch — edit-distance string similarity. - :func:ExactMatch — exact value comparison. - :func:NumericDiff — normalised numeric difference. - :func:JSONDiff — structural JSON comparison. - :func:ValidJSON — JSON syntax / schema validation. - :func:ListContains — overlap between two string lists.
Embedding scorer: - :func:EmbeddingSimilarity — cosine similarity via embeddings.
LLM-as-judge scorers: - :func:Factuality, :func:ClosedQA, :func:Battle,
:func:Humor, :func:Security, :func:Sql,
:func:Summary, :func:Translation, :func:Possible.
Moderation: - :func:Moderation — OpenAI content-moderation check.
RAGAS metrics: - :func:ContextRelevancy, :func:Faithfulness,
:func:AnswerRelevancy, :func:AnswerCorrectness.
Evaluator Selection Guide
Choose evaluators based on the output type and eval criteria:
| Output type | Evaluator category | Examples |
|---|---|---|
| Deterministic (labels, yes/no, fixed-format) | Heuristic: ExactMatch, JSONDiff, ValidJSON |
Label classification, JSON extraction |
| Open-ended text with a reference answer | LLM-as-judge: Factuality, ClosedQA, AnswerCorrectness |
Chatbot responses, QA, summaries |
| Text with expected context/grounding | RAG: Faithfulness, ContextRelevancy |
RAG pipelines |
| Text with style/format requirements | Custom via create_llm_evaluator |
Voice-friendly responses, tone checks |
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
Critical rules:
- For open-ended LLM text, never use
ExactMatch— LLM outputs are non-deterministic. AnswerRelevancyis RAG-only — requirescontextin the trace. Returns 0.0 without it. For general relevance, usecreate_llm_evaluator.- Do NOT use comparison evaluators (
Factuality,ClosedQA,ExactMatch) on items withoutexpected_output— they produce meaningless scores.
Evaluator Reference
AnswerCorrectness
AnswerCorrectness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
Answer correctness evaluator (RAGAS).
Judges whether eval_output is correct compared to
expected_output, combining factual similarity and semantic
similarity.
When to use: QA scenarios in RAG pipelines where you have a reference answer and want a comprehensive correctness score.
Requires expected_output: Yes.
Requires eval_metadata["context"]: Optional (improves accuracy).
Args: client: OpenAI client instance.
AnswerRelevancy
AnswerRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
Answer relevancy evaluator (RAGAS).
Judges whether eval_output directly addresses the question in
eval_input.
When to use: RAG pipelines only — requires context in the
trace. Returns 0.0 without it. For general (non-RAG) response
relevance, use create_llm_evaluator with a custom prompt instead.
Requires expected_output: No.
Requires eval_metadata["context"]: Yes — RAG pipelines only.
Args: client: OpenAI client instance.
Battle
Battle(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Head-to-head comparison evaluator (LLM-as-judge).
Uses an LLM to compare eval_output against expected_output
and determine which is better given the instructions in eval_input.
When to use: A/B testing scenarios, comparing model outputs, or ranking alternative responses.
Requires expected_output: Yes.
Args: model: LLM model name. client: OpenAI client instance.
ClosedQA
ClosedQA(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Closed-book question-answering evaluator (LLM-as-judge).
Uses an LLM to judge whether eval_output correctly answers the
question in eval_input compared to expected_output. Optionally
forwards eval_metadata["criteria"] for custom grading criteria.
When to use: QA scenarios where the answer should match a reference — e.g. customer support answers, knowledge-base queries.
Requires expected_output: Yes — do NOT use on items without
expected_output; produces meaningless scores.
Args: model: LLM model name. client: OpenAI client instance.
ContextRelevancy
ContextRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
Context relevancy evaluator (RAGAS).
Judges whether the retrieved context is relevant to the query.
Forwards eval_metadata["context"] to the underlying scorer.
When to use: RAG pipelines — evaluating retrieval quality.
Requires expected_output: Yes.
Requires eval_metadata["context"]: Yes (RAG pipelines only).
Args: client: OpenAI client instance.
EmbeddingSimilarity
EmbeddingSimilarity(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Embedding-based semantic similarity evaluator.
Computes cosine similarity between embedding vectors of eval_output
and expected_output.
When to use: Comparing semantic meaning of two texts when exact wording doesn't matter. More robust than Levenshtein for paraphrased content but less nuanced than LLM-as-judge evaluators.
Requires expected_output: Yes.
Args: prefix: Optional text to prepend for domain context. model: Embedding model name. client: OpenAI client instance.
ExactMatch
ExactMatch() -> 'AutoevalsAdapter'
Exact value comparison evaluator.
Returns 1.0 if eval_output exactly equals expected_output,
0.0 otherwise.
When to use: Deterministic, structured outputs (classification labels, yes/no answers, fixed-format strings). Never use for open-ended LLM text — LLM outputs are non-deterministic, so exact match will almost always fail.
Requires expected_output: Yes.
Factuality
Factuality(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Factual accuracy evaluator (LLM-as-judge).
Uses an LLM to judge whether eval_output is factually consistent
with expected_output given the eval_input context.
When to use: Open-ended text where factual correctness matters
(chatbot responses, QA answers, summaries). Preferred over
ExactMatch for LLM-generated text.
Requires expected_output: Yes — do NOT use on items without
expected_output; produces meaningless scores.
Args: model: LLM model name. client: OpenAI client instance.
Faithfulness
Faithfulness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
Faithfulness evaluator (RAGAS).
Judges whether eval_output is faithful to (i.e. supported by)
the provided context. Forwards eval_metadata["context"].
When to use: RAG pipelines — ensuring the answer doesn't hallucinate beyond what the retrieved context supports.
Requires expected_output: No.
Requires eval_metadata["context"]: Yes (RAG pipelines only).
Args: client: OpenAI client instance.
Humor
Humor(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Humor quality evaluator (LLM-as-judge).
Uses an LLM to judge the humor quality of eval_output against
expected_output.
When to use: Evaluating humor in creative writing, chatbot personality, or entertainment applications.
Requires expected_output: Yes.
Args: model: LLM model name. client: OpenAI client instance.
JSONDiff
JSONDiff(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'
Structural JSON comparison evaluator.
Recursively compares two JSON structures and produces a similarity score. Handles nested objects, arrays, and mixed types.
When to use: Structured JSON outputs where field-level comparison is needed (e.g. extracted data, API response schemas, tool call arguments).
Requires expected_output: Yes.
Args: string_scorer: Optional pairwise scorer for string fields.
LevenshteinMatch
LevenshteinMatch() -> 'AutoevalsAdapter'
Edit-distance string similarity evaluator.
Computes a normalised Levenshtein distance between eval_output and
expected_output. Returns 1.0 for identical strings and decreasing
scores as edit distance grows.
When to use: Deterministic or near-deterministic outputs where small textual variations are acceptable (e.g. formatting differences, minor spelling). Not suitable for open-ended LLM text — use an LLM-as-judge evaluator instead.
Requires expected_output: Yes.
ListContains
ListContains(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'
List overlap evaluator.
Checks whether eval_output contains all items from
expected_output. Scores based on overlap ratio.
When to use: Outputs that produce a list of items where completeness matters (e.g. extracted entities, search results, recommendations).
Requires expected_output: Yes.
Args: pairwise_scorer: Optional scorer for pairwise element comparison. allow_extra_entities: If True, extra items in output are not penalised.
Moderation
Moderation(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Content moderation evaluator.
Uses the OpenAI moderation API to check eval_output for unsafe
content (hate speech, violence, self-harm, etc.).
When to use: Any application where output safety is a concern — chatbots, content generation, user-facing AI.
Requires expected_output: No.
Args: threshold: Custom flagging threshold. client: OpenAI client instance.
NumericDiff
NumericDiff() -> 'AutoevalsAdapter'
Normalised numeric difference evaluator.
Computes a normalised numeric distance between eval_output and
expected_output. Returns 1.0 for identical numbers and decreasing
scores as the difference grows.
When to use: Numeric outputs where approximate equality is acceptable (e.g. price calculations, scores, measurements).
Requires expected_output: Yes.
Possible
Possible(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Feasibility / plausibility evaluator (LLM-as-judge).
Uses an LLM to judge whether eval_output is a plausible or
feasible response.
When to use: General-purpose quality check when you want to verify outputs are reasonable without a specific reference answer.
Requires expected_output: No.
Args: model: LLM model name. client: OpenAI client instance.
Security
Security(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Security vulnerability evaluator (LLM-as-judge).
Uses an LLM to check eval_output for security vulnerabilities
based on the instructions in eval_input.
When to use: Code generation, SQL output, or any scenario where output must be checked for injection or vulnerability risks.
Requires expected_output: No.
Args: model: LLM model name. client: OpenAI client instance.
Sql
Sql(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
SQL equivalence evaluator (LLM-as-judge).
Uses an LLM to judge whether eval_output SQL is semantically
equivalent to expected_output SQL.
When to use: Text-to-SQL applications where the generated SQL should be functionally equivalent to a reference query.
Requires expected_output: Yes.
Args: model: LLM model name. client: OpenAI client instance.
Summary
Summary(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Summarisation quality evaluator (LLM-as-judge).
Uses an LLM to judge the quality of eval_output as a summary
compared to the reference summary in expected_output.
When to use: Summarisation tasks where the output must capture key information from the source material.
Requires expected_output: Yes.
Args: model: LLM model name. client: OpenAI client instance.
Translation
Translation(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
Translation quality evaluator (LLM-as-judge).
Uses an LLM to judge the translation quality of eval_output
compared to expected_output in the target language.
When to use: Machine translation or multilingual output scenarios.
Requires expected_output: Yes.
Args:
language: Target language (e.g. "Spanish").
model: LLM model name.
client: OpenAI client instance.
ValidJSON
ValidJSON(*, schema: 'Any' = None) -> 'AutoevalsAdapter'
JSON syntax and schema validation evaluator.
Returns 1.0 if eval_output is valid JSON (and optionally matches
the provided schema), 0.0 otherwise.
When to use: Outputs that must be valid JSON — optionally conforming to a specific schema (e.g. tool call responses, structured extraction).
Requires expected_output: No.
Args: schema: Optional JSON Schema to validate against.
Custom Evaluators: create_llm_evaluator
Factory for custom LLM-as-judge evaluators from prompt templates.
Usage::
from pixie import create_llm_evaluator
concise_voice_style = create_llm_evaluator(
name="ConciseVoiceStyle",
prompt_template="""
You are evaluating whether a voice agent response is concise and
phone-friendly.
User said: {eval_input}
Agent responded: {eval_output}
Expected behavior: {expectation}
Score 1.0 if the response is concise (under 3 sentences), directly
addresses the question, and uses conversational language suitable for
a phone call. Score 0.0 if it's verbose, off-topic, or uses
written-style formatting.
""",
)
create_llm_evaluator
create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
Create a custom LLM-as-judge evaluator from a prompt template.
The template may reference these variables (populated from the
:class:~pixie.storage.evaluable.Evaluable fields):
{eval_input}— the evaluable's input data. Single-item lists expand to that item's value; multi-item lists expand to a JSON dict ofname → valuepairs.{eval_output}— the evaluable's output data (same rule aseval_input).{expectation}— the evaluable's expected output
Args:
name: Display name for the evaluator (shown in scorecard).
prompt_template: A string template with {eval_input},
{eval_output}, and/or {expectation} placeholders.
model: OpenAI model name (default: gpt-4o-mini).
client: Optional pre-configured OpenAI client instance.
Returns:
An evaluator callable satisfying the Evaluator protocol.
Raises:
ValueError: If the template uses nested field access like
{eval_input[key]} (only top-level placeholders are supported).