# Built-in Evaluators > Auto-generated from pixie source code docstrings. > Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository. Autoevals adapters — pre-made evaluators wrapping `autoevals` scorers. This module provides :class:`AutoevalsAdapter`, which bridges the autoevals `Scorer` interface to pixie's `Evaluator` protocol, and a set of factory functions for common evaluation tasks. Public API (all are also re-exported from `pixie.evals`): **Core adapter:** - :class:`AutoevalsAdapter` — generic wrapper for any autoevals `Scorer`. **Heuristic scorers (no LLM required):** - :func:`LevenshteinMatch` — edit-distance string similarity. - :func:`ExactMatch` — exact value comparison. - :func:`NumericDiff` — normalised numeric difference. - :func:`JSONDiff` — structural JSON comparison. - :func:`ValidJSON` — JSON syntax / schema validation. - :func:`ListContains` — overlap between two string lists. **Embedding scorer:** - :func:`EmbeddingSimilarity` — cosine similarity via embeddings. **LLM-as-judge scorers:** - :func:`Factuality`, :func:`ClosedQA`, :func:`Battle`, :func:`Humor`, :func:`Security`, :func:`Sql`, :func:`Summary`, :func:`Translation`, :func:`Possible`. **Moderation:** - :func:`Moderation` — OpenAI content-moderation check. **RAGAS metrics:** - :func:`ContextRelevancy`, :func:`Faithfulness`, :func:`AnswerRelevancy`, :func:`AnswerCorrectness`. ## Evaluator Selection Guide Choose evaluators based on the **output type** and eval criteria: | Output type | Evaluator category | Examples | | -------------------------------------------- | ----------------------------------------------------------- | ------------------------------------- | | Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON` | Label classification, JSON extraction | | Open-ended text with a reference answer | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries | | Text with expected context/grounding | RAG: `Faithfulness`, `ContextRelevancy` | RAG pipelines | | Text with style/format requirements | Custom via `create_llm_evaluator` | Voice-friendly responses, tone checks | | Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone | Critical rules: - For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are non-deterministic. - `AnswerRelevancy` is **RAG-only** — requires `context` in the trace. Returns 0.0 without it. For general relevance, use `create_llm_evaluator`. - Do NOT use comparison evaluators (`Factuality`, `ClosedQA`, `ExactMatch`) on items without `expected_output` — they produce meaningless scores. --- ## Evaluator Reference ### `AnswerCorrectness` ```python AnswerCorrectness(*, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Answer correctness evaluator (RAGAS). Judges whether `eval_output` is correct compared to `expected_output`, combining factual similarity and semantic similarity. **When to use**: QA scenarios in RAG pipelines where you have a reference answer and want a comprehensive correctness score. **Requires `expected_output`**: Yes. **Requires `eval_metadata["context"]`**: Optional (improves accuracy). Args: client: OpenAI client instance. ### `AnswerRelevancy` ```python AnswerRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Answer relevancy evaluator (RAGAS). Judges whether `eval_output` directly addresses the question in `eval_input`. **When to use**: RAG pipelines only — requires `context` in the trace. Returns 0.0 without it. For general (non-RAG) response relevance, use `create_llm_evaluator` with a custom prompt instead. **Requires `expected_output`**: No. **Requires `eval_metadata["context"]`**: Yes — **RAG pipelines only**. Args: client: OpenAI client instance. ### `Battle` ```python Battle(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Head-to-head comparison evaluator (LLM-as-judge). Uses an LLM to compare `eval_output` against `expected_output` and determine which is better given the instructions in `eval_input`. **When to use**: A/B testing scenarios, comparing model outputs, or ranking alternative responses. **Requires `expected_output`**: Yes. Args: model: LLM model name. client: OpenAI client instance. ### `ClosedQA` ```python ClosedQA(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Closed-book question-answering evaluator (LLM-as-judge). Uses an LLM to judge whether `eval_output` correctly answers the question in `eval_input` compared to `expected_output`. Optionally forwards `eval_metadata["criteria"]` for custom grading criteria. **When to use**: QA scenarios where the answer should match a reference — e.g. customer support answers, knowledge-base queries. **Requires `expected_output`**: Yes — do NOT use on items without `expected_output`; produces meaningless scores. Args: model: LLM model name. client: OpenAI client instance. ### `ContextRelevancy` ```python ContextRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Context relevancy evaluator (RAGAS). Judges whether the retrieved context is relevant to the query. Forwards `eval_metadata["context"]` to the underlying scorer. **When to use**: RAG pipelines — evaluating retrieval quality. **Requires `expected_output`**: Yes. **Requires `eval_metadata["context"]`**: Yes (RAG pipelines only). Args: client: OpenAI client instance. ### `EmbeddingSimilarity` ```python EmbeddingSimilarity(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Embedding-based semantic similarity evaluator. Computes cosine similarity between embedding vectors of `eval_output` and `expected_output`. **When to use**: Comparing semantic meaning of two texts when exact wording doesn't matter. More robust than Levenshtein for paraphrased content but less nuanced than LLM-as-judge evaluators. **Requires `expected_output`**: Yes. Args: prefix: Optional text to prepend for domain context. model: Embedding model name. client: OpenAI client instance. ### `ExactMatch` ```python ExactMatch() -> 'AutoevalsAdapter' ``` Exact value comparison evaluator. Returns 1.0 if `eval_output` exactly equals `expected_output`, 0.0 otherwise. **When to use**: Deterministic, structured outputs (classification labels, yes/no answers, fixed-format strings). **Never** use for open-ended LLM text — LLM outputs are non-deterministic, so exact match will almost always fail. **Requires `expected_output`**: Yes. ### `Factuality` ```python Factuality(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Factual accuracy evaluator (LLM-as-judge). Uses an LLM to judge whether `eval_output` is factually consistent with `expected_output` given the `eval_input` context. **When to use**: Open-ended text where factual correctness matters (chatbot responses, QA answers, summaries). Preferred over `ExactMatch` for LLM-generated text. **Requires `expected_output`**: Yes — do NOT use on items without `expected_output`; produces meaningless scores. Args: model: LLM model name. client: OpenAI client instance. ### `Faithfulness` ```python Faithfulness(*, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Faithfulness evaluator (RAGAS). Judges whether `eval_output` is faithful to (i.e. supported by) the provided context. Forwards `eval_metadata["context"]`. **When to use**: RAG pipelines — ensuring the answer doesn't hallucinate beyond what the retrieved context supports. **Requires `expected_output`**: No. **Requires `eval_metadata["context"]`**: Yes (RAG pipelines only). Args: client: OpenAI client instance. ### `Humor` ```python Humor(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Humor quality evaluator (LLM-as-judge). Uses an LLM to judge the humor quality of `eval_output` against `expected_output`. **When to use**: Evaluating humor in creative writing, chatbot personality, or entertainment applications. **Requires `expected_output`**: Yes. Args: model: LLM model name. client: OpenAI client instance. ### `JSONDiff` ```python JSONDiff(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter' ``` Structural JSON comparison evaluator. Recursively compares two JSON structures and produces a similarity score. Handles nested objects, arrays, and mixed types. **When to use**: Structured JSON outputs where field-level comparison is needed (e.g. extracted data, API response schemas, tool call arguments). **Requires `expected_output`**: Yes. Args: string_scorer: Optional pairwise scorer for string fields. ### `LevenshteinMatch` ```python LevenshteinMatch() -> 'AutoevalsAdapter' ``` Edit-distance string similarity evaluator. Computes a normalised Levenshtein distance between `eval_output` and `expected_output`. Returns 1.0 for identical strings and decreasing scores as edit distance grows. **When to use**: Deterministic or near-deterministic outputs where small textual variations are acceptable (e.g. formatting differences, minor spelling). Not suitable for open-ended LLM text — use an LLM-as-judge evaluator instead. **Requires `expected_output`**: Yes. ### `ListContains` ```python ListContains(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter' ``` List overlap evaluator. Checks whether `eval_output` contains all items from `expected_output`. Scores based on overlap ratio. **When to use**: Outputs that produce a list of items where completeness matters (e.g. extracted entities, search results, recommendations). **Requires `expected_output`**: Yes. Args: pairwise_scorer: Optional scorer for pairwise element comparison. allow_extra_entities: If True, extra items in output are not penalised. ### `Moderation` ```python Moderation(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Content moderation evaluator. Uses the OpenAI moderation API to check `eval_output` for unsafe content (hate speech, violence, self-harm, etc.). **When to use**: Any application where output safety is a concern — chatbots, content generation, user-facing AI. **Requires `expected_output`**: No. Args: threshold: Custom flagging threshold. client: OpenAI client instance. ### `NumericDiff` ```python NumericDiff() -> 'AutoevalsAdapter' ``` Normalised numeric difference evaluator. Computes a normalised numeric distance between `eval_output` and `expected_output`. Returns 1.0 for identical numbers and decreasing scores as the difference grows. **When to use**: Numeric outputs where approximate equality is acceptable (e.g. price calculations, scores, measurements). **Requires `expected_output`**: Yes. ### `Possible` ```python Possible(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Feasibility / plausibility evaluator (LLM-as-judge). Uses an LLM to judge whether `eval_output` is a plausible or feasible response. **When to use**: General-purpose quality check when you want to verify outputs are reasonable without a specific reference answer. **Requires `expected_output`**: No. Args: model: LLM model name. client: OpenAI client instance. ### `Security` ```python Security(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Security vulnerability evaluator (LLM-as-judge). Uses an LLM to check `eval_output` for security vulnerabilities based on the instructions in `eval_input`. **When to use**: Code generation, SQL output, or any scenario where output must be checked for injection or vulnerability risks. **Requires `expected_output`**: No. Args: model: LLM model name. client: OpenAI client instance. ### `Sql` ```python Sql(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` SQL equivalence evaluator (LLM-as-judge). Uses an LLM to judge whether `eval_output` SQL is semantically equivalent to `expected_output` SQL. **When to use**: Text-to-SQL applications where the generated SQL should be functionally equivalent to a reference query. **Requires `expected_output`**: Yes. Args: model: LLM model name. client: OpenAI client instance. ### `Summary` ```python Summary(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Summarisation quality evaluator (LLM-as-judge). Uses an LLM to judge the quality of `eval_output` as a summary compared to the reference summary in `expected_output`. **When to use**: Summarisation tasks where the output must capture key information from the source material. **Requires `expected_output`**: Yes. Args: model: LLM model name. client: OpenAI client instance. ### `Translation` ```python Translation(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter' ``` Translation quality evaluator (LLM-as-judge). Uses an LLM to judge the translation quality of `eval_output` compared to `expected_output` in the target language. **When to use**: Machine translation or multilingual output scenarios. **Requires `expected_output`**: Yes. Args: language: Target language (e.g. `"Spanish"`). model: LLM model name. client: OpenAI client instance. ### `ValidJSON` ```python ValidJSON(*, schema: 'Any' = None) -> 'AutoevalsAdapter' ``` JSON syntax and schema validation evaluator. Returns 1.0 if `eval_output` is valid JSON (and optionally matches the provided schema), 0.0 otherwise. **When to use**: Outputs that must be valid JSON — optionally conforming to a specific schema (e.g. tool call responses, structured extraction). **Requires `expected_output`**: No. Args: schema: Optional JSON Schema to validate against. --- ## Custom Evaluators: `create_llm_evaluator` Factory for custom LLM-as-judge evaluators from prompt templates. Usage:: from pixie import create_llm_evaluator concise_voice_style = create_llm_evaluator( name="ConciseVoiceStyle", prompt_template=""" You are evaluating whether a voice agent response is concise and phone-friendly. User said: {eval_input} Agent responded: {eval_output} Expected behavior: {expectation} Score 1.0 if the response is concise (under 3 sentences), directly addresses the question, and uses conversational language suitable for a phone call. Score 0.0 if it's verbose, off-topic, or uses written-style formatting. """, ) ### `create_llm_evaluator` ```python create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator' ``` Create a custom LLM-as-judge evaluator from a prompt template. The template may reference these variables (populated from the :class:`~pixie.storage.evaluable.Evaluable` fields): - `{eval_input}` — the evaluable's input data. Single-item lists expand to that item's value; multi-item lists expand to a JSON dict of `name → value` pairs. - `{eval_output}` — the evaluable's output data (same rule as `eval_input`). - `{expectation}` — the evaluable's expected output Args: name: Display name for the evaluator (shown in scorecard). prompt_template: A string template with `{eval_input}`, `{eval_output}`, and/or `{expectation}` placeholders. model: OpenAI model name (default: `gpt-4o-mini`). client: Optional pre-configured OpenAI client instance. Returns: An evaluator callable satisfying the `Evaluator` protocol. Raises: ValueError: If the template uses nested field access like `{eval_input[key]}` (only top-level placeholders are supported).