Files
awesome-copilot/skills/eval-driven-dev/references/3-define-evaluators.md
Yiou Li 5f59ddb9cf update eval-driven-dev skill (#1352)
* update eval-driven-dev skill

* small refinement of skill description

* address review, rerun npm start.
2026-04-10 11:19:28 +10:00

7.6 KiB

Step 3: Define Evaluators

Why this step: With the app instrumented (Step 2), you now map each eval criterion to a concrete evaluator — implementing custom ones where needed — so the dataset (Step 4) can reference them by name.


3a. Map criteria to evaluators

Every eval criterion from Step 1b — including any dimensions specified by the user in the prompt — must have a corresponding evaluator. If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension.

For each eval criterion, decide how to evaluate it:

  • Can it be checked with a built-in evaluator? (factual correctness → Factuality, exact match → ExactMatch, RAG faithfulness → Faithfulness)
  • Does it need a custom evaluator? Most app-specific criteria do — use create_llm_evaluator with a prompt that operationalizes the criterion.
  • Is it universal or case-specific? Universal criteria apply to all dataset items. Case-specific criteria apply only to certain rows.

For open-ended LLM text, never use ExactMatch — LLM outputs are non-deterministic.

AnswerRelevancy is RAG-only — it requires a context value in the trace. Returns 0.0 without it. For general relevance without RAG, use create_llm_evaluator with a custom prompt.

3b. Implement custom evaluators

If any criterion requires a custom evaluator, implement it now. Place custom evaluators in pixie_qa/evaluators.py (or a sub-module if there are many).

create_llm_evaluator factory

Use when the quality dimension is domain-specific and no built-in evaluator fits.

The return value is a ready-to-use evaluator instance. Assign it to a module-level variable — pixie test will import and use it directly (no class wrapper needed):

from pixie import create_llm_evaluator

concise_voice_style = create_llm_evaluator(
    name="ConciseVoiceStyle",
    prompt_template="""
    You are evaluating whether this response is concise and phone-friendly.

    Input: {eval_input}
    Response: {eval_output}

    Score 1.0 if the response is concise (under 3 sentences), directly addresses
    the question, and uses conversational language suitable for a phone call.
    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
    """,
)

Reference the evaluator in your dataset JSON by its filepath:callable_name reference (e.g., "pixie_qa/evaluators.py:concise_voice_style").

How template variables work: {eval_input}, {eval_output}, {expectation} are the only placeholders. Each is replaced with a string representation of the corresponding Evaluable field:

  • Single-item eval_input / eval_output → the item's value (string, JSON-serialized dict/list)
  • Multi-item eval_input / eval_output → a JSON dict mapping name → value for every item

The LLM judge sees the full serialized value.

Rules:

  • Only {eval_input}, {eval_output}, {expectation} — no nested access like {eval_input[key]} (this will crash with a ValueError)
  • Keep templates short and direct — the system prompt already tells the LLM to return Score: X.X. Your template just needs to present the data and define the scoring criteria.
  • Don't instruct the LLM to "parse" or "extract" data — just present the values and state the criteria. The LLM can read JSON naturally.

Non-RAG response relevance (instead of AnswerRelevancy):

response_relevance = create_llm_evaluator(
    name="ResponseRelevance",
    prompt_template="""
    You are evaluating whether a customer support response is relevant and helpful.

    Input: {eval_input}
    Response: {eval_output}
    Expected: {expectation}

    Score 1.0 if the response directly addresses the question and meets expectations.
    Score 0.5 if partially relevant but misses important aspects.
    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
    """,
)

Manual custom evaluator

Custom evaluators can be sync or async functions. Assign them to module-level variables in pixie_qa/evaluators.py:

from pixie import Evaluation, Evaluable

def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")

Reference by filepath:callable_name in the dataset: "pixie_qa/evaluators.py:my_evaluator".

Accessing eval_metadata and captured data: Custom evaluators access per-entry metadata and wrap() outputs via the Evaluable fields:

  • evaluable.eval_metadata — dict from the entry's eval_metadata field (e.g., {"expected_tool": "endCall"})
  • evaluable.eval_outputlist[NamedData] containing ALL wrap(purpose="output") and wrap(purpose="state") values. Each item has .name (str) and .value (JsonValue). Use the helper below to look up by name.
def _get_output(evaluable: Evaluable, name: str) -> Any:
    """Look up a wrap value by name from eval_output."""
    for item in evaluable.eval_output:
        if item.name == name:
            return item.value
    return None

def call_ended_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
    expected = evaluable.eval_metadata.get("expected_call_ended") if evaluable.eval_metadata else None
    actual = _get_output(evaluable, "call_ended")
    if expected is None:
        return Evaluation(score=1.0, reasoning="No expected_call_ended in eval_metadata")
    match = bool(actual) == bool(expected)
    return Evaluation(
        score=1.0 if match else 0.0,
        reasoning=f"Expected call_ended={expected}, got {actual}",
    )

3c. Produce the evaluator mapping artifact

Write the criterion-to-evaluator mapping to pixie_qa/03-evaluator-mapping.md. This artifact bridges between the eval criteria (Step 1b) and the dataset (Step 4).

CRITICAL: Use the exact evaluator names as they appear in the evaluators.md reference — built-in evaluators use their short name (e.g., Factuality, ClosedQA), and custom evaluators use filepath:callable_name format (e.g., pixie_qa/evaluators.py:ConciseVoiceStyle).

Template

# Evaluator Mapping

## Built-in evaluators used

| Evaluator name | Criterion it covers | Applies to                 |
| -------------- | ------------------- | -------------------------- |
| Factuality     | Factual accuracy    | All items                  |
| ClosedQA       | Answer correctness  | Items with expected_output |

## Custom evaluators

| Evaluator name                           | Criterion it covers | Applies to | Source file            |
| ---------------------------------------- | ------------------- | ---------- | ---------------------- |
| pixie_qa/evaluators.py:ConciseVoiceStyle | Phone-friendly tone | All items  | pixie_qa/evaluators.py |

## Applicability summary

- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:ConciseVoiceStyle
- **Item-specific** (apply to subset): ClosedQA (only items with expected_output)

Output

  • Custom evaluator implementations in pixie_qa/evaluators.py (if any custom evaluators needed)
  • pixie_qa/03-evaluator-mapping.md — the criterion-to-evaluator mapping

Evaluator selection guide: See evaluators.md for the full evaluator catalog, selection guide (which evaluator for which output type), and create_llm_evaluator reference.

If you hit an unexpected error when implementing evaluators (import failures, API mismatch), read evaluators.md for the authoritative evaluator reference and wrap-api.md for API details before guessing at a fix.