mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
2.6 KiB
2.6 KiB
Evaluators: LLM Evaluators in Python
LLM evaluators use a language model to judge outputs. Use when criteria are subjective.
Quick Start
from phoenix.evals import ClassificationEvaluator, LLM
llm = LLM(provider="openai", model="gpt-4o")
HELPFULNESS_TEMPLATE = """Rate how helpful the response is.
<question>{{input}}</question>
<response>{{output}}</response>
"helpful" means directly addresses the question.
"not_helpful" means does not address the question.
Your answer (helpful/not_helpful):"""
helpfulness = ClassificationEvaluator(
name="helpfulness",
prompt_template=HELPFULNESS_TEMPLATE,
llm=llm,
choices={"not_helpful": 0, "helpful": 1}
)
Template Variables
Use XML tags to wrap variables for clarity:
| Variable | XML Tag |
|---|---|
{{input}} |
<question>{{input}}</question> |
{{output}} |
<response>{{output}}</response> |
{{reference}} |
<reference>{{reference}}</reference> |
{{context}} |
<context>{{context}}</context> |
create_classifier (Factory)
Shorthand factory that returns a ClassificationEvaluator. Prefer direct
ClassificationEvaluator instantiation for more parameters/customization:
from phoenix.evals import create_classifier, LLM
relevance = create_classifier(
name="relevance",
prompt_template="""Is this response relevant to the question?
<question>{{input}}</question>
<response>{{output}}</response>
Answer (relevant/irrelevant):""",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
Input Mapping
Column names must match template variables. Rename columns or use bind_evaluator:
# Option 1: Rename columns to match template variables
df = df.rename(columns={"user_query": "input", "ai_response": "output"})
# Option 2: Use bind_evaluator
from phoenix.evals import bind_evaluator
bound = bind_evaluator(
evaluator=helpfulness,
input_mapping={"input": "user_query", "output": "ai_response"},
)
Running
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])
Best Practices
- Be specific - Define exactly what pass/fail means
- Include examples - Show concrete cases for each label
- Explanations by default -
ClassificationEvaluatorincludes explanations automatically - Study built-in prompts - See
phoenix.evals.__generated__.classification_evaluator_configsfor examples of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)