Files
awesome-copilot/plugins/phoenix/skills/phoenix-evals/references/common-mistakes-python.md
2026-04-01 23:04:18 +00:00

7.0 KiB

Common Mistakes (Python)

Patterns that LLMs frequently generate incorrectly from training data.

Legacy Model Classes

# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")

# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")

Why: OpenAIModel, AnthropicModel, etc. are legacy 1.0 wrappers in phoenix.evals.legacy. The LLM class is provider-agnostic and is the current 2.0 API.

Using run_evals Instead of evaluate_dataframe

# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames

# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns

Why: run_evals is the legacy 1.0 batch function. evaluate_dataframe is the current 2.0 function with a different return format.

Wrong Result Column Names

# WRONG — column doesn't exist
score = results_df["relevance"].mean()

# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()

# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()

Why: evaluate_dataframe returns columns named {name}_score containing Score dicts like {"name": "...", "score": 1.0, "label": "...", "explanation": "..."}.

Deprecated project_name Parameter

# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")

# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")

Why: project_name is deprecated in favor of project_identifier, which also accepts project IDs.

Wrong Client Constructor

# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")

# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")

# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()

Why: The parameter is base_url, not endpoint or url. For local instances, Client() with no args works fine. For remote instances, base_url and api_key are required.

Too-Aggressive Time Filters

# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    start_time=datetime.now() - timedelta(hours=1),
)

# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    limit=50,
)

Why: Traces may be from any time period. A 1-hour window frequently returns nothing. Use limit= to control result size instead.

Not Filtering Spans Appropriately

# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")

# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    root_spans_only=True,
)

# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
    project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]

Why: For end-to-end evaluation (e.g., overall answer quality), use root_spans_only=True. For RAG systems, you often need child spans separately — retriever spans for DocumentRelevance and LLM spans for Faithfulness. Choose the right span level for your evaluation target.

Assuming Span Output is Plain Text

# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]

# RIGHT — parse JSON and extract the answer field
import json

def extract_answer(output_value):
    if not isinstance(output_value, str):
        return str(output_value) if output_value is not None else ""
    try:
        parsed = json.loads(output_value)
        if isinstance(parsed, dict):
            for key in ("answer", "result", "output", "response"):
                if key in parsed:
                    return str(parsed[key])
    except (json.JSONDecodeError, TypeError):
        pass
    return output_value

df["output"] = df["attributes.output.value"].apply(extract_answer)

Why: LangChain and other frameworks often output structured JSON from root spans, like {"context": "...", "question": "...", "answer": "..."}. Evaluators need the actual answer text, not the raw JSON.

Using @create_evaluator for LLM-Based Evaluation

# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
    pass  # No LLM is involved

# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Why: @create_evaluator wraps a plain Python function. Setting kind="llm" marks it as LLM-based but you must implement the LLM call yourself. For LLM-based evaluation, prefer ClassificationEvaluator which handles the LLM call, structured output parsing, and explanations automatically.

Using llm_classify Instead of ClassificationEvaluator

# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
    dataframe=df,
    template=template_str,
    model=model,
    rails=["relevant", "irrelevant"],
)

# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM

classifier = ClassificationEvaluator(
    name="relevance",
    prompt_template=template_str,
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])

Why: llm_classify is the legacy 1.0 function. The current pattern is to create an evaluator with ClassificationEvaluator and run it with async_evaluate_dataframe().

Using HallucinationEvaluator

# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)

# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))

Why: HallucinationEvaluator is deprecated. FaithfulnessEvaluator is its replacement, using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).