7.0 KiB
Common Mistakes (Python)
Patterns that LLMs frequently generate incorrectly from training data.
Legacy Model Classes
# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")
# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")
Why: OpenAIModel, AnthropicModel, etc. are legacy 1.0 wrappers in phoenix.evals.legacy.
The LLM class is provider-agnostic and is the current 2.0 API.
Using run_evals Instead of evaluate_dataframe
# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames
# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns
Why: run_evals is the legacy 1.0 batch function. evaluate_dataframe is the current
2.0 function with a different return format.
Wrong Result Column Names
# WRONG — column doesn't exist
score = results_df["relevance"].mean()
# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()
# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()
Why: evaluate_dataframe returns columns named {name}_score containing Score dicts
like {"name": "...", "score": 1.0, "label": "...", "explanation": "..."}.
Deprecated project_name Parameter
# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")
# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")
Why: project_name is deprecated in favor of project_identifier, which also
accepts project IDs.
Wrong Client Constructor
# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")
# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")
# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()
Why: The parameter is base_url, not endpoint or url. For local instances,
Client() with no args works fine. For remote instances, base_url and api_key are required.
Too-Aggressive Time Filters
# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
start_time=datetime.now() - timedelta(hours=1),
)
# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
limit=50,
)
Why: Traces may be from any time period. A 1-hour window frequently returns
nothing. Use limit= to control result size instead.
Not Filtering Spans Appropriately
# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")
# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
root_spans_only=True,
)
# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]
Why: For end-to-end evaluation (e.g., overall answer quality), use root_spans_only=True.
For RAG systems, you often need child spans separately — retriever spans for
DocumentRelevance and LLM spans for Faithfulness. Choose the right span level
for your evaluation target.
Assuming Span Output is Plain Text
# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]
# RIGHT — parse JSON and extract the answer field
import json
def extract_answer(output_value):
if not isinstance(output_value, str):
return str(output_value) if output_value is not None else ""
try:
parsed = json.loads(output_value)
if isinstance(parsed, dict):
for key in ("answer", "result", "output", "response"):
if key in parsed:
return str(parsed[key])
except (json.JSONDecodeError, TypeError):
pass
return output_value
df["output"] = df["attributes.output.value"].apply(extract_answer)
Why: LangChain and other frameworks often output structured JSON from root spans,
like {"context": "...", "question": "...", "answer": "..."}. Evaluators need
the actual answer text, not the raw JSON.
Using @create_evaluator for LLM-Based Evaluation
# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
pass # No LLM is involved
# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM
relevance = ClassificationEvaluator(
name="relevance",
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
Why: @create_evaluator wraps a plain Python function. Setting kind="llm"
marks it as LLM-based but you must implement the LLM call yourself.
For LLM-based evaluation, prefer ClassificationEvaluator which handles
the LLM call, structured output parsing, and explanations automatically.
Using llm_classify Instead of ClassificationEvaluator
# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
dataframe=df,
template=template_str,
model=model,
rails=["relevant", "irrelevant"],
)
# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM
classifier = ClassificationEvaluator(
name="relevance",
prompt_template=template_str,
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])
Why: llm_classify is the legacy 1.0 function. The current pattern is to create
an evaluator with ClassificationEvaluator and run it with async_evaluate_dataframe().
Using HallucinationEvaluator
# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)
# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))
Why: HallucinationEvaluator is deprecated. FaithfulnessEvaluator is its replacement,
using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).