# Common Mistakes (Python) Patterns that LLMs frequently generate incorrectly from training data. ## Legacy Model Classes ```python # WRONG from phoenix.evals import OpenAIModel, AnthropicModel model = OpenAIModel(model="gpt-4") # RIGHT from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o") ``` **Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`. The `LLM` class is provider-agnostic and is the current 2.0 API. ## Using run_evals Instead of evaluate_dataframe ```python # WRONG — legacy 1.0 API from phoenix.evals import run_evals results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True) # Returns list of DataFrames # RIGHT — current 2.0 API from phoenix.evals import evaluate_dataframe results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1]) # Returns single DataFrame with {name}_score dict columns ``` **Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current 2.0 function with a different return format. ## Wrong Result Column Names ```python # WRONG — column doesn't exist score = results_df["relevance"].mean() # WRONG — column exists but contains dicts, not numbers score = results_df["relevance_score"].mean() # RIGHT — extract numeric score from dict scores = results_df["relevance_score"].apply( lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0 ) score = scores.mean() ``` **Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`. ## Deprecated project_name Parameter ```python # WRONG df = client.spans.get_spans_dataframe(project_name="my-project") # RIGHT df = client.spans.get_spans_dataframe(project_identifier="my-project") ``` **Why**: `project_name` is deprecated in favor of `project_identifier`, which also accepts project IDs. ## Wrong Client Constructor ```python # WRONG client = Client(endpoint="https://app.phoenix.arize.com") client = Client(url="https://app.phoenix.arize.com") # RIGHT — for remote/cloud Phoenix client = Client(base_url="https://app.phoenix.arize.com", api_key="...") # ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006) client = Client() ``` **Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances, `Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required. ## Too-Aggressive Time Filters ```python # WRONG — often returns zero spans from datetime import datetime, timedelta df = client.spans.get_spans_dataframe( project_identifier="my-project", start_time=datetime.now() - timedelta(hours=1), ) # RIGHT — use limit to control result size instead df = client.spans.get_spans_dataframe( project_identifier="my-project", limit=50, ) ``` **Why**: Traces may be from any time period. A 1-hour window frequently returns nothing. Use `limit=` to control result size instead. ## Not Filtering Spans Appropriately ```python # WRONG — fetches all spans including internal LLM calls, retrievers, etc. df = client.spans.get_spans_dataframe(project_identifier="my-project") # RIGHT for end-to-end evaluation — filter to top-level spans df = client.spans.get_spans_dataframe( project_identifier="my-project", root_spans_only=True, ) # RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics all_spans = client.spans.get_spans_dataframe( project_identifier="my-project", ) retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"] llm_spans = all_spans[all_spans["span_kind"] == "LLM"] ``` **Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`. For RAG systems, you often need child spans separately — retriever spans for DocumentRelevance and LLM spans for Faithfulness. Choose the right span level for your evaluation target. ## Assuming Span Output is Plain Text ```python # WRONG — output may be JSON, not plain text df["output"] = df["attributes.output.value"] # RIGHT — parse JSON and extract the answer field import json def extract_answer(output_value): if not isinstance(output_value, str): return str(output_value) if output_value is not None else "" try: parsed = json.loads(output_value) if isinstance(parsed, dict): for key in ("answer", "result", "output", "response"): if key in parsed: return str(parsed[key]) except (json.JSONDecodeError, TypeError): pass return output_value df["output"] = df["attributes.output.value"].apply(extract_answer) ``` **Why**: LangChain and other frameworks often output structured JSON from root spans, like `{"context": "...", "question": "...", "answer": "..."}`. Evaluators need the actual answer text, not the raw JSON. ## Using @create_evaluator for LLM-Based Evaluation ```python # WRONG — @create_evaluator doesn't call an LLM @create_evaluator(name="relevance", kind="llm") def relevance(input: str, output: str) -> str: pass # No LLM is involved # RIGHT — use ClassificationEvaluator for LLM-based evaluation from phoenix.evals import ClassificationEvaluator, LLM relevance = ClassificationEvaluator( name="relevance", prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:", llm=LLM(provider="openai", model="gpt-4o"), choices={"relevant": 1.0, "irrelevant": 0.0}, ) ``` **Why**: `@create_evaluator` wraps a plain Python function. Setting `kind="llm"` marks it as LLM-based but you must implement the LLM call yourself. For LLM-based evaluation, prefer `ClassificationEvaluator` which handles the LLM call, structured output parsing, and explanations automatically. ## Using llm_classify Instead of ClassificationEvaluator ```python # WRONG — legacy 1.0 API from phoenix.evals import llm_classify results = llm_classify( dataframe=df, template=template_str, model=model, rails=["relevant", "irrelevant"], ) # RIGHT — current 2.0 API from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM classifier = ClassificationEvaluator( name="relevance", prompt_template=template_str, llm=LLM(provider="openai", model="gpt-4o"), choices={"relevant": 1.0, "irrelevant": 0.0}, ) results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier]) ``` **Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create an evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`. ## Using HallucinationEvaluator ```python # WRONG — deprecated from phoenix.evals import HallucinationEvaluator eval = HallucinationEvaluator(model) # RIGHT — use FaithfulnessEvaluator from phoenix.evals.metrics import FaithfulnessEvaluator from phoenix.evals import LLM eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o")) ``` **Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement, using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).