Files
awesome-copilot/skills/phoenix-evals/references/common-mistakes-python.md
Jim Bennett d79183139a Add Arize and Phoenix LLM observability skills (#1204)
* Add 9 Arize LLM observability skills

Add skills for Arize AI platform covering trace export, instrumentation,
datasets, experiments, evaluators, AI provider integrations, annotations,
prompt optimization, and deep linking to the Arize UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add 3 Phoenix AI observability skills

Add skills for Phoenix (Arize open-source) covering CLI debugging,
LLM evaluation workflows, and OpenInference tracing/instrumentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Ignoring intentional bad spelling

* Fix CI: remove .DS_Store from generated skills README and add codespell ignore

Remove .DS_Store artifact from winmd-api-search asset listing in generated
README.skills.md so it matches the CI Linux build output. Add queston to
codespell ignore list (intentional misspelling example in arize-dataset skill).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add arize-ax and phoenix plugins

Bundle the 9 Arize skills into an arize-ax plugin and the 3 Phoenix
skills into a phoenix plugin for easier installation as single packages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix skill folder structures to match source repos

Move arize supporting files from references/ to root level and rename
phoenix references/ to rules/ to exactly match the original source
repository folder structures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fixing file locations

* Fixing readme

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:58:55 +11:00

226 lines
7.0 KiB
Markdown

# Common Mistakes (Python)
Patterns that LLMs frequently generate incorrectly from training data.
## Legacy Model Classes
```python
# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")
# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")
```
**Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`.
The `LLM` class is provider-agnostic and is the current 2.0 API.
## Using run_evals Instead of evaluate_dataframe
```python
# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames
# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns
```
**Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current
2.0 function with a different return format.
## Wrong Result Column Names
```python
# WRONG — column doesn't exist
score = results_df["relevance"].mean()
# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()
# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()
```
**Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts
like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`.
## Deprecated project_name Parameter
```python
# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")
# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")
```
**Why**: `project_name` is deprecated in favor of `project_identifier`, which also
accepts project IDs.
## Wrong Client Constructor
```python
# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")
# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")
# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()
```
**Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances,
`Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required.
## Too-Aggressive Time Filters
```python
# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
start_time=datetime.now() - timedelta(hours=1),
)
# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
limit=50,
)
```
**Why**: Traces may be from any time period. A 1-hour window frequently returns
nothing. Use `limit=` to control result size instead.
## Not Filtering Spans Appropriately
```python
# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")
# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
project_identifier="my-project",
root_spans_only=True,
)
# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]
```
**Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`.
For RAG systems, you often need child spans separately — retriever spans for
DocumentRelevance and LLM spans for Faithfulness. Choose the right span level
for your evaluation target.
## Assuming Span Output is Plain Text
```python
# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]
# RIGHT — parse JSON and extract the answer field
import json
def extract_answer(output_value):
if not isinstance(output_value, str):
return str(output_value) if output_value is not None else ""
try:
parsed = json.loads(output_value)
if isinstance(parsed, dict):
for key in ("answer", "result", "output", "response"):
if key in parsed:
return str(parsed[key])
except (json.JSONDecodeError, TypeError):
pass
return output_value
df["output"] = df["attributes.output.value"].apply(extract_answer)
```
**Why**: LangChain and other frameworks often output structured JSON from root spans,
like `{"context": "...", "question": "...", "answer": "..."}`. Evaluators need
the actual answer text, not the raw JSON.
## Using @create_evaluator for LLM-Based Evaluation
```python
# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
pass # No LLM is involved
# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM
relevance = ClassificationEvaluator(
name="relevance",
prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
```
**Why**: `@create_evaluator` wraps a plain Python function. Setting `kind="llm"`
marks it as LLM-based but you must implement the LLM call yourself.
For LLM-based evaluation, prefer `ClassificationEvaluator` which handles
the LLM call, structured output parsing, and explanations automatically.
## Using llm_classify Instead of ClassificationEvaluator
```python
# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
dataframe=df,
template=template_str,
model=model,
rails=["relevant", "irrelevant"],
)
# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM
classifier = ClassificationEvaluator(
name="relevance",
prompt_template=template_str,
llm=LLM(provider="openai", model="gpt-4o"),
choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])
```
**Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create
an evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`.
## Using HallucinationEvaluator
```python
# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)
# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))
```
**Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement,
using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).