mirror of https://github.com/github/awesome-copilot.git synced 2026-04-11 10:45:56 +00:00

Files

Jim Bennett d79183139a Add Arize and Phoenix LLM observability skills (#1204 )

* Add 9 Arize LLM observability skills

Add skills for Arize AI platform covering trace export, instrumentation,
datasets, experiments, evaluators, AI provider integrations, annotations,
prompt optimization, and deep linking to the Arize UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add 3 Phoenix AI observability skills

Add skills for Phoenix (Arize open-source) covering CLI debugging,
LLM evaluation workflows, and OpenInference tracing/instrumentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Ignoring intentional bad spelling

* Fix CI: remove .DS_Store from generated skills README and add codespell ignore

Remove .DS_Store artifact from winmd-api-search asset listing in generated
README.skills.md so it matches the CI Linux build output. Add queston to
codespell ignore list (intentional misspelling example in arize-dataset skill).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add arize-ax and phoenix plugins

Bundle the 9 Arize skills into an arize-ax plugin and the 3 Phoenix
skills into a phoenix plugin for easier installation as single packages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix skill folder structures to match source repos

Move arize supporting files from references/ to root level and rename
phoenix references/ to rules/ to exactly match the original source
repository folder structures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fixing file locations

* Fixing readme

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-02 09:58:55 +11:00

7.0 KiB

Raw Blame History

Common Mistakes (Python)

Patterns that LLMs frequently generate incorrectly from training data.

Legacy Model Classes

# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")

# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")

Why: OpenAIModel, AnthropicModel, etc. are legacy 1.0 wrappers in phoenix.evals.legacy. The LLM class is provider-agnostic and is the current 2.0 API.

Using run_evals Instead of evaluate_dataframe

# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames

# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns

Why: run_evals is the legacy 1.0 batch function. evaluate_dataframe is the current 2.0 function with a different return format.

Wrong Result Column Names

# WRONG — column doesn't exist
score = results_df["relevance"].mean()

# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()

# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()

Why: evaluate_dataframe returns columns named {name}_score containing Score dicts like {"name": "...", "score": 1.0, "label": "...", "explanation": "..."}.

Deprecated project_name Parameter

# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")

# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")

Why: project_name is deprecated in favor of project_identifier, which also accepts project IDs.

Wrong Client Constructor

# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")

# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")

# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()

Why: The parameter is base_url, not endpoint or url. For local instances, Client() with no args works fine. For remote instances, base_url and api_key are required.

Too-Aggressive Time Filters

# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    start_time=datetime.now() - timedelta(hours=1),
)

# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    limit=50,
)

Why: Traces may be from any time period. A 1-hour window frequently returns nothing. Use limit= to control result size instead.

Not Filtering Spans Appropriately

# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")

# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    root_spans_only=True,
)

# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
    project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]

Why: For end-to-end evaluation (e.g., overall answer quality), use root_spans_only=True. For RAG systems, you often need child spans separately — retriever spans for DocumentRelevance and LLM spans for Faithfulness. Choose the right span level for your evaluation target.

Assuming Span Output is Plain Text

# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]

# RIGHT — parse JSON and extract the answer field
import json

def extract_answer(output_value):
    if not isinstance(output_value, str):
        return str(output_value) if output_value is not None else ""
    try:
        parsed = json.loads(output_value)
        if isinstance(parsed, dict):
            for key in ("answer", "result", "output", "response"):
                if key in parsed:
                    return str(parsed[key])
    except (json.JSONDecodeError, TypeError):
        pass
    return output_value

df["output"] = df["attributes.output.value"].apply(extract_answer)

Why: LangChain and other frameworks often output structured JSON from root spans, like {"context": "...", "question": "...", "answer": "..."}. Evaluators need the actual answer text, not the raw JSON.

Using @create_evaluator for LLM-Based Evaluation

# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
    pass  # No LLM is involved

# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Why: @create_evaluator wraps a plain Python function. Setting kind="llm" marks it as LLM-based but you must implement the LLM call yourself. For LLM-based evaluation, prefer ClassificationEvaluator which handles the LLM call, structured output parsing, and explanations automatically.

Using llm_classify Instead of ClassificationEvaluator

# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
    dataframe=df,
    template=template_str,
    model=model,
    rails=["relevant", "irrelevant"],
)

# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM

classifier = ClassificationEvaluator(
    name="relevance",
    prompt_template=template_str,
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])

Why: llm_classify is the legacy 1.0 function. The current pattern is to create an evaluator with ClassificationEvaluator and run it with async_evaluate_dataframe().

Using HallucinationEvaluator

# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)

# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))

Why: HallucinationEvaluator is deprecated. FaithfulnessEvaluator is its replacement, using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).

7.0 KiB Raw Blame History