mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-13 03:35:55 +00:00
chore: publish from staged
This commit is contained in:
580
plugins/arize-ax/skills/arize-evaluator/SKILL.md
Normal file
580
plugins/arize-ax/skills/arize-evaluator/SKILL.md
Normal file
@@ -0,0 +1,580 @@
|
||||
---
|
||||
name: arize-evaluator
|
||||
description: "INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithfulness/correctness/relevance, run eval, score my spans or experiment, ax tasks, trigger-run, trigger eval, column mapping, continuous monitoring, query filter for evals, evaluator version, or improve an evaluator prompt."
|
||||
---
|
||||
|
||||
# Arize Evaluator Skill
|
||||
|
||||
This skill covers designing, creating, and running **LLM-as-judge evaluators** on Arize. An evaluator defines the judge; a **task** is how you run it against real data.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.
|
||||
|
||||
If an `ax` command fails, troubleshoot based on the error:
|
||||
- `command not found` or version error → see references/ax-setup.md
|
||||
- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong: check `.env` for `ARIZE_API_KEY` and use it to create/update the profile via references/ax-profiles.md. If `.env` has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)
|
||||
- Space ID unknown → check `.env` for `ARIZE_SPACE_ID`, or run `ax spaces list -o json`, or ask the user
|
||||
- LLM provider call fails (missing OPENAI_API_KEY / ANTHROPIC_API_KEY) → check `.env`, load if present, otherwise ask the user
|
||||
|
||||
---
|
||||
|
||||
## Concepts
|
||||
|
||||
### What is an Evaluator?
|
||||
|
||||
An **evaluator** is an LLM-as-judge definition. It contains:
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| **Template** | The judge prompt. Uses `{variable}` placeholders (e.g. `{input}`, `{output}`, `{context}`) that get filled in at run time via a task's column mappings. |
|
||||
| **Classification choices** | The set of allowed output labels (e.g. `factual` / `hallucinated`). Binary is the default and most common. Each choice can optionally carry a numeric score. |
|
||||
| **AI Integration** | Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model. |
|
||||
| **Model** | The specific judge model (e.g. `gpt-4o`, `claude-sonnet-4-5`). |
|
||||
| **Invocation params** | Optional JSON of model settings like `{"temperature": 0}`. Low temperature is recommended for reproducibility. |
|
||||
| **Optimization direction** | Whether higher scores are better (`maximize`) or worse (`minimize`). Sets how the UI renders trends. |
|
||||
| **Data granularity** | Whether the evaluator runs at the **span**, **trace**, or **session** level. Most evaluators run at the span level. |
|
||||
|
||||
Evaluators are **versioned** — every prompt or model change creates a new immutable version. The most recent version is active.
|
||||
|
||||
### What is a Task?
|
||||
|
||||
A **task** is how you run one or more evaluators against real data. Tasks are attached to a **project** (live traces/spans) or a **dataset** (experiment runs). A task contains:
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| **Evaluators** | List of evaluators to run. You can run multiple in one task. |
|
||||
| **Column mappings** | Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. `"input" → "attributes.input.value"`). This is what makes evaluators portable across projects and experiments. |
|
||||
| **Query filter** | SQL-style expression to select which spans/runs to evaluate (e.g. `"span_kind = 'LLM'"`). Optional but important for precision. |
|
||||
| **Continuous** | For project tasks: whether to automatically score new spans as they arrive. |
|
||||
| **Sampling rate** | For continuous project tasks: fraction of new spans to evaluate (0–1). |
|
||||
|
||||
---
|
||||
|
||||
## Data Granularity
|
||||
|
||||
The `--data-granularity` flag controls what unit of data the evaluator scores. It defaults to `span` and only applies to **project tasks** (not dataset/experiment tasks — those evaluate experiment runs directly).
|
||||
|
||||
| Level | What it evaluates | Use for | Result column prefix |
|
||||
|-------|-------------------|---------|---------------------|
|
||||
| `span` (default) | Individual spans | Q&A correctness, hallucination, relevance | `eval.{name}.label` / `.score` / `.explanation` |
|
||||
| `trace` | All spans in a trace, grouped by `context.trace_id` | Agent trajectory, task correctness — anything that needs the full call chain | `trace_eval.{name}.label` / `.score` / `.explanation` |
|
||||
| `session` | All traces in a session, grouped by `attributes.session.id` and ordered by start time | Multi-turn coherence, overall tone, conversation quality | `session_eval.{name}.label` / `.score` / `.explanation` |
|
||||
|
||||
### How trace and session aggregation works
|
||||
|
||||
For **trace** granularity, spans sharing the same `context.trace_id` are grouped together. Column values used by the evaluator template are comma-joined into a single string (each value truncated to 100K characters) before being passed to the judge model.
|
||||
|
||||
For **session** granularity, the same trace-level grouping happens first, then traces are ordered by `start_time` and grouped by `attributes.session.id`. Session-level values are capped at 100K characters total.
|
||||
|
||||
### The `{conversation}` template variable
|
||||
|
||||
At session granularity, `{conversation}` is a special template variable that renders as a JSON array of `{input, output}` turns across all traces in the session, built from `attributes.input.value` / `attributes.llm.input_messages` (input side) and `attributes.output.value` / `attributes.llm.output_messages` (output side).
|
||||
|
||||
At span or trace granularity, `{conversation}` is treated as a regular template variable and resolved via column mappings like any other.
|
||||
|
||||
### Multi-evaluator tasks
|
||||
|
||||
A task can contain evaluators at different granularities. At runtime the system uses the **highest** granularity (session > trace > span) for data fetching and automatically **splits into one child run per evaluator**. Per-evaluator `query_filter` in the task's evaluators JSON further narrows which spans are included (e.g., only tool-call spans within a session).
|
||||
|
||||
---
|
||||
|
||||
## Basic CRUD
|
||||
|
||||
### AI Integrations
|
||||
|
||||
AI integrations store the LLM provider credentials the evaluator uses. For full CRUD — listing, creating for all providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Gemini, NVIDIA NIM, custom), updating, and deleting — use the **arize-ai-provider-integration** skill.
|
||||
|
||||
Quick reference for the common case (OpenAI):
|
||||
|
||||
```bash
|
||||
# Check for an existing integration first
|
||||
ax ai-integrations list --space-id SPACE_ID
|
||||
|
||||
# Create if none exists
|
||||
ax ai-integrations create \
|
||||
--name "My OpenAI Integration" \
|
||||
--provider openAI \
|
||||
--api-key $OPENAI_API_KEY
|
||||
```
|
||||
|
||||
Copy the returned integration ID — it is required for `ax evaluators create --ai-integration-id`.
|
||||
|
||||
### Evaluators
|
||||
|
||||
```bash
|
||||
# List / Get
|
||||
ax evaluators list --space-id SPACE_ID
|
||||
ax evaluators get EVALUATOR_ID
|
||||
ax evaluators list-versions EVALUATOR_ID
|
||||
ax evaluators get-version VERSION_ID
|
||||
|
||||
# Create (creates the evaluator and its first version)
|
||||
ax evaluators create \
|
||||
--name "Answer Correctness" \
|
||||
--space-id SPACE_ID \
|
||||
--description "Judges if the model answer is correct" \
|
||||
--template-name "correctness" \
|
||||
--commit-message "Initial version" \
|
||||
--ai-integration-id INT_ID \
|
||||
--model-name "gpt-4o" \
|
||||
--include-explanations \
|
||||
--use-function-calling \
|
||||
--classification-choices '{"correct": 1, "incorrect": 0}' \
|
||||
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.
|
||||
|
||||
User question: {input}
|
||||
|
||||
Model response: {output}
|
||||
|
||||
Respond with exactly one of these labels: correct, incorrect'
|
||||
|
||||
# Create a new version (for prompt or model changes — versions are immutable)
|
||||
ax evaluators create-version EVALUATOR_ID \
|
||||
--commit-message "Added context grounding" \
|
||||
--template-name "correctness" \
|
||||
--ai-integration-id INT_ID \
|
||||
--model-name "gpt-4o" \
|
||||
--include-explanations \
|
||||
--classification-choices '{"correct": 1, "incorrect": 0}' \
|
||||
--template 'Updated prompt...
|
||||
|
||||
{input} / {output} / {context}'
|
||||
|
||||
# Update metadata only (name, description — not prompt)
|
||||
ax evaluators update EVALUATOR_ID \
|
||||
--name "New Name" \
|
||||
--description "Updated description"
|
||||
|
||||
# Delete (permanent — removes all versions)
|
||||
ax evaluators delete EVALUATOR_ID
|
||||
```
|
||||
|
||||
**Key flags for `create`:**
|
||||
|
||||
| Flag | Required | Description |
|
||||
|------|----------|-------------|
|
||||
| `--name` | yes | Evaluator name (unique within space) |
|
||||
| `--space-id` | yes | Space to create in |
|
||||
| `--template-name` | yes | Eval column name — alphanumeric, spaces, hyphens, underscores |
|
||||
| `--commit-message` | yes | Description of this version |
|
||||
| `--ai-integration-id` | yes | AI integration ID (from above) |
|
||||
| `--model-name` | yes | Judge model (e.g. `gpt-4o`) |
|
||||
| `--template` | yes | Prompt with `{variable}` placeholders (single-quoted in bash) |
|
||||
| `--classification-choices` | yes | JSON object mapping choice labels to numeric scores e.g. `'{"correct": 1, "incorrect": 0}'` |
|
||||
| `--description` | no | Human-readable description |
|
||||
| `--include-explanations` | no | Include reasoning alongside the label |
|
||||
| `--use-function-calling` | no | Prefer structured function-call output |
|
||||
| `--invocation-params` | no | JSON of model params e.g. `'{"temperature": 0}'` |
|
||||
| `--data-granularity` | no | `span` (default), `trace`, or `session`. Only relevant for project tasks, not dataset/experiment tasks. See Data Granularity section. |
|
||||
| `--provider-params` | no | JSON object of provider-specific parameters |
|
||||
|
||||
### Tasks
|
||||
|
||||
```bash
|
||||
# List / Get
|
||||
ax tasks list --space-id SPACE_ID
|
||||
ax tasks list --project-id PROJ_ID
|
||||
ax tasks list --dataset-id DATASET_ID
|
||||
ax tasks get TASK_ID
|
||||
|
||||
# Create (project — continuous)
|
||||
ax tasks create \
|
||||
--name "Correctness Monitor" \
|
||||
--task-type template_evaluation \
|
||||
--project-id PROJ_ID \
|
||||
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
|
||||
--is-continuous \
|
||||
--sampling-rate 0.1
|
||||
|
||||
# Create (project — one-time / backfill)
|
||||
ax tasks create \
|
||||
--name "Correctness Backfill" \
|
||||
--task-type template_evaluation \
|
||||
--project-id PROJ_ID \
|
||||
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
|
||||
--no-continuous
|
||||
|
||||
# Create (experiment / dataset)
|
||||
ax tasks create \
|
||||
--name "Experiment Scoring" \
|
||||
--task-type template_evaluation \
|
||||
--dataset-id DATASET_ID \
|
||||
--experiment-ids "EXP_ID_1,EXP_ID_2" \
|
||||
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
|
||||
--no-continuous
|
||||
|
||||
# Trigger a run (project task — use data window)
|
||||
ax tasks trigger-run TASK_ID \
|
||||
--data-start-time "2026-03-20T00:00:00" \
|
||||
--data-end-time "2026-03-21T23:59:59" \
|
||||
--wait
|
||||
|
||||
# Trigger a run (experiment task — use experiment IDs)
|
||||
ax tasks trigger-run TASK_ID \
|
||||
--experiment-ids "EXP_ID_1" \
|
||||
--wait
|
||||
|
||||
# Monitor
|
||||
ax tasks list-runs TASK_ID
|
||||
ax tasks get-run RUN_ID
|
||||
ax tasks wait-for-run RUN_ID --timeout 300
|
||||
ax tasks cancel-run RUN_ID --force
|
||||
```
|
||||
|
||||
**Time format for trigger-run:** `2026-03-21T09:00:00` — no trailing `Z`.
|
||||
|
||||
**Additional trigger-run flags:**
|
||||
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--max-spans` | Cap processed spans (default 10,000) |
|
||||
| `--override-evaluations` | Re-score spans that already have labels |
|
||||
| `--wait` / `-w` | Block until the run finishes |
|
||||
| `--timeout` | Seconds to wait with `--wait` (default 600) |
|
||||
| `--poll-interval` | Poll interval in seconds when waiting (default 5) |
|
||||
|
||||
**Run status guide:**
|
||||
|
||||
| Status | Meaning |
|
||||
|--------|---------|
|
||||
| `completed`, 0 spans | No spans in eval index for that window — widen time range |
|
||||
| `cancelled` ~1s | Integration credentials invalid |
|
||||
| `cancelled` ~3min | Found spans but LLM call failed — check model name or key |
|
||||
| `completed`, N > 0 | Success — check scores in UI |
|
||||
|
||||
---
|
||||
|
||||
## Workflow A: Create an evaluator for a project
|
||||
|
||||
Use this when the user says something like *"create an evaluator for my Playground Traces project"*.
|
||||
|
||||
### Step 1: Resolve the project name to an ID
|
||||
|
||||
`ax spans export` requires a project **ID**, not a name — passing a name causes a validation error. Always look up the ID first:
|
||||
|
||||
```bash
|
||||
ax projects list --space-id SPACE_ID -o json
|
||||
```
|
||||
|
||||
Find the entry whose `"name"` matches (case-insensitive). Copy its `"id"` (a base64 string).
|
||||
|
||||
### Step 2: Understand what to evaluate
|
||||
|
||||
If the user specified the evaluator type (hallucination, correctness, relevance, etc.) → skip to Step 3.
|
||||
|
||||
If not, sample recent spans to base the evaluator on actual data:
|
||||
|
||||
```bash
|
||||
ax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdout
|
||||
```
|
||||
|
||||
Inspect `attributes.input`, `attributes.output`, span kinds, and any existing annotations. Identify failure modes (e.g. hallucinated facts, off-topic answers, missing context) and propose **1–3 concrete evaluator ideas**. Let the user pick.
|
||||
|
||||
Each suggestion must include: the evaluator name (bold), a one-sentence description of what it judges, and the binary label pair in parentheses. Format each like:
|
||||
|
||||
1. **Name** — Description of what is being judged. (`label_a` / `label_b`)
|
||||
|
||||
Example:
|
||||
1. **Response Correctness** — Does the agent's response correctly address the user's financial query? (`correct` / `incorrect`)
|
||||
2. **Hallucination** — Does the response fabricate facts not grounded in retrieved context? (`factual` / `hallucinated`)
|
||||
|
||||
### Step 3: Confirm or create an AI integration
|
||||
|
||||
```bash
|
||||
ax ai-integrations list --space-id SPACE_ID -o json
|
||||
```
|
||||
|
||||
If a suitable integration exists, note its ID. If not, create one using the **arize-ai-provider-integration** skill. Ask the user which provider/model they want for the judge.
|
||||
|
||||
### Step 4: Create the evaluator
|
||||
|
||||
Use the template design best practices below. Keep the evaluator name and variables **generic** — the task (Step 6) handles project-specific wiring via `column_mappings`.
|
||||
|
||||
```bash
|
||||
ax evaluators create \
|
||||
--name "Hallucination" \
|
||||
--space-id SPACE_ID \
|
||||
--template-name "hallucination" \
|
||||
--commit-message "Initial version" \
|
||||
--ai-integration-id INT_ID \
|
||||
--model-name "gpt-4o" \
|
||||
--include-explanations \
|
||||
--use-function-calling \
|
||||
--classification-choices '{"factual": 1, "hallucinated": 0}' \
|
||||
--template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.
|
||||
|
||||
User question: {input}
|
||||
|
||||
Model response: {output}
|
||||
|
||||
Respond with exactly one of these labels: hallucinated, factual'
|
||||
```
|
||||
|
||||
### Step 5: Ask — backfill, continuous, or both?
|
||||
|
||||
Before creating the task, ask:
|
||||
|
||||
> "Would you like to:
|
||||
> (a) Run a **backfill** on historical spans (one-time)?
|
||||
> (b) Set up **continuous** evaluation on new spans going forward?
|
||||
> (c) **Both** — backfill now and keep scoring new spans automatically?"
|
||||
|
||||
### Step 6: Determine column mappings from real span data
|
||||
|
||||
Do not guess paths. Pull a sample and inspect what fields are actually present:
|
||||
|
||||
```bash
|
||||
ax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdout
|
||||
```
|
||||
|
||||
For each template variable (`{input}`, `{output}`, `{context}`), find the matching JSON path. Common starting points — **always verify on your actual data before using**:
|
||||
|
||||
| Template var | LLM span | CHAIN span |
|
||||
|---|---|---|
|
||||
| `input` | `attributes.input.value` | `attributes.input.value` |
|
||||
| `output` | `attributes.llm.output_messages.0.message.content` | `attributes.output.value` |
|
||||
| `context` | `attributes.retrieval.documents.contents` | — |
|
||||
| `tool_output` | `attributes.input.value` (fallback) | `attributes.output.value` |
|
||||
|
||||
**Validate span kind alignment:** If the evaluator prompt assumes LLM final text but the task targets CHAIN spans (or vice versa), runs can cancel or score the wrong text. Make sure the `query_filter` on the task matches the span kind you mapped.
|
||||
|
||||
**Full example `--evaluators` JSON:**
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"evaluator_id": "EVAL_ID",
|
||||
"query_filter": "span_kind = 'LLM'",
|
||||
"column_mappings": {
|
||||
"input": "attributes.input.value",
|
||||
"output": "attributes.llm.output_messages.0.message.content",
|
||||
"context": "attributes.retrieval.documents.contents"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Include a mapping for **every** variable the template references. Omitting one causes runs to produce no valid scores.
|
||||
|
||||
### Step 7: Create the task
|
||||
|
||||
**Backfill only (a):**
|
||||
```bash
|
||||
ax tasks create \
|
||||
--name "Hallucination Backfill" \
|
||||
--task-type template_evaluation \
|
||||
--project-id PROJECT_ID \
|
||||
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
|
||||
--no-continuous
|
||||
```
|
||||
|
||||
**Continuous only (b):**
|
||||
```bash
|
||||
ax tasks create \
|
||||
--name "Hallucination Monitor" \
|
||||
--task-type template_evaluation \
|
||||
--project-id PROJECT_ID \
|
||||
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
|
||||
--is-continuous \
|
||||
--sampling-rate 0.1
|
||||
```
|
||||
|
||||
**Both (c):** Use `--is-continuous` on create, then also trigger a backfill run in Step 8.
|
||||
|
||||
### Step 8: Trigger a backfill run (if requested)
|
||||
|
||||
First find what time range has data:
|
||||
```bash
|
||||
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout # try last 24h first
|
||||
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout # widen if empty
|
||||
```
|
||||
|
||||
Use the `start_time` / `end_time` fields from real spans to set the window. Use the most recent data for your first test run.
|
||||
|
||||
```bash
|
||||
ax tasks trigger-run TASK_ID \
|
||||
--data-start-time "2026-03-20T00:00:00" \
|
||||
--data-end-time "2026-03-21T23:59:59" \
|
||||
--wait
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow B: Create an evaluator for an experiment
|
||||
|
||||
Use this when the user says something like *"create an evaluator for my experiment"* or *"evaluate my dataset runs"*.
|
||||
|
||||
**If the user says "dataset" but doesn't have an experiment:** A task must target an experiment (not a bare dataset). Ask:
|
||||
> "Evaluation tasks run against experiment runs, not datasets directly. Would you like help creating an experiment on that dataset first?"
|
||||
|
||||
If yes, use the **arize-experiment** skill to create one, then return here.
|
||||
|
||||
### Step 1: Resolve dataset and experiment
|
||||
|
||||
```bash
|
||||
ax datasets list --space-id SPACE_ID -o json
|
||||
ax experiments list --dataset-id DATASET_ID -o json
|
||||
```
|
||||
|
||||
Note the dataset ID and the experiment ID(s) to score.
|
||||
|
||||
### Step 2: Understand what to evaluate
|
||||
|
||||
If the user specified the evaluator type → skip to Step 3.
|
||||
|
||||
If not, inspect a recent experiment run to base the evaluator on actual data:
|
||||
|
||||
```bash
|
||||
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"
|
||||
```
|
||||
|
||||
Look at the `output`, `input`, `evaluations`, and `metadata` fields. Identify gaps (metrics the user cares about but doesn't have yet) and propose **1–3 evaluator ideas**. Each suggestion must include: the evaluator name (bold), a one-sentence description, and the binary label pair in parentheses — same format as Workflow A, Step 2.
|
||||
|
||||
### Step 3: Confirm or create an AI integration
|
||||
|
||||
Same as Workflow A, Step 3.
|
||||
|
||||
### Step 4: Create the evaluator
|
||||
|
||||
Same as Workflow A, Step 4. Keep variables generic.
|
||||
|
||||
### Step 5: Determine column mappings from real run data
|
||||
|
||||
Run data shape differs from span data. Inspect:
|
||||
|
||||
```bash
|
||||
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"
|
||||
```
|
||||
|
||||
Common mapping for experiment runs:
|
||||
- `output` → `"output"` (top-level field on each run)
|
||||
- `input` → check if it's on the run or embedded in the linked dataset examples
|
||||
|
||||
If `input` is not on the run JSON, export dataset examples to find the path:
|
||||
```bash
|
||||
ax datasets export DATASET_ID --stdout | python3 -c "import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))"
|
||||
```
|
||||
|
||||
### Step 6: Create the task
|
||||
|
||||
```bash
|
||||
ax tasks create \
|
||||
--name "Experiment Correctness" \
|
||||
--task-type template_evaluation \
|
||||
--dataset-id DATASET_ID \
|
||||
--experiment-ids "EXP_ID" \
|
||||
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
|
||||
--no-continuous
|
||||
```
|
||||
|
||||
### Step 7: Trigger and monitor
|
||||
|
||||
```bash
|
||||
ax tasks trigger-run TASK_ID \
|
||||
--experiment-ids "EXP_ID" \
|
||||
--wait
|
||||
|
||||
ax tasks list-runs TASK_ID
|
||||
ax tasks get-run RUN_ID
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices for Template Design
|
||||
|
||||
### 1. Use generic, portable variable names
|
||||
|
||||
Use `{input}`, `{output}`, and `{context}` — not names tied to a specific project or span attribute (e.g. do not use `{attributes_input_value}`). The evaluator itself stays abstract; the **task's `column_mappings`** is where you wire it to the actual fields in a specific project or experiment. This lets the same evaluator run across multiple projects and experiments without modification.
|
||||
|
||||
### 2. Default to binary labels
|
||||
|
||||
Use exactly two clear string labels (e.g. `hallucinated` / `factual`, `correct` / `incorrect`, `pass` / `fail`). Binary labels are:
|
||||
- Easiest for the judge model to produce consistently
|
||||
- Most common in the industry
|
||||
- Simplest to interpret in dashboards
|
||||
|
||||
If the user insists on more than two choices, that's fine — but recommend binary first and explain the tradeoff (more labels → more ambiguity → lower inter-rater reliability).
|
||||
|
||||
### 3. Be explicit about what the model must return
|
||||
|
||||
The template must tell the judge model to respond with **only** the label string — nothing else. The label strings in the prompt must **exactly match** the labels in `--classification-choices` (same spelling, same casing).
|
||||
|
||||
Good:
|
||||
```
|
||||
Respond with exactly one of these labels: hallucinated, factual
|
||||
```
|
||||
|
||||
Bad (too open-ended):
|
||||
```
|
||||
Is this hallucinated? Answer yes or no.
|
||||
```
|
||||
|
||||
### 4. Keep temperature low
|
||||
|
||||
Pass `--invocation-params '{"temperature": 0}'` for reproducible scoring. Higher temperatures introduce noise into evaluation results.
|
||||
|
||||
### 5. Use `--include-explanations` for debugging
|
||||
|
||||
During initial setup, always include explanations so you can verify the judge is reasoning correctly before trusting the labels at scale.
|
||||
|
||||
### 6. Pass the template in single quotes in bash
|
||||
|
||||
Single quotes prevent the shell from interpolating `{variable}` placeholders. Double quotes will cause issues:
|
||||
|
||||
```bash
|
||||
# Correct
|
||||
--template 'Judge this: {input} → {output}'
|
||||
|
||||
# Wrong — shell may interpret { } or fail
|
||||
--template "Judge this: {input} → {output}"
|
||||
```
|
||||
|
||||
### 7. Always set `--classification-choices` to match your template labels
|
||||
|
||||
The labels in `--classification-choices` must exactly match the labels referenced in `--template` (same spelling, same casing). Omitting `--classification-choices` causes task runs to fail with "missing rails and classification choices."
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Solution |
|
||||
|---------|----------|
|
||||
| `ax: command not found` | See references/ax-setup.md |
|
||||
| `401 Unauthorized` | API key may not have access to this space. Verify at https://app.arize.com/admin > API Keys |
|
||||
| `Evaluator not found` | `ax evaluators list --space-id SPACE_ID` |
|
||||
| `Integration not found` | `ax ai-integrations list --space-id SPACE_ID` |
|
||||
| `Task not found` | `ax tasks list --space-id SPACE_ID` |
|
||||
| `project-id and dataset-id are mutually exclusive` | Use only one when creating a task |
|
||||
| `experiment-ids required for dataset tasks` | Add `--experiment-ids` to `create` and `trigger-run` |
|
||||
| `sampling-rate only valid for project tasks` | Remove `--sampling-rate` from dataset tasks |
|
||||
| Validation error on `ax spans export` | Pass project ID (base64), not project name — look up via `ax projects list` |
|
||||
| Template validation errors | Use single-quoted `--template '...'` in bash; single braces `{var}`, not double `{{var}}` |
|
||||
| Run stuck in `pending` | `ax tasks get-run RUN_ID`; then `ax tasks cancel-run RUN_ID` |
|
||||
| Run `cancelled` ~1s | Integration credentials invalid — check AI integration |
|
||||
| Run `cancelled` ~3min | Found spans but LLM call failed — wrong model name or bad key |
|
||||
| Run `completed`, 0 spans | Widen time window; eval index may not cover older data |
|
||||
| No scores in UI | Fix `column_mappings` to match real paths on your spans/runs |
|
||||
| Scores look wrong | Add `--include-explanations` and inspect judge reasoning on a few samples |
|
||||
| Evaluator cancels on wrong span kind | Match `query_filter` and `column_mappings` to LLM vs CHAIN spans |
|
||||
| Time format error on `trigger-run` | Use `2026-03-21T09:00:00` — no trailing `Z` |
|
||||
| Run failed: "missing rails and classification choices" | Add `--classification-choices '{"label_a": 1, "label_b": 0}'` to `ax evaluators create` — labels must match the template |
|
||||
| Run `completed`, all spans skipped | Query filter matched spans but column mappings are wrong or template variables don't resolve — export a sample span and verify paths |
|
||||
|
||||
---
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **arize-ai-provider-integration**: Full CRUD for LLM provider integrations (create, update, delete credentials)
|
||||
- **arize-trace**: Export spans to discover column paths and time ranges
|
||||
- **arize-experiment**: Create experiments and export runs for experiment column mappings
|
||||
- **arize-dataset**: Export dataset examples to find input fields when runs omit them
|
||||
- **arize-link**: Deep links to evaluators and tasks in the Arize UI
|
||||
|
||||
---
|
||||
|
||||
## Save Credentials for Future Use
|
||||
|
||||
See references/ax-profiles.md § Save Credentials for Future Use.
|
||||
@@ -0,0 +1,115 @@
|
||||
# ax Profile Setup
|
||||
|
||||
Consult this when authentication fails (401, missing profile, missing API key). Do NOT run these checks proactively.
|
||||
|
||||
Use this when there is no profile, or a profile has incorrect settings (wrong API key, wrong region, etc.).
|
||||
|
||||
## 1. Inspect the current state
|
||||
|
||||
```bash
|
||||
ax profiles show
|
||||
```
|
||||
|
||||
Look at the output to understand what's configured:
|
||||
- `API Key: (not set)` or missing → key needs to be created/updated
|
||||
- No profile output or "No profiles found" → no profile exists yet
|
||||
- Connected but getting `401 Unauthorized` → key is wrong or expired
|
||||
- Connected but wrong endpoint/region → region needs to be updated
|
||||
|
||||
## 2. Fix a misconfigured profile
|
||||
|
||||
If a profile exists but one or more settings are wrong, patch only what's broken.
|
||||
|
||||
**Never pass a raw API key value as a flag.** Always reference it via the `ARIZE_API_KEY` environment variable. If the variable is not already set in the shell, instruct the user to set it first, then run the command:
|
||||
|
||||
```bash
|
||||
# If ARIZE_API_KEY is already exported in the shell:
|
||||
ax profiles update --api-key $ARIZE_API_KEY
|
||||
|
||||
# Fix the region (no secret involved — safe to run directly)
|
||||
ax profiles update --region us-east-1b
|
||||
|
||||
# Fix both at once
|
||||
ax profiles update --api-key $ARIZE_API_KEY --region us-east-1b
|
||||
```
|
||||
|
||||
`update` only changes the fields you specify — all other settings are preserved. If no profile name is given, the active profile is updated.
|
||||
|
||||
## 3. Create a new profile
|
||||
|
||||
If no profile exists, or if the existing profile needs to point to a completely different setup (different org, different region):
|
||||
|
||||
**Always reference the key via `$ARIZE_API_KEY`, never inline a raw value.**
|
||||
|
||||
```bash
|
||||
# Requires ARIZE_API_KEY to be exported in the shell first
|
||||
ax profiles create --api-key $ARIZE_API_KEY
|
||||
|
||||
# Create with a region
|
||||
ax profiles create --api-key $ARIZE_API_KEY --region us-east-1b
|
||||
|
||||
# Create a named profile
|
||||
ax profiles create work --api-key $ARIZE_API_KEY --region us-east-1b
|
||||
```
|
||||
|
||||
To use a named profile with any `ax` command, add `-p NAME`:
|
||||
```bash
|
||||
ax spans export PROJECT_ID -p work
|
||||
```
|
||||
|
||||
## 4. Getting the API key
|
||||
|
||||
**Never ask the user to paste their API key into the chat. Never log, echo, or display an API key value.**
|
||||
|
||||
If `ARIZE_API_KEY` is not already set, instruct the user to export it in their shell:
|
||||
|
||||
```bash
|
||||
export ARIZE_API_KEY="..." # user pastes their key here in their own terminal
|
||||
```
|
||||
|
||||
They can find their key at https://app.arize.com/admin > API Keys. Recommend they create a **scoped service key** (not a personal user key) — service keys are not tied to an individual account and are safer for programmatic use. Keys are space-scoped — make sure they copy the key for the correct space.
|
||||
|
||||
Once the user confirms the variable is set, proceed with `ax profiles create --api-key $ARIZE_API_KEY` or `ax profiles update --api-key $ARIZE_API_KEY` as described above.
|
||||
|
||||
## 5. Verify
|
||||
|
||||
After any create or update:
|
||||
|
||||
```bash
|
||||
ax profiles show
|
||||
```
|
||||
|
||||
Confirm the API key and region are correct, then retry the original command.
|
||||
|
||||
## Space ID
|
||||
|
||||
There is no profile flag for space ID. Save it as an environment variable:
|
||||
|
||||
**macOS/Linux** — add to `~/.zshrc` or `~/.bashrc`:
|
||||
```bash
|
||||
export ARIZE_SPACE_ID="U3BhY2U6..."
|
||||
```
|
||||
Then `source ~/.zshrc` (or restart terminal).
|
||||
|
||||
**Windows (PowerShell):**
|
||||
```powershell
|
||||
[System.Environment]::SetEnvironmentVariable('ARIZE_SPACE_ID', 'U3BhY2U6...', 'User')
|
||||
```
|
||||
Restart terminal for it to take effect.
|
||||
|
||||
## Save Credentials for Future Use
|
||||
|
||||
At the **end of the session**, if the user manually provided any credentials during this conversation **and** those values were NOT already loaded from a saved profile or environment variable, offer to save them.
|
||||
|
||||
**Skip this entirely if:**
|
||||
- The API key was already loaded from an existing profile or `ARIZE_API_KEY` env var
|
||||
- The space ID was already set via `ARIZE_SPACE_ID` env var
|
||||
- The user only used base64 project IDs (no space ID was needed)
|
||||
|
||||
**How to offer:** Use **AskQuestion**: *"Would you like to save your Arize credentials so you don't have to enter them next time?"* with options `"Yes, save them"` / `"No thanks"`.
|
||||
|
||||
**If the user says yes:**
|
||||
|
||||
1. **API key** — Run `ax profiles show` to check the current state. Then run `ax profiles create --api-key $ARIZE_API_KEY` or `ax profiles update --api-key $ARIZE_API_KEY` (the key must already be exported as an env var — never pass a raw key value).
|
||||
|
||||
2. **Space ID** — See the Space ID section above to persist it as an environment variable.
|
||||
@@ -0,0 +1,38 @@
|
||||
# ax CLI — Troubleshooting
|
||||
|
||||
Consult this only when an `ax` command fails. Do NOT run these checks proactively.
|
||||
|
||||
## Check version first
|
||||
|
||||
If `ax` is installed (not `command not found`), always run `ax --version` before investigating further. The version must be `0.8.0` or higher — many errors are caused by an outdated install. If the version is too old, see **Version too old** below.
|
||||
|
||||
## `ax: command not found`
|
||||
|
||||
**macOS/Linux:**
|
||||
1. Check common locations: `~/.local/bin/ax`, `~/Library/Python/*/bin/ax`
|
||||
2. Install: `uv tool install arize-ax-cli` (preferred), `pipx install arize-ax-cli`, or `pip install arize-ax-cli`
|
||||
3. Add to PATH if needed: `export PATH="$HOME/.local/bin:$PATH"`
|
||||
|
||||
**Windows (PowerShell):**
|
||||
1. Check: `Get-Command ax` or `where.exe ax`
|
||||
2. Common locations: `%APPDATA%\Python\Scripts\ax.exe`, `%LOCALAPPDATA%\Programs\Python\Python*\Scripts\ax.exe`
|
||||
3. Install: `pip install arize-ax-cli`
|
||||
4. Add to PATH: `$env:PATH = "$env:APPDATA\Python\Scripts;$env:PATH"`
|
||||
|
||||
## Version too old (below 0.8.0)
|
||||
|
||||
Upgrade: `uv tool install --force --reinstall arize-ax-cli`, `pipx upgrade arize-ax-cli`, or `pip install --upgrade arize-ax-cli`
|
||||
|
||||
## SSL/certificate error
|
||||
|
||||
- macOS: `export SSL_CERT_FILE=/etc/ssl/cert.pem`
|
||||
- Linux: `export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt`
|
||||
- Fallback: `export SSL_CERT_FILE=$(python -c "import certifi; print(certifi.where())")`
|
||||
|
||||
## Subcommand not recognized
|
||||
|
||||
Upgrade ax (see above) or use the closest available alternative.
|
||||
|
||||
## Still failing
|
||||
|
||||
Stop and ask the user for help.
|
||||
Reference in New Issue
Block a user