chore: publish from staged

2026-04-13 11:45:56 +00:00 · 2026-04-01 23:04:18 +00:00
parent 5f3d66c380
commit 0c3c5bbbfb
407 changed files with 85783 additions and 237 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/fundamentals.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/fundamentals.md
@@ -0,0 +1,76 @@
+# Fundamentals
+
+Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
+
+## Evaluator Types
+
+| Type | Speed | Cost | Use Case |
+| ---- | ----- | ---- | -------- |
+| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
+| **LLM** | Medium | Medium | Subjective quality, complex criteria |
+| **Human** | Slow | Expensive | Ground truth, calibration |
+
+**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.
+
+## Score Structure
+
+| Property | Required | Description |
+| -------- | -------- | ----------- |
+| `name` | Yes | Evaluator name |
+| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
+| `score` | No* | 0-1 numeric |
+| `label` | No* | `"pass"`, `"fail"` |
+| `explanation` | No | Rationale |
+
+*One of `score` or `label` required.
+
+## Binary > Likert
+
+Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
+
+```python
+# Multiple binary checks instead of one Likert scale
+evaluators = [
+    AnswersQuestion(),    # Yes/No
+    UsesContext(),        # Yes/No
+    NoHallucination(),    # Yes/No
+]
+```
+
+## Quick Patterns
+
+### Code Evaluator
+
+```python
+from phoenix.evals import create_evaluator
+
+@create_evaluator(name="has_citation", kind="code")
+def has_citation(output: str) -> bool:
+    return bool(re.search(r'\[\d+\]', output))
+```
+
+### LLM Evaluator
+
+```python
+from phoenix.evals import ClassificationEvaluator, LLM
+
+evaluator = ClassificationEvaluator(
+    name="helpfulness",
+    prompt_template="...",
+    llm=LLM(provider="openai", model="gpt-4o"),
+    choices={"not_helpful": 0, "helpful": 1}
+)
+```
+
+### Run Experiment
+
+```python
+from phoenix.client.experiments import run_experiment
+
+experiment = run_experiment(
+    dataset=dataset,
+    task=my_task,
+    evaluators=[evaluator1, evaluator2],
+)
+print(experiment.aggregate_scores)
+```