mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-12 11:15:56 +00:00
4.4 KiB
4.4 KiB
name, description, license, compatibility, metadata
| name | description | license | compatibility | metadata | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| phoenix-evals | Build and run evaluators for AI/LLM applications using Phoenix. | Apache-2.0 | Requires Phoenix server. Python skills need phoenix and openai packages; TypeScript skills need @arizeai/phoenix-client. |
|
Phoenix Evals
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Quick Reference
Workflows
Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production: production-overview → production-guardrails → production-continuous
Reference Categories
| Prefix | Description |
|---|---|
fundamentals-* |
Types, scores, anti-patterns |
observe-* |
Tracing, sampling |
error-analysis-* |
Finding failures |
axial-coding-* |
Categorizing failures |
evaluators-* |
Code, LLM, RAG evaluators |
experiments-* |
Datasets, running experiments |
validation-* |
Validating evaluator accuracy against human labels |
production-* |
CI/CD, monitoring |
Key Principles
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |