mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-13 11:45:56 +00:00
chore: publish from staged
This commit is contained in:
@@ -0,0 +1,108 @@
|
||||
# Evaluators: RAG Systems
|
||||
|
||||
RAG has two distinct components requiring different evaluation approaches.
|
||||
|
||||
## Two-Phase Evaluation
|
||||
|
||||
```
|
||||
RETRIEVAL GENERATION
|
||||
───────── ──────────
|
||||
Query → Retriever → Docs Docs + Query → LLM → Answer
|
||||
│ │
|
||||
IR Metrics LLM Judges / Code Checks
|
||||
```
|
||||
|
||||
**Debug retrieval first** using IR metrics, then tackle generation quality.
|
||||
|
||||
## Retrieval Evaluation (IR Metrics)
|
||||
|
||||
Use traditional information retrieval metrics:
|
||||
|
||||
| Metric | What It Measures |
|
||||
| ------ | ---------------- |
|
||||
| Recall@k | Of all relevant docs, how many in top k? |
|
||||
| Precision@k | Of k retrieved docs, how many relevant? |
|
||||
| MRR | How high is first relevant doc? |
|
||||
| NDCG | Quality weighted by position |
|
||||
|
||||
```python
|
||||
# Requires query-document relevance labels
|
||||
def recall_at_k(retrieved_ids, relevant_ids, k=5):
|
||||
retrieved_set = set(retrieved_ids[:k])
|
||||
relevant_set = set(relevant_ids)
|
||||
if not relevant_set:
|
||||
return 0.0
|
||||
return len(retrieved_set & relevant_set) / len(relevant_set)
|
||||
```
|
||||
|
||||
## Creating Retrieval Test Data
|
||||
|
||||
Generate query-document pairs synthetically:
|
||||
|
||||
```python
|
||||
# Reverse process: document → questions that document answers
|
||||
def generate_retrieval_test(documents):
|
||||
test_pairs = []
|
||||
for doc in documents:
|
||||
# Extract facts, generate questions
|
||||
questions = llm(f"Generate 3 questions this document answers:\n{doc}")
|
||||
for q in questions:
|
||||
test_pairs.append({"query": q, "relevant_doc_id": doc.id})
|
||||
return test_pairs
|
||||
```
|
||||
|
||||
## Generation Evaluation
|
||||
|
||||
Use LLM judges for qualities code can't measure:
|
||||
|
||||
| Eval | Question |
|
||||
| ---- | -------- |
|
||||
| **Faithfulness** | Are all claims supported by retrieved context? |
|
||||
| **Relevance** | Does answer address the question? |
|
||||
| **Completeness** | Does answer cover key points from context? |
|
||||
|
||||
```python
|
||||
from phoenix.evals import ClassificationEvaluator, LLM
|
||||
|
||||
FAITHFULNESS_TEMPLATE = """Given the context and answer, is every claim in the answer supported by the context?
|
||||
|
||||
<context>{{context}}</context>
|
||||
<answer>{{output}}</answer>
|
||||
|
||||
"faithful" = ALL claims supported by context
|
||||
"unfaithful" = ANY claim NOT in context
|
||||
|
||||
Answer (faithful/unfaithful):"""
|
||||
|
||||
faithfulness = ClassificationEvaluator(
|
||||
name="faithfulness",
|
||||
prompt_template=FAITHFULNESS_TEMPLATE,
|
||||
llm=LLM(provider="openai", model="gpt-4o"),
|
||||
choices={"unfaithful": 0, "faithful": 1}
|
||||
)
|
||||
```
|
||||
|
||||
## RAG Failure Taxonomy
|
||||
|
||||
Common failure modes to evaluate:
|
||||
|
||||
```yaml
|
||||
retrieval_failures:
|
||||
- no_relevant_docs: Query returns unrelated content
|
||||
- partial_retrieval: Some relevant docs missed
|
||||
- wrong_chunk: Right doc, wrong section
|
||||
|
||||
generation_failures:
|
||||
- hallucination: Claims not in retrieved context
|
||||
- ignored_context: Answer doesn't use retrieved docs
|
||||
- incomplete: Missing key information from context
|
||||
- wrong_synthesis: Misinterprets or miscombines sources
|
||||
```
|
||||
|
||||
## Evaluation Order
|
||||
|
||||
1. **Retrieval first** - If wrong docs, generation will fail
|
||||
2. **Faithfulness** - Is answer grounded in context?
|
||||
3. **Answer quality** - Does answer address the question?
|
||||
|
||||
Fix retrieval problems before debugging generation.
|
||||
Reference in New Issue
Block a user