mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
44 lines
1.0 KiB
Markdown
44 lines
1.0 KiB
Markdown
# Validating Evaluators (Python)
|
|
|
|
Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.
|
|
|
|
## Calculate Metrics
|
|
|
|
```python
|
|
from sklearn.metrics import classification_report, confusion_matrix
|
|
|
|
print(classification_report(human_labels, evaluator_predictions))
|
|
|
|
cm = confusion_matrix(human_labels, evaluator_predictions)
|
|
tn, fp, fn, tp = cm.ravel()
|
|
tpr = tp / (tp + fn)
|
|
tnr = tn / (tn + fp)
|
|
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")
|
|
```
|
|
|
|
## Correct Production Estimates
|
|
|
|
```python
|
|
def correct_estimate(observed, tpr, tnr):
|
|
"""Adjust observed pass rate using known TPR/TNR."""
|
|
return (observed - (1 - tnr)) / (tpr - (1 - tnr))
|
|
```
|
|
|
|
## Find Misclassified
|
|
|
|
```python
|
|
# False Positives: Evaluator pass, human fail
|
|
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
|
|
false_positives = dataset[fp_mask]
|
|
|
|
# False Negatives: Evaluator fail, human pass
|
|
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
|
|
false_negatives = dataset[fn_mask]
|
|
```
|
|
|
|
## Red Flags
|
|
|
|
- TPR or TNR < 70%
|
|
- Large gap between TPR and TNR
|
|
- Kappa < 0.6
|