Awesome/awesome-copilot

mirror of https://github.com/github/awesome-copilot.git synced 2026-04-11 18:55:55 +00:00

Files

github-actions[bot] a68b190031 chore: publish from staged

2026-04-09 06:26:21 +00:00

1.0 KiB

Raw Blame History

Validating Evaluators (Python)

Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.

Calculate Metrics

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(human_labels, evaluator_predictions))

cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")

Correct Production Estimates

def correct_estimate(observed, tpr, tnr):
    """Adjust observed pass rate using known TPR/TNR."""
    return (observed - (1 - tnr)) / (tpr - (1 - tnr))

Find Misclassified

# False Positives: Evaluator pass, human fail
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
false_positives = dataset[fp_mask]

# False Negatives: Evaluator fail, human pass
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
false_negatives = dataset[fn_mask]

Red Flags

TPR or TNR < 70%
Large gap between TPR and TNR
Kappa < 0.6