mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-11 18:55:55 +00:00
138 lines
4.4 KiB
Markdown
138 lines
4.4 KiB
Markdown
# Batch Evaluation with evaluate_dataframe (Python)
|
|
|
|
Run evaluators across a DataFrame. The core 2.0 batch evaluation API.
|
|
|
|
## Preferred: async_evaluate_dataframe
|
|
|
|
For batch evaluations (especially with LLM evaluators), prefer the async version
|
|
for better throughput:
|
|
|
|
```python
|
|
from phoenix.evals import async_evaluate_dataframe
|
|
|
|
results_df = await async_evaluate_dataframe(
|
|
dataframe=df, # pandas DataFrame with columns matching evaluator params
|
|
evaluators=[eval1, eval2], # List of evaluators
|
|
concurrency=5, # Max concurrent LLM calls (default 3)
|
|
exit_on_error=False, # Optional: stop on first error (default True)
|
|
max_retries=3, # Optional: retry failed LLM calls (default 10)
|
|
)
|
|
```
|
|
|
|
## Sync Version
|
|
|
|
```python
|
|
from phoenix.evals import evaluate_dataframe
|
|
|
|
results_df = evaluate_dataframe(
|
|
dataframe=df, # pandas DataFrame with columns matching evaluator params
|
|
evaluators=[eval1, eval2], # List of evaluators
|
|
exit_on_error=False, # Optional: stop on first error (default True)
|
|
max_retries=3, # Optional: retry failed LLM calls (default 10)
|
|
)
|
|
```
|
|
|
|
## Result Column Format
|
|
|
|
`async_evaluate_dataframe` / `evaluate_dataframe` returns a copy of the input DataFrame with added columns.
|
|
**Result columns contain dicts, NOT raw numbers.**
|
|
|
|
For each evaluator named `"foo"`, two columns are added:
|
|
|
|
| Column | Type | Contents |
|
|
| ------ | ---- | -------- |
|
|
| `foo_score` | `dict` | `{"name": "foo", "score": 1.0, "label": "True", "explanation": "...", "metadata": {...}, "kind": "code", "direction": "maximize"}` |
|
|
| `foo_execution_details` | `dict` | `{"status": "success", "exceptions": [], "execution_seconds": 0.001}` |
|
|
|
|
Only non-None fields appear in the score dict.
|
|
|
|
### Extracting Numeric Scores
|
|
|
|
```python
|
|
# WRONG — these will fail or produce unexpected results
|
|
score = results_df["relevance"].mean() # KeyError!
|
|
score = results_df["relevance_score"].mean() # Tries to average dicts!
|
|
|
|
# RIGHT — extract the numeric score from each dict
|
|
scores = results_df["relevance_score"].apply(
|
|
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
|
)
|
|
mean_score = scores.mean()
|
|
```
|
|
|
|
### Extracting Labels
|
|
|
|
```python
|
|
labels = results_df["relevance_score"].apply(
|
|
lambda x: x.get("label", "") if isinstance(x, dict) else ""
|
|
)
|
|
```
|
|
|
|
### Extracting Explanations (LLM evaluators)
|
|
|
|
```python
|
|
explanations = results_df["relevance_score"].apply(
|
|
lambda x: x.get("explanation", "") if isinstance(x, dict) else ""
|
|
)
|
|
```
|
|
|
|
### Finding Failures
|
|
|
|
```python
|
|
scores = results_df["relevance_score"].apply(
|
|
lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
|
|
)
|
|
failed_mask = scores < 0.5
|
|
failures = results_df[failed_mask]
|
|
```
|
|
|
|
## Input Mapping
|
|
|
|
Evaluators receive each row as a dict. Column names must match the evaluator's
|
|
expected parameter names. If they don't match, use `.bind()` or `bind_evaluator`:
|
|
|
|
```python
|
|
from phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe
|
|
|
|
@create_evaluator(name="check", kind="code")
|
|
def check(response: str) -> bool:
|
|
return len(response.strip()) > 0
|
|
|
|
# Option 1: Use .bind() method on the evaluator
|
|
check.bind(input_mapping={"response": "answer"})
|
|
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])
|
|
|
|
# Option 2: Use bind_evaluator function
|
|
bound = bind_evaluator(evaluator=check, input_mapping={"response": "answer"})
|
|
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])
|
|
```
|
|
|
|
Or simply rename columns to match:
|
|
|
|
```python
|
|
df = df.rename(columns={
|
|
"attributes.input.value": "input",
|
|
"attributes.output.value": "output",
|
|
})
|
|
```
|
|
|
|
## DO NOT use run_evals
|
|
|
|
```python
|
|
# WRONG — legacy 1.0 API
|
|
from phoenix.evals import run_evals
|
|
results = run_evals(dataframe=df, evaluators=[eval1])
|
|
# Returns List[DataFrame] — one per evaluator
|
|
|
|
# RIGHT — current 2.0 API
|
|
from phoenix.evals import async_evaluate_dataframe
|
|
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])
|
|
# Returns single DataFrame with {name}_score dict columns
|
|
```
|
|
|
|
Key differences:
|
|
- `run_evals` returns a **list** of DataFrames (one per evaluator)
|
|
- `async_evaluate_dataframe` returns a **single** DataFrame with all results merged
|
|
- `async_evaluate_dataframe` uses `{name}_score` dict column format
|
|
- `async_evaluate_dataframe` uses `bind_evaluator` for input mapping (not `input_mapping=` param)
|