-
Notifications
You must be signed in to change notification settings - Fork 322
Closed
Description
Environment
- inspect_ai version: 0.3.133
- Python version: 3.13
- OS: macOS
Description
When rescoring logs that were created with a scorer returning dictionary values (multiple metrics), the score() function fails to properly handle the original scores. During rescoring:
- Warnings appear:
Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5} - The original multi-metric scores get collapsed into a single score with the scorer name and a default value of 0.0
Expected Behavior
Given:
-
An initial log created with
toy_scorer_1that returns dict values for two metrics:@scorer(metrics={ "metric_a": [accuracy(), stderr()], "metric_b": [accuracy(), stderr()], }) # Returns: Score(value={"metric_a": 1.0, "metric_b": 0.5}, answer="test")
-
A second scorer
toy_scorer_2with a different metric:@scorer(metrics={ "metric_c": [accuracy(), stderr()], }) # Returns: Score(value={"metric_c": 0.8}, answer="test")
When rescoring with inspect_score(log, toy_scorer_2(), action="append"), the rescored log should:
- Preserve the original scores:
metric_a(accuracy=1.0) andmetric_b(accuracy=0.5) - Append the new score:
metric_c(accuracy=0.8) - Result in three separate
EvalScoreentries inlog.results.scores - Not produce any warnings
Actual Behavior
When rescoring with inspect_score(log, toy_scorer_2(), action="append"):
- Warnings appear:
Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5} - The original multi-metric scores (
metric_aandmetric_b) are collapsed into a single score entry namedtoy_scorer_1with accuracy=0.0 - The new score
metric_cis added correctly with accuracy=0.8 - The rescored log ends up with only two
EvalScoreentries instead of three:metric_c(correct)toy_scorer_1(wrong - should bemetric_aandmetric_bseparately)
Minimal Reproduction
#!/usr/bin/env python3
"""
Minimal example showing rescoring with dict values.
"""
from pathlib import Path
from inspect_ai import Task, eval as inspect_eval, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Scorer, accuracy, scorer, stderr
from inspect_ai.solver import TaskState, generate, solver
from inspect_ai.log import read_eval_log, write_eval_log
from inspect_ai import score as inspect_score
# Simple toy scorer that returns a dict value
@scorer(
metrics={
"metric_a": [accuracy(), stderr()],
"metric_b": [accuracy(), stderr()],
}
)
def toy_scorer_1() -> Scorer:
async def score(state: TaskState, target) -> Score:
# Return dict values
return Score(
value={
"metric_a": 1.0,
"metric_b": 0.5,
},
answer="test_answer_1",
)
return score
# Another toy scorer with different metrics
@scorer(
metrics={
"metric_c": [accuracy(), stderr()],
}
)
def toy_scorer_2() -> Scorer:
async def score(state: TaskState, target) -> Score:
# Return dict value
return Score(
value={
"metric_c": 0.8,
},
answer="test_answer_2",
)
return score
# Dummy solver that does nothing
@solver
def dummy_solver():
async def solve(state: TaskState, generate):
state.output.completion = "dummy output"
return state
return solve
# Create a simple task
@task
def toy_task():
return Task(
dataset=[Sample(input="test input", target="test target")],
solver=dummy_solver(),
scorer=toy_scorer_1(),
)
def main():
# Step 1: Run initial eval with toy_scorer_1
print("=" * 60)
print("Step 1: Running initial eval with toy_scorer_1...")
print("=" * 60)
log_dir = Path("logs/toy_test")
log_dir.mkdir(parents=True, exist_ok=True)
results = inspect_eval(
tasks=[toy_task()],
model="mockllm/model", # Mock model, won't actually call LLM
log_dir=str(log_dir),
)
print(f"\nInitial eval completed.")
# Get the log file
log_files = list(log_dir.glob("*.eval"))
if not log_files:
print("ERROR: No log files found!")
return
log_file = log_files[-1] # Get most recent
print(f"\nLog file: {log_file}")
# Step 2: Read the log
print("\n" + "=" * 60)
print("Step 2: Reading log file...")
print("=" * 60)
log = read_eval_log(str(log_file))
print(f" Original scores: {log.results.scores}")
# Step 3: Rescore with toy_scorer_2
print("\n" + "=" * 60)
print("Step 3: Rescoring with toy_scorer_2...")
print("=" * 60)
try:
rescored_log = inspect_score(log, toy_scorer_2(), action="append")
print(f" Rescored successfully!")
print(f" New scores: {rescored_log.results.scores}")
# Save the rescored log
rescored_path = log_file.parent / f"{log_file.stem}_rescored.eval"
write_eval_log(rescored_log, str(rescored_path))
print(f" Saved rescored log to: {rescored_path}")
except Exception as e:
print(f" ERROR during rescoring: {e}")
import traceback
traceback.print_exc()
print("\n" + "=" * 60)
print("Done!")
print("=" * 60)
if __name__ == "__main__":
main()Output
============================================================
Step 1: Running initial eval with toy_scorer_1...
============================================================
╭───────────────────────────────────────────────────────────────────────╮
│toy_task (1 sample): mockllm/model │
╰───────────────────────────────────────────────────────────────────────╯
dataset: (samples)
total time: 0:00:00
metric_a metric_b
accuracy 1.000 accuracy 0.500
stderr 0.000 stderr 0.000
Log: logs/toy_test/2025-10-06T10-38-18+01-00_toy-task_<id>.eval
Initial eval completed.
Log file: logs/toy_test/2025-10-06T10-36-03+01-00_toy-task_<id>.eval
============================================================
Step 2: Reading log file...
============================================================
Original scores: [EvalScore(name='metric_a', scorer='toy_scorer_1', reducer=None, scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=1.0, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None), EvalScore(name='metric_b', scorer='toy_scorer_1', reducer=None, scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.5, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None)]
============================================================
Step 3: Rescoring with toy_scorer_2...
============================================================
[10/06/25 10:38:19] WARNING Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5} _metric.py:196
WARNING Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5} _metric.py:196
Rescored successfully!
New scores: [EvalScore(name='metric_c', scorer='toy_scorer_2', reducer='mean', scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.8, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None), EvalScore(name='toy_scorer_1', scorer='toy_scorer_1', reducer='mean', scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.0, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0.0, params={}, metadata=None)}, metadata=None)]
Saved rescored log to: logs/toy_test/2025-10-06T10-36-03+01-00_toy-task_<id>_rescored.eval
Analysis
Notice that after rescoring:
- The warnings
Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}appear during the rescoring step toy_scorer_1should have two scores (metric_a=1.0andmetric_b=0.5) but instead shows as a single collapsed score with accuracy=0.0- The new scorer
toy_scorer_2withmetric_c=0.8is added correctly - The problem appears to be in how
score()processes existing logs with multi-metric scorers during rescoring
Workaround
Currently avoiding multi-metric scorers entirely and using single-metric scorers with metrics=[accuracy(), stderr()] format and simple float return values.
Metadata
Metadata
Assignees
Labels
No labels