Skip to content

Rescoring fails with multi-metric scorers that return dictionary values #2562

@Jannoshh

Description

@Jannoshh

Environment

  • inspect_ai version: 0.3.133
  • Python version: 3.13
  • OS: macOS

Description

When rescoring logs that were created with a scorer returning dictionary values (multiple metrics), the score() function fails to properly handle the original scores. During rescoring:

  1. Warnings appear: Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}
  2. The original multi-metric scores get collapsed into a single score with the scorer name and a default value of 0.0

Expected Behavior

Given:

  • An initial log created with toy_scorer_1 that returns dict values for two metrics:

    @scorer(metrics={
        "metric_a": [accuracy(), stderr()],
        "metric_b": [accuracy(), stderr()],
    })
    # Returns: Score(value={"metric_a": 1.0, "metric_b": 0.5}, answer="test")
  • A second scorer toy_scorer_2 with a different metric:

    @scorer(metrics={
        "metric_c": [accuracy(), stderr()],
    })
    # Returns: Score(value={"metric_c": 0.8}, answer="test")

When rescoring with inspect_score(log, toy_scorer_2(), action="append"), the rescored log should:

  1. Preserve the original scores: metric_a (accuracy=1.0) and metric_b (accuracy=0.5)
  2. Append the new score: metric_c (accuracy=0.8)
  3. Result in three separate EvalScore entries in log.results.scores
  4. Not produce any warnings

Actual Behavior

When rescoring with inspect_score(log, toy_scorer_2(), action="append"):

  1. Warnings appear: Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}
  2. The original multi-metric scores (metric_a and metric_b) are collapsed into a single score entry named toy_scorer_1 with accuracy=0.0
  3. The new score metric_c is added correctly with accuracy=0.8
  4. The rescored log ends up with only two EvalScore entries instead of three:
    • metric_c (correct)
    • toy_scorer_1 (wrong - should be metric_a and metric_b separately)

Minimal Reproduction

#!/usr/bin/env python3
"""
Minimal example showing rescoring with dict values.
"""

from pathlib import Path

from inspect_ai import Task, eval as inspect_eval, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Scorer, accuracy, scorer, stderr
from inspect_ai.solver import TaskState, generate, solver
from inspect_ai.log import read_eval_log, write_eval_log
from inspect_ai import score as inspect_score


# Simple toy scorer that returns a dict value
@scorer(
    metrics={
        "metric_a": [accuracy(), stderr()],
        "metric_b": [accuracy(), stderr()],
    }
)
def toy_scorer_1() -> Scorer:
    async def score(state: TaskState, target) -> Score:
        # Return dict values
        return Score(
            value={
                "metric_a": 1.0,
                "metric_b": 0.5,
            },
            answer="test_answer_1",
        )

    return score


# Another toy scorer with different metrics
@scorer(
    metrics={
        "metric_c": [accuracy(), stderr()],
    }
)
def toy_scorer_2() -> Scorer:
    async def score(state: TaskState, target) -> Score:
        # Return dict value
        return Score(
            value={
                "metric_c": 0.8,
            },
            answer="test_answer_2",
        )

    return score


# Dummy solver that does nothing
@solver
def dummy_solver():
    async def solve(state: TaskState, generate):
        state.output.completion = "dummy output"
        return state

    return solve


# Create a simple task
@task
def toy_task():
    return Task(
        dataset=[Sample(input="test input", target="test target")],
        solver=dummy_solver(),
        scorer=toy_scorer_1(),
    )


def main():
    # Step 1: Run initial eval with toy_scorer_1
    print("=" * 60)
    print("Step 1: Running initial eval with toy_scorer_1...")
    print("=" * 60)

    log_dir = Path("logs/toy_test")
    log_dir.mkdir(parents=True, exist_ok=True)

    results = inspect_eval(
        tasks=[toy_task()],
        model="mockllm/model",  # Mock model, won't actually call LLM
        log_dir=str(log_dir),
    )

    print(f"\nInitial eval completed.")

    # Get the log file
    log_files = list(log_dir.glob("*.eval"))
    if not log_files:
        print("ERROR: No log files found!")
        return

    log_file = log_files[-1]  # Get most recent
    print(f"\nLog file: {log_file}")

    # Step 2: Read the log
    print("\n" + "=" * 60)
    print("Step 2: Reading log file...")
    print("=" * 60)

    log = read_eval_log(str(log_file))
    print(f"  Original scores: {log.results.scores}")

    # Step 3: Rescore with toy_scorer_2
    print("\n" + "=" * 60)
    print("Step 3: Rescoring with toy_scorer_2...")
    print("=" * 60)

    try:
        rescored_log = inspect_score(log, toy_scorer_2(), action="append")
        print(f"  Rescored successfully!")
        print(f"  New scores: {rescored_log.results.scores}")

        # Save the rescored log
        rescored_path = log_file.parent / f"{log_file.stem}_rescored.eval"
        write_eval_log(rescored_log, str(rescored_path))
        print(f"  Saved rescored log to: {rescored_path}")
    except Exception as e:
        print(f"  ERROR during rescoring: {e}")
        import traceback

        traceback.print_exc()

    print("\n" + "=" * 60)
    print("Done!")
    print("=" * 60)


if __name__ == "__main__":
    main()

Output

============================================================
Step 1: Running initial eval with toy_scorer_1...
============================================================
╭───────────────────────────────────────────────────────────────────────╮
│toy_task (1 sample): mockllm/model                                     │
╰───────────────────────────────────────────────────────────────────────╯
dataset: (samples)

total time:  0:00:00

metric_a         metric_b
accuracy  1.000  accuracy  0.500
stderr    0.000  stderr    0.000

Log: logs/toy_test/2025-10-06T10-38-18+01-00_toy-task_<id>.eval


Initial eval completed.

Log file: logs/toy_test/2025-10-06T10-36-03+01-00_toy-task_<id>.eval

============================================================
Step 2: Reading log file...
============================================================
  Original scores: [EvalScore(name='metric_a', scorer='toy_scorer_1', reducer=None, scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=1.0, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None), EvalScore(name='metric_b', scorer='toy_scorer_1', reducer=None, scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.5, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None)]

============================================================
Step 3: Rescoring with toy_scorer_2...
============================================================
[10/06/25 10:38:19] WARNING  Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}  _metric.py:196
                    WARNING  Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}  _metric.py:196
  Rescored successfully!
  New scores: [EvalScore(name='metric_c', scorer='toy_scorer_2', reducer='mean', scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.8, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None), EvalScore(name='toy_scorer_1', scorer='toy_scorer_1', reducer='mean', scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.0, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0.0, params={}, metadata=None)}, metadata=None)]
  Saved rescored log to: logs/toy_test/2025-10-06T10-36-03+01-00_toy-task_<id>_rescored.eval

Analysis

Notice that after rescoring:

  1. The warnings Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5} appear during the rescoring step
  2. toy_scorer_1 should have two scores (metric_a=1.0 and metric_b=0.5) but instead shows as a single collapsed score with accuracy=0.0
  3. The new scorer toy_scorer_2 with metric_c=0.8 is added correctly
  4. The problem appears to be in how score() processes existing logs with multi-metric scorers during rescoring

Workaround

Currently avoiding multi-metric scorers entirely and using single-metric scorers with metrics=[accuracy(), stderr()] format and simple float return values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions