Rescoring fails with multi-metric scorers that return dictionary values

## Environment
- **inspect_ai version**: 0.3.133
- **Python version**: 3.13
- **OS**: macOS

## Description

When rescoring logs that were created with a scorer returning dictionary values (multiple metrics), the `score()` function fails to properly handle the original scores. During rescoring:

1. Warnings appear: `Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}`
2. The original multi-metric scores get collapsed into a single score with the scorer name and a default value of 0.0

## Expected Behavior

Given:
- An initial log created with `toy_scorer_1` that returns dict values for two metrics:
  ```python
  @scorer(metrics={
      "metric_a": [accuracy(), stderr()],
      "metric_b": [accuracy(), stderr()],
  })
  # Returns: Score(value={"metric_a": 1.0, "metric_b": 0.5}, answer="test")
  ```

- A second scorer `toy_scorer_2` with a different metric:
  ```python
  @scorer(metrics={
      "metric_c": [accuracy(), stderr()],
  })
  # Returns: Score(value={"metric_c": 0.8}, answer="test")
  ```

When rescoring with `inspect_score(log, toy_scorer_2(), action="append")`, the rescored log should:
1. Preserve the original scores: `metric_a` (accuracy=1.0) and `metric_b` (accuracy=0.5)
2. Append the new score: `metric_c` (accuracy=0.8)
3. Result in three separate `EvalScore` entries in `log.results.scores`
4. Not produce any warnings

## Actual Behavior

When rescoring with `inspect_score(log, toy_scorer_2(), action="append")`:
1. Warnings appear: `Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}`
2. The original multi-metric scores (`metric_a` and `metric_b`) are collapsed into a single score entry named `toy_scorer_1` with accuracy=0.0
3. The new score `metric_c` is added correctly with accuracy=0.8
4. The rescored log ends up with only two `EvalScore` entries instead of three:
   - `metric_c` (correct)
   - `toy_scorer_1` (wrong - should be `metric_a` and `metric_b` separately)

## Minimal Reproduction

```python
#!/usr/bin/env python3
"""
Minimal example showing rescoring with dict values.
"""

from pathlib import Path

from inspect_ai import Task, eval as inspect_eval, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Scorer, accuracy, scorer, stderr
from inspect_ai.solver import TaskState, generate, solver
from inspect_ai.log import read_eval_log, write_eval_log
from inspect_ai import score as inspect_score


# Simple toy scorer that returns a dict value
@scorer(
    metrics={
        "metric_a": [accuracy(), stderr()],
        "metric_b": [accuracy(), stderr()],
    }
)
def toy_scorer_1() -> Scorer:
    async def score(state: TaskState, target) -> Score:
        # Return dict values
        return Score(
            value={
                "metric_a": 1.0,
                "metric_b": 0.5,
            },
            answer="test_answer_1",
        )

    return score


# Another toy scorer with different metrics
@scorer(
    metrics={
        "metric_c": [accuracy(), stderr()],
    }
)
def toy_scorer_2() -> Scorer:
    async def score(state: TaskState, target) -> Score:
        # Return dict value
        return Score(
            value={
                "metric_c": 0.8,
            },
            answer="test_answer_2",
        )

    return score


# Dummy solver that does nothing
@solver
def dummy_solver():
    async def solve(state: TaskState, generate):
        state.output.completion = "dummy output"
        return state

    return solve


# Create a simple task
@task
def toy_task():
    return Task(
        dataset=[Sample(input="test input", target="test target")],
        solver=dummy_solver(),
        scorer=toy_scorer_1(),
    )


def main():
    # Step 1: Run initial eval with toy_scorer_1
    print("=" * 60)
    print("Step 1: Running initial eval with toy_scorer_1...")
    print("=" * 60)

    log_dir = Path("logs/toy_test")
    log_dir.mkdir(parents=True, exist_ok=True)

    results = inspect_eval(
        tasks=[toy_task()],
        model="mockllm/model",  # Mock model, won't actually call LLM
        log_dir=str(log_dir),
    )

    print(f"\nInitial eval completed.")

    # Get the log file
    log_files = list(log_dir.glob("*.eval"))
    if not log_files:
        print("ERROR: No log files found!")
        return

    log_file = log_files[-1]  # Get most recent
    print(f"\nLog file: {log_file}")

    # Step 2: Read the log
    print("\n" + "=" * 60)
    print("Step 2: Reading log file...")
    print("=" * 60)

    log = read_eval_log(str(log_file))
    print(f"  Original scores: {log.results.scores}")

    # Step 3: Rescore with toy_scorer_2
    print("\n" + "=" * 60)
    print("Step 3: Rescoring with toy_scorer_2...")
    print("=" * 60)

    try:
        rescored_log = inspect_score(log, toy_scorer_2(), action="append")
        print(f"  Rescored successfully!")
        print(f"  New scores: {rescored_log.results.scores}")

        # Save the rescored log
        rescored_path = log_file.parent / f"{log_file.stem}_rescored.eval"
        write_eval_log(rescored_log, str(rescored_path))
        print(f"  Saved rescored log to: {rescored_path}")
    except Exception as e:
        print(f"  ERROR during rescoring: {e}")
        import traceback

        traceback.print_exc()

    print("\n" + "=" * 60)
    print("Done!")
    print("=" * 60)


if __name__ == "__main__":
    main()
```

## Output
```
============================================================
Step 1: Running initial eval with toy_scorer_1...
============================================================
╭───────────────────────────────────────────────────────────────────────╮
│toy_task (1 sample): mockllm/model                                     │
╰───────────────────────────────────────────────────────────────────────╯
dataset: (samples)

total time:  0:00:00

metric_a         metric_b
accuracy  1.000  accuracy  0.500
stderr    0.000  stderr    0.000

Log: logs/toy_test/2025-10-06T10-38-18+01-00_toy-task_<id>.eval


Initial eval completed.

Log file: logs/toy_test/2025-10-06T10-36-03+01-00_toy-task_<id>.eval

============================================================
Step 2: Reading log file...
============================================================
  Original scores: [EvalScore(name='metric_a', scorer='toy_scorer_1', reducer=None, scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=1.0, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None), EvalScore(name='metric_b', scorer='toy_scorer_1', reducer=None, scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.5, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None)]

============================================================
Step 3: Rescoring with toy_scorer_2...
============================================================
[10/06/25 10:38:19] WARNING  Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}  _metric.py:196
                    WARNING  Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}  _metric.py:196
  Rescored successfully!
  New scores: [EvalScore(name='metric_c', scorer='toy_scorer_2', reducer='mean', scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.8, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0, params={}, metadata=None)}, metadata=None), EvalScore(name='toy_scorer_1', scorer='toy_scorer_1', reducer='mean', scored_samples=1, unscored_samples=0, params={}, metrics={'accuracy': EvalMetric(name='accuracy', value=0.0, params={}, metadata=None), 'stderr': EvalMetric(name='stderr', value=0.0, params={}, metadata=None)}, metadata=None)]
  Saved rescored log to: logs/toy_test/2025-10-06T10-36-03+01-00_toy-task_<id>_rescored.eval
```

## Analysis

Notice that after rescoring:
1. The warnings `Unable to convert value to float: {'metric_a': 1.0, 'metric_b': 0.5}` appear during the rescoring step
2. `toy_scorer_1` should have two scores (`metric_a=1.0` and `metric_b=0.5`) but instead shows as a single collapsed score with accuracy=0.0
3. The new scorer `toy_scorer_2` with `metric_c=0.8` is added correctly
4. The problem appears to be in how `score()` processes existing logs with multi-metric scorers during rescoring

## Workaround

Currently avoiding multi-metric scorers entirely and using single-metric scorers with `metrics=[accuracy(), stderr()]` format and simple float return values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rescoring fails with multi-metric scorers that return dictionary values #2562

Environment

Description

Expected Behavior

Actual Behavior

Minimal Reproduction

Output

Analysis

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rescoring fails with multi-metric scorers that return dictionary values #2562

Description

Environment

Description

Expected Behavior

Actual Behavior

Minimal Reproduction

Output

Analysis

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions