Skip to content

Commit 5df392b

Browse files
[Benchmark] Add mmmu (#812)
* add mmmu * add mmmu * resolve mmmu * resolve image * compat dsw * compat dsw * compat dsw * Update evalscope/api/messages/chat_message.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update evalscope/benchmarks/mmmu/mmmu_adapter.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update evalscope/benchmarks/mmmu/mmmu_adapter.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent f12cccd commit 5df392b

File tree

22 files changed

+395
-27
lines changed

22 files changed

+395
-27
lines changed

docs/en/get_started/supported_dataset/llm.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name to jump
3535
| `mmlu` | [MMLU](#mmlu) | `Knowledge`, `MCQ` |
3636
| `mmlu_pro` | [MMLU-Pro](#mmlu-pro) | `Knowledge`, `MCQ` |
3737
| `mmlu_redux` | [MMLU-Redux](#mmlu-redux) | `Knowledge`, `MCQ` |
38+
| `mmmu` | [MMMU](#mmmu) | `Knowledge`, `MultiModal`, `QA` |
3839
| `musr` | [MuSR](#musr) | `MCQ`, `Reasoning` |
3940
| `needle_haystack` | [Needle-in-a-Haystack](#needle-in-a-haystack) | `LongContext`, `Retrieval` |
4041
| `process_bench` | [ProcessBench](#processbench) | `Math`, `Reasoning` |
@@ -840,6 +841,26 @@ Answer the following multiple choice question. The last line of your response sh
840841

841842
---
842843

844+
### MMMU
845+
846+
[Back to Top](#llm-benchmarks)
847+
- **Dataset Name**: `mmmu`
848+
- **Dataset ID**: [AI-ModelScope/MMMU](https://modelscope.cn/datasets/AI-ModelScope/MMMU/summary)
849+
- **Description**:
850+
> MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.
851+
- **Task Categories**: `Knowledge`, `MultiModal`, `QA`
852+
- **Evaluation Metrics**: `acc`
853+
- **Requires LLM Judge**: No
854+
- **Default Shots**: 0-shot
855+
- **Subsets**: `Accounting`, `Agriculture`, `Architecture_and_Engineering`, `Art_Theory`, `Art`, `Basic_Medical_Science`, `Biology`, `Chemistry`, `Clinical_Medicine`, `Computer_Science`, `Design`, `Diagnostics_and_Laboratory_Medicine`, `Economics`, `Electronics`, `Energy_and_Power`, `Finance`, `Geography`, `History`, `Literature`, `Manage`, `Marketing`, `Materials`, `Math`, `Mechanical_Engineering`, `Music`, `Pharmacy`, `Physics`, `Psychology`, `Public_Health`, `Sociology`
856+
857+
- **Prompt Template**:
858+
```text
859+
{question}
860+
```
861+
862+
---
863+
843864
### MuSR
844865

845866
[Back to Top](#llm-benchmarks)

docs/zh/get_started/supported_dataset/llm.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
| `mmlu` | [MMLU](#mmlu) | `Knowledge`, `MCQ` |
3636
| `mmlu_pro` | [MMLU-Pro](#mmlu-pro) | `Knowledge`, `MCQ` |
3737
| `mmlu_redux` | [MMLU-Redux](#mmlu-redux) | `Knowledge`, `MCQ` |
38+
| `mmmu` | [MMMU](#mmmu) | `Knowledge`, `MultiModal`, `QA` |
3839
| `musr` | [MuSR](#musr) | `MCQ`, `Reasoning` |
3940
| `needle_haystack` | [Needle-in-a-Haystack](#needle-in-a-haystack) | `LongContext`, `Retrieval` |
4041
| `process_bench` | [ProcessBench](#processbench) | `Math`, `Reasoning` |
@@ -840,6 +841,26 @@ Answer the following multiple choice question. The last line of your response sh
840841

841842
---
842843

844+
### MMMU
845+
846+
[返回目录](#llm评测集)
847+
- **数据集名称**: `mmmu`
848+
- **数据集ID**: [AI-ModelScope/MMMU](https://modelscope.cn/datasets/AI-ModelScope/MMMU/summary)
849+
- **数据集描述**:
850+
> MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.
851+
- **任务类别**: `Knowledge`, `MultiModal`, `QA`
852+
- **评估指标**: `acc`
853+
- **需要LLM Judge**: 否
854+
- **默认提示方式**: 0-shot
855+
- **数据集子集**: `Accounting`, `Agriculture`, `Architecture_and_Engineering`, `Art_Theory`, `Art`, `Basic_Medical_Science`, `Biology`, `Chemistry`, `Clinical_Medicine`, `Computer_Science`, `Design`, `Diagnostics_and_Laboratory_Medicine`, `Economics`, `Electronics`, `Energy_and_Power`, `Finance`, `Geography`, `History`, `Literature`, `Manage`, `Marketing`, `Materials`, `Math`, `Mechanical_Engineering`, `Music`, `Pharmacy`, `Physics`, `Psychology`, `Public_Health`, `Sociology`
856+
857+
- **提示模板**:
858+
```text
859+
{question}
860+
```
861+
862+
---
863+
843864
### MuSR
844865

845866
[返回目录](#llm评测集)

evalscope/api/dataset/dataset.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,8 @@
55
from pydantic import BaseModel, Field
66
from typing import Any, Callable, Dict, Iterator, List, Optional, Sequence, Union
77

8-
from evalscope.api.messages import ChatMessage, messages_pretty_str
8+
from evalscope.api.messages import ChatMessage, messages_to_markdown
99
from evalscope.api.tool import ToolInfo
10-
from evalscope.utils.multi_choices import answer_character, answer_index
1110

1211

1312
class Sample(BaseModel):
@@ -51,7 +50,7 @@ def pretty_print(self) -> str:
5150
if isinstance(self.input, str):
5251
input_text = self.input
5352
else:
54-
input_text = messages_pretty_str(self.input)
53+
input_text = messages_to_markdown(self.input, max_length=50)
5554
return f'Sample ID: {self.id}\nInput: {input_text}\nTarget: {self.target}'
5655

5756

@@ -227,6 +226,8 @@ def shuffle(self, seed: Optional[int] = None) -> None:
227226
self._shuffled = True
228227

229228
def shuffle_choices(self, seed: Optional[int] = None) -> None:
229+
from evalscope.utils.multi_choices import answer_character
230+
230231
rand = random.Random(seed)
231232
for sample in self.samples:
232233
if not sample.choices:
@@ -246,6 +247,8 @@ def shuffle_choices(self, seed: Optional[int] = None) -> None:
246247
sample.target = self._remap_target(sample.target, position_map=position_map)
247248

248249
def _remap_target(self, target: Union[str, List[str]], position_map: Dict[int, str]) -> Union[str, List[str]]:
250+
from evalscope.utils.multi_choices import answer_index
251+
249252
if isinstance(target, list):
250253
return [position_map[answer_index(t)] for t in target]
251254
else:

evalscope/api/evaluator/cache.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,15 @@ def to_task_state(self, dataset: Dataset) -> TaskState:
299299
completed=True, # Mark as completed since it was cached
300300
)
301301

302+
def pretty_print(self) -> str:
303+
"""
304+
Generate a pretty-printed string representation of the model result.
305+
306+
Returns:
307+
A string representation of the model result
308+
"""
309+
return self.model_dump_json(indent=2)
310+
302311

303312
class ReviewResult(BaseModel):
304313
"""
@@ -340,7 +349,7 @@ def from_score_state(
340349

341350
return cls(
342351
index=state.sample_id,
343-
input=state.input_text,
352+
input=state.input_markdown,
344353
target=state.target,
345354
sample_score=sample_score,
346355
)
@@ -353,3 +362,17 @@ def to_sample_score(self) -> SampleScore:
353362
The sample score object
354363
"""
355364
return self.sample_score
365+
366+
def pretty_print(self) -> str:
367+
"""
368+
Generate a pretty-printed string representation of the review result.
369+
370+
Returns:
371+
A string representation of the review result
372+
"""
373+
output = [
374+
f'Review Result for Sample {self.index}:',
375+
f'Target: {self.target}',
376+
f'Score: {self.sample_score.model_dump_json(indent=2)}',
377+
]
378+
return '\n'.join(output)

evalscope/api/evaluator/state.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from typing import Any, Dict, List, Optional, Sequence, Union, overload
44

55
from evalscope.api.dataset import Sample
6-
from evalscope.api.messages import ChatMessage, ChatMessageUser, messages_pretty_str
6+
from evalscope.api.messages import ChatMessage, ChatMessageUser, messages_pretty_str, messages_to_markdown
77
from evalscope.api.model import ModelOutput
88

99

@@ -188,6 +188,17 @@ def input_text(self) -> str:
188188
else:
189189
return messages_pretty_str(self._input)
190190

191+
@property
192+
def input_markdown(self) -> str:
193+
"""Get the input text as markdown.
194+
195+
For multi-modal content, images will be represented in markdown format.
196+
"""
197+
if isinstance(self._input, str):
198+
return self._input
199+
else:
200+
return messages_to_markdown(self._input)
201+
191202
@property
192203
def choices(self) -> Choices:
193204
"""Choices for the sample, if applicable."""

evalscope/api/messages/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
ChatMessageUser,
77
dict_to_chat_message,
88
messages_pretty_str,
9+
messages_to_markdown,
910
)
1011
from .content import Content, ContentAudio, ContentData, ContentImage, ContentReasoning, ContentText, ContentVideo
1112
from .utils import parse_content_with_reasoning

evalscope/api/messages/chat_message.py

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from typing import Any, Dict, List, Literal, Optional, Type, Union
44

55
from evalscope.api.tool import ToolCall, ToolCallError
6-
from .content import Content, ContentReasoning, ContentText
6+
from .content import Content, ContentImage, ContentReasoning, ContentText
77
from .utils import parse_content_with_reasoning
88

99

@@ -184,7 +184,7 @@ def dict_to_chat_message(data: Dict[str, Any]) -> ChatMessage:
184184

185185

186186
def messages_pretty_str(messages: List[ChatMessage]) -> str:
187-
"""Pretty print a list of chat messages."""
187+
"""Pretty print a list of chat messages. Without images or other multi-modal contents."""
188188
output = []
189189
for message in messages:
190190
role = message.role.capitalize()
@@ -196,3 +196,48 @@ def messages_pretty_str(messages: List[ChatMessage]) -> str:
196196
content += f'\nFunction: {message.function}'
197197
output.append(f'**{role}**: {content}')
198198
return '\n\n'.join(output)
199+
200+
201+
def messages_to_markdown(messages: List[ChatMessage], max_length: Optional[int] = None) -> str:
202+
"""Convert a list of chat messages to markdown format.
203+
204+
Args:
205+
messages (List[ChatMessage]): The list of chat messages to convert.
206+
max_length (Optional[int]): If provided, truncates the base64 string of images to this length.
207+
"""
208+
output = []
209+
for message in messages:
210+
role = message.role.capitalize()
211+
212+
# Start with role header
213+
content_parts = [f'**{role}**: ']
214+
215+
# Handle content based on type
216+
if isinstance(message.content, str):
217+
content_parts.append(message.content)
218+
else:
219+
for content_item in message.content:
220+
if isinstance(content_item, ContentText):
221+
content_parts.append(content_item.text)
222+
elif isinstance(content_item, ContentImage):
223+
# Use markdown image syntax
224+
image_base64 = content_item.image
225+
if max_length and len(image_base64) > max_length:
226+
image_base64 = image_base64[:max_length]
227+
content_parts.append(f'![image]({image_base64})')
228+
elif isinstance(content_item, ContentReasoning):
229+
content_parts.append(f'**Reasoning:** {content_item.reasoning}')
230+
231+
# Add tool-specific information
232+
if isinstance(message, ChatMessageTool):
233+
if message.error:
234+
content_parts.append(f'**Error:** {message.error.message}')
235+
if message.function:
236+
content_parts.append(f'**Function:** {message.function}')
237+
elif isinstance(message, ChatMessageAssistant) and message.tool_calls:
238+
for tool_call in message.tool_calls:
239+
content_parts.append(f'**Tool Call:** {tool_call.function}')
240+
241+
output.append('\n'.join(content_parts))
242+
243+
return '\n\n'.join(output)

evalscope/app/app.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from evalscope.utils.logger import configure_logging
77
from .arguments import add_argument
88
from .ui import create_app_ui
9+
from .utils.env_utils import setup_env
910

1011

1112
def create_app(args: argparse.Namespace):
@@ -17,6 +18,8 @@ def create_app(args: argparse.Namespace):
1718
"""
1819
configure_logging(debug=args.debug)
1920

21+
setup_env(args)
22+
2023
demo = create_app_ui(args)
2124

2225
demo.launch(

evalscope/app/ui/single_model.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -198,9 +198,9 @@ def update_table_components(filtered_df, page_number, score_threshold):
198198

199199
# Process the data for display
200200
input_md = row['Input'] + '\n\n' + process_model_prediction(row['Metadata'])
201-
generated_md = process_model_prediction(row['Generated'])
202-
gold_md = convert_markdown_image(process_model_prediction(row['Gold']))
203-
pred_md = convert_markdown_image(process_model_prediction(row['Pred']))
201+
generated_md = convert_markdown_image(row['Generated'])
202+
gold_md = convert_markdown_image(row['Gold'])
203+
pred_md = process_model_prediction(row['Pred'])
204204
score_md = process_json_content(row['Score'])
205205
nscore_val = float(row['NScore']) if not pd.isna(row['NScore']) else 0.0
206206

evalscope/app/utils/data_utils.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -163,18 +163,14 @@ def get_model_prediction(work_dir: str, model_name: str, dataset_name: str, subs
163163
metadata = sample_score.sample_metadata
164164
prediction = score.prediction
165165
target = review_result.target
166-
# TODO: Need a more robust way to determine target
167-
if not target:
168-
# Put input_image as target if not available for image generation
169-
target = metadata.get('input_image', '')
170166
extracted_prediction = score.extracted_prediction
171167
raw_d = {
172168
'Index': str(review_result.index),
173169
'Input': review_result.input.replace('\n', '\n\n'), # for markdown
174170
'Metadata': metadata,
175-
'Generated': prediction if prediction != extracted_prediction else '*Same as Pred*',
171+
'Generated': prediction,
176172
'Gold': target,
177-
'Pred': extracted_prediction,
173+
'Pred': extracted_prediction if extracted_prediction != prediction else '*Same as Generated*',
178174
'Score': score.model_dump(exclude_none=True),
179175
'NScore': normalize_score(score.main_value)
180176
}

0 commit comments

Comments
 (0)