modelscope
diff --git a/‎docs/en/get_started/supported_dataset/llm.md‎
Lines changed: 21 additions & 0 deletions b/‎docs/en/get_started/supported_dataset/llm.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎docs/zh/get_started/supported_dataset/llm.md‎
Lines changed: 21 additions & 0 deletions b/‎docs/zh/get_started/supported_dataset/llm.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎evalscope/api/dataset/dataset.py‎
Lines changed: 6 additions & 3 deletions b/‎evalscope/api/dataset/dataset.py‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎evalscope/api/evaluator/cache.py‎
Lines changed: 24 additions & 1 deletion b/‎evalscope/api/evaluator/cache.py‎
Lines changed: 24 additions & 1 deletion
diff --git a/‎evalscope/api/evaluator/state.py‎
Lines changed: 12 additions & 1 deletion b/‎evalscope/api/evaluator/state.py‎
Lines changed: 12 additions & 1 deletion
diff --git a/‎evalscope/api/messages/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎evalscope/api/messages/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎evalscope/api/messages/chat_message.py‎
Lines changed: 47 additions & 2 deletions b/‎evalscope/api/messages/chat_message.py‎
Lines changed: 47 additions & 2 deletions
diff --git a/‎evalscope/app/app.py‎
Lines changed: 3 additions & 0 deletions b/‎evalscope/app/app.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎evalscope/app/ui/single_model.py‎
Lines changed: 3 additions & 3 deletions b/‎evalscope/app/ui/single_model.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎evalscope/app/utils/data_utils.py‎
Lines changed: 2 additions & 6 deletions b/‎evalscope/app/utils/data_utils.py‎
Lines changed: 2 additions & 6 deletions
@@ -35,6 +35,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name to jump
 | `mmlu` | [MMLU](#mmlu) | `Knowledge`, `MCQ` |
 | `mmlu_pro` | [MMLU-Pro](#mmlu-pro) | `Knowledge`, `MCQ` |
 | `mmlu_redux` | [MMLU-Redux](#mmlu-redux) | `Knowledge`, `MCQ` |
+| `mmmu` | [MMMU](#mmmu) | `Knowledge`, `MultiModal`, `QA` |
 | `musr` | [MuSR](#musr) | `MCQ`, `Reasoning` |
 | `needle_haystack` | [Needle-in-a-Haystack](#needle-in-a-haystack) | `LongContext`, `Retrieval` |
 | `process_bench` | [ProcessBench](#processbench) | `Math`, `Reasoning` |
@@ -840,6 +841,26 @@ Answer the following multiple choice question. The last line of your response sh
 
 ---
 
+### MMMU
+
+[Back to Top](#llm-benchmarks)
+- **Dataset Name**: `mmmu`
+- **Dataset ID**: [AI-ModelScope/MMMU](https://modelscope.cn/datasets/AI-ModelScope/MMMU/summary)
+- **Description**:  
+  > MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.
+- **Task Categories**: `Knowledge`, `MultiModal`, `QA`
+- **Evaluation Metrics**: `acc`
+- **Requires LLM Judge**: No
+- **Default Shots**: 0-shot
+- **Subsets**: `Accounting`, `Agriculture`, `Architecture_and_Engineering`, `Art_Theory`, `Art`, `Basic_Medical_Science`, `Biology`, `Chemistry`, `Clinical_Medicine`, `Computer_Science`, `Design`, `Diagnostics_and_Laboratory_Medicine`, `Economics`, `Electronics`, `Energy_and_Power`, `Finance`, `Geography`, `History`, `Literature`, `Manage`, `Marketing`, `Materials`, `Math`, `Mechanical_Engineering`, `Music`, `Pharmacy`, `Physics`, `Psychology`, `Public_Health`, `Sociology`
+
+- **Prompt Template**: 
+```text
+{question}
+```
+
+---
+
 ### MuSR
 
 [Back to Top](#llm-benchmarks)
 
@@ -35,6 +35,7 @@
 | `mmlu` | [MMLU](#mmlu) | `Knowledge`, `MCQ` |
 | `mmlu_pro` | [MMLU-Pro](#mmlu-pro) | `Knowledge`, `MCQ` |
 | `mmlu_redux` | [MMLU-Redux](#mmlu-redux) | `Knowledge`, `MCQ` |
+| `mmmu` | [MMMU](#mmmu) | `Knowledge`, `MultiModal`, `QA` |
 | `musr` | [MuSR](#musr) | `MCQ`, `Reasoning` |
 | `needle_haystack` | [Needle-in-a-Haystack](#needle-in-a-haystack) | `LongContext`, `Retrieval` |
 | `process_bench` | [ProcessBench](#processbench) | `Math`, `Reasoning` |
@@ -840,6 +841,26 @@ Answer the following multiple choice question. The last line of your response sh
 
 ---
 
+### MMMU
+
+[返回目录](#llm评测集)
+- **数据集名称**: `mmmu`
+- **数据集ID**: [AI-ModelScope/MMMU](https://modelscope.cn/datasets/AI-ModelScope/MMMU/summary)
+- **数据集描述**:  
+  > MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.
+- **任务类别**: `Knowledge`, `MultiModal`, `QA`
+- **评估指标**: `acc`
+- **需要LLM Judge**: 否
+- **默认提示方式**: 0-shot
+- **数据集子集**: `Accounting`, `Agriculture`, `Architecture_and_Engineering`, `Art_Theory`, `Art`, `Basic_Medical_Science`, `Biology`, `Chemistry`, `Clinical_Medicine`, `Computer_Science`, `Design`, `Diagnostics_and_Laboratory_Medicine`, `Economics`, `Electronics`, `Energy_and_Power`, `Finance`, `Geography`, `History`, `Literature`, `Manage`, `Marketing`, `Materials`, `Math`, `Mechanical_Engineering`, `Music`, `Pharmacy`, `Physics`, `Psychology`, `Public_Health`, `Sociology`
+
+- **提示模板**: 
+```text
+{question}
+```
+
+---
+
 ### MuSR
 
 [返回目录](#llm评测集)
 
@@ -5,9 +5,8 @@
 from pydantic import BaseModel, Field
 from typing import Any, Callable, Dict, Iterator, List, Optional, Sequence, Union
 
-from evalscope.api.messages import ChatMessage, messages_pretty_str
+from evalscope.api.messages import ChatMessage, messages_to_markdown
 from evalscope.api.tool import ToolInfo
-from evalscope.utils.multi_choices import answer_character, answer_index
 
 
 class Sample(BaseModel):
@@ -51,7 +50,7 @@ def pretty_print(self) -> str:
         if isinstance(self.input, str):
             input_text = self.input
         else:
-            input_text = messages_pretty_str(self.input)
+            input_text = messages_to_markdown(self.input, max_length=50)
         return f'Sample ID: {self.id}\nInput: {input_text}\nTarget: {self.target}'
 
 
@@ -227,6 +226,8 @@ def shuffle(self, seed: Optional[int] = None) -> None:
         self._shuffled = True
 
     def shuffle_choices(self, seed: Optional[int] = None) -> None:
+        from evalscope.utils.multi_choices import answer_character
+
         rand = random.Random(seed)
         for sample in self.samples:
             if not sample.choices:
@@ -246,6 +247,8 @@ def shuffle_choices(self, seed: Optional[int] = None) -> None:
             sample.target = self._remap_target(sample.target, position_map=position_map)
 
     def _remap_target(self, target: Union[str, List[str]], position_map: Dict[int, str]) -> Union[str, List[str]]:
+        from evalscope.utils.multi_choices import answer_index
+
         if isinstance(target, list):
             return [position_map[answer_index(t)] for t in target]
         else:
 
@@ -299,6 +299,15 @@ def to_task_state(self, dataset: Dataset) -> TaskState:
             completed=True,  # Mark as completed since it was cached
         )
 
+    def pretty_print(self) -> str:
+        """
+        Generate a pretty-printed string representation of the model result.
+
+        Returns:
+            A string representation of the model result
+        """
+        return self.model_dump_json(indent=2)
+
 
 class ReviewResult(BaseModel):
     """
@@ -340,7 +349,7 @@ def from_score_state(
 
         return cls(
             index=state.sample_id,
-            input=state.input_text,
+            input=state.input_markdown,
             target=state.target,
             sample_score=sample_score,
         )
@@ -353,3 +362,17 @@ def to_sample_score(self) -> SampleScore:
             The sample score object
         """
         return self.sample_score
+
+    def pretty_print(self) -> str:
+        """
+        Generate a pretty-printed string representation of the review result.
+
+        Returns:
+            A string representation of the review result
+        """
+        output = [
+            f'Review Result for Sample {self.index}:',
+            f'Target: {self.target}',
+            f'Score: {self.sample_score.model_dump_json(indent=2)}',
+        ]
+        return '\n'.join(output)
@@ -3,7 +3,7 @@
 from typing import Any, Dict, List, Optional, Sequence, Union, overload
 
 from evalscope.api.dataset import Sample
-from evalscope.api.messages import ChatMessage, ChatMessageUser, messages_pretty_str
+from evalscope.api.messages import ChatMessage, ChatMessageUser, messages_pretty_str, messages_to_markdown
 from evalscope.api.model import ModelOutput
 
 
@@ -188,6 +188,17 @@ def input_text(self) -> str:
         else:
             return messages_pretty_str(self._input)
 
+    @property
+    def input_markdown(self) -> str:
+        """Get the input text as markdown.
+
+        For multi-modal content, images will be represented in markdown format.
+        """
+        if isinstance(self._input, str):
+            return self._input
+        else:
+            return messages_to_markdown(self._input)
+
     @property
     def choices(self) -> Choices:
         """Choices for the sample, if applicable."""
 
@@ -6,6 +6,7 @@
     ChatMessageUser,
     dict_to_chat_message,
     messages_pretty_str,
+    messages_to_markdown,
 )
 from .content import Content, ContentAudio, ContentData, ContentImage, ContentReasoning, ContentText, ContentVideo
 from .utils import parse_content_with_reasoning
@@ -3,7 +3,7 @@
 from typing import Any, Dict, List, Literal, Optional, Type, Union
 
 from evalscope.api.tool import ToolCall, ToolCallError
-from .content import Content, ContentReasoning, ContentText
+from .content import Content, ContentImage, ContentReasoning, ContentText
 from .utils import parse_content_with_reasoning
 
 
@@ -184,7 +184,7 @@ def dict_to_chat_message(data: Dict[str, Any]) -> ChatMessage:
 
 
 def messages_pretty_str(messages: List[ChatMessage]) -> str:
-    """Pretty print a list of chat messages."""
+    """Pretty print a list of chat messages. Without images or other multi-modal contents."""
     output = []
     for message in messages:
         role = message.role.capitalize()
@@ -196,3 +196,48 @@ def messages_pretty_str(messages: List[ChatMessage]) -> str:
                 content += f'\nFunction: {message.function}'
         output.append(f'**{role}**: {content}')
     return '\n\n'.join(output)
+
+
+def messages_to_markdown(messages: List[ChatMessage], max_length: Optional[int] = None) -> str:
+    """Convert a list of chat messages to markdown format.
+
+    Args:
+        messages (List[ChatMessage]): The list of chat messages to convert.
+        max_length (Optional[int]): If provided, truncates the base64 string of images to this length.
+    """
+    output = []
+    for message in messages:
+        role = message.role.capitalize()
+
+        # Start with role header
+        content_parts = [f'**{role}**: ']
+
+        # Handle content based on type
+        if isinstance(message.content, str):
+            content_parts.append(message.content)
+        else:
+            for content_item in message.content:
+                if isinstance(content_item, ContentText):
+                    content_parts.append(content_item.text)
+                elif isinstance(content_item, ContentImage):
+                    # Use markdown image syntax
+                    image_base64 = content_item.image
+                    if max_length and len(image_base64) > max_length:
+                        image_base64 = image_base64[:max_length]
+                    content_parts.append(f'![image]({image_base64})')
+                elif isinstance(content_item, ContentReasoning):
+                    content_parts.append(f'**Reasoning:** {content_item.reasoning}')
+
+        # Add tool-specific information
+        if isinstance(message, ChatMessageTool):
+            if message.error:
+                content_parts.append(f'**Error:** {message.error.message}')
+            if message.function:
+                content_parts.append(f'**Function:** {message.function}')
+        elif isinstance(message, ChatMessageAssistant) and message.tool_calls:
+            for tool_call in message.tool_calls:
+                content_parts.append(f'**Tool Call:** {tool_call.function}')
+
+        output.append('\n'.join(content_parts))
+
+    return '\n\n'.join(output)
@@ -6,6 +6,7 @@
 from evalscope.utils.logger import configure_logging
 from .arguments import add_argument
 from .ui import create_app_ui
+from .utils.env_utils import setup_env
 
 
 def create_app(args: argparse.Namespace):
@@ -17,6 +18,8 @@ def create_app(args: argparse.Namespace):
     """
     configure_logging(debug=args.debug)
 
+    setup_env(args)
+
     demo = create_app_ui(args)
 
     demo.launch(
 
@@ -198,9 +198,9 @@ def update_table_components(filtered_df, page_number, score_threshold):
 
         # Process the data for display
         input_md = row['Input'] + '\n\n' + process_model_prediction(row['Metadata'])
-        generated_md = process_model_prediction(row['Generated'])
-        gold_md = convert_markdown_image(process_model_prediction(row['Gold']))
-        pred_md = convert_markdown_image(process_model_prediction(row['Pred']))
+        generated_md = convert_markdown_image(row['Generated'])
+        gold_md = convert_markdown_image(row['Gold'])
+        pred_md = process_model_prediction(row['Pred'])
         score_md = process_json_content(row['Score'])
         nscore_val = float(row['NScore']) if not pd.isna(row['NScore']) else 0.0
 
 
@@ -163,18 +163,14 @@ def get_model_prediction(work_dir: str, model_name: str, dataset_name: str, subs
         metadata = sample_score.sample_metadata
         prediction = score.prediction
         target = review_result.target
-        # TODO: Need a more robust way to determine target
-        if not target:
-            # Put input_image as target if not available for image generation
-            target = metadata.get('input_image', '')
         extracted_prediction = score.extracted_prediction
         raw_d = {
             'Index': str(review_result.index),
             'Input': review_result.input.replace('\n', '\n\n'),  # for markdown
             'Metadata': metadata,
-            'Generated': prediction if prediction != extracted_prediction else '*Same as Pred*',
+            'Generated': prediction,
             'Gold': target,
-            'Pred': extracted_prediction,
+            'Pred': extracted_prediction if extracted_prediction != prediction else '*Same as Generated*',
             'Score': score.model_dump(exclude_none=True),
             'NScore': normalize_score(score.main_value)
         }
Original file line number	Diff line number	Diff line change
`@@ -6,6 +6,7 @@`
`6`	`6`	`ChatMessageUser,`
`7`	`7`	`dict_to_chat_message,`
`8`	`8`	`messages_pretty_str,`
	`9`	`+ messages_to_markdown,`
`9`	`10`	`)`
`10`	`11`	`from .content import Content, ContentAudio, ContentData, ContentImage, ContentReasoning, ContentText, ContentVideo`
`11`	`12`	`from .utils import parse_content_with_reasoning`