Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/en/get_started/supported_dataset/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name to jump
| `mmlu` | [MMLU](#mmlu) | `Knowledge`, `MCQ` |
| `mmlu_pro` | [MMLU-Pro](#mmlu-pro) | `Knowledge`, `MCQ` |
| `mmlu_redux` | [MMLU-Redux](#mmlu-redux) | `Knowledge`, `MCQ` |
| `mmmu` | [MMMU](#mmmu) | `Knowledge`, `MultiModal`, `QA` |
| `musr` | [MuSR](#musr) | `MCQ`, `Reasoning` |
| `needle_haystack` | [Needle-in-a-Haystack](#needle-in-a-haystack) | `LongContext`, `Retrieval` |
| `process_bench` | [ProcessBench](#processbench) | `Math`, `Reasoning` |
Expand Down Expand Up @@ -840,6 +841,26 @@ Answer the following multiple choice question. The last line of your response sh

---

### MMMU

[Back to Top](#llm-benchmarks)
- **Dataset Name**: `mmmu`
- **Dataset ID**: [AI-ModelScope/MMMU](https://modelscope.cn/datasets/AI-ModelScope/MMMU/summary)
- **Description**:
> MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.
- **Task Categories**: `Knowledge`, `MultiModal`, `QA`
- **Evaluation Metrics**: `acc`
- **Requires LLM Judge**: No
- **Default Shots**: 0-shot
- **Subsets**: `Accounting`, `Agriculture`, `Architecture_and_Engineering`, `Art_Theory`, `Art`, `Basic_Medical_Science`, `Biology`, `Chemistry`, `Clinical_Medicine`, `Computer_Science`, `Design`, `Diagnostics_and_Laboratory_Medicine`, `Economics`, `Electronics`, `Energy_and_Power`, `Finance`, `Geography`, `History`, `Literature`, `Manage`, `Marketing`, `Materials`, `Math`, `Mechanical_Engineering`, `Music`, `Pharmacy`, `Physics`, `Psychology`, `Public_Health`, `Sociology`

- **Prompt Template**:
```text
{question}
```

---

### MuSR

[Back to Top](#llm-benchmarks)
Expand Down
21 changes: 21 additions & 0 deletions docs/zh/get_started/supported_dataset/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
| `mmlu` | [MMLU](#mmlu) | `Knowledge`, `MCQ` |
| `mmlu_pro` | [MMLU-Pro](#mmlu-pro) | `Knowledge`, `MCQ` |
| `mmlu_redux` | [MMLU-Redux](#mmlu-redux) | `Knowledge`, `MCQ` |
| `mmmu` | [MMMU](#mmmu) | `Knowledge`, `MultiModal`, `QA` |
| `musr` | [MuSR](#musr) | `MCQ`, `Reasoning` |
| `needle_haystack` | [Needle-in-a-Haystack](#needle-in-a-haystack) | `LongContext`, `Retrieval` |
| `process_bench` | [ProcessBench](#processbench) | `Math`, `Reasoning` |
Expand Down Expand Up @@ -840,6 +841,26 @@ Answer the following multiple choice question. The last line of your response sh

---

### MMMU

[返回目录](#llm评测集)
- **数据集名称**: `mmmu`
- **数据集ID**: [AI-ModelScope/MMMU](https://modelscope.cn/datasets/AI-ModelScope/MMMU/summary)
- **数据集描述**:
> MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.
- **任务类别**: `Knowledge`, `MultiModal`, `QA`
- **评估指标**: `acc`
- **需要LLM Judge**: 否
- **默认提示方式**: 0-shot
- **数据集子集**: `Accounting`, `Agriculture`, `Architecture_and_Engineering`, `Art_Theory`, `Art`, `Basic_Medical_Science`, `Biology`, `Chemistry`, `Clinical_Medicine`, `Computer_Science`, `Design`, `Diagnostics_and_Laboratory_Medicine`, `Economics`, `Electronics`, `Energy_and_Power`, `Finance`, `Geography`, `History`, `Literature`, `Manage`, `Marketing`, `Materials`, `Math`, `Mechanical_Engineering`, `Music`, `Pharmacy`, `Physics`, `Psychology`, `Public_Health`, `Sociology`

- **提示模板**:
```text
{question}
```

---

### MuSR

[返回目录](#llm评测集)
Expand Down
9 changes: 6 additions & 3 deletions evalscope/api/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,8 @@
from pydantic import BaseModel, Field
from typing import Any, Callable, Dict, Iterator, List, Optional, Sequence, Union

from evalscope.api.messages import ChatMessage, messages_pretty_str
from evalscope.api.messages import ChatMessage, messages_to_markdown
from evalscope.api.tool import ToolInfo
from evalscope.utils.multi_choices import answer_character, answer_index


class Sample(BaseModel):
Expand Down Expand Up @@ -51,7 +50,7 @@ def pretty_print(self) -> str:
if isinstance(self.input, str):
input_text = self.input
else:
input_text = messages_pretty_str(self.input)
input_text = messages_to_markdown(self.input, max_length=50)
return f'Sample ID: {self.id}\nInput: {input_text}\nTarget: {self.target}'


Expand Down Expand Up @@ -227,6 +226,8 @@ def shuffle(self, seed: Optional[int] = None) -> None:
self._shuffled = True

def shuffle_choices(self, seed: Optional[int] = None) -> None:
from evalscope.utils.multi_choices import answer_character

rand = random.Random(seed)
for sample in self.samples:
if not sample.choices:
Expand All @@ -246,6 +247,8 @@ def shuffle_choices(self, seed: Optional[int] = None) -> None:
sample.target = self._remap_target(sample.target, position_map=position_map)

def _remap_target(self, target: Union[str, List[str]], position_map: Dict[int, str]) -> Union[str, List[str]]:
from evalscope.utils.multi_choices import answer_index

if isinstance(target, list):
return [position_map[answer_index(t)] for t in target]
else:
Expand Down
25 changes: 24 additions & 1 deletion evalscope/api/evaluator/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,15 @@ def to_task_state(self, dataset: Dataset) -> TaskState:
completed=True, # Mark as completed since it was cached
)

def pretty_print(self) -> str:
"""
Generate a pretty-printed string representation of the model result.

Returns:
A string representation of the model result
"""
return self.model_dump_json(indent=2)


class ReviewResult(BaseModel):
"""
Expand Down Expand Up @@ -340,7 +349,7 @@ def from_score_state(

return cls(
index=state.sample_id,
input=state.input_text,
input=state.input_markdown,
target=state.target,
sample_score=sample_score,
)
Expand All @@ -353,3 +362,17 @@ def to_sample_score(self) -> SampleScore:
The sample score object
"""
return self.sample_score

def pretty_print(self) -> str:
"""
Generate a pretty-printed string representation of the review result.

Returns:
A string representation of the review result
"""
output = [
f'Review Result for Sample {self.index}:',
f'Target: {self.target}',
f'Score: {self.sample_score.model_dump_json(indent=2)}',
]
return '\n'.join(output)
13 changes: 12 additions & 1 deletion evalscope/api/evaluator/state.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from typing import Any, Dict, List, Optional, Sequence, Union, overload

from evalscope.api.dataset import Sample
from evalscope.api.messages import ChatMessage, ChatMessageUser, messages_pretty_str
from evalscope.api.messages import ChatMessage, ChatMessageUser, messages_pretty_str, messages_to_markdown
from evalscope.api.model import ModelOutput


Expand Down Expand Up @@ -188,6 +188,17 @@ def input_text(self) -> str:
else:
return messages_pretty_str(self._input)

@property
def input_markdown(self) -> str:
"""Get the input text as markdown.

For multi-modal content, images will be represented in markdown format.
"""
if isinstance(self._input, str):
return self._input
else:
return messages_to_markdown(self._input)

@property
def choices(self) -> Choices:
"""Choices for the sample, if applicable."""
Expand Down
1 change: 1 addition & 0 deletions evalscope/api/messages/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
ChatMessageUser,
dict_to_chat_message,
messages_pretty_str,
messages_to_markdown,
)
from .content import Content, ContentAudio, ContentData, ContentImage, ContentReasoning, ContentText, ContentVideo
from .utils import parse_content_with_reasoning
49 changes: 47 additions & 2 deletions evalscope/api/messages/chat_message.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from typing import Any, Dict, List, Literal, Optional, Type, Union

from evalscope.api.tool import ToolCall, ToolCallError
from .content import Content, ContentReasoning, ContentText
from .content import Content, ContentImage, ContentReasoning, ContentText
from .utils import parse_content_with_reasoning


Expand Down Expand Up @@ -184,7 +184,7 @@ def dict_to_chat_message(data: Dict[str, Any]) -> ChatMessage:


def messages_pretty_str(messages: List[ChatMessage]) -> str:
"""Pretty print a list of chat messages."""
"""Pretty print a list of chat messages. Without images or other multi-modal contents."""
output = []
for message in messages:
role = message.role.capitalize()
Expand All @@ -196,3 +196,48 @@ def messages_pretty_str(messages: List[ChatMessage]) -> str:
content += f'\nFunction: {message.function}'
output.append(f'**{role}**: {content}')
return '\n\n'.join(output)


def messages_to_markdown(messages: List[ChatMessage], max_length: Optional[int] = None) -> str:
"""Convert a list of chat messages to markdown format.

Args:
messages (List[ChatMessage]): The list of chat messages to convert.
max_length (Optional[int]): If provided, truncates the base64 string of images to this length.
"""
output = []
for message in messages:
role = message.role.capitalize()

# Start with role header
content_parts = [f'**{role}**: ']

# Handle content based on type
if isinstance(message.content, str):
content_parts.append(message.content)
else:
for content_item in message.content:
if isinstance(content_item, ContentText):
content_parts.append(content_item.text)
elif isinstance(content_item, ContentImage):
# Use markdown image syntax
image_base64 = content_item.image
if max_length and len(image_base64) > max_length:
image_base64 = image_base64[:max_length]
content_parts.append(f'![image]({image_base64})')
elif isinstance(content_item, ContentReasoning):
content_parts.append(f'**Reasoning:** {content_item.reasoning}')

# Add tool-specific information
if isinstance(message, ChatMessageTool):
if message.error:
content_parts.append(f'**Error:** {message.error.message}')
if message.function:
content_parts.append(f'**Function:** {message.function}')
elif isinstance(message, ChatMessageAssistant) and message.tool_calls:
for tool_call in message.tool_calls:
content_parts.append(f'**Tool Call:** {tool_call.function}')

output.append('\n'.join(content_parts))

return '\n\n'.join(output)
3 changes: 3 additions & 0 deletions evalscope/app/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from evalscope.utils.logger import configure_logging
from .arguments import add_argument
from .ui import create_app_ui
from .utils.env_utils import setup_env


def create_app(args: argparse.Namespace):
Expand All @@ -17,6 +18,8 @@ def create_app(args: argparse.Namespace):
"""
configure_logging(debug=args.debug)

setup_env(args)

demo = create_app_ui(args)

demo.launch(
Expand Down
6 changes: 3 additions & 3 deletions evalscope/app/ui/single_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,9 +198,9 @@ def update_table_components(filtered_df, page_number, score_threshold):

# Process the data for display
input_md = row['Input'] + '\n\n' + process_model_prediction(row['Metadata'])
generated_md = process_model_prediction(row['Generated'])
gold_md = convert_markdown_image(process_model_prediction(row['Gold']))
pred_md = convert_markdown_image(process_model_prediction(row['Pred']))
generated_md = convert_markdown_image(row['Generated'])
gold_md = convert_markdown_image(row['Gold'])
pred_md = process_model_prediction(row['Pred'])
score_md = process_json_content(row['Score'])
nscore_val = float(row['NScore']) if not pd.isna(row['NScore']) else 0.0

Expand Down
8 changes: 2 additions & 6 deletions evalscope/app/utils/data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,18 +163,14 @@ def get_model_prediction(work_dir: str, model_name: str, dataset_name: str, subs
metadata = sample_score.sample_metadata
prediction = score.prediction
target = review_result.target
# TODO: Need a more robust way to determine target
if not target:
# Put input_image as target if not available for image generation
target = metadata.get('input_image', '')
extracted_prediction = score.extracted_prediction
raw_d = {
'Index': str(review_result.index),
'Input': review_result.input.replace('\n', '\n\n'), # for markdown
'Metadata': metadata,
'Generated': prediction if prediction != extracted_prediction else '*Same as Pred*',
'Generated': prediction,
'Gold': target,
'Pred': extracted_prediction,
'Pred': extracted_prediction if extracted_prediction != prediction else '*Same as Generated*',
'Score': score.model_dump(exclude_none=True),
'NScore': normalize_score(score.main_value)
}
Expand Down
12 changes: 12 additions & 0 deletions evalscope/app/utils/env_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# flake8: noqa
import os


def setup_env(args):
compat_dsw_gradio(args)


def compat_dsw_gradio(args) -> None:
if ('JUPYTER_NAME' in os.environ) and ('dsw-'
in os.environ['JUPYTER_NAME']) and ('GRADIO_ROOT_PATH' not in os.environ):
os.environ['GRADIO_ROOT_PATH'] = f"/{os.environ['JUPYTER_NAME']}/proxy/{args.server_port}"
Empty file.
Loading