Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 31 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,23 @@

## 📝 Introduction

EvalScope is [ModelScope](https://modelscope.cn/)'s official framework for model evaluation and benchmarking, designed for diverse assessment needs. It supports various model types including large language models, multimodal, embedding, reranker, and CLIP models.
EvalScope is a comprehensive model evaluation and performance benchmarking framework meticulously crafted by the [ModelScope Community](https://modelscope.cn/), offering a one-stop solution for your model assessment needs. Regardless of the type of model you are developing, EvalScope is equipped to cater to your requirements:

The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It features built-in benchmarks and metrics like MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, EvalScope enables one-click evaluations, offering comprehensive support for model training and assessment 🚀
- 🧠 Large Language Models
- 🎨 Multimodal Models
- 🔍 Embedding Models
- 🏆 Reranker Models
- 🖼️ CLIP Models
- 🎭 AIGC Models (Image-to-Text/Video)
- ...and more!

EvalScope is not merely an evaluation tool; it is a valuable ally in your model optimization journey:

- 🏅 Equipped with multiple industry-recognized benchmarks and evaluation metrics: MMLU, CMMLU, C-Eval, GSM8K, etc.
- 📊 Model inference performance stress testing: Ensuring your model excels in real-world applications.
- 🚀 Seamless integration with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, enabling one-click evaluations and providing full-chain support from training to assessment for your model development.

Below is the overall architecture diagram of EvalScope:

<p align="center">
<img src="docs/en/_static/images/evalscope_framework.png" width="70%">
Expand Down Expand Up @@ -353,26 +367,27 @@ For more customized evaluations, such as customizing model parameters or dataset

```shell
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--model-args revision=master,precision=torch.float16,device_map=auto \
--generation-config do_sample=true,temperature=0.5 \
--model Qwen/Qwen3-0.6B \
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
--generation-config '{"do_sample":true,"temperature":0.6,"max_new_tokens":512,"chat_template_kwargs":{"enable_thinking": false}}' \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
```

### Parameter
- `--model-args`: Model loading parameters, separated by commas in `key=value` format. Default parameters:
- `revision`: Model version, default is `master`
- `precision`: Model precision, default is `auto`
- `device_map`: Model device allocation, default is `auto`
- `--generation-config`: Generation parameters, separated by commas in `key=value` format. Default parameters:
- `do_sample`: Whether to use sampling, default is `false`
- `max_length`: Maximum length, default is 2048
- `max_new_tokens`: Maximum length of generation, default is 512
- `--dataset-args`: Configuration parameters for evaluation datasets, passed in `json` format. The key is the dataset name, and the value is the parameters. Note that it needs to correspond one-to-one with the values in the `--datasets` parameter:
### Parameter Description
- `--model-args`: Model loading parameters, passed as a JSON string:
- `revision`: Model version
- `precision`: Model precision
- `device_map`: Device allocation for the model
- `--generation-config`: Generation parameters, passed as a JSON string and parsed as a dictionary:
- `do_sample`: Whether to use sampling
- `temperature`: Generation temperature
- `max_new_tokens`: Maximum length of generated tokens
- `chat_template_kwargs`: Model inference template parameters
- `--dataset-args`: Settings for the evaluation dataset, passed as a JSON string where the key is the dataset name and the value is the parameters. Note that these need to correspond one-to-one with the values in the `--datasets` parameter:
- `few_shot_num`: Number of few-shot examples
- `few_shot_random`: Whether to randomly sample few-shot data, if not set, defaults to `true`
- `few_shot_random`: Whether to randomly sample few-shot data; if not set, defaults to `true`

Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)

Expand Down
43 changes: 29 additions & 14 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,23 @@

## 📝 简介

EvalScope是[魔搭社区](https://modelscope.cn/)官方推出的模型评测与性能基准测试框架,专为多样化的模型评估需求而设计。它支持广泛的模型类型,包括但不限于大语言模型、多模态模型、Embedding 模型、Reranker 模型和 CLIP 模型。
EvalScope 是[魔搭社区](https://modelscope.cn/)倾力打造的模型评测与性能基准测试框架,为您的模型评估需求提供一站式解决方案。无论您在开发什么类型的模型,EvalScope 都能满足您的需求:

EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模式和模型推理性能压测等,其内置多个常用测试基准和评测指标,如MMLU、CMMLU、C-Eval、GSM8K等。此外,通过与[ms-swift](https://github.com/modelscope/ms-swift)训练框架的无缝集成,可一键发起评测,为模型训练和评测提供全链路支持🚀
- 🧠 大语言模型
- 🎨 多模态模型
- 🔍 Embedding 模型
- 🏆 Reranker 模型
- 🖼️ CLIP 模型
- 🎭 AIGC模型(图生文/视频)
- ...以及更多!

EvalScope 不仅仅是一个评测工具,它是您模型优化之旅的得力助手:

- 🏅 内置多个业界认可的测试基准和评测指标:MMLU、CMMLU、C-Eval、GSM8K 等。
- 📊 模型推理性能压测:确保您的模型在实际应用中表现出色。
- 🚀 与 [ms-swift](https://github.com/modelscope/ms-swift) 训练框架无缝集成,一键发起评测,为您的模型开发提供从训练到评估的全链路支持。

下面是 EvalScope 的整体架构图:

<p align="center">
<img src="docs/en/_static/images/evalscope_framework.png" style="width: 70%;">
Expand Down Expand Up @@ -347,24 +361,25 @@ evalscope eval \

```shell
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--model-args revision=master,precision=torch.float16,device_map=auto \
--generation-config do_sample=true,temperature=0.5 \
--model Qwen/Qwen3-0.6B \
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
--generation-config '{"do_sample":true,"temperature":0.6,"max_new_tokens":512,"chat_template_kwargs":{"enable_thinking": false}}' \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
```

### 参数说明
- `--model-args`: 模型加载参数,以逗号分隔,`key=value`形式,默认参数:
- `revision`: 模型版本,默认为`master`
- `precision`: 模型精度,默认为`auto`
- `device_map`: 模型分配设备,默认为`auto`
- `--generation-config`: 生成参数,以逗号分隔,`key=value`形式,默认参数:
- `do_sample`: 是否使用采样,默认为`false`
- `max_length`: 最大长度,默认为2048
- `max_new_tokens`: 生成最大长度,默认为512
- `--dataset-args`: 评测数据集的设置参数,以`json`格式传入,key为数据集名称,value为参数,注意需要跟`--datasets`参数中的值一一对应:
- `--model-args`: 模型加载参数,以json字符串格式传入:
- `revision`: 模型版本
- `precision`: 模型精度
- `device_map`: 模型分配设备
- `--generation-config`: 生成参数,以json字符串格式传入,将解析为字典:
- `do_sample`: 是否使用采样
- `temperature`: 生成温度
- `max_new_tokens`: 生成最大长度
- `chat_template_kwargs`: 模型推理模板参数
- `--dataset-args`: 评测数据集的设置参数,以json字符串格式传入,key为数据集名称,value为参数,注意需要跟`--datasets`参数中的值一一对应:
- `few_shot_num`: few-shot的数量
- `few_shot_random`: 是否随机采样few-shot数据,如果不设置,则默认为`true`

Expand Down
56 changes: 29 additions & 27 deletions docs/en/get_started/basic_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,8 @@ run_task(task_cfg="config.json")
- `--limit`: Maximum amount of evaluation data per dataset. If not specified, it defaults to evaluating all data, which can be used for quick validation.


### Output Results
**Output Results**

```text
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Model Name | Dataset Name | Metric Name | Category Name | Subset Name | Num | Score |
Expand All @@ -127,45 +128,46 @@ run_task(task_cfg="config.json")


## Complex Evaluation
For more customized evaluations, like setting custom model parameters or dataset parameters, you can use the following command. The method to initiate evaluation is the same as in simple evaluation. Below is an example using the `eval` command to start the evaluation:

If you wish to conduct more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation method is the same as simple evaluation, and below is an example of starting the evaluation using the `eval` command:

```shell
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--model-args revision=master,precision=torch.float16,device_map=auto \
--generation-config do_sample=true,temperature=0.5 \
--model Qwen/Qwen3-0.6B \
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
--generation-config '{"do_sample":true,"temperature":0.6,"max_new_tokens":512,"chat_template_kwargs":{"enable_thinking": false}}' \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
```

### Parameter Descriptions
- `--model-args`: Model loading parameters, separated by commas in the `key=value` format, default parameters:
- `revision`: Model version, defaults to `master`
- `precision`: Model precision, defaults to `auto`
- `device_map`: Device allocation for the model, defaults to `auto`

- `--generation-config`: Generation parameters, separated by commas in the `key=value` format, default parameters:
- `do_sample`: Whether to use sampling, defaults to `false`
- `max_length`: Maximum length, defaults to 2048
- `max_new_tokens`: Maximum length for generation, defaults to 512

- `--dataset-args`: Settings parameters for the evaluation dataset, provided in `json` format, where the key is the dataset name and the value is the parameter. Note that these must correspond one-to-one with the values in the `--datasets` parameter:
- `few_shot_num`: Number of few-shot samples
### Parameter Description
- `--model-args`: Model loading parameters, passed as a JSON string:
- `revision`: Model version
- `precision`: Model precision
- `device_map`: Device allocation for the model
- `--generation-config`: Generation parameters, passed as a JSON string and parsed as a dictionary:
- `do_sample`: Whether to use sampling
- `temperature`: Generation temperature
- `max_new_tokens`: Maximum length of generated tokens
- `chat_template_kwargs`: Model inference template parameters
- `--dataset-args`: Settings for the evaluation dataset, passed as a JSON string where the key is the dataset name and the value is the parameters. Note that these need to correspond one-to-one with the values in the `--datasets` parameter:
- `few_shot_num`: Number of few-shot examples
- `few_shot_random`: Whether to randomly sample few-shot data; if not set, defaults to `true`

```{seealso}
Reference: [All Parameter Descriptions](parameters.md)
```

### Output Results
**Output Results**

```text
+-----------------------+-----------------+
| Model | gsm8k |
+=======================+=================+
| Qwen2.5-0.5B-Instruct | (gsm8k/acc) 0.2 |
+-----------------------+-----------------+
+------------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+============+===========+=================+==========+=======+=========+=========+
| Qwen3-0.6B | gsm8k | AverageAccuracy | main | 10 | 0.3 | default |
+------------+-----------+-----------------+----------+-------+---------+---------+
```

```{seealso}
Reference: [Full Parameter Description](parameters.md)
```

## Model API Service Evaluation
Expand Down
8 changes: 8 additions & 0 deletions docs/en/get_started/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,10 @@ A: Try downgrading plotly to version 5.23.0.

A: Refer to this: https://evalscope.readthedocs.io/zh-cn/latest/get_started/parameters.html#id5, set the use_cache parameter.

### Q30: The evaluation was interrupted, how can I resume it (checkpoint evaluation)?

A: It is supported. Please use the `use_cache` parameter to pass in the path of the previous evaluation output to reuse the model's prediction results and review outcomes.

## Model Stress Testing

### Q1: When testing ollama, I found that when the concurrency is greater than 5, the Throughput (average tokens/s) value always does not go up. My graphics card, cpu, memory, and io have no bottlenecks. What is the problem?
Expand Down Expand Up @@ -297,3 +301,7 @@ A: The `model` is the name of the model deployed by the model service framework,
### Q12: KTransformers stream output cannot be recognized and reports ZeroDivisionError: float division by zero.

A: The deployed model service seems not to return usage information, which is different from the standard OpenAI API format. It requires the `--tokenizer-path` parameter to calculate the number of `tokens`.

### Q13: How can I perform stress testing on a multimodal large model, and how do I input images?

A: Currently, setting the dataset to flickr8k is supported for stress testing of multimodal models. Please [refer to](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/parameters.html#id5) for more information.
35 changes: 18 additions & 17 deletions docs/zh/get_started/basic_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,24 +131,25 @@ run_task(task_cfg="config.json")

```shell
evalscope eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--model-args revision=master,precision=torch.float16,device_map=auto \
--generation-config do_sample=true,temperature=0.5 \
--model Qwen/Qwen3-0.6B \
--model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
--generation-config '{"do_sample":true,"temperature":0.6,"max_new_tokens":512,"chat_template_kwargs":{"enable_thinking": false}}' \
--dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
--datasets gsm8k \
--limit 10
```

### 参数说明
- `--model-args`: 模型加载参数,以逗号分隔,`key=value`形式,默认参数:
- `revision`: 模型版本,默认为`master`
- `precision`: 模型精度,默认为`auto`
- `device_map`: 模型分配设备,默认为`auto`
- `--generation-config`: 生成参数,以逗号分隔,`key=value`形式,默认参数:
- `do_sample`: 是否使用采样,默认为`false`
- `max_length`: 最大长度,默认为2048
- `max_new_tokens`: 生成最大长度,默认为512
- `--dataset-args`: 评测数据集的设置参数,以`json`格式传入,key为数据集名称,value为参数,注意需要跟`--datasets`参数中的值一一对应:
- `--model-args`: 模型加载参数,以json字符串格式传入:
- `revision`: 模型版本
- `precision`: 模型精度
- `device_map`: 模型分配设备
- `--generation-config`: 生成参数,以json字符串格式传入,将解析为字典:
- `do_sample`: 是否使用采样
- `temperature`: 生成温度
- `max_new_tokens`: 生成最大长度
- `chat_template_kwargs`: 模型推理模板参数
- `--dataset-args`: 评测数据集的设置参数,以json字符串格式传入,key为数据集名称,value为参数,注意需要跟`--datasets`参数中的值一一对应:
- `few_shot_num`: few-shot的数量
- `few_shot_random`: 是否随机采样few-shot数据,如果不设置,则默认为`true`

Expand All @@ -159,11 +160,11 @@ evalscope eval \
### 输出结果

```text
+-----------------------+-----------------+
| Model | gsm8k |
+=======================+=================+
| Qwen2.5-0.5B-Instruct | (gsm8k/acc) 0.2 |
+-----------------------+-----------------+
+------------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+============+===========+=================+==========+=======+=========+=========+
| Qwen3-0.6B | gsm8k | AverageAccuracy | main | 10 | 0.3 | default |
+------------+-----------+-----------------+----------+-------+---------+---------+
```

## 模型API服务评测
Expand Down
8 changes: 8 additions & 0 deletions docs/zh/get_started/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,10 @@ A: plotly 尝试降到 5.23.0 版本

A: 参考这个 https://evalscope.readthedocs.io/zh-cn/latest/get_started/parameters.html#id5 ,设置use_cache参数

### Q30: 评测中断了,如何继续评测(断点续评)?

A: 支持的,请使用`use_cache`参数定传入上次评测输出的路径即可重用模型预测结果以及review结果。

## 模型压测

### Q1: 测试ollama发现,当并发数大于5后,Throughput(average tokens/s)的值始终上不去,我的显卡 cpu 内存 io都不存在瓶颈,是怎么回事?
Expand Down Expand Up @@ -298,3 +302,7 @@ A: `model`填的是模型服务框架部署的模型名称,比如OpenAI的服
### Q12: KTransformers 流输出无法识别报错ZeroDivisionError: float division by zero

A: 部署的模型服务似乎没有返回使用信息,这与标准的 OpenAI API 格式不同,需要传递 `--tokenizer-path` 参数来计算 `token` 数量。

### Q13: 如何进行多模态大模型的压测,如何输入图片呢?

A: 目前支持dataset设置为flickr8k进行多模态模型压测,请[参考](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/parameters.html#id5)
4 changes: 2 additions & 2 deletions docs/zh/get_started/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
- 指定为模型的本地路径,例如`/path/to/model`,将从本地加载模型;
- 评测目标为模型API服务时,需要指定为服务对应的模型id,例如`Qwen2.5-0.5B-Instruct`。
- `--model-id`: 被评测的模型的别名,用于报告展示。默认为`model`的最后一部分,例如`Qwen/Qwen2.5-0.5B-Instruct`的`model-id`为`Qwen2.5-0.5B-Instruct`。
- `--model-args`: 模型加载参数,以逗号分隔,`key=value`形式,,将解析为字典默认参数:
- `--model-args`: 模型加载参数,以逗号分隔的`key=value`形式;或以json字符串格式传入,将解析为字典默认参数:
- `revision`: 模型版本,默认为`master`
- `precision`: 模型精度,默认为`torch.float16`
- `device_map`: 模型分配设备,默认为`auto`
- `--model-task`: 模型任务类型,默认为`text_generation`,可选`text_generation`, `image_generation`
- `--generation-config`: 生成参数,以逗号分隔,`key=value`形式或以json字符串格式传入,将解析为字典:
- `--generation-config`: 生成参数,以逗号分隔的`key=value`形式或以json字符串格式传入,将解析为字典:
- 若使用本地模型推理(基于Transformers)包括如下参数([全部参数指南](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig)):
- `do_sample`: 是否使用采样,默认为`false`
- `max_length`: 最大长度,默认为2048
Expand Down
6 changes: 0 additions & 6 deletions evalscope/benchmarks/alpaca_eval/alpaca_eval_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,12 +96,6 @@ def llm_match(self, gold: Any, pred: Any, judge: LLMJudge, **kwargs) -> bool:
return None

def compute_metric(self, review_res_list: List[bool], **kwargs) -> List[dict]:
"""
compute weighted mean of the bleu score of all samples

Args:
review_res_list: [{'is_correct': 1, 'is_incorrect': 0, 'is_not_attempted': 0}, ...]
"""
# zip dict answers
res_list = [res for res in review_res_list if res is not None]

Expand Down
Loading