Merge branch 'main' into release/0.17

Yunnglin · Yunnglin · commit 0466ef8e6ac6 · 2025-07-18T14:48:01.000+08:00
diff --git a/README.md b/README.md
@@ -111,7 +111,7 @@ Please scan the QR code below to join our community groups:
 
 
 ## 🎉 News
-
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
 - 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
 - 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
 - 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
diff --git a/README_zh.md b/README_zh.md
@@ -98,6 +98,7 @@ EvalScope 不仅仅是一个评测工具，它是您模型优化之旅的得力
 
 ## 🎉 新闻
 
+- 🔥 **[2025.07.18]** 模型压测支持随机生成图文数据，用于多模态模型压测，使用方法[参考](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/examples.html#id4)。
 - 🔥 **[2025.07.16]** 支持[τ-bench](https://github.com/sierra-research/tau-bench)，用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性，使用方法[参考](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench)。
 - 🔥 **[2025.07.14]** 支持“人类最后的考试”([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle))，这一高难度评测基准，使用方法[参考](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam)。
 - 🔥 **[2025.07.03]** 重构了竞技场模式，支持自定义模型对战，输出模型排行榜，以及对战结果可视化，使用[参考](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html)。
diff --git a/tests/cli/test_all.py b/tests/cli/test_all.py
@@ -57,6 +57,9 @@
         'tau_bench',
 ]
 
+# Reverse the datasets list to ensure the order is from most recent to oldest
+datasets.reverse()
+
 dataset_args={
     'mmlu': {
         'subset_list': ['elementary_mathematics', 'high_school_european_history', 'nutrition'],

Original file line number	Diff line number	Diff line change
`@@ -57,6 +57,9 @@`
`57`	`57`	`'tau_bench',`
`58`	`58`	`]`
`59`	`59`
	`60`	`+# Reverse the datasets list to ensure the order is from most recent to oldest`
	`61`	`+datasets.reverse()`
	`62`	`+`
`60`	`63`	`dataset_args={`
`61`	`64`	`'mmlu': {`
`62`	`65`	`'subset_list': ['elementary_mathematics', 'high_school_european_history', 'nutrition'],`