Skip to content

Commit 0466ef8

Browse files
committed
Merge branch 'main' into release/0.17
2 parents a4522b9 + 8ba2d2d commit 0466ef8

File tree

3 files changed

+5
-1
lines changed

3 files changed

+5
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ Please scan the QR code below to join our community groups:
111111

112112

113113
## 🎉 News
114-
114+
- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
115115
- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
116116
- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
117117
- 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.

README_zh.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ EvalScope 不仅仅是一个评测工具,它是您模型优化之旅的得力
9898

9999
## 🎉 新闻
100100

101+
- 🔥 **[2025.07.18]** 模型压测支持随机生成图文数据,用于多模态模型压测,使用方法[参考](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/examples.html#id4)
101102
- 🔥 **[2025.07.16]** 支持[τ-bench](https://github.com/sierra-research/tau-bench),用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性,使用方法[参考](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench)
102103
- 🔥 **[2025.07.14]** 支持“人类最后的考试”([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)),这一高难度评测基准,使用方法[参考](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam)
103104
- 🔥 **[2025.07.03]** 重构了竞技场模式,支持自定义模型对战,输出模型排行榜,以及对战结果可视化,使用[参考](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html)

tests/cli/test_all.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,9 @@
5757
'tau_bench',
5858
]
5959

60+
# Reverse the datasets list to ensure the order is from most recent to oldest
61+
datasets.reverse()
62+
6063
dataset_args={
6164
'mmlu': {
6265
'subset_list': ['elementary_mathematics', 'high_school_european_history', 'nutrition'],

0 commit comments

Comments
 (0)