v0.17.1

Yunnglin released this 21 Jul 02:10

029cc1c

新功能

模型压测支持随机生成图文数据，用于多模态模型压测，使用方法参考。
支持τ-bench，用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性，使用方法参考。
支持“人类最后的考试”(Humanity's-Last-Exam)，这一高难度评测基准，使用方法参考。

New Features

The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.

What's Changed

[Feat] add perf sleep interval by @Yunnglin in #699
[Benchmark] Add HLE by @Yunnglin in #705
[Benchmark] Add tau-bench by @Yunnglin in #711
[Feature] Update perf random generation by @Yunnglin in #713
[Fix] Eval parser: humaneval, mmlu by @Yunnglin in #718

Full Changelog: v0.17.0...v0.17.1

Contributors

Yunnglin

Assets 2