v0.17.1
新功能
- 模型压测支持随机生成图文数据,用于多模态模型压测,使用方法参考。
 - 支持τ-bench,用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性,使用方法参考。
 - 支持“人类最后的考试”(Humanity's-Last-Exam),这一高难度评测基准,使用方法参考。
 
New Features
- The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
 - Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
 - Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.
 
What's Changed
- [Feat] add perf sleep interval by @Yunnglin in #699
 - [Benchmark] Add HLE by @Yunnglin in #705
 - [Benchmark] Add tau-bench by @Yunnglin in #711
 - [Feature] Update perf random generation by @Yunnglin in #713
 - [Fix] Eval parser: humaneval, mmlu by @Yunnglin in #718
 
Full Changelog: v0.17.0...v0.17.1