Releases: modelscope/evalscope
v1.1.1
更新
- 基准测试扩展
- 视觉/多模态评测:HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
- 文档理解: OmniDocBench
- NLP任务: CoNLL2003、NER 任务集合(9个任务)、AA-LCR
- 逻辑推理: VisuLogic、ZeroBench
- 功能增强
- 性能基准测试优化:perf 功能优化,可获得与 vLLM benchmarking 相媲美的测试结果,参考使用文档
- 代码评测环境增强:沙箱环境支持本地/远程双模式运行,提升代码安全性与灵活性,参考使用文档
- 性能与稳定性优化
- 修复数据集中 prompt tokens 计算问题
- 增加评测过程中心跳检测机制
- 修复 GSM8K 准确率计算并增强日志记录
- 系统要求更新
- Python版本要求:提升至 ≥3.10 (无依赖更新)
Updates
- Benchmark Extensions
- Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
- Document Understanding: OmniDocBench
- NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
- Logic Reasoning: VisuLogic, ZeroBench
- Feature Enhancements
- Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
- Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation
- Performance and Stability Improvements
- Fixed prompt tokens calculation issues in datasets
- Added heartbeat detection mechanism during evaluation process
- Fixed GSM8K accuracy calculation and enhanced logging
- System Requirements Update
- Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)
What's Changed
- Datasets: prompt tokens count bug fixed by @Aktsvigun in #873
- [Benchmark] Add HallusionBench and POPE by @Yunnglin in #875
- [Feature] Add inflight process by @Yunnglin in #880
- [Benchmark] Add PloyMath by @Yunnglin in #882
- add math_verse math_vision simple_vqa by @mushenL in #881
- fix: update Python version requirement to >=3.10 by @nowang6 in #890
- [Feature] Update perf thoughput by @Yunnglin in #894
- [Feature] Add extra query by @Yunnglin in #895
- add AA-LCR benchmark to evalscope by @sophies-cerebras in #897
- [feature] add
--visualizerparameter instead of --XXX_api_key in stress test by @ShaohonChen in #878 - [Feature] Add sandbox doc by @Yunnglin in #899
- fix gsm8k acc and add more log by @ms-cs in #903
- [Doc] Update writing by @Yunnglin in #904
- [Benchmark] Add OmniDocBench by @Yunnglin in #908
- [Benchmark] Add CoNLL2003 benchmark by @penguinwang96825 in #912
- add seed_bench_2_plus,visu_logic_adapter,zerobench by @mushenL in #916
- [Benchmark] Add NER suite by @penguinwang96825 in #921
- [Feature] Add pred heartbeat by @ms-cs in #922
New Contributors
- @Aktsvigun made their first contribution in #873
- @nowang6 made their first contribution in #890
- @sophies-cerebras made their first contribution in #897
- @ms-cs made their first contribution in #903
- @penguinwang96825 made their first contribution in #912
Full Changelog: v1.1.0...v1.1.1
v1.1.0
更新
- 支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准,所有支持的数据集请参考
- 编写Qwen3-Omni和Qwen3-VL模型评测最佳实践
- 支持
pyproject.toml安装
Update
- The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
- Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
- Installation via
pyproject.tomlis now supported.
What's Changed
- [Doc] Add qwen omni doc by @Yunnglin in #854
- [Fix] Fix bfcl_v3 validation by @Yunnglin in #858
- [Feature] Add pyproject.toml by @Yunnglin in #857
- [Benchmark] Add ChartQA and BLINK by @Yunnglin in #861
- [Benchmark] Add DocVQA and InfoVQA by @Yunnglin in #862
- [Fix] transformers import by @Yunnglin in #865
- [Benchmark] Add OCRBench and OCRBench-v2 by @Yunnglin in #869
- [Fix] None string error by @Yunnglin in #871
Full Changelog: v1.0.2...v1.1.0
v1.0.2
新增功能
- 代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行,要使用该功能需先安装ms-enclave。
- 新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准,和Multi-IF、HealthBench、AMC等纯文本评测基准。
New Features
- Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
- Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.
What's Changed
- [Benchmark] add Multi-IF by @Yunnglin in #822
- Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
- [Benchmark] Add health bench by @Yunnglin in #826
- fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
- [Fix] vlm tokenize by @Yunnglin in #829
- [Doc] update qwen next doc by @Yunnglin in #832
- [Fix] fix bfcl-v3 score by @Yunnglin in #833
- [Benchmark] Add MMBench and MMStar by @mushenL in #834
- [Benchmark] Add Omnibench by @Yunnglin in #837
- [Fix] Fix bfcl validation error by @Yunnglin in #838
- [Feature] add docker sandbox by @Yunnglin in #835
- [Fix] Fix thread pool error by @Yunnglin in #841
- [Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
- [Benchmark] add minerva-math by @Yunnglin in #846
New Contributors
Full Changelog: v1.0.1...v1.0.2
v1.0.1
更新内容
- 支持视觉-语言多模态大模型的评测任务,例如:MathVista、MMMU,更多支持数据集请参考。
- 支持图像编辑任务评测,支持GEdit-Bench 评测基准,使用方法参考。
- 核心依赖移除
torch,移动到rag和aigc可选依赖中。
Update
- The evaluation tasks for vision-language multimodal large models are now supported, including MathVista and MMMU. For more information on the supported datasets, please refer to this link.
- Image editing task evaluation is now supported, with the GEdit-Bench evaluation benchmark available. For usage instructions, please refer to this guide.
- The core dependency on
torchhas been removed and is now an optional dependency underragandaigc.
What's Changed
- [DOC] Update 1.0 custom doc by @Yunnglin in #793
- [Fix] Fix reasoning content by @Yunnglin in #797
- [Fix] Change old collection to new version by @Yunnglin in #798
- Reduce dataset loading time by @mmdbhs in #805
- [Fix] fix reranker pad token and embedding max tokens by @Yunnglin in #806
- [Feature] Add image edit task by @Yunnglin in #804
- [Benchmark] Add mmmu by @Yunnglin in #812
- add math_vista by @mushenL in #813
- [Fix] tau-bench zero scores by @Yunnglin in #814
- [Fix] collection eval by @Yunnglin in #816
- [Feature] add vlm adapter by @Yunnglin in #817
- [Feature] remove torch from framework by @Yunnglin in #818
- add MMMU_Pro by @mushenL in #819
New Contributors
Full Changelog: v1.0.0...v1.0.1
v1.0.0
新版本
版本 1.0 对评测框架进行了重大重构,在 evalscope/api 下建立了全新的、更模块化且易扩展的 API 层。主要改进包括:为基准、样本和结果引入了标准化数据模型;对基准和指标等组件采用注册表式设计;并重写了核心评测器以协同新架构。现有的基准已迁移到这一 API,实现更加简洁、一致且易于维护。
不兼容的更新请参考。
New version
Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
What's Changed
- [Feature] Add image edit evaluation by @Yunnglin in #725
- [Doc] add tau-bench doc by @Yunnglin in #730
- [Fix] ragas local model by @Yunnglin in #732
- [Doc] Add qwen-code best practice doc by @Yunnglin in #734
- Fix: Incorrect keyword argument in call to csv_to_list() by @Zhuzhenghao in #745
- Add SECURITY.md by @wangxingjun778 in #750
- Update SECURITY.md by @wangxingjun778 in #752
- updata faq file by @mushenL in #744
- [Refactor] v1.0 by @Yunnglin in #739
New Contributors
- @Zhuzhenghao made their first contribution in #745
Full Changelog: v0.17.1...v1.0.0
v0.17.1
新功能
- 模型压测支持随机生成图文数据,用于多模态模型压测,使用方法参考。
- 支持τ-bench,用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性,使用方法参考。
- 支持“人类最后的考试”(Humanity's-Last-Exam),这一高难度评测基准,使用方法参考。
New Features
- The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
- Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
- Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.
What's Changed
- [Feat] add perf sleep interval by @Yunnglin in #699
- [Benchmark] Add HLE by @Yunnglin in #705
- [Benchmark] Add tau-bench by @Yunnglin in #711
- [Feature] Update perf random generation by @Yunnglin in #713
- [Fix] Eval parser: humaneval, mmlu by @Yunnglin in #718
Full Changelog: v0.17.0...v0.17.1
v0.17.0
新功能
- 重构了竞技场模式,支持自定义模型对战,输出模型排行榜,以及对战结果可视化,使用参考。
- 优化自定义数据集评测,支持无参考答案评测;优化LLM裁判使用,预置“无参考答案直接打分” 和 “判断答案是否与参考答案一致”两种模式,使用参考
- 重构结果可视化,支持两个模型评测结果对比、支持竞技场模式结果可视化,参考
New Features
- Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See reference for more details.
- Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge functionality with built-in modes for “direct scoring without reference answers” and “consistency check between answers and reference answers.” See reference for more details.
- Refactored result visualization: now supports comparison of evaluation results between two models, as well as visualization of Arena mode results. See reference
What's Changed
- [Feature] Add CI test workflow by @Yunnglin in #671
- [Bug] fix load local data by @Yunnglin in #673
- [Refector] visualization by @Yunnglin in #661
- [BUG] fix perf zero error by @Yunnglin in #690
- [Refactor] Refact arena mode by @Yunnglin in #677
Full Changelog: v0.16.3...v0.17.0
v0.16.3
新功能
New Features
- Introduced support for the BFCL-v3 evaluation benchmark, designed to assess the model's function-calling capabilities across diverse scenarios. For more details, refer to the documentation.
- Documentation updates include: Supported Datasets, Custom Model Evaluation, and Adding Evaluation Benchmarks.
What's Changed
- add needle show score params by @Yunnglin in #620
- fix clip request by @Yunnglin in #621
- fix logit register by @Yunnglin in #624
- Fix eval errors by @Yunnglin in #627
- [Fix] cross encoder args by @Yunnglin in #628
- [Doc]Add t2i best practice by @Yunnglin in #631
- Fix in benchmark, when the number of dataset index is less than parallel , the parallel will be insufficient. by @xcode03 in #634
- fix super gpqa error by @Yunnglin in #639
- add repetition penalty by @Yunnglin in #640
- [Feature] add overall metrics log by @Yunnglin in #653
- [Doc] Update benchmark documents by @Yunnglin in #650
- [Doc] Update the default value of max_tokens for model API errors by @Su-yj in #659
- make sure the stream parameter is included in the request_json by @Su-yj in #663
- [Benchmark] Add BFCL-v3 by @Yunnglin in #657
- [Refector] t2i metrics init by @Yunnglin in #660
- [Doc] Support general mcq jsonl, update new benchmark, model doc by @Yunnglin in #667
New Contributors
Full Changelog: v0.16.1...v0.16.2
v0.16.1
新功能
- 支持传递
--analysis-report布尔参数,使用judge model生成分析报告,报告中包含模型评测结果的分析解读和建议。 - 新增支持大海捞针测试(Needle-in-a-Haystack),指定
needle_haystack即可进行测试,并在outputs/reports文件夹下生成对应的heatmap,直观展现模型性能,使用参考。 - 新增支持DocMath和FRAMES两个长文档评测基准,使用注意事项请查看文档
--limit支持设置0-1的浮点数,表示评测数据集的百分比数量。
New Features
- Supports passing the
--analysis-reportboolean parameter, which uses the judge model to generate an analysis report. The report includes interpretative analysis and recommendations based on the model evaluation results. - Added support for the Needle-in-a-Haystack test. Specify
needle_haystackto conduct the test, and a corresponding heatmap will be generated in theoutputs/reportsfolder, visually displaying the model's performance. For usage, refer to this guide. - Added support for two long document evaluation benchmarks: DocMath and FRAMES. Please check the documentation for usage considerations.
- The
--limitparameter now supports setting a float between 0 and 1, representing the percentage of the dataset to be evaluated.
What's Changed
- [DOC] Update perf doc by @Yunnglin in #576
- Fix reranker model args by @Yunnglin in #580
- fix perf tpop by @Yunnglin in #587
- Refactor app & add report analysis by @Yunnglin in #591
- fix toolbench message by @Yunnglin in #592
- Support swanlab workspace env by @xcode03 in #600
- fix tool bench rouge error by @Yunnglin in #604
- Compat mteb v1.38 by @Yunnglin in #608
- add Frames and other long doc benchmarks by @Yunnglin in #609
- Add float limit (percentage) by @Yunnglin in #617
New Contributors
- @xcode03 made their first contribution in #600
Full Changelog: v0.16.0...v0.16.1
v0.16.0
新功能
- 支持模型服务性能压测支持设置多种并发,并输出美观的性能压测报告,参考示例。
- 支持ToolBench-Static数据集,评测模型的工具调用能力,参考使用文档
- 支持DROP和Winogrande评测基准,评测模型的推理能力。
- 支持
use_cache重用评测结果
New Features
- Supports performance stress testing of model services with various concurrency settings and outputs aesthetically pleasing performance reports. See example.
- Supports the ToolBench-Static dataset to evaluate the tool invocation capabilities of models. Refer to the user guide.
- Supports DROP and Winogrande evaluation benchmarks to assess the reasoning capabilities of models.
- Supports
use_cacheto reuse evaluation results.
What's Changed
- fix preprocess args by @Yunnglin in #537
- fix report generation encoding to support chinese on windows by @antigone660 in #534
- fix config extension check by @Yunnglin in #544
- handle error during evaluation by @Yunnglin in #541
- Bug Fix: Allow Custom SwanLab Project Name by @ShaohonChen in #549
- support args json format by @Yunnglin in #551
- Add drop and winogrande by @Yunnglin in #546
- fix issue docs by @xiaoping378 in #562
- 新增judge模型的缓存利用 by @xh3204 in #566
- Support Perf multi parallel and rich output by @Yunnglin in #564
- Update review cache logic and doc by @Yunnglin in #574
- Refactor ToolBench eval by @Yunnglin in #556
New Contributors
- @antigone660 made their first contribution in #534
- @xiaoping378 made their first contribution in #562
- @xh3204 made their first contribution in #566
Full Changelog: v0.15.1...v0.16.0