27 Oct 09:11

Yunnglin

87b50a9

v1.1.1 Latest

Latest

更新

基准测试扩展

视觉/多模态评测：HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
文档理解： OmniDocBench
NLP任务： CoNLL2003、NER 任务集合（9个任务）、AA-LCR
逻辑推理： VisuLogic、ZeroBench

功能增强

性能基准测试优化：perf 功能优化，可获得与 vLLM benchmarking 相媲美的测试结果，参考使用文档
代码评测环境增强：沙箱环境支持本地/远程双模式运行，提升代码安全性与灵活性，参考使用文档

性能与稳定性优化

修复数据集中 prompt tokens 计算问题
增加评测过程中心跳检测机制
修复 GSM8K 准确率计算并增强日志记录

系统要求更新

Python版本要求：提升至 ≥3.10 （无依赖更新）

Updates

Benchmark Extensions

Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
Document Understanding: OmniDocBench
NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
Logic Reasoning: VisuLogic, ZeroBench

Feature Enhancements

Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation

Performance and Stability Improvements

Fixed prompt tokens calculation issues in datasets
Added heartbeat detection mechanism during evaluation process
Fixed GSM8K accuracy calculation and enhanced logging

System Requirements Update

Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)

What's Changed

Datasets: prompt tokens count bug fixed by @Aktsvigun in #873
[Benchmark] Add HallusionBench and POPE by @Yunnglin in #875
[Feature] Add inflight process by @Yunnglin in #880
[Benchmark] Add PloyMath by @Yunnglin in #882
add math_verse math_vision simple_vqa by @mushenL in #881
fix: update Python version requirement to >=3.10 by @nowang6 in #890
[Feature] Update perf thoughput by @Yunnglin in #894
[Feature] Add extra query by @Yunnglin in #895
add AA-LCR benchmark to evalscope by @sophies-cerebras in #897
[feature] add --visualizer parameter instead of --XXX_api_key in stress test by @ShaohonChen in #878
[Feature] Add sandbox doc by @Yunnglin in #899
fix gsm8k acc and add more log by @ms-cs in #903
[Doc] Update writing by @Yunnglin in #904
[Benchmark] Add OmniDocBench by @Yunnglin in #908
[Benchmark] Add CoNLL2003 benchmark by @penguinwang96825 in #912
add seed_bench_2_plus,visu_logic_adapter,zerobench by @mushenL in #916
[Benchmark] Add NER suite by @penguinwang96825 in #921
[Feature] Add pred heartbeat by @ms-cs in #922

New Contributors

@Aktsvigun made their first contribution in #873
@nowang6 made their first contribution in #890
@sophies-cerebras made their first contribution in #897
@ms-cs made their first contribution in #903
@penguinwang96825 made their first contribution in #912

Full Changelog: v1.1.0...v1.1.1

Contributors

penguinwang96825, ShaohonChen, and 6 other contributors

Assets 2

14 Oct 09:20

Yunnglin

v1.1.0

aca9756

v1.1.0

更新

支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准，所有支持的数据集请参考
编写Qwen3-Omni和Qwen3-VL模型评测最佳实践
支持pyproject.toml安装

Update

The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
Installation via pyproject.toml is now supported.

What's Changed

[Doc] Add qwen omni doc by @Yunnglin in #854
[Fix] Fix bfcl_v3 validation by @Yunnglin in #858
[Feature] Add pyproject.toml by @Yunnglin in #857
[Benchmark] Add ChartQA and BLINK by @Yunnglin in #861
[Benchmark] Add DocVQA and InfoVQA by @Yunnglin in #862
[Fix] transformers import by @Yunnglin in #865
[Benchmark] Add OCRBench and OCRBench-v2 by @Yunnglin in #869
[Fix] None string error by @Yunnglin in #871

Full Changelog: v1.0.2...v1.1.0

Contributors

Yunnglin

Assets 2

23 Sep 09:30

Yunnglin

v1.0.2

ac3a470

v1.0.2

新增功能

代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行，要使用该功能需先安装ms-enclave。
新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准，和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

[Benchmark] add Multi-IF by @Yunnglin in #822
Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
[Benchmark] Add health bench by @Yunnglin in #826
fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
[Fix] vlm tokenize by @Yunnglin in #829
[Doc] update qwen next doc by @Yunnglin in #832
[Fix] fix bfcl-v3 score by @Yunnglin in #833
[Benchmark] Add MMBench and MMStar by @mushenL in #834
[Benchmark] Add Omnibench by @Yunnglin in #837
[Fix] Fix bfcl validation error by @Yunnglin in #838
[Feature] add docker sandbox by @Yunnglin in #835
[Fix] Fix thread pool error by @Yunnglin in #841
[Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
[Benchmark] add minerva-math by @Yunnglin in #846

New Contributors

@MemoryIt made their first contribution in #827

Full Changelog: v1.0.1...v1.0.2

Contributors

Yunnglin, MemoryIt, and mushenL

Assets 2

05 Sep 09:11

Yunnglin

v1.0.1

bebf960

v1.0.1

更新内容

支持视觉-语言多模态大模型的评测任务，例如：MathVista、MMMU，更多支持数据集请参考。
支持图像编辑任务评测，支持GEdit-Bench 评测基准，使用方法参考。
核心依赖移除torch，移动到rag和aigc可选依赖中。

Update

The evaluation tasks for vision-language multimodal large models are now supported, including MathVista and MMMU. For more information on the supported datasets, please refer to this link.
Image editing task evaluation is now supported, with the GEdit-Bench evaluation benchmark available. For usage instructions, please refer to this guide.
The core dependency on torch has been removed and is now an optional dependency under rag and aigc.

What's Changed

[DOC] Update 1.0 custom doc by @Yunnglin in #793
[Fix] Fix reasoning content by @Yunnglin in #797
[Fix] Change old collection to new version by @Yunnglin in #798
Reduce dataset loading time by @mmdbhs in #805
[Fix] fix reranker pad token and embedding max tokens by @Yunnglin in #806
[Feature] Add image edit task by @Yunnglin in #804
[Benchmark] Add mmmu by @Yunnglin in #812
add math_vista by @mushenL in #813
[Fix] tau-bench zero scores by @Yunnglin in #814
[Fix] collection eval by @Yunnglin in #816
[Feature] add vlm adapter by @Yunnglin in #817
[Feature] remove torch from framework by @Yunnglin in #818
add MMMU_Pro by @mushenL in #819

New Contributors

@mmdbhs made their first contribution in #805

Full Changelog: v1.0.0...v1.0.1

Contributors

Yunnglin, mmdbhs, and mushenL

Assets 2

25 Aug 06:50

Yunnglin

v1.0.0

cceafe6

v1.0.0

新版本

版本 1.0 对评测框架进行了重大重构，在 evalscope/api 下建立了全新的、更模块化且易扩展的 API 层。主要改进包括：为基准、样本和结果引入了标准化数据模型；对基准和指标等组件采用注册表式设计；并重写了核心评测器以协同新架构。现有的基准已迁移到这一 API，实现更加简洁、一致且易于维护。

不兼容的更新请参考。

New version

Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.

What's Changed

[Feature] Add image edit evaluation by @Yunnglin in #725
[Doc] add tau-bench doc by @Yunnglin in #730
[Fix] ragas local model by @Yunnglin in #732
[Doc] Add qwen-code best practice doc by @Yunnglin in #734
Fix: Incorrect keyword argument in call to csv_to_list() by @Zhuzhenghao in #745
Add SECURITY.md by @wangxingjun778 in #750
Update SECURITY.md by @wangxingjun778 in #752
updata faq file by @mushenL in #744
[Refactor] v1.0 by @Yunnglin in #739

New Contributors

@Zhuzhenghao made their first contribution in #745

Full Changelog: v0.17.1...v1.0.0

Contributors

wangxingjun778, Yunnglin, and 2 other contributors

Assets 2

21 Jul 02:10

Yunnglin

v0.17.1

029cc1c

v0.17.1

新功能

模型压测支持随机生成图文数据，用于多模态模型压测，使用方法参考。
支持τ-bench，用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性，使用方法参考。
支持“人类最后的考试”(Humanity's-Last-Exam)，这一高难度评测基准，使用方法参考。

New Features

The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.

What's Changed

[Feat] add perf sleep interval by @Yunnglin in #699
[Benchmark] Add HLE by @Yunnglin in #705
[Benchmark] Add tau-bench by @Yunnglin in #711
[Feature] Update perf random generation by @Yunnglin in #713
[Fix] Eval parser: humaneval, mmlu by @Yunnglin in #718

Full Changelog: v0.17.0...v0.17.1

Contributors

Yunnglin

Assets 2

04 Jul 12:58

Yunnglin

v0.17.0

afabfe5

v0.17.0

新功能

重构了竞技场模式，支持自定义模型对战，输出模型排行榜，以及对战结果可视化，使用参考。
优化自定义数据集评测，支持无参考答案评测；优化LLM裁判使用，预置“无参考答案直接打分” 和 “判断答案是否与参考答案一致”两种模式，使用参考
重构结果可视化，支持两个模型评测结果对比、支持竞技场模式结果可视化，参考

New Features

Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See reference for more details.
Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge functionality with built-in modes for “direct scoring without reference answers” and “consistency check between answers and reference answers.” See reference for more details.
Refactored result visualization: now supports comparison of evaluation results between two models, as well as visualization of Arena mode results. See reference

What's Changed

[Feature] Add CI test workflow by @Yunnglin in #671
[Bug] fix load local data by @Yunnglin in #673
[Refector] visualization by @Yunnglin in #661
[BUG] fix perf zero error by @Yunnglin in #690
[Refactor] Refact arena mode by @Yunnglin in #677

Full Changelog: v0.16.3...v0.17.0

Contributors

Yunnglin

Assets 2

23 Jun 10:36

Yunnglin

v0.16.3

f939c6f

v0.16.3

新功能

新增支持BFCL-v3评测基准，用于评测模型在多种场景下的函数调用能力，使用参考。
更新文档：支持的数据集、自定义模型评测、添加评测基准。

New Features

Introduced support for the BFCL-v3 evaluation benchmark, designed to assess the model's function-calling capabilities across diverse scenarios. For more details, refer to the documentation.
Documentation updates include: Supported Datasets, Custom Model Evaluation, and Adding Evaluation Benchmarks.

What's Changed

add needle show score params by @Yunnglin in #620
fix clip request by @Yunnglin in #621
fix logit register by @Yunnglin in #624
Fix eval errors by @Yunnglin in #627
[Fix] cross encoder args by @Yunnglin in #628
[Doc]Add t2i best practice by @Yunnglin in #631
Fix in benchmark, when the number of dataset index is less than parallel , the parallel will be insufficient. by @xcode03 in #634
fix super gpqa error by @Yunnglin in #639
add repetition penalty by @Yunnglin in #640
[Feature] add overall metrics log by @Yunnglin in #653
[Doc] Update benchmark documents by @Yunnglin in #650
[Doc] Update the default value of max_tokens for model API errors by @Su-yj in #659
make sure the stream parameter is included in the request_json by @Su-yj in #663
[Benchmark] Add BFCL-v3 by @Yunnglin in #657
[Refector] t2i metrics init by @Yunnglin in #660
[Doc] Support general mcq jsonl, update new benchmark, model doc by @Yunnglin in #667

New Contributors

@Su-yj made their first contribution in #659

Full Changelog: v0.16.1...v0.16.2

Contributors

Prayer3th, Su-yj, and Yunnglin

Assets 2

03 Jun 12:15

Yunnglin

v0.16.1

70c00c2

v0.16.1

新功能

支持传递--analysis-report布尔参数，使用judge model生成分析报告，报告中包含模型评测结果的分析解读和建议。
新增支持大海捞针测试（Needle-in-a-Haystack），指定needle_haystack即可进行测试，并在outputs/reports文件夹下生成对应的heatmap，直观展现模型性能，使用参考。
新增支持DocMath和FRAMES两个长文档评测基准，使用注意事项请查看文档
--limit支持设置0-1的浮点数，表示评测数据集的百分比数量。

New Features

Supports passing the --analysis-report boolean parameter, which uses the judge model to generate an analysis report. The report includes interpretative analysis and recommendations based on the model evaluation results.
Added support for the Needle-in-a-Haystack test. Specify needle_haystack to conduct the test, and a corresponding heatmap will be generated in the outputs/reports folder, visually displaying the model's performance. For usage, refer to this guide.
Added support for two long document evaluation benchmarks: DocMath and FRAMES. Please check the documentation for usage considerations.
The --limit parameter now supports setting a float between 0 and 1, representing the percentage of the dataset to be evaluated.

What's Changed

[DOC] Update perf doc by @Yunnglin in #576
Fix reranker model args by @Yunnglin in #580
fix perf tpop by @Yunnglin in #587
Refactor app & add report analysis by @Yunnglin in #591
fix toolbench message by @Yunnglin in #592
Support swanlab workspace env by @xcode03 in #600
fix tool bench rouge error by @Yunnglin in #604
Compat mteb v1.38 by @Yunnglin in #608
add Frames and other long doc benchmarks by @Yunnglin in #609
Add float limit (percentage) by @Yunnglin in #617

New Contributors

@xcode03 made their first contribution in #600

Full Changelog: v0.16.0...v0.16.1

Contributors

Prayer3th and Yunnglin

Assets 2

19 May 09:14

Yunnglin

v0.16.0

dc2148c

v0.16.0

新功能

支持模型服务性能压测支持设置多种并发，并输出美观的性能压测报告，参考示例。
支持ToolBench-Static数据集，评测模型的工具调用能力，参考使用文档
支持DROP和Winogrande评测基准，评测模型的推理能力。
支持use_cache重用评测结果

New Features

Supports performance stress testing of model services with various concurrency settings and outputs aesthetically pleasing performance reports. See example.
Supports the ToolBench-Static dataset to evaluate the tool invocation capabilities of models. Refer to the user guide.
Supports DROP and Winogrande evaluation benchmarks to assess the reasoning capabilities of models.
Supports use_cache to reuse evaluation results.

What's Changed

fix preprocess args by @Yunnglin in #537
fix report generation encoding to support chinese on windows by @antigone660 in #534
fix config extension check by @Yunnglin in #544
handle error during evaluation by @Yunnglin in #541
Bug Fix: Allow Custom SwanLab Project Name by @ShaohonChen in #549
support args json format by @Yunnglin in #551
Add drop and winogrande by @Yunnglin in #546
fix issue docs by @xiaoping378 in #562
新增judge模型的缓存利用 by @xh3204 in #566
Support Perf multi parallel and rich output by @Yunnglin in #564
Update review cache logic and doc by @Yunnglin in #574
Refactor ToolBench eval by @Yunnglin in #556

New Contributors

@antigone660 made their first contribution in #534
@xiaoping378 made their first contribution in #562
@xh3204 made their first contribution in #566

Full Changelog: v0.15.1...v0.16.0

Contributors

xiaoping378, ShaohonChen, and 3 other contributors

Assets 2

Releases: modelscope/evalscope

v1.1.1

更新

Updates

What's Changed

New Contributors

Contributors

Uh oh!

v1.1.0

更新

Update

What's Changed

Contributors

Uh oh!

v1.0.2

新增功能

New Features

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.1

更新内容

Update

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.0

新版本

New version

What's Changed

New Contributors

Contributors

Uh oh!

v0.17.1

新功能

New Features

What's Changed

Contributors

Uh oh!

v0.17.0

新功能

New Features

What's Changed

Contributors

Uh oh!

v0.16.3

新功能

New Features

What's Changed

New Contributors

Contributors

Uh oh!

v0.16.1

新功能

New Features

What's Changed

New Contributors

Contributors

Uh oh!

v0.16.0

新功能

New Features

What's Changed

New Contributors

Contributors

Uh oh!