Skip to content

Releases: modelscope/evalscope

v1.1.1

27 Oct 09:11

Choose a tag to compare

更新

  1. 基准测试扩展
  • 视觉/多模态评测:HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
  • 文档理解: OmniDocBench
  • NLP任务: CoNLL2003、NER 任务集合(9个任务)、AA-LCR
  • 逻辑推理: VisuLogic、ZeroBench
  1. 功能增强
  • 性能基准测试优化:perf 功能优化,可获得与 vLLM benchmarking 相媲美的测试结果,参考使用文档
  • 代码评测环境增强:沙箱环境支持本地/远程双模式运行,提升代码安全性与灵活性,参考使用文档
  1. 性能与稳定性优化
  • 修复数据集中 prompt tokens 计算问题
  • 增加评测过程中心跳检测机制
  • 修复 GSM8K 准确率计算并增强日志记录
  1. 系统要求更新
  • Python版本要求:提升至 ≥3.10 (无依赖更新)

Updates

  1. Benchmark Extensions
  • Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
  • Document Understanding: OmniDocBench
  • NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
  • Logic Reasoning: VisuLogic, ZeroBench
  1. Feature Enhancements
  • Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
  • Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation
  1. Performance and Stability Improvements
  • Fixed prompt tokens calculation issues in datasets
  • Added heartbeat detection mechanism during evaluation process
  • Fixed GSM8K accuracy calculation and enhanced logging
  1. System Requirements Update
  • Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)

What's Changed

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0

14 Oct 09:20

Choose a tag to compare

更新

  • 支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准,所有支持的数据集请参考
  • 编写Qwen3-OmniQwen3-VL模型评测最佳实践
  • 支持pyproject.toml安装

Update

  • The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
  • Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
  • Installation via pyproject.toml is now supported.

What's Changed

Full Changelog: v1.0.2...v1.1.0

v1.0.2

23 Sep 09:30

Choose a tag to compare

新增功能

  • 代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行,要使用该功能需先安装ms-enclave
  • 新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准,和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

  • Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
  • Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

New Contributors

Full Changelog: v1.0.1...v1.0.2

v1.0.1

05 Sep 09:11

Choose a tag to compare

更新内容

  • 支持视觉-语言多模态大模型的评测任务,例如:MathVista、MMMU,更多支持数据集请参考
  • 支持图像编辑任务评测,支持GEdit-Bench 评测基准,使用方法参考
  • 核心依赖移除torch,移动到ragaigc可选依赖中。

Update

  • The evaluation tasks for vision-language multimodal large models are now supported, including MathVista and MMMU. For more information on the supported datasets, please refer to this link.
  • Image editing task evaluation is now supported, with the GEdit-Bench evaluation benchmark available. For usage instructions, please refer to this guide.
  • The core dependency on torch has been removed and is now an optional dependency under rag and aigc.

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.1

v1.0.0

25 Aug 06:50

Choose a tag to compare

新版本

版本 1.0 对评测框架进行了重大重构,在 evalscope/api 下建立了全新的、更模块化且易扩展的 API 层。主要改进包括:为基准、样本和结果引入了标准化数据模型;对基准和指标等组件采用注册表式设计;并重写了核心评测器以协同新架构。现有的基准已迁移到这一 API,实现更加简洁、一致且易于维护。

不兼容的更新请参考

New version

Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.

What's Changed

New Contributors

Full Changelog: v0.17.1...v1.0.0

v0.17.1

21 Jul 02:10

Choose a tag to compare

新功能

  • 模型压测支持随机生成图文数据,用于多模态模型压测,使用方法参考
  • 支持τ-bench,用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性,使用方法参考
  • 支持“人类最后的考试”(Humanity's-Last-Exam),这一高难度评测基准,使用方法参考

New Features

  • The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
  • Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
  • Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.

What's Changed

Full Changelog: v0.17.0...v0.17.1

v0.17.0

04 Jul 12:58

Choose a tag to compare

新功能

  • 重构了竞技场模式,支持自定义模型对战,输出模型排行榜,以及对战结果可视化,使用参考
  • 优化自定义数据集评测,支持无参考答案评测;优化LLM裁判使用,预置“无参考答案直接打分” 和 “判断答案是否与参考答案一致”两种模式,使用参考
  • 重构结果可视化,支持两个模型评测结果对比、支持竞技场模式结果可视化,参考

New Features

  • Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See reference for more details.
  • Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge functionality with built-in modes for “direct scoring without reference answers” and “consistency check between answers and reference answers.” See reference for more details.
  • Refactored result visualization: now supports comparison of evaluation results between two models, as well as visualization of Arena mode results. See reference

What's Changed

Full Changelog: v0.16.3...v0.17.0

v0.16.3

23 Jun 10:36

Choose a tag to compare

新功能

New Features

What's Changed

New Contributors

Full Changelog: v0.16.1...v0.16.2

v0.16.1

03 Jun 12:15

Choose a tag to compare

新功能

  • 支持传递--analysis-report布尔参数,使用judge model生成分析报告,报告中包含模型评测结果的分析解读和建议。
  • 新增支持大海捞针测试(Needle-in-a-Haystack),指定needle_haystack即可进行测试,并在outputs/reports文件夹下生成对应的heatmap,直观展现模型性能,使用参考
  • 新增支持DocMathFRAMES两个长文档评测基准,使用注意事项请查看文档
  • --limit支持设置0-1的浮点数,表示评测数据集的百分比数量。

New Features

  • Supports passing the --analysis-report boolean parameter, which uses the judge model to generate an analysis report. The report includes interpretative analysis and recommendations based on the model evaluation results.
  • Added support for the Needle-in-a-Haystack test. Specify needle_haystack to conduct the test, and a corresponding heatmap will be generated in the outputs/reports folder, visually displaying the model's performance. For usage, refer to this guide.
  • Added support for two long document evaluation benchmarks: DocMath and FRAMES. Please check the documentation for usage considerations.
  • The --limit parameter now supports setting a float between 0 and 1, representing the percentage of the dataset to be evaluated.

What's Changed

New Contributors

  • @xcode03 made their first contribution in #600

Full Changelog: v0.16.0...v0.16.1

v0.16.0

19 May 09:14

Choose a tag to compare

新功能

  • 支持模型服务性能压测支持设置多种并发,并输出美观的性能压测报告,参考示例
  • 支持ToolBench-Static数据集,评测模型的工具调用能力,参考使用文档
  • 支持DROPWinogrande评测基准,评测模型的推理能力。
  • 支持use_cache重用评测结果

New Features

  • Supports performance stress testing of model services with various concurrency settings and outputs aesthetically pleasing performance reports. See example.
  • Supports the ToolBench-Static dataset to evaluate the tool invocation capabilities of models. Refer to the user guide.
  • Supports DROP and Winogrande evaluation benchmarks to assess the reasoning capabilities of models.
  • Supports use_cache to reuse evaluation results.

What's Changed

New Contributors

Full Changelog: v0.15.1...v0.16.0