OpenCompass v0.4.2
The OpenCompass team is thrilled to announce the release of OpenCompass v0.4.2! This version introduces powerful new datasets, robust evaluation enhancements, and critical refinements to elevate your benchmarking workflows. Let’s explore the key updates!
🌟 Highlights
✨ SuperGPQA Benchmark: Added support for the SuperGPQA dataset, enabling advanced reasoning evaluation with refined subset metrics. (#1924, #1966)
✨ MultiPL-E & Code Evaluator: Introduced MultiPL-E dataset integration and a dedicated code evaluator for comprehensive programming task assessments. (#1963)
✨ OlympiadMath Dataset: Expanded mathematical reasoning benchmarks with the OlympiadMath dataset. (#1982)
🚀 New Features
🔧 MMLU Pro Evaluation: Implemented generic LLM evaluation for MMLU Pro, broadening assessment flexibility. (#1923)
🔧 LLM Judge Configuration: Enhanced LLM judge capabilities with updated dataset configurations. (#1940)
🔧 Model Configurations:
- Added Intervl-8B and Intervl-38B model configurations. (#1978)
- Introduced Qwen-72B model support. (#1959)
📖 Documentation
📝 Results Persistence Guide: Documented evaluation results persistence workflows for better reproducibility. (#1908)
📝 Dataset Statistics: Added TBD token clarifications in dataset statistics documentation. (#1986)
📝 Typo Fixes: Corrected typos in DeepSeed-R1 documentation. (#1916)
🐛 Bug Fixes
🔧 Math Verification: Fixed math-verify evaluator logic to ensure accurate mathematical reasoning checks. (#1917)
🔧 Summarizer Logic: Resolved summarizer inconsistencies for reliable result aggregation. (#1953)
🔧 Model Compatibility:
- Fixed
model_kwargs
handling for vLLM accelerator compatibility. (#1958) - Patched AIME-2024 configuration errors. (#1974)
- Addressed OpenAI model tokenization constraints. (#1960)
⚙ Enhancements and Refactors
⚙ Dataset Optimization:
- Refined dataset configurations for KorBench, LiveMathBench, and others. (#1937, #1967)
- Added no
max_out_len
configurations for flexible evaluations. (#1968)
⚙ Infrastructure Upgrades:
- Increased memory allocation for VOLC Runner CPU jobs. (#1962)
- Updated OlympiadBench and LLM Judge pipelines. (#1954)
⚙ CI/CD Improvements:
- Fixed baseline scoring in daily tests for consistent benchmarking. (#1996)
🎉 Welcome New Contributors
🎊 A warm welcome to our newest contributors:
- @kangreen0210 for adding SuperGPQA support in #1924
- @Jiajun0425 for resolving vLLM accelerator issues in #1958
Full Changelog: 0.4.1...0.4.2
Thank you for using OpenCompass! These updates empower deeper insights and more reliable evaluations. Keep exploring, and stay tuned for future innovations! 🌟