An ever-evolving benchmark for LLMs and LMMs.
(Recommended) Create a new virtual environment and activate it. Some packages require Python>=3.11, therefore we suggest using the following:
conda create -n onebench python=3.11 -y
conda activate onebenchInstall the required packages:
python -m pip install -r requirements.txtInstall ONEBench in editable mode:
python -m pip install -e .Test the installation:
python -c "import onebench"[Optional] Upgrade the Google Cloud SDK:
brew install [email protected]
export CLOUDSDK_PYTHON=$(which python3.11)
gcloud components updateAuthenticate to Google Cloud:
gcloud initDownload the HELM data:
python llm/download_helm.pyDownload the Open LLM Leaderboard data:
python llm/download_open_llm_leaderboard.pyDownload the LMSYS Chatbot Arena data:
python llm/download_chatbot_arena.pyThe VLM results are in the data/vlm/{dataset} directory, where dataset corresponds to vhelm and lmms-eval. The individual dataset a-matrices are located in data/vlm/{dataset}/binary and data/vlm/{dataset}/numeric. The results from Prometheus2 are located in data/vlm/{dataset}/pairwise_num.
[TODO]: Add instructions for json downloads, a matrix creation, prometheus scripts and capability querying.
@inprocessings{ghosh2025onebench,
title={ONEBench to test them all: Sample-level benchmarking over open-ended capabilities},
author={Ghosh, Adhiraj and Dziadzio, Sebastian and Prabhu, Ameya and Udandarao, Vishaal and Albanie, Samuel and Bethge, Matthias},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics },
year={2025}
}
Code: MIT. Check LICENSE.