ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

An ever-evolving benchmark for LLMs and LMMs.

Installation

(Recommended) Create a new virtual environment and activate it. Some packages require Python>=3.11, therefore we suggest using the following:

conda create -n onebench python=3.11 -y
conda activate onebench

Install the required packages:

python -m pip install -r requirements.txt

Install ONEBench in editable mode:

python -m pip install -e .

Test the installation:

python -c "import onebench"

Downloading the data

LLM

HELM

[Optional] Upgrade the Google Cloud SDK:

brew install [email protected]
export CLOUDSDK_PYTHON=$(which python3.11)
gcloud components update

Authenticate to Google Cloud:

gcloud init

Download the HELM data:

python llm/download_helm.py

Open LLM Leaderboard

Download the Open LLM Leaderboard data:

python llm/download_open_llm_leaderboard.py

Chatbot Arena

Download the LMSYS Chatbot Arena data:

python llm/download_chatbot_arena.py

VLM

The VLM results are in the data/vlm/{dataset} directory, where dataset corresponds to vhelm and lmms-eval. The individual dataset a-matrices are located in data/vlm/{dataset}/binary and data/vlm/{dataset}/numeric. The results from Prometheus2 are located in data/vlm/{dataset}/pairwise_num.

[TODO]: Add instructions for json downloads, a matrix creation, prometheus scripts and capability querying.

📚Citation

If you find our work helpful, please use the following citation:

@inprocessings{ghosh2025onebench,
  title={ONEBench to test them all: Sample-level benchmarking over open-ended capabilities},
  author={Ghosh, Adhiraj and Dziadzio, Sebastian and Prabhu, Ameya and Udandarao, Vishaal and Albanie, Samuel and Bethge, Matthias},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics },
  year={2025}
}

🪪 License

Code: MIT. Check LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
analysis		analysis
docs		docs
llm		llm
prometheus		prometheus
query		query
vlm		vlm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Installation

Downloading the data

LLM

HELM

Open LLM Leaderboard

Chatbot Arena

VLM

📚Citation

If you find our work helpful, please use the following citation:

🪪 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Uh oh!

License

Uh oh!

bethgelab/onebench

Folders and files

Latest commit

History

Repository files navigation

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Installation

Downloading the data

LLM

HELM

Open LLM Leaderboard

Chatbot Arena

VLM

📚Citation

If you find our work helpful, please use the following citation:

🪪 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages