RM-R1: Reward Modeling as Reasoning

🤗 Model | 📊 Dataset | 📖 Paper | 📖 Website

RM‑R1 reframes reward modeling as a reasoning problem. Instead of emitting an opaque scalar, a Reasoning Reward Model (ReasRM) first thinks out loud—generating a structured rubric or solution—and then predicts the preference between two responses. This simple shift boosts both interpretability and performance: RM‑R1 beats prior SOTA reward models (e.g. INF-ORM-Llama3.1-70B, GPT-4o) across multiple public benchmarks on average, while letting you read why the model prefers one answer over the other.

This repository provides all materials necessary to reproduce and extend RM-R1:

End-to-end scripts and configs for training (Distillation + RL),
a unified evaluation harness for public benchmarks, and
ready-to-run examples for deployment and inference.

All experiments are fully documented so that results can be audited or adapted to new domains with minimal changes.

Installation

Important: RM‑R1 currently depends on specific commits of veRL and vLLM. Please follow the exact steps below—even if you already have vLLM installed—otherwise compilation or runtime errors may occur.

1. Base environment

# create and enter env (Python ≥3.11 recommended)
conda create -n rm-r1 python=3.11 -y
conda activate rm-r1

2. veRL – pinned commit

# We recommend install verl in a directory seperate from RM-R1
git clone https://github.com/volcengine/verl
cd verl
git checkout e49fb572bf85a8f0ef7124c898f509bd6d9832a1
pip install -e .
cd ..

3. vLLM – pinned commit + flash‑attention

# We recommend install vllm in a directory seperate from RM-R1
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout ed6e9075d31e32c8548b480a47d1ffb77da1f54c
git cherry-pick caac5c2e597b1780c3df54a537c34e6061c32cff
export VLLM_COMMIT=ed6e9075d31e32c8548b480a47d1ffb77da1f54c
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/ed6e9075d31e32c8548b480a47d1ffb77da1f54c/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install --editable .

# flash‑attention 2 (for >2× speed‑up)
pip install flash-attn==2.7.2.post1 --no-build-isolation

Done! You can now run RM‑R1 for RL training.

(Optional) Distillation / SFT environment

If you intend to reproduce the reasoning‑distillation stage from scratch, we recommend a separate environment:

conda create -n rm-r1-sft python=3.11 -y
conda activate rm-r1-sft

pip install uv && uv pip install --upgrade pip
uv pip install vllm==0.7.2

# OpenRLHF
cd OpenRLHF
uv pip install -e .

🚀 Training Workflow

All training recipes live in rm_r1/scripts/. The pipeline has two stages (for instruct models):

Stage	Script Directory
Distillation	`scripts/Distill/distill_qwen2.5-*.sh`
RL with Verifiable Rewards (RLVR)	`scripts/RLVR/train_rm_r1_rlvr_.sh`

Specify SAVE_MODEL_PATH in every distillation script and SAVE_META_DIR in every RLVR script to choose where checkpoints are stored. Other arguments such as batch size, learning rate, Slurm partition, etc., can be edited directly in each shell script. Detailed flag descriptions are available in the veRL documentation.

We support large-scale, multi-node, and multi-GPU training.

🔧 Example Training a 14 B Instruct model from scratch

# 1️⃣ Distillation (SFT)
conda activate rm-r1-sft
cd rm_r1/OpenRLHF
bash ../scripts/Distill/local/distill_qwen2.5-14b-instruct.sh

# 2️⃣ RLVR fine‑tuning
conda deactivate
conda activate rm-r1
cd ../..

# – local
bash rm_r1/scripts/RLVR/local/train_rm_r1_rlvr_qwen2.5_instruct_14b.sh

# – Slurm cluster
sbatch rm_r1/scripts/RLVR/slurm/train_rm_r1_rlvr_qwen2.5_instruct_14b.sh

🔧 Example Fine‑tuning a DeepSeek‑distilled checkpoint

conda activate rm-r1

# – local
bash rm_r1/scripts/RLVR/local/train_rm_r1_rlvr_dpsk_distilled_14b.sh

# – Slurm
sbatch rm_r1/scripts/RLVR/slurm/train_rm_r1_rlvr_dpsk_distilled_14b.sh

Use Our Model

You can find a demo of how to use our model in the Jupyter notebook located at demo/demo.ipynb.

Evaluation

Our evaluation leverages three publicly available datasets: RewardBench, RM-Bench, and RMB. The evaluation code, along with detailed documentation, is available in the eval/ directory.

To run the evaluation, simply execute the following script:

bash eval/eval_one_command.sh

The primary arguments to modify are model and model_save_name. To facilitate better interpretation, our evaluation pipeline logs per-sample outputs of the model. We provide the examplar per-sample-output at eval/result/.

Feel free to add your own models or datasets to the pipeline for one-line, transparent, and comprehensive reward model evaluations!

🧩 Build Your Own Dataset

This section outlines how we curated and blended data for training RM‑R1, and how you can adapt the process for your own use case.

🔗 Source Pools

We mix examples from the following datasets:

Dataset	Size	Domain
Skywork‑Reward‑Preference‑80K‑v0.2	~54k pairs	General
Code‑Preference‑Pairs	8k pairs	Code
Math‑Step‑DPO‑10K	10k pairs	Math

We thank the authors of these datasets for their contributions. If you'd like to construct your own dataset, feel free to refer to our mixing script at rm_r1/dataset/mix_data/mix_data.py.

🛠️ Generating High‑Quality Reasoning Chains

A key component of RM‑R1 training is the use of correct and coherent distilled reasoning chains. Naively prompting strong thinking models (e.g., O3, Claude) in a zero-shot setting yields only ~75% chain accuracy. To address this, we use a two-pass bootstrapping strategy:

Pass 1 (Claude-3.7-Sonnet): Generate chains via zero-shot prompting.
Keep: Retain samples with incorrect answers and their corresponding chains.
Pass 2 (O3): Provide the correct answer, the prompt, and the flawed chain from Pass 1. Ask the model to regenerate a corrected reasoning chain.

This approach reliably produces chains that are both accurate and logically sound. Our implementations are provided at rm_r1/dataset/reasoning_chain_generation/.

Features

Open release of trained model and the full accompanying datasets. ✔️
End-to-end pipelines for both supervised fine-tuning (SFT) and reinforcement learning (RL). ✔️
Support different RL frameworks. ✔️
Support Slurm v.s. Interactive Training. ✔️
Support multi-node, multi-gpu training. ✔️
Support different LLMs. ✔️
Support building your own custom dataset. ✔️
Demo Code for using RM-R1. ✔️
One-command evaluation on public RM benchmarks for quick, reproducible reporting. ✔️

Acknowledgement

The concept of RM-R1 is inspired by Deepseek-R1. Its implementation is built upon veRL and OpenRLHF. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.

Citations

@article{chen2025rm,
  title={RM-R1: Reward Modeling as Reasoning},
  author={Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and others},
  journal={arXiv preprint arXiv:2505.02387},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
demo		demo
eval		eval
figures		figures
rm_r1		rm_r1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RM-R1: Reward Modeling as Reasoning

📑 Table of Contents

Installation

1. Base environment

2. veRL – pinned commit

3. vLLM – pinned commit + flash‑attention

(Optional) Distillation / SFT environment

🚀 Training Workflow

🔧 Example Training a 14 B Instruct model from scratch

🔧 Example Fine‑tuning a DeepSeek‑distilled checkpoint

Use Our Model

Evaluation

🧩 Build Your Own Dataset

🔗 Source Pools

🛠️ Generating High‑Quality Reasoning Chains

Features

Acknowledgement

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

RM-R1-UIUC/RM-R1

Folders and files

Latest commit

History

Repository files navigation

RM-R1: Reward Modeling as Reasoning

📑 Table of Contents

Installation

1. Base environment

2. veRL – pinned commit

3. vLLM – pinned commit + flash‑attention

(Optional) Distillation / SFT environment

🚀 Training Workflow

🔧 Example Training a 14 B Instruct model from scratch

🔧 Example Fine‑tuning a DeepSeek‑distilled checkpoint

Use Our Model

Evaluation

🧩 Build Your Own Dataset

🔗 Source Pools

🛠️ Generating High‑Quality Reasoning Chains

Features

Acknowledgement

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

🔧 Example Training a 14 B Instruct model from scratch

Packages