Skip to content

minwukim/RLvsDistillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 RL vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

This repository contains the code used in the paper:

Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Minwu Kim*, Anubhav Shrestha*, Safal Shrestha, Aadim Nepal, Keith Ross

📄 arXiv:2505.14216 | 🤗 Hugging Face


🧩 About the Paper

This paper investigates why reinforcement learning with verifiable rewards (RLVR) and distillation have different impacts on large language model (LLM) reasoning. Key findings:

  • RLVR improves accuracy but does not improve capability, due to selective reinforcement of easier questions.
  • Distillation improves both accuracy and capability, but only when it introduces new knowledge.
  • Distilling only reasoning patterns behaves similarly to RLVR, helping with easy questions but failing on hard ones.

Read the full paper 👉 arXiv:2505.14216


📁 Code Structure

.
├── environment.yml           # 🧪 Conda environment file
├── zero3.yaml               # ⚙️ DeepSpeed Zero3 config for RLVR training
│
├── train/
│   ├── RLVR/
│   │   ├── grpo_trainer.py     # RL training script using TRL (GRPO)
│   │   └── grpoconfig.yaml     # Config file for GRPO training
│   │
│   └── Distillation/
│       ├── sft_trainer.py      # Supervised fine-tuning script
│       └── sftconfig.yaml      # SFT config file
│
└── test/
    └── MATH500_eval.py        # 🎯 Evaluation script on MATH500

⚙️ Environment Setup

Create the conda environment and install dependencies:

conda env create -f environment.yml
conda activate rlvsdistill

Ensure your system supports DeepSpeed and has accelerate configured.


🚀 How to Run

🔍 Evaluation (MATH500)

python test/MATH500_eval.py

🧪 RLVR Training (GRPO via TRL + VLLM)

We use TRL for policy optimization and VLLM for fast multi-process inference during reward evaluation.

accelerate launch \
  --config_file zero3.yaml \
  --num_processes <NUM_PROCESSES> \
  train/RLVR/grpo_trainer.py

💡 Note:

  • <NUM_PROCESSES> should be set to the number of available GPUs minus one (–1). This is because VLLM internally uses one process for fast batched model inference, and the remaining processes will be used for parallel reward evaluation.
  • Example: If your machine has 16 logical cores, use --num_processes 15.

🛠️ Edit train/RLVR/grpoconfig.yaml to configure:

  • Base model and reward model paths
  • Dataset locations
  • Sampling parameters
  • Reward functions

📘 Distillation Training (SFT)

python train/Distillation/sft_trainer.py --config train/Distillation/sftconfig.yaml

Use the config file to specify:

  • Teacher and student model
  • Number of training steps
  • Prompt format and dataset
  • Saving & logging behavior

📦 Libraries Used

  • 🤗 Transformers – model loading and generation
  • 🤗 TRL – for RL fine-tuning via GRPO
  • VLLM – for fast batched generation and reward model inference
  • 🧪 DeepSpeed – memory-efficient distributed training

🧪 Citation

If cite this paper in any form, please refer to the BibTeX citation shown below:

@misc{kim2024reinforcement,
      title={Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning}, 
      author={Minwu Kim and Anubhav Shrestha and Safal Shrestha and Aadim Nepal and Keith Ross},
      year={2024},
      eprint={2505.14216},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages