mincheolseong
diff --git a/‎docs/source/grpo_trainer.md‎
Lines changed: 97 additions & 11 deletions b/‎docs/source/grpo_trainer.md‎
Lines changed: 97 additions & 11 deletions
diff --git a/‎docs/source/speeding_up_training.md‎
Lines changed: 47 additions & 18 deletions b/‎docs/source/speeding_up_training.md‎
Lines changed: 47 additions & 18 deletions
diff --git a/‎setup.py‎
Lines changed: 1 addition & 2 deletions b/‎setup.py‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎tests/test_grpo_trainer.py‎
Lines changed: 5 additions & 13 deletions b/‎tests/test_grpo_trainer.py‎
Lines changed: 5 additions & 13 deletions
diff --git a/‎trl/cli.py‎
Lines changed: 7 additions & 0 deletions b/‎trl/cli.py‎
Lines changed: 7 additions & 0 deletions
@@ -68,11 +68,11 @@ At each training step, we sample a batch of prompts and generate a set of  \\( G
 
 ### Computing the advantage
 
-For each of the  \\( G \\) sequences, we compute the reward using a reward model. To align with the comparative nature of reward models—typically trained on datasets of comparisons between outputs for the same question—the advantage is calculated to reflect these relative comparisons. It is normalized as follows:  
+For each of the  \\( G \\) sequences, we compute the reward using a reward model. To align with the comparative nature of reward models—typically trained on datasets of comparisons between outputs for the same question—the advantage is calculated to reflect these relative comparisons. It is normalized as follows:
 
-$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$  
+$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$
 
-This approach gives the method its name: **Group Relative Policy Optimization (GRPO)**.  
+This approach gives the method its name: **Group Relative Policy Optimization (GRPO)**.
 
 ### Estimating the KL divergence
 
@@ -83,15 +83,15 @@ $$
 
 ### Computing the loss
 
-The objective is to maximize the advantage while ensuring that the model remains close to the reference policy. Consequently, the loss is defined as follows:  
+The objective is to maximize the advantage while ensuring that the model remains close to the reference policy. Consequently, the loss is defined as follows:
 
 $$
 \mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
 $$
 
-where the first term represents the scaled advantage and the second term penalizes deviations from the reference policy through KL divergence.  
+where the first term represents the scaled advantage and the second term penalizes deviations from the reference policy through KL divergence.
 
-In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted  \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**:  
+In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted  \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**:
 
 $$
 \mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
@@ -112,17 +112,103 @@ The GRPO Trainer logs the following metrics:
 
 ## Customization
 
-## Speed up training with vLLM-powered generation  
+### Speed up training with vLLM-powered generation
 
-Generation is often the main bottleneck that makes training slow with online methods. To accelerate generation, you can use [vLLM](https://github.com/vllm-project/vllm), a library that enables fast generation. To enable it, pass `use_vllm=True` in the training arguments.  
+Generation is often the main bottleneck that makes training slow with online methods. To accelerate generation, you can use [vLLM](https://github.com/vllm-project/vllm), a library that enables fast generation. To enable it, first install the package with
+
+```shell
+pip install trl[vllm]
+```
+
+Then, start the vLLM server with the desired model:
+
+```bash
+trl vllm-serve --model <model_name>
+```
+
+Then, pass `use_vllm=True` in the training arguments and run the training script:
 
 ```python
 from trl import GRPOConfig
 
 training_args = GRPOConfig(..., use_vllm=True)
-```  
+```
+
+For more information, see [Speeding up training with vLLM](speeding_up_training#vllm-for-fast-generation-in-online-methods).
+
+### GRPO at scale: train a 70B+ Model on multiple nodes
+
+When training large models like **Qwen2.5-72B**, you need several key optimizations to make the training efficient and scalable across multiple GPUs and nodes. These include:
+
+- **DeepSpeed ZeRO Stage 3**: ZeRO leverages data parallelism to distribute model states (weights, gradients, optimizer states) across multiple GPUs and CPUs, reducing memory and compute requirements on each device. Since large models cannot fit on a single GPU, using ZeRO Stage 3 is required for training such model. For more details, see [DeepSpeed Integration](deepspeed_integration).
+- **Accelerate**: Accelerate is a library that simplifies distributed training across multiple GPUs and nodes. It provides a simple API to launch distributed training and handles the complexities of distributed training, such as data parallelism, gradient accumulation, and distributed data loading. For more details, see [Distributing Training](distributing_training).
+- **vLLM**: See the previous section on how to use vLLM to speed up generation.
+
+Below is an example SLURM script to train a 70B model with GRPO on multiple nodes. This script trains a model on 4 nodes and uses the 5th node for vLLM-powered generation.
 
-For more information, see [Speeding up training with vLLM](speeding_up_training#vllm-for-fast-generation-in-online-methods).  
+```sh
+#!/bin/bash
+#SBATCH --nodes=5
+#SBATCH --gres=gpu:8
+
+# Get the list of allocated nodes
+NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
+
+# Assign the first 4 nodes for training and the 5th node for vLLM
+TRAIN_NODES="${NODELIST[@]:0:4}"  # Nodes 0, 1, 2, 3 for training
+VLLM_NODE="${NODELIST[4]}"  # Node 4 for vLLM
+
+# Run training on the first 4 nodes (Group 1)
+srun --nodes=4 --ntasks=4 --nodelist="${NODELIST[@]:0:4}" accelerate launch \
+     --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
+     --num_processes 32 \
+     --num_machines 4 \
+     --main_process_ip ${NODELIST[0]} \
+     --machine_rank $SLURM_PROCID \
+     --rdzv_backend c10d \
+     train_grpo.py \
+     --server_ip $VLLM_NODE &
+
+# Run vLLM server on the 5th node (Group 2)
+srun --nodes=1 --ntasks=1 --nodelist="${NODELIST[4]}" trl vllm-serve --model Qwen/Qwen2.5-72B --tensor_parallel_size 8 &
+
+wait
+```
+
+```python
+import argparse
+
+from datasets import load_dataset
+from trl import GRPOTrainer, GRPOConfig
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--vllm_server_host", type=str, default="", help="The server IP")
+    args = parser.parse_args()
+
+    # Example dataset from TLDR
+    dataset = load_dataset("trl-lib/tldr", split="train")
+
+    # Dummy reward function: count the number of unique characters in the completions
+    def reward_num_unique_chars(completions, **kwargs):
+        return [len(set(c)) for c in completions]
+
+    training_args = GRPOConfig(
+        output_dir="Qwen2.5-72B-GRPO",
+        per_device_train_batch_size=4,
+        bf16=True,
+        gradient_checkpointing=True,
+        logging_steps=10,
+        use_vllm=True,
+        vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."),  # from ip-X-X-X-X to X.X.X.X
+    )
+
+    trainer = GRPOTrainer(model="Qwen/Qwen2.5-72B", args=training_args, reward_funcs=reward_num_unique_chars, train_dataset=dataset)
+    trainer.train()
+
+if __name__=="__main__":
+    main()
+```
 
 ### Using a custom reward function
 
@@ -247,7 +333,7 @@ def math_reward_func(prompts, completions, task, **kwargs):
             # Return None for non-math tasks
             rewards.append(None)
     return rewards
-    
+
 # Coding-specific reward function
 def coding_reward_func(prompts, completions, task, **kwargs):
     rewards = []
 
@@ -37,39 +37,68 @@ training_args = OnlineDPOConfig(..., use_vllm=True)
 </hfoption>
 <hfoption id="GRPO">
 
-Then, enable it by passing `use_vllm=True` in the training arguments.
+First, start a vLLM server by running:
+
+```bash
+trl vllm-serve --model <model_name>
+```
+
+Then, run the training script and pass `use_vllm=True` in the training arguments.
 
 ```python
 from trl import GRPOConfig
 
 training_args = GRPOConfig(..., use_vllm=True)
 ```
 
-The strategy here is to use a dedicated GPU for generation powered by vLLM, while using the remainder for training.
+You can customize the server configuration by passing additional arguments.
+
+```sh
+$ trl vllm-serve --help
+usage: trl vllm-serve [-h] --model MODEL [--revision REVISION] [--tensor_parallel_size TENSOR_PARALLEL_SIZE] [--host HOST]
+                      [--port PORT] [--gpu_memory_utilization GPU_MEMORY_UTILIZATION] [--dtype DTYPE]
+                      [--max_model_len MAX_MODEL_LEN] [--enable_prefix_caching ENABLE_PREFIX_CACHING]
+
+options:
+  -h, --help            Show this help message and exit
+  --model MODEL         Model name or path to load the model from. (default: None)
+  --revision REVISION   Revision to use for the model. If not specified, the default branch will be used. (default: None)
+  --tensor_parallel_size TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE
+                        Number of tensor parallel workers to use. (default: 1)
+  --host HOST           Host address to run the server on. (default: 0.0.0.0)
+  --port PORT           Port to run the server on. (default: 8000)
+  --gpu_memory_utilization GPU_MEMORY_UTILIZATION, --gpu-memory-utilization GPU_MEMORY_UTILIZATION
+                        Ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache on the device
+                        dedicated to generation powered by vLLM. Higher values will increase the KV cache size and thus improve the
+                        model's throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors during
+                        initialization. (default: 0.9)
+  --dtype DTYPE         Data type to use for vLLM generation. If set to 'auto', the data type will be automatically determined based on
+                        the model configuration. Find the supported values in the vLLM documentation. (default: auto)
+  --max_model_len MAX_MODEL_LEN, --max-model-len MAX_MODEL_LEN
+                        If set, the `max_model_len` to use for vLLM. This can be useful when running with reduced
+                        `vllm_gpu_memory_utilization`, leading to a reduced KV cache size. If not set, vLLM will use the model context
+                        size, which might be much larger than the KV cache, leading to inefficiencies. (default: None)
+  --enable_prefix_caching ENABLE_PREFIX_CACHING, --enable-prefix-caching ENABLE_PREFIX_CACHING
+                        Whether to enable prefix caching in vLLM. If set to `True`, ensure that the model and the hardware support this
+                        feature. (default: None)
+```
 
 <Tip warning={true}>
 
-When using vLLM, an additional GPU is required exclusively for generation. This means you need at least two available GPUs and must ensure that one remains unused by the trainer. To achieve this, run the training with `--num_processes <NUMBER_OF_GPUs - 1>`.
+When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.  
 
-For example, if you have 4 GPUs, set `--num_processes 3` to allocate three GPUs for training while reserving one for generation.
-```bash
-accelerate launch --multi_gpu --num_processes 3 train_grpo.py
-```
+Set GPUs **0-3** for vLLM generation:  
+```sh
+CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
+```  
 
-![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/1_gpu_for_generation.png)
+And GPUs **4-7** for training:  
+```sh
+CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
+```  
 
 </Tip>
 
-You can further tune the vLLM configuration by setting a specific `vllm_device` and `vllm_gpu_memory_utilization` in the [`GRPOConfig`].
-
-```python
-training_args = GRPOConfig(
-    ...,
-    use_vllm=True,
-    vllm_device="cuda:4",
-    vllm_gpu_memory_utilization=0.7,
-)
-```
 
 </hfoption>
 </hfoptions>
@@ -91,8 +91,7 @@
     "scikit": ["scikit-learn"],
     "test": ["parameterized", "pytest-cov", "pytest-rerunfailures", "pytest-xdist", "pytest"],
     # vllm is not available on Windows
-    # vllm 0.7.3 causes hanging while gathering. temporary pinning the version until the issue is resolved
-    "vllm": ["vllm==0.7.2; sys_platform != 'win32'"],
+    "vllm": ["vllm>=0.7.0; sys_platform != 'win32'", "fastapi", "pydantic", "requests", "uvicorn"],
     "vlm": ["Pillow"],
 }
 EXTRAS["dev"] = []
 
@@ -20,7 +20,7 @@
 from datasets import load_dataset
 from parameterized import parameterized
 from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
-from transformers.testing_utils import require_peft, require_torch_accelerator
+from transformers.testing_utils import require_peft
 from transformers.utils import is_peft_available
 
 from trl import GRPOConfig, GRPOTrainer
@@ -631,7 +631,7 @@ def reward_func(completions, some_values, **kwargs):
                 self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")
 
     @unittest.skipIf(not is_vllm_available(), "vLLM is not available")
-    @require_torch_accelerator
+    @unittest.skip("We should add a mock for the vLLM server.")
     def test_training_vllm(self):
         """Test that training works with vLLM for generation."""
         dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
@@ -645,8 +645,6 @@ def test_training_vllm(self):
                 max_completion_length=32,  # reduce the completion length to reduce memory usage
                 report_to="none",
                 use_vllm=True,
-                vllm_device="cuda:0",  # will raise a warning, but allows this test to work with only one GPU
-                vllm_gpu_memory_utilization=0.5,  # reduce since because we use the same device for training and vllm
             )
             trainer = GRPOTrainer(
                 model="Qwen/Qwen2.5-0.5B-Instruct",  # tiny is too small for vLLM
@@ -761,7 +759,7 @@ def test_beta_zero_no_ref_model_and_no_kl(self):
                 self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")
 
     @unittest.skipIf(not is_vllm_available(), "vLLM is not available")
-    @require_torch_accelerator
+    @unittest.skip("We should add a mock for the vLLM server.")
     @require_peft
     def test_training_vllm_and_peft(self):
         """Test that training works with vLLM for generation."""
@@ -778,8 +776,6 @@ def test_training_vllm_and_peft(self):
                 max_completion_length=32,  # reduce the completion length to reduce memory usage
                 report_to="none",
                 use_vllm=True,
-                vllm_device="cuda:0",  # will raise a warning, but allows this test to work with only one GPU
-                vllm_gpu_memory_utilization=0.5,  # reduce since because we use the same device for training and vllm
             )
             lora_config = LoraConfig(
                 target_modules="all-linear",
@@ -810,7 +806,7 @@ def test_training_vllm_and_peft(self):
                     self.assertFalse(torch.allclose(param, new_param), f"Parameter {n} has not changed.")
 
     @unittest.skipIf(not is_vllm_available(), "vLLM is not available")
-    @require_torch_accelerator
+    @unittest.skip("We should add a mock for the vLLM server.")
     def test_training_vllm_guided_decoding(self):
         """Test that training works with vLLM for generation with guided decoding."""
         dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
@@ -824,8 +820,6 @@ def test_training_vllm_guided_decoding(self):
                 max_completion_length=32,  # reduce the completion length to reduce memory usage
                 report_to="none",
                 use_vllm=True,
-                vllm_device="cuda:0",  # will raise a warning, but allows this test to work with only one GPU
-                vllm_gpu_memory_utilization=0.5,  # reduce since because we use the same device for training and vllm
                 vllm_guided_decoding_regex=r"<reasoning>\n.*\n</reasoning>\n<answer>\n.*\n</answer>",
             )
             trainer = GRPOTrainer(
@@ -883,7 +877,7 @@ def test_training_with_additional_generation_kwargs(self):
                 self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")
 
     @unittest.skipIf(not is_vllm_available(), "vLLM is not available")
-    @require_torch_accelerator
+    @unittest.skip("We should add a mock for the vLLM server.")
     def test_training_vllm_with_additional_generation_kwargs(self):
         """Test that training works with vLLM and additional generation kwargs."""
         dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
@@ -897,8 +891,6 @@ def test_training_vllm_with_additional_generation_kwargs(self):
                 max_completion_length=32,  # reduce the completion length to reduce memory usage
                 report_to="none",
                 use_vllm=True,
-                vllm_device="cuda:0",  # will raise a warning, but allows this test to work with only one GPU
-                vllm_gpu_memory_utilization=0.5,  # reduce since because we use the same device for training and vllm
                 top_p=0.9,
                 top_k=10,
                 min_p=0.01,
 
@@ -25,6 +25,8 @@
 from .scripts.kto import make_parser as make_kto_parser
 from .scripts.sft import make_parser as make_sft_parser
 from .scripts.utils import TrlParser
+from .scripts.vllm_serve import main as vllm_serve_main
+from .scripts.vllm_serve import make_parser as make_vllm_serve_parser
 
 
 def main():
@@ -40,6 +42,7 @@ def main():
     make_grpo_parser(subparsers)
     make_kto_parser(subparsers)
     make_sft_parser(subparsers)
+    make_vllm_serve_parser(subparsers)
 
     # Parse the arguments
     args = parser.parse_args()
@@ -87,6 +90,10 @@ def main():
         args.training_script_args = sys.argv[2:]  # remove "trl" and "sft"
         launch_command(args)  # launch training
 
+    elif args.command == "vllm-serve":
+        (script_args,) = parser.parse_args_and_config()
+        vllm_serve_main(script_args)
+
 
 if __name__ == "__main__":
     main()