Any possible solutions for GRPO+LoRA on a multi-GPU setup? #3517
Replies: 2 comments 1 reply
-
If the majority of the consumed VRAM is from the completions size, FSDP or DeepSpeed are probably not going to help much... try first to use Also vLLM could help to speed up things a bit, check https://huggingface.co/blog/vllm-colocate |
Beta Was this translation helpful? Give feedback.
-
@shepardyan Have you solved this issue? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I'm trying to train a model with LoRA using the GRPOTrainer. Due to the limitation of GPU mem (in my case, that is 24 GB), I can't train a model with enough context length on a single GPU. So I tried to train on 4 GPUs. However, using trl/accelerate with tensor parallel (FSDP/DeepSpeed) encounters several problems. Here are my environment configurations.
Python Environment
transformers==4.52.3
trl==0.18.0
peft==0.15.2
bitsandbytes=0.46.0
torch==2.7.0
accelerate==1.7.0
deepspeed==0.16.9
Hardware Configuration
CPU Configuration:
GPU Configuration:
System Topology:
Minimal Example
Code
Run command
accelerate launch --use_fsdp train_grpo.py
Results
Other problems
In other scripts, FSDP may face hang before the script goes into training (hang for over 10 hrs) and DeepSpeed may face DeviceMesh not found error.
Is anyone successfully trained a model using multi-GPU and
GRPOTrainer
? Any help would be appreciated!Beta Was this translation helpful? Give feedback.
All reactions