Any possible solutions for GRPO+LoRA on a multi-GPU setup? #3517

shepardyan · 2025-05-30T05:18:49Z

shepardyan
May 30, 2025

Hello! I'm trying to train a model with LoRA using the GRPOTrainer. Due to the limitation of GPU mem (in my case, that is 24 GB), I can't train a model with enough context length on a single GPU. So I tried to train on 4 GPUs. However, using trl/accelerate with tensor parallel (FSDP/DeepSpeed) encounters several problems. Here are my environment configurations.

Python Environment

transformers==4.52.3
trl==0.18.0
peft==0.15.2
bitsandbytes=0.46.0
torch==2.7.0
accelerate==1.7.0
deepspeed==0.16.9

Hardware Configuration

CPU Configuration:

Dual AMD EPYC 7642 processors

GPU Configuration:

4x NVIDIA RTX 4090 (without NVLink connectivity)

System Topology:

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    SYS     SYS     0-47           0               N/A
GPU1    NODE     X      SYS     SYS     0-47           0               N/A
GPU2    SYS     SYS      X      NODE    48-95          1               N/A
GPU3    SYS     SYS     NODE     X      48-95          1               N/A

Minimal Example

Code

# train_grpo.py
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM


dataset = load_dataset("trl-lib/tldr", split="train")
lora_rank = 8
# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct", # Choose the base model
    torch_dtype = "auto", # Use bfloat16 for Ampere+ GPUs
    trust_remote_code = True, # Enable remote code execution for custom models
    tp_plan="auto"
)
training_args = GRPOConfig(output_dir="Qwen2.5-0.5B-GRPO", logging_steps=10, label_names=['label'])
trainer = GRPOTrainer(
    model=base_model,
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
    peft_config= LoraConfig(
        r=lora_rank,
        lora_alpha=2 * lora_rank, # *2 speeds up training
        lora_dropout=0,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
    ),

)
trainer.train()

Run command

accelerate launch --use_fsdp train_grpo.py

Results

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
INFO 05-30 13:15:12 [__init__.py:243] Automatically detected platform cuda.
INFO 05-30 13:15:12 [__init__.py:243] Automatically detected platform cuda.
INFO 05-30 13:15:12 [__init__.py:243] Automatically detected platform cuda.
INFO 05-30 13:15:12 [__init__.py:243] Automatically detected platform cuda.
[2025-05-30 13:15:24,957] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user/Desktop/GRPO/grpo/minimal_example.py", line 36, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/transformers/trainer.py", line 2240, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/transformers/trainer.py", line 2359, in _inner_training_loop
[rank0]:     self.model = self.accelerator.prepare(self.model)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1438, in prepare
[rank0]:     result = tuple(
[rank0]:              ^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1439, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1281, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1650, in prepare_model
[rank0]:     model = FSDP(model, **kwargs)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 473, in __init__
[rank0]:     _auto_wrap(
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank0]:     _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 533, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 533, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 533, in _recursive_wrap
[rank0]:     wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 551, in _recursive_wrap
[rank0]:     return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/wrap.py", line 480, in _wrap
[rank0]:     return wrapper_cls(module, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 499, in __init__
[rank0]:     _init_param_handle_from_module(
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 622, in _init_param_handle_from_module
[rank0]:     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/_init_utils.py", line 647, in _init_param_handle_from_params
[rank0]:     handle.shard()
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 939, in shard
[rank0]:     flat_param._typed_storage()._resize_(0)
[rank0]:   File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/storage.py", line 1258, in _resize_
[rank0]:     self._untyped_storage.resize_(size * self._element_size())
[rank0]: RuntimeError: Attempted to call resize_() on an invalid python storage.
W0530 13:15:28.271000 2298677 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2298795 closing signal SIGTERM
W0530 13:15:28.271000 2298677 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2298796 closing signal SIGTERM
W0530 13:15:28.272000 2298677 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2298797 closing signal SIGTERM
E0530 13:15:28.586000 2298677 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 2298794) of binary: /home/user/miniconda3/envs/trl/bin/python3.11
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/trl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1185, in launch_command
    multi_gpu_launcher(args)
  File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 815, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/trl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
grpo/minimal_example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-30_13:15:28
  host      : ps
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2298794)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Any possible solutions for GRPO+LoRA on a multi-GPU setup? #3517

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Any possible solutions for GRPO+LoRA on a multi-GPU setup? #3517

Uh oh!

shepardyan May 30, 2025

Python Environment

Hardware Configuration

Minimal Example

Code

Run command

Results

Other problems

Replies: 2 comments · 1 reply

Uh oh!

daniel-dona Jun 25, 2025

Uh oh!

mistysesame Jul 11, 2025

Uh oh!

shepardyan Jul 11, 2025 Author

shepardyan
May 30, 2025

Replies: 2 comments 1 reply

daniel-dona
Jun 25, 2025

mistysesame
Jul 11, 2025

shepardyan Jul 11, 2025
Author