Skip to content

GRPO: OOM when init self.vllm #3128

@markoov

Description

@markoov

When I load the 7b model using vllm, there is no OOM error reported. This error occurs when I run the grpo training code using "accelerate launch"
==============error code==============
[rank0]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 404, in init
[rank0]: self.llm = LLM(
[rank0]: ^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 1051, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 242, in init
[rank0]: self.llm_engine = self.engine_class.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 484, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 276, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 121, in initialize_cache
[rank0]: self.collective_rpc("initialize_cache",
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2220, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 306, in initialize_cache
[rank0]: self._init_cache_engine()
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 311, in _init_cache_engine
[rank0]: self.cache_engine = [
[rank0]: ^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 312, in
[rank0]: CacheEngine(self.cache_config, self.model_config,
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/worker/cache_engine.py", line 69, in init
[rank0]: self.gpu_cache = self._allocate_kv_cache(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/vllm/worker/cache_engine.py", line 103, in _allocate_kv_cache
[rank0]: layer_kv_cache = torch.zeros(alloc_shape,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 780.00 MiB. GPU 3 has a total capacity of 23.68 GiB of which 698.94 MiB is free. Process 1953163 has 22.99 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 47.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

===========train env=================
model : Qwen/Qwen2.5-7B-Instruct
GPU : 3090 24G * 4
python : 3.11
trl : 0.15.2
torch : 2.5.1+cu124
transformer : 4.49.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions