Supporting multi-vLLM inference for GRPO #2929

ghrua · 2025-02-22T09:27:04Z

What does this PR do?

The current GRPO only uses 1 GPU for inference and $N-1$ GPUs for training, where $N$ is the world size. This PR allows users to leverage $K$ GPUs for inference and $N-K$ GPUs for training.

Method:

Apply for the resources with $N$ GPUs
initializing the distributed training with $N-K$ processes on GPU 0, 1, ... $N-K-1$.
Initializing the $K$ vLLM objects on 0 to $K-1$ training processes. However, the model parameters of the $K$ vLLM objects will be placed on GPU $N-K$, $N-K+1$, ..., $N-1$.
Using the a vLLM object to do the real-time inference for a subset of the prompts in each process.
Gathering the decoded data from 0 to $K-1$ processes.

Constraints:

The number of vLLM instances should be less or equal to the number of training processes, i.e., K <= N-K. For example, for an environment with 8 GPUs, we can have at most 4 GPUs for inference. Other configs are also allowed, e.g., 1 inference 7 training, 2 inference 6 training, etc.

I tested with 4 A100-80 GPU. The script is from open-r1. I train the Qwen-math-1.5B model with G=num_generation=32 and per_device_train_batch_size=32. Under this setting, I find that the config (2 infer, 2 train) is ~12% faster than (1 infer, 2 train). Probably scalling up the batch size and num_generation would introduce higher gains.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
Bottleneck in GRPO training #2887
GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

@qgallouedec @edbeeching

… vllm instances on K training process; 2) add an arg vllm_worker_num to control the K; 3) avoiding OOM of the test_training_vllm_guided_decoding by introducing vllm_gpu_memory_utilization

HuggingFaceDocBuilderDev · 2025-02-22T12:13:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-02-24T22:25:52Z

Thanks a lot @ghrua! Huge work! I'll test it asap. It definitely makes sense :)

qgallouedec · 2025-02-25T11:16:05Z

Have you tried with vLLM 0.7.3 btw?

ghrua · 2025-02-25T11:21:49Z

Hey @qgallouedec! Thanks for reviewing my commit.

~~I am sorry that I haven't tried vLLM 0.7.3. Let me do a quick check. Please wait a moment.~~

I checked that the env for this commit is based on 0.7.3. Sorry for the misremember.

▶ pip list | grep vllm
vllm                              0.7.3

qgallouedec · 2025-02-25T12:46:17Z

So it works with this version?
I'm asking because currently, main branch hangs at some point when you use vLLM 0.7.3, and I'm curious to know if your PR solves it

ghrua · 2025-02-25T13:45:31Z

So it works with this version? I'm asking because currently, main branch hangs at some point when you use vLLM 0.7.3, and I'm curious to know if your PR solves it

Yes, my commit works well with vLLM 0.7.3. I re-run the code to double check it.

Two parts may be helpful:

I use a context manager to control the device that each vLLM can access during inference, because the vLLM may mis-use the devices of distributed training processes:

trl/trl/trainer/grpo_trainer.py

Line 670 in 79e9af0

with torch.cuda.device(self.llm_device):
Three additional patches are used for a smoother initialization of vLLM (though not very elegant 😅):

trl/trl/trainer/grpo_trainer.py

Line 474 in 79e9af0

get_rank_patch = patch("torch.distributed.get_rank", return_value=0)

… of prompts for multiple vllm workers

qgallouedec

Ok now I understand better what you are trying to do.
Why do you need $K$ processes to control $K$ instances of vLLM? Couldn't you do everything in the main process?

ghrua · 2025-02-27T14:32:39Z

Ok now I understand better what you are trying to do.
Why do you need $K$ processes to control $K$ instances of vLLM? Couldn't you do everything in the main process?

That was my initial plan... I tried to initialise a vllm_list for K vllm instances in the main process and use the ThreadExecutor to call the inference in parallel. However, it will meet lots of errors (I forget the details, but seems to be the inconsistency between q and kv cache.) I didn't check the detailed code of vLLM, but those errors seems come from the conflict of some shared static variables. And I find this issue: vllm-project/vllm#1676. Please let me know if I misunderstood anything.

Afterwards, I change my strategy by initialising K instances in K processes.

SeiunSky0131 · 2025-03-10T02:31:38Z

This PR allows users to leverage K GPUs for inference and N − K GPUs for training.

Hi, that's a great work! However, when I test this code on my server equipped with 8 NVIDIA A100 80G, with 6 GPUs(GPU 0-5) for training and 2 GPUs(GPU 6-7) for vLLM engine, I find that GPU 7's utilization is always 0 throughout the training process.(See the blue line of GPU7 in the following figure)

It seems that no generation tasks are allocated to GPU 7. Could you check whether this situation applies to your code?

loki369loki · 2025-03-11T02:27:47Z

This PR allows users to leverage K GPUs for inference and N − K GPUs for training.

Hi, that's a great work! However, when I test this code on my server equipped with 8 NVIDIA A100 80G, with 6 GPUs(GPU 0-5) for training and 2 GPUs(GPU 6-7) for vLLM engine, I find that GPU 7's utilization is always 0 throughout the training process.(See the blue line of GPU7 in the following figure)

It seems that no generation tasks are allocated to GPU 7. Could you check whether this situation applies to your code?

Encountered the same issue, GPU7 is not functioning properly for vllm inference.

skepsun · 2025-03-19T09:07:23Z

It seems not to solve the problem that vllm hangs when we want to use large models (for example, 32b) which could only be placed on multiple cards.

qgallouedec · 2025-03-22T18:41:00Z

Closed via #3094

Main contributions: 1) supporting multi-gpu inference by allocating K…

79e9af0

… vllm instances on K training process; 2) add an arg vllm_worker_num to control the K; 3) avoiding OOM of the test_training_vllm_guided_decoding by introducing vllm_gpu_memory_utilization

1) add a doc string for vllm_worker_num; 2) fix a bug in the spliting…

1dfe699

… of prompts for multiple vllm workers

qgallouedec reviewed Feb 27, 2025

View reviewed changes

qgallouedec closed this Mar 22, 2025

Supporting multi-vLLM inference for GRPO #2929

Supporting multi-vLLM inference for GRPO #2929

Uh oh!

Conversation

ghrua commented Feb 22, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 22, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 25, 2025

Uh oh!

ghrua commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Feb 25, 2025

Uh oh!

ghrua commented Feb 25, 2025

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

ghrua commented Feb 27, 2025

Uh oh!

SeiunSky0131 commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loki369loki commented Mar 11, 2025

Uh oh!

skepsun commented Mar 19, 2025

Uh oh!

qgallouedec commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ghrua commented Feb 25, 2025 •

edited

Loading

SeiunSky0131 commented Mar 10, 2025 •

edited

Loading