Skip to content

Conversation

@ghrua
Copy link

@ghrua ghrua commented Feb 22, 2025

What does this PR do?

The current GRPO only uses 1 GPU for inference and $N-1$ GPUs for training, where $N$ is the world size. This PR allows users to leverage $K$ GPUs for inference and $N-K$ GPUs for training.

Method:

  1. Apply for the resources with $N$ GPUs
  2. initializing the distributed training with $N-K$ processes on GPU 0, 1, ... $N-K-1$.
  3. Initializing the $K$ vLLM objects on 0 to $K-1$ training processes. However, the model parameters of the $K$ vLLM objects will be placed on GPU $N-K$, $N-K+1$, ..., $N-1$.
  4. Using the a vLLM object to do the real-time inference for a subset of the prompts in each process.
  5. Gathering the decoded data from 0 to $K-1$ processes.

Constraints:

  1. The number of vLLM instances should be less or equal to the number of training processes, i.e., K <= N-K. For example, for an environment with 8 GPUs, we can have at most 4 GPUs for inference. Other configs are also allowed, e.g., 1 inference 7 training, 2 inference 6 training, etc.

I tested with 4 A100-80 GPU. The script is from open-r1. I train the Qwen-math-1.5B model with G=num_generation=32 and per_device_train_batch_size=32. Under this setting, I find that the config (2 infer, 2 train) is ~12% faster than (1 infer, 2 train). Probably scalling up the batch size and num_generation would introduce higher gains.

Before submitting

Who can review?

@qgallouedec @edbeeching

… vllm instances on K training process; 2) add an arg vllm_worker_num to control the K; 3) avoiding OOM of the test_training_vllm_guided_decoding by introducing vllm_gpu_memory_utilization
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

Thanks a lot @ghrua! Huge work! I'll test it asap. It definitely makes sense :)

@qgallouedec
Copy link
Member

Have you tried with vLLM 0.7.3 btw?

@ghrua
Copy link
Author

ghrua commented Feb 25, 2025

Hey @qgallouedec! Thanks for reviewing my commit.

I am sorry that I haven't tried vLLM 0.7.3. Let me do a quick check. Please wait a moment.

I checked that the env for this commit is based on 0.7.3. Sorry for the misremember.

▶ pip list | grep vllm
vllm                              0.7.3

@qgallouedec
Copy link
Member

So it works with this version?
I'm asking because currently, main branch hangs at some point when you use vLLM 0.7.3, and I'm curious to know if your PR solves it

@ghrua
Copy link
Author

ghrua commented Feb 25, 2025

So it works with this version? I'm asking because currently, main branch hangs at some point when you use vLLM 0.7.3, and I'm curious to know if your PR solves it

Yes, my commit works well with vLLM 0.7.3. I re-run the code to double check it.

Two parts may be helpful:

  1. I use a context manager to control the device that each vLLM can access during inference, because the vLLM may mis-use the devices of distributed training processes:
    with torch.cuda.device(self.llm_device):
  2. Three additional patches are used for a smoother initialization of vLLM (though not very elegant 😅):
    get_rank_patch = patch("torch.distributed.get_rank", return_value=0)

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now I understand better what you are trying to do.
Why do you need $K$ processes to control $K$ instances of vLLM? Couldn't you do everything in the main process?

@ghrua
Copy link
Author

ghrua commented Feb 27, 2025

Ok now I understand better what you are trying to do.
Why do you need $K$ processes to control $K$ instances of vLLM? Couldn't you do everything in the main process?

That was my initial plan... I tried to initialise a vllm_list for K vllm instances in the main process and use the ThreadExecutor to call the inference in parallel. However, it will meet lots of errors (I forget the details, but seems to be the inconsistency between q and kv cache.) I didn't check the detailed code of vLLM, but those errors seems come from the conflict of some shared static variables. And I find this issue: vllm-project/vllm#1676. Please let me know if I misunderstood anything.

Afterwards, I change my strategy by initialising K instances in K processes.

@SeiunSky0131
Copy link

SeiunSky0131 commented Mar 10, 2025

This PR allows users to leverage K GPUs for inference and N − K GPUs for training.

Hi, that's a great work! However, when I test this code on my server equipped with 8 NVIDIA A100 80G, with 6 GPUs(GPU 0-5) for training and 2 GPUs(GPU 6-7) for vLLM engine, I find that GPU 7's utilization is always 0 throughout the training process.(See the blue line of GPU7 in the following figure)
截屏2025-03-10 10 15 57

It seems that no generation tasks are allocated to GPU 7. Could you check whether this situation applies to your code?

@loki369loki
Copy link

This PR allows users to leverage K GPUs for inference and N − K GPUs for training.

Hi, that's a great work! However, when I test this code on my server equipped with 8 NVIDIA A100 80G, with 6 GPUs(GPU 0-5) for training and 2 GPUs(GPU 6-7) for vLLM engine, I find that GPU 7's utilization is always 0 throughout the training process.(See the blue line of GPU7 in the following figure) 截屏2025-03-10 10 15 57

It seems that no generation tasks are allocated to GPU 7. Could you check whether this situation applies to your code?

Encountered the same issue, GPU7 is not functioning properly for vllm inference.
vllm_multi_gpu_inference_test

@skepsun
Copy link

skepsun commented Mar 19, 2025

It seems not to solve the problem that vllm hangs when we want to use large models (for example, 32b) which could only be placed on multiple cards.

@qgallouedec
Copy link
Member

Closed via #3094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants