-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Supporting multi-vLLM inference for GRPO #2929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… vllm instances on K training process; 2) add an arg vllm_worker_num to control the K; 3) avoiding OOM of the test_training_vllm_guided_decoding by introducing vllm_gpu_memory_utilization
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thanks a lot @ghrua! Huge work! I'll test it asap. It definitely makes sense :) |
|
Have you tried with vLLM 0.7.3 btw? |
|
Hey @qgallouedec! Thanks for reviewing my commit.
I checked that the env for this commit is based on |
|
So it works with this version? |
Yes, my commit works well with vLLM 0.7.3. I re-run the code to double check it. Two parts may be helpful:
|
… of prompts for multiple vllm workers
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok now I understand better what you are trying to do.
Why do you need
That was my initial plan... I tried to initialise a vllm_list for K vllm instances in the main process and use the ThreadExecutor to call the inference in parallel. However, it will meet lots of errors (I forget the details, but seems to be the inconsistency between q and kv cache.) I didn't check the detailed code of vLLM, but those errors seems come from the conflict of some shared static variables. And I find this issue: vllm-project/vllm#1676. Please let me know if I misunderstood anything. Afterwards, I change my strategy by initialising K instances in K processes. |
|
It seems not to solve the problem that vllm hangs when we want to use large models (for example, 32b) which could only be placed on multiple cards. |
|
Closed via #3094 |



What does this PR do?
The current GRPO only uses 1 GPU for inference and$N-1$ GPUs for training, where $N$ is the world size. This PR allows users to leverage $K$ GPUs for inference and $N-K$ GPUs for training.
Method:
Constraints:
I tested with 4 A100-80 GPU. The script is from open-r1. I train the Qwen-math-1.5B model with
G=num_generation=32andper_device_train_batch_size=32. Under this setting, I find that the config (2 infer, 2 train) is ~12% faster than (1 infer, 2 train). Probably scalling up the batch size andnum_generationwould introduce higher gains.Before submitting
Pull Request section?
Bottleneck in GRPO training #2887
GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901
documentation guidelines.
Who can review?
@qgallouedec @edbeeching