-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add GRPO/ Online DPO support for quantitative models when use vllm as infer backbone. #3133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@qgallouedec Could you be free test this PR? |
|
Thanks @maoulee
I understand the motivation, but does it really matter here? I don't think the merge is the bottleneck. Have you done any benchmarking?
Not sure to get this one, why would you send both the adapters and the model? And even if you do, the adapter is significantly smaller than the model, how do you end up with "twice the VRAM"?
I'm not aware of such error. Do you mean numerical errors? Do you have any reports? pointers? I haven't looked in detail at the changes you're proposing, because it seems to me that your PR still needs to be cleaned up, the majority of the changes don't seem to be related to what you're proposing, so please limit the number of line changed to make the review possible.🙏 |
|
@qgallouedec I think there are two cases: non-zero3 / zero3 cases In non-zero3 cases, PEFT may not consume too much memory. However, in zero3 cases, the zero3 param gather may cause GPU OOM:
|
I've tested this PR and it works well on r1-32b-int4 grpo! Unfortunately, my current school workload and limited GPU access (just 2*A100 40GB) have prevented me from testing other setups thoroughly. |
|
Hey @maoulee, I've been giving your implementation a try as I've had issues around merging adapters particularly when using different quantization techniques – what did you modify in the GRPO trainer to call your code correctly, particularly in the Any help on this or code snippets you could share would be hugely appreciated if you have the time! Edit: The issue seems to be related to using |
Sorry, my email just reminded me now. I have updated the code again, and you can use move_lora_to_vllm to update the adapter parameters. in vllm_client.py: in vllm_serve.py: @app.post("/update_lora_param/") |




What does this PR do?
This PR modifies vLLM, through patching, to directly load the weights of a PEFT model and apply them as LoRA adapters during inference. This avoids the need to merge the entire model and transfer it to the generation server during online reinforcement learning algorithms like GRPO and DPO.
This provides the following benefits:
Limitations: