-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[V1] Reuse V0's memory_profiling util for gpu worker memory profiling #19312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @yeqcharlotte, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR addresses a critical bug in the V1 memory profiling logic. Previously, V1 incorrectly included memory used by other processes on the same GPU as part of its own non_torch_allocations. This led to scenarios where attempting to start multiple vLLM instances on a single GPU, even with sufficient total memory available and appropriate gpu_memory_utilization settings, would fail with 'No available memory for the cache blocks' errors. This PR resolves this by adopting the more accurate memory profiling utility from V0, which correctly measures only the memory footprint of the current vLLM instance, allowing multiple instances to coexist on the same card as intended.
Highlights
- Bugfix: Incorrect Memory Profiling: Fixes a bug in V1's memory profiling where memory used by other processes on the same GPU was incorrectly counted towards the current instance's memory usage.
- Enables Multiple Instances on One GPU: Resolves the issue preventing multiple vLLM servers or benchmark processes from starting on the same GPU with appropriate memory utilization settings.
- Adopts V0 Profiling Method: Leverages the memory profiling utility from V0 (
memory_profilingcontext manager andMemorySnapshot) for more accurate measurement of the current instance's memory footprint. - Improved Memory Logging: Adds detailed logging during initialization showing the breakdown of memory usage (total GPU, utilization, weights, non-torch, activation peak, KV cache).
Changelog
- vllm/v1/worker/gpu_worker.py
- Imported
MemorySnapshotandmemory_profilingfromvllm.utils(lines 25). - Initialized
self.baseline_snapshot = MemorySnapshot()after clearing cache ininit_device(line 134). - Wrapped the
self.model_runner.profile_run()call within thememory_profilingcontext manager, passing the baseline snapshot and weights memory (lines 187-191). - Replaced the old logic for calculating
non_torch_allocationsandpeak_memorywith the results obtained from thememory_profilingcontext manager (result.non_kv_cache_memory) (lines 202-205). - Added detailed logging using
logger.infoto print the memory breakdown during initialization (lines 207-223).
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a significant bug in V1 memory profiling that prevented multiple vLLM servers from efficiently sharing a GPU. The approach of leveraging the V0 memory profiling utility through the memory_profiling context manager is a smart way to reuse existing, robust logic.
The changes are clear, and the detailed PR description, including the test plan and results, is very helpful for understanding the impact and verifying the fix. The new detailed log message for memory breakdown is also a great addition for observability.
Overall, the code quality is good, and the fix appears to be correct and well-implemented. I don't have any major concerns.
Merge Readiness
The pull request seems to be in good shape. The bug fix is well-targeted, and the solution appears robust. The provided unit and E2E tests confirm the fix. I believe this PR is ready for merging after standard CI checks pass. As always, ensure any other relevant reviewers have a chance to look before merging. I am not authorized to approve pull requests.
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty neat approach.
We can try to add some unittest to ensure the memory consumed by other process is not incorrectly counted as non-torch memory in the current case.
vllm/v1/worker/gpu_worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add some comment to explain the motivation of baseline_snapshot.
|
Wondering if @youkaichao or @WoosukKwon would like to give it a pass? |
|
This pull request has merge conflicts that must be resolved before it can be |
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach makes sense to me and I left a comment. Can you please also update this PR with main? Thanks!
vllm/v1/worker/gpu_worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to always show this message to the end user so it's better to have this as debug rather than info.
It's also better to break up this long msg for code readability. See
vllm/vllm/v1/worker/gpu_worker.py
Lines 236 to 245 in 8335667
| GiB = lambda b: b / GiB_bytes | |
| logger.debug( | |
| "Initial free memory: %.2f GiB, free memory: %.2f GiB, " | |
| "total GPU memory: %.2f GiB", GiB(self.init_gpu_memory), | |
| GiB(free_gpu_memory), GiB(total_gpu_memory)) | |
| logger.debug( | |
| "Peak torch memory: %.2f GiB, non-torch forward-pass memory: " | |
| "%.2f GiB, available KVCache memory: %.2f GiB", | |
| GiB(peak_torch_memory), GiB(non_torch_alloc_bytes), | |
| GiB(available_kv_cache_memory)) |
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
c7cdabf to
eb6cbdd
Compare
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
| GiB(self.init_snapshot.free_memory), GiB(free_gpu_memory), | ||
| GiB(self.requested_memory)) | ||
| logger.debug(profile_result) | ||
| logger.info("Available KV cache memory: %.2f GiB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywang96 kept 1 info which i think it's useful for users
existing memory_profiling util already has coverage on that: Line 323 in 12e5829
speaking of that. good time to check how many more V0 utils can be reused in V1. |
|
We can create a refactor list for v0 deprecation, like more concrete plan of which part of v0 should be reused |
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, thanks for cleaning this up. Could you also add the logging changes from #17122?
vllm/v1/worker/gpu_worker.py
Outdated
| self.model_runner.model_memory_usage)) as profile_result: | ||
| self.model_runner.profile_run() | ||
|
|
||
| free_gpu_memory, _ = torch.cuda.mem_get_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this use the existing data from the profile_result?
vllm/v1/worker/gpu_worker.py
Outdated
| f"current free memory {free_gpu_memory/GiB_bytes} GiB. " | ||
| f"Initial free memory {GiB(self.init_snapshot.free_memory)} GiB, " | ||
| f"current free memory {GiB(free_gpu_memory)} GiB. " | ||
| f"This happens when the GPU memory was not properly cleaned up " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can improve this message. We should say something along the lines of "other process freed up memory during profiling"
| peak_memory) | ||
|
|
||
| GiB = lambda b: b / GiB_bytes | ||
| logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this info as well? It would be very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess this message could be confusing to generic end users. keeping it as debug for now for developers to turn on through VLLM_LOGGING_LEVEL=DEBUG .
it's always easier to flip it if too many issues complain about this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking we now try to reduce the amount of startup server logs as much as possible so that it's less confusing to the end users, and IMO it makes sense to keep this kind of information to DEBUG level.
| logger.debug(profile_result) | ||
| logger.info("Available KV cache memory: %.2f GiB", | ||
| GiB(available_kv_cache_memory)) | ||
| gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gc.collect after log doesn't feel right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's some proactive final cleanup after the profile runs. probably wouldn't matter too much just in case we left around some objects. gc perf overhead here should not matter too.
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
|
@ProExpertProg @Datta0 @ywang96 @maxdebayser - lemme know if you folks have more feedback! thanks! |
Purpose
Update: Rebase on top of #18974. Clean up duplicated calls, improve code readability and logging.
Follow-up PRs to reduce OOMs:
old context before rebase:
V1's memory profiling incorrectly counts memory used by other processes as non_torch_allocations. This means, if you try to start 2 vllm servers -- one uses 70% hbm, the other uses 20%, it'll complain about not having enough memory.
This seems to bother quite a few users in the V0 deprecation RFC: #18571 (comment).
So reattempt #14419 by leveraging V0's memory profiling util to address that as @youkaichao suggested.
Test Plan
Starting 2 servers as below:
Test Result
before
after
old test before rebase:
before
after
before
after