[V1] Reuse V0's memory_profiling util for gpu worker memory profiling #19312

yeqcharlotte · 2025-06-07T07:51:20Z

Purpose

Update: Rebase on top of #18974. Clean up duplicated calls, improve code readability and logging.

Follow-up PRs to reduce OOMs:

Account for CUDA graph memory
torch memory snapshot capture for profiling
max-model-len auto for long ctx model

old context before rebase:
V1's memory profiling incorrectly counts memory used by other processes as non_torch_allocations. This means, if you try to start 2 vllm servers -- one uses 70% hbm, the other uses 20%, it'll complain about not having enough memory.

This seems to bother quite a few users in the V0 deprecation RFC: #18571 (comment).

So reattempt #14419 by leveraging V0's memory profiling util to address that as @youkaichao suggested.

Test Plan

Unit test -- it's already covered by the following test which is broken on main

pytest -vv tests/entrypoints/llm/test_gpu_utilization.py

E2E test
Starting 2 servers as below:

CUDA_VISIBLE_DEVICES=0 wp vllm serve "meta-llama/Llama-3.2-1B" \
    --gpu-memory-utilization 0.7

VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0 wp python3 benchmarks/benchmark_latency.py \
    --model "meta-llama/Llama-3.2-1B" \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --gpu-memory-utilization 0.2

Test Result

before

DEBUG 06-09 00:54:11 [gpu_worker.py:237] Initial free memory: 40.32 GiB, free memory: 37.74 GiB, total GPU memory: 139.72 GiB
DEBUG 06-09 00:54:11 [gpu_worker.py:241] Peak torch memory: 7.07 GiB, non-torch forward-pass memory: 0.17 GiB, available KVCache memory: 20.70 GiB

after

DEBUG 06-09 01:03:28 [gpu_worker.py:221] Initial free memory: 40.32 GiB, free memory: 37.74 GiB, requested GPU memory: 27.94 GiB
DEBUG 06-09 01:03:28 [gpu_worker.py:225] Memory profiling takes 6.13 seconds. Total non KV cache memory: 7.16GiB; torch peak memory increase: 4.69GiB; non-torch forward increase memory: 0.15GiB; weights memory: 2.32GiB.
INFO 06-09 01:03:28 [gpu_worker.py:226] Available KV cache memory: 20.78 GiB

old test before rebase:

Unit test

before

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
================================================ short test summary info =================================================
FAILED tests/entrypoints/llm/test_gpu_utilization.py::test_gpu_memory_utilization - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

after

tests/entrypoints/llm/test_gpu_utilization.py::test_gpu_memory_utilization PASSED                                  [100%]

E2E test

before

ERROR 06-03 03:14:47 [core.py:508]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 06-03 03:14:47 [core.py:508]   File "/data/users/yeq/gitrepos/vllm/vllm/v1/core/kv_cache_utils.py", line 532, in check_enough_kv_cache_memory
ERROR 06-03 03:14:47 [core.py:508]     raise ValueError("No available memory for the cache blocks. "
ERROR 06-03 03:14:47 [core.py:508] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

after

INFO 06-06 23:28:31 [gpu_worker.py:224] Memory profiling takes 6.20 seconds
INFO 06-06 23:28:31 [gpu_worker.py:224] the current vLLM instance can use total_gpu_memory (139.72GiB) x gpu_memory_utilization (0.20) = 27.94GiB
INFO 06-06 23:28:31 [gpu_worker.py:224] model weights take 2.32GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 4.69GiB; the rest of the memory reserved for KV Cache is 20.78GiB.
INFO 06-06 23:28:31 [kv_cache_utils.py:638] GPU KV cache size: 681,040 tokens
INFO 06-06 23:28:31 [kv_cache_utils.py:641] Maximum concurrency for 131,072 tokens per request: 5.20x
...
Avg latency: 0.2802899726356069 seconds

github-actions · 2025-06-07T07:51:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @yeqcharlotte, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR addresses a critical bug in the V1 memory profiling logic. Previously, V1 incorrectly included memory used by other processes on the same GPU as part of its own non_torch_allocations. This led to scenarios where attempting to start multiple vLLM instances on a single GPU, even with sufficient total memory available and appropriate gpu_memory_utilization settings, would fail with 'No available memory for the cache blocks' errors. This PR resolves this by adopting the more accurate memory profiling utility from V0, which correctly measures only the memory footprint of the current vLLM instance, allowing multiple instances to coexist on the same card as intended.

Highlights

Bugfix: Incorrect Memory Profiling: Fixes a bug in V1's memory profiling where memory used by other processes on the same GPU was incorrectly counted towards the current instance's memory usage.
Enables Multiple Instances on One GPU: Resolves the issue preventing multiple vLLM servers or benchmark processes from starting on the same GPU with appropriate memory utilization settings.
Adopts V0 Profiling Method: Leverages the memory profiling utility from V0 (memory_profiling context manager and MemorySnapshot) for more accurate measurement of the current instance's memory footprint.
Improved Memory Logging: Adds detailed logging during initialization showing the breakdown of memory usage (total GPU, utilization, weights, non-torch, activation peak, KV cache).

Changelog

vllm/v1/worker/gpu_worker.py
- Imported MemorySnapshot and memory_profiling from vllm.utils (lines 25).
- Initialized self.baseline_snapshot = MemorySnapshot() after clearing cache in init_device (line 134).
- Wrapped the self.model_runner.profile_run() call within the memory_profiling context manager, passing the baseline snapshot and weights memory (lines 187-191).
- Replaced the old logic for calculating non_torch_allocations and peak_memory with the results obtained from the memory_profiling context manager (result.non_kv_cache_memory) (lines 202-205).
- Added detailed logging using logger.info to print the memory breakdown during initialization (lines 207-223).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a significant bug in V1 memory profiling that prevented multiple vLLM servers from efficiently sharing a GPU. The approach of leveraging the V0 memory profiling utility through the memory_profiling context manager is a smart way to reuse existing, robust logic.

The changes are clear, and the detailed PR description, including the test plan and results, is very helpful for understanding the impact and verifying the fix. The new detailed log message for memory breakdown is also a great addition for observability.

Overall, the code quality is good, and the fix appears to be correct and well-implemented. I don't have any major concerns.

Merge Readiness

The pull request seems to be in good shape. The bug fix is well-targeted, and the solution appears robust. The provided unit and E2E tests confirm the fix. I believe this PR is ready for merging after standard CI checks pass. As always, ensure any other relevant reviewers have a chance to look before merging. I am not authorized to approve pull requests.

houseroad

Overall looks pretty neat approach.

We can try to add some unittest to ensure the memory consumed by other process is not incorrectly counted as non-torch memory in the current case.

houseroad · 2025-06-07T12:45:13Z

vllm/v1/worker/gpu_worker.py

We can add some comment to explain the motivation of baseline_snapshot.

houseroad · 2025-06-07T12:46:50Z

Wondering if @youkaichao or @WoosukKwon would like to give it a pass?

mergify · 2025-06-07T17:36:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yeqcharlotte.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ywang96

The approach makes sense to me and I left a comment. Can you please also update this PR with main? Thanks!

ywang96 · 2025-06-09T05:57:56Z

vllm/v1/worker/gpu_worker.py

I don't think we need to always show this message to the end user so it's better to have this as debug rather than info.

It's also better to break up this long msg for code readability. See

vllm/vllm/v1/worker/gpu_worker.py

Lines 236 to 245 in 8335667

GiB = lambda b: b / GiB_bytes

logger.debug(

"Initial free memory: %.2f GiB, free memory: %.2f GiB, "

"total GPU memory: %.2f GiB", GiB(self.init_gpu_memory),

GiB(free_gpu_memory), GiB(total_gpu_memory))

logger.debug(

"Peak torch memory: %.2f GiB, non-torch forward-pass memory: "

"%.2f GiB, available KVCache memory: %.2f GiB",

GiB(peak_torch_memory), GiB(non_torch_alloc_bytes),

GiB(available_kv_cache_memory))

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

yeqcharlotte · 2025-06-09T08:19:51Z

vllm/v1/worker/gpu_worker.py

+            GiB(self.init_snapshot.free_memory), GiB(free_gpu_memory),
+            GiB(self.requested_memory))
+        logger.debug(profile_result)
+        logger.info("Available KV cache memory: %.2f GiB",


@ywang96 kept 1 info which i think it's useful for users

yeqcharlotte · 2025-06-09T08:25:56Z

Overall looks pretty neat approach.

We can try to add some unittest to ensure the memory consumed by other process is not incorrectly counted as non-torch memory in the current case.

existing memory_profiling util already has coverage on that:

vllm/tests/test_utils.py

Line 323 in 12e5829

def test_memory_profiling():

speaking of that. good time to check how many more V0 utils can be reused in V1.

houseroad · 2025-06-09T12:03:00Z

We can create a refactor list for v0 deprecation, like more concrete plan of which part of v0 should be reused

ProExpertProg

I like this, thanks for cleaning this up. Could you also add the logging changes from #17122?

ProExpertProg · 2025-06-09T17:34:25Z

vllm/v1/worker/gpu_worker.py

+                    self.model_runner.model_memory_usage)) as profile_result:
+            self.model_runner.profile_run()

        free_gpu_memory, _ = torch.cuda.mem_get_info()


Could this use the existing data from the profile_result?

ProExpertProg · 2025-06-09T17:36:54Z

vllm/v1/worker/gpu_worker.py

-            f"current free memory {free_gpu_memory/GiB_bytes} GiB. "
+            f"Initial free memory {GiB(self.init_snapshot.free_memory)} GiB, "
+            f"current free memory {GiB(free_gpu_memory)} GiB. "
            f"This happens when the GPU memory was not properly cleaned up "


I think we can improve this message. We should say something along the lines of "other process freed up memory during profiling"

Datta0 · 2025-06-09T17:51:39Z

vllm/v1/worker/gpu_worker.py

-            peak_memory)
-
-        GiB = lambda b: b / GiB_bytes
        logger.debug(


Should we make this info as well? It would be very useful.

i guess this message could be confusing to generic end users. keeping it as debug for now for developers to turn on through VLLM_LOGGING_LEVEL=DEBUG .
it's always easier to flip it if too many issues complain about this part.

Generally speaking we now try to reduce the amount of startup server logs as much as possible so that it's less confusing to the end users, and IMO it makes sense to keep this kind of information to DEBUG level.

Datta0 · 2025-06-09T17:52:08Z

vllm/v1/worker/gpu_worker.py

+        logger.debug(profile_result)
+        logger.info("Available KV cache memory: %.2f GiB",
+                    GiB(available_kv_cache_memory))
+        gc.collect()


gc.collect after log doesn't feel right

it's some proactive final cleanup after the profile runs. probably wouldn't matter too much just in case we left around some objects. gc perf overhead here should not matter too.

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

yeqcharlotte · 2025-06-09T23:14:34Z

@ProExpertProg @Datta0 @ywang96 @maxdebayser - lemme know if you folks have more feedback! thanks!

yeqcharlotte requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners June 7, 2025 07:51

gemini-code-assist bot reviewed Jun 7, 2025

View reviewed changes

mergify bot added the v1 label Jun 7, 2025

gemini-code-assist bot reviewed Jun 7, 2025

View reviewed changes

houseroad approved these changes Jun 7, 2025

View reviewed changes

vllm/v1/worker/gpu_worker.py Outdated

Copy link

Collaborator

houseroad Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add some comment to explain the motivation of baseline_snapshot.

mergify bot added the needs-rebase label Jun 7, 2025

ywang96 reviewed Jun 9, 2025

View reviewed changes

yeqcharlotte added 4 commits June 8, 2025 23:48

use memory_profile util to determine kv memory

c23e73b

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

merge conflict, move logging to debug

b13d767

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

cleanup

83247be

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

minor improvement

eb6cbdd

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

yeqcharlotte force-pushed the fix_mem_profile branch from c7cdabf to eb6cbdd Compare June 9, 2025 08:05

mergify bot removed the needs-rebase label Jun 9, 2025

fix

2dfc26d

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

yeqcharlotte changed the title ~~[Bugfix][V1] Fix memory profile to allow multiple servers to start on the same card~~ [V1] Reuse V0's memory_profile util for gpu worker memory profiling Jun 9, 2025

yeqcharlotte commented Jun 9, 2025

View reviewed changes

yeqcharlotte changed the title ~~[V1] Reuse V0's memory_profile util for gpu worker memory profiling~~ [V1] Reuse V0's memory_profiling util for gpu worker memory profiling Jun 9, 2025

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2025

ProExpertProg mentioned this pull request Jun 9, 2025

[WIP][V1] Optimized the determine_available_memory method for v1 #18296

Closed

ProExpertProg reviewed Jun 9, 2025

View reviewed changes

This was referenced Jun 9, 2025

[Misc]: Enable memory usage logging for vLLM GPU worker #17122

Closed

Account for memory usage of other processes #18858

Closed

Datta0 reviewed Jun 9, 2025

View reviewed changes

improve error message

4f0d18c

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

houseroad merged commit cc867be into vllm-project:main Jun 10, 2025
64 checks passed

kebe7jun mentioned this pull request Jul 23, 2025

[Performance]: KV Cache Size Comparison vLLM vs SGLang #21348

Closed

1 task

	GiB = lambda b: b / GiB_bytes
	logger.debug(
	"Initial free memory: %.2f GiB, free memory: %.2f GiB, "
	"total GPU memory: %.2f GiB", GiB(self.init_gpu_memory),
	GiB(free_gpu_memory), GiB(total_gpu_memory))
	logger.debug(
	"Peak torch memory: %.2f GiB, non-torch forward-pass memory: "
	"%.2f GiB, available KVCache memory: %.2f GiB",
	GiB(peak_torch_memory), GiB(non_torch_alloc_bytes),
	GiB(available_kv_cache_memory))

Uh oh!

[V1] Reuse V0's memory_profiling util for gpu worker memory profiling #19312

[V1] Reuse V0's memory_profiling util for gpu worker memory profiling #19312

Uh oh!

Conversation

yeqcharlotte commented Jun 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jun 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Merge Readiness

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houseroad commented Jun 7, 2025

Uh oh!

mergify bot commented Jun 7, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

ywang96 Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte commented Jun 9, 2025

Uh oh!

houseroad commented Jun 9, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte commented Jun 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yeqcharlotte commented Jun 7, 2025 •

edited by github-actions bot

Loading

ywang96 Jun 9, 2025 •

edited

Loading

ywang96 Jun 9, 2025 •

edited

Loading