[WebGPU] Map UMA buffers for model uploading #23958

jchen10 · 2025-03-09T11:26:03Z

With the Dawn BufferMapExtendedUsages feature available, we can now copy data directly to the GPU buffer for UMA GPUs without the need to use a staging buffer. This PR enables the feature for model uploading during the initialization of the inference session, where ORT reaches the peak memory usage. To achieve this, we need to introduce a new 'SetAllocHint' method in the 'IAllocator', so that the WebGPU allocator can have the buffer manager to handle UMA buffers appropriately.

Credits to @fs-eire for the overall design of implementation.

@fs-eire

With the Dawn BufferMapExtendedUsages feature available, we can now copy data directly to the GPU buffer for UMA GPUs without the need to use a staging buffer. This PR enables the feature for model uploading during the initialization of the inference session, where ORT reaches the peak memory usage. To achieve this, we need to introduce a new 'SetAllocHint' method in the 'IAllocator', so that the WebGPU allocator can have the buffer manager to handle UMA buffers appropriately. Credits to @fs-eire for the overall design of implementation.

jchen10 · 2025-03-09T11:35:46Z

@fs-eire @xhcao
With your inputs in #23910, I've come up with this one, which is much simpler, meanwhile still effective.

model_benchmark.exe -i ..\models-genai\Phi-3.5-mini-instruct-onnx-web
Batch size: 1, prompt tokens: 17, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       284191
        avg (tokens/s): 59.819
        p50 (us):       274340
        stddev (us):    15825
        n:              5 * 17 token(s)
Token generation:
        avg (us):       53967.1
        avg (tokens/s): 18.5298
        p50 (us):       53783.8
        stddev (us):    2169.13
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       10.26
        avg (tokens/s): 97465.9
        p50 (us):       7.7
        stddev (us):    4.48475
        n:              5 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       7138.11
        p50 (ms):       7140.06
        stddev (ms):    17.9578
        n:              5
Peak working set size (bytes): 3414822912

qjia7 · 2025-03-09T12:29:16Z

onnxruntime/core/providers/webgpu/allocator.cc

+  if (alloc_hint_ == AllocHint::UploadDst && context_.SupportsBufferMapExtendedUsages()) {
+    buffer = context_.BufferManager().CreateUMA(size);
+  } else {
+    buffer = context_.BufferManager().Create(size);


If we have the AllocHint::UploadDst hint, does it mean that for the non-UMA path, we can also create the storage buffer with mappedAtCreation = true. Then in the BufferManager::Upload path, staging buffer can also be removed.

In this case, it will be fair to compare the memory peak to see how much the UMA brings :)

You are right. Actually I also tried that. However the memory peak slightly increased rather than significantly reduced. Maybe it's worth a shot on a d-GPU.

It's weird. I notice another issue: It seems that for uploading, in the uma way, we can't reuse the existed buffer and always create a new buffer. It's not a problem for the weights data since we need them when the session is alive. But if some inputs are in cpu, in each run, they also can't reuse the existed buffer, right?

The BucketCacheManager does support buffer reuse, however has to use a staging buffer for upload and download.

#23910 can improve that with no staging buffer needed. But it's much complicated than this one. I haven't found a use case to show that the increased complexity is rewarding.

Yulong submitted a PR #23947 to release staging buffer early. Can you compare the UMA's perf with that one? I am curious about the UMA's gains on iGPUs.

Here comes the data of PR #23947. It seems the UMA gain is just 0.2G.

model_benchmark.exe -i ..\models-genai\Phi-3.5-mini-instruct-onnx-web Batch size: 1, prompt tokens: 17, tokens to generate: 128 Prompt processing (time to first token): avg (us): 284987 avg (tokens/s): 59.6519 p50 (us): 277820 stddev (us): 14648.9 n: 5 * 17 token(s) Token generation: avg (us): 55056.5 avg (tokens/s): 18.1632 p50 (us): 54971.5 stddev (us): 2005.99 n: 635 * 1 token(s) Token sampling: avg (us): 10.26 avg (tokens/s): 97465.9 p50 (us): 7.2 stddev (us): 4.87473 n: 5 * 1 token(s) E2E generation (entire generation loop): avg (ms): 7277.27 p50 (ms): 7264.27 stddev (ms): 41.85 n: 5 Peak working set size (bytes): 3659186176

If possible, we want to also measure GPU memory usage and latency, which in my understanding should show some gain.

I tired to add a new metric to measure the model create time to https://github.com/microsoft/onnxruntime-genai/blob/2b741723abb2814e899d3d6272022f4c2fa66a5d/benchmark/c/main.cpp#L137

void RunBenchmark(const benchmark::Options& opts) { std::unique_ptr<OgaModel> model; std::vector<Duration> model_create_times; { Timing model_create_timing{model_create_times}; model = OgaModel::Create(opts.model_path.c_str()); }

You can get the data I measured again this PR and #23947 respectively here:

https://microsoft.sharepoint.com/:f:/r/teams/MS-Intelcollaboration/Shared%20Documents/General/Jie-Individual/[stagingbuffer](https://microsoft.sharepoint.com/:f:/r/teams/MS-Intelcollaboration/Shared%20Documents/General/Jie-Individual/stagingbuffer?csf=1&web=1&e=re4zLf)?csf=1&web=1&e=re4zLf

fs-eire · 2025-03-10T06:12:36Z

Regarding the 3 PRs recently:

I believe that there are 2 features can be merged:

early release upload staging buffer
enable UMA for downloading

The 2 features above should be clear.

For enabling UMA for uploading, I believe the current status is already much clearer than the original. However it still need to figure out what is the best approach to do - for example, how to revise the AllocHint added to the base class IAllocator, where "upload destination" is not an appropriate concept to an abstract allocator (which is also the base class of CPUAllocator).

fs-eire · 2025-03-10T06:46:57Z

After carefully check how lifecycle and ownership is managed for instances of WebGpuContext, WebGpuExecutionProvider, GpuBufferAllocator and BufferManager, I found that there is an easy and clean way to implement the UMA buffers for initializers.

Again, this is based on 2 assumptions:

During the whole model initialization process, all created GPU buffers are for the usage of uploading initializers.
Applying UMA optimization to initializers uploading should get most of the perf gain for uploading.

The design should only work when both assumptions are correct.

We can add a boolean flag in GpuBufferAllocator, something like initializing and default value is true. Then override WebGpuExecutionProvider::OnSessionInitializationEnd() and set the flag to false. GpuBufferAllocator can use this flag to decide whether to create buffer with mappedAtCreation.

For BufferManager, the only change should be: if the buffer is required to be created with mappedAtCreation, bypass the cache check and always create new one. (this line may need discussion)

qjia7 reviewed Mar 9, 2025

View reviewed changes

jchen10 mentioned this pull request Mar 10, 2025

[WebGPU] Reduce staging buffers for uploading intializers #23968

Merged

guschmue added the ep:WebGPU ort-web webgpu provider label Mar 10, 2025

jchen10 closed this Mar 12, 2025

jchen10 deleted the uma branch April 2, 2025 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WebGPU] Map UMA buffers for model uploading #23958

[WebGPU] Map UMA buffers for model uploading #23958

Uh oh!

jchen10 commented Mar 9, 2025

Uh oh!

jchen10 commented Mar 9, 2025

Uh oh!

qjia7 Mar 9, 2025

Uh oh!

jchen10 Mar 9, 2025

Uh oh!

qjia7 Mar 9, 2025

Uh oh!

jchen10 Mar 9, 2025

Uh oh!

qjia7 Mar 10, 2025

Uh oh!

jchen10 Mar 10, 2025

Uh oh!

fs-eire Mar 10, 2025

Uh oh!

jchen10 Mar 10, 2025 •

edited

Loading

Uh oh!

fs-eire commented Mar 10, 2025

Uh oh!

fs-eire commented Mar 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WebGPU] Map UMA buffers for model uploading #23958

[WebGPU] Map UMA buffers for model uploading #23958

Uh oh!

Conversation

jchen10 commented Mar 9, 2025

Uh oh!

jchen10 commented Mar 9, 2025

Uh oh!

qjia7 Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

jchen10 Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

qjia7 Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

jchen10 Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

qjia7 Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

jchen10 Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

fs-eire Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

jchen10 Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fs-eire commented Mar 10, 2025

Uh oh!

fs-eire commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jchen10 Mar 10, 2025 •

edited

Loading

fs-eire commented Mar 10, 2025 •

edited

Loading