Skip to content

Conversation

jchen10
Copy link
Contributor

@jchen10 jchen10 commented Mar 9, 2025

With the Dawn BufferMapExtendedUsages feature available, we can now copy data directly to the GPU buffer for UMA GPUs without the need to use a staging buffer. This PR enables the feature for model uploading during the initialization of the inference session, where ORT reaches the peak memory usage. To achieve this, we need to introduce a new 'SetAllocHint' method in the 'IAllocator', so that the WebGPU allocator can have the buffer manager to handle UMA buffers appropriately.

Credits to @fs-eire for the overall design of implementation.

With the Dawn BufferMapExtendedUsages feature available, we can now copy
data directly to the GPU buffer for UMA GPUs without the need to use a
staging buffer. This PR enables the feature for model uploading during
the initialization of the inference session, where ORT reaches the peak
memory usage. To achieve this, we need to introduce a new 'SetAllocHint'
method in the 'IAllocator', so that the WebGPU allocator can have the
buffer manager to handle UMA buffers appropriately.

Credits to @fs-eire for the overall design of implementation.
@jchen10
Copy link
Contributor Author

jchen10 commented Mar 9, 2025

@fs-eire @xhcao
With your inputs in #23910, I've come up with this one, which is much simpler, meanwhile still effective.

model_benchmark.exe -i ..\models-genai\Phi-3.5-mini-instruct-onnx-web
Batch size: 1, prompt tokens: 17, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       284191
        avg (tokens/s): 59.819
        p50 (us):       274340
        stddev (us):    15825
        n:              5 * 17 token(s)
Token generation:
        avg (us):       53967.1
        avg (tokens/s): 18.5298
        p50 (us):       53783.8
        stddev (us):    2169.13
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       10.26
        avg (tokens/s): 97465.9
        p50 (us):       7.7
        stddev (us):    4.48475
        n:              5 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       7138.11
        p50 (ms):       7140.06
        stddev (ms):    17.9578
        n:              5
Peak working set size (bytes): 3414822912

if (alloc_hint_ == AllocHint::UploadDst && context_.SupportsBufferMapExtendedUsages()) {
buffer = context_.BufferManager().CreateUMA(size);
} else {
buffer = context_.BufferManager().Create(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the AllocHint::UploadDst hint, does it mean that for the non-UMA path, we can also create the storage buffer with mappedAtCreation = true. Then in the BufferManager::Upload path, staging buffer can also be removed.

In this case, it will be fair to compare the memory peak to see how much the UMA brings :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Actually I also tried that. However the memory peak slightly increased rather than significantly reduced. Maybe it's worth a shot on a d-GPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird. I notice another issue: It seems that for uploading, in the uma way, we can't reuse the existed buffer and always create a new buffer. It's not a problem for the weights data since we need them when the session is alive. But if some inputs are in cpu, in each run, they also can't reuse the existed buffer, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BucketCacheManager does support buffer reuse, however has to use a staging buffer for upload and download.

#23910 can improve that with no staging buffer needed. But it's much complicated than this one. I haven't found a use case to show that the increased complexity is rewarding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yulong submitted a PR #23947 to release staging buffer early. Can you compare the UMA's perf with that one? I am curious about the UMA's gains on iGPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here comes the data of PR #23947. It seems the UMA gain is just 0.2G.

model_benchmark.exe -i ..\models-genai\Phi-3.5-mini-instruct-onnx-web
Batch size: 1, prompt tokens: 17, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       284987
        avg (tokens/s): 59.6519
        p50 (us):       277820
        stddev (us):    14648.9
        n:              5 * 17 token(s)
Token generation:
        avg (us):       55056.5
        avg (tokens/s): 18.1632
        p50 (us):       54971.5
        stddev (us):    2005.99
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       10.26
        avg (tokens/s): 97465.9
        p50 (us):       7.2
        stddev (us):    4.87473
        n:              5 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       7277.27
        p50 (ms):       7264.27
        stddev (ms):    41.85
        n:              5
Peak working set size (bytes): 3659186176

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we want to also measure GPU memory usage and latency, which in my understanding should show some gain.

Copy link
Contributor Author

@jchen10 jchen10 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tired to add a new metric to measure the model create time to https://github.com/microsoft/onnxruntime-genai/blob/2b741723abb2814e899d3d6272022f4c2fa66a5d/benchmark/c/main.cpp#L137

void RunBenchmark(const benchmark::Options& opts) {
  std::unique_ptr<OgaModel> model;
  std::vector<Duration> model_create_times;
  {
    Timing model_create_timing{model_create_times};
    model = OgaModel::Create(opts.model_path.c_str());
  }

You can get the data I measured again this PR and #23947 respectively here:

https://microsoft.sharepoint.com/:f:/r/teams/MS-Intelcollaboration/Shared%20Documents/General/Jie-Individual/[stagingbuffer](https://microsoft.sharepoint.com/:f:/r/teams/MS-Intelcollaboration/Shared%20Documents/General/Jie-Individual/stagingbuffer?csf=1&web=1&e=re4zLf)?csf=1&web=1&e=re4zLf

@fs-eire
Copy link
Contributor

fs-eire commented Mar 10, 2025

Regarding the 3 PRs recently:

I believe that there are 2 features can be merged:

  • early release upload staging buffer
  • enable UMA for downloading

The 2 features above should be clear.

For enabling UMA for uploading, I believe the current status is already much clearer than the original. However it still need to figure out what is the best approach to do - for example, how to revise the AllocHint added to the base class IAllocator, where "upload destination" is not an appropriate concept to an abstract allocator (which is also the base class of CPUAllocator).

@fs-eire
Copy link
Contributor

fs-eire commented Mar 10, 2025

After carefully check how lifecycle and ownership is managed for instances of WebGpuContext, WebGpuExecutionProvider, GpuBufferAllocator and BufferManager, I found that there is an easy and clean way to implement the UMA buffers for initializers.

Again, this is based on 2 assumptions:

  • During the whole model initialization process, all created GPU buffers are for the usage of uploading initializers.
  • Applying UMA optimization to initializers uploading should get most of the perf gain for uploading.

The design should only work when both assumptions are correct.

We can add a boolean flag in GpuBufferAllocator, something like initializing and default value is true. Then override WebGpuExecutionProvider::OnSessionInitializationEnd() and set the flag to false. GpuBufferAllocator can use this flag to decide whether to create buffer with mappedAtCreation.

For BufferManager, the only change should be: if the buffer is required to be created with mappedAtCreation, bypass the cache check and always create new one. (this line may need discussion)

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Mar 10, 2025
@jchen10 jchen10 closed this Mar 12, 2025
@jchen10 jchen10 deleted the uma branch April 2, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants