-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[WebGPU] Map UMA buffers for model uploading #23958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With the Dawn BufferMapExtendedUsages feature available, we can now copy data directly to the GPU buffer for UMA GPUs without the need to use a staging buffer. This PR enables the feature for model uploading during the initialization of the inference session, where ORT reaches the peak memory usage. To achieve this, we need to introduce a new 'SetAllocHint' method in the 'IAllocator', so that the WebGPU allocator can have the buffer manager to handle UMA buffers appropriately. Credits to @fs-eire for the overall design of implementation.
@fs-eire @xhcao
|
if (alloc_hint_ == AllocHint::UploadDst && context_.SupportsBufferMapExtendedUsages()) { | ||
buffer = context_.BufferManager().CreateUMA(size); | ||
} else { | ||
buffer = context_.BufferManager().Create(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have the AllocHint::UploadDst
hint, does it mean that for the non-UMA path, we can also create the storage buffer with mappedAtCreation = true
. Then in the BufferManager::Upload
path, staging buffer can also be removed.
In this case, it will be fair to compare the memory peak to see how much the UMA brings :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Actually I also tried that. However the memory peak slightly increased rather than significantly reduced. Maybe it's worth a shot on a d-GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird. I notice another issue: It seems that for uploading, in the uma way, we can't reuse the existed buffer and always create a new buffer. It's not a problem for the weights data since we need them when the session is alive. But if some inputs are in cpu, in each run, they also can't reuse the existed buffer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BucketCacheManager
does support buffer reuse, however has to use a staging buffer for upload and download.
#23910 can improve that with no staging buffer needed. But it's much complicated than this one. I haven't found a use case to show that the increased complexity is rewarding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yulong submitted a PR #23947 to release staging buffer early. Can you compare the UMA's perf with that one? I am curious about the UMA's gains on iGPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here comes the data of PR #23947. It seems the UMA gain is just 0.2G.
model_benchmark.exe -i ..\models-genai\Phi-3.5-mini-instruct-onnx-web
Batch size: 1, prompt tokens: 17, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 284987
avg (tokens/s): 59.6519
p50 (us): 277820
stddev (us): 14648.9
n: 5 * 17 token(s)
Token generation:
avg (us): 55056.5
avg (tokens/s): 18.1632
p50 (us): 54971.5
stddev (us): 2005.99
n: 635 * 1 token(s)
Token sampling:
avg (us): 10.26
avg (tokens/s): 97465.9
p50 (us): 7.2
stddev (us): 4.87473
n: 5 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 7277.27
p50 (ms): 7264.27
stddev (ms): 41.85
n: 5
Peak working set size (bytes): 3659186176
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, we want to also measure GPU memory usage and latency, which in my understanding should show some gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tired to add a new metric to measure the model create time to https://github.com/microsoft/onnxruntime-genai/blob/2b741723abb2814e899d3d6272022f4c2fa66a5d/benchmark/c/main.cpp#L137
void RunBenchmark(const benchmark::Options& opts) {
std::unique_ptr<OgaModel> model;
std::vector<Duration> model_create_times;
{
Timing model_create_timing{model_create_times};
model = OgaModel::Create(opts.model_path.c_str());
}
You can get the data I measured again this PR and #23947 respectively here:
Regarding the 3 PRs recently:
I believe that there are 2 features can be merged:
The 2 features above should be clear. For enabling UMA for uploading, I believe the current status is already much clearer than the original. However it still need to figure out what is the best approach to do - for example, how to revise the |
After carefully check how lifecycle and ownership is managed for instances of Again, this is based on 2 assumptions:
The design should only work when both assumptions are correct. We can add a boolean flag in For |
With the Dawn BufferMapExtendedUsages feature available, we can now copy data directly to the GPU buffer for UMA GPUs without the need to use a staging buffer. This PR enables the feature for model uploading during the initialization of the inference session, where ORT reaches the peak memory usage. To achieve this, we need to introduce a new 'SetAllocHint' method in the 'IAllocator', so that the WebGPU allocator can have the buffer manager to handle UMA buffers appropriately.
Credits to @fs-eire for the overall design of implementation.