-
Notifications
You must be signed in to change notification settings - Fork 190
Remove unnecessary device and stream syncs #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| cudaStreamSynchronize(stream); | ||
| cudaDeviceSynchronize(); | ||
| // cudaStreamSynchronize(stream); | ||
| // cudaDeviceSynchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call these when there isn't a immediate error checking
| } | ||
|
|
||
| CHECK_CUDA_ERROR(cudaGetLastError()); | ||
| cudaStreamSynchronize(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we don't always do this, torch uses a flag CUDA_LAUNCH_BLOCKING
|
cc @DanFu09 |
|
This may actually introduce some correctness issues IIUC (especially removing the stream sync). There does need to be some stream and management in the dispatch. There's another set of changes I'll work on upstreaming this week that should address this. |
|
Could you elaborate on the correctness issue? Precision for layernorm is the same |
|
Here's flashinfer's stream dispatch (without any sync) |
|
Yep these two lines are the things we need: https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L67-L68. The tradeoff is it makes compile more expensive, so ideally we gate it behind a compiler flag. I have a solution sitting around on a branch, I just need to move it to main :) |
|
Oh I see, thank you! |
|
Take a look at this branch: https://github.com/HazyResearch/ThunderKittens/tree/danfu09/update-attn Any other optimizations you see there? It's pretty old code at this point :) |
|
I think the device syncs in lin_attn.cu and layer norm can be removed :) |
When reading lots of TK kernels, I see a bunch of
cudaDeviceSynchronizeandcudaStreamSynchronize(), even sometimes consecutively. However, they are very expensive and should not be used inside general kernel dispatch except for testing and debugging purposes.I see the following use cases for them:
cudaDeviceSynchronizecan be used to time kernels in benchmarks (though cuda events will likely be more accurate)cudaDeviceSynchronizefor debugging, and we can add a debug flag to trigger stream sync.Removing them can speed up 10%-20% when porting code to general use cases.
See hao-ai-lab/FastVideo#517
Thanks.