Remove unnecessary device and stream syncs #129

Edenzzzz · 2025-06-15T04:59:52Z

When reading lots of TK kernels, I see a bunch of cudaDeviceSynchronize and cudaStreamSynchronize(), even sometimes consecutively. However, they are very expensive and should not be used inside general kernel dispatch except for testing and debugging purposes.
I see the following use cases for them:

cudaDeviceSynchronize can be used to time kernels in benchmarks (though cuda events will likely be more accurate)
Both can cause errors to be reported from the right line on the host side. There's no need to use cudaDeviceSynchronize for debugging, and we can add a debug flag to trigger stream sync.

Removing them can speed up 10%-20% when porting code to general use cases.
See hao-ai-lab/FastVideo#517
Thanks.

Edenzzzz · 2025-06-15T05:12:15Z

kernels/attn/h100/h100.cu

-        cudaStreamSynchronize(stream);
-        cudaDeviceSynchronize();
+        // cudaStreamSynchronize(stream);
+        // cudaDeviceSynchronize();


No need to call these when there isn't a immediate error checking

Edenzzzz · 2025-06-15T16:54:34Z

kernels/attn/h100/h100.cu

        }

        CHECK_CUDA_ERROR(cudaGetLastError());
-        cudaStreamSynchronize(stream);


Usually we don't always do this, torch uses a flag CUDA_LAUNCH_BLOCKING

Edenzzzz · 2025-06-15T16:54:44Z

cc @DanFu09

DanFu09 · 2025-06-16T16:19:48Z

This may actually introduce some correctness issues IIUC (especially removing the stream sync). There does need to be some stream and management in the dispatch. There's another set of changes I'll work on upstreaming this week that should address this.

Edenzzzz · 2025-06-16T17:01:02Z

Could you elaborate on the correctness issue? Precision for layernorm is the same

Edenzzzz · 2025-06-17T21:39:41Z

Here's flashinfer's stream dispatch (without any sync)
https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L68

DanFu09 · 2025-06-17T22:18:59Z

Yep these two lines are the things we need: https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L67-L68. The tradeoff is it makes compile more expensive, so ideally we gate it behind a compiler flag.

I have a solution sitting around on a branch, I just need to move it to main :)

Edenzzzz · 2025-06-17T22:22:17Z

Oh I see, thank you!

DanFu09 · 2025-06-18T19:09:12Z

Take a look at this branch: https://github.com/HazyResearch/ThunderKittens/tree/danfu09/update-attn

Any other optimizations you see there? It's pretty old code at this point :)

Edenzzzz · 2025-06-19T00:25:46Z

I think the device syncs in lin_attn.cu and layer norm can be removed :)

Edenzzzz added 2 commits June 15, 2025 04:53

remove unnecessary syncs

646db8a

fix

524715b

Edenzzzz commented Jun 15, 2025

View reviewed changes

Edenzzzz mentioned this pull request Jun 18, 2025

[Kernel] Remove all syncs from STA & VSA kernels hao-ai-lab/FastVideo#517

Merged

StuartSul force-pushed the main branch from cdb5fea to 50f75fd Compare September 15, 2025 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove unnecessary device and stream syncs #129

Remove unnecessary device and stream syncs #129

Uh oh!

Edenzzzz commented Jun 15, 2025 •

edited

Loading

Uh oh!

Edenzzzz Jun 15, 2025

Uh oh!

Edenzzzz Jun 15, 2025

Uh oh!

Edenzzzz commented Jun 15, 2025

Uh oh!

DanFu09 commented Jun 16, 2025 •

edited

Loading

Uh oh!

Edenzzzz commented Jun 16, 2025 •

edited

Loading

Uh oh!

Edenzzzz commented Jun 17, 2025

Uh oh!

DanFu09 commented Jun 17, 2025

Uh oh!

Edenzzzz commented Jun 17, 2025

Uh oh!

DanFu09 commented Jun 18, 2025

Uh oh!

Edenzzzz commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove unnecessary device and stream syncs #129

Are you sure you want to change the base?

Remove unnecessary device and stream syncs #129

Uh oh!

Conversation

Edenzzzz commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz Jun 15, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Jun 15, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz commented Jun 15, 2025

Uh oh!

DanFu09 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz commented Jun 17, 2025

Uh oh!

DanFu09 commented Jun 17, 2025

Uh oh!

Edenzzzz commented Jun 17, 2025

Uh oh!

DanFu09 commented Jun 18, 2025

Uh oh!

Edenzzzz commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Edenzzzz commented Jun 15, 2025 •

edited

Loading

DanFu09 commented Jun 16, 2025 •

edited

Loading

Edenzzzz commented Jun 16, 2025 •

edited

Loading