Skip to content

Conversation

@Edenzzzz
Copy link

@Edenzzzz Edenzzzz commented Jun 15, 2025

When reading lots of TK kernels, I see a bunch of cudaDeviceSynchronize and cudaStreamSynchronize(), even sometimes consecutively. However, they are very expensive and should not be used inside general kernel dispatch except for testing and debugging purposes.
I see the following use cases for them:

  1. cudaDeviceSynchronize can be used to time kernels in benchmarks (though cuda events will likely be more accurate)
  2. Both can cause errors to be reported from the right line on the host side. There's no need to use cudaDeviceSynchronize for debugging, and we can add a debug flag to trigger stream sync.

Removing them can speed up 10%-20% when porting code to general use cases.
See hao-ai-lab/FastVideo#517
Thanks.

Comment on lines -1044 to +1045
cudaStreamSynchronize(stream);
cudaDeviceSynchronize();
// cudaStreamSynchronize(stream);
// cudaDeviceSynchronize();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call these when there isn't a immediate error checking

}

CHECK_CUDA_ERROR(cudaGetLastError());
cudaStreamSynchronize(stream);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we don't always do this, torch uses a flag CUDA_LAUNCH_BLOCKING

@Edenzzzz
Copy link
Author

cc @DanFu09

@DanFu09
Copy link
Contributor

DanFu09 commented Jun 16, 2025

This may actually introduce some correctness issues IIUC (especially removing the stream sync). There does need to be some stream and management in the dispatch. There's another set of changes I'll work on upstreaming this week that should address this.

@Edenzzzz
Copy link
Author

Edenzzzz commented Jun 16, 2025

Could you elaborate on the correctness issue? Precision for layernorm is the same

@Edenzzzz
Copy link
Author

@DanFu09
Copy link
Contributor

DanFu09 commented Jun 17, 2025

Yep these two lines are the things we need: https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L67-L68. The tradeoff is it makes compile more expensive, so ideally we gate it behind a compiler flag.

I have a solution sitting around on a branch, I just need to move it to main :)

@Edenzzzz
Copy link
Author

Oh I see, thank you!

@DanFu09
Copy link
Contributor

DanFu09 commented Jun 18, 2025

Take a look at this branch: https://github.com/HazyResearch/ThunderKittens/tree/danfu09/update-attn

Any other optimizations you see there? It's pretty old code at this point :)

@Edenzzzz
Copy link
Author

I think the device syncs in lin_attn.cu and layer norm can be removed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants