Skip to content

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Apr 24, 2025

Mainly adapted from cutlass examples.

@johnnynunez
Copy link

johnnynunez commented Apr 25, 2025

@yzh119 cutlass 3.9.0 is officially released https://github.com/NVIDIA/cutlass/releases/tag/v3.9.0

@yzh119
Copy link
Collaborator Author

yzh119 commented Apr 25, 2025

Hi @johnnynunez , yes and we upgraded 3rdparty dependency to cutlass 3.9 several weeks ago: #997.

This PR changes the original implementation to unblocks the features we need.

@hwu36
Copy link

hwu36 commented May 2, 2025

We will have 3.9.2 this weekend.

@yzh119
Copy link
Collaborator Author

yzh119 commented May 2, 2025

@hwu36 good to know, we will adopt to the 3.9.2 as soon as its ready.

@yzh119 yzh119 force-pushed the cutlass-fmha-blackwell branch from 5849003 to eef0ada Compare May 4, 2025 15:44
Copy link
Collaborator

@cyx-6 cyx-6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@yzh119 yzh119 merged commit 9a05c92 into flashinfer-ai:main May 13, 2025
1 of 2 checks passed
@yzh119
Copy link
Collaborator Author

yzh119 commented May 13, 2025

Major changes made to cutlass original example:

  1. add barrier_O and reuse shared memory for mainloop & epilogue to reduce shared memory size. (the current smem layout can't support head_dim_qk=192 and head_dim_vo=128).
  2. separate pipeline_K/pipeline_V and smem_k/smem_v for head_dim_qk=192 and head_dim_vo=128
  3. remove the need of padding for tma_load of q/k/v (we observe a significant overhead of padding), but still use padding trick in tma_store, the overhead is tolerable, because we can allocate a padded buffer without data movement.
  4. the original persistent tile scheduler is bad for causal attention, this PR adds a naive tile scheduler (which is slightly better), we have a working-in-progress ahead-of-time static scheduler (like earlier flashinfer plan function) which reach better performance on causal attention, and will be upstreamed later.
  5. change mask mode to inference-style causal mask.

@zhyncs zhyncs deleted the cutlass-fmha-blackwell branch May 13, 2025 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants