Skip to content

Conversation

XiaobingSuper
Copy link
Contributor

@XiaobingSuper XiaobingSuper commented Jul 22, 2025

This PR is about adding var_len case for MLA benchmark.

@XiaobingSuper
Copy link
Contributor Author

XiaobingSuper commented Jul 22, 2025

@tridao , for var_len case, I find FA3 has a big performance gap compared with FlashMLA(tested on H100):

total_seqlens=130864, mean_seqlens=1022, max_seqlen=2358
Seqlen = 2560, FA3 time: 131.5 us, 1418 GB/s, 277 TFLOPS/s
Seqlen = 2560, FlashMLA time: 107.6 us, 1732 GB/s, 339 TFLOPS/s
Arithmetic intensity: 195.5
Ideal time: 56 us
total_seqlens=274227, mean_seqlens=2142, max_seqlen=4696
Seqlen = 4864, FA3 time: 211.9 us, 1659 GB/s, 361 TFLOPS/s
Seqlen = 4864, FlashMLA time: 175.2 us, 2007 GB/s, 436 TFLOPS/s
Arithmetic intensity: 217.3
Ideal time: 105 us
total_seqlens=539326, mean_seqlens=4213, max_seqlen=9155
Seqlen = 9216, FA3 time: 363.3 us, 1808 GB/s, 413 TFLOPS/s
Seqlen = 9216, FlashMLA time: 292.9 us, 2243 GB/s, 513 TFLOPS/s
Arithmetic intensity: 228.7
Ideal time: 196 us
total_seqlens=1040384, mean_seqlens=8128, max_seqlen=17609
Seqlen = 17664, FA3 time: 664.7 us, 1857 GB/s, 436 TFLOPS/s
Seqlen = 17664, FlashMLA time: 505.0 us, 2444 GB/s, 574 TFLOPS/s
Arithmetic intensity: 234.8
Ideal time: 368 us
total_seqlens=2123978, mean_seqlens=16593, max_seqlen=37079
Seqlen = 37120, FA3 time: 1398.0 us, 1776 GB/s, 423 TFLOPS/s
Seqlen = 37120, FlashMLA time: 964.3 us, 2574 GB/s, 613 TFLOPS/s
Arithmetic intensity: 238.3
Ideal time: 741 us
total_seqlens=4190053, mean_seqlens=32734, max_seqlen=92105
Seqlen = 92160, FA3 time: 2515.6 us, 1933 GB/s, 464 TFLOPS/s
Seqlen = 92160, FlashMLA time: 1835.9 us, 2649 GB/s, 636 TFLOPS/s
Arithmetic intensity: 240.0
Ideal time: 1452 us
total_seqlens=8162046, mean_seqlens=63765, max_seqlen=147507
Seqlen = 147712, FA3 time: 5127.2 us, 1841 GB/s, 443 TFLOPS/s
Seqlen = 147712, FlashMLA time: 3511.0 us, 2688 GB/s, 648 TFLOPS/s
Arithmetic intensity: 240.9
Ideal time: 2817 us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant