Skip to content

[webgpu] Apply Flash Attention if sliding window exceeds KV cache length #25594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

daijh
Copy link
Contributor

@daijh daijh commented Jul 30, 2025

Description

#25372 adds sliding window support for Group Query Attention, disabling Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention when the window size exceeds the KV cache length or total sequence length.

Motivation and Context

See above.

@daijh
Copy link
Contributor Author

daijh commented Jul 30, 2025

@guschmue @fs-eire @qjia7

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jul 30, 2025
qjia7
qjia7 previously approved these changes Jul 31, 2025
Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@guschmue
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@daijh
Copy link
Contributor Author

daijh commented Aug 1, 2025

CI infra issue by check the logs, please help to re-run.

@guschmue guschmue merged commit 7cc93cf into microsoft:main Aug 1, 2025
87 of 91 checks passed
snnn pushed a commit that referenced this pull request Aug 1, 2025
…gth (#25594)

### Description
<!-- Describe your changes. -->
#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
snnn added a commit that referenced this pull request Aug 1, 2025
This PR cherry-picks some pipeline changes from the main branch to the
1.23.0 release branch.


- **[build] disable CodeQL for NPM Packaging Pipeline (#25614)**
- **Refactor Java Test Pipeline (#25608)**
- **[build] upgrade Node.js for NPM packaging pipeline (#25568)**

And a WebGPU change:

- **[webgpu] Apply Flash Attention if sliding window exceeds KV cache
length (#25594)**
@daijh daijh deleted the supports-sliding-window-for-flash-attention branch August 2, 2025 00:52
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
…gth (microsoft#25594)

### Description
<!-- Describe your changes. -->
microsoft#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants