Skip to content

add sliding window support for webgpu gqa #25372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 17, 2025
Merged

Conversation

guschmue
Copy link
Contributor

No description provided.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jul 11, 2025
@guschmue guschmue requested a review from Copilot July 11, 2025 19:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for a sliding window mask in the GPU-based group query attention (GQA) softmax shader, allowing attention to be constrained to a local window of past tokens.

  • Propagate a new local_window_size parameter through ComputeInternal, ApplyAttention, and ComputeInPlaceSoftmax.
  • Extend InPlaceSoftmaxProgram to accept local_window_size and emit conditional shader code for sliding-window masking.
  • Update function signatures and uniform variable setup to include the new window size parameter.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
contrib_ops/webgpu/bert/group_query_attention.cc Added local_window_size_ guard for FlashAttention and pass it to ApplyAttention.
contrib_ops/webgpu/bert/attention_common.h Extended ApplyAttention signature to accept local_window_size.
contrib_ops/webgpu/bert/attention.h Updated InPlaceSoftmaxProgram constructor and added local_window_size_ field.
contrib_ops/webgpu/bert/attention.cc Injected sliding-window logic into shader code and passed local_window_size through to the program.
Comments suppressed due to low confidence (2)

onnxruntime/contrib_ops/webgpu/bert/attention.cc:228

  • [nitpick] Consider adding unit or integration tests to cover the new sliding window behavior in InPlaceSoftmaxProgram to ensure correctness under various window sizes, including edge cases like window size zero or larger than sequence length.
  bool has_sliding_window = local_window_size_ != -1;

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc:199

  • [nitpick] Align the indentation for this condition with the other conditions in the if statement to improve readability (e.g., match the two-space indent used for head_sink == nullptr).
      local_window_size_ == -1 &&

@guschmue guschmue merged commit 2911e70 into main Jul 17, 2025
90 checks passed
@guschmue guschmue deleted the gs/webgpu-sliding-window branch July 17, 2025 15:19
@snnn
Copy link
Member

snnn commented Jul 25, 2025

Hi there! We haven't cut the release branch for this version yet, so I'm removing the release:1.23.0 label for now to keep things tidy. Thanks so much for your contribution! We'll make sure this gets included when the release is prepared. 🤖

guschmue pushed a commit that referenced this pull request Aug 1, 2025
…gth (#25594)

### Description
<!-- Describe your changes. -->
#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
snnn pushed a commit that referenced this pull request Aug 1, 2025
…gth (#25594)

### Description
<!-- Describe your changes. -->
#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
qti-yuduo pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Aug 8, 2025
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
…gth (microsoft#25594)

### Description
<!-- Describe your changes. -->
microsoft#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants