Skip to content

Conversation

xadupre
Copy link
Member

@xadupre xadupre commented Sep 5, 2025

Description

Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@xadupre xadupre marked this pull request as ready for review September 10, 2025 14:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the Attention GQA (Grouped Query Attention) implementation on CPU to comply with ONNX specifications by updating the test exclusions and correcting the GQA index calculation logic.

Key changes:

  • Updated test filters to exclude pending ONNX update tests with appropriate comments
  • Fixed GQA head indexing calculation from modulo-based to division-based approach
  • Corrected mask value assignment to use negative infinity instead of lowest value

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc Updated test exclusion patterns for attention tests pending ONNX updates
onnxruntime/test/onnx/main.cc Added hardcoded test exclusions for attention GQA tests pending ONNX updates
onnxruntime/core/providers/cpu/llm/attention.cc Fixed GQA head indexing calculation and corrected mask values to use negative infinity

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any change of attention value from changing masked value from lowest() to -inf? I think the result shall be close enough.

@justinchuby
Copy link
Contributor

@justinchuby
Copy link
Contributor

@xadupre
Copy link
Member Author

xadupre commented Sep 15, 2025

The reason why I switch to inifinity is that there is a case where the mask is added to the input. If it is infinity, any value won't change it unless it is infinity itself or nan. That's not true for lowest. I don't think it is likely to happen than the mask is == -lower and the input is +lowest ~highest.

@justinchuby
Copy link
Contributor

justinchuby commented Sep 15, 2025

Question is do we need special handling for softmax, when mask is -inf?

# When using scaled dot product attention with a boolean mask, the softmax operation might return NaN values
# due to the presence of -inf in an entire row (padding tokens), resulting in 0/0 (NaN) in the softmax output.
# This is because there's no safe/masked softmax imp in ONNX, so we need to handle NaN values explicitly to match
# the behavior of PyTorch with boolean masks.
# Reference: https://github.com/pytorch/pytorch/issues/103749

@xadupre
Copy link
Member Author

xadupre commented Sep 15, 2025

Question is do we need special handling for softmax, when mask is -inf?

# When using scaled dot product attention with a boolean mask, the softmax operation might return NaN values
# due to the presence of -inf in an entire row (padding tokens), resulting in 0/0 (NaN) in the softmax output.
# This is because there's no safe/masked softmax imp in ONNX, so we need to handle NaN values explicitly to match
# the behavior of PyTorch with boolean masks.
# Reference: https://github.com/pytorch/pytorch/issues/103749

I see your point. We should add a unit test with such a mask.

@justinchuby justinchuby merged commit d530b29 into main Sep 15, 2025
95 of 98 checks passed
@justinchuby justinchuby deleted the xadupre/attentioncpu branch September 15, 2025 18:33
snnn pushed a commit that referenced this pull request Sep 15, 2025
Attention on CPU is following ONNX specifications. This change
replicates the changes introduced by
onnx/onnx#7274.
adrianlizarraga added a commit that referenced this pull request Sep 24, 2025
### Description
Cherry-pick the following PRs into the ORT 1.23.1 branch:

- Fix Attention GQA implementation on CPU
- **MANUAL MERGE**: see
#26057
  - main merge date: Sept 15, 11:33am
  - pr: #25966
  - commit: d530b29
- Address edge GetMemInfo edge cases
  - main merge date: Sept 16, 10:32am
  - pr: #26021
  - commit: d251f3a
- Implement new Python APIs
  - main merge date: Sept 17, 11:44am
  - pr: #25999
  - commit: abc63e8
- MemcpyFromHost and MemcpyToHost support for plugin EPs
- **MERGE CONFLICT** on file
onnxruntime/test/optimizer/transpose_optimizer_test.cc. Conflicts with
#25689
  - main merge date: Sept 23, 10:42am
  - pr: #26088
  - commit: 4545732
- [TRT RTX EP] Fix bug for generating the correct subgraph in
GetCapability #26132
  - main merge date: Sept 23, 8:54pm
  - pr: #26132
  - commit: 72e56e7


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Dmitri Smirnov <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Chi Lo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants