-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Fix Attention GQA implementation on CPU #25966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the Attention GQA (Grouped Query Attention) implementation on CPU to comply with ONNX specifications by updating the test exclusions and correcting the GQA index calculation logic.
Key changes:
- Updated test filters to exclude pending ONNX update tests with appropriate comments
- Fixed GQA head indexing calculation from modulo-based to division-based approach
- Corrected mask value assignment to use negative infinity instead of lowest value
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc | Updated test exclusion patterns for attention tests pending ONNX updates |
onnxruntime/test/onnx/main.cc | Added hardcoded test exclusions for attention GQA tests pending ONNX updates |
onnxruntime/core/providers/cpu/llm/attention.cc | Fixed GQA head indexing calculation and corrected mask values to use negative infinity |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any change of attention value from changing masked value from lowest() to -inf? I think the result shall be close enough.
There is one case where we see NaN from softmax: https://github.com/microsoft/onnxscript/blob/a70ee8d0905f563c840bbd5338595e9ac6b1b5b4/onnxscript/function_libs/torch_lib/ops/nn.py#L2147-L2155 Not sure if it is related |
I wonder if there are benefits in using -inf? Should we change https://github.com/microsoft/onnxscript/blob/a70ee8d0905f563c840bbd5338595e9ac6b1b5b4/onnxscript/function_libs/torch_lib/ops/nn.py#L2141 to use |
The reason why I switch to inifinity is that there is a case where the mask is added to the input. If it is infinity, any value won't change it unless it is infinity itself or nan. That's not true for lowest. I don't think it is likely to happen than the mask is == -lower and the input is +lowest ~highest. |
Question is do we need special handling for softmax, when mask is -inf?
|
I see your point. We should add a unit test with such a mask. |
Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.
### Description Cherry-pick the following PRs into the ORT 1.23.1 branch: - Fix Attention GQA implementation on CPU - **MANUAL MERGE**: see #26057 - main merge date: Sept 15, 11:33am - pr: #25966 - commit: d530b29 - Address edge GetMemInfo edge cases - main merge date: Sept 16, 10:32am - pr: #26021 - commit: d251f3a - Implement new Python APIs - main merge date: Sept 17, 11:44am - pr: #25999 - commit: abc63e8 - MemcpyFromHost and MemcpyToHost support for plugin EPs - **MERGE CONFLICT** on file onnxruntime/test/optimizer/transpose_optimizer_test.cc. Conflicts with #25689 - main merge date: Sept 23, 10:42am - pr: #26088 - commit: 4545732 - [TRT RTX EP] Fix bug for generating the correct subgraph in GetCapability #26132 - main merge date: Sept 23, 8:54pm - pr: #26132 - commit: 72e56e7 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Chi Lo <[email protected]>
Description
Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.