Fix Attention GQA implementation on CPU #25966

xadupre · 2025-09-05T17:15:23Z

Description

Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/providers/cpu/llm/attention.cc

…xadupre/attentioncpu

Copilot

Pull Request Overview

This PR fixes the Attention GQA (Grouped Query Attention) implementation on CPU to comply with ONNX specifications by updating the test exclusions and correcting the GQA index calculation logic.

Key changes:

Updated test filters to exclude pending ONNX update tests with appropriate comments
Fixed GQA head indexing calculation from modulo-based to division-based approach
Corrected mask value assignment to use negative infinity instead of lowest value

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc	Updated test exclusion patterns for attention tests pending ONNX updates
onnxruntime/test/onnx/main.cc	Added hardcoded test exclusions for attention GQA tests pending ONNX updates
onnxruntime/core/providers/cpu/llm/attention.cc	Fixed GQA head indexing calculation and corrected mask values to use negative infinity

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

onnxruntime/core/providers/cpu/llm/attention.cc

onnxruntime/test/onnx/main.cc

onnxruntime/core/providers/cpu/llm/attention.cc

tianleiwu

Any change of attention value from changing masked value from lowest() to -inf? I think the result shall be close enough.

justinchuby · 2025-09-13T01:35:59Z

There is one case where we see NaN from softmax: https://github.com/microsoft/onnxscript/blob/a70ee8d0905f563c840bbd5338595e9ac6b1b5b4/onnxscript/function_libs/torch_lib/ops/nn.py#L2147-L2155

Not sure if it is related

justinchuby · 2025-09-13T01:37:28Z

I wonder if there are benefits in using -inf? Should we change https://github.com/microsoft/onnxscript/blob/a70ee8d0905f563c840bbd5338595e9ac6b1b5b4/onnxscript/function_libs/torch_lib/ops/nn.py#L2141 to use lowest()?

xadupre · 2025-09-15T16:31:34Z

The reason why I switch to inifinity is that there is a case where the mask is added to the input. If it is infinity, any value won't change it unless it is infinity itself or nan. That's not true for lowest. I don't think it is likely to happen than the mask is == -lower and the input is +lowest ~highest.

justinchuby · 2025-09-15T16:33:29Z

Question is do we need special handling for softmax, when mask is -inf?

# When using scaled dot product attention with a boolean mask, the softmax operation might return NaN values
# due to the presence of -inf in an entire row (padding tokens), resulting in 0/0 (NaN) in the softmax output.
# This is because there's no safe/masked softmax imp in ONNX, so we need to handle NaN values explicitly to match
# the behavior of PyTorch with boolean masks.
# Reference: https://github.com/pytorch/pytorch/issues/103749

xadupre · 2025-09-15T16:37:55Z

Question is do we need special handling for softmax, when mask is -inf?

# When using scaled dot product attention with a boolean mask, the softmax operation might return NaN values
# due to the presence of -inf in an entire row (padding tokens), resulting in 0/0 (NaN) in the softmax output.
# This is because there's no safe/masked softmax imp in ONNX, so we need to handle NaN values explicitly to match
# the behavior of PyTorch with boolean masks.
# Reference: https://github.com/pytorch/pytorch/issues/103749

I see your point. We should add a unit test with such a mask.

Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.

### Description Cherry-pick the following PRs into the ORT 1.23.1 branch: - Fix Attention GQA implementation on CPU - **MANUAL MERGE**: see #26057 - main merge date: Sept 15, 11:33am - pr: #25966 - commit: d530b29 - Address edge GetMemInfo edge cases - main merge date: Sept 16, 10:32am - pr: #26021 - commit: d251f3a - Implement new Python APIs - main merge date: Sept 17, 11:44am - pr: #25999 - commit: abc63e8 - MemcpyFromHost and MemcpyToHost support for plugin EPs - **MERGE CONFLICT** on file onnxruntime/test/optimizer/transpose_optimizer_test.cc. Conflicts with #25689 - main merge date: Sept 23, 10:42am - pr: #26088 - commit: 4545732 - [TRT RTX EP] Fix bug for generating the correct subgraph in GetCapability #26132 - main merge date: Sept 23, 8:54pm - pr: #26132 - commit: 72e56e7 ### Motivation and Context  --------- Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Chi Lo <[email protected]>

Fix Attention GQA implementation on CPU

28de5c5

github-actions bot reviewed Sep 5, 2025

View reviewed changes

onnxruntime/core/providers/cpu/llm/attention.cc Outdated Show resolved Hide resolved

xadupre added 5 commits September 5, 2025 20:07

add one more unit test

f407fd7

negative infinity

7ff71da

fix last unittest

6eb8ee9

fix merge conflicts

627fe07

disable onnx test for attention

07745c4

xadupre marked this pull request as ready for review September 10, 2025 14:56

xadupre added 8 commits September 10, 2025 17:45

disable

542b5ab

disable

08841de

fix issues

db6df58

disable

abb196e

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

805fe69

…xadupre/attentioncpu

disable test waiting for onnx

59ded0c

disable

5f93b47

disable

5c99cf0

justinchuby requested review from Copilot, justinchuby, jywu-msft, kunal-vaishnavi and titaiwangms September 11, 2025 18:11

justinchuby added the release:1.23.0 label Sep 11, 2025

Copilot AI reviewed Sep 11, 2025

View reviewed changes

onnxruntime/core/providers/cpu/llm/attention.cc Show resolved Hide resolved

onnxruntime/core/providers/cpu/llm/attention.cc Show resolved Hide resolved

titaiwangms reviewed Sep 11, 2025

View reviewed changes

onnxruntime/test/onnx/main.cc Show resolved Hide resolved

kunal-vaishnavi reviewed Sep 11, 2025

View reviewed changes

onnxruntime/core/providers/cpu/llm/attention.cc Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Sep 11, 2025

View reviewed changes

onnxruntime/core/providers/cpu/llm/attention.cc Outdated Show resolved Hide resolved

kunal-vaishnavi requested a review from tianleiwu September 11, 2025 21:06

inlien

18b0787

tianleiwu approved these changes Sep 13, 2025

View reviewed changes

justinchuby mentioned this pull request Sep 13, 2025

Do not use -inf in boolean masks? microsoft/onnxscript#2561

Open

justinchuby approved these changes Sep 15, 2025

View reviewed changes

justinchuby merged commit d530b29 into main Sep 15, 2025
95 of 98 checks passed

justinchuby deleted the xadupre/attentioncpu branch September 15, 2025 18:33

snnn pushed a commit that referenced this pull request Sep 15, 2025

Fix Attention GQA implementation on CPU (#25966)

e33ab7a

Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.

snnn mentioned this pull request Sep 15, 2025

Cherry-pick release-1.23.0 changes from main to rel-1.23.0 #26049

Closed

adrianlizarraga added release:1.23.1 and removed release:1.23.0 labels Sep 23, 2025

adrianlizarraga added a commit that referenced this pull request Sep 24, 2025

manual cherry-pick Fix Attention GQA implementation on CPU #25966

ba1bc31

adrianlizarraga mentioned this pull request Sep 24, 2025

ORT 1.23.1 cherrypick 1 #26138

Closed

adrianlizarraga added a commit that referenced this pull request Sep 24, 2025

manual cherry-pick: Fix Attention GQA implementation on CPU #25966

deb22e9

adrianlizarraga mentioned this pull request Sep 24, 2025

ORT 1.23.1 cherrypick 1 [REDO] #26140

Merged

snnn removed the release:1.23.1 label Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Attention GQA implementation on CPU #25966

Fix Attention GQA implementation on CPU #25966

Uh oh!

xadupre commented Sep 5, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Uh oh!

justinchuby commented Sep 13, 2025

Uh oh!

justinchuby commented Sep 13, 2025

Uh oh!

xadupre commented Sep 15, 2025

Uh oh!

justinchuby commented Sep 15, 2025 •

edited

Loading

Uh oh!

xadupre commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix Attention GQA implementation on CPU #25966

Fix Attention GQA implementation on CPU #25966

Uh oh!

Conversation

xadupre commented Sep 5, 2025

Description

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

justinchuby commented Sep 13, 2025

Uh oh!

justinchuby commented Sep 13, 2025

Uh oh!

xadupre commented Sep 15, 2025

Uh oh!

justinchuby commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xadupre commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

justinchuby commented Sep 15, 2025 •

edited

Loading