Skip to content

Conversation

DDEle
Copy link
Contributor

@DDEle DDEle commented Oct 13, 2025

Motivation

To optimize HDim=48 cases.

Technical Details

Update to ROCm/composable_kernel@95bdc74
Update to ROCm/composable_kernel@013ba3c

Test Plan

MAX_JOBS=$(nproc) pytest op_tests/test_mha.py -v

Test Result

Submission Checklist

@Copilot Copilot AI review requested due to automatic review settings October 13, 2025 07:16
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the composable_kernel submodule to implement FMHA (Fused Multi-Head Attention) backward pass optimizations specifically for D48 configurations on GFX950 hardware.

  • Updates composable_kernel submodule commit to include FMHA BWD optimizations
  • Targets D48 dimension size optimizations for GFX950 GPU architecture
  • Focuses on backward pass performance improvements for attention mechanisms

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@DDEle DDEle requested a review from slippedJim October 13, 2025 09:18
@DDEle
Copy link
Contributor Author

DDEle commented Oct 16, 2025

It seems that test_gemm_a8w8_blockscale_mi350 fails (coredump) with a high probability with this CK update, while there is a small probability of coredump in current aiter main branch (with linked CK version).

Another pattern of this failure is that the problem only appears in the first run. The test_gemm_a8w8_blockscale_mi350 runs smoothly in following runs (where jit cache exists).

@valarLip
Copy link
Collaborator

Test failed: op_tests/test_mha.py ?

@valarLip valarLip self-assigned this Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants