[Feature]: decode MLA kernel support nhead=8

### Suggestion Description

Hi, I am testing Kimi-K2-Instruct (https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/config.json) TP8 solution on single MI355 node. I use Aiter attention backend replace triton backend for better performance.

it seems like only support DSV3/R1 nhead=16. https://github.com/ROCm/aiter/blob/v0.1.4/aiter/mla.py#L107
could we support nhead=8 in Aiter?

# reproduce step:
## image: 
lmsysorg/sglang:v0.4.9.post2-rocm700-mi35x
## launch server: 
python3 -m sglang.launch_server --model moonshotai/Kimi-K2-Instruct --trust-remote-code --tp 8 --attention-backend aiter

### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: decode MLA kernel support nhead=8 #726

Suggestion Description

reproduce step:

image:

launch server:

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: decode MLA kernel support nhead=8 #726

Description

Suggestion Description

reproduce step:

image:

launch server:

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions