[Issue]: grouped_topk_opt_sort_kernel kernel memory segmentation fault

### Problem Description

We observed a GPU crash on MI355x when running speculative decoding using sglang v0.5.3rc0 / ROCm700 / aiter 0.1.5post3. The issue occurs with the model DeepSeek-R1-WMXFP4-Preview during GSM8k benchmarking under 8-way tensor parallelism and large prefill size. The crash consistently triggers inside grouped_topk_opt_sort_kernel. Non-speculative runs work fine. Expected behavior: 

This issue was introduced after upgrading Aiter from v0.1.5 (rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250917) to v0.1.5post2 (rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250918). Same command works fine with aiter v0.1.5. 

### Operating System

Ubuntu 22.04.5 LTS

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

AMD MI355x * 8

### ROCm Version

ROCm700

### ROCm Component

_No response_

### Steps to Reproduce

### Environment

* **GPU:** MI355x
* **Container Image:** `rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250930`
* **sglang Version:** v0.5.3rc0
* **ROCm Version:** 7.0
* **aiter Version:** 0.1.5post3
* **Speculative Algorithm:** `EAGLE`

### Setup

**Docker run command:**

```bash
container_name="v0.5.3rc0-rocm700-mi35x-speculative-decoding-0930"
sudo docker run -it --privileged --name="$container_name" \
    --network=host --device=/dev/kfd \
    --device=/dev/dri --group-add video --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined --ipc=host --shm-size 16G \
    -v "$HOME:/home/" -v /data:/data -v /data2:/data2 \
    rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250930
```

**Server launch script (`server.sh`):**

```bash
#!/usr/bin/env bash
set -euo pipefail

export HF_HUB_OFFLINE=1
export NCCL_MIN_NCHANNELS=112
export TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1
export AMDGCN_USE_BUFFER_OPS=1
export TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1
export TRITON_HIP_ASYNC_FAST_SWIZZLE=1
export TRITON_HIP_USE_ASYNC_COPY=1
export TRITON_HIP_USE_BLOCK_PINGPONG=0
export SGLANG_MXFP4_WEIGHT=0
export SGLANG_AITER_MOE=1
export SGLANG_AITER_NORM=1
export AITER_GEMM=1
export AITER_MLA_DECODE=1
export AITER_PREFILL=1
export AITER_ROPE=1
export SGLANG_RPD_PROFILER_DIR="./"

ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects" \
HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 \
HSA_ENABLE_DEBUG=1 \
python3 -m sglang.launch_server \
    --model-path /data2/DeepSeek-R1-WMXFP4-Preview/ \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --chunked-prefill-size 131072 \
    --host 0.0.0.0 \
    --port 8002 \
    --log-requests \
    --disable-radix-cache \
    --mem-fraction-static 0.95 \
    --speculative-algo EAGLE \
    --speculative-draft-model-path /data2/DeepSeek-R1-NextN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --trust-remote-code |& tee debug_kernel.log
```

**Client script (`client.sh`):**

```bash
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 5
```

**Run commands:**

```bash
cd /sgl-workspace/sglang
./server.sh

cd /sgl-workspace/sglang
./client.sh
```

### Error Message

Observed crash during speculative decoding with EAGLE algorithm:

```
wave_11878: pc=0x7f966d095bd0 (kernel_code_entry=0x7f966d094900 <void aiter::grouped_topk_opt_sort_kernel<float, float __vector(4), 8, true, true, false>(float*, float const*, float*, int*, unsigned long, int, int, int, int, float)>) 
(stopped, reason: MEMORY_VIOLATION)
```

### Expected Behavior

* Speculative decoding with `EAGLE` should run successfully without GPU memory violation.

### Actual Behavior

* GPU kernel crashes with **MEMORY_VIOLATION** inside `grouped_topk_opt_sort_kernel`.

### Steps to Reproduce

1. Launch container with the provided command.
2. Start server using `server.sh`.
3. Run benchmark client with `client.sh`.
4. Observe crash during speculative decoding.

### Additional Notes

* This issue only occurs with **speculative decoding (EAGLE)** enabled.


### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: grouped_topk_opt_sort_kernel kernel memory segmentation fault #1121

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

Environment

Setup

Error Message

Expected Behavior

Actual Behavior

Steps to Reproduce

Additional Notes

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: grouped_topk_opt_sort_kernel kernel memory segmentation fault #1121

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

Environment

Setup

Error Message

Expected Behavior

Actual Behavior

Steps to Reproduce

Additional Notes

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions