Skip to content

[Issue]: grouped_topk_opt_sort_kernel kernel memory segmentation fault #1121

@yichiche

Description

@yichiche

Problem Description

We observed a GPU crash on MI355x when running speculative decoding using sglang v0.5.3rc0 / ROCm700 / aiter 0.1.5post3. The issue occurs with the model DeepSeek-R1-WMXFP4-Preview during GSM8k benchmarking under 8-way tensor parallelism and large prefill size. The crash consistently triggers inside grouped_topk_opt_sort_kernel. Non-speculative runs work fine. Expected behavior:

This issue was introduced after upgrading Aiter from v0.1.5 (rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250917) to v0.1.5post2 (rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250918). Same command works fine with aiter v0.1.5.

Operating System

Ubuntu 22.04.5 LTS

CPU

AMD EPYC 9575F 64-Core Processor

GPU

AMD MI355x * 8

ROCm Version

ROCm700

ROCm Component

No response

Steps to Reproduce

Environment

  • GPU: MI355x
  • Container Image: rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250930
  • sglang Version: v0.5.3rc0
  • ROCm Version: 7.0
  • aiter Version: 0.1.5post3
  • Speculative Algorithm: EAGLE

Setup

Docker run command:

container_name="v0.5.3rc0-rocm700-mi35x-speculative-decoding-0930"
sudo docker run -it --privileged --name="$container_name" \
    --network=host --device=/dev/kfd \
    --device=/dev/dri --group-add video --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined --ipc=host --shm-size 16G \
    -v "$HOME:/home/" -v /data:/data -v /data2:/data2 \
    rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250930

Server launch script (server.sh):

#!/usr/bin/env bash
set -euo pipefail

export HF_HUB_OFFLINE=1
export NCCL_MIN_NCHANNELS=112
export TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1
export AMDGCN_USE_BUFFER_OPS=1
export TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1
export TRITON_HIP_ASYNC_FAST_SWIZZLE=1
export TRITON_HIP_USE_ASYNC_COPY=1
export TRITON_HIP_USE_BLOCK_PINGPONG=0
export SGLANG_MXFP4_WEIGHT=0
export SGLANG_AITER_MOE=1
export SGLANG_AITER_NORM=1
export AITER_GEMM=1
export AITER_MLA_DECODE=1
export AITER_PREFILL=1
export AITER_ROPE=1
export SGLANG_RPD_PROFILER_DIR="./"

ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects" \
HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 \
HSA_ENABLE_DEBUG=1 \
python3 -m sglang.launch_server \
    --model-path /data2/DeepSeek-R1-WMXFP4-Preview/ \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --chunked-prefill-size 131072 \
    --host 0.0.0.0 \
    --port 8002 \
    --log-requests \
    --disable-radix-cache \
    --mem-fraction-static 0.95 \
    --speculative-algo EAGLE \
    --speculative-draft-model-path /data2/DeepSeek-R1-NextN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --trust-remote-code |& tee debug_kernel.log

Client script (client.sh):

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 5

Run commands:

cd /sgl-workspace/sglang
./server.sh

cd /sgl-workspace/sglang
./client.sh

Error Message

Observed crash during speculative decoding with EAGLE algorithm:

wave_11878: pc=0x7f966d095bd0 (kernel_code_entry=0x7f966d094900 <void aiter::grouped_topk_opt_sort_kernel<float, float __vector(4), 8, true, true, false>(float*, float const*, float*, int*, unsigned long, int, int, int, int, float)>) 
(stopped, reason: MEMORY_VIOLATION)

Expected Behavior

  • Speculative decoding with EAGLE should run successfully without GPU memory violation.

Actual Behavior

  • GPU kernel crashes with MEMORY_VIOLATION inside grouped_topk_opt_sort_kernel.

Steps to Reproduce

  1. Launch container with the provided command.
  2. Start server using server.sh.
  3. Run benchmark client with client.sh.
  4. Observe crash during speculative decoding.

Additional Notes

  • This issue only occurs with speculative decoding (EAGLE) enabled.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions