-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Problem Description
We observed a GPU crash on MI355x when running speculative decoding using sglang v0.5.3rc0 / ROCm700 / aiter 0.1.5post3. The issue occurs with the model DeepSeek-R1-WMXFP4-Preview during GSM8k benchmarking under 8-way tensor parallelism and large prefill size. The crash consistently triggers inside grouped_topk_opt_sort_kernel. Non-speculative runs work fine. Expected behavior:
This issue was introduced after upgrading Aiter from v0.1.5 (rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250917) to v0.1.5post2 (rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250918). Same command works fine with aiter v0.1.5.
Operating System
Ubuntu 22.04.5 LTS
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD MI355x * 8
ROCm Version
ROCm700
ROCm Component
No response
Steps to Reproduce
Environment
- GPU: MI355x
- Container Image:
rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250930
- sglang Version: v0.5.3rc0
- ROCm Version: 7.0
- aiter Version: 0.1.5post3
- Speculative Algorithm:
EAGLE
Setup
Docker run command:
container_name="v0.5.3rc0-rocm700-mi35x-speculative-decoding-0930"
sudo docker run -it --privileged --name="$container_name" \
--network=host --device=/dev/kfd \
--device=/dev/dri --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined --ipc=host --shm-size 16G \
-v "$HOME:/home/" -v /data:/data -v /data2:/data2 \
rocm/sgl-dev:v0.5.3rc0-rocm700-mi35x-20250930
Server launch script (server.sh
):
#!/usr/bin/env bash
set -euo pipefail
export HF_HUB_OFFLINE=1
export NCCL_MIN_NCHANNELS=112
export TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1
export AMDGCN_USE_BUFFER_OPS=1
export TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1
export TRITON_HIP_ASYNC_FAST_SWIZZLE=1
export TRITON_HIP_USE_ASYNC_COPY=1
export TRITON_HIP_USE_BLOCK_PINGPONG=0
export SGLANG_MXFP4_WEIGHT=0
export SGLANG_AITER_MOE=1
export SGLANG_AITER_NORM=1
export AITER_GEMM=1
export AITER_MLA_DECODE=1
export AITER_PREFILL=1
export AITER_ROPE=1
export SGLANG_RPD_PROFILER_DIR="./"
ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects" \
HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 \
HSA_ENABLE_DEBUG=1 \
python3 -m sglang.launch_server \
--model-path /data2/DeepSeek-R1-WMXFP4-Preview/ \
--tensor-parallel-size 8 \
--trust-remote-code \
--chunked-prefill-size 131072 \
--host 0.0.0.0 \
--port 8002 \
--log-requests \
--disable-radix-cache \
--mem-fraction-static 0.95 \
--speculative-algo EAGLE \
--speculative-draft-model-path /data2/DeepSeek-R1-NextN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--trust-remote-code |& tee debug_kernel.log
Client script (client.sh
):
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 5
Run commands:
cd /sgl-workspace/sglang
./server.sh
cd /sgl-workspace/sglang
./client.sh
Error Message
Observed crash during speculative decoding with EAGLE algorithm:
wave_11878: pc=0x7f966d095bd0 (kernel_code_entry=0x7f966d094900 <void aiter::grouped_topk_opt_sort_kernel<float, float __vector(4), 8, true, true, false>(float*, float const*, float*, int*, unsigned long, int, int, int, int, float)>)
(stopped, reason: MEMORY_VIOLATION)
Expected Behavior
- Speculative decoding with
EAGLE
should run successfully without GPU memory violation.
Actual Behavior
- GPU kernel crashes with MEMORY_VIOLATION inside
grouped_topk_opt_sort_kernel
.
Steps to Reproduce
- Launch container with the provided command.
- Start server using
server.sh
. - Run benchmark client with
client.sh
. - Observe crash during speculative decoding.
Additional Notes
- This issue only occurs with speculative decoding (EAGLE) enabled.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response