-
Notifications
You must be signed in to change notification settings - Fork 489
Open
Description
Hello,
While running GPT-OSS 120B in vLLM, we noticed that FlashInfer introduces a quality regression. Specifically we achieve the expected quality scores (92% on AIME) when setting:
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 \
VLLM_USE_TRTLLM_ATTENTION=0 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=0 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=0 \
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 \
VLLM_USE_V1=1 \
Switching to FlashInfer based attention (ie):
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 \
VLLM_USE_TRTLLM_ATTENTION=1 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=1 \
VLLM_ATTENTION_BACKEND=CUTLASS_MLA \
VLLM_USE_V1=1 \
Consistently reduces the score to around 88% in high reasoning mode. We're not sure exactly where this issue is coming from, but it is preventing us from using the great kernels in this project. Does anyone have any ideas what the source might be?
@yeqcharlotte Added a simple repro for this issue here: yeqcharlotte/vllm#3
yzh119
Metadata
Metadata
Assignees
Labels
No labels