Skip to content

[vLLM] Quality Regression with FlashInfer vs Triton Attention #1645

@jwfromm

Description

@jwfromm

Hello,

While running GPT-OSS 120B in vLLM, we noticed that FlashInfer introduces a quality regression. Specifically we achieve the expected quality scores (92% on AIME) when setting:

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 \
VLLM_USE_TRTLLM_ATTENTION=0 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=0 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=0 \
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 \
VLLM_USE_V1=1 \

Switching to FlashInfer based attention (ie):

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 \
VLLM_USE_TRTLLM_ATTENTION=1 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=1 \
VLLM_ATTENTION_BACKEND=CUTLASS_MLA \
VLLM_USE_V1=1 \

Consistently reduces the score to around 88% in high reasoning mode. We're not sure exactly where this issue is coming from, but it is preventing us from using the great kernels in this project. Does anyone have any ideas what the source might be?

@yeqcharlotte Added a simple repro for this issue here: yeqcharlotte/vllm#3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions