[vLLM] Quality Regression with FlashInfer vs Triton Attention

Hello,

While running GPT-OSS 120B in vLLM, we noticed that FlashInfer introduces a quality regression. Specifically we achieve the expected quality scores (92% on AIME) when setting:
```
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 \
VLLM_USE_TRTLLM_ATTENTION=0 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=0 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=0 \
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 \
VLLM_USE_V1=1 \
```

Switching to FlashInfer based attention (ie):
```
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 \
VLLM_USE_TRTLLM_ATTENTION=1 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=1 \
VLLM_ATTENTION_BACKEND=CUTLASS_MLA \
VLLM_USE_V1=1 \
```

Consistently reduces the score to around 88% in high reasoning mode. We're not sure exactly where this issue is coming from, but it is preventing us from using the great kernels in this project. Does anyone have any ideas what the source might be?

@yeqcharlotte Added a simple repro for this issue here: https://github.com/yeqcharlotte/vllm/pull/3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vLLM] Quality Regression with FlashInfer vs Triton Attention #1645

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[vLLM] Quality Regression with FlashInfer vs Triton Attention #1645

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions