Correct CUDA Graph capture for encoder-decoder models （V0 engine）

Sugar-zsg · web-flow · commit 16c00c425122 · 2025-08-11T16:25:34.000+08:00
This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models.
﻿
The previous implementation of CUDA Graph capture incorrectly set the `max_seq_len_to_capture` based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature.
﻿
The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.
diff --git a/vllm/config/__init__.py b/vllm/config/__init__.py
@@ -1165,8 +1165,9 @@ def _verify_quantization(self) -> None:
                     "non-quantized models.", self.quantization)
 
     def _verify_cuda_graph(self) -> None:
-        self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
-                                          self.max_model_len)
+        if not self.is_encoder_decoder:
+            self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
+                                              self.max_model_len)
         # CUDAGraph capture not supported for enc-dec models and mllama on ROCm
         ROCM_UNSUPPORTED_MODELS = ['mllama']
         unsupported_rocm = (self.hf_config.model_type