Skip to content

Commit 16c00c4

Browse files
authored
Correct CUDA Graph capture for encoder-decoder models (V0 engine)
This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models.  The previous implementation of CUDA Graph capture incorrectly set the `max_seq_len_to_capture` based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature.  The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.
1 parent ebf7605 commit 16c00c4

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

vllm/config/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1165,8 +1165,9 @@ def _verify_quantization(self) -> None:
11651165
"non-quantized models.", self.quantization)
11661166

11671167
def _verify_cuda_graph(self) -> None:
1168-
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
1169-
self.max_model_len)
1168+
if not self.is_encoder_decoder:
1169+
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
1170+
self.max_model_len)
11701171
# CUDAGraph capture not supported for enc-dec models and mllama on ROCm
11711172
ROCM_UNSUPPORTED_MODELS = ['mllama']
11721173
unsupported_rocm = (self.hf_config.model_type

0 commit comments

Comments
 (0)