[Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode)

### 🚀 The feature, motivation and pitch

On cudagraph capture stage, the `MLACommonMetadataBuilder` build metadata with `max_query_len =1`, [with this PR](https://github.com/vllm-project/vllm/pull/18581)
```python
    def build_for_cudagraph_capture(
            self, common_attn_metadata: CommonAttentionMetadata) -> M:
        """
        This method builds the metadata for full cudagraph capture.
        Currently, only decode is supported for full cudagraphs with MLA.
        """
        m = common_attn_metadata
        assert m.num_reqs == m.num_actual_tokens, \
            "MLA only supports decode-only full CUDAGraph capture. " \
            "Make sure all cudagraph capture sizes <= max_num_seq."

        m.max_query_len = 1  # decode-only
```

And when run DeepSeek-R1 with DeepSeek MTP with [simple-cuda-graph](https://github.com/vllm-project/vllm/pull/20328) and [run eagle with full cudagraph support](https://github.com/vllm-project/vllm/pull/21477) patches, the `max_query_len` may be 2 (one actual decode token and one Speculative token), which will give an error output like ok, ok, ok, ok....

And we find that, when run DeepSeek with MTP, the `MLACommonImpl.forward` will always call `self._forward_prefill`, and `self._forward_decode` is NOT called anymore, and the cudagraph capture is `decode-only`, requir `max_query_len=1`, which may cause conflict? I'm NOT sure.

```python
class MLACommonImpl(MLAAttentionImpl[M], Generic[M]):
    ...

    def forward(
        ...
    ) -> torch.Tensor:

        ...

        if has_prefill:
            output[num_decode_tokens:] = self._forward_prefill(
                prefill_q, prefill_k_c_normed, prefill_k_pe, kv_cache,
                attn_metadata)

        if has_decode: # Run DeepSeek R1 with MTP, has_decode is always False
            ...
            output[:num_decode_tokens] = self._forward_decode(
                decode_ql_nope, decode_q_pe, kv_cache, attn_metadata)
```

So, Is there any way to support DeepSeek MTP with full cudagraph？

@ProExpertProg @LucasWilkinson @zixi-qi @YaoJiayi Looking forward to your reply.

Other Refs:
https://github.com/vllm-project/vllm/pull/17211
https://github.com/vllm-project/vllm/pull/18435

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode) #21505

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode) #21505

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions