Skip to content

Commit f89ad25

Browse files
noamgatLeiWang1999
authored andcommitted
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (vllm-project#6501)
Signed-off-by: LeiWang1999 <[email protected]>
1 parent d72f51a commit f89ad25

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

vllm/attention/backends/flashinfer.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
from vllm.attention.backends.utils import (PAD_SLOT_ID, compute_slot_mapping,
2121
compute_slot_mapping_start_idx,
2222
is_block_tables_empty)
23+
from vllm.attention.ops.paged_attn import PagedAttention
2324
from vllm.sequence import SequenceGroupMetadata
2425
from vllm.utils import get_kv_cache_torch_dtype, make_tensor_with_pad
2526

@@ -61,14 +62,14 @@ def swap_blocks(
6162
dst_kv_cache: torch.Tensor,
6263
src_to_dst: torch.Tensor,
6364
) -> None:
64-
raise NotImplementedError
65+
PagedAttention.swap_blocks(src_kv_cache, dst_kv_cache, src_to_dst)
6566

6667
@staticmethod
6768
def copy_blocks(
6869
kv_caches: List[torch.Tensor],
6970
src_to_dists: torch.Tensor,
7071
) -> None:
71-
raise NotImplementedError
72+
PagedAttention.copy_blocks(kv_caches, src_to_dists)
7273

7374
@staticmethod
7475
def get_supported_head_sizes() -> List[int]:

0 commit comments

Comments
 (0)