Skip to content

Commit f68b6ab

Browse files
committed
[Docs] Fix 1-2-3 list in v1/prefix_caching.md
Signed-off-by: windsonsea <[email protected]>
1 parent 19108ef commit f68b6ab

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

docs/design/v1/prefix_caching.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -117,8 +117,8 @@ There are two design points to highlight:
117117

118118
1. We allocate all KVCacheBlock when initializing the KV cache manager to be a block pool. This avoids Python object creation overheads and can easily track all blocks all the time.
119119
2. We introduce doubly linked list pointers directly in the KVCacheBlock, so that we could construct a free queue directly. This gives us two benefits:
120-
1. We could have O(1) complexity moving elements in the middle to the tail.
121-
2. We could avoid introducing another Python queue (e.g., `deque`) which has a wrapper to the elements.
120+
1. We could have O(1) complexity moving elements in the middle to the tail.
121+
2. We could avoid introducing another Python queue (e.g., `deque`) which has a wrapper to the elements.
122122

123123
As a result, we will have the following components when the KV cache manager is initialized:
124124

@@ -135,19 +135,19 @@ As a result, we will have the following components when the KV cache manager is
135135

136136
**New request:** Workflow for the scheduler to schedule a new request with KV cache block allocation:
137137

138-
1. The scheduler calls `kv_cache_manager.get_computed_blocks()` to get a sequence of blocks that have already been computed. This is done by hashing the prompt tokens in the request and looking up Cache Blocks.
138+
1. The scheduler calls `kv_cache_manager.get_computed_blocks()` to get a sequence of blocks that have already been computed. This is done by hashing the prompt tokens in the request and looking up cache blocks.
139139
2. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps:
140-
1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
141-
2. “Touch” the computed blocks. It increases the reference count of the computed block by one, and removes the block from the free queue if the block wasn’t used by other requests. This is to avoid these computed blocks being evicted. See the example in the next section for illustration.
142-
3. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
143-
4. If an allocated block is already full of tokens, we immediately add it to the Cache Block, so that the block can be reused by other requests in the same batch.
140+
1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
141+
2. “Touch” the computed blocks. It increases the reference count of the computed block by one, and removes the block from the free queue if the block wasn’t used by other requests. This is to avoid these computed blocks being evicted. See the example in the next section for illustration.
142+
3. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
143+
4. If an allocated block is already full of tokens, we immediately add it to the cache block, so that the block can be reused by other requests in the same batch.
144144

145145
**Running request:** Workflow for the scheduler to schedule a running request with KV cache block allocation:
146146

147147
1. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps:
148-
1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
149-
2. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
150-
3. Append token IDs to the slots in existing blocks as well as the new blocks. If a block is full, we add it to the Cache Block to cache it.
148+
1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
149+
2. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
150+
3. Append token IDs to the slots in existing blocks as well as the new blocks. If a block is full, we add it to the cache block to cache it.
151151

152152
**Duplicated blocks**
153153
Assuming block size is 4 and you send a request (Request 1\) with prompt ABCDEF and decoding length 3:
@@ -199,7 +199,7 @@ When a request is finished, we free all its blocks if no other requests are usin
199199
When the head block (least recently used block) of the free queue is cached, we have to evict the block to prevent it from being used by other requests. Specifically, eviction involves the following steps:
200200

201201
1. Pop the block from the head of the free queue. This is the LRU block to be evicted.
202-
2. Remove the block ID from the Cache Block.
202+
2. Remove the block ID from the cache block.
203203
3. Remove the block hash.
204204

205205
## Example

0 commit comments

Comments
 (0)