-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
Description
GPUs: 2x 4090 (2x24GB)
Regarding my long context issue with CodeLlama above:
- vLLM 0.2.3:
# GPU blocks: 1464, # CPU blocks: 1310
- vLLM 0.2.4:
# GPU blocks: 1464, # CPU blocks: 1310
- vLLM 0.2.5 and
main
:# GPU blocks: 112, # CPU blocks: 1310
Something broke in VRAM profiling or before that, which prevents vLLM from using all remaining VRAM for the KV cache. Profiling already gives too low values and there is no way to manually override it from the command line. Both GPUs had ~8GB free VRAM after loading the model, so vLLM just fails to allocate it as cache.
Command:
python -O -u -m vllm.entrypoints.openai.api_server \
--model=TheBloke/CodeLlama-13B-Instruct-fp16 \
--chat-template=$HOME/bin/templates/llama-2-chat.jinja \
--served-model-name=model \
--host=0.0.0.0 \
--port=8000 \
--max-model-len=16384 \
--max-num-seqs=16 \
--tensor-parallel-size=2 \
--swap-space=8 \
--gpu-memory-utilization=0.95 \
--disable-log-requests
Tested OK up to the full 16k context window on vLLM 0.2.3 and 0.2.4. The test fails on 0.2.5 if the sequence is longer than about 1700 tokens. (I think the exact limit is 112 * 16 due to block manager allocation and the block size of 16.)
vLLM 0.2.5 (and main
) works fine with TheBloke/deepseek-coder-33B-instruct-AWQ
, the problem does not happen with that model.
The use of --chat-template
does not affect the problem, that's only to get the chat template right (same as Llama-2).
I've tried to change all the meaningful command line options in many ways, none of them helped.