KV cache is low, memory profiling does not see the remaining VRAM

GPUs: 2x 4090 (2x24GB)

Regarding my long context issue with CodeLlama above:
- vLLM 0.2.3: `# GPU blocks: 1464, # CPU blocks: 1310`
- vLLM 0.2.4: `# GPU blocks: 1464, # CPU blocks: 1310`
- vLLM 0.2.5 and `main`: `# GPU blocks: 112, # CPU blocks: 1310`

Something broke in VRAM profiling or before that, which prevents vLLM from using all remaining VRAM for the KV cache. Profiling already gives too low values and there is no way to manually override it from the command line. Both GPUs had ~8GB free VRAM after loading the model, so vLLM just fails to allocate it as cache.

Command:
```sh
python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-13B-Instruct-fp16 \
  --chat-template=$HOME/bin/templates/llama-2-chat.jinja \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95 \
  --disable-log-requests
```

Tested OK up to the full 16k context window on vLLM 0.2.3 and 0.2.4. The test fails on 0.2.5 if the sequence is longer than about 1700 tokens. (I think the exact limit is 112 * 16 due to block manager allocation and the block size of 16.)

vLLM 0.2.5 (and `main`) works fine with `TheBloke/deepseek-coder-33B-instruct-AWQ`, the problem does not happen with that model.

The use of `--chat-template` does not affect the problem, that's only to get the chat template right (same as Llama-2).

I've tried to change all the meaningful command line options in many ways, none of them helped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

KV cache is low, memory profiling does not see the remaining VRAM #2136

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

KV cache is low, memory profiling does not see the remaining VRAM #2136

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions