Skip to content

KV cache is low, memory profiling does not see the remaining VRAM #2136

@viktor-ferenczi

Description

@viktor-ferenczi

GPUs: 2x 4090 (2x24GB)

Regarding my long context issue with CodeLlama above:

  • vLLM 0.2.3: # GPU blocks: 1464, # CPU blocks: 1310
  • vLLM 0.2.4: # GPU blocks: 1464, # CPU blocks: 1310
  • vLLM 0.2.5 and main: # GPU blocks: 112, # CPU blocks: 1310

Something broke in VRAM profiling or before that, which prevents vLLM from using all remaining VRAM for the KV cache. Profiling already gives too low values and there is no way to manually override it from the command line. Both GPUs had ~8GB free VRAM after loading the model, so vLLM just fails to allocate it as cache.

Command:

python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-13B-Instruct-fp16 \
  --chat-template=$HOME/bin/templates/llama-2-chat.jinja \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95 \
  --disable-log-requests

Tested OK up to the full 16k context window on vLLM 0.2.3 and 0.2.4. The test fails on 0.2.5 if the sequence is longer than about 1700 tokens. (I think the exact limit is 112 * 16 due to block manager allocation and the block size of 16.)

vLLM 0.2.5 (and main) works fine with TheBloke/deepseek-coder-33B-instruct-AWQ, the problem does not happen with that model.

The use of --chat-template does not affect the problem, that's only to get the chat template right (same as Llama-2).

I've tried to change all the meaningful command line options in many ways, none of them helped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions