Skip to content

Commit f78aeb9

Browse files
Long Context usage README update (#836)
Long Context usage README update --------- Signed-off-by: Iryna Boiko <[email protected]>
1 parent acf2cd5 commit f78aeb9

File tree

1 file changed

+57
-0
lines changed

1 file changed

+57
-0
lines changed

README_GAUDI.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -399,6 +399,63 @@ However, disabling this feature in production environments is not recommended, a
399399
> - `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
400400
> - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
401401
402+
# Long Context Support
403+
404+
Long context feature enables support for a token context window exceeding 32K tokens. It is supported by the following models:
405+
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
406+
- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
407+
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
408+
- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
409+
- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
410+
- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
411+
412+
## Environment Variables Settings
413+
414+
Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:
415+
416+
- `VLLM_ENGINE_ITERATION_TIMEOUT_S=3600`
417+
- `VLLM_RPC_TIMEOUT=100000`
418+
- `VLLM_PROMPT_USE_FUSEDSDPA=1`
419+
- `PT_HPU_ENABLE_LAZY_COLLECTIVES=true`
420+
- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1`
421+
- `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`
422+
423+
**32K context length flags examples:**
424+
425+
- `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
426+
- `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
427+
- `VLLM_PROMPT_BS_BUCKET_STEP=16` - Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased.
428+
- `VLLM_PROMPT_BS_BUCKET_MAX=16` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
429+
- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576` - Suggested value, depends on warmup results.
430+
- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048` - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3.
431+
- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768` - Value for context length of 32K. Use 16384 for 16K.
432+
- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024` - Suggested value, depends on warmup results.
433+
- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024` - Suggested value, depends on warmup results.
434+
- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792` - `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example:
435+
- `128 *((32 + 1)* 1024) / 128`
436+
- `32 *((32 + 1)* 1024) / 128`
437+
438+
## Batch Size Settings
439+
440+
The default `batch_size=256` is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups.
441+
442+
If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
443+
444+
**Recompilation message example:**
445+
446+
```bash
447+
Configuration: (prompt, 1, 36864) was not warmed-up!
448+
```
449+
450+
**Warning message example:**
451+
452+
```bash
453+
Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.
454+
```
455+
456+
**Usage of Multi-Step Scheduling feature**
457+
Enabling of Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details.
458+
402459
# Troubleshooting
403460

404461
The following steps address Out of Memory related errors:

0 commit comments

Comments
 (0)