Long Context usage README update (#836)

iboiko-habana · web-flow · commit f78aeb9da071 · 2025-03-04T09:45:25.000+01:00
Long Context usage README update

---------

Signed-off-by: Iryna Boiko &lt;iboiko@habana.ai&gt;
diff --git a/README_GAUDI.md b/README_GAUDI.md
@@ -399,6 +399,63 @@ However, disabling this feature in production environments is not recommended, a
 > - `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
 > - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
 
+# Long Context Support
+
+Long context feature enables support for a token context window exceeding 32K tokens. It is supported by the following models:
+- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
+- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
+- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
+- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
+- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
+- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
+
+## Environment Variables Settings
+
+Set the following environment variables to avoid OOM/functional issues.  Additional environment variable settings depend on context length:
+
+- `VLLM_ENGINE_ITERATION_TIMEOUT_S=3600`
+- `VLLM_RPC_TIMEOUT=100000`
+- `VLLM_PROMPT_USE_FUSEDSDPA=1`
+- `PT_HPU_ENABLE_LAZY_COLLECTIVES=true`
+- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1`
+- `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`
+
+**32K context length flags examples:**
+
+- `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
+- `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
+- `VLLM_PROMPT_BS_BUCKET_STEP=16` -  Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased.
+- `VLLM_PROMPT_BS_BUCKET_MAX=16` -  Suggested value, depends on the model.  You can increase it until you reach an OOM error or decrease it if OOM occurs.
+- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576` -  Suggested value, depends on warmup results.
+- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048` - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3.
+- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768` - Value for context length of 32K. Use 16384 for 16K.
+- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024` - Suggested value, depends on warmup results.
+- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024` - Suggested value, depends on warmup results.
+- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792` - `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example:
+  - `128 *((32 + 1)* 1024) / 128`
+  - `32 *((32 + 1)* 1024) / 128`
+
+## Batch Size Settings
+
+The default `batch_size=256` is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups.
+
+If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
+
+**Recompilation message example:**
+
+```bash
+Configuration: (prompt, 1, 36864) was not warmed-up!
+```
+
+**Warning message example:**
+
+```bash
+Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.
+```
+
+**Usage of Multi-Step Scheduling feature**
+Enabling of Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details.
+
 # Troubleshooting
 
 The following steps address Out of Memory related errors: