You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README_GAUDI.md
+57Lines changed: 57 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -399,6 +399,63 @@ However, disabling this feature in production environments is not recommended, a
399
399
> -`VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
400
400
> -`VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
401
401
402
+
# Long Context Support
403
+
404
+
Long context feature enables support for a token context window exceeding 32K tokens. It is supported by the following models:
Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:
415
+
416
+
-`VLLM_ENGINE_ITERATION_TIMEOUT_S=3600`
417
+
-`VLLM_RPC_TIMEOUT=100000`
418
+
-`VLLM_PROMPT_USE_FUSEDSDPA=1`
419
+
-`PT_HPU_ENABLE_LAZY_COLLECTIVES=true`
420
+
-`PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1`
421
+
-`VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`
422
+
423
+
**32K context length flags examples:**
424
+
425
+
-`VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
426
+
-`VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
427
+
-`VLLM_PROMPT_BS_BUCKET_STEP=16` - Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased.
428
+
-`VLLM_PROMPT_BS_BUCKET_MAX=16` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
429
+
-`VLLM_PROMPT_SEQ_BUCKET_MIN=24576` - Suggested value, depends on warmup results.
430
+
-`VLLM_PROMPT_SEQ_BUCKET_STEP=2048` - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3.
431
+
-`VLLM_PROMPT_SEQ_BUCKET_MAX=32768` - Value for context length of 32K. Use 16384 for 16K.
432
+
-`VLLM_DECODE_BLOCK_BUCKET_MIN=1024` - Suggested value, depends on warmup results.
433
+
-`VLLM_DECODE_BLOCK_BUCKET_STEP=1024` - Suggested value, depends on warmup results.
434
+
-`VLLM_DECODE_BLOCK_BUCKET_MAX=33792` - `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example:
435
+
-`128 *((32 + 1)* 1024) / 128`
436
+
-`32 *((32 + 1)* 1024) / 128`
437
+
438
+
## Batch Size Settings
439
+
440
+
The default `batch_size=256` is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups.
441
+
442
+
If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
443
+
444
+
**Recompilation message example:**
445
+
446
+
```bash
447
+
Configuration: (prompt, 1, 36864) was not warmed-up!
448
+
```
449
+
450
+
**Warning message example:**
451
+
452
+
```bash
453
+
Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.
454
+
```
455
+
456
+
**Usage of Multi-Step Scheduling feature**
457
+
Enabling of Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details.
458
+
402
459
# Troubleshooting
403
460
404
461
The following steps address Out of Memory related errors:
0 commit comments