Long Context usage README update #836

iboiko-habana · 2025-02-18T14:34:36Z

Long Context usage README update

Signed-off-by: Iryna Boiko <[email protected]>

tzielinski-habana · 2025-02-19T14:15:57Z

README_GAUDI.md


+# Long context support
+**Environment variable's setting**
+Environment variables for OOM/functional issues avoiding.


for avoiding OOM/functional issues.

README_GAUDI.md

tzielinski-habana

I added a lot of linguistic comments. Maybe a technical writer should take a look as well.

tzielinski-habana · 2025-02-19T16:09:43Z

README_GAUDI.md

+32K context length flags example:
+VLLM_GRAPH_RESERVED_MEM. it's value depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
+VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM
+VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM


"Can be increased until no OOM or decreased if OOM"
Increasing the step value results in fewer buckets, so if the user gets OOM, they should increase the value.

tzielinski-habana · 2025-02-19T16:16:50Z

README_GAUDI.md

+VLLM_GRAPH_RESERVED_MEM. it's value depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
+VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM
+VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM
+VLLM_PROMPT_BS_BUCKET_MAX=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM


"Can be increased until no OOM or decreased if OOM"
I don't like this sentence. Maybe:
"Suggested value, depends on a model. You can increase it until you reach an OOM error, or decrease it if OOM occurs."

README_GAUDI.md

tzielinski-habana · 2025-02-19T16:20:51Z

README_GAUDI.md

+VLLM_PROMPT_SEQ_BUCKET_MAX=32768 # context length 32K, 16384 for 16K
+VLLM_DECODE_BLOCK_BUCKET_MIN=1024 # proposal for usage. depends on warmup results
+VLLM_DECODE_BLOCK_BUCKET_STEP=1024 # proposal for usage. depends on warmup results
+VLLM_DECODE_BLOCK_BUCKET_MAX=33792 # max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output # i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128


You have a comment inside a comment, just one comment suffices.
Also, i.e. means "that is" or "in other words". If you want to say "for example, use "e.g." or "Example:". I think 1 example is fine.

proposal: max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output, i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128
Let's keep real example if it is possible. It was user request to provide more details for example usage in one of vllm-fork issue

README_GAUDI.md

tzielinski-habana · 2025-02-19T16:22:28Z

README_GAUDI.md

+VLLM_DECODE_BLOCK_BUCKET_MAX=33792 # max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output # i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128
+
+**Batch size setting**
+Usage of default batch_size=256 is not optimal for long context 8K+. Recompilation can occur due to not enough KV cache space for some sequence group.


(...) for long context (8K+). Recompilations can occur if there is not enough KV cache space for some sequence groups.

tzielinski-habana · 2025-02-19T16:23:06Z

README_GAUDI.md

+Usage of default batch_size=256 is not optimal for long context 8K+. Recompilation can occur due to not enough KV cache space for some sequence group.
+
+Please decrease batch_size if recompialtion or next recomputation warning occurs during inference run:
+recompilation message example: "Configuration: (dprompt, 1, 36864) was not warmed-up!"


dprompt? Is this a typo?

Thanks. Yes, this is typo

tzielinski-habana · 2025-02-19T16:24:06Z

README_GAUDI.md

+warning example: "Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory."
+
+**Usage of Multi-Step Scheduling feature**
+It is recommended to enable "Multi-Step Scheduling feature" feature for better decode performance. More details about feature: https://github.com/vllm-project/vllm/issues/6854


It is recommended to enable Multi-Step Scheduling for better decode performance. Here are more details about the feature: vllm-project#6854

tzielinski-habana · 2025-02-19T16:24:16Z

README_GAUDI.md

+It is recommended to enable "Multi-Step Scheduling feature" feature for better decode performance. More details about feature: https://github.com/vllm-project/vllm/issues/6854
+
+**Gaudi3 usage**
+All steps above are valid for gaudi3.


tzielinski-habana · 2025-02-19T16:25:00Z

README_GAUDI.md

+
+**Gaudi3 usage**
+All steps above are valid for gaudi3.
+It is recommended only to change VLLM_PROMPT_SEQ_BUCKET_STEP to higher value for faster warmup.


I'm not sure I understand this sentence.

iboiko-habana · 2025-02-21T14:03:45Z

@tzielinski-habana thanks for comments.
Final review will be requested in @kzawora-intel

README_GAUDI.md

auvarovahabana · 2025-02-27T10:36:41Z

README_GAUDI.md

 > - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.

+# Long context support
+**Environment variable's setting**


Lines 397-405:

Environment Variables Settings

Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:

VLLM_ENGINE_ITERATION_TIMEOUT_S=3600

VLLM_RPC_TIMEOUT=100000

VLLM_PROMPT_USE_FUSEDSDPA=1

PT_HPU_ENABLE_LAZY_COLLECTIVES=true

PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

auvarovahabana · 2025-02-27T10:53:37Z

README_GAUDI.md

+VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
+Other environment variables settings depend on context length.
+
+32K context length flags example:


Lines 407-417:

32K context length flags examples:

VLLM_GRAPH_RESERVED_MEM - The value depends on the model and context length settings. Use
VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b or VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.

VLLM_PROMPT_BS_BUCKET_MIN=1 - Suggested value, depends on the model. You can increase it until you
reach an OOM error or decrease it if OOM occurs.

VLLM_PROMPT_BS_BUCKET_STEP=16 - Suggested value, depends on the model. Increasing the step value results
in fewer buckets. If an OOM error occurs, the value should be increased.

VLLM_PROMPT_BS_BUCKET_MAX=16 - Suggested value, depends on the model. You can increase it until you
reach an OOM error or decrease it if OOM occurs.

VLLM_PROMPT_SEQ_BUCKET_MIN=24576 - Suggested value, depends on warmup results.

VLLM_PROMPT_SEQ_BUCKET_STEP=2048 - Suggested value, depends on warmup results. It is
recommended to increase it to a higher value for faster warmup.

VLLM_PROMPT_SEQ_BUCKET_MAX=32768 - Value for context length of 32K. Use 16384 for 16K.

VLLM_DECODE_BLOCK_BUCKET_MIN=1024 - Suggested value, depends on warmup results.

VLLM_DECODE_BLOCK_BUCKET_STEP=1024 - Suggested value, depends on warmup results.

VLLM_DECODE_BLOCK_BUCKET_MAX=33792 - max_num_seqs * max_decode_seq // self.block_size, where
max_decode_seq represents the sum of input and output sequences.
For example:

128 * ((32 + 1) * 1024) / 128

32 * ((32 + 1) * 1024) / 128

recommended changes for gaudi3, as it's significantly decreases time of execution
"- VLLM_PROMPT_SEQ_BUCKET_STEP=2048 - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. VLLM_PROMPT_SEQ_BUCKET_STEP=16384 - Suggested value for Intel Gaudi 3."
@auvarovahabana may we keep it or re-phrase?

"Suggested value for Intel Gaudi 3" is OK

README_GAUDI.md

auvarovahabana · 2025-02-27T11:07:36Z

README_GAUDI.md

+recompilation message example: "Configuration: (prompt, 1, 36864) was not warmed-up!"
+warning example: "Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory."
+
+**Usage of Multi-Step Scheduling feature**


Multi-Step Scheduling Feature Usage

README_GAUDI.md

auvarovahabana · 2025-02-27T11:09:48Z

README_GAUDI.md

+
+**Gaudi3**
+All steps above are valid for gaudi3.
+It is recommended only to change VLLM_PROMPT_SEQ_BUCKET_STEP to higher value for faster warmup.


remove line 431

README_GAUDI.md

auvarovahabana · 2025-03-03T08:17:47Z

README_GAUDI.md

+VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
+Other environment variables settings depend on context length.
+
+32K context length flags example:


"Suggested value for Intel Gaudi 3" is OK

michalkuligowski · 2025-03-03T14:21:04Z

README_GAUDI.md

+
+- `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
+- `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
+- `VLLM_PROMPT_BS_BUCKET_STEP=16` -  Suggested value, depends on the model. Increasing the step value results 


Please fix PyMarkdown issue

Signed-off-by: Iryna Boiko <[email protected]>

michalkuligowski

Rerun failed job https://github.com/HabanaAI/vllm-fork/actions/runs/13634312490/job/38109444860?pr=836#step:5:1

Long Context README update

ca02dec

iboiko-habana requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz, michalkuligowski and vivekgoe as code owners February 18, 2025 14:34

Long context README update part2

e5e6cd7

Signed-off-by: Iryna Boiko <[email protected]>

tzielinski-habana requested changes Feb 19, 2025

View reviewed changes

After review #1

5b373b7

iboiko-habana requested a review from auvarovahabana February 27, 2025 08:23

tzielinski-habana approved these changes Feb 27, 2025

View reviewed changes

auvarovahabana requested changes Feb 27, 2025

View reviewed changes

iboiko-habana added 3 commits February 28, 2025 13:51

After review #2

ee7cded

After review #3

ffdf700

After review #4

05bff98

auvarovahabana reviewed Mar 3, 2025

View reviewed changes

After review #5

acb6bcd

auvarovahabana approved these changes Mar 3, 2025

View reviewed changes

kzawora-intel approved these changes Mar 3, 2025

View reviewed changes

michalkuligowski requested changes Mar 3, 2025

View reviewed changes

iboiko-habana added 3 commits March 3, 2025 16:59

After review #6

14e4b24

After review #7

d71aa61

Signed-off-by: Iryna Boiko <[email protected]>

After review #8

6a377ae

Signed-off-by: Iryna Boiko <[email protected]>

michalkuligowski approved these changes Mar 3, 2025

View reviewed changes

iboiko-habana merged commit f78aeb9 into habana_main Mar 4, 2025
34 checks passed

PatrykWo self-requested a review March 7, 2025 09:24

Long Context usage README update #836

Long Context usage README update #836

Uh oh!

Conversation

iboiko-habana commented Feb 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tzielinski-habana left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iboiko-habana commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Environment Variables Settings

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Multi-Step Scheduling Feature Usage

Uh oh!

Uh oh!

auvarovahabana Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalkuligowski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

auvarovahabana Feb 27, 2025 •

edited

Loading