Skip to content

Conversation

@iboiko-habana
Copy link

Long Context usage README update

README_GAUDI.md Outdated
# Long context support
**Environment variable's setting**
Environment variables for OOM/functional issues avoiding.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for avoiding OOM/functional issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

@tzielinski-habana tzielinski-habana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a lot of linguistic comments. Maybe a technical writer should take a look as well.

README_GAUDI.md Outdated
32K context length flags example:
VLLM_GRAPH_RESERVED_MEM. it's value depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM
VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Can be increased until no OOM or decreased if OOM"
Increasing the step value results in fewer buckets, so if the user gets OOM, they should increase the value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

README_GAUDI.md Outdated
VLLM_GRAPH_RESERVED_MEM. it's value depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM
VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM
VLLM_PROMPT_BS_BUCKET_MAX=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Can be increased until no OOM or decreased if OOM"
I don't like this sentence. Maybe:
"Suggested value, depends on a model. You can increase it until you reach an OOM error, or decrease it if OOM occurs."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

README_GAUDI.md Outdated
VLLM_PROMPT_SEQ_BUCKET_MAX=32768 # context length 32K, 16384 for 16K
VLLM_DECODE_BLOCK_BUCKET_MIN=1024 # proposal for usage. depends on warmup results
VLLM_DECODE_BLOCK_BUCKET_STEP=1024 # proposal for usage. depends on warmup results
VLLM_DECODE_BLOCK_BUCKET_MAX=33792 # max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output # i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a comment inside a comment, just one comment suffices.
Also, i.e. means "that is" or "in other words". If you want to say "for example, use "e.g." or "Example:". I think 1 example is fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposal: max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output, i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128
Let's keep real example if it is possible. It was user request to provide more details for example usage in one of vllm-fork issue

README_GAUDI.md Outdated
VLLM_DECODE_BLOCK_BUCKET_MAX=33792 # max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output # i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128

**Batch size setting**
Usage of default batch_size=256 is not optimal for long context 8K+. Recompilation can occur due to not enough KV cache space for some sequence group.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(...) for long context (8K+). Recompilations can occur if there is not enough KV cache space for some sequence groups.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

README_GAUDI.md Outdated
Usage of default batch_size=256 is not optimal for long context 8K+. Recompilation can occur due to not enough KV cache space for some sequence group.

Please decrease batch_size if recompialtion or next recomputation warning occurs during inference run:
recompilation message example: "Configuration: (dprompt, 1, 36864) was not warmed-up!"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dprompt? Is this a typo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, this is typo

README_GAUDI.md Outdated
warning example: "Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory."

**Usage of Multi-Step Scheduling feature**
It is recommended to enable "Multi-Step Scheduling feature" feature for better decode performance. More details about feature: https://github.com/vllm-project/vllm/issues/6854

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to enable Multi-Step Scheduling for better decode performance. Here are more details about the feature: vllm-project#6854

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

README_GAUDI.md Outdated
It is recommended to enable "Multi-Step Scheduling feature" feature for better decode performance. More details about feature: https://github.com/vllm-project/vllm/issues/6854

**Gaudi3 usage**
All steps above are valid for gaudi3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaudi3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

README_GAUDI.md Outdated

**Gaudi3 usage**
All steps above are valid for gaudi3.
It is recommended only to change VLLM_PROMPT_SEQ_BUCKET_STEP to higher value for faster warmup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this sentence.

@iboiko-habana
Copy link
Author

@tzielinski-habana thanks for comments.
Final review will be requested in @kzawora-intel

README_GAUDI.md Outdated
> - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
# Long context support
**Environment variable's setting**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 397-405:

Environment Variables Settings

Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:

  • VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
  • VLLM_RPC_TIMEOUT=100000
  • VLLM_PROMPT_USE_FUSEDSDPA=1
  • PT_HPU_ENABLE_LAZY_COLLECTIVES=true
  • PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1
  • VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

README_GAUDI.md Outdated
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Other environment variables settings depend on context length.

32K context length flags example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 407-417:

32K context length flags examples:

  • VLLM_GRAPH_RESERVED_MEM - The value depends on the model and context length settings. Use
    VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b or VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
  • VLLM_PROMPT_BS_BUCKET_MIN=1 - Suggested value, depends on the model. You can increase it until you
    reach an OOM error or decrease it if OOM occurs.
  • VLLM_PROMPT_BS_BUCKET_STEP=16 - Suggested value, depends on the model. Increasing the step value results
    in fewer buckets. If an OOM error occurs, the value should be increased.
  • VLLM_PROMPT_BS_BUCKET_MAX=16 - Suggested value, depends on the model. You can increase it until you
    reach an OOM error or decrease it if OOM occurs.
  • VLLM_PROMPT_SEQ_BUCKET_MIN=24576 - Suggested value, depends on warmup results.
  • VLLM_PROMPT_SEQ_BUCKET_STEP=2048 - Suggested value, depends on warmup results. It is
    recommended to increase it to a higher value for faster warmup.
  • VLLM_PROMPT_SEQ_BUCKET_MAX=32768 - Value for context length of 32K. Use 16384 for 16K.
  • VLLM_DECODE_BLOCK_BUCKET_MIN=1024 - Suggested value, depends on warmup results.
  • VLLM_DECODE_BLOCK_BUCKET_STEP=1024 - Suggested value, depends on warmup results.
  • VLLM_DECODE_BLOCK_BUCKET_MAX=33792 - max_num_seqs * max_decode_seq // self.block_size, where
    max_decode_seq represents the sum of input and output sequences.
    For example:
    • 128 * ((32 + 1) * 1024) / 128
    • 32 * ((32 + 1) * 1024) / 128

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommended changes for gaudi3, as it's significantly decreases time of execution
"- VLLM_PROMPT_SEQ_BUCKET_STEP=2048 - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. VLLM_PROMPT_SEQ_BUCKET_STEP=16384 - Suggested value for Intel Gaudi 3."
@auvarovahabana may we keep it or re-phrase?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Suggested value for Intel Gaudi 3" is OK

README_GAUDI.md Outdated
recompilation message example: "Configuration: (prompt, 1, 36864) was not warmed-up!"
warning example: "Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory."

**Usage of Multi-Step Scheduling feature**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Step Scheduling Feature Usage

README_GAUDI.md Outdated

**Gaudi3**
All steps above are valid for gaudi3.
It is recommended only to change VLLM_PROMPT_SEQ_BUCKET_STEP to higher value for faster warmup.
Copy link

@auvarovahabana auvarovahabana Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove line 431

README_GAUDI.md Outdated
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Other environment variables settings depend on context length.

32K context length flags example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Suggested value for Intel Gaudi 3" is OK

README_GAUDI.md Outdated

- `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
- `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
- `VLLM_PROMPT_BS_BUCKET_STEP=16` - Suggested value, depends on the model. Increasing the step value results

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix PyMarkdown issue

Signed-off-by: Iryna Boiko <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
Copy link

@michalkuligowski michalkuligowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iboiko-habana iboiko-habana merged commit f78aeb9 into habana_main Mar 4, 2025
34 checks passed
@PatrykWo PatrykWo self-requested a review March 7, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants