-
Notifications
You must be signed in to change notification settings - Fork 134
Long Context usage README update #836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Iryna Boiko <[email protected]>
README_GAUDI.md
Outdated
| # Long context support | ||
| **Environment variable's setting** | ||
| Environment variables for OOM/functional issues avoiding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for avoiding OOM/functional issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tzielinski-habana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a lot of linguistic comments. Maybe a technical writer should take a look as well.
README_GAUDI.md
Outdated
| 32K context length flags example: | ||
| VLLM_GRAPH_RESERVED_MEM. it's value depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b. | ||
| VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM | ||
| VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Can be increased until no OOM or decreased if OOM"
Increasing the step value results in fewer buckets, so if the user gets OOM, they should increase the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
README_GAUDI.md
Outdated
| VLLM_GRAPH_RESERVED_MEM. it's value depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b. | ||
| VLLM_PROMPT_BS_BUCKET_MIN=1 # proposal for usage. depends on model. Can be increased if no OOM | ||
| VLLM_PROMPT_BS_BUCKET_STEP=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM | ||
| VLLM_PROMPT_BS_BUCKET_MAX=16 # proposal for usage. depends on model. Can be increased until no OOM or decreased if OOM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Can be increased until no OOM or decreased if OOM"
I don't like this sentence. Maybe:
"Suggested value, depends on a model. You can increase it until you reach an OOM error, or decrease it if OOM occurs."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
README_GAUDI.md
Outdated
| VLLM_PROMPT_SEQ_BUCKET_MAX=32768 # context length 32K, 16384 for 16K | ||
| VLLM_DECODE_BLOCK_BUCKET_MIN=1024 # proposal for usage. depends on warmup results | ||
| VLLM_DECODE_BLOCK_BUCKET_STEP=1024 # proposal for usage. depends on warmup results | ||
| VLLM_DECODE_BLOCK_BUCKET_MAX=33792 # max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output # i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a comment inside a comment, just one comment suffices.
Also, i.e. means "that is" or "in other words". If you want to say "for example, use "e.g." or "Example:". I think 1 example is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proposal: max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output, i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128
Let's keep real example if it is possible. It was user request to provide more details for example usage in one of vllm-fork issue
README_GAUDI.md
Outdated
| VLLM_DECODE_BLOCK_BUCKET_MAX=33792 # max_num_seqs * max_decode_seq // self.block_size, where max_decode_seq is input + output # i.e. 128*((32+1)* 1024)/128 or 32*((32+1)*1024)/128 | ||
|
|
||
| **Batch size setting** | ||
| Usage of default batch_size=256 is not optimal for long context 8K+. Recompilation can occur due to not enough KV cache space for some sequence group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(...) for long context (8K+). Recompilations can occur if there is not enough KV cache space for some sequence groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
README_GAUDI.md
Outdated
| Usage of default batch_size=256 is not optimal for long context 8K+. Recompilation can occur due to not enough KV cache space for some sequence group. | ||
|
|
||
| Please decrease batch_size if recompialtion or next recomputation warning occurs during inference run: | ||
| recompilation message example: "Configuration: (dprompt, 1, 36864) was not warmed-up!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dprompt? Is this a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, this is typo
README_GAUDI.md
Outdated
| warning example: "Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory." | ||
|
|
||
| **Usage of Multi-Step Scheduling feature** | ||
| It is recommended to enable "Multi-Step Scheduling feature" feature for better decode performance. More details about feature: https://github.com/vllm-project/vllm/issues/6854 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to enable Multi-Step Scheduling for better decode performance. Here are more details about the feature: vllm-project#6854
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
README_GAUDI.md
Outdated
| It is recommended to enable "Multi-Step Scheduling feature" feature for better decode performance. More details about feature: https://github.com/vllm-project/vllm/issues/6854 | ||
|
|
||
| **Gaudi3 usage** | ||
| All steps above are valid for gaudi3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gaudi3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
README_GAUDI.md
Outdated
|
|
||
| **Gaudi3 usage** | ||
| All steps above are valid for gaudi3. | ||
| It is recommended only to change VLLM_PROMPT_SEQ_BUCKET_STEP to higher value for faster warmup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this sentence.
|
@tzielinski-habana thanks for comments. |
README_GAUDI.md
Outdated
| > - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes. | ||
| # Long context support | ||
| **Environment variable's setting** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 397-405:
Environment Variables Settings
Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:
VLLM_ENGINE_ITERATION_TIMEOUT_S=3600VLLM_RPC_TIMEOUT=100000VLLM_PROMPT_USE_FUSEDSDPA=1PT_HPU_ENABLE_LAZY_COLLECTIVES=truePT_HPUGRAPH_DISABLE_TENSOR_CACHE=1VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
README_GAUDI.md
Outdated
| VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 | ||
| Other environment variables settings depend on context length. | ||
|
|
||
| 32K context length flags example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 407-417:
32K context length flags examples:
VLLM_GRAPH_RESERVED_MEM- The value depends on the model and context length settings. Use
VLLM_GRAPH_RESERVED_MEM=0.02for llama3.1-8b orVLLM_GRAPH_RESERVED_MEM=0.1for llama3.1-70b.VLLM_PROMPT_BS_BUCKET_MIN=1- Suggested value, depends on the model. You can increase it until you
reach an OOM error or decrease it if OOM occurs.VLLM_PROMPT_BS_BUCKET_STEP=16- Suggested value, depends on the model. Increasing the step value results
in fewer buckets. If an OOM error occurs, the value should be increased.VLLM_PROMPT_BS_BUCKET_MAX=16- Suggested value, depends on the model. You can increase it until you
reach an OOM error or decrease it if OOM occurs.VLLM_PROMPT_SEQ_BUCKET_MIN=24576- Suggested value, depends on warmup results.VLLM_PROMPT_SEQ_BUCKET_STEP=2048- Suggested value, depends on warmup results. It is
recommended to increase it to a higher value for faster warmup.VLLM_PROMPT_SEQ_BUCKET_MAX=32768- Value for context length of 32K. Use 16384 for 16K.VLLM_DECODE_BLOCK_BUCKET_MIN=1024- Suggested value, depends on warmup results.VLLM_DECODE_BLOCK_BUCKET_STEP=1024- Suggested value, depends on warmup results.VLLM_DECODE_BLOCK_BUCKET_MAX=33792-max_num_seqs * max_decode_seq // self.block_size, where
max_decode_seqrepresents the sum of input and output sequences.
For example:- 128 * ((32 + 1) * 1024) / 128
- 32 * ((32 + 1) * 1024) / 128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommended changes for gaudi3, as it's significantly decreases time of execution
"- VLLM_PROMPT_SEQ_BUCKET_STEP=2048 - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. VLLM_PROMPT_SEQ_BUCKET_STEP=16384 - Suggested value for Intel Gaudi 3."
@auvarovahabana may we keep it or re-phrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Suggested value for Intel Gaudi 3" is OK
README_GAUDI.md
Outdated
| recompilation message example: "Configuration: (prompt, 1, 36864) was not warmed-up!" | ||
| warning example: "Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory." | ||
|
|
||
| **Usage of Multi-Step Scheduling feature** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-Step Scheduling Feature Usage
README_GAUDI.md
Outdated
|
|
||
| **Gaudi3** | ||
| All steps above are valid for gaudi3. | ||
| It is recommended only to change VLLM_PROMPT_SEQ_BUCKET_STEP to higher value for faster warmup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove line 431
README_GAUDI.md
Outdated
| VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 | ||
| Other environment variables settings depend on context length. | ||
|
|
||
| 32K context length flags example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Suggested value for Intel Gaudi 3" is OK
README_GAUDI.md
Outdated
|
|
||
| - `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B. | ||
| - `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. | ||
| - `VLLM_PROMPT_BS_BUCKET_STEP=16` - Suggested value, depends on the model. Increasing the step value results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix PyMarkdown issue
Signed-off-by: Iryna Boiko <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
michalkuligowski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long Context usage README update