Skip to content

[CB] bug fix: account for prefill token #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 21, 2025
Merged

Conversation

yannicks1
Copy link
Collaborator

@yannicks1 yannicks1 commented Jul 17, 2025

[CB] bug fix: account for prefill token when asserting context length

Prefill already provides one new token (without requiring any KV cache for it).

Example: for max model length 2048 it is possible to do a prefill on a prompt of size 2048 to (32 blocks in Spyre) when only requesting 1 token. A 33rd block is only needed if a 2nd output token was requested and that would violate the max model length.

changes:

  • correct assertion in platform.py: allow any requests that satisfy: prompt_padding_len + max_tokens - 1 <= max_model_len
  • correct long_context.py: allowing prompt_len <= max_model_len while setting
    new_tokens = max_model_len + 1 - prompt_padding_len

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@yannicks1 yannicks1 marked this pull request as ready for review July 17, 2025 11:02
@@ -122,7 +122,7 @@ def round_up(t):


tokens_to_generate = [
args.max_model_len - round_up(plen) for plen in prompt_lens
args.max_model_len + 1 - round_up(plen) for plen in prompt_lens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot help but read plen as if it was one word, can we rename that?

Suggested change
args.max_model_len + 1 - round_up(plen) for plen in prompt_lens
args.max_model_len + 1 - round_up(p_len) for p_len in prompt_lens

or

Suggested change
args.max_model_len + 1 - round_up(plen) for plen in prompt_lens
args.max_model_len + 1 - round_up(prompt_len) for prompt_lenin prompt_lens

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for prompt_len, we don't need to save the bytes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_lenin

Signed-off-by: Yannick Schnider <[email protected]>
@yannicks1 yannicks1 enabled auto-merge (squash) July 21, 2025 06:54
@github-actions github-actions bot added the ready label Jul 21, 2025
@yannicks1 yannicks1 merged commit 1d13d62 into main Jul 21, 2025
19 checks passed
@yannicks1 yannicks1 deleted the ysc-fix-prefill-token branch July 21, 2025 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants