-
Notifications
You must be signed in to change notification settings - Fork 21
[CB] bug fix: account for prefill token #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
@@ -122,7 +122,7 @@ def round_up(t): | |||
|
|||
|
|||
tokens_to_generate = [ | |||
args.max_model_len - round_up(plen) for plen in prompt_lens | |||
args.max_model_len + 1 - round_up(plen) for plen in prompt_lens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot help but read plen
as if it was one word, can we rename that?
args.max_model_len + 1 - round_up(plen) for plen in prompt_lens | |
args.max_model_len + 1 - round_up(p_len) for p_len in prompt_lens |
or
args.max_model_len + 1 - round_up(plen) for plen in prompt_lens | |
args.max_model_len + 1 - round_up(prompt_len) for prompt_lenin prompt_lens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for prompt_len
, we don't need to save the bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Yannick Schnider <[email protected]>
[CB] bug fix: account for prefill token when asserting context length
Prefill already provides one new token (without requiring any KV cache for it).
Example: for max model length 2048 it is possible to do a prefill on a prompt of size 2048 to (32 blocks in Spyre) when only requesting 1 token. A 33rd block is only needed if a 2nd output token was requested and that would violate the max model length.
changes:
prompt_padding_len + max_tokens - 1 <= max_model_len
prompt_len <= max_model_len
while settingnew_tokens = max_model_len + 1 - prompt_padding_len