-
Notifications
You must be signed in to change notification settings - Fork 26
[do not merge][CB] requesting only one token via index (fms api change) #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
#464 has been merged as a intermediate fix/stepping stone. Performance wise not critical anymore to return precisely one token. |
closing as done in #464 then! |
[do not merge][CB] requesting only one token via index (fms api change)
To minimize data transfer for CB, we only want to request the last prompt tokens logits instead of the the entire prompts logits.
Therefore the flag
only_last_token: bool
will be replaced by the argumentindex: int
in the fms forward api. When passing and index i, fms will return the logits for the token at i-th position only. A draft implementation of this in fms can be found here.Note: we only request the last token logits for static batching by default, as no right padding is ever required there.
changes:
index
instead ofonly_last_token
solves #254