Skip to content

Conversation

yannicks1
Copy link
Collaborator

@yannicks1 yannicks1 commented Jun 23, 2025

[do not merge][CB] requesting only one token via index (fms api change)

To minimize data transfer for CB, we only want to request the last prompt tokens logits instead of the the entire prompts logits.
Therefore the flag only_last_token: bool will be replaced by the argument index: int in the fms forward api. When passing and index i, fms will return the logits for the token at i-th position only. A draft implementation of this in fms can be found here.

Note: we only request the last token logits for static batching by default, as no right padding is ever required there.

changes:

  • install fms feature branch for testing
  • passing index instead of only_last_token
  • set number of right pads for decode to 0

solves #254

Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@yannicks1 yannicks1 changed the title [do not merge][CB] requesting only last token per default [do not merge][CB] requesting only one token via index (fms api change) Jun 30, 2025
@yannicks1
Copy link
Collaborator Author

#464 has been merged as a intermediate fix/stepping stone. Performance wise not critical anymore to return precisely one token.

@joerunde
Copy link
Collaborator

joerunde commented Oct 6, 2025

closing as done in #464 then!

@joerunde joerunde closed this Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants