-
Notifications
You must be signed in to change notification settings - Fork 20
V1 embeddings #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V1 embeddings #277
Conversation
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
The changes introduced by PR vllm-project/vllm#16728 to the sampler architecture were incompatible with our spyre model runner. Initially, as a stopgap solution. I copied the old sampling classes into our vllm_spyre tree just so that we can keep working on the latest changes from main. Now this commit reverts that and makes the same logits processor logic work for the spyre input batch and model runner classes. The difference with the gpu model runner is that in spyre we don't condense the batch but have a boolean mask that is used to calculate "dense" request indices. These indices must be used for the BatchUpdateBuilder because they are the right ones to slice the `logits` tensor that is passed to the Sampler. Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
…upstream (#245)" This reverts commit 962abf1. Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
All tests are passing now after the changes from the first round of reviews. |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
From my side this looks good! I have one small things apart from the tiny merge conflicts @maxdebayser will solve in within seconds:) IMO we have 1 or 2 files too many in |
Also hoping for @joerunde to give this a final pass (as it is a really big refactoring) and merge once he is back:) |
I think it should be nice to update |
torch.Tensor, self.token_type_ids_cpu_tensor).numpy() | ||
return self._token_type_ids_cpu | ||
|
||
def has_token_types(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that this method is not being used and, I was wondering if the tensors related to that like token_type_ids_cpu_tensor are being used as well. I mean, I could find it being populated, but not being read if I did the reading correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is to prepare for changes that haven't been merged upstream yet. I can remove these to simplify the PR for now.
It's not clear in what shape the support will be added upstream Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the refactoring of input batch, it makes sense to me.
Just a friendly reminder: update docs/supported_features.md
Signed-off-by: Max de Bayser <[email protected]>
### [v1] remove v0 code Now as we have v1 support for embedding models (#277 ), we can finally delete the v0 code. Note: for decoder models v0 support was depreciated some time ago. --------- Signed-off-by: Yannick Schnider <[email protected]>
Description
This PR enables embedding models on vllm V1. In contrast with the V1 GPU implementation, here I added a separate model runner because for most of the embedding models there is no need for continuous batching. To avoid code repetition, I refactored the input batch and model runner classes into a class hierarchy with common base classes.
@gmarinho2 contributed a test that verifies that the returned embeddings don't change with batch size.