Skip to content

V1 embeddings #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Jul 28, 2025
Merged

V1 embeddings #277

merged 44 commits into from
Jul 28, 2025

Conversation

maxdebayser
Copy link
Collaborator

@maxdebayser maxdebayser commented Jul 2, 2025

Description

This PR enables embedding models on vllm V1. In contrast with the V1 GPU implementation, here I added a separate model runner because for most of the embedding models there is no need for continuous batching. To avoid code repetition, I refactored the input batch and model runner classes into a class hierarchy with common base classes.

@gmarinho2 contributed a test that verifies that the returned embeddings don't change with batch size.

Copy link

github-actions bot commented Jul 2, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

maxdebayser and others added 21 commits July 2, 2025 15:56
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
The changes introduced by PR
vllm-project/vllm#16728
to the sampler architecture were incompatible with
our spyre model runner.

Initially, as a stopgap solution. I copied the old sampling
classes into our vllm_spyre tree just so that we can keep
working on the latest changes from main.

Now this commit reverts that and makes the same logits
processor logic work for the spyre input batch and model
runner classes.  The difference with the gpu model runner is
that in spyre we don't condense the batch but have a boolean
mask that is used to calculate "dense" request indices. These
indices must be used for the BatchUpdateBuilder because they
are the right ones to slice the `logits` tensor that is passed
to the Sampler.

Signed-off-by: Max de Bayser <[email protected]>
…upstream (#245)"

This reverts commit 962abf1.

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Gabriel Marinho <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
@maxdebayser maxdebayser changed the title [WIP] V1 embeddings V1 embeddings Jul 15, 2025
@maxdebayser maxdebayser marked this pull request as ready for review July 15, 2025 22:53
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
@maxdebayser maxdebayser enabled auto-merge (squash) July 22, 2025 02:50
@maxdebayser
Copy link
Collaborator Author

All tests are passing now after the changes from the first round of reviews.

@github-actions github-actions bot added the ready label Jul 22, 2025
Signed-off-by: Max de Bayser <[email protected]>
@yannicks1
Copy link
Collaborator

From my side this looks good! I have one small things apart from the tiny merge conflicts @maxdebayser will solve in within seconds:)

IMO we have 1 or 2 files too many in vllm_spyre/v1/worker. It looks cluttered with the three input batch related .py files. Upstream they only have one file for input batches per accelerator (e.g. GPU, TPU). Looking at the number of lines in e.g. the gpu_input_batch.py (760 lines) we could put all the content of vllm_spyre/v1/worker/spyre_base_input_batch.py, vllm_spyre/v1/worker/spyre_input_batch.py and vllm_spyre/v1/worker/spyre_pooling_input_batch.py into one file (called spyre_input_batch.py) and have around 800 lines. Another option would be to put the base class vllm_spyre/v1/worker/spyre_base_input_batch.py into vllm_spyre/v1/worker/spyre_input_batch.py, similar as we have the model runner base class in vllm_spyre/v1/worker/spyre_model_runner.py and not a separate file for this.

@yannicks1
Copy link
Collaborator

Also hoping for @joerunde to give this a final pass (as it is a really big refactoring) and merge once he is back:)

@wallashss
Copy link
Collaborator

I think it should be nice to update docs/supported_features.md

torch.Tensor, self.token_type_ids_cpu_tensor).numpy()
return self._token_type_ids_cpu

def has_token_types(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that this method is not being used and, I was wondering if the tensors related to that like token_type_ids_cpu_tensor are being used as well. I mean, I could find it being populated, but not being read if I did the reading correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to prepare for changes that haven't been merged upstream yet. I can remove these to simplify the PR for now.

It's not clear in what shape the support will be
added upstream

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Copy link
Collaborator

@wallashss wallashss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the refactoring of input batch, it makes sense to me.

Just a friendly reminder: update docs/supported_features.md

@maxdebayser maxdebayser merged commit 02f4e63 into main Jul 28, 2025
15 of 18 checks passed
@maxdebayser maxdebayser deleted the v1_embeddings branch July 28, 2025 18:58
@yannicks1 yannicks1 mentioned this pull request Jul 29, 2025
yannicks1 added a commit that referenced this pull request Jul 31, 2025
### [v1] remove v0 code

Now as we have v1 support for embedding models (#277 ), we can finally
delete the v0 code.
Note: for decoder models v0 support was depreciated some time ago.

---------

Signed-off-by: Yannick Schnider <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants