✨ Reject requests, and upgrade vllm install to 0.8.0 #37

joerunde · 2025-03-19T22:48:02Z

This PR implements plugin-side rejection of requests that don't fit a warmup shape

Closes #16

The reason I did it this way instead of opening an upstream PR for vllm to allow the scheduler to reject requests is:

There's an open PR ([V1][Frontend] Improve Shutdown And Logs vllm#11737 and [V1][Frontend] Improve Shutdown And Logs vllm#14048) that should allow us to actually raise an error from add_request so that the user gets a real 400-type error, which IMHO would be much better than a 200 with empty text
We're accumulating a lot of drift with the v1 implementation, and I'd prefer us getting a fix in the plugin now rather than waiting on upstream reviews for a feature that the stock v1 scheduler doesn't support, so that we can resume testing against vllm main (or at least the latest release). There's already a blocker where the plugin won't boot with 0.8.1 that we need to address :(

This PR also updates the vllm install to 0.8.0, which is the latest working release for the plugin

Signed-off-by: Joe Runde <[email protected]>

github-actions · 2025-03-19T22:48:11Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-03-19T23:06:46Z

Dang, simple manual tests are working but the warmup shapes test that flexes this is failing, so I think there's still a problem :(

Will keep digging...

Signed-off-by: Joe Runde <[email protected]>

…v1 tests Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-03-20T19:40:14Z

This should be working and good to review now 👍

tdoublep

Looks generally good, just have one comment about how the requests get rejected since this seems to be change vs. what we were doing in V0.

tdoublep · 2025-03-21T16:01:39Z

vllm_spyre/v1/core/scheduler.py

+        """Temporary solution to reject requests that were too large to
+        schedule. This removes the rejected requests from the scheduler, and 
+        returns empty outputs for them with finish reason `abort`.


The behaviour in V0 was that if the request didn't match any warmup shape, it would be added to the ignore queue. This results in finish reason 'length' and empty string as output. While we didn't really like this behaviour, it was consistent with how vLLM handled similar cases (e.g., prompt length being longer than max model len) in V0. Has this changed in V1?

@tdoublep Yeah, as far as I can tell the v1 scheduler doesn't have the ability to ignore requests. The new flow in the v1 engine is that the scheduling is broken down into two passes, schedule() and update_from_output(). The engine expects all requests to schedule at least one token eventually from schedule(), and if no tokens are scheduled for an iteration it'll just go idle waiting for more inputs. Once the model has been invoked, update_from_output() is used to construct the outputs for the engine's callers. If we look at the simple case where we get only one single request and it doesn't fit a warmup shape, we can't just not schedule it or the engine will go idle and we never get a chance to pass a result back to the caller. So we have to schedule at least one dummy token, and then pass back an empty result from update_from_output(), which is a bit sad.

As far as I can tell, all of the request validation has been moved ahead of the scheduler, so that for cases like len(prompt) > max_model_len, the user gets an error and the scheduler is never invoked for that request. For online requests, that's handled in the api server for both V0 and V1:

{ "object": "error", "message": "This model's maximum context length is 100 tokens. However, you requested 153 tokens (137 in the messages, 16 in the completion). Please reduce the length of the messages or completion.", "type": "BadRequestError", "param": null, "code": 400 }

But the offline entrypoint doesn't do as much request validation, with V1 the engine handles this during "input processing" and will now raise a ValueError on any invalid inputs:

model = LLM("/models/llama-194m", max_model_len=100) model.generate("hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/utils.py", line 1080, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 462, in generate self._validate_and_add_requests( File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 1310, in _validate_and_add_requests self._add_request( File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 1328, in _add_request self.llm_engine.add_request( File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 164, in add_request request = self.processor.process_inputs(request_id, prompt, params, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/processor.py", line 141, in process_inputs self._validate_model_inputs(processed_inputs) File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/processor.py", line 266, in _validate_model_inputs raise ValueError( ValueError: Prompt length of 137 is longer than the maximum model length of 100.

IIUC, the input processing step was removed from the main engine loop in V1 so that it could better handle these cases. Previously with V0 any errors would kill the engine completely, now V1 can raise any validation errors it wants before the request gets to the main engine loop. But I do see with Robert's incoming changes here that there should soon be per-request failure handling from inside the engine loop that may allow us to raise a value error from our scheduler, which would be a lot better than this hack.

However, I think the best approach would be to let the plugins hook into the v1 engines' input processing step so that we can properly enforce the new assumption that every request that makes it to the scheduler is schedulable. I can start talking to maintainers about that, but I did still want to get this temporary fix in so we can keep moving forward.

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-03-21T17:32:06Z

I just removed the backwards compatibility for 0.7.3, since remembering that 0.8.0+ is required anyway for the pluggable scheduler

yannicks1

LGTM in general, dropped a small comment on the choice of FinishReason and RequestStatus when rejecting request.

yannicks1 · 2025-03-24T09:29:43Z

vllm_spyre/v1/core/scheduler.py

+            reject_outputs.append(
+                EngineCoreOutput(request.request_id,
+                                 new_token_ids=[],
+                                 finish_reason=FinishReason.ABORT,
+                                 stop_reason="Request did not fit any warmup "
+                                 "shape"))
+            request.status = RequestStatus.FINISHED_ABORTED


@joerunde Would your solution still work when you put:

reject_outputs.append( EngineCoreOutput(request.request_id, new_token_ids=[], finish_reason=FinishReason.LENGTH, stop_reason="Request did not fit any warmup " "shape")) request.status = RequestStatus.FINISHED_IGNORED

here? As @tdoublep mentioned, this would be more consistent with how we used to handle it previously.
A question regarding your comment: Does this mean that for both online and offline inference the RequestStatus and FinishReason are never set but an error is raised before? Are therefore the FinishReason.LENGTH and RequestStatus.FINISHED_IGNORED obsolete here? If that is the case, I don't have any preference...

Yeah as far as I can tell the RequestStatus.FINISHED_IGNORED: FinishReason.LENGTH entry in that map for v1 is unused. It looks like the new behavior is to raise an error when a prompt is too long. I guess I don't really care either way about what the finish_reason here is, so I can change it to length. I'd like to get this is and then moving forward we can figure out how best to match v1's behavior of raising an error. I'll check on the vllm slack though and make sure that that's the intended behavior

Sounds good. Also no strong opinion here, just asked for better understanding.

yannicks1

LGTM!

joerunde · 2025-03-24T15:20:17Z

Slack thread about this here: https://vllm-dev.slack.com/archives/C087RA55P0D/p1742828751236509

joerunde · 2025-03-24T15:41:10Z

I'll go ahead and merge this in so we can upgrade, will keep pulling on that slack thread and figure out how to iterate on this behavior

wallashss · 2025-03-24T16:56:48Z

vllm_spyre/v1/worker/spyre_model_runner.py

            logit_bias=[None for _ in range(num_reqs)],
            allowed_token_ids_mask=None,
-        )
+            bad_words_token_ids={})


Have we update vllm?

joerunde added 2 commits March 19, 2025 16:41

✨ Support rejecting requests

187ec1c

Signed-off-by: Joe Runde <[email protected]>

⬆️ Bump to 0.8.0

485fbfd

Signed-off-by: Joe Runde <[email protected]>

🐛 fix build

51c9399

Signed-off-by: Joe Runde <[email protected]>

joerunde added 3 commits March 20, 2025 13:26

🐛 fixup scheduler and backwards compatibility

5fc2aec

Signed-off-by: Joe Runde <[email protected]>

📝 add note about backwards compatibility

470756e

Signed-off-by: Joe Runde <[email protected]>

:construction-worker: stop matching on the roberta-v1 model name for …

ff5a79c

…v1 tests Signed-off-by: Joe Runde <[email protected]>

tdoublep reviewed Mar 21, 2025

View reviewed changes

joerunde added 2 commits March 21, 2025 11:06

Merge branch 'main' into reject-requests

ab0f6de

🔥 more fixes for 0.8.0, remove backwards compatibility

2481dd0

Signed-off-by: Joe Runde <[email protected]>

yannicks1 reviewed Mar 24, 2025

View reviewed changes

yannicks1 mentioned this pull request Mar 24, 2025

[Readme] update repository and bump vLLM to 0.8.0 #43

Merged

yannicks1 approved these changes Mar 24, 2025

View reviewed changes

joerunde merged commit 6fb779a into vllm-project:main Mar 24, 2025
10 checks passed

joerunde deleted the reject-requests branch March 24, 2025 15:41

wallashss reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ Reject requests, and upgrade vllm install to 0.8.0 #37

✨ Reject requests, and upgrade vllm install to 0.8.0 #37

joerunde commented Mar 19, 2025

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

joerunde commented Mar 19, 2025

Uh oh!

joerunde commented Mar 20, 2025

Uh oh!

tdoublep left a comment

Uh oh!

tdoublep Mar 21, 2025

Uh oh!

joerunde Mar 21, 2025

Uh oh!

joerunde commented Mar 21, 2025

Uh oh!

yannicks1 left a comment

Uh oh!

yannicks1 Mar 24, 2025

Uh oh!

joerunde Mar 24, 2025

Uh oh!

yannicks1 Mar 24, 2025

Uh oh!

yannicks1 left a comment

Uh oh!

joerunde commented Mar 24, 2025

Uh oh!

joerunde commented Mar 24, 2025

Uh oh!

Uh oh!

wallashss Mar 24, 2025

Uh oh!

Uh oh!

✨ Reject requests, and upgrade vllm install to 0.8.0 #37

✨ Reject requests, and upgrade vllm install to 0.8.0 #37

Conversation

joerunde commented Mar 19, 2025

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

joerunde commented Mar 19, 2025

Uh oh!

joerunde commented Mar 20, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde commented Mar 21, 2025

Uh oh!

yannicks1 left a comment

Choose a reason for hiding this comment

Uh oh!

yannicks1 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 left a comment

Choose a reason for hiding this comment

Uh oh!

joerunde commented Mar 24, 2025

Uh oh!

joerunde commented Mar 24, 2025

Uh oh!

Uh oh!

wallashss Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!