Skip to content

Conversation

joerunde
Copy link
Collaborator

This PR implements plugin-side rejection of requests that don't fit a warmup shape

Closes #16

The reason I did it this way instead of opening an upstream PR for vllm to allow the scheduler to reject requests is:

  • There's an open PR ([V1][Frontend] Improve Shutdown And Logs vllm#11737 and [V1][Frontend] Improve Shutdown And Logs vllm#14048) that should allow us to actually raise an error from add_request so that the user gets a real 400-type error, which IMHO would be much better than a 200 with empty text
  • We're accumulating a lot of drift with the v1 implementation, and I'd prefer us getting a fix in the plugin now rather than waiting on upstream reviews for a feature that the stock v1 scheduler doesn't support, so that we can resume testing against vllm main (or at least the latest release). There's already a blocker where the plugin won't boot with 0.8.1 that we need to address :(

This PR also updates the vllm install to 0.8.0, which is the latest working release for the plugin

Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Joe Runde <[email protected]>
@joerunde
Copy link
Collaborator Author

Dang, simple manual tests are working but the warmup shapes test that flexes this is failing, so I think there's still a problem :(

Will keep digging...

@joerunde
Copy link
Collaborator Author

This should be working and good to review now 👍

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good, just have one comment about how the requests get rejected since this seems to be change vs. what we were doing in V0.

Comment on lines +152 to +154
"""Temporary solution to reject requests that were too large to
schedule. This removes the rejected requests from the scheduler, and
returns empty outputs for them with finish reason `abort`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour in V0 was that if the request didn't match any warmup shape, it would be added to the ignore queue. This results in finish reason 'length' and empty string as output. While we didn't really like this behaviour, it was consistent with how vLLM handled similar cases (e.g., prompt length being longer than max model len) in V0. Has this changed in V1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdoublep Yeah, as far as I can tell the v1 scheduler doesn't have the ability to ignore requests. The new flow in the v1 engine is that the scheduling is broken down into two passes, schedule() and update_from_output(). The engine expects all requests to schedule at least one token eventually from schedule(), and if no tokens are scheduled for an iteration it'll just go idle waiting for more inputs. Once the model has been invoked, update_from_output() is used to construct the outputs for the engine's callers. If we look at the simple case where we get only one single request and it doesn't fit a warmup shape, we can't just not schedule it or the engine will go idle and we never get a chance to pass a result back to the caller. So we have to schedule at least one dummy token, and then pass back an empty result from update_from_output(), which is a bit sad.

As far as I can tell, all of the request validation has been moved ahead of the scheduler, so that for cases like len(prompt) > max_model_len, the user gets an error and the scheduler is never invoked for that request. For online requests, that's handled in the api server for both V0 and V1:

{
  "object": "error",
  "message": "This model's maximum context length is 100 tokens. However, you requested 153 tokens (137 in the messages, 16 in the completion). Please reduce the length of the messages or completion.",
  "type": "BadRequestError",
  "param": null,
  "code": 400
}

But the offline entrypoint doesn't do as much request validation, with V1 the engine handles this during "input processing" and will now raise a ValueError on any invalid inputs:

model = LLM("/models/llama-194m", max_model_len=100)
model.generate("hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/utils.py", line 1080, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 462, in generate
    self._validate_and_add_requests(
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 1310, in _validate_and_add_requests
    self._add_request(
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 1328, in _add_request
    self.llm_engine.add_request(
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 164, in add_request
    request = self.processor.process_inputs(request_id, prompt, params,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/processor.py", line 141, in process_inputs
    self._validate_model_inputs(processed_inputs)
  File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/processor.py", line 266, in _validate_model_inputs
    raise ValueError(
ValueError: Prompt length of 137 is longer than the maximum model length of 100.

IIUC, the input processing step was removed from the main engine loop in V1 so that it could better handle these cases. Previously with V0 any errors would kill the engine completely, now V1 can raise any validation errors it wants before the request gets to the main engine loop. But I do see with Robert's incoming changes here that there should soon be per-request failure handling from inside the engine loop that may allow us to raise a value error from our scheduler, which would be a lot better than this hack.

However, I think the best approach would be to let the plugins hook into the v1 engines' input processing step so that we can properly enforce the new assumption that every request that makes it to the scheduler is schedulable. I can start talking to maintainers about that, but I did still want to get this temporary fix in so we can keep moving forward.

@joerunde
Copy link
Collaborator Author

I just removed the backwards compatibility for 0.7.3, since remembering that 0.8.0+ is required anyway for the pluggable scheduler

Copy link
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, dropped a small comment on the choice of FinishReason and RequestStatus when rejecting request.

Comment on lines +179 to +185
reject_outputs.append(
EngineCoreOutput(request.request_id,
new_token_ids=[],
finish_reason=FinishReason.ABORT,
stop_reason="Request did not fit any warmup "
"shape"))
request.status = RequestStatus.FINISHED_ABORTED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joerunde Would your solution still work when you put:

            reject_outputs.append(
                EngineCoreOutput(request.request_id,
                                 new_token_ids=[],
                                 finish_reason=FinishReason.LENGTH,
                                 stop_reason="Request did not fit any warmup "
                                 "shape"))
            request.status = RequestStatus.FINISHED_IGNORED

here? As @tdoublep mentioned, this would be more consistent with how we used to handle it previously.
A question regarding your comment: Does this mean that for both online and offline inference the RequestStatus and FinishReason are never set but an error is raised before? Are therefore the FinishReason.LENGTH and RequestStatus.FINISHED_IGNORED obsolete here? If that is the case, I don't have any preference...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as far as I can tell the RequestStatus.FINISHED_IGNORED: FinishReason.LENGTH entry in that map for v1 is unused. It looks like the new behavior is to raise an error when a prompt is too long. I guess I don't really care either way about what the finish_reason here is, so I can change it to length. I'd like to get this is and then moving forward we can figure out how best to match v1's behavior of raising an error. I'll check on the vllm slack though and make sure that that's the intended behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Also no strong opinion here, just asked for better understanding.

Copy link
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@joerunde
Copy link
Collaborator Author

Slack thread about this here: https://vllm-dev.slack.com/archives/C087RA55P0D/p1742828751236509

@joerunde
Copy link
Collaborator Author

I'll go ahead and merge this in so we can upgrade, will keep pulling on that slack thread and figure out how to iterate on this behavior

@joerunde joerunde merged commit 6fb779a into vllm-project:main Mar 24, 2025
10 checks passed
@joerunde joerunde deleted the reject-requests branch March 24, 2025 15:41
logit_bias=[None for _ in range(num_reqs)],
allowed_token_ids_mask=None,
)
bad_words_token_ids={})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we update vllm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[V1] Reject requests that don't fit a warmup shape
4 participants