-
Notifications
You must be signed in to change notification settings - Fork 24
✨ Reject requests, and upgrade vllm install to 0.8.0 #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Now you are good to go 🚀 |
Signed-off-by: Joe Runde <[email protected]>
Dang, simple manual tests are working but the warmup shapes test that flexes this is failing, so I think there's still a problem :( Will keep digging... |
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
…v1 tests Signed-off-by: Joe Runde <[email protected]>
This should be working and good to review now 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good, just have one comment about how the requests get rejected since this seems to be change vs. what we were doing in V0.
"""Temporary solution to reject requests that were too large to | ||
schedule. This removes the rejected requests from the scheduler, and | ||
returns empty outputs for them with finish reason `abort`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour in V0 was that if the request didn't match any warmup shape, it would be added to the ignore queue. This results in finish reason 'length' and empty string as output. While we didn't really like this behaviour, it was consistent with how vLLM handled similar cases (e.g., prompt length being longer than max model len) in V0. Has this changed in V1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdoublep Yeah, as far as I can tell the v1 scheduler doesn't have the ability to ignore requests. The new flow in the v1 engine is that the scheduling is broken down into two passes, schedule()
and update_from_output()
. The engine expects all requests to schedule at least one token eventually from schedule()
, and if no tokens are scheduled for an iteration it'll just go idle waiting for more inputs. Once the model has been invoked, update_from_output()
is used to construct the outputs for the engine's callers. If we look at the simple case where we get only one single request and it doesn't fit a warmup shape, we can't just not schedule it or the engine will go idle and we never get a chance to pass a result back to the caller. So we have to schedule at least one dummy token, and then pass back an empty result from update_from_output()
, which is a bit sad.
As far as I can tell, all of the request validation has been moved ahead of the scheduler, so that for cases like len(prompt) > max_model_len
, the user gets an error and the scheduler is never invoked for that request. For online requests, that's handled in the api server for both V0 and V1:
{
"object": "error",
"message": "This model's maximum context length is 100 tokens. However, you requested 153 tokens (137 in the messages, 16 in the completion). Please reduce the length of the messages or completion.",
"type": "BadRequestError",
"param": null,
"code": 400
}
But the offline entrypoint doesn't do as much request validation, with V1 the engine handles this during "input processing" and will now raise a ValueError
on any invalid inputs:
model = LLM("/models/llama-194m", max_model_len=100)
model.generate("hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 hello 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/utils.py", line 1080, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 462, in generate
self._validate_and_add_requests(
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 1310, in _validate_and_add_requests
self._add_request(
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/llm.py", line 1328, in _add_request
self.llm_engine.add_request(
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 164, in add_request
request = self.processor.process_inputs(request_id, prompt, params,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/processor.py", line 141, in process_inputs
self._validate_model_inputs(processed_inputs)
File "/home/senuser/my-vllm/lib64/python3.11/site-packages/vllm/v1/engine/processor.py", line 266, in _validate_model_inputs
raise ValueError(
ValueError: Prompt length of 137 is longer than the maximum model length of 100.
IIUC, the input processing step was removed from the main engine loop in V1 so that it could better handle these cases. Previously with V0 any errors would kill the engine completely, now V1 can raise any validation errors it wants before the request gets to the main engine loop. But I do see with Robert's incoming changes here that there should soon be per-request failure handling from inside the engine loop that may allow us to raise a value error from our scheduler, which would be a lot better than this hack.
However, I think the best approach would be to let the plugins hook into the v1 engines' input processing step so that we can properly enforce the new assumption that every request that makes it to the scheduler is schedulable. I can start talking to maintainers about that, but I did still want to get this temporary fix in so we can keep moving forward.
I just removed the backwards compatibility for 0.7.3, since remembering that 0.8.0+ is required anyway for the pluggable scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, dropped a small comment on the choice of FinishReason
and RequestStatus
when rejecting request.
reject_outputs.append( | ||
EngineCoreOutput(request.request_id, | ||
new_token_ids=[], | ||
finish_reason=FinishReason.ABORT, | ||
stop_reason="Request did not fit any warmup " | ||
"shape")) | ||
request.status = RequestStatus.FINISHED_ABORTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joerunde Would your solution still work when you put:
reject_outputs.append(
EngineCoreOutput(request.request_id,
new_token_ids=[],
finish_reason=FinishReason.LENGTH,
stop_reason="Request did not fit any warmup "
"shape"))
request.status = RequestStatus.FINISHED_IGNORED
here? As @tdoublep mentioned, this would be more consistent with how we used to handle it previously.
A question regarding your comment: Does this mean that for both online and offline inference the RequestStatus
and FinishReason
are never set but an error is raised before? Are therefore the FinishReason.LENGTH
and RequestStatus.FINISHED_IGNORED
obsolete here? If that is the case, I don't have any preference...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah as far as I can tell the RequestStatus.FINISHED_IGNORED: FinishReason.LENGTH
entry in that map for v1 is unused. It looks like the new behavior is to raise an error when a prompt is too long. I guess I don't really care either way about what the finish_reason
here is, so I can change it to length. I'd like to get this is and then moving forward we can figure out how best to match v1's behavior of raising an error. I'll check on the vllm slack though and make sure that that's the intended behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Also no strong opinion here, just asked for better understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Slack thread about this here: https://vllm-dev.slack.com/archives/C087RA55P0D/p1742828751236509 |
I'll go ahead and merge this in so we can upgrade, will keep pulling on that slack thread and figure out how to iterate on this behavior |
logit_bias=[None for _ in range(num_reqs)], | ||
allowed_token_ids_mask=None, | ||
) | ||
bad_words_token_ids={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we update vllm?
This PR implements plugin-side rejection of requests that don't fit a warmup shape
Closes #16
The reason I did it this way instead of opening an upstream PR for vllm to allow the scheduler to reject requests is:
add_request
so that the user gets a real 400-type error, which IMHO would be much better than a 200 with empty textThis PR also updates the vllm install to 0.8.0, which is the latest working release for the plugin