-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
[Core] Increase default max_num_batched_tokens
for multimodal models
#8028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
max_num_batched_tokens
to 8192
for multimodal modelsmax_num_batched_tokens
for multimodal models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the quick fix.
Hmm... seems that Fuyu test cannot run in CI with the increased |
vllm-project#8028) Signed-off-by: Alvant <[email protected]>
vllm-project#8028) Signed-off-by: LeiWang1999 <[email protected]>
Enabling chunked prefill causes some confusing errors for multimodal models as
max_num_batched_tokens < num_multimodal_tokens
leads to mismatched placeholder count when running the model.This PR partially solves this issue by increasing the default
max_num_batched_tokens
for multimodal models so that it is sufficient for most cases.As indicated by the TODO, it would be more ideal to determine the number of multimodal tokens that are in the prompt and raise an error if we detect that chunked prefill would truncate the multimodal tokens. However, this requires some refactoring for
LLMEngine
to access the multimodal registry used in theModelRunner
, so let's leave that to another PR.As mentioned by @ywang96 , another improvement would be to dynamically set the default
max_num_batched_tokens
, but that also requires access to theModelRunner
as the maximum number of multimodal tokens is only available afterinit_mm_limits_per_prompt
is called.FIX #7996