Skip to content

[Core] Increase default max_num_batched_tokens for multimodal models #8028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 30, 2024
Merged

[Core] Increase default max_num_batched_tokens for multimodal models #8028

merged 4 commits into from
Aug 30, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Aug 30, 2024

Enabling chunked prefill causes some confusing errors for multimodal models as max_num_batched_tokens < num_multimodal_tokens leads to mismatched placeholder count when running the model.

This PR partially solves this issue by increasing the default max_num_batched_tokens for multimodal models so that it is sufficient for most cases.

As indicated by the TODO, it would be more ideal to determine the number of multimodal tokens that are in the prompt and raise an error if we detect that chunked prefill would truncate the multimodal tokens. However, this requires some refactoring for LLMEngine to access the multimodal registry used in the ModelRunner, so let's leave that to another PR.

As mentioned by @ywang96 , another improvement would be to dynamically set the default max_num_batched_tokens, but that also requires access to the ModelRunner as the maximum number of multimodal tokens is only available after init_mm_limits_per_prompt is called.

FIX #7996

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 changed the title [Core] Increase max_num_batched_tokens to 8192 for multimodal models [Core] Increase default max_num_batched_tokens for multimodal models Aug 30, 2024
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the quick fix.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 30, 2024 10:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 30, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Aug 30, 2024

Hmm... seems that Fuyu test cannot run in CI with the increased max_num_batched_tokens=8192. (It works locally) I'll reduce it to 4096 then.

@WoosukKwon WoosukKwon disabled auto-merge August 30, 2024 15:20
@WoosukKwon WoosukKwon merged commit 98cef6a into vllm-project:main Aug 30, 2024
35 of 38 checks passed
@DarkLight1337 DarkLight1337 deleted the fix-mm-prefill branch August 30, 2024 15:20
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: InternVL2-26B infer error:Attempted to assign 7 x 256 = 1792 multimodal tokens to 506 placeholders
3 participants