[Core] Increase default `max_num_batched_tokens` for multimodal models #8028

DarkLight1337 · 2024-08-30T10:02:02Z

Enabling chunked prefill causes some confusing errors for multimodal models as max_num_batched_tokens < num_multimodal_tokens leads to mismatched placeholder count when running the model.

This PR partially solves this issue by increasing the default max_num_batched_tokens for multimodal models so that it is sufficient for most cases.

As indicated by the TODO, it would be more ideal to determine the number of multimodal tokens that are in the prompt and raise an error if we detect that chunked prefill would truncate the multimodal tokens. However, this requires some refactoring for LLMEngine to access the multimodal registry used in the ModelRunner, so let's leave that to another PR.

As mentioned by @ywang96 , another improvement would be to dynamically set the default max_num_batched_tokens, but that also requires access to the ModelRunner as the maximum number of multimodal tokens is only available after init_mm_limits_per_prompt is called.

FIX #7996

github-actions · 2024-08-30T10:02:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

ywang96

LGTM! Thanks for the quick fix.

DarkLight1337 · 2024-08-30T12:43:19Z

Hmm... seems that Fuyu test cannot run in CI with the increased max_num_batched_tokens=8192. (It works locally) I'll reduce it to 4096 then.

vllm-project#8028) Signed-off-by: Alvant <[email protected]>

vllm-project#8028) Signed-off-by: LeiWang1999 <[email protected]>

DarkLight1337 added 2 commits August 30, 2024 09:53

Increase max_num_batched_tokens to 8192 for VLMs

50e13d0

Add TODO

bc85fb5

DarkLight1337 changed the title ~~[Core] Increase max_num_batched_tokens to 8192 for multimodal models~~ [Core] Increase default max_num_batched_tokens for multimodal models Aug 30, 2024

format

41a3df0

ywang96 approved these changes Aug 30, 2024

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 30, 2024 10:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 30, 2024

Reduce default max_num_batched_tokens as the KV cache cannot fit in CI

b0efa70

WoosukKwon disabled auto-merge August 30, 2024 15:20

WoosukKwon merged commit 98cef6a into vllm-project:main Aug 30, 2024
35 of 38 checks passed

DarkLight1337 deleted the fix-mm-prefill branch August 30, 2024 15:20

DarkLight1337 mentioned this pull request Aug 30, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

53 tasks

DarkLight1337 mentioned this pull request Sep 9, 2024

[Bug]: Error when running multimodal large models with --enable-prefix-caching #8296

Closed

1 task

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Core] Increase default max_num_batched_tokens for multimodal models (

82db2f1

vllm-project#8028) Signed-off-by: Alvant <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Core] Increase default max_num_batched_tokens for multimodal models (

a9ba72f

vllm-project#8028) Signed-off-by: LeiWang1999 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Increase default `max_num_batched_tokens` for multimodal models #8028

[Core] Increase default `max_num_batched_tokens` for multimodal models #8028

Uh oh!

DarkLight1337 commented Aug 30, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 30, 2024

Uh oh!

ywang96 left a comment

Uh oh!

DarkLight1337 commented Aug 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Core] Increase default max_num_batched_tokens for multimodal models #8028

[Core] Increase default max_num_batched_tokens for multimodal models #8028

Uh oh!

Conversation

DarkLight1337 commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 30, 2024

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Core] Increase default `max_num_batched_tokens` for multimodal models #8028

[Core] Increase default `max_num_batched_tokens` for multimodal models #8028

DarkLight1337 commented Aug 30, 2024 •

edited

Loading

DarkLight1337 commented Aug 30, 2024 •

edited

Loading