-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Closed
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
Most multimodal models support input image embeddings. see previous pr: #6613
IMO there's no reason not to support qwen2vl.
When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties.
For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here
height, width = get_image_size(image, channel_dim=input_data_format)
And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.
Might we need to passthrough more parameters for qwen2vl? please me give some tips.
here is my draft code: #8856
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request