-
Notifications
You must be signed in to change notification settings - Fork 285
Enable QWen VL video preprocess #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Enable QWen VL video preprocess #2514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please share CVS number in JIRA to access from arch perspective.
Thanks.
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Only calc once for video process. Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
2: add ov::Properity::video Signed-off-by: xipingya <[email protected]>
edd75d8
to
10d8e8d
Compare
Co-authored-by: Wanglei Shen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Enables video processing for QWen VL models by adding video input support throughout the VLM pipeline. The main change allows QWen VL models to handle video input through temporal patch processing, which groups video frames and merges them into combined patches for more efficient processing.
- Adds video parameter support to all VLM pipeline interfaces and binding functions
- Implements video encoding functionality specifically for QWen2VL models
- Updates the generation config to include
is_video
flag for video-specific processing
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
src/python/py_vlm_pipeline.cpp | Adds video parameter to Python VLM pipeline bindings |
src/python/py_continuous_batching_pipeline.cpp | Updates continuous batching pipeline with video support |
src/python/openvino_genai/py_openvino_genai.pyi | Adds is_video property and video parameter to type stubs |
src/cpp/src/visual_language/vision_encoder.hpp | Adds virtual encode_video method to base VisionEncoder |
src/cpp/src/visual_language/qwen2vl/classes.hpp | Declares video encoding implementation for QWen2VL |
src/cpp/src/visual_language/qwen2vl/classes.cpp | Implements video preprocessing and encoding logic |
src/cpp/src/visual_language/pipeline_base.hpp | Updates base pipeline interface to include video parameter |
src/cpp/src/visual_language/pipeline.cpp | Updates main pipeline implementation with video support |
src/cpp/src/visual_language/llava_next/classes.hpp | Updates LLaVANext interface with video parameter |
src/cpp/src/visual_language/llava_next/classes.cpp | Adds video warning for unsupported models |
src/cpp/src/visual_language/llava/classes.hpp | Updates LLaVA interface with video parameter |
src/cpp/src/visual_language/llava/classes.cpp | Adds video warning for unsupported models |
src/cpp/src/visual_language/inputs_embedder.hpp | Updates inputs embedder interface for video support |
src/cpp/src/visual_language/inputs_embedder.cpp | Implements video encoding routing logic |
src/cpp/src/visual_language/continuous_batching_adapter.hpp | Updates adapter interface with video parameter |
src/cpp/src/visual_language/clip.cpp | Optimizes bicubic resize with early exit for same-size images |
src/cpp/src/continuous_batching/pipeline_impl.cpp | Updates implementation to handle video parameters |
src/cpp/src/continuous_batching/pipeline_base.hpp | Updates base interface with video support |
src/cpp/src/continuous_batching/pipeline_base.cpp | Implements video parameter handling in pipeline |
src/cpp/src/continuous_batching/pipeline.cpp | Updates main pipeline with video parameter support |
src/cpp/include/openvino/genai/visual_language/pipeline.hpp | Adds video property and parameter to public interface |
src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp | Updates public interface with video support |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
# Conflicts: # src/cpp/src/continuous_batching/pipeline_base.cpp # src/cpp/src/visual_language/inputs_embedder.cpp # src/cpp/src/visual_language/inputs_embedder.hpp # src/cpp/src/visual_language/qwen2vl/classes.cpp # src/cpp/src/visual_language/qwen2vl/classes.hpp
7178143
to
6d8f9f1
Compare
Signed-off-by: xipingya <[email protected]>
6d8f9f1
to
8768795
Compare
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
dfa276d
to
6f5189b
Compare
Signed-off-by: xipingya <[email protected]>
get_inputs_embeds(const std::string& prompt, const std::vector<ov::Tensor>& images ... get_inputs_embeds_with_token_type_ids(const std::string& prompt, const std::vector<ov::Tensor>& images, ... Because 1: they never been called for current codes. 2: Getting embeds feature, we usually need to apply a chat template. I think only keeping below interface is enough. get_inputs_embeds(const std::string& prompt, const std::vector<EncodedImage>& images... get_inputs_embeds_with_token_type_ids(const std::string& prompt, const std::vector<EncodedImage>& images... Signed-off-by: xipingya <[email protected]>
2: Enable video for get_input_embeds Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Signed-off-by: xipingya <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
…om/xipingyan/openvino.genai into xp/enable_qwen_vl_video_preprocess
std::vector<ov::Tensor> videos std::vector means multiple videos ov::Tensor means [N,H,W,C], N represents multiple frames of a video. Signed-off-by: xipingya <[email protected]>
6320103
to
6e33dcf
Compare
Signed-off-by: xipingya <[email protected]>
ac965a1
to
eb4faea
Compare
Signed-off-by: xiping.yan <[email protected]>
e521645
to
515c911
Compare
Signed-off-by: xiping.yan <[email protected]>
tickets: CVS-173219
1: Enable video preprocessing for Qwen VL model.
Add: ov::Property<std::vectorov::Tensor> video{"video"};
2: The main updates:
For video: For 2-in-1 merging, if 9 images are input, only 5 images are actually processed.
For image: For 2-in-1 merging, we only double each image, so if we input 9 images, we only actually process 9 images.
Introduce "
If
" node, merge video and image preprocess into one OV subgroup.