Skip to content
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
6c49dc8
Avoid to do resize for same width and height images.
xipingyan Jul 30, 2025
c7d9932
Enable video process for qwen*-vl
xipingyan Jul 30, 2025
2ee043f
Add python interface: generate config: is_video, default false.
xipingyan Jul 31, 2025
29c74fd
fallback video_encode to image encode in base class.
xipingyan Aug 5, 2025
78dac29
Update calc target image size.
xipingyan Aug 5, 2025
7b2c115
Reduce shared codes, fallback to image process via return empty vector;
xipingyan Aug 5, 2025
10d8e8d
1: remove is_video,
xipingyan Aug 9, 2025
a3000d4
Update src/cpp/src/visual_language/llava/classes.cpp
xipingyan Sep 11, 2025
062fc40
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Sep 12, 2025
4d8375d
Update src/cpp/src/visual_language/pipeline.cpp
xipingyan Sep 12, 2025
ef9f868
rename according to copilot suggestion
xipingyan Sep 12, 2025
ad95828
Merge branch 'xp/enable_qwen_vl_video_preprocess' of https://github.c…
xipingyan Sep 12, 2025
f92b19b
rename rgbs to images
xipingyan Sep 12, 2025
66cdf38
enable if node to unify image and video preprocess.
xipingyan Sep 15, 2025
3eda036
cpp preprocess: enable video preprecess.
xipingyan Sep 15, 2025
3df267f
Pass same_images
xipingyan Sep 15, 2025
bf3169b
add commments for same image
xipingyan Sep 15, 2025
e1250aa
Update loop condition, and rename variables.
xipingyan Sep 16, 2025
fe0ab92
Update src/cpp/src/visual_language/pipeline_base.hpp
xipingyan Sep 16, 2025
dec67b2
video should be frames.
xipingyan Sep 16, 2025
caee3fd
Add pytest for video input.
xipingyan Sep 16, 2025
6a49a48
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Sep 16, 2025
800638e
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
peterchen-intel Sep 17, 2025
1502b28
Remove is_video python attribute.
xipingyan Sep 17, 2025
4d8e867
rename video to videos
xipingyan Sep 17, 2025
ea7fc94
Update docs, and add video for add_request.
xipingyan Sep 17, 2025
60364bf
Fix docs format.
xipingyan Sep 17, 2025
4ea5b3d
Fix test error: can't catch exception.
xipingyan Sep 18, 2025
8a0ab2e
Fix: cannot be narrowed from type 'int' to 'float' in initializer list
xipingyan Sep 18, 2025
28337ea
Support no image or video input;
xipingyan Sep 18, 2025
f3fd7d4
Add checking input for python api.
xipingyan Sep 18, 2025
a80d28e
cpp interface: generate, remove video. add is_video, default false
xipingyan Sep 18, 2025
6ab0a35
update get_inputs_embeds_with_token_type_ids and get_inputs_embeds, i…
xipingyan Sep 18, 2025
c531982
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Sep 18, 2025
dc30ec1
update pyi interface of generate.
xipingyan Sep 19, 2025
5edf0a5
Remove "const bool& is_video" in add_request and generate.
xipingyan Sep 24, 2025
2215f8a
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Sep 25, 2025
14352a7
Update src/python/openvino_genai/py_openvino_genai.pyi
xipingyan Sep 25, 2025
89afa54
copilot give a wrong suggestion. add images and video param for add_r…
xipingyan Sep 25, 2025
3b5c6cd
Merge remote-tracking branch 'origin/master' into xp/enable_qwen_vl_v…
xipingyan Sep 25, 2025
8768795
Add examples to .md
xipingyan Sep 25, 2025
be57bf2
Fix test video error, and input multiple images.
xipingyan Sep 25, 2025
d96c5dd
Update test based on 4D video.
xipingyan Sep 26, 2025
aaf20b0
Add vlm test dependency: opencv-python
xipingyan Sep 27, 2025
a2ad61b
Merge remote-tracking branch 'origin/master' into xp/enable_qwen_vl_v…
xipingyan Sep 27, 2025
6f5189b
Enable mix video and image input.
xipingyan Sep 27, 2025
c0829a3
split encode_images into encode_images and encode_video
xipingyan Sep 28, 2025
f25770b
Remove:
xipingyan Sep 28, 2025
72c621b
1: Add <video_pad> placeholder,
xipingyan Sep 28, 2025
132b228
Update position_ids after enable video.
xipingyan Sep 29, 2025
8c0e13d
add video histry id.
xipingyan Sep 30, 2025
64ba684
Update src/cpp/include/openvino/genai/visual_language/pipeline.hpp
xipingyan Sep 30, 2025
bbbef65
Merge branch 'xp/enable_qwen_vl_video_preprocess' of https://github.c…
xipingyan Sep 30, 2025
6e33dcf
Rename video to videos, reducing confusion.
xipingyan Sep 30, 2025
6bf63de
Remove useless header.
xipingyan Sep 30, 2025
eb4faea
Update video-> videos in Readme
xipingyan Sep 30, 2025
123221b
all video -> videos
xipingyan Sep 30, 2025
515c911
Call images when the models not implement video process.
xipingyan Sep 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,11 @@ class OPENVINO_GENAI_EXPORTS ContinuousBatchingPipeline {
/// @param request_id must be unique for every add_request() call.
GenerationHandle add_request(uint64_t request_id, const ov::Tensor& input_ids, const ov::genai::GenerationConfig& sampling_params);
GenerationHandle add_request(uint64_t request_id, const std::string& prompt, const ov::genai::GenerationConfig& sampling_params);
GenerationHandle add_request(uint64_t request_id, const std::string& prompt, const std::vector<ov::Tensor>& images, const ov::genai::GenerationConfig& sampling_params);
GenerationHandle add_request(uint64_t request_id,
const std::string& prompt,
const std::vector<ov::Tensor>& images,
const std::vector<ov::Tensor>& video,
const ov::genai::GenerationConfig& sampling_params);

void step();

Expand All @@ -177,6 +181,7 @@ class OPENVINO_GENAI_EXPORTS ContinuousBatchingPipeline {
std::vector<VLMDecodedResults> generate(
const std::vector<std::string>& prompts,
const std::vector<std::vector<ov::Tensor>>& images,
const std::vector<std::vector<ov::Tensor>>& videos,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer=std::monostate{});
/**
Expand Down
4 changes: 4 additions & 0 deletions src/cpp/include/openvino/genai/visual_language/pipeline.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
VLMDecodedResults generate(
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
const std::vector<ov::Tensor>& video,
const GenerationConfig& generation_config,
const StreamerVariant& streamer
);
Expand Down Expand Up @@ -235,7 +236,10 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/*
* utils that allow to use generate() in the following way:
* pipe.generate(prompt, ov::genai::image(image_tensor)).
* pipe.generate(prompt, ov::genai::images(image_tensors)).
* pipe.generate(prompt, ov::genai::video(video_tensors)).
*/
static constexpr ov::Property<ov::Tensor> image{"image"};
static constexpr ov::Property<std::vector<ov::Tensor>> images{"images"};
static constexpr ov::Property<std::vector<ov::Tensor>> video{"video"};
}
11 changes: 8 additions & 3 deletions src/cpp/src/continuous_batching/pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -237,8 +237,12 @@ GenerationHandle ContinuousBatchingPipeline::add_request(uint64_t request_id, co
return m_impl->add_request(request_id, input_ids, sampling_params);
}

GenerationHandle ContinuousBatchingPipeline::add_request(uint64_t request_id, const std::string& prompt, const std::vector<ov::Tensor>& images, const ov::genai::GenerationConfig& sampling_params) {
return m_impl->add_request(request_id, prompt, images, sampling_params);
GenerationHandle ContinuousBatchingPipeline::add_request(uint64_t request_id,
const std::string& prompt,
const std::vector<ov::Tensor>& images,
const std::vector<ov::Tensor>& video,
const ov::genai::GenerationConfig& sampling_params) {
return m_impl->add_request(request_id, prompt, images, video, sampling_params);
}

void ContinuousBatchingPipeline::step() {
Expand Down Expand Up @@ -272,9 +276,10 @@ std::vector<GenerationResult> ContinuousBatchingPipeline::generate(const std::ve
std::vector<VLMDecodedResults> ContinuousBatchingPipeline::generate(
const std::vector<std::string>& prompts,
const std::vector<std::vector<ov::Tensor>>& images,
const std::vector<std::vector<ov::Tensor>>& videos,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer) {
return m_impl->generate(prompts, images, sampling_params, streamer);
return m_impl->generate(prompts, images, videos, sampling_params, streamer);
}


Expand Down
36 changes: 31 additions & 5 deletions src/cpp/src/continuous_batching/pipeline_base.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
// TODO: remove this code and within model runner add check: if sequence group type is tokens,
// but embedding model is available => compute embeddings first, then pass to LLM
std::vector<std::vector<ov::Tensor>> images(prompts.size());
auto results_vlm = generate(prompts, images, sampling_params, streamer);
std::vector<std::vector<ov::Tensor>> videos(prompts.size());
auto results_vlm = generate(prompts, images, videos, sampling_params, streamer);
std::vector<GenerationResult> resutls;
for (auto& vlm_result : results_vlm) {
GenerationResult result;
Expand Down Expand Up @@ -150,13 +151,15 @@ std::vector<VLMDecodedResults>
ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
const std::vector<std::string>& prompts,
const std::vector<std::vector<ov::Tensor>>& rgbs_vector,
const std::vector<std::vector<ov::Tensor>>& video_vector,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer) {
auto generate_start_time = std::chrono::steady_clock::now();
OPENVINO_ASSERT(m_model_input_type == ModelInputType::EMBEDDINGS);

OPENVINO_ASSERT(prompts.size() == sampling_params.size(), "Number of prompts should be equal to the number of generation configs.");
OPENVINO_ASSERT(prompts.size() == rgbs_vector.size(), "Number of prompts should be equal to the number of images vectors.");
OPENVINO_ASSERT(prompts.size() == video_vector.size(), "Number of prompts should be equal to the number of video vectors.");

std::vector<ov::Tensor> input_embeds_list;
std::vector<VLMPerfMetrics> vlm_perf_metrics(prompts.size());
Expand All @@ -165,9 +168,14 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
if (m_is_chat_conversation) {
OPENVINO_ASSERT(1 == prompts.size(), "Can't chat with multiple prompts");
const auto& rgbs = rgbs_vector[0];
const auto& video = video_vector[0];
const auto& prompt = prompts[0];
auto start_get_inputs_embeds = std::chrono::steady_clock::now();
encoded_images = m_inputs_embedder->encode_images(rgbs);
if (rgbs.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(rgbs, false);
} else if (video.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(video, true);
}
m_history_images.insert(m_history_images.end(), encoded_images.begin(), encoded_images.end());

const auto [unified_prompt, image_sequence] = m_inputs_embedder->normalize_prompt(prompt, m_image_id, encoded_images);
Expand All @@ -177,15 +185,26 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
std::string templated_history = m_tokenizer.apply_chat_template(m_history, true);

m_inputs_embedder->set_apply_chat_template_status(false);
input_embeds_list.push_back(m_inputs_embedder->get_inputs_embeds(templated_history, m_history_images, vlm_perf_metrics[0], rgbs.size() > 0, m_history_image_ids));
input_embeds_list.push_back(m_inputs_embedder->get_inputs_embeds(templated_history,
m_history_images,
vlm_perf_metrics[0],
encoded_images.size() > 0,
m_history_image_ids));
auto end_get_inputs_embeds = std::chrono::steady_clock::now();
vlm_perf_metrics[0].vlm_raw_metrics.prepare_embeddings_durations.emplace_back(PerfMetrics::get_microsec(end_get_inputs_embeds - start_get_inputs_embeds));

} else {
for (size_t i = 0; i < prompts.size(); i++) {
const auto& prompt = prompts[i];
const auto& rgbs = rgbs_vector[i];
const auto encoded_images = m_inputs_embedder->encode_images(rgbs);
const auto& video = video_vector[i];
std::vector<ov::genai::EncodedImage> encoded_images;
if (rgbs.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(rgbs, false);
} else if (video.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(video, true);
}

auto [unified_prompt, image_sequence] = m_inputs_embedder->normalize_prompt(prompt, m_image_id, encoded_images);

auto start_get_inputs_embeds = std::chrono::steady_clock::now();
Expand Down Expand Up @@ -241,14 +260,21 @@ GenerationHandle
ContinuousBatchingPipeline::IContinuousBatchingPipeline::add_request(uint64_t request_id,
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
const std::vector<ov::Tensor>& video,
GenerationConfig sampling_params) {
OPENVINO_ASSERT(m_model_input_type == ModelInputType::EMBEDDINGS, "Model doesn't support embeddings.");
ov::genai::VLMPerfMetrics metrics;
ov::Tensor inputs;
{
std::lock_guard<std::mutex> lock(m_embeddings_mutex);
m_inputs_embedder->set_apply_chat_template_status(sampling_params.apply_chat_template);
const auto encoded_images = m_inputs_embedder->encode_images(rgbs);

std::vector<ov::genai::EncodedImage> encoded_images;
if (rgbs.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(rgbs, false);
} else if (video.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(video, true);
}

const auto [unified_prompt, image_sequence] = m_inputs_embedder->normalize_prompt(prompt, 0, encoded_images);
inputs = m_inputs_embedder->get_inputs_embeds(unified_prompt, encoded_images, metrics, true, image_sequence);
Expand Down
2 changes: 2 additions & 0 deletions src/cpp/src/continuous_batching/pipeline_base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
GenerationHandle add_request(uint64_t request_id,
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
const std::vector<ov::Tensor>& video,
GenerationConfig sampling_params);

/**
Expand Down Expand Up @@ -124,6 +125,7 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
generate(
const std::vector<std::string>& prompts,
const std::vector<std::vector<ov::Tensor>>& rgbs,
const std::vector<std::vector<ov::Tensor>>& videos,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer);

Expand Down
2 changes: 1 addition & 1 deletion src/cpp/src/continuous_batching/pipeline_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(uint64_t request
timer.end();
return add_request(request_id, inputs, sampling_params);
} else if (m_model_input_type == ModelInputType::EMBEDDINGS) {
return ContinuousBatchingPipeline::IContinuousBatchingPipeline::add_request(request_id, prompt, {}, sampling_params);
return ContinuousBatchingPipeline::IContinuousBatchingPipeline::add_request(request_id, prompt, {}, {}, sampling_params);
} else {
OPENVINO_THROW("Unknown model input type.");
}
Expand Down
7 changes: 6 additions & 1 deletion src/cpp/src/visual_language/clip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,12 @@ void bicubic_resize(const clip_image_u8 &img, clip_image_u8 &dst, int target_wid

dst.nx = target_width;
dst.ny = target_height;
dst.buf.resize(3 * target_width * target_height);
const int target_size = 3 * target_width * target_height;
dst.buf.resize(target_size);
if (img.nx == target_width && img.ny == target_height) {
std::memcpy(dst.buf.data(), img.buf.data(), target_size);
return;
}

float Cc;
float C[5];
Expand Down
3 changes: 2 additions & 1 deletion src/cpp/src/visual_language/continuous_batching_adapter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,12 @@ class ov::genai::VLMPipeline::VLMContinuousBatchingAdapter : public ov::genai::V
VLMDecodedResults generate(
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
const std::vector<ov::Tensor>& video,
GenerationConfig generation_config,
const StreamerVariant& streamer
) override {
auto start_time = std::chrono::steady_clock::now();
auto result = m_impl.generate({prompt}, {rgbs}, {generation_config}, streamer)[0];
auto result = m_impl.generate({prompt}, {rgbs}, {video}, {generation_config}, streamer)[0];
auto stop_time = std::chrono::steady_clock::now();

VLMDecodedResults decoded;
Expand Down
17 changes: 13 additions & 4 deletions src/cpp/src/visual_language/inputs_embedder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -163,9 +163,18 @@ std::vector<ov::Tensor> InputsEmbedder::IInputsEmbedder::to_single_image_tensors
return single_image_tensors;
}

std::vector<ov::genai::EncodedImage> InputsEmbedder::IInputsEmbedder::encode_images(const std::vector<ov::Tensor>& images) {
std::vector<EncodedImage> embeds;
std::vector<ov::genai::EncodedImage> InputsEmbedder::IInputsEmbedder::encode_images(const std::vector<ov::Tensor>& images, const bool& is_video) {
std::vector<ov::Tensor> single_images = to_single_image_tensors(images);
std::vector<EncodedImage> embeds;

if (is_video) {
embeds = m_vision_encoder->encode_video(single_images);
if (!embeds.empty()) {
return embeds;
}
// Fallback to image process.
}

for (const ov::Tensor& image : single_images) {
embeds.emplace_back(m_vision_encoder->encode(image));
}
Expand Down Expand Up @@ -240,8 +249,8 @@ ov::Tensor InputsEmbedder::get_inputs_embeds(const std::string& prompt, const st
return m_impl->get_inputs_embeds(prompt, images, metrics, recalculate_merged_embeddings, image_sequence);
}

std::vector<ov::genai::EncodedImage> InputsEmbedder::encode_images(const std::vector<ov::Tensor>& images) {
return m_impl->encode_images(images);
std::vector<ov::genai::EncodedImage> InputsEmbedder::encode_images(const std::vector<ov::Tensor>& images, const bool& is_video) {
return m_impl->encode_images(images, is_video);
}

std::pair<ov::Tensor, std::optional<int64_t>> InputsEmbedder::get_position_ids(const size_t inputs_embeds_size, const size_t history_size) {
Expand Down
4 changes: 2 additions & 2 deletions src/cpp/src/visual_language/inputs_embedder.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ class InputsEmbedder {

ov::Tensor get_inputs_embeds(const std::string& prompt, const std::vector<ov::genai::EncodedImage>& images, ov::genai::VLMPerfMetrics& metrics, bool recalculate_merged_embeddings = true, const std::vector<size_t>& image_sequence = {});

std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images);
std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images, const bool& is_video = false);

// compute position ids for language model input
std::pair<ov::Tensor, std::optional<int64_t>> get_position_ids(const size_t inputs_embeds_size, const size_t history_size);
Expand Down Expand Up @@ -102,7 +102,7 @@ class InputsEmbedder {

ov::Tensor get_inputs_embeds(const std::string& prompt, const std::vector<ov::Tensor>& images, ov::genai::VLMPerfMetrics& metrics, const std::vector<size_t>& image_sequence);

virtual std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images);
virtual std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images, const bool& is_video = false);

virtual std::pair<ov::Tensor, std::optional<int64_t>> get_position_ids(const size_t inputs_embeds_size, const size_t history_size);

Expand Down
5 changes: 4 additions & 1 deletion src/cpp/src/visual_language/llava/classes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,10 @@ InputsEmbedderLLaVA::InputsEmbedderLLaVA(
const ov::AnyMap device_config) :
IInputsEmbedder(vlm_config, models_map, tokenizer, config_dir_path, device, device_config) { }

std::vector<ov::genai::EncodedImage> InputsEmbedderLLaVA::encode_images(const std::vector<ov::Tensor>& images) {
std::vector<ov::genai::EncodedImage> InputsEmbedderLLaVA::encode_images(const std::vector<ov::Tensor>& images, const bool& is_video) {
if (is_video) {
std::cout << "== Warning: LLaVA doesn't support video process. Input images are processed as separate images." << std::endl;
}
std::vector<EncodedImage> embeds;
ov::AnyMap vision_config = {{"patch_size", m_vlm_config.vision_config_patch_size}};
std::vector<ov::Tensor> single_images = to_single_image_tensors(images);
Expand Down
2 changes: 1 addition & 1 deletion src/cpp/src/visual_language/llava/classes.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ class InputsEmbedderLLaVA : public InputsEmbedder::IInputsEmbedder {

ov::Tensor get_inputs_embeds(const std::string& prompt, const std::vector<ov::genai::EncodedImage>& images, ov::genai::VLMPerfMetrics& metrics, bool recalculate_merged_embeddings = true, const std::vector<size_t>& image_sequence = {}) override;

std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images) override;
std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images, const bool& is_video = false) override;

std::pair<std::string, std::vector<size_t>> normalize_prompt(
const std::string& prompt,
Expand Down
6 changes: 5 additions & 1 deletion src/cpp/src/visual_language/llava_next/classes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,11 @@ ov::Tensor pack_image_features_llava_next(

} // namespace

std::vector<ov::genai::EncodedImage> InputsEmbedderLLaVANext::encode_images(const std::vector<ov::Tensor>& images) {
std::vector<ov::genai::EncodedImage> InputsEmbedderLLaVANext::encode_images(const std::vector<ov::Tensor>& images, const bool& is_video) {
if (is_video) {
std::cout << "== Warning: LLaVANext doesn't support video process. " << std::endl;
}

std::vector<EncodedImage> embeds;
ov::AnyMap vision_config = {{"patch_size", m_vlm_config.vision_config_patch_size}};
std::vector<ov::Tensor> single_images = to_single_image_tensors(images);
Expand Down
2 changes: 1 addition & 1 deletion src/cpp/src/visual_language/llava_next/classes.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ class InputsEmbedderLLaVANext : public InputsEmbedderLLaVA {

ov::Tensor get_inputs_embeds(const std::string& prompt, const std::vector<ov::genai::EncodedImage>& images, ov::genai::VLMPerfMetrics& metrics, bool recalculate_merged_embeddings = true, const std::vector<size_t>& image_sequence = {}) override;

std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images) override;
std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images, const bool& is_video) override;

std::pair<std::string, std::vector<size_t>> normalize_prompt(
const std::string& prompt,
Expand Down
13 changes: 10 additions & 3 deletions src/cpp/src/visual_language/pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ class VLMPipeline::VLMPipelineImpl : public VLMPipelineBase{
VLMDecodedResults generate(
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
const std::vector<ov::Tensor>& video,
GenerationConfig generation_config,
const StreamerVariant& streamer
) override {
Expand Down Expand Up @@ -183,7 +184,12 @@ class VLMPipeline::VLMPipelineImpl : public VLMPipelineBase{
"Currently only \"num_return_sequences\" equal to 1 is supported for NPU device!");
}

const auto encoded_images = m_inputs_embedder->encode_images(rgbs);
std::vector<ov::genai::EncodedImage> encoded_images;
if (rgbs.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(rgbs, false);
} else if (rgbs.size() > 0) {
encoded_images = m_inputs_embedder->encode_images(video, true);
}
auto [unified_prompt, image_sequence] = m_inputs_embedder->normalize_prompt(prompt, m_image_id, encoded_images);

if (m_is_chat_conversation) {
Expand Down Expand Up @@ -444,10 +450,11 @@ VLMPipeline::~VLMPipeline() = default;
VLMDecodedResults VLMPipeline::generate(
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
const std::vector<ov::Tensor>& video,
const GenerationConfig& generation_config,
const StreamerVariant& streamer
) {
return m_pimpl->generate(prompt, rgbs, generation_config, streamer);
return m_pimpl->generate(prompt, rgbs, video, generation_config, streamer);
}

VLMDecodedResults VLMPipeline::generate(
Expand All @@ -456,7 +463,7 @@ VLMDecodedResults VLMPipeline::generate(
const GenerationConfig& generation_config,
const StreamerVariant& streamer
) {
return m_pimpl->generate(prompt, {rgb}, generation_config, streamer);
return m_pimpl->generate(prompt, {rgb}, {}, generation_config, streamer);
}

VLMDecodedResults VLMPipeline::generate(
Expand Down
Loading