Skip to content

Commit 1874c6a

Browse files
[Doc] Update vlm.rst to include an example on videos (#9155)
Co-authored-by: Cyrus Leung <[email protected]>
1 parent 9a94ca4 commit 1874c6a

File tree

1 file changed

+27
-0
lines changed

1 file changed

+27
-0
lines changed

docs/source/models/vlm.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,33 @@ Instead of passing in a single image, you can pass in a list of images.
135135
136136
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
137137

138+
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
139+
140+
.. code-block:: python
141+
142+
# Specify the maximum number of frames per video to be 4. This can be changed.
143+
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
144+
145+
# Create the request payload.
146+
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
147+
message = {
148+
"role": "user",
149+
"content": [
150+
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
151+
],
152+
}
153+
for i in range(len(video_frames)):
154+
base64_image = encode_image(video_frames[i]) # base64 encoding.
155+
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
156+
message["content"].append(new_image)
157+
158+
# Perform inference and log output.
159+
outputs = llm.chat([message])
160+
161+
for o in outputs:
162+
generated_text = o.outputs[0].text
163+
print(generated_text)
164+
138165
Online Inference
139166
----------------
140167

0 commit comments

Comments
 (0)