πΉ Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models π
Xuyang Liu1,2*, Yiyu Wang1*, Junpeng Ma3, Linfeng Zhang1β
1Shanghai Jiao Tong University, 2Sichuan University, 3Fudan University
2025.05.21
π€π€ We release our latest work VidCom2, a plug-and-play inference acceleration method of VideoLLMs. Code is available!
- Model Adaptability: Compatible with most VideoLLMs (e.g., LLaVA, Qwen-VL series).
- Operator Compatibility: Works seamlessly with efficient operators like Flash Attention 2.
- Strong Performance: Uses only 25% of tokens while maintaining 99.6% performance of LLaVA-OV.
- High Efficiency: Cuts LLaVA-OV generation time by 70.8% and overall latency by 43.0%.
TLDR: We present VidCom2, a plug-and-play framework that dynamically compresses video tokens based on frame uniqueness, achieving state-of-the-art efficiency and performance across various VideoLLMs and benchmarks.
The core implementation of our code is in llava/model/vidcom2.py
. In LLaVA-OneVision, it is called at here and in LLaVA-Video, it is called at here.The implementation of Qwen2-VL is located in here.
- Clone this repositoryοΌ
git clone https://github.com/xuyang-liu16/VidCom2.git
cd VidCom2
- Environment Setup and Preparation:
conda create -n VidCom2 python=3.10 -y
conda activate VidCom2
pip install --upgrade pip # Enable PEP 660 support.
pip install -e ".[train]"
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
We use the lmms-eval toolkit to evaluate our models.
π You can reproduce all of our ablation experiments by modifying the parameters of the vidcom2_compression
function! By default, the retention ratio is 0.25.
You can choose whether to use flash attention, but in our efficiency analysis, if flash attention can be used, then it should be used.
To evaluate LLaVA-OneVision-7B, you can use:
accelerate launch --num_processes=8 \
-m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,attn_implementation=flash_attention_2 \
--tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_onevision \
--output_path ./logs/
To evaluate LLaVA-Video-7B, you can use:
accelerate launch --num_processes=8 \
-m lmms_eval \
--model llava_vid \
--model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average,attn_implementation=flash_attention_2 \
--tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_vid \
--output_path ./logs/
To evaluate Qwen2-VL, you can use:
accelerate launch --num_processes=8 \
-m lmms_eval \
--model qwen2_vl \
--model_args=pretrained=Qwen/Qwen2-VL-7B-Instruct,attn_implementation=flash_attention_2 \
--tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
--batch_size 1
--log_samples
--log_samples_suffix reproduce
--output_path ./logs/
VidCom2 employs lmms-eval for inference, and all efficiency analyses are based on LLaVA-OneVision-7B.
The code of measuring the total time for model generation and GPU peak memory is located in lmms_eval/models/llava_onevision.py.
The code of measuring the total time for LLM generation is located in llava/model/language_model/llava_qwen.py.
Please consider citing our paper in your publications, if our findings help your research.
@article{liu2025vidcom2,
title={Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models},
author={Liu, Xuyang and Wang, Yiyu and Ma, Junpeng and Zhang, Linfeng},
journal={arXiv preprint arXiv:2505.14454},
year={2025}
}
We extend our gratitude to the open-source efforts of LLaVA-OneVision and Qwen2-VL.
For any question about our paper or code, please email [email protected]
.