📹 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models 🚀

Xuyang Liu^1,2, Yiyu Wang^1, Junpeng Ma³, Linfeng Zhang^1✉

¹Shanghai Jiao Tong University, ²Sichuan University, ³Fudan University

🔥 News

2025.05.21 🤗🤗 We release our latest work VidCom², a plug-and-play inference acceleration method of VideoLLMs. Code is available!

📌 Highlights

Model Adaptability: Compatible with most VideoLLMs (e.g., LLaVA, Qwen-VL series).
Operator Compatibility: Works seamlessly with efficient operators like Flash Attention 2.
Strong Performance: Uses only 25% of tokens while maintaining 99.6% performance of LLaVA-OV.
High Efficiency: Cuts LLaVA-OV generation time by 70.8% and overall latency by 43.0%.

✨ Overview

TLDR: We present VidCom², a plug-and-play framework that dynamically compresses video tokens based on frame uniqueness, achieving state-of-the-art efficiency and performance across various VideoLLMs and benchmarks.

💥 Core Codes

The core implementation of our code is in llava/model/vidcom2.py. In LLaVA-OneVision, it is called at here and in LLaVA-Video, it is called at here.The implementation of Qwen2-VL is located in here.

🛠 Preparation

Clone this repository：

git clone https://github.com/xuyang-liu16/VidCom2.git
cd VidCom2

Environment Setup and Preparation:

conda create -n VidCom2 python=3.10 -y
conda activate VidCom2
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

🚀 Performance Evaluation

We use the lmms-eval toolkit to evaluate our models. 👉 You can reproduce all of our ablation experiments by modifying the parameters of the vidcom2_compressionfunction! By default, the retention ratio is 0.25.

You can choose whether to use flash attention, but in our efficiency analysis, if flash attention can be used, then it should be used.

To evaluate LLaVA-OneVision-7B, you can use:

accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model llava_onevision \
  --model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,attn_implementation=flash_attention_2 \
  --tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_onevision \
  --output_path ./logs/

To evaluate LLaVA-Video-7B, you can use:

accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model llava_vid \
  --model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average,attn_implementation=flash_attention_2 \
  --tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_vid \
  --output_path ./logs/

To evaluate Qwen2-VL, you can use:

accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model qwen2_vl \
  --model_args=pretrained=Qwen/Qwen2-VL-7B-Instruct,attn_implementation=flash_attention_2 \
  --tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
  --batch_size 1
  --log_samples
  --log_samples_suffix reproduce
  --output_path ./logs/

🩻 Efficiency Analysis

VidCom² employs lmms-eval for inference, and all efficiency analyses are based on LLaVA-OneVision-7B.

The code of measuring the total time for model generation and GPU peak memory is located in lmms_eval/models/llava_onevision.py.

The code of measuring the total time for LLM generation is located in llava/model/language_model/llava_qwen.py.

📌 Citation

Please consider citing our paper in your publications, if our findings help your research.

@article{liu2025vidcom2,
  title={Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models},
  author={Liu, Xuyang and Wang, Yiyu and Ma, Junpeng and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2505.14454},
  year={2025}
}

👍 Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA-OneVision and Qwen2-VL.

📩 Contact

For any question about our paper or code, please email [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
images		images
llava		llava
lmms-eval		lmms-eval
playground		playground
scripts		scripts
transformers		transformers
trl		trl
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📹 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models 🚀

Xuyang Liu^1,2, Yiyu Wang^1, Junpeng Ma³, Linfeng Zhang^1✉

¹Shanghai Jiao Tong University, ²Sichuan University, ³Fudan University

🔥 News

📌 Highlights

✨ Overview

💥 Core Codes

🛠 Preparation

🚀 Performance Evaluation

🩻 Efficiency Analysis

📌 Citation

👍 Acknowledgment

📩 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

xuyang-liu16/VidCom2

Folders and files

Latest commit

History

Repository files navigation

📹 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models 🚀

Xuyang Liu1,2*, Yiyu Wang1*, Junpeng Ma3, Linfeng Zhang1✉ 1Shanghai Jiao Tong University, 2Sichuan University, 3Fudan University

🔥 News

📌 Highlights

✨ Overview

💥 Core Codes

🛠 Preparation

🚀 Performance Evaluation

🩻 Efficiency Analysis

📌 Citation

👍 Acknowledgment

📩 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Xuyang Liu^1,2, Yiyu Wang^1, Junpeng Ma³, Linfeng Zhang^1✉

¹Shanghai Jiao Tong University, ²Sichuan University, ³Fudan University

Packages