Skip to content

xuyang-liu16/VidCom2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Ή Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models πŸš€

Xuyang Liu1,2*, Yiyu Wang1*, Junpeng Ma3, Linfeng Zhang1βœ‰

1Shanghai Jiao Tong University, 2Sichuan University, 3Fudan University

πŸ”₯ News

  • 2025.05.21 πŸ€—πŸ€— We release our latest work VidCom2, a plug-and-play inference acceleration method of VideoLLMs. Code is available!

πŸ“Œ Highlights

  • Model Adaptability: Compatible with most VideoLLMs (e.g., LLaVA, Qwen-VL series).
  • Operator Compatibility: Works seamlessly with efficient operators like Flash Attention 2.
  • Strong Performance: Uses only 25% of tokens while maintaining 99.6% performance of LLaVA-OV.
  • High Efficiency: Cuts LLaVA-OV generation time by 70.8% and overall latency by 43.0%.

✨ Overview

TLDR: We present VidCom2, a plug-and-play framework that dynamically compresses video tokens based on frame uniqueness, achieving state-of-the-art efficiency and performance across various VideoLLMs and benchmarks.

πŸ’₯ Core Codes

The core implementation of our code is in llava/model/vidcom2.py. In LLaVA-OneVision, it is called at here and in LLaVA-Video, it is called at here.The implementation of Qwen2-VL is located in here.

πŸ›  Preparation

  1. Clone this repository:
git clone https://github.com/xuyang-liu16/VidCom2.git
cd VidCom2
  1. Environment Setup and Preparation:
conda create -n VidCom2 python=3.10 -y
conda activate VidCom2
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

πŸš€ Performance Evaluation

We use the lmms-eval toolkit to evaluate our models. πŸ‘‰ You can reproduce all of our ablation experiments by modifying the parameters of the vidcom2_compressionfunction! By default, the retention ratio is 0.25.

You can choose whether to use flash attention, but in our efficiency analysis, if flash attention can be used, then it should be used.

To evaluate LLaVA-OneVision-7B, you can use:

accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model llava_onevision \
  --model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,attn_implementation=flash_attention_2 \
  --tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_onevision \
  --output_path ./logs/

To evaluate LLaVA-Video-7B, you can use:

accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model llava_vid \
  --model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average,attn_implementation=flash_attention_2 \
  --tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_vid \
  --output_path ./logs/

To evaluate Qwen2-VL, you can use:

accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model qwen2_vl \
  --model_args=pretrained=Qwen/Qwen2-VL-7B-Instruct,attn_implementation=flash_attention_2 \
  --tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
  --batch_size 1
  --log_samples
  --log_samples_suffix reproduce
  --output_path ./logs/

🩻 Efficiency Analysis

VidCom2 employs lmms-eval for inference, and all efficiency analyses are based on LLaVA-OneVision-7B.

The code of measuring the total time for model generation and GPU peak memory is located in lmms_eval/models/llava_onevision.py.

The code of measuring the total time for LLM generation is located in llava/model/language_model/llava_qwen.py.

πŸ“Œ Citation

Please consider citing our paper in your publications, if our findings help your research.

@article{liu2025vidcom2,
  title={Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models},
  author={Liu, Xuyang and Wang, Yiyu and Ma, Junpeng and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2505.14454},
  year={2025}
}

πŸ‘ Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA-OneVision and Qwen2-VL.

πŸ“© Contact

For any question about our paper or code, please email [email protected].

About

πŸš€ Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages