Skip to content

Tianhao-Qi/Mask2DiT

Repository files navigation

Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)

           

Tianhao Qi*, Jianlong Yuan✝, Wanquan Feng, Shancheng Fang✉, Jiawei Liu,
SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang


(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)

From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.

🔆 Introduction

TL;DR: We present Mask²DiT, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both synthesizing a fixed number of scenes and auto-regressively expanding new scenes, advancing the scalability and continuity of long video synthesis.

⭐⭐ Fixed-Scene Video Generation.

Videos generated with a fixed number of scenes using Mask²DiT.
Each scene maintains coherent appearance and motion across temporal boundaries.

⭐⭐ Auto-Regressive Scene Expansion.

Mask²DiT extends multi-scene narratives auto-regressively,
producing long and coherent videos with evolving context.

📝 Changelog

  • [2025.10.15]: 🔥🔥 Release the code and checkpoint.
  • [2025.03.26]: 🔥🔥 Release the arxiv paper and project page.

🧩 Inference

We provide two inference pipelines for long video generation:

  • 🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
  • 🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.

1️⃣ Prepare Pretrained Model

Download the pretrained model from Hugging Face and place it under:

./models/

2️⃣ Environment Setup

We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using conda as follows:

conda create -n mask2dit python=3.11.2
conda activate mask2dit
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

🎬 Fixed-Scene Video Generation

Use this script to synthesize videos with a fixed number of scenes:

python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.py

📦 Output:

Generated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.


🔄 Auto-Regressive Scene Expansion

Use this script to expand video scenes sequentially based on the given context.

python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.py

📦 Output:

This mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.

🧑‍🏫 Training

1️⃣ Prepare Training Data

Please prepare your datasets following the provided examples:

💡 You can modify these template files to fit your own dataset paths and captions.

2️⃣ Pretraining

We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:

bash scripts/cogvideox_fun/train_mask2dit_pretrain.sh

3️⃣ Supervised Fine-Tuning (SFT)

After pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:

bash scripts/cogvideox_fun/train_mask2dit_sft.sh

🙏 Acknowledgement

This project is built upon the open-source repository VideoX-Fun.
We sincerely thank the original authors for their excellent work and open-source contributions.

Bibtex

If you find our work useful for your research, welcome to cite our work using the following BibTeX:

@inproceedings{qi2025mask,
  title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
  author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18837--18846},
  year={2025}
}

About

CVPR 2025 Accepted Papers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published