Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)

Tianhao Qi*, Jianlong Yuan✝, Wanquan Feng, Shancheng Fang✉, Jiawei Liu,
SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang

(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)

From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.

🔆 Introduction

TL;DR: We present Mask²DiT, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both synthesizing a fixed number of scenes and auto-regressively expanding new scenes, advancing the scalability and continuity of long video synthesis.

⭐⭐ Fixed-Scene Video Generation.

Videos generated with a fixed number of scenes using Mask²DiT.
Each scene maintains coherent appearance and motion across temporal boundaries.

⭐⭐ Auto-Regressive Scene Expansion.

Mask²DiT extends multi-scene narratives auto-regressively,
producing long and coherent videos with evolving context.

📝 Changelog

[2025.10.15]: 🔥🔥 Release the code and checkpoint.
[2025.03.26]: 🔥🔥 Release the arxiv paper and project page.

🧩 Inference

We provide two inference pipelines for long video generation:

🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.

1️⃣ Prepare Pretrained Model

Download the pretrained model from Hugging Face and place it under:

./models/

2️⃣ Environment Setup

We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using conda as follows:

conda create -n mask2dit python=3.11.2
conda activate mask2dit
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

🎬 Fixed-Scene Video Generation

Use this script to synthesize videos with a fixed number of scenes:

python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.py

📦 Output:

Generated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.

🔄 Auto-Regressive Scene Expansion

Use this script to expand video scenes sequentially based on the given context.

python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.py

📦 Output:

This mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.

🧑‍🏫 Training

1️⃣ Prepare Training Data

Please prepare your datasets following the provided examples:

datasets/pretrain.csv → used for pretraining
datasets/sft.json → used for supervised fine-tuning (SFT)

💡 You can modify these template files to fit your own dataset paths and captions.

2️⃣ Pretraining

We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:

bash scripts/cogvideox_fun/train_mask2dit_pretrain.sh

3️⃣ Supervised Fine-Tuning (SFT)

After pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:

bash scripts/cogvideox_fun/train_mask2dit_sft.sh

🙏 Acknowledgement

This project is built upon the open-source repository VideoX-Fun.
We sincerely thank the original authors for their excellent work and open-source contributions.

Bibtex

If you find our work useful for your research, welcome to cite our work using the following BibTeX:

@inproceedings{qi2025mask,
  title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
  author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18837--18846},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
asset		asset
config		config
datasets		datasets
examples/cogvideox_fun		examples/cogvideox_fun
models		models
reports		reports
scripts/cogvideox_fun		scripts/cogvideox_fun
videox_fun		videox_fun
.gitignore		.gitignore
Dockerfile.ds		Dockerfile.ds
LICENSE		LICENSE
README.md		README.md
install.py		install.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)

🔆 Introduction

⭐⭐ Fixed-Scene Video Generation.

⭐⭐ Auto-Regressive Scene Expansion.

📝 Changelog

🧩 Inference

1️⃣ Prepare Pretrained Model

2️⃣ Environment Setup

🎬 Fixed-Scene Video Generation

📦 Output:

🔄 Auto-Regressive Scene Expansion

📦 Output:

🧑‍🏫 Training

1️⃣ Prepare Training Data

2️⃣ Pretraining

3️⃣ Supervised Fine-Tuning (SFT)

🙏 Acknowledgement

Bibtex

About

Uh oh!

Releases

Packages

Languages

License

Tianhao-Qi/Mask2DiT

Folders and files

Latest commit

History

Repository files navigation

Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)

🔆 Introduction

⭐⭐ Fixed-Scene Video Generation.

⭐⭐ Auto-Regressive Scene Expansion.

📝 Changelog

🧩 Inference

1️⃣ Prepare Pretrained Model

2️⃣ Environment Setup

🎬 Fixed-Scene Video Generation

📦 Output:

🔄 Auto-Regressive Scene Expansion

📦 Output:

🧑‍🏫 Training

1️⃣ Prepare Training Data

2️⃣ Pretraining

3️⃣ Supervised Fine-Tuning (SFT)

🙏 Acknowledgement

Bibtex

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages