Tianhao Qi*, Jianlong Yuan✝, Wanquan Feng, Shancheng Fang✉, Jiawei Liu,
SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang
(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)
From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.
TL;DR: We present Mask²DiT, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both synthesizing a fixed number of scenes and auto-regressively expanding new scenes, advancing the scalability and continuity of long video synthesis.
Videos generated with a fixed number of scenes using Mask²DiT.
Each scene maintains coherent appearance and motion across temporal boundaries.
Mask²DiT extends multi-scene narratives auto-regressively,
producing long and coherent videos with evolving context.
- [2025.10.15]: 🔥🔥 Release the code and checkpoint.
- [2025.03.26]: 🔥🔥 Release the arxiv paper and project page.
We provide two inference pipelines for long video generation:
- 🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
- 🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.
Download the pretrained model from Hugging Face and place it under:
./models/
We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using conda as follows:
conda create -n mask2dit python=3.11.2
conda activate mask2dit
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txtUse this script to synthesize videos with a fixed number of scenes:
python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.pyGenerated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.
Use this script to expand video scenes sequentially based on the given context.
python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.pyThis mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.
Please prepare your datasets following the provided examples:
- datasets/pretrain.csv → used for pretraining
- datasets/sft.json → used for supervised fine-tuning (SFT)
💡 You can modify these template files to fit your own dataset paths and captions.
We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:
bash scripts/cogvideox_fun/train_mask2dit_pretrain.shAfter pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:
bash scripts/cogvideox_fun/train_mask2dit_sft.shThis project is built upon the open-source repository
VideoX-Fun.
We sincerely thank the original authors for their excellent work and open-source contributions.
If you find our work useful for your research, welcome to cite our work using the following BibTeX:
@inproceedings{qi2025mask,
title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={18837--18846},
year={2025}
}