Jinbo Xing, Menghan Xia*, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu,
Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong
(* corresponding author)
From CUHK and Tencent AI Lab.
IEEE TVCG 2024
Make-Your-Video is a customized video generation model with both text and motion structure (depth) control. It inherits rich visual concepts from image LDM and supports longer video inference.
| Real-life scene | Ours | Text2Video-zero+CtrlNet | LVDMExt+Adapter |
|
|
|
|
| "A dam discharging water" | |||
|
|
|
|
| "A futuristic rocket ship on a launchpad, with sleek design, glowing lights" | |||
- [2023.11.30]: 🔥🔥 Release the main model.
- [2023.06.01]: 🔥🔥 Create this repo and launch the project webpage.
| Model | Resolution | Checkpoint |
|---|---|---|
| MakeYourVideo256 | 256x256 | Hugging Face |
It takes approximately 13 seconds and requires a peak GPU memory of 20 GB to animate an image using a single NVIDIA A100 (40G) GPU.
conda create -n makeyourvideo python=3.8.5
conda activate makeyourvideo
pip install -r requirements.txt- Download the pre-trained depth estimation model from Hugging Face, and put the
dpt_hybrid-midas-501f0c75.ptincheckpoints/depth/dpt_hybrid-midas-501f0c75.pt. - Download pretrained models via Hugging Face, and put the
model.ckptincheckpoints/makeyourvideo_256_v1/model.ckpt. - Input the following commands in terminal.
sh scripts/run.shVideoCrafter1: Framework for high-quality video generation.
DynamiCrafter: Open-domain image animation methods using video diffusion priors.
Play with these projects in the same conda environement!
@article{xing2023make,
title={Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance},
author={Xing, Jinbo and Xia, Menghan and Liu, Yuxin and Zhang, Yuechen and Zhang, Yong and He, Yingqing and Liu, Hanyuan and Chen, Haoxin and Cun, Xiaodong and Wang, Xintao and others},
journal={arXiv preprint arXiv:2306.00943},
year={2023}
}We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.
We gratefully acknowledge the Visual Geometry Group of University of Oxford for collecting the WebVid-10M dataset and follow the corresponding terms of access.







































