arXiv 2025
TL;DR. We use the segmentation mask as an additional input to specify the target for video diffusion models.
- [2025/05/20] Initial code for inference and checkpoint released.
- [2025/03/25] Paper released.
TODO list.
- Release training code and dataset
- Add attention visualization code
- Add Gradio app
Clone the repository.
git clone https://github.com/taeksuu/tavid.git
Create a conda environment and install the required packages.
conda create -n tavid python=3.11
conda activate tavid
pip install -r requirements.txt
Download the model checkpoints and place it in the ./checkpoints
folder.
huggingface-cli download Taeksoo/TAViD --local-dir checkpoints
You can run the inference code with the following command. Just add the trigger word "target" in front of the noun you'd like to specify as in the example below. You will find your output videos in the ./results
folder.
We have tested the inference code on RTX 3090 and A100.
python inference.py \
--image_path assets/image.png \
--mask_path assets/mask_0.png \
--prompt "In a serene, well-lit kitchen with clean, modern lines, the woman reaches forward, and picks up the target mug cup with her hand. She brings the target mug to her lips, taking a slow, thoughtful sip of the coffee, her gaze unfocused as if lost in contemplation. The steam from the coffee curls gently in the air, adding warmth to the quiet ambiance of the room."
Since our base model, CogvideoX, is trained with long prompts, prompt quality directly impacts the output quality. Please refer to this guide from CogVideoX for prompt enhancement. The generated videos can still suffer from limitations, including object disappearances or implausible dynamics. You may have to try multiple times for the best results.
We will soon release the training code and data.
If you find TAViD useful for your work, please consider citing:
@article{kim2025target,
title={Target-Aware Video Diffusion Models},
author={Kim, Taeksoo and Joo, Hanbyul},
journal={arXiv preprint arXiv:2503.18950},
year={2025}
}
We sincerely thank the authors of following amazing works for their open-sourced codes, models, and datasets:
- Training: CogVideo, Finetrainers
- Datasets: Ego-Exo4D, BEHAVE