Target-Aware Video Diffusion Models

arXiv 2025

Taeksoo Kim, Hanbyul Joo

TL;DR. We use the segmentation mask as an additional input to specify the target for video diffusion models.

Updates

[2025/05/20] Initial code for inference and checkpoint released.
[2025/03/25] Paper released.

TODO list.

Release training code and dataset
Add attention visualization code
Add Gradio app

Installation

Clone the repository.

git clone https://github.com/taeksuu/tavid.git

Create a conda environment and install the required packages.

conda create -n tavid python=3.11
conda activate tavid
pip install -r requirements.txt

Download the model checkpoints and place it in the ./checkpoints folder.

huggingface-cli download Taeksoo/TAViD --local-dir checkpoints

Inference

You can run the inference code with the following command. Just add the trigger word "target" in front of the noun you'd like to specify as in the example below. You will find your output videos in the ./results folder.
We have tested the inference code on RTX 3090 and A100.

python inference.py \
  --image_path assets/image.png \
  --mask_path assets/mask_0.png \
  --prompt "In a serene, well-lit kitchen with clean, modern lines, the woman reaches forward, and picks up the target mug cup with her hand. She brings the target mug to her lips, taking a slow, thoughtful sip of the coffee, her gaze unfocused as if lost in contemplation. The steam from the coffee curls gently in the air, adding warmth to the quiet ambiance of the room."

Since our base model, CogvideoX, is trained with long prompts, prompt quality directly impacts the output quality. Please refer to this guide from CogVideoX for prompt enhancement. The generated videos can still suffer from limitations, including object disappearances or implausible dynamics. You may have to try multiple times for the best results.

Training and Dataset

We will soon release the training code and data.

Citation

If you find TAViD useful for your work, please consider citing:

@article{kim2025target,
    title={Target-Aware Video Diffusion Models},
    author={Kim, Taeksoo and Joo, Hanbyul},
    journal={arXiv preprint arXiv:2503.18950},
    year={2025}
}

Acknowledgements

We sincerely thank the authors of following amazing works for their open-sourced codes, models, and datasets:

Training: CogVideo, Finetrainers
Datasets: Ego-Exo4D, BEHAVE

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Target-Aware Video Diffusion Models

Updates

Table of Contents

Installation

Inference

Training and Dataset

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

taeksuu/tavid

Folders and files

Latest commit

History

Repository files navigation

Target-Aware Video Diffusion Models

Updates

Table of Contents

Installation

Inference

Training and Dataset

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages