Skip to content

taeksuu/tavid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Target-Aware Video Diffusion Models

arXiv 2025

Taeksoo Kim, Hanbyul Joo

Paper PDF Project Page Model Hugging Face

Logo TL;DR. We use the segmentation mask as an additional input to specify the target for video diffusion models.

Updates

  • [2025/05/20] Initial code for inference and checkpoint released.
  • [2025/03/25] Paper released.

TODO list.

  • Release training code and dataset
  • Add attention visualization code
  • Add Gradio app

Table of Contents

Installation

Clone the repository.

git clone https://github.com/taeksuu/tavid.git

Create a conda environment and install the required packages.

conda create -n tavid python=3.11
conda activate tavid
pip install -r requirements.txt

Download the model checkpoints and place it in the ./checkpoints folder.

huggingface-cli download Taeksoo/TAViD --local-dir checkpoints

Inference

You can run the inference code with the following command. Just add the trigger word "target" in front of the noun you'd like to specify as in the example below. You will find your output videos in the ./results folder.
We have tested the inference code on RTX 3090 and A100.

python inference.py \
  --image_path assets/image.png \
  --mask_path assets/mask_0.png \
  --prompt "In a serene, well-lit kitchen with clean, modern lines, the woman reaches forward, and picks up the target mug cup with her hand. She brings the target mug to her lips, taking a slow, thoughtful sip of the coffee, her gaze unfocused as if lost in contemplation. The steam from the coffee curls gently in the air, adding warmth to the quiet ambiance of the room."

Since our base model, CogvideoX, is trained with long prompts, prompt quality directly impacts the output quality. Please refer to this guide from CogVideoX for prompt enhancement. The generated videos can still suffer from limitations, including object disappearances or implausible dynamics. You may have to try multiple times for the best results.

Training and Dataset

We will soon release the training code and data.

Citation

If you find TAViD useful for your work, please consider citing:

@article{kim2025target,
    title={Target-Aware Video Diffusion Models},
    author={Kim, Taeksoo and Joo, Hanbyul},
    journal={arXiv preprint arXiv:2503.18950},
    year={2025}
}

Acknowledgements

We sincerely thank the authors of following amazing works for their open-sourced codes, models, and datasets:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages