SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

SketchColour receives the colored first frame and the entire scene in sketch format, then colors each frame based on the reference. Evaluated on the SAKUGA dataset, SketchColour outperforms state-of-the-art video colourization methods (including LVCD, ToonCrafter, and AniDoc) across all metrics, despite using only half the training data of competing models. Our result produces accurate colorization results compared to previous works, adhering closely to the sketch reference while minimizing color bleeding.

Please see our demo page for results

TODO List

Release the paper and demo page. Visit https://bconstantine.github.io/SketchColour
Release the inference code. We also load some inference samples (first frame, caption, and sketch) that we take from Sakuga Dataset.
Release the training code.
Build Gradio Demo

Requirements

SketchColour is implemented using PyTorch. Training and inference were performed on 2 NVIDIA A40 GPUs (DDP).

We use mixed precision during training, using float32 precision for the transformers and bfloat16 precision for the remaining components. We used latent precomputation for training as well, while CPU offloading, VAE slicing, and VAE tiling were used for both training and inference. This results in a per-GPU memory requirement of 28GB for training and 37GB for inference. To speed up the training process, we used Torch compilation.

Setup

git clone https://github.com/bconstantine/SketchColour.git
cd SketchColour

Environment

We use separate conda environments for data preprocessing and for training and inference.

For data preprocessing, please use the following environment:

conda create -n sketchcolour_preprocess python=3.8
conda activate sketchcolour_preprocess
pip install -r requirements.txt

To set up the training and inference environment, use:

conda env create -f sketchcolour.yml

Dataset

We used the SAKUGA Dataset as the basis for our training, validation, and test datasets. This dataset can be found on GitHub here.

Preprocessing

Download the Sakuga Dataset, and place the files in the following directory structure:

.
├── dataset
│   ├── sampled_train
│   │     ├── parquet (paste the parquet of corresponding SAKUGA here)
│   │     ├── download (paste the mp4 content of SAKUGA here)
│   │     ├── split (generated by `preprocess/preprocess_video_split_sampled_trim.py` for scene split)
│   │     ├── keyframe (generated by `preprocess/preprocess_video_split_sampled_trim.py` for video cropping and keyframe extraction)
│   │     ├── sketch (generated by `preprocess/preprocess_keyframe_sketch.py` for sketch generation)
│   ├── sampled_val (same structure as sampled_train)
│   └── sampled_test (same structure as sampled_train)

Preprocessing consists of three steps: scene splitting, video cropping, and keyframe generation. We have provided convenience scripts to run these steps:

conda activate sketchcolour_preprocess
./run_video_splitting_sampled_train.sh
./run_video_splitting_sampled_val.sh
./run_video_splitting_sampled_test.sh

To run the sketch generation, first download the netG_A_latest model, available from the InformativeDrawings repository. After putting those weights into the weights/ folder, sketch generation can be run as follows:

conda activate sketchcolour_preprocess
./run_sketch_generation_sampled_train.sh
./run_sketch_generation_sampled_val.sh
./run_sketch_generation_sampled_test.sh

Finally, our training script requires the input format to be similar to the training structure for CogVideoX. We provide a script to restructure the files accordingly:

conda activate sketchcolour_preprocess
python preprocess/cogvideox_preprocess_required_format.py

Checkpoints

Our full weights can be downloaded here. Paste the final_model/ folder into src/train_logs

Training

We wrote a custom training script using finetrainers, using CogVideoX-I2V-5B as our base model. Our training script implements torch compilation, VAE slicing, VAE tiling, and CPU offloading, which compresses peak GPU memory usage to 28GB per GPU per batch (DDP).

For training, run the provided script:

conda activate sketchcolour
./src/examples/training/control/cogvideox/i2v-control/train.sh

You may configure GPU use by modifying train.sh. Refer to the finetrainers documentation for details on the appropriate arguments. We ran training for 40K steps and a batch size of 2, which took roughly 4 days to complete.

Inference

We also wrote a custom inference script, again using finetrainers. Our inference script implements torch compilation, VAE slicing, VAE tiling, and CPU offloading, which compresses peak GPU memory usage to 37GB per GPU per batch (DDP). Each video (using 50 deionization steps) takes around 15 minutes to complete.

As diffusers CPU offloading is incompatbile with multi-GPU inference, we implemented parallel processing for inference by running one single main shell script (inference.sh) that delegates work to sub-processes (mini_run.sh).

To run inference, run the provided script:

conda activate sketchcolour
./src/examples/training/control/cogvideox/i2v-control/inference.sh

You may configure GPU use by modifying run.sh. Refer to the finetrainers documentation for details on the appropriate arguments.

🤝 Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental during our development process:

CogVideo: An open source video generation framework by THUKEG, which we use as our DiT base model.
finetrainers: A Memory-optimized training library for diffusion models. Our whole training and inference architecture use finetrainers repository as its base.
sakuga: A large-scale animation dataset. We use SAKUGA dataset for our training and evaluation processes.

Special thanks to the contributors of these work for their hard work and dedication!

Citation

If you find this work useful, please consider giving a star and citing it!

@article{sadihin2025sketchcolour,
  author       = {Bryan Constantine Sadihin and Michael Hua Wang and
                  Shei Pern Chua and Hang Su},
  title        = {SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour
                  Pipeline for 2D Animation},
  journal      = {arXiv preprint arXiv:2507.01586},
  year         = {2025},
  note         = {URL: \url{https://arxiv.org/abs/2507.01586}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

TODO List

Requirements

Setup

Environment

Dataset

Preprocessing

Checkpoints

Training

Inference

🤝 Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dataset		dataset
docs		docs
preprocess		preprocess
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt
run_sketch_generation_sampled_test.sh		run_sketch_generation_sampled_test.sh
run_sketch_generation_sampled_train.sh		run_sketch_generation_sampled_train.sh
run_sketch_generation_sampled_val.sh		run_sketch_generation_sampled_val.sh
run_video_splitting_sampled_test.sh		run_video_splitting_sampled_test.sh
run_video_splitting_sampled_train.sh		run_video_splitting_sampled_train.sh
run_video_splitting_sampled_val.sh		run_video_splitting_sampled_val.sh
sketchcolour.yml		sketchcolour.yml

bconstantine/SketchColour

Folders and files

Latest commit

History

Repository files navigation

SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

TODO List

Requirements

Setup

Environment

Dataset

Preprocessing

Checkpoints

Training

Inference

🤝 Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages