SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation
SketchColour receives the colored first frame and the entire scene in sketch format, then colors each frame based on the reference. Evaluated on the SAKUGA dataset, SketchColour outperforms state-of-the-art video colourization methods (including LVCD, ToonCrafter, and AniDoc) across all metrics, despite using only half the training data of competing models. Our result produces accurate colorization results compared to previous works, adhering closely to the sketch reference while minimizing color bleeding.
Please see our demo page for results
- Release the paper and demo page. Visit https://bconstantine.github.io/SketchColour
- Release the inference code. We also load some inference samples (first frame, caption, and sketch) that we take from Sakuga Dataset.
- Release the training code.
- Build Gradio Demo
SketchColour is implemented using PyTorch. Training and inference were performed on 2 NVIDIA A40 GPUs (DDP).
We use mixed precision during training, using float32
precision for the transformers and bfloat16
precision for the remaining components. We used latent precomputation for training as well, while CPU offloading, VAE slicing, and VAE tiling were used for both training and inference. This results in a per-GPU memory requirement of 28GB for training and 37GB for inference. To speed up the training process, we used Torch compilation.
git clone https://github.com/bconstantine/SketchColour.git
cd SketchColour
We use separate conda
environments for data preprocessing and for training and inference.
For data preprocessing, please use the following environment:
conda create -n sketchcolour_preprocess python=3.8
conda activate sketchcolour_preprocess
pip install -r requirements.txt
To set up the training and inference environment, use:
conda env create -f sketchcolour.yml
We used the SAKUGA Dataset as the basis for our training, validation, and test datasets. This dataset can be found on GitHub here.
Download the Sakuga Dataset, and place the files in the following directory structure:
.
├── dataset
│ ├── sampled_train
│ │ ├── parquet (paste the parquet of corresponding SAKUGA here)
│ │ ├── download (paste the mp4 content of SAKUGA here)
│ │ ├── split (generated by `preprocess/preprocess_video_split_sampled_trim.py` for scene split)
│ │ ├── keyframe (generated by `preprocess/preprocess_video_split_sampled_trim.py` for video cropping and keyframe extraction)
│ │ ├── sketch (generated by `preprocess/preprocess_keyframe_sketch.py` for sketch generation)
│ ├── sampled_val (same structure as sampled_train)
│ └── sampled_test (same structure as sampled_train)
Preprocessing consists of three steps: scene splitting, video cropping, and keyframe generation. We have provided convenience scripts to run these steps:
conda activate sketchcolour_preprocess
./run_video_splitting_sampled_train.sh
./run_video_splitting_sampled_val.sh
./run_video_splitting_sampled_test.sh
To run the sketch generation, first download the netG_A_latest
model, available from the InformativeDrawings repository. After putting those weights into the weights/
folder, sketch generation can be run as follows:
conda activate sketchcolour_preprocess
./run_sketch_generation_sampled_train.sh
./run_sketch_generation_sampled_val.sh
./run_sketch_generation_sampled_test.sh
Finally, our training script requires the input format to be similar to the training structure for CogVideoX. We provide a script to restructure the files accordingly:
conda activate sketchcolour_preprocess
python preprocess/cogvideox_preprocess_required_format.py
Our full weights can be downloaded here. Paste the final_model/
folder into src/train_logs
We wrote a custom training script using finetrainers
, using CogVideoX-I2V-5B
as our base model. Our training script implements torch compilation, VAE slicing, VAE tiling, and CPU offloading, which compresses peak GPU memory usage to 28GB per GPU per batch (DDP).
For training, run the provided script:
conda activate sketchcolour
./src/examples/training/control/cogvideox/i2v-control/train.sh
You may configure GPU use by modifying train.sh
. Refer to the finetrainers
documentation for details on the appropriate arguments. We ran training for 40K steps and a batch size of 2, which took roughly 4 days to complete.
We also wrote a custom inference script, again using finetrainers
. Our inference script implements torch compilation, VAE slicing, VAE tiling, and CPU offloading, which compresses peak GPU memory usage to 37GB per GPU per batch (DDP). Each video (using 50 deionization steps) takes around 15 minutes to complete.
As diffusers
CPU offloading is incompatbile with multi-GPU inference, we implemented parallel processing for inference by running one single main shell script (inference.sh
) that delegates work to sub-processes (mini_run.sh
).
To run inference, run the provided script:
conda activate sketchcolour
./src/examples/training/control/cogvideox/i2v-control/inference.sh
You may configure GPU use by modifying run.sh
. Refer to the finetrainers
documentation for details on the appropriate arguments.
We would like to express our gratitude to the following open-source projects that have been instrumental during our development process:
- CogVideo: An open source video generation framework by THUKEG, which we use as our DiT base model.
- finetrainers: A Memory-optimized training library for diffusion models. Our whole training and inference architecture use finetrainers repository as its base.
- sakuga: A large-scale animation dataset. We use SAKUGA dataset for our training and evaluation processes.
Special thanks to the contributors of these work for their hard work and dedication!
If you find this work useful, please consider giving a star and citing it!
@article{sadihin2025sketchcolour,
author = {Bryan Constantine Sadihin and Michael Hua Wang and
Shei Pern Chua and Hang Su},
title = {SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour
Pipeline for 2D Animation},
journal = {arXiv preprint arXiv:2507.01586},
year = {2025},
note = {URL: \url{https://arxiv.org/abs/2507.01586}}
}