rCM: Score-Regularized Continuous-Time Consistency Model
🚀SOTA Diffusion Distillation & Few-Step Video Generation
rCM is the first work that:
- Scales up continuous-time consistency distillation (e.g., sCM/MeanFlow) to 10B+ parameter video diffusion models.
- Provides open-sourced FlashAttention-2 Jacobian-vector product (JVP) kernel with support for parallelisms like FSDP/CP.
- Identifies the quality bottleneck of sCM and overcomes it via a forward–reverse divergence joint distillation framework.
- Delivers models that generate videos with both high quality and strong diversity in only 2~4 steps.
| sCM | DMD2 | rCM (Ours) |
|---|---|---|
Wan1.3B-sCM-4step.mp4 |
Wan1.3B-DMD2-4step.mp4 |
Wan1.3B-rCM-4step.mp4 |
rCM achieves both high quality and strong diversity.
| 1-step | 2-step | 4-step |
|---|---|---|
1step.mp4 |
2step.mp4 |
4step.mp4 |
hotpot.mp4
This codebase is built on top of Cosmos-Predict2. Please follow its environment setup instructions.
Below is an example inference script for running rCM on T2V:
# Basic usage:
# PYTHONPATH=. python rcm/inference/wan2pt1_t2v_rcm_infer.py [arguments]
# Arguments:
# --model_size Model size: "1.3B" or "14B" (default: 1.3B)
# --num_samples Number of videos to generate (default: 1)
# --num_steps Sampling steps, 1–4 (default: 4)
# --sigma_max Initial sigma for rCM (default: 80); larger choices (e.g., 1600) reduce diversity but may enhance quality
# --dit_path Path to the distilled DiT model checkpoint (REQUIRED for inference)
# --vae_path Path to Wan2.1 VAE (default: checkpoints/Wan2.1_VAE.pth)
# --text_encoder_path Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth)
# --prompt Text prompt for video generation (default: A stylish woman walks down a Tokyo street...)
# --resolution Output resolution, e.g. "480p", "720p" (default: 480p)
# --aspect_ratio Aspect ratio in W:H format (default: 16:9)
# --seed Random seed for reproducibility (default: 0)
# --save_path Output file path including extension (default: output/generated_video.mp4)
# Example
PYTHONPATH=. python rcm/inference/wan2pt1_t2v_rcm_infer.py \
--dit_path checkpoints/rCM_Wan2.1_T2V_1.3B_480p.pt \
--num_samples 5 \
--prompt "A cinematic shot of a snowy mountain at sunrise"See Wan examples for additional usage examples.
The full distillation pipeline still requires refactoring. We provide essential reference code for key components:
- FlashAttention-2 JVP kernel:
rcm/utils/flash_attention_jvp_triton.py - JVP-adapted Wan2.1 student network:
rcm/networks/wan2pt1_jvp.py - Training:
rcm/models/t2v_model_distill_rcm.py
There are promising directions to explore based on rCM. For example:
- Few-step distilled models lag behind the teacher in aspects such as physical consistency; this can potentially be improved via reward-based post-training.
- The forward–reverse divergence joint distillation framework of rCM could be extended to autoregressive video diffusion.
We thank the Cosmos-Predict2 project for providing the awesome open-source video diffusion training codebase.
@article{zheng2025rcm,
title={Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency},
author={Zheng, Kaiwen and Wang, Yuji and Ma, Qianli and Chen, Huayu and Zhang, Jintao and Balaji, Yogesh and Chen, Jianfei and Liu, Ming-Yu and Zhu, Jun and Zhang, Qinsheng},
journal={arXiv preprint arXiv:2510.08431},
year={2025}
}