π Project Page Β | Β π Technical Report Β | Β π₯ Try generating videos with Waver on Discord Β
Waver 1.0 is a next-generation, universal foundation model family for unified image and video generation, built on rectified flow Transformers and engineered for industry-grade performance.
- π All-in-One Model: Simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework.
- π 1080p Resolution and Flexible Length: Supports up to 1080p high-resolution image and video generation, with flexible resolution, aspect ratio, and video length (2β10 seconds).
- π SOTA Performance: Ranks in the π Top 3 on both the T2V leaderboard & I2V leaderboard at Artificial Analysis (data up to 2025-08-05 12:00 (GMT+8)), consistently outperforms existing open-source models and matches or exceeds state-of-the-art commercial solutions.
- π Superior Motion Modeling: Excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis.
Waver: Wave Your Way to Lifelike Video Generation
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Zehuan Yuan, Bingyue Peng
demo_hq.mp4
demo_motion.mp4
In order to comprehensively evaluate the capabilities of the model and to explore the upper limits of its motion generation abilities, we propose Waver-bench 1.0. It consists of 304 samples covering a wide range of scenarios, including sports, daily activities, landscapes, animals, machinery, surreal scenes, animations, etc.
common_show.mp4
Complex and large-amplitude motions have consistently posed significant challenges in the field of text-to-video generation. To systematically evaluate and compare the upper limits of various text-to-video models in terms of motion generation capabilities, we constructed a comprehensive benchmark dataset called Hermes Motion Testset. This evaluation set comprises 96 prompts, encompassing 32 distinct types of sports activities, including tennis, basketball, gymnastics, rowing, boxing, equestrianism, among others.
olympic.mp4
We compared Waver1.0 with leading open-source and closed-source models on Waver-Bench 1.0 and Hermes Motion Testset in terms of motion quality, visual quality and prompt following. Through manual evaluation, Waver1.0 are superior to those from both closed-source and open-source models.
Waver1.0 is an all-in-one video generation model that excels in text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation. It offers flexible resolution and aspect ratio, arbitrary video length, and easily extensible for controllable video generation.
We use Wan-VAE to obtain the compressed video latent for its efficiency. We adopt flan-t5-xxl and Qwen2.5-32B-Instruct to extract text features. Our DiT model is built upon rectified flow Transformers. Video and text modalities are fused using a Dual Stream + Single Stream approach, where the numbers of the two types of blocks are M and N. We achieve joint training of T2V and I2V within a single model simply by modifying the input channel, 16 for video token, 16 for image (first frame) token and 4 for task mask. We incorporate the image latent with a probability of 20% during joint training of T2V and I2V.
Model | M | N | Input Dimension | Output Dimension | Num of Head | Head Dim |
---|---|---|---|---|---|---|
12B | 16 | 40 | 36 | 16 | 24 | 128 |
Waver-Refiner adopts the DiT architecture and is trained using the flow matching method. We first upsample the low-resolution video (480p or 720p) to 1080p and then add noise to it. The refiner takes the noised low-resolution video as input and outputs high-quality 1080p video. The refiner employs the window attention mechanism and reduces the number of inference steps by half, significantly improving inference speed. Specifically, inference time is reduced by ~40% for 720pβ1080p and ~60% for 480pβ1080p compared to direct 1080p generation.
We found that training on low-resolution videos is crucial for learning motion. So we invested substantial computational resources in training on 192p videos, and then gradually increased the resolution to 480p and 720p. We follow the flow matching training setting in SD3 to gradually increase the value of sigma shift when training on 480p and 720p videos. For 720p training, we set sigma shift to 3.0 in training and 7.0 in inferring. For T2I, we use lognorm(0.5, 1) probability density function for the sampling of timesteps. For T2V and I2V, we use mode(1.29). Empirically, we find that using the mode sampling strategy brings greater motion in video generation tasks.
We employ a prompt tagging approach to distinguish between different types of training data. We assign distinct tags to the training data based on both the video style and the video quality. During training, we prepend the caption with a prompt describing the video's style. Regarding quality, we append a prompt describing video quality to the end of the training caption. During inference, we incorporate prompts describing undesirable qualities such as low definition or slow motion into the negative prompt. For specific style requirements (e.g., anime style), we prepend the corresponding descriptive prompt to the overall prompt using prompt rewriting techniques.
The following videos show 6 different styles of the same prompt "A man and a woman are walking hand in hand along a bustling city street at night. ": Real, Ghibli-inspired 2D animation, 3D animation, 3D animation in voxel style, Disney animated film style, 2D animation in a cartoon picture book style.
anime_show.mp4
We extend APG to video generation to enhance realism and reduce artifacts. APG decomposes the update term in CFG into parallel and orthogonal components and down-weighting the parallel component to achieve high-quality generations without oversaturation. We find that normalizing the latent from [C, H, W] dimension achieves fewer artifacts than from [C, T, H, W] dimension. For the hyperparameters, we find the normalization threshold 27 and guidance scale 8 achieve a good balance between realism and artifacts.
apg.mp4
@article{zhang2025Waver,
title={Waver: Wave Your Way to Lifelike Video Generation},
author={Zhang, Yifu and Yang, Hao and Zhang, Yuqi and Hu, Yifei and Zhu, Fengda and Lin, Chuang and Mei, Xiaofeng and Jiang, Yi and Yuan, Zehuan and Peng, Bingyue},
journal={arXiv preprint arXiv:2508.15761},
year={2025}
}