-
[Update Aug. 27, 2025] We present a new variant of MeanAudio: MeanAudio-L-Full a 480M latent flow transformer achieving SOTA performance on both single-step and multi-step audio generation. Try it out at our 🤗 huggingface space !
-
[Update Aug. 17, 2025] We present MeanAudio-S-Full: a 120M latent flow transformer trained with the MeanFlow objective on ~10,000 hours of audio data sourced from AudioCaps, AudioSet, WavCaps, VGGSound, MusicCaps, and LP-MusicCaps.
MeanAudio is a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. It can synthesize realistic sound in a single step, achieving a real-time factor (RTF) of 0.013 on a single NVIDIA 3090 GPU. Moreover, it also demonstrates strong performance in multi-step generation.
1. Create a new conda environment:
conda create -n meanaudio python=3.11 -y
conda activate meanaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
2. Install with pip:
git clone https://github.com/xiquan-li/MeanAudio.git
cd MeanAudio
pip install -e .
To generate audio with our pre-trained model, simply run:
python demo.py --prompt 'your prompt' --num_steps 1
This will automatically download the pre-trained checkpoints from huggingface, and generate audio according to your prompt.
By default, this will use meanaudio-s-full.
The output audio will be at MeanAudio/output/
, and the checkpoints will be at MeanAudio/weights/
.
Alternatively, you can download manually the pre-trained models from this Folder, and put them into MeanAudio/weights/
. Then, you can use scripts/meanflow/infer_meanflow.sh
and scripts/flowmatching/infer_flowmatching.sh
to generate audio with pre-trained models.
Model Name | Size | Dataset | Objective | Pre-trained | Link |
---|---|---|---|---|---|
MeanAudio-S-AC | 120M | AudioCaps | Mean Flow | FluxAudio-S-Full | Here |
FluxAudio-S-Full | 120M | All |
Flow Matching | - | Here |
MeanAudio-S-Full | 120M | All |
Mean Flow | - | Here |
MeanAudio-L-Full | 480M | All |
Mean Flow | - | Here |
Before training, make sure that all files from here are placed in MeanAudio/weights
.
We first extract VAE latents & text encoder embeddings to enable fast and efficient training. For this, scripts/extract_audio_latents.sh
provides a detailed guide for it. The pipeline includes two steps: a) partition audios into 10s clips. b) extract latents & embeddings into npz files.
To avoid the laborious data pre-processing step, we have uploaded an extracted version of AudioCaps. Feel free to download it from this link, unzip it and put it under MeanAudio/data/
. Then you can directly jump to the second step. 😊
However, if you want to train the model on other datasets besides AudioCaps, you should still run scripts/extract_audio_latents.sh
to do feature extraction.
Remember to adjust config/data/t5_clap.yaml
for correct metadata paths.
We rely on av-benchmark for validation & evaluation. Please install it first before training.
Use the script below to train a MeanAudio model. By default, this will initialize the flow transformer from the pretrained ckpt fluxaudio_fm.pth
and do MeanFlow fine-tuning.
bash scripts/meanflow/train_meanflow.sh
Use the script below to train a Flux-style transformer using the conditional flow matching objective:
bash scripts/flowmatching/train_flowmatching.sh
The obtained model can serve as a strong initialization for the mixed-flow fine-tuning.
Use the script below to do evaluation, before this, please first install av-benchmark for metrics calculation. You can specify num_steps
and ckpt_path
to evaluate different models with different sampling steps.
bash scripts/meanflow/eval_meanflow.sh
@article{li2025meanaudio,
title={MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows},
author={Li, Xiquan and Liu, Junxi and Liang, Yuzhe and Niu, Zhikang and Chen, Wenxi and Chen, Xie},
journal={arXiv preprint arXiv:2508.06098},
year={2025}
}
Many thanks to:
- MMAudio for the MMDiT code and training & inference structure
- MeanFlow-pytorch and MeanFlow-official for the mean flow implementation
- Make-An-Audio 2 BigVGAN Vocoder and the VAE
- av-benchmark for benchmarking results