Skip to content

xiquan-li/MeanAudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Paper Hugging Face Model Hugging Face Space Webpage

News 🔥

  • [Update Aug. 27, 2025] We present a new variant of MeanAudio: MeanAudio-L-Full a 480M latent flow transformer achieving SOTA performance on both single-step and multi-step audio generation. Try it out at our 🤗 huggingface space !

  • [Update Aug. 17, 2025] We present MeanAudio-S-Full: a 120M latent flow transformer trained with the MeanFlow objective on ~10,000 hours of audio data sourced from AudioCaps, AudioSet, WavCaps, VGGSound, MusicCaps, and LP-MusicCaps.

Overview

MeanAudio is a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. It can synthesize realistic sound in a single step, achieving a real-time factor (RTF) of 0.013 on a single NVIDIA 3090 GPU. Moreover, it also demonstrates strong performance in multi-step generation.

Environmental Setup

1. Create a new conda environment:

conda create -n meanaudio python=3.11 -y
conda activate meanaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade

2. Install with pip:

git clone https://github.com/xiquan-li/MeanAudio.git

cd MeanAudio
pip install -e .

Quick Start

To generate audio with our pre-trained model, simply run:

python demo.py --prompt 'your prompt' --num_steps 1

This will automatically download the pre-trained checkpoints from huggingface, and generate audio according to your prompt. By default, this will use meanaudio-s-full. The output audio will be at MeanAudio/output/, and the checkpoints will be at MeanAudio/weights/.

Alternatively, you can download manually the pre-trained models from this Folder, and put them into MeanAudio/weights/. Then, you can use scripts/meanflow/infer_meanflow.sh and scripts/flowmatching/infer_flowmatching.sh to generate audio with pre-trained models.

Variants

Model Name Size Dataset Objective Pre-trained Link
MeanAudio-S-AC 120M AudioCaps Mean Flow FluxAudio-S-Full Here
FluxAudio-S-Full 120M All $^*$ Flow Matching - Here
MeanAudio-S-Full 120M All $^*$ Mean Flow - Here
MeanAudio-L-Full 480M All $^*$ Mean Flow - Here

$^*$: All denotes AudioCaps + WavCaps + AudioSet + VGGSound + LP-MusicCaps-MC + LP-MusicCaps-MTT, forming approximately 3M of audio-text pairs (about 10,000 hours audio data).

Training

Before training, make sure that all files from here are placed in MeanAudio/weights.

1. Latent & Text Feature Extraction:

We first extract VAE latents & text encoder embeddings to enable fast and efficient training. For this, scripts/extract_audio_latents.sh provides a detailed guide for it. The pipeline includes two steps: a) partition audios into 10s clips. b) extract latents & embeddings into npz files.

To avoid the laborious data pre-processing step, we have uploaded an extracted version of AudioCaps. Feel free to download it from this link, unzip it and put it under MeanAudio/data/. Then you can directly jump to the second step. 😊

However, if you want to train the model on other datasets besides AudioCaps, you should still run scripts/extract_audio_latents.sh to do feature extraction. Remember to adjust config/data/t5_clap.yaml for correct metadata paths.

2. Install Validation Packages:

We rely on av-benchmark for validation & evaluation. Please install it first before training.

3. Train with MeanFlow objective:

Use the script below to train a MeanAudio model. By default, this will initialize the flow transformer from the pretrained ckpt fluxaudio_fm.pth and do MeanFlow fine-tuning.

bash scripts/meanflow/train_meanflow.sh

4. (Optional) Pre-training with Standard Flow Matching:

Use the script below to train a Flux-style transformer using the conditional flow matching objective:

bash scripts/flowmatching/train_flowmatching.sh

The obtained model can serve as a strong initialization for the mixed-flow fine-tuning.

Evaluation

Use the script below to do evaluation, before this, please first install av-benchmark for metrics calculation. You can specify num_steps and ckpt_path to evaluate different models with different sampling steps.

bash scripts/meanflow/eval_meanflow.sh 

Citation

@article{li2025meanaudio,
  title={MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows},
  author={Li, Xiquan and Liu, Junxi and Liang, Yuzhe and Niu, Zhikang and Chen, Wenxi and Chen, Xie},
  journal={arXiv preprint arXiv:2508.06098},
  year={2025}
}

Acknowledgement

Many thanks to:

About

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published