📖Project Page |🤗 Hugging Face| 🤖 ModelScope
- 🚀 First Unified Continuous Speech Tokenizer: the first continuous audio tokenizer to effectively integrate semantic and acoustic features, suitable for both understanding and generation tasks.
- 🎧 High-Quality Reconstruction: Achieve high-quality audio generation by modeling continuous features with a VAE, minimizing information loss and preserving intricate acoustic textures.
- 🌐 Convolution-Free Efficiency: Built on a pure causal transformer architecture, completely eliminating convolutional layers for superior efficiency and a simpler design.
pip install -r requirements.txt
import torch
import torchaudio
from audio_tokenizer.modeling_audio_vae import AudioVAE
model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
model = model.cuda()
model.eval()
waveform, sr = torchaudio.load('data/1089-134686-0000.flac', backend='soundfile')
sample = {'waveform': waveform.cuda(), 'waveform_length': torch.tensor([waveform.size(-1)]).cuda()}
with torch.no_grad():
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
latent, frame_num = model.encode_latent(**sample)
output_waveform = model.decode(latent)
torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000)| System | FrameRate | SEED-ZH | SEED-EN | ||||
|---|---|---|---|---|---|---|---|
| PESQ↑ | SIM↑ | STOI↑ | PESQ↑ | SIM↑ | STOI↑ | ||
| MiMo-Audio-Tokenizer | 25 | 2.71 | 0.89 | 0.93 | 2.43 | 0.85 | 0.92 |
| GLM4-Voice-Tokenizer | 12.5 | 1.06 | 0.33 | 0.61 | 1.05 | 0.12 | 0.60 |
| Baichuan-Audio-Tokenizer | 12.5 | 1.84 | 0.78 | 0.86 | 1.62 | 0.69 | 0.85 |
| XY-Tokenizer | 12.5 | 2.27 | 0.77 | 0.90 | 2.14 | 0.82 | 0.90 |
| Mimi | 75 | 2.05 | 0.73 | 0.89 | 2.01 | 0.77 | 0.89 |
| XCodec2.0 | 50 | 2.19 | 0.80 | 0.92 | 2.37 | 0.82 | 0.93 |
| BigCodec | 80 | 2.26 | 0.81 | 0.92 | 2.22 | 0.80 | 0.91 |
| MingTok-Audio(ours) | 50 | 4.21 | 0.96 | 0.98 | 4.04 | 0.96 | 0.98 |
| Datasets | Model | Performance | ||||||
|---|---|---|---|---|---|---|---|---|
| aishell2-ios | LS-clean | Hunan | Minnan | Guangyue | Chuanyu | Shanghai | ||
| Understanding ASR | Kimi-Audio | 2.56 | 1.28 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 |
| Qwen2.5 Omni | 2.75 | 1.80 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | |
| Qwen2 Audio | 2.92 | 1.60 | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | |
| Ming-UniAudio-16B-A3B(ours) | 2.84 | 1.62 | 9.80 | 16.50 | 5.51 | 5.46 | 14.65 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
| Seed-zh WER(%) | Seed-zh SIM | Seed-en WER(%) | Seed-en SIM | ||
| Generation | Seed-TTS | 1.12 | 0.80 | 2.25 | 0.76 |
| MiMo-Audio | 1.96 | - | 5.37 | - | |
| Qwen3-Omni-30B-A3B-Instruct | 1.07 | - | 1.39 | - | |
| Ming-Omni-Lite | 1.69 | 0.68 | 4.31 | 0.51 | |
| Ming-UniAudio-16B-A3B(ours) | 0.95 | 0.70 | 1.85 | 0.58 | |
- We borrowed a lot of code from X-Codec-2.0 for tokenizer training.
- We thank the OpenAI team for developing the Whisper model and making its weights publicly available.
This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.
If you find our work helpful, feel free to give us a cite.

