GitHub - yongjie-lv/MingTok-Audio

📖Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope

Architecture

Key Features

🚀 First Unified Continuous Speech Tokenizer: the first continuous audio tokenizer to effectively integrate semantic and acoustic features, suitable for both understanding and generation tasks.
🎧 High-Quality Reconstruction: Achieve high-quality audio generation by modeling continuous features with a VAE, minimizing information loss and preserving intricate acoustic textures.
🌐 Convolution-Free Efficiency: Built on a pure causal transformer architecture, completely eliminating convolutional layers for superior efficiency and a simpler design.

Installation

pip install -r requirements.txt

Quick start

import torch
import torchaudio

from audio_tokenizer.modeling_audio_vae import AudioVAE

model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
model = model.cuda()
model.eval()

waveform, sr = torchaudio.load('data/1089-134686-0000.flac', backend='soundfile')
sample = {'waveform': waveform.cuda(), 'waveform_length': torch.tensor([waveform.size(-1)]).cuda()}

with torch.no_grad():
    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        latent, frame_num = model.encode_latent(**sample)
        output_waveform = model.decode(latent)

torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000)

Performance

Speech reconstruction performance

Speech reconstruction performance comparison on various audio benchmark datasets. The best results are in bold.

System	FrameRate	SEED-ZH			SEED-EN
System	FrameRate	PESQ↑	SIM↑	STOI↑	PESQ↑	SIM↑	STOI↑
MiMo-Audio-Tokenizer	25	2.71	0.89	0.93	2.43	0.85	0.92
GLM4-Voice-Tokenizer	12.5	1.06	0.33	0.61	1.05	0.12	0.60
Baichuan-Audio-Tokenizer	12.5	1.84	0.78	0.86	1.62	0.69	0.85
XY-Tokenizer	12.5	2.27	0.77	0.90	2.14	0.82	0.90
Mimi	75	2.05	0.73	0.89	2.01	0.77	0.89
XCodec2.0	50	2.19	0.80	0.92	2.37	0.82	0.93
BigCodec	80	2.26	0.81	0.92	2.22	0.80	0.91
MingTok-Audio(ours)	50	4.21	0.96	0.98	4.04	0.96	0.98

The adaptation performance for downstream ASR tasks

Understanding ASR performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets	Model	Performance
Datasets	Model	aishell2-ios	LS-clean	Hunan	Minnan	Guangyue	Chuanyu	Shanghai
Understanding ASR	Kimi-Audio	2.56	1.28	31.93	80.28	41.49	6.69	60.64
	Qwen2.5 Omni	2.75	1.80	29.31	53.43	10.39	7.61	32.05
	Qwen2 Audio	2.92	1.60	25.88	123.78	7.59	7.77	31.73
	Ming-UniAudio-16B-A3B(ours)	2.84	1.62	9.80	16.50	5.51	5.46	14.65

The adaptation performance for downstream TTS tasks

Performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets	Model	Performance
		Seed-zh WER(%)	Seed-zh SIM	Seed-en WER(%)	Seed-en SIM
Generation	Seed-TTS	1.12	0.80	2.25	0.76
	MiMo-Audio	1.96	-	5.37	-
	Qwen3-Omni-30B-A3B-Instruct	1.07	-	1.39	-
	Ming-Omni-Lite	1.69	0.68	4.31	0.51
	Ming-UniAudio-16B-A3B(ours)	0.95	0.70	1.85	0.58

Acknowledgements

We borrowed a lot of code from X-Codec-2.0 for tokenizer training.
We thank the OpenAI team for developing the Whisper model and making its weights publicly available.

License and Legal Disclaimer

This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.

Citation

If you find our work helpful, feel free to give us a cite.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
assets		assets
audio_tokenizer		audio_tokenizer
data		data
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Architecture

Key Features

Installation

Quick start

Performance

Speech reconstruction performance

The adaptation performance for downstream ASR tasks

The adaptation performance for downstream TTS tasks

Acknowledgements

License and Legal Disclaimer

Citation

About

Uh oh!

Releases

Packages

Languages

License

yongjie-lv/MingTok-Audio

Folders and files

Latest commit

History

Repository files navigation

Architecture

Key Features

Installation

Quick start

Performance

Speech reconstruction performance

The adaptation performance for downstream ASR tasks

The adaptation performance for downstream TTS tasks

Acknowledgements

License and Legal Disclaimer

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages