📖Project Page |🤗 Hugging Face| 🤖 ModelScope
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: Ming-UniAudio-Edit
- 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark
- [2025.09.30] 🔥 We release Ming-UniAudio with significant improvements across speech understanding, generation, and free-form editing tasks.
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-quality speech synthesis.
- Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
| System | FrameRate | SEED-ZH | SEED-EN | ||||
|---|---|---|---|---|---|---|---|
| PESQ↑ | SIM↑ | STOI↑ | PESQ↑ | SIM↑ | STOI↑ | ||
| MiMo-Audio-Tokenizer | 25 | 2.71 | 0.89 | 0.93 | 2.43 | 0.85 | 0.92 |
| GLM4-Voice-Tokenizer | 12.5 | 1.06 | 0.33 | 0.61 | 1.05 | 0.12 | 0.60 |
| Baichuan-Audio-Tokenizer | 12.5 | 1.84 | 0.78 | 0.86 | 1.62 | 0.69 | 0.85 |
| XY-Tokenizer | 12.5 | 2.27 | 0.77 | 0.90 | 2.14 | 0.82 | 0.90 |
| Mimi | 75 | 2.05 | 0.73 | 0.89 | 2.01 | 0.77 | 0.89 |
| XCodec2.0 | 50 | 2.19 | 0.80 | 0.92 | 2.37 | 0.82 | 0.93 |
| BigCodec | 80 | 2.26 | 0.81 | 0.92 | 2.22 | 0.80 | 0.91 |
| MingTok-Audio(ours) | 50 | 4.21 | 0.96 | 0.98 | 4.04 | 0.96 | 0.98 |
| Datasets | Model | Performance | ||||||
|---|---|---|---|---|---|---|---|---|
| aishell2-ios | LS-clean | Hunan | Minnan | Guangyue | Chuanyu | Shanghai | ||
| Understanding ASR | Kimi-Audio | 2.56 | 1.28 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 |
| Qwen2.5 Omni | 2.75 | 1.80 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | |
| Qwen2 Audio | 2.92 | 1.60 | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | |
| Ming-UniAudio-16B-A3B(ours) | 2.84 | 1.62 | 9.80 | 16.50 | 5.51 | 5.46 | 14.65 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
|
Speech-English WER | NE-WER | NE-FNR |
Dialogue-English WER | NE-WER | NE-FNR |
Speech-Mandarin WER | NE-WER | NE-FNR |
Dialogue-Mandarin WER | NE-WER | NE-FNR |
||
|
Understanding Context ASR |
Qwen2-Audio | 11.49 | 27.27 | 35.08 | 13.99 | 33.02 | 32.92 | 9.92 | 24.10 | 30.02 | 7.00 | 22.76 | 26.17 |
| Baichuan-Audio | 7.52 | 5.87 | 4.55 | 5.66 | 10.01 | 3.64 | 2.16 | 6.65 | 2.35 | 2.96 | 11.48 | 3.94 | |
| Kimi-Audio | 2.90 | 6.68 | 8.01 | 4.67 | 13.50 | 11.31 | 1.95 | 11.13 | 15.28 | 2.90 | 15.91 | 16.68 | |
| Baichuan-Omni-1.5 | 8.16 | 7.69 | 6.53 | 9.91 | 14.40 | 5.54 | 2.98 | 8.39 | 4.71 | 5.00 | 16.83 | 7.84 | |
| Qwen2.5-Omni-3B | 3.99 | 7.80 | 9.69 | 4.83 | 14.36 | 12.85 | 2.13 | 10.55 | 14.11 | 3.12 | 15.07 | 15.17 | |
| Qwen2.5-Omni-7B | 3.96 | 7.38 | 8.72 | 5.32 | 11.83 | 9.24 | 1.84 | 9.80 | 12.19 | 2.40 | 14.06 | 13.17 | |
| Ming-UniAudio-16B-A3B-Edit(ours) | 4.00 | 3.56 | 3.69 | 5.34 | 8.73 | 2.53 | 1.58 | 5.98 | 2.40 | 3.04 | 9.50 | 1.48 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
| Seed-zh WER(%) | Seed-zh SIM | Seed-en WER(%) | Seed-en SIM | ||
| Generation | Seed-TTS | 1.12 | 0.80 | 2.25 | 0.76 |
| MiMo-Audio | 1.96 | - | 5.37 | - | |
| Qwen3-Omni-30B-A3B-Instruct | 1.07 | - | 1.39 | - | |
| Ming-Omni-Lite | 1.69 | 0.68 | 4.31 | 0.51 | |
| Ming-UniAudio-16B-A3B(ours) | 0.95 | 0.70 | 1.85 | 0.58 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
|
Deletion-basic Deletion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 11.89 | 14.85 22.92 | 27.60 |
ACC zh | en 100 | 82.22 82.92 | 85 |
SIM zh | en 0.78 | 0.76 0.81 | 0.74 |
no-edit WER(%) zh | en 11.49 | 24.26 17.50 | 35.21 |
|
Insertion-basic Insertion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 3.42 | 6.63 3.89 | 7.592 |
ACC zh | en 80 | 71.43 79.31 | 62.31 |
SIM zh | en 0.83 | 0.79 0.83 | 0.79 |
no-edit WER(%) zh | en 3.52 | 17.70 4.10 | 18.84 |
|
Substitution-basic Substitution |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 4.52 | 8.99 4.56 | 7.64 |
ACC zh | en 78.62 | 59.78 76.62 | 65.62 |
SIM zh | en 0.82 | 0.78 0.83 | 0.77 |
no-edit WER(%) zh | en 4.63 | 19.28 4.75 | 18.39 |
|
Dialect Conversion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) 8.93 |
ACC 0.50 |
SIM 0.66 |
- |
|
Speed changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 5.88 | 17.53 |
SIM zh | en 0.66 | 0.57 |
RDE(%) zh | en 6.36 | 5.92 |
- |
|
Pitch changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 7.45 | 13.37 |
SIM zh | en 0.36 | 0.24 |
- | - |
|
Volume changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 1.71 | 1.35 |
SIM zh | en 0.86 | 0.80 |
RAE(%) zh | en 14.9 | 11.7 |
- |
| Datasets | Model | Model Type | DNSMOS OVRL | DNSMOS SIG | DNSMOS BAK |
|---|---|---|---|---|---|
| Denoise | FullSubNet | specialized | 2.93 | 3.05 | 3.51 |
| Inter-Subnet | 2.98 | 3.17 | 3.15 | ||
| CDiffuSE | 2.84 | 3.37 | 3.52 | ||
| SGMSE | 3.11 | 3.47 | 3.41 | ||
| StoRM | 3.15 | 3.54 | 3.69 | ||
| GenSE | 3.43 | 3.65 | 4.18 | ||
| MiMo-Audio | general | 3.30 | 3.56 | 4.10 | |
| Ming-UniAudio-16B-A3B-Edit(ours) | 3.26 | 3.59 | 3.97 |
You can download our latest model and Benchmark from both Huggingface and ModelScope.
| Type | Model | Input modality | Oput modality | Download |
|---|---|---|---|---|
| Tokenizer | MingTok-Audio | audio | audio | 🤗 HuggingFace 🤖 ModelScope |
| SpeechLLM | Ming-UniAudio-16B-A3B | audio | audio | 🤗 HuggingFace 🤖 ModelScope |
| SpeechLLM | Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | 🤗 HuggingFace 🤖 ModelScope |
| Benchmark | Ming-Freeform-Audio-Edit | - | - | 🤗 HuggingFace 🤖 ModelScope Eval tools |
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
Note: This download process will take several minutes to several hours, depending on your network conditions.
Additional demonstration cases are available on our project page.
pip install -r requirements.txtYou can set up the environment using Docker in two ways.
- Option 1: Pull from Docker Hub (Recommended)
# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.0
# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.0 /bin/bash- Option 2: Build from Source
# 1. Build the image
docker build -t ming-uniaudio:v1.0 -f ./docker/ming_uniaudio.dockerfile .
# 2. Run the container
docker run -it --gpus all ming-uniaudio:v1.0 /bin/bashWe provide a step-by-step running example:
Step 1 - Download the source code
git clone https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory
Download our model following Model & Benchmark Downloads
mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3BStep 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
jupyter notebook cookbooks/demo.ipynbWe also provide a simple example on the usage of this repo. For detailed usage, please refer to demobook.ipynb.
import warnings
import torch
from transformers import AutoProcessor
import os
import sys
current_dir = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
import random
import numpy as np
from loguru import logger
from sentence_manager.sentence_manager import SentenceNormalizer
import re
import yaml
def seed_everything(seed=1895):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_everything()
warnings.filterwarnings("ignore")
class MingAudio:
def __init__(self, model_path, device="cuda:0"):
self.device = device
self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).eval().to(torch.bfloat16).to(self.device)
self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
self.tokenizer = self.processor.tokenizer
self.sample_rate = self.processor.audio_processor.sample_rate
self.patch_size = self.processor.audio_processor.patch_size
self.normalizer = self.init_tn_normalizer(tokenizer=self.tokenizer)
def init_tn_normalizer(self, config_file_path=None, tokenizer=None):
if config_file_path is None:
default_config_path = os.path.join(
os.path.dirname(os.path.dirname(os.path.realpath(__file__))),
"sentence_manager/default_config.yaml"
)
config_file_path = default_config_path
with open(config_file_path, 'r') as f:
self.sentence_manager_config = yaml.safe_load(f)
if "split_token" not in self.sentence_manager_config:
self.sentence_manager_config["split_token"] = []
assert isinstance(self.sentence_manager_config["split_token"], list)
if tokenizer is not None:
self.sentence_manager_config["split_token"].append(re.escape(tokenizer.eos_token))
normalizer = SentenceNormalizer(self.sentence_manager_config.get("text_norm", {}))
return normalizer
def speech_understanding(self, messages, lang=None):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
if lang is not None:
language = torch.tensor([self.tokenizer.encode(f'{lang}\t')]).to(inputs['input_ids'].device)
inputs['input_ids'] = torch.cat([inputs['input_ids'], language], dim=1)
attention_mask = inputs['attention_mask']
inputs['attention_mask'] = torch.ones(inputs['input_ids'].shape, dtype=attention_mask.dtype)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
generated_ids = self.model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=self.processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
return output_text
def speech_generation(
self,
text,
prompt_wav_path,
prompt_text,
lang='zh',
output_wav_path='out.wav'
):
text = self.normalizer.normalize(text)
waveform = self.model.generate_tts(
text=text,
prompt_wav_path=prompt_wav_path,
prompt_text=prompt_text,
patch_size=self.patch_size,
tokenizer=self.tokenizer,
lang=lang,
output_wav_path=output_wav_path,
sample_rate=self.sample_rate,
device=self.device
)
return waveform
def speech_edit(
self,
messages,
output_wav_path='out.wav',
use_cot=True
):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
if use_cot:
ans = torch.tensor([self.tokenizer.encode('<answer>')]).to(inputs['input_ids'].device)
inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
attention_mask = inputs['attention_mask']
inputs['attention_mask'] = torch.ones(inputs['input_ids'].shape, dtype=attention_mask.dtype)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
edited_speech, edited_text = self.model.generate_edit(
**inputs,
tokenizer=self.tokenizer,
output_wav_path=output_wav_path
)
return edited_speech, edited_text
if __name__ == "__main__":
model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B")
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{
"type": "text",
"text": "Please recognize the language of this speech and transcribe it. Format: oral.",
},
{"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"},
],
},
]
response = model.speech_understanding(messages=messages)
logger.info(f"Generated Response: {response}")
# TTS
model.speech_generation(
text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。',
prompt_wav_path='data/wavs/10002287-00000094.wav',
prompt_text='在此奉劝大家别乱打美白针。',
output_wav_path='data/output/tts.wav'
)
# Edit
# model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B-Edit")
messages = [
{
"role": "HUMAN",
"content": [
{"type": "audio", "audio": "data/wavs/00004768-00000024.wav", "target_sample_rate": 16000},
{
"type": "text",
"text": "<prompt>Please recognize the language of this speech and transcribe it. And insert '实现' before the character or word at index 3.\n</prompt>",
},
],
},
]
response = model.speech_edit(messages=messages, output_wav_path='data/output/ins.wav')
logger.info(f"Generated Response: {response}")Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
If you find our work helpful, feel free to give us a cite.


