Skip to content

inclusionAI/Ming-UniAudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ming-UniAudio

📖Project Page |🤗 Hugging Face| 🤖 ModelScope

Introduction

Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.

  • 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
  • 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
  • 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: Ming-UniAudio-Edit
  • 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark

📌 Updates

  • [2025.09.30] 🔥 We release Ming-UniAudio with significant improvements across speech understanding, generation, and free-form editing tasks.

Key Features

Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:

  • Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks

  • Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-quality speech synthesis.
  • Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.

Evaluation

In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.

Speech Tokenizer

Comparison of reconstruction performance across different acoustic tokenizers. The best results are in bold.
System FrameRate SEED-ZH SEED-EN
PESQ↑ SIM↑ STOI↑ PESQ↑ SIM↑ STOI↑
MiMo-Audio-Tokenizer 25 2.71 0.89 0.93 2.43 0.85 0.92
GLM4-Voice-Tokenizer 12.5 1.06 0.33 0.61 1.05 0.12 0.60
Baichuan-Audio-Tokenizer 12.5 1.84 0.78 0.86 1.62 0.69 0.85
XY-Tokenizer 12.5 2.27 0.77 0.90 2.14 0.82 0.90
Mimi 75 2.05 0.73 0.89 2.01 0.77 0.89
XCodec2.0 50 2.19 0.80 0.92 2.37 0.82 0.93
BigCodec 80 2.26 0.81 0.92 2.22 0.80 0.91
MingTok-Audio(ours) 50 4.21 0.96 0.98 4.04 0.96 0.98

Speech Understanding

ASR performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Performance
aishell2-ios LS-clean Hunan Minnan Guangyue Chuanyu Shanghai
Understanding ASR Kimi-Audio 2.56 1.28 31.93 80.28 41.49 6.69 60.64
Qwen2.5 Omni 2.75 1.80 29.31 53.43 10.39 7.61 32.05
Qwen2 Audio 2.92 1.60 25.88 123.78 7.59 7.77 31.73
Ming-UniAudio-16B-A3B(ours) 2.84 1.62 9.80 16.50 5.51 5.46 14.65
Context ASR performance comparison on various audio benchmark datasets.
Datasets Model Performance
Speech-English
WER | NE-WER | NE-FNR
Dialogue-English
WER | NE-WER | NE-FNR
Speech-Mandarin
WER | NE-WER | NE-FNR
Dialogue-Mandarin
WER | NE-WER | NE-FNR
Understanding
Context ASR
Qwen2-Audio 11.49 | 27.27 | 35.08 13.99 | 33.02 | 32.92 9.92 | 24.10 | 30.02 7.00 | 22.76 | 26.17
Baichuan-Audio 7.52 | 5.87 | 4.55 5.66 | 10.01 | 3.64 2.16 | 6.65 | 2.35 2.96 | 11.48 | 3.94
Kimi-Audio 2.90 | 6.68 | 8.01 4.67 | 13.50 | 11.31 1.95 | 11.13 | 15.28 2.90 | 15.91 | 16.68
Baichuan-Omni-1.5 8.16 | 7.69 | 6.53 9.91 | 14.40 | 5.54 2.98 | 8.39 | 4.71 5.00 | 16.83 | 7.84
Qwen2.5-Omni-3B 3.99 | 7.80 | 9.69 4.83 | 14.36 | 12.85 2.13 | 10.55 | 14.11 3.12 | 15.07 | 15.17
Qwen2.5-Omni-7B 3.96 | 7.38 | 8.72 5.32 | 11.83 | 9.24 1.84 | 9.80 | 12.19 2.40 | 14.06 | 13.17
Ming-UniAudio-16B-A3B-Edit(ours) 4.00 | 3.56 | 3.69 5.34 | 8.73 | 2.53 1.58 | 5.98 | 2.40 3.04 | 9.50 | 1.48

Speech Generation

Performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Performance
Seed-zh WER(%) Seed-zh SIM Seed-en WER(%) Seed-en SIM
Generation Seed-TTS 1.12 0.80 2.25 0.76
MiMo-Audio 1.96 - 5.37 -
Qwen3-Omni-30B-A3B-Instruct 1.07 - 1.39 -
Ming-Omni-Lite 1.69 0.68 4.31 0.51
Ming-UniAudio-16B-A3B(ours) 0.95 0.70 1.85 0.58

Speech Editing

Performance on various audio benchmark datasets.
Datasets Model Performance
Deletion-basic
Deletion
Ming-UniAudio-16B-A3B-Edit WER(%) zh | en

11.89 | 14.85
22.92 | 27.60
ACC zh | en

100 | 82.22
82.92 | 85
SIM zh | en

0.78 | 0.76
0.81 | 0.74
no-edit WER(%) zh | en

11.49 | 24.26
17.50 | 35.21
Insertion-basic
Insertion
Ming-UniAudio-16B-A3B-Edit WER(%) zh | en

3.42 | 6.63
3.89 | 7.592
ACC zh | en

80 | 71.43
79.31 | 62.31
SIM zh | en

0.83 | 0.79
0.83 | 0.79
no-edit WER(%) zh | en

3.52 | 17.70
4.10 | 18.84
Substitution-basic
Substitution
Ming-UniAudio-16B-A3B-Edit WER(%) zh | en

4.52 | 8.99
4.56 | 7.64
ACC zh | en

78.62 | 59.78
76.62 | 65.62
SIM zh | en

0.82 | 0.78
0.83 | 0.77
no-edit WER(%) zh | en

4.63 | 19.28
4.75 | 18.39
Dialect Conversion
Ming-UniAudio-16B-A3B-Edit WER(%)

8.93
ACC

0.50
SIM

0.66
-
Speed changing
Ming-UniAudio-16B-A3B-Edit WER(%) zh | en

5.88 | 17.53
SIM zh | en

0.66 | 0.57
RDE(%) zh | en

6.36 | 5.92
-
Pitch changing
Ming-UniAudio-16B-A3B-Edit WER(%) zh | en

7.45 | 13.37
SIM zh | en

0.36 | 0.24
- -
Volume changing
Ming-UniAudio-16B-A3B-Edit WER(%) zh | en

1.71 | 1.35
SIM zh | en

0.86 | 0.80
RAE(%) zh | en

14.9 | 11.7
-

Denoise

Performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Model Type DNSMOS OVRL DNSMOS SIG DNSMOS BAK
Denoise FullSubNet specialized 2.93 3.05 3.51
Inter-Subnet 2.98 3.17 3.15
CDiffuSE 2.84 3.37 3.52
SGMSE 3.11 3.47 3.41
StoRM 3.15 3.54 3.69
GenSE 3.43 3.65 4.18
MiMo-Audio general 3.30 3.56 4.10
Ming-UniAudio-16B-A3B-Edit(ours) 3.26 3.59 3.97

Model & Benchmark Downloads

You can download our latest model and Benchmark from both Huggingface and ModelScope.

Type Model Input modality Oput modality Download
Tokenizer MingTok-Audio audio audio 🤗 HuggingFace
🤖 ModelScope
SpeechLLM Ming-UniAudio-16B-A3B audio audio 🤗 HuggingFace
🤖 ModelScope
SpeechLLM Ming-UniAudio-16B-A3B-Edit text, audio text, audio 🤗 HuggingFace
🤖 ModelScope
Benchmark Ming-Freeform-Audio-Edit - - 🤗 HuggingFace
🤖 ModelScope
Eval tools
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Use Cases

Additional demonstration cases are available on our project page.

Environment Preparation

Installation with pip

pip install -r requirements.txt

Installation with docker

You can set up the environment using Docker in two ways.

  • Option 1: Pull from Docker Hub (Recommended)
# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.0

# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.0 /bin/bash
  • Option 2: Build from Source
# 1. Build the image
docker build -t ming-uniaudio:v1.0 -f ./docker/ming_uniaudio.dockerfile .

# 2. Run the container
docker run -it --gpus all ming-uniaudio:v1.0 /bin/bash

Example Usage

We provide a step-by-step running example:

Step 1 - Download the source code

git clone	https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio

Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory

Download our model following Model & Benchmark Downloads

mkdir inclusionAI 
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.

jupyter notebook cookbooks/demo.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to demobook.ipynb.

import warnings
import torch
from transformers import AutoProcessor
import os
import sys
current_dir = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

from modeling_bailingmm import BailingMMNativeForConditionalGeneration
import random
import numpy as np
from loguru import logger
from sentence_manager.sentence_manager import SentenceNormalizer
import re
import yaml

def seed_everything(seed=1895):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything()
warnings.filterwarnings("ignore")

class MingAudio:
    def __init__(self, model_path, device="cuda:0"):
        self.device = device
        self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
        ).eval().to(torch.bfloat16).to(self.device)
        self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
        self.tokenizer = self.processor.tokenizer
        self.sample_rate = self.processor.audio_processor.sample_rate
        self.patch_size = self.processor.audio_processor.patch_size
        self.normalizer = self.init_tn_normalizer(tokenizer=self.tokenizer)

    def init_tn_normalizer(self, config_file_path=None, tokenizer=None):

        if config_file_path is None:
            default_config_path = os.path.join(
                os.path.dirname(os.path.dirname(os.path.realpath(__file__))), 
                "sentence_manager/default_config.yaml"
            )
            config_file_path = default_config_path
        with open(config_file_path, 'r') as f:
            self.sentence_manager_config = yaml.safe_load(f)
        if "split_token" not in self.sentence_manager_config:
            self.sentence_manager_config["split_token"] = []
        assert isinstance(self.sentence_manager_config["split_token"], list)
        if tokenizer is not None:
            self.sentence_manager_config["split_token"].append(re.escape(tokenizer.eos_token))
        normalizer = SentenceNormalizer(self.sentence_manager_config.get("text_norm", {}))
        
        return normalizer

    def speech_understanding(self, messages, lang=None):
        text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
        image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)

        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            audios=audio_inputs,
            return_tensors="pt",
        ).to(self.device)
        
        if lang is not None:
            language = torch.tensor([self.tokenizer.encode(f'{lang}\t')]).to(inputs['input_ids'].device)
            inputs['input_ids'] = torch.cat([inputs['input_ids'], language], dim=1)
            attention_mask = inputs['attention_mask']
            inputs['attention_mask'] = torch.ones(inputs['input_ids'].shape, dtype=attention_mask.dtype)
        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)
        logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=512,
            eos_token_id=self.processor.gen_terminator,
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )[0]

        return output_text

    def speech_generation(
        self, 
        text,
        prompt_wav_path,
        prompt_text,
        lang='zh',
        output_wav_path='out.wav'
    ):
        text = self.normalizer.normalize(text)
        waveform = self.model.generate_tts(
            text=text,
            prompt_wav_path=prompt_wav_path,
            prompt_text=prompt_text,
            patch_size=self.patch_size,
            tokenizer=self.tokenizer,
            lang=lang,
            output_wav_path=output_wav_path,
            sample_rate=self.sample_rate,
            device=self.device
        )
        
        return waveform

    def speech_edit(
        self, 
        messages,
        output_wav_path='out.wav',
        use_cot=True
    ):
        text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
        image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)

        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            audios=audio_inputs,
            return_tensors="pt",
        ).to(self.device)

        if use_cot:
            ans = torch.tensor([self.tokenizer.encode('<answer>')]).to(inputs['input_ids'].device)
            inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
            attention_mask = inputs['attention_mask']
            inputs['attention_mask'] = torch.ones(inputs['input_ids'].shape, dtype=attention_mask.dtype)
        for k in inputs.keys():
            if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
                inputs[k] = inputs[k].to(dtype=torch.bfloat16)
        logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")

        edited_speech, edited_text = self.model.generate_edit(
            **inputs,
            tokenizer=self.tokenizer,
            output_wav_path=output_wav_path
        )
        return edited_speech, edited_text

if __name__ == "__main__":
    model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B")
    # ASR
    messages = [
        {
            "role": "HUMAN",
            "content": [
                {
                    "type": "text",
                    "text": "Please recognize the language of this speech and transcribe it. Format: oral.",
                },
                
                {"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"},
            ],
        },
    ]
    
    response = model.speech_understanding(messages=messages)
    logger.info(f"Generated Response: {response}")

    # TTS
    model.speech_generation(
        text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。',
        prompt_wav_path='data/wavs/10002287-00000094.wav',
        prompt_text='在此奉劝大家别乱打美白针。',
        output_wav_path='data/output/tts.wav'
    )

    # Edit
    # model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B-Edit")
    messages = [
        {
            "role": "HUMAN",
            "content": [
                {"type": "audio", "audio": "data/wavs/00004768-00000024.wav", "target_sample_rate": 16000},
                {
                    "type": "text",
                    "text": "<prompt>Please recognize the language of this speech and transcribe it. And insert '实现' before the character or word at index 3.\n</prompt>",
                },
            ],
        },
    ]
    
    response = model.speech_edit(messages=messages, output_wav_path='data/output/ins.wav')
    logger.info(f"Generated Response: {response}")

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

Citation

If you find our work helpful, feel free to give us a cite.

About

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published