Skip to content

Camb-ai/mars6-turbo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MARS6: A Small & Robust Hierarchical-Codec Text-to-Speech Model

MARS6 is a compact autoregressive TTS system based on a hierarchical neural audio codec. It uses a two-stage decoder to predict coarse-to-fine tokens at only 12 Hz, leading to fast inference while preserving high audio quality and speaker similarity. MARS6 achieves robust zero-shot voice cloning and expressive synthesis even on challenging in-the-wild references. This repository provides a lightweight implementation for inference using our public turbo checkpoints.

Open In Colab


Model Architecture

Below is a high-level diagram of MARS6 from our paper. The encoder processes the text and speaker embeddings (from an external speaker encoder), producing a sequence of latent features. The hierarchical decoder operates at a low 12 Hz "global" level while autoregressively expanding each frame into multiple discrete codec tokens with a small "local" decoder.

MARS6 Model Architecture


Links


Quick Start Guide

This section outlines how to install and run MARS6 for inference. You can either clone this repository and install dependencies or load MARS6 directly via Torch Hub.

Installation

  1. Clone this repo:

    git clone https://github.com/Camb-ai/mars6-turbo.git
    cd mars6-turbo
  2. Install required dependencies:

    pip install snac msclap ipykernel iprogress

    (Make sure you also have a modern version of Python, e.g. 3.9+. Best practice to use a conda environment or a python venv)

Model Inference

Use inference.py for direct usage:

python inference.py --audio "referencepath.wav" --save_path "outputpath.wav" --text "Text we wish to output. All right here!" --transcript "Transcript of the reference. For if you wish to deep clone"

OR MARS6_turbo_inference_demo.ipynb for a jupyter notebook.


Acknowledgements

  • We use minBPE by Karpathy for byte-pair tokenization utilities.
  • We were inspired by ideas from MEGABYTE for multi-scale token processing.
  • We leverage SNAC for the discrete audio codec.
  • WAVLM and CLAP are utilised for speaker information embedding.
  • Additional TTS references and techniques from VALL-E, StyleTTS2, XTTSv2, and more (see paper).

Thank you to the authors of these amazing works above for making this model possible!


Contributions

We welcome any contributions to improving the model. We'd also love to see how you used MARS6-Turbo in different scenarios, please use the 🙌 Show and tell category in Discussions to share your examples.

Contribution format:

The preferred way to contribute to our repo is to fork the master repository on GitHub:

  1. Fork the repo on GitHub
  2. Clone the repo, set upstream as this repo: git remote add upstream [email protected]:Camb-ai/mars6-turbo.git
  3. Make a new local branch and make your changes, commit changes.
  4. Push changes to new upstream branch: git push --set-upstream origin <NAME-NEW-BRANCH>
  5. On GitHub, go to your fork and click 'Pull Request' to begin the PR process. Please make sure to include a description of what you did/fixed.

Join Our Team

We're an ambitious team, globally distributed, with a singular aim of making everyone's voice count. At CAMB.AI, we're a research team of Interspeech-published, Carnegie Mellon, ex-Siri engineers and we're looking for you to join our team.

We're actively hiring; please drop us an email at [email protected] if you're interested. Visit our careers page for more info.

Community

Join CAMB.AI community on Forum and Discord to share any suggestions, feedback, or questions with our team.

Support Camb.ai on Ko-fi ❤️!

ko-fi

Citation

If you use this repository or MARS6 in your research, please cite our paper with the following:

@inproceedings{mars6-2025icassp,
  author    = {Baas, Matthew and Scholtz, Pieter and Mehta, Arnav and Dyson, Elliott and Prakash, Akshat and Kamper, Herman},
  title     = {{MARS6}: A Small and Robust Hierarchical-Codec Text-to-Speech Model},
  booktitle = {IEEE ICASSP},
  year      = {2025},
  doi       = {10.1234/icassp.2025.mars6}
}

Thank you for trying MARS6! For issues, please open a GitHub ticket

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published