MARS6 is a compact autoregressive TTS system based on a hierarchical neural audio codec. It uses a two-stage decoder to predict coarse-to-fine tokens at only 12 Hz, leading to fast inference while preserving high audio quality and speaker similarity. MARS6 achieves robust zero-shot voice cloning and expressive synthesis even on challenging in-the-wild references. This repository provides a lightweight implementation for inference using our public turbo checkpoints.
Below is a high-level diagram of MARS6 from our paper. The encoder processes the text and speaker embeddings (from an external speaker encoder), producing a sequence of latent features. The hierarchical decoder operates at a low 12 Hz "global" level while autoregressively expanding each frame into multiple discrete codec tokens with a small "local" decoder.
- Project Page: MARS6-Turbo Demo & Samples
- arXiv Paper: MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model
- Website: https://camb.ai
This section outlines how to install and run MARS6 for inference. You can either clone this repository and install dependencies or load MARS6 directly via Torch Hub.
-
Clone this repo:
git clone https://github.com/Camb-ai/mars6-turbo.git cd mars6-turbo
-
Install required dependencies:
pip install snac msclap ipykernel iprogress
(Make sure you also have a modern version of Python, e.g. 3.9+. Best practice to use a conda environment or a python venv)
Use inference.py
for direct usage:
python inference.py --audio "referencepath.wav" --save_path "outputpath.wav" --text "Text we wish to output. All right here!" --transcript "Transcript of the reference. For if you wish to deep clone"
OR
MARS6_turbo_inference_demo.ipynb
for a jupyter notebook.
- We use minBPE by Karpathy for byte-pair tokenization utilities.
- We were inspired by ideas from MEGABYTE for multi-scale token processing.
- We leverage SNAC for the discrete audio codec.
- WAVLM and CLAP are utilised for speaker information embedding.
- Additional TTS references and techniques from VALL-E, StyleTTS2, XTTSv2, and more (see paper).
Thank you to the authors of these amazing works above for making this model possible!
We welcome any contributions to improving the model. We'd also love to see how you used MARS6-Turbo in different scenarios, please use the 🙌 Show and tell category in Discussions to share your examples.
Contribution format:
The preferred way to contribute to our repo is to fork the master repository on GitHub:
- Fork the repo on GitHub
- Clone the repo, set upstream as this repo:
git remote add upstream [email protected]:Camb-ai/mars6-turbo.git
- Make a new local branch and make your changes, commit changes.
- Push changes to new upstream branch:
git push --set-upstream origin <NAME-NEW-BRANCH>
- On GitHub, go to your fork and click 'Pull Request' to begin the PR process. Please make sure to include a description of what you did/fixed.
We're an ambitious team, globally distributed, with a singular aim of making everyone's voice count. At CAMB.AI, we're a research team of Interspeech-published, Carnegie Mellon, ex-Siri engineers and we're looking for you to join our team.
We're actively hiring; please drop us an email at [email protected] if you're interested. Visit our careers page for more info.
Join CAMB.AI community on Forum and Discord to share any suggestions, feedback, or questions with our team.
If you use this repository or MARS6 in your research, please cite our paper with the following:
@inproceedings{mars6-2025icassp,
author = {Baas, Matthew and Scholtz, Pieter and Mehta, Arnav and Dyson, Elliott and Prakash, Akshat and Kamper, Herman},
title = {{MARS6}: A Small and Robust Hierarchical-Codec Text-to-Speech Model},
booktitle = {IEEE ICASSP},
year = {2025},
doi = {10.1234/icassp.2025.mars6}
}
Thank you for trying MARS6! For issues, please open a GitHub ticket