Skip to content

Conversation

@Blaizzy
Copy link
Owner

@Blaizzy Blaizzy commented May 17, 2025

Summary

  • save joined audio files using the model's sample rate

Testing

  • pytest -q (fails: command not found)

@Blaizzy Blaizzy merged commit c45f399 into main May 17, 2025
2 checks passed
Blaizzy added a commit that referenced this pull request May 19, 2025
* base arch of server

* add tts and stt endpoints

* functioning server

* connect server and ui

* Add audio utilities, use them where possible (#161)

* Add audio utilities, use them where possible.

* Formatting.

* Fix tests.

* Fix tests.

* More test fixes.

* fix server

* Fix join audio sample rate (#162)

* update nextjs

* fix stt view

* working STT

* working text to speech

* remove voices

* remove home

* add custom model and delete file

* refactor model mapping

* add animation and use env vars for frontend config

* remove unused

* refactor model loading

* add tests

* mock generate

* fix tests

* remove old player

* update readme

---------

Co-authored-by: Lucas Newman <[email protected]>
Blaizzy added a commit that referenced this pull request Nov 7, 2025
* base arch of server

* add tts and stt endpoints

* functioning server

* connect server and ui

* Add audio utilities, use them where possible (#161)

* Add audio utilities, use them where possible.

* Formatting.

* Fix tests.

* Fix tests.

* More test fixes.

* fix server

* Fix join audio sample rate (#162)

* update nextjs

* fix stt view

* working STT

* working text to speech

* remove voices

* remove home

* add custom model and delete file

* refactor model mapping

* add animation and use env vars for frontend config

* remove unused

* refactor model loading

* add tests

* mock generate

* fix tests

* remove old player

* update readme

---------

Co-authored-by: Lucas Newman <[email protected]>
Blaizzy added a commit that referenced this pull request Nov 7, 2025
* add ui v2

* Server v2 (#153)

* base arch of server

* add tts and stt endpoints

* functioning server

* connect server and ui

* Add audio utilities, use them where possible (#161)

* Add audio utilities, use them where possible.

* Formatting.

* Fix tests.

* Fix tests.

* More test fixes.

* fix server

* Fix join audio sample rate (#162)

* update nextjs

* fix stt view

* working STT

* working text to speech

* remove voices

* remove home

* add custom model and delete file

* refactor model mapping

* add animation and use env vars for frontend config

* remove unused

* refactor model loading

* add tests

* mock generate

* fix tests

* remove old player

* update readme

---------

Co-authored-by: Lucas Newman <[email protected]>

* add main and remove unused

* add gitignore

* update gitignore

* fix: (outetts)loading model: Speaker file not found. (#189)

Error loading model: Speaker file not found: mlx-audio/lib/python3.12/site-packages/mlx_audio/tts/models/outetts/default_speaker.json

* Fix deprecated save in MLX-LM (#194)

* fix deprecated save weights

* update mlx-vlm

* add pytest-asyncio

* Implementation of Misaki G2P tokenizer  (#193)

* implementation of Misaki for on-device

* KokoroTokenizer tests

---------

Co-authored-by: Prince Canuma <[email protected]>

* Add IndexTTS (#187)

* correctly load model

* got latent generation working

* add ECAPA-TDNN and BigVGANConditioning
∙ * added sanitize method to oroginal bigvgan
* renamed BigVGAN Activation1D.activation to .act

* fixed various bugs

* fit in existing model formats

* add init test for IndexTTS

* skip sanitize if already sanitized

* masking logic + `tqdm.trange` + better default sampler

* uses `WNConv1D` and `WNConvTranspose1d`

* fix test & validate already converted in bigvgan

* added normalizer

* add normalizer dependencies

* removed specifiers

* fix tests

* seems like wetext could just work fine on other platforms

* removed wetext and added number normalizer for English

---------

Co-authored-by: Prince Canuma <[email protected]>

* Load both lexicon files us_gold and us_silver with words in us_gold taking precedence  (#195)

* implementation of Misaki for on-device

* KokoroTokenizer tests

* load both lexicon files with phonemes in us_gold taking precedence of us_silver

---------

Co-authored-by: Prince Canuma <[email protected]>

* Add S3 neural audio codec. (#204)

* add lexicon files for British sounds, gb_gold and gb_silver (#197)

* Fix Mimi codec. (#209)

Co-authored-by: Prince Canuma <[email protected]>

* Add ability to use a custom URL to load Kokoro safetensors (#185)

* Add ability to use a custom URL to load Kokoro

* Add ability to use a custom URL to load Orpheus

* Handle error when loading kokoro weigths

---------

Co-authored-by: Prince Canuma <[email protected]>

* Handle transformers-style config for Sesame CSM models. (#211)

* Add Xcode build troubleshooting documentation (#210)

- Document Metal Toolchain error with Xcode Beta versions
- Provide step-by-step solution for missing Metal Toolchain component
- Include alternative build approaches and debug commands
- Add comprehensive build command reference

Fixes build failures when using Xcode Beta that lacks Metal Toolchain component required for mlx-swift Metal shader compilation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <[email protected]>
Co-authored-by: Prince Canuma <[email protected]>

* Multi model support (#213)

* feat: Add all possible languages selection and mapping for Kokoro TTS

* Add support for all TTS models in web interface

- Added /models endpoint to list available TTS models with configurations
- Extended model dropdown to include Kokoro, CSM/Sesame, Bark, OuteTTS, and Spark
- Added model-specific UI elements (reference audio upload for voice cloning)
- Updated JavaScript for dynamic UI changes based on selected model capabilities
- Enhanced server TTS endpoint to handle model-specific parameters
- Support for reference audio in CSM/Sesame models for voice cloning
- Dynamic language/voice options based on model capabilities

* feat: Enhance TTS endpoint and web interface for model-specific parameters
- Added support for all kokoro languages and voices
- Added support for CSM/Sesame
- Updated TTS endpoint to handle speed, pitch, and gender parameters for Spark models.
- Modified audio player UI to dynamically load model types, languages, and voices based on selected model capabilities.
- Added support for Spark-specific controls including speech speed, pitch, and gender selection.
- Improved JavaScript logic for fetching models and updating UI elements accordingly.

---------

Co-authored-by: Prince Canuma <[email protected]>

* Add voxtral (#214)

* add voxtral

* use lm modules

* add mistral commons

* refactor generate

* remove lazy

* make lazy imports

* bump version

* Use indeterminate progress for CSM models (#216)

* Use indeterminite progress for CSM models.

* Fix multiple paragraph generation.

* Bump version to 0.2.5 (#219)

* fix wav2vec (#222)

* Fix RTF calculation in kokoro model

Reset start_time after each segment to ensure accurate real-time factor calculations for multi-segment audio generation.

* fix: avoid unnecessary audio transcription for the index tts model

Use the inspect.signature method to detect the ref_text parameter to avoid unnecessary transcription operations during text-to-speech conversion.

* Add Sesame TTS components and configurations

* Add Sesame TTS attention and model argument implementations

* Add SesameModel implementation for dual-transformer TTS system

* Implement Sesame TTS tokenizer and model wrapper

* Add SesameVoiceManager for voice prompt management

* Refactor VectorQuantization to use MLXNN.Linear

* Refactor Mimi and SesameModelWrapper

* Add Sesame TTS components for example

* Update SesameWeightLoader to load weights

* Update Sesame TTS with configuration loading

* Update Sesame TTS with improved input handling and masking logic

* Refactor SesameAttention and SesameTokenizer

* Get the weights loading and RVQ working

* Update to match python implementation

* Refactor SesameModel and Mimi codec

* Update Sesame TTS Swift

* Update Sesame TTS with improved Mimi codec

* Refactor Sesame TTS

* Update ContentView to use Marvis TTS as the default provider

* Refactor MarvisTTS by removing memory monitoring and debug logging

* Refactor ContentView to integrate Sesame TTS as the default provider

* Update Sesame TTS playback management with improved buffering

* Update ContentView iOS to support Sesame

* Update ContentView iOS to manage speaker selection

* Refactor SesameTTS for better Swift naming conventions and add async streaming

* Use SesameSession

* Add Swift integration documentation

* Clean up project by removing unused files and adding Xcode-related entries to .gitignore

* Update SesameSession with optional audio playback

* Fix Swift TTS import issues by updating Xcode project configuration

- Removed problematic mlx-swift-audio package dependency from Xcode project
- Added proper MLX package dependencies (MLX, MLXNN, MLXRandom, MLXLMCommon, MLXLLM, Transformers)
- Updated workspace configuration to use parent directory
- Modified SesameSession.swift to support optional audio playback

Note: Swift Package Manager builds successfully, but Xcode has issues resolving MLX dependencies

* Clean up Xcode package dependencies - remove redundant entries

- Removed explicit mlx-swift and swift-transformers dependencies from Xcode
- Keep only mlx-swift-examples which brings in both as transitive dependencies
- This matches the Package.swift structure and avoids duplication

Note: Xcode still has MLX dependency resolution issues, but Swift Package Manager works correctly

* Update Xcode project configuration

* Fix Xcode project build - remove redundant package dependencies

- Removed duplicate mlx-swift and swift-transformers dependencies
- Keep only mlx-swift-examples which provides both as transitive dependencies
- Updated package version to upToNextMajorVersion with minimumVersion 2.25.7
- Xcode project now builds successfully ✅

Before: mlx-swift + mlx-swift-examples + swift-transformers (redundant)
After: mlx-swift-examples (includes mlx-swift + swift-transformers)

* Add audio playback files to Xcode project

* Use batched vocoding to reduce peak memory usage with Sesame arch models. (#236)

* Cache RoPE by dtype for Sesame arch models for improved generation performance. (#232)

* Install Metal toolchain for Swift tests. (#233)

* Adopt changes interface changes from mlx-lm to fix Sesame-arch models. (#242)

Co-authored-by: Prince Canuma <[email protected]>

* Update package dependencies and formatting (#247)

* Improve Swift TTS app UX (#248)

* Refactor SesameSession initializers and update ContentView for improved TTS provider handling

* Refactor ContentView and update TTS

* Update ContentView

* Integrate Marvis TTS model into Swift

* Update macOS generate button with loading indicators and stop functionality

- Add progress indicators when generating
- Show 'Loading...' vs 'Generating...' states
- Add Stop button with proper disable states
- Use @StateObject for KokoroTTSModel to observe changes
- Match iOS button styling and behavior
- Update title to 'MLX Audio Eval' and remove mouth icon

* Update package dependencies in Package.swift and Package.resolved

* Refactor ContentView and introduce new inspector components for TTS

* Add audio playback management

* Add Marvis session status indicator into TTS views

* Refactor VoicePickerSection and AudioPlayerView for improved layout and functionality

* Update ContentView with inspector

* Update README and ContentView

* Swift_TTS to MLXAudio

* Update Package.swift

---------

Co-authored-by: Prince Canuma <[email protected]>

* Add quality selection and streaming controls to Marvis with UI support for macOS & iOS (#249)

* Add quality selection feature

* Refactor QualityLevel enum

* Add streaming audio generation option

* Add streaming interval configuration

* Refactor Marvis session management

* Refactor unused variables (#250)

Co-authored-by: Prince Canuma <[email protected]>

* Refactor MarvisModel to handle optional backbone and decoder flavors (#251)

* Refactor MarvisModel to handle optional backbone and decoder flavors

* Add 6-bit model support and quantization handling

- Updated default model to marvis-tts-100m-v0.2-MLX-6bit
- Fixed quantization config to handle JSONValue types (supports mixed types like mode string and bits/group_size numbers)
- Updated installWeights to properly extract quantization parameters from JSONValue enum

* Refactor MarvisSession to improve quantization handling and update default model

* Fix iOS 16 compatibility and ESpeakNG framework linking for iOS app (#252)

* Refactor ContentView and ESpeakNGEngine

* Update iOS platform version to v17 in Package.swift

* Refactor UI components to use platform-specific colors

* Update button styles for macOS and iOS in TextInputSection.swift

* Refactor onChange handlers

* Add testable reference and build action for MLXAudioTests in Xcode scheme

* Add SpeakNG.xcframework in Embed Frameworks phase of MLXAudio target

---------

Co-authored-by: Prince Canuma <[email protected]>

* Add memory increase limit for iOS  (#253)

* Refactor ContentView and ESpeakNGEngine

* Update iOS platform version to v17 in Package.swift

* Refactor UI components to use platform-specific colors

* Update button styles for macOS and iOS in TextInputSection.swift

* Refactor onChange handlers

* Add testable reference and build action for MLXAudioTests in Xcode scheme

* Add SpeakNG.xcframework in Embed Frameworks phase of MLXAudio target

* Add entitlements file for increased memory limit

---------

Co-authored-by: Prince Canuma <[email protected]>

* Update audio playback management in Marvis TTS (#254)

* Bump version and add new copy files (#255)

* bump version

* add wav and text to copy

* add text to copy pattern

* fix tests

* Server v2 (#153)

* base arch of server

* add tts and stt endpoints

* functioning server

* connect server and ui

* Add audio utilities, use them where possible (#161)

* Add audio utilities, use them where possible.

* Formatting.

* Fix tests.

* Fix tests.

* More test fixes.

* fix server

* Fix join audio sample rate (#162)

* update nextjs

* fix stt view

* working STT

* working text to speech

* remove voices

* remove home

* add custom model and delete file

* refactor model mapping

* add animation and use env vars for frontend config

* remove unused

* refactor model loading

* add tests

* mock generate

* fix tests

* remove old player

* update readme

---------

Co-authored-by: Lucas Newman <[email protected]>

* add main and remove unused

* set marvis as default

* format

---------

Co-authored-by: Lucas Newman <[email protected]>
Co-authored-by: sam <[email protected]>
Co-authored-by: Sachin Desai <[email protected]>
Co-authored-by: Senstella <[email protected]>
Co-authored-by: Adrien Grondin <[email protected]>
Co-authored-by: Kyle Kinkade <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Ivan Fioravanti <[email protected]>
Co-authored-by: Josh Bleecher Snyder <[email protected]>
Co-authored-by: David Feng <[email protected]>
Co-authored-by: bytefer <[email protected]>
Co-authored-by: Rudrank Riyam <[email protected]>
Co-authored-by: Liam Wittig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants