Blaizzy
diff --git a/‎README.md‎
Lines changed: 56 additions & 37 deletions b/‎README.md‎
Lines changed: 56 additions & 37 deletions
diff --git a/‎mlx_audio/codec/models/vocos/mel.py‎
Lines changed: 2 additions & 146 deletions b/‎mlx_audio/codec/models/vocos/mel.py‎
Lines changed: 2 additions & 146 deletions
diff --git a/‎mlx_audio/codec/models/vocos/vocos.py‎
Lines changed: 7 additions & 6 deletions b/‎mlx_audio/codec/models/vocos/vocos.py‎
Lines changed: 7 additions & 6 deletions
diff --git a/‎mlx_audio/codec/tests/test_vocos.py‎
Lines changed: 4 additions & 4 deletions b/‎mlx_audio/codec/tests/test_vocos.py‎
Lines changed: 4 additions & 4 deletions
@@ -64,27 +64,40 @@ print("Audiobook chapter successfully generated!")
 
 ```
 
-### Web Interface & API Server
-
-MLX-Audio includes a web interface with a 3D visualization that reacts to audio frequencies. The interface allows you to:
-
-1. Generate TTS with different voices and speed settings
-2. Upload and play your own audio files
-3. Visualize audio with an interactive 3D orb
-4. Automatically saves generated audio files to the outputs directory in the current working folder
-5. Open the output folder directly from the interface (when running locally)
-
-#### Features
+### Web Interface & FastAPI Server
+
+MLX-Audio provides a modern web interface with real-time audio visualization capabilities. The interface offers:
+
+1. Text-to-Speech generation with customizable voices and parameters
+2. Speech-to-Text transcription with support for multiple languages
+3. Audio file upload and playback functionality
+4. Interactive 3D audio visualization
+5. Automatic audio file management in the outputs directory
+6. Direct access to the output folder from the interface (local deployment only)
+
+#### Key Features
+
+- **Voice Customization**: Select from multiple voice presets including AF Heart, AF Nova, AF Bella, and BF Emma
+- **Speech Rate Control**: Fine-tune speech generation speed using an intuitive slider (range: 0.5x - 2.0x)
+- **Dynamic 3D Visualization**: Experience audio through an interactive 3D orb that responds to frequency changes
+- **Audio Management**: Upload, play, and visualize custom audio files
+- **Smart Playback**: Optional automatic playback of generated audio
+- **File Management**: Quick access to the output directory through an integrated file explorer button
+- **Speech Recognition**: Convert speech to text with support for multiple languages and models
+To start the web interface and API server:
 
-- **Multiple Voice Options**: Choose from different voice styles (AF Heart, AF Nova, AF Bella, BF Emma)
-- **Adjustable Speech Speed**: Control the speed of speech generation with an interactive slider (0.5x to 2.0x)
-- **Real-time 3D Visualization**: A responsive 3D orb that reacts to audio frequencies
-- **Audio Upload**: Play and visualize your own audio files
-- **Auto-play Option**: Automatically play generated audio
-- **Output Folder Access**: Convenient button to open the output folder in your system's file explorer
+UI:
+```bash
+# Configure the API base URL and port
+export NEXT_PUBLIC_API_BASE_URL=http://localhost
+export NEXT_PUBLIC_API_PORT=8000
 
-To start the web interface and API server:
+# Start UI server
+cd mlx_audio/ui
+npm run dev
+```
 
+Server:
 ```bash
 # Using the command-line interface
 mlx_audio.server
@@ -109,26 +122,23 @@ http://127.0.0.1:8000
 
 The server provides the following REST API endpoints:
 
-- `POST /tts`: Generate TTS audio
-  - Parameters (form data):
-    - `text`: The text to convert to speech (required)
-    - `voice`: Voice to use (default: "af_heart")
-    - `speed`: Speech speed from 0.5 to 2.0 (default: 1.0)
-  - Returns: JSON with filename of generated audio
-
-- `GET /audio/{filename}`: Retrieve generated audio file
-
-- `POST /play`: Play audio directly from the server
-  - Parameters (form data):
-    - `filename`: The filename of the audio to play (required)
-  - Returns: JSON with status and filename
+- `POST /v1/audio/speech`: Generate speech from text following the OpenAI TTS specification.
+  - JSON body parameters:
+    - `model`: Name or path of the TTS model to use.
+    - `input`: Text to convert to speech.
+    - `voice`: Optional voice preset.
+    - `speed`: Optional speech speed (default `1.0`).
+  - Returns the generated audio in WAV format.
 
-- `POST /stop`: Stop any currently playing audio
-  - Returns: JSON with status
+- `POST /v1/audio/transcriptions`: Transcribe audio files using an STT model in a format compatible with OpenAI's API.
+  - Multipart form parameters:
+    - `file`: The audio file to transcribe.
+    - `model`: Name or path of the STT model.
+  - Returns JSON containing the transcribed `text`.
 
-- `POST /open_output_folder`: Open the output folder in the system's file explorer
-  - Returns: JSON with status and path
-  - Note: This feature only works when running the server locally
+- `GET /v1/models`: List loaded models.
+- `POST /v1/models`: Load a model by name.
+- `DELETE /v1/models`: Unload a model.
 
 > Note: Generated audio files are stored in `~/.mlx_audio/outputs` by default, or in a fallback directory if that location is not writable.
 
@@ -217,7 +227,7 @@ mx.save_safetensors("./8bit/kokoro-v1_0.safetensors", weights, metadata={"format
 - For the web interface and API:
   - FastAPI
   - Uvicorn
-  
+
 ## License
 
 [MIT License](LICENSE)
@@ -227,3 +237,12 @@ mx.save_safetensors("./8bit/kokoro-v1_0.safetensors", weights, metadata={"format
 - Thanks to the Apple MLX team for providing a great framework for building TTS and STS models.
 - This project uses the Kokoro model architecture for text-to-speech synthesis.
 - The 3D visualization uses Three.js for rendering.
+
+
+@misc{mlx-audio,
+  author = {Canuma, Prince},
+  title = {MLX Audio},
+  year = {2025},
+  howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
+  note = {A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.}
+}
@@ -1,152 +1,8 @@
 from __future__ import annotations
 
-import math
-from functools import lru_cache
-from typing import Optional
-
 import mlx.core as mx
 
-
-@lru_cache(maxsize=None)
-def mel_filters(
-    sample_rate: int,
-    n_fft: int,
-    n_mels: int,
-    f_min: float = 0,
-    f_max: Optional[float] = None,
-    norm: Optional[str] = None,
-    mel_scale: str = "htk",
-) -> mx.array:
-    def hz_to_mel(freq, mel_scale="htk"):
-        if mel_scale == "htk":
-            return 2595.0 * math.log10(1.0 + freq / 700.0)
-
-        # slaney scale
-        f_min, f_sp = 0.0, 200.0 / 3
-        mels = (freq - f_min) / f_sp
-        min_log_hz = 1000.0
-        min_log_mel = (min_log_hz - f_min) / f_sp
-        logstep = math.log(6.4) / 27.0
-        if freq >= min_log_hz:
-            mels = min_log_mel + math.log(freq / min_log_hz) / logstep
-        return mels
-
-    def mel_to_hz(mels, mel_scale="htk"):
-        if mel_scale == "htk":
-            return 700.0 * (10.0 ** (mels / 2595.0) - 1.0)
-
-        # slaney scale
-        f_min, f_sp = 0.0, 200.0 / 3
-        freqs = f_min + f_sp * mels
-        min_log_hz = 1000.0
-        min_log_mel = (min_log_hz - f_min) / f_sp
-        logstep = math.log(6.4) / 27.0
-        freqs = mx.where(
-            mels >= min_log_mel,
-            min_log_hz * mx.exp(logstep * (mels - min_log_mel)),
-            freqs,
-        )
-        return freqs
-
-    f_max = f_max or sample_rate / 2
-
-    # generate frequency points
-
-    n_freqs = n_fft // 2 + 1
-    all_freqs = mx.linspace(0, sample_rate // 2, n_freqs)
-
-    # convert frequencies to mel and back to hz
-
-    m_min = hz_to_mel(f_min, mel_scale)
-    m_max = hz_to_mel(f_max, mel_scale)
-    m_pts = mx.linspace(m_min, m_max, n_mels + 2)
-    f_pts = mel_to_hz(m_pts, mel_scale)
-
-    # compute slopes for filterbank
-
-    f_diff = f_pts[1:] - f_pts[:-1]
-    slopes = mx.expand_dims(f_pts, 0) - mx.expand_dims(all_freqs, 1)
-
-    # calculate overlapping triangular filters
-
-    down_slopes = (-slopes[:, :-2]) / f_diff[:-1]
-    up_slopes = slopes[:, 2:] / f_diff[1:]
-    filterbank = mx.maximum(
-        mx.zeros_like(down_slopes), mx.minimum(down_slopes, up_slopes)
-    )
-
-    if norm == "slaney":
-        enorm = 2.0 / (f_pts[2 : n_mels + 2] - f_pts[:n_mels])
-        filterbank *= mx.expand_dims(enorm, 0)
-
-    filterbank = filterbank.moveaxis(0, 1)
-    return filterbank
-
-
-@lru_cache(maxsize=None)
-def hanning(size):
-    return mx.array(
-        [0.5 * (1 - math.cos(2 * math.pi * n / (size - 1))) for n in range(size)]
-    )
-
-
-def stft(x, window, nperseg=256, noverlap=None, nfft=None, pad_mode="reflect"):
-    if nfft is None:
-        nfft = nperseg
-    if noverlap is None:
-        noverlap = nfft // 4
-
-    def _pad(x, padding, pad_mode="constant"):
-        if pad_mode == "constant":
-            return mx.pad(x, [(padding, padding)])
-        elif pad_mode == "reflect":
-            prefix = x[1 : padding + 1][::-1]
-            suffix = x[-(padding + 1) : -1][::-1]
-            return mx.concatenate([prefix, x, suffix])
-        else:
-            raise ValueError(f"Invalid pad_mode {pad_mode}")
-
-    if window.shape[0] < nfft:
-        pad_left = (nfft - window.shape[0]) // 2
-        pad_right = nfft - window.shape[0] - pad_left
-        window = mx.pad(window, (pad_left, pad_right))
-
-    padding = nfft // 2
-    x = _pad(x, padding, pad_mode)
-
-    strides = [noverlap, 1]
-    t = (x.size - nperseg + noverlap) // noverlap
-    shape = [t, nfft]
-    x = mx.as_strided(x, shape=shape, strides=strides)
-    return mx.fft.rfft(x * window)
-
-
-def istft(x, window, nperseg=256, noverlap=None, nfft=None):
-    if nfft is None:
-        nfft = nperseg
-    if noverlap is None:
-        noverlap = nfft // 4
-
-    t = (x.shape[0] - 1) * noverlap + nperseg
-    reconstructed = mx.zeros(t)
-    window_sum = mx.zeros(t)
-
-    for i in range(x.shape[0]):
-        # inverse FFT of each frame
-        frame_time = mx.fft.irfft(x[i])
-
-        # get the position in the time-domain signal to add the frame
-        start = i * noverlap
-        end = start + nperseg
-
-        # overlap-add the inverse transformed frame, scaled by the window
-        reconstructed[start:end] += frame_time * window
-        window_sum[start:end] += window
-
-    # normalize by the sum of the window values
-    reconstructed = mx.where(window_sum != 0, reconstructed / window_sum, reconstructed)
-
-    return reconstructed
+from mlx_audio.utils import hanning, mel_filters, stft
 
 
 def log_mel_spectrogram(
@@ -163,7 +19,7 @@ def log_mel_spectrogram(
     if padding > 0:
         audio = mx.pad(audio, (0, padding))
 
-    freqs = stft(audio, hanning(n_fft), nperseg=n_fft, noverlap=hop_length)
+    freqs = stft(audio, window=hanning(n_fft), n_fft=n_fft, win_length=hop_length)
     magnitudes = freqs[:-1, :].abs()
     filters = mel_filters(
         sample_rate=sample_rate,
 
@@ -9,8 +9,10 @@
 import yaml
 from huggingface_hub import snapshot_download
 
+from mlx_audio.utils import hanning, istft
+
 from ..encodec import Encodec
-from .mel import hanning, istft, log_mel_spectrogram
+from .mel import log_mel_spectrogram
 
 
 class FeatureExtractor(nn.Module):
@@ -130,11 +132,10 @@ def __call__(self, x: mx.array) -> mx.array:
         y = mx.sin(p)
         S = mag * (x + 1j * y)
         audio = istft(
-            S.squeeze(0).swapaxes(0, 1),
-            hanning(self.n_fft),
-            self.n_fft,
-            self.hop_length,
-            self.n_fft,
+            S.squeeze(0),
+            window=hanning(self.n_fft),
+            hop_length=self.hop_length,
+            win_length=self.n_fft,
         )
         return audio
 
 
@@ -65,12 +65,12 @@ def test_vocos_24khz(self):
 
         # reconstruct from mel spec
         reconstructed_audio = model(audio)
-        self.assertEqual(reconstructed_audio.shape, (120576,))
+        self.assertEqual(reconstructed_audio.shape, (119552,))
 
         # decode from mel spec
         mel_spec = log_mel_spectrogram(audio)
         decoded = model.decode(mel_spec)
-        self.assertEqual(decoded.shape, (120576,))
+        self.assertEqual(decoded.shape, (119552,))
 
         model = Vocos.from_hparams(config_encodec)
 
@@ -79,14 +79,14 @@ def test_vocos_24khz(self):
         reconstructed_audio = model(
             audio, bandwidth_id=mx.array(bandwidth_id)[None, ...]
         )
-        self.assertEqual(reconstructed_audio.shape, (120960,))
+        self.assertEqual(reconstructed_audio.shape, (119680,))
 
         # decode with encodec codes
         codes = model.get_encodec_codes(audio, bandwidth_id=bandwidth_id)
         decoded = model.decode_from_codes(
             codes, bandwidth_id=mx.array(bandwidth_id)[None, ...]
         )
-        self.assertEqual(decoded.shape, (120960,))
+        self.assertEqual(decoded.shape, (119680,))
 
 
 if __name__ == "__main__":