Skip to content

Conversation

@Blaizzy
Copy link
Owner

@Blaizzy Blaizzy commented May 18, 2025

Description:

Users running iterative or long text-to-speech generation tasks with the Kokoro and other models could experience significantly high and growing memory usage. This was primarily due to the MLX framework's internal cache accumulating tensor data (like KV caches and activations) across multiple generation segments or iterations without being cleared. In testing scenarios, the MLX cache memory was observed to grow to ~70-73 GB and remain high, leading to a substantial memory footprint for the overall Python process.

Solution ✅

This PR addresses the high memory consumption by strategically calling mx.core.clear_cache() within the Model.generate method, after each audio segment is produced. This ensures that memory allocated by the MLX cache for one segment is released before processing the next, preventing uncontrolled growth.

Impact & Results:

With this change, the MLX cache memory is effectively managed and does not accumulate across segments. This leads to a drastically reduced and stable memory footprint for applications using the Kokoro TTS model.

image

This demonstrates a reduction in per-iteration cache memory from tens of gigabytes to a few megabytes, making the model significantly more memory-efficient for sequential generation tasks.

Considerations:

While clearing the cache prevents excessive memory usage, it means that MLX cannot reuse computations cached from previous segments. This might have a minor performance impact on per-segment generation speed. However, for typical TTS applications where memory stability and the ability to process long texts are crucial, the benefits of reduced memory usage outweigh this potential trade-off.

How to Verify:

Run a script that iteratively calls tts.generate() (or a function wrapping it) for multiple text inputs or segments. Monitor mx.core.get_cache_memory() and overall process memory (e.g., using psutil). The cache memory should remain low and stable after each generation call, and the overall process memory should not grow uncontrollably due to MLX cache accumulation.

Checklist

@Blaizzy Blaizzy changed the title Pc/fix memory spike Fix memory spikes May 18, 2025
@Blaizzy Blaizzy mentioned this pull request May 18, 2025
@Blaizzy Blaizzy marked this pull request as ready for review May 18, 2025 23:49
@Blaizzy Blaizzy merged commit a01642e into main May 19, 2025
2 checks passed
This was referenced May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Voices not unloaded dynamically with mlx_audio.server Model type kokoro not supported Memory Usage in Kokoro

2 participants