Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
Users running iterative or long text-to-speech generation tasks with the
Kokoroand other models could experience significantly high and growing memory usage. This was primarily due to the MLX framework's internal cache accumulating tensor data (like KV caches and activations) across multiple generation segments or iterations without being cleared. In testing scenarios, the MLX cache memory was observed to grow to ~70-73 GB and remain high, leading to a substantial memory footprint for the overall Python process.Solution ✅
This PR addresses the high memory consumption by strategically calling
mx.core.clear_cache()within theModel.generatemethod, after each audio segment is produced. This ensures that memory allocated by the MLX cache for one segment is released before processing the next, preventing uncontrolled growth.Impact & Results:
With this change, the MLX cache memory is effectively managed and does not accumulate across segments. This leads to a drastically reduced and stable memory footprint for applications using the Kokoro TTS model.
This demonstrates a reduction in per-iteration cache memory from tens of gigabytes to a few megabytes, making the model significantly more memory-efficient for sequential generation tasks.
Considerations:
While clearing the cache prevents excessive memory usage, it means that MLX cannot reuse computations cached from previous segments. This might have a minor performance impact on per-segment generation speed. However, for typical TTS applications where memory stability and the ability to process long texts are crucial, the benefits of reduced memory usage outweigh this potential trade-off.
How to Verify:
Run a script that iteratively calls
tts.generate()(or a function wrapping it) for multiple text inputs or segments. Monitormx.core.get_cache_memory()and overall process memory (e.g., usingpsutil). The cache memory should remain low and stable after each generation call, and the overall process memory should not grow uncontrollably due to MLX cache accumulation.Checklist