Skip to content

Conversation

@AbrahamSanders
Copy link
Contributor

@AbrahamSanders AbrahamSanders commented Sep 19, 2025

Hey, thanks for the great work!

This PR addresses #8, #21 by adding a streaming API to the VoxCPM and VoxCPMModel classes.

model.generate continues to output a complete audio, while model.generate_streaming yields a generator that returns a single 0.08s chunk at every generation step.

To ensure streaming is smooth, the last 3 latents are passed to the audio VAE for context and the last 1280 sample chunk is returned. 3 latents seemed to be the minimum context needed to remove audible artifacts from the streaming audio (I determined this by ear, feel free to adjust if you feel it's not enough)

import numpy as np
import soundfile as sf
from voxcpm import VoxCPM
from transformers.trainer_utils import set_seed

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
text = "Streaming text to speech is easy with VoxCPM!"

# Non-streaming
set_seed(42)
final = model.generate(text)
sf.write("output.wav", final, model.tts_model.sample_rate)

# Streaming
set_seed(42)
chunks = []
for chunk in model.generate_streaming(text):
    chunks.append(chunk)
final = np.concatenate(chunks)
sf.write("output_streaming.wav", final, model.tts_model.sample_rate)

The two results should be audibly identical!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a streaming API to the VoxCPM text-to-speech model, allowing users to receive audio chunks incrementally during generation instead of waiting for the complete audio. The streaming implementation maintains audio quality by using overlapping context between chunks.

  • Introduces generate_streaming methods that yield audio chunks as generators
  • Refactors existing generation methods to use internal _generate implementations that support both streaming and non-streaming modes
  • Adds streaming support across the entire inference pipeline from the core VoxCPM class down to the model internals

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/voxcpm/model/voxcpm.py Refactors generation and inference methods to support streaming, adds context management for smooth audio chunks
src/voxcpm/core.py Updates VoxCPM class to expose streaming API and handle numpy array conversion for streaming chunks
README.md Adds documentation and example usage for the new streaming functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

def optimize(self, disable: bool = False):
try:
if disable:
raise ValueError("Optimization disabled by user")
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message 'Optimization disabled by user' is misleading. This isn't an error condition but expected behavior when disable=True. Consider removing the artificial ValueError and simply returning early, or use a more descriptive message like 'Optimization skipped as requested'.

Suggested change
raise ValueError("Optimization disabled by user")
print("Optimization skipped as requested")
return self

Copilot uses AI. Check for mistakes.
Comment on lines +634 to +635
# return the last three predicted latent features to provide enough context for smooth decoding
pred_feat_chunk = torch.cat(pred_feat_seq[-3:], dim=1)
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 3 for context features is hardcoded but mentioned in the PR description as adjustable. Consider making this a configurable parameter or at least extracting it to a named constant like STREAMING_CONTEXT_SIZE = 3 for better maintainability.

Copilot uses AI. Check for mistakes.
patch_len = self.patch_size * self.chunk_size
for latent_pred, _ in inference_result:
decode_audio = self.audio_vae.decode(latent_pred.to(torch.float32))
decode_audio = decode_audio[..., -patch_len:].squeeze(1).cpu()
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch_len calculation and slicing logic [..., -patch_len:] is duplicated in multiple places (lines 296, 509). Consider extracting this into a helper method to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
zipenhancer_model_path: ModelScope acoustic noise suppression model
id or local path. If None, denoiser will not be initialized.
enable_denoiser: Whether to initialize the denoiser pipeline.
optimize: Whether to optimize the model with torch.compile. True by default, but can be disabled for debugging.
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter description mentions 'torch.compile' but the actual optimization logic in the model uses different mechanisms (torch_tensorrt, torch2trt). The documentation should accurately describe what optimization actually does or be more generic.

Suggested change
optimize: Whether to optimize the model with torch.compile. True by default, but can be disabled for debugging.
optimize: Whether to optimize the model for inference speed. True by default, but can be disabled for debugging.

Copilot uses AI. Check for mistakes.
@liuxin99 liuxin99 merged commit b0714ad into OpenBMB:main Sep 22, 2025
@dignome dignome mentioned this pull request Oct 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants