- 
                Notifications
    
You must be signed in to change notification settings  - Fork 213
 
Add a streaming API for VoxCPM #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a streaming API to the VoxCPM text-to-speech model, allowing users to receive audio chunks incrementally during generation instead of waiting for the complete audio. The streaming implementation maintains audio quality by using overlapping context between chunks.
- Introduces 
generate_streamingmethods that yield audio chunks as generators - Refactors existing generation methods to use internal 
_generateimplementations that support both streaming and non-streaming modes - Adds streaming support across the entire inference pipeline from the core VoxCPM class down to the model internals
 
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description | 
|---|---|
| src/voxcpm/model/voxcpm.py | Refactors generation and inference methods to support streaming, adds context management for smooth audio chunks | 
| src/voxcpm/core.py | Updates VoxCPM class to expose streaming API and handle numpy array conversion for streaming chunks | 
| README.md | Adds documentation and example usage for the new streaming functionality | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| def optimize(self, disable: bool = False): | ||
| try: | ||
| if disable: | ||
| raise ValueError("Optimization disabled by user") | 
    
      
    
      Copilot
AI
    
    
    
      Sep 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message 'Optimization disabled by user' is misleading. This isn't an error condition but expected behavior when disable=True. Consider removing the artificial ValueError and simply returning early, or use a more descriptive message like 'Optimization skipped as requested'.
| raise ValueError("Optimization disabled by user") | |
| print("Optimization skipped as requested") | |
| return self | 
| # return the last three predicted latent features to provide enough context for smooth decoding | ||
| pred_feat_chunk = torch.cat(pred_feat_seq[-3:], dim=1) | 
    
      
    
      Copilot
AI
    
    
    
      Sep 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 3 for context features is hardcoded but mentioned in the PR description as adjustable. Consider making this a configurable parameter or at least extracting it to a named constant like STREAMING_CONTEXT_SIZE = 3 for better maintainability.
| patch_len = self.patch_size * self.chunk_size | ||
| for latent_pred, _ in inference_result: | ||
| decode_audio = self.audio_vae.decode(latent_pred.to(torch.float32)) | ||
| decode_audio = decode_audio[..., -patch_len:].squeeze(1).cpu() | 
    
      
    
      Copilot
AI
    
    
    
      Sep 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch_len calculation and slicing logic [..., -patch_len:] is duplicated in multiple places (lines 296, 509). Consider extracting this into a helper method to reduce code duplication and improve maintainability.
| zipenhancer_model_path: ModelScope acoustic noise suppression model | ||
| id or local path. If None, denoiser will not be initialized. | ||
| enable_denoiser: Whether to initialize the denoiser pipeline. | ||
| optimize: Whether to optimize the model with torch.compile. True by default, but can be disabled for debugging. | 
    
      
    
      Copilot
AI
    
    
    
      Sep 22, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter description mentions 'torch.compile' but the actual optimization logic in the model uses different mechanisms (torch_tensorrt, torch2trt). The documentation should accurately describe what optimization actually does or be more generic.
| optimize: Whether to optimize the model with torch.compile. True by default, but can be disabled for debugging. | |
| optimize: Whether to optimize the model for inference speed. True by default, but can be disabled for debugging. | 
Hey, thanks for the great work!
This PR addresses #8, #21 by adding a streaming API to the VoxCPM and VoxCPMModel classes.
model.generatecontinues to output a complete audio, whilemodel.generate_streamingyields a generator that returns a single 0.08s chunk at every generation step.To ensure streaming is smooth, the last 3 latents are passed to the audio VAE for context and the last 1280 sample chunk is returned. 3 latents seemed to be the minimum context needed to remove audible artifacts from the streaming audio (I determined this by ear, feel free to adjust if you feel it's not enough)
The two results should be audibly identical!