Skip to content

Conversation

subhankar-ghosh
Copy link
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add streaming algorithm to magpietts

Collection: [Note which collection this PR will affect]

TTS

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
python scripts/magpietts/infer_and_evaluate_streaming.py \
--checkpoint_files ${CKPT} \
--hparams_files ${HPARAM} \
--codecmodel_path ${CODEC} \
--out_dir ${OUT_DIR} \
--datasets ${DATASET} \
--use_cfg \
--disable_fcd \
--apply_attention_prior

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

subhankar-ghosh and others added 19 commits July 30, 2025 20:12
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
subhankar-ghosh and others added 4 commits August 25, 2025 08:22
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
@github-actions github-actions bot removed the common label Aug 25, 2025
Signed-off-by: subhankar-ghosh <[email protected]>
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds streaming inference functionality to MagpieTTS, enabling real-time text-to-speech generation by processing text input incrementally rather than all at once. The streaming algorithm maintains sliding windows for both text and audio history while managing attention priors to ensure coherent audio generation across text chunks.

Key Changes:

  • Implementation of streaming inference algorithm with windowing mechanisms for text and audio tokens
  • Addition of specialized attention prior handling for streaming mode with exponential weight support
  • Extraction of common argument parsing functionality to support both streaming and non-streaming inference scripts

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
scripts/magpietts/infer_and_evaluate_streaming.py New streaming inference script with chunked text processing and windowed generation
nemo/collections/tts/models/magpietts.py Core streaming methods including windowed text processing and streaming-specific attention prior construction
scripts/magpietts/infer_and_evaluate.py Refactored to extract common argument parsing logic and removed combined violin plot functionality
scripts/magpietts/README.md Added documentation and usage example for the new streaming inference capability

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +1979 to +1980
It also uses a end_of_text flag to indicate whether the text has ended.
It also uses a left_offset to account for the fact that the text is not provided in a single chunk.
Copy link
Preview

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring repeats the explanation about left_offset twice. The last two sentences are redundant and should be consolidated.

Suggested change
It also uses a end_of_text flag to indicate whether the text has ended.
It also uses a left_offset to account for the fact that the text is not provided in a single chunk.
It also uses an end_of_text flag to indicate whether the text has ended.

Copilot uses AI. Check for mistakes.

@rfejgin
Copy link
Collaborator

rfejgin commented Aug 27, 2025

Hi Subhankar, nice work!

I have not dived into all the details but here are a few things that come to mind:

  1. There seems to be some code duplication between the streaming and non-streaming version. I wonder if it will become hard to maintain over time. Specifically, in:
  • infer_and_evaluate_streaming.py vs infer_and_evaluate.py - things like setting up checkpoint name, logging of metrics, etc. I do see that there is reuse certain functions from infer_and_evaluate.py but maybe there is more commonality to extract?
  • in magpietts.py does construct_streaming_inference_prior() have major differences (that we actively use) from construct_streaming_prior() aside from including the offset?

I know that there is a tradeoff between eliminating code duplication vs making unified code overly complex, but maybe the above are worth another look?

  1. A README or pointer to documentation on the design of the streaming algorithm would be of interest since it's non trivial.
  2. More of a minor point, but I wonder if logic that needs to know model-specific details like creating a BOS token would be better to put inside the MagpieTTSModel class (accessible to the external script via some API). That way it would also be easier for it to be reused when folks use infer_batch() directly (not through infer_and_evaluate_streaming.py), which is what I believe they do in Riva.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants