Skip to content

Improve the responsiveness of voices by trimming the leading silence #17614

@gexgd0419

Description

@gexgd0419

Is your feature request related to a problem? Please describe.

This is related to #13284.

#17592 closed that issue by making SAPI5 voices output via WASAPI. This did improve the responsiveness, but we can improve it even further by removing the leading silence part.

Take Microsoft Zira Desktop (SAPI5) as an example. When speaking at 1X speed, the leading silence is 100ms long. When speaking at its maximum rate (3X speed), the leading silence becomes about 30ms long. If we can remove the leading silence, it will respond even faster.

Other voices such as OneCore voices also have a few milliseconds leading silence.

Describe the solution you'd like

We can detect and remove the silence audio part in WavePlayer, either in the Python part or in the C++ part. As eSpeak, OneCore and SAPI5 (plus MSSP) all use WavePlayer now, they can all benefit from this. The synthesizer may need to tell WavePlayer when the audio will start or end, so that WavePlayer can locate the "leading silence" part more easily.

Describe alternatives you've considered

Create a stand-alone module for detecting and removing the silence audio part, either in Python or in C++. The synthesizers should pass the audio data to this module before feeding it to WavePlayer.

Additional context

I'm not sure what is the best approach to implement this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/speechp4https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priorityperformancetriagedHas been triaged, issue is waiting for implementation.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions