-
-
Notifications
You must be signed in to change notification settings - Fork 717
Description
Is your feature request related to a problem? Please describe.
This is related to #13284.
#17592 closed that issue by making SAPI5 voices output via WASAPI. This did improve the responsiveness, but we can improve it even further by removing the leading silence part.
Take Microsoft Zira Desktop
(SAPI5) as an example. When speaking at 1X speed, the leading silence is 100ms long. When speaking at its maximum rate (3X speed), the leading silence becomes about 30ms long. If we can remove the leading silence, it will respond even faster.
Other voices such as OneCore voices also have a few milliseconds leading silence.
Describe the solution you'd like
We can detect and remove the silence audio part in WavePlayer
, either in the Python part or in the C++ part. As eSpeak, OneCore and SAPI5 (plus MSSP) all use WavePlayer
now, they can all benefit from this. The synthesizer may need to tell WavePlayer
when the audio will start or end, so that WavePlayer
can locate the "leading silence" part more easily.
Describe alternatives you've considered
Create a stand-alone module for detecting and removing the silence audio part, either in Python or in C++. The synthesizers should pass the audio data to this module before feeding it to WavePlayer
.
Additional context
I'm not sure what is the best approach to implement this.