Skip to content

Commit 1dc0f76

Browse files
authored
Add leading silence detection and removal logic (#17648)
Closes #17614 Summary of the issue: Some voices output a leading silence part before the actual speech voice. By removing the silence part, the delay between keypress and user hearing the audio will be shorter, therefore make the voices more responsive. Description of user facing changes Users may find the voices more responsive. All voices using NVDA's WavePlayer will be affected, including eSpeak-NG, OneCore, SAPI5, and some third-party voice add-ons. This should only affect the leading silence parts. Silence between sentences or at punctuation marks are not changed, but this may depend on how the voice uses WavePlayer. Description of development approach I wrote a header-only library silenceDetect.h in nvdaHelper/local. It supports most wave formats (8/16/24/32-bit integer and 32/64-bit float wave), and uses a simple algorithm: check each sample to see if it's outside threshold range (currently hard-coded to +/- 1/2^10 or 0.0009765625). It uses template-related code and requires C++ 20 standard. The WasapiPlayer in wasapi.cpp is updated to handle silence. A new member function, startTrimmingLeadingSilence, and the exported version wasPlay_startTrimmingLeadingSilence, is added, to set or clear the isTrimmingLeadingSilence flag. If isTrimmingLeadingSilence is true, the next chunk fed in will have its leading silence removed. When non-silence is detected, isTrimmingLeadingSilence will be reset to false. So every time a new utterance is about to be spoken, startTrimmingLeadingSilence should be called. In nvwave.py, startTrimmingLeadingSilence() will be called when: the player is initialized; the player is stopped; idle is called; _idleCheck determines that the player is idle. Usually voices will call idle when an utterance is completed, so that audio ducking can work correctly, so here idle is used to mark the starting point of the next utterance. If a voice doesn't use idle this way, then this logic might be messed up. As long as the synthesizer uses idle as intended, the synthesizer's code doesn't need to be modified to benefit from this feature. As leading silence can also be introduced by a BreakCommand at the beginning of the speech sequence, WavePlayer will check the speech sequence first; if there's a BreakCommand at the beginning, the leading silence will not be trimmed for the current utterance. To check the exact speech sequence that is about to be spoken, a new extension point, pre_synthSpeak, is added in synthDriverHandler, which will be invoked just before SpeechManager calls getSynth().speak(). The existing pre_speech is called before SpeechManager processes and queues the speech sequence, so pre_synthSpeak is needed to provide a more accurate sequence. When the purpose of a WavePlayer is not SPEECH, it does not trim the leading silence by default, because of the way playWaveFile works (it calls idle after every chunk). Users of WavePlayer will still be able to enable/disable automatic trimming by calling enableTrimmingLeadingSilence, or to initiate trimming manually for the next audio section by calling startTrimmingLeadingSilence. Other possible ways/things that may worth considering (but hasn't been implemented): Put silence detection/removal logic in a separate module instead of in WavePlayer. The drawback is that every voice synthesizer module needs to be modified to utilize a separate module. Use another audio library, such as PyDub, to detect/remove silence. Add a setting item to turn this feature on or off. Add a public API function to allow a synthesizer to opt out of this feature.
1 parent ffd1cf5 commit 1dc0f76

File tree

11 files changed

+327
-2
lines changed

11 files changed

+327
-2
lines changed

nvdaHelper/local/nvdaHelperLocal.def

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ EXPORTS
8282
wasPlay_pause
8383
wasPlay_resume
8484
wasPlay_setChannelVolume
85+
wasPlay_startTrimmingLeadingSilence
8586
wasPlay_startup
8687
wasSilence_init
8788
wasSilence_playFor

nvdaHelper/local/silenceDetect.h

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
// A part of NonVisual Desktop Access (NVDA)
2+
// This file is covered by the GNU General Public License.
3+
// See the file COPYING for more details.
4+
// Copyright (C) 2025 NV Access Limited, gexgd0419
5+
6+
#ifndef SILENCEDETECT_H
7+
#define SILENCEDETECT_H
8+
9+
#include <windows.h>
10+
#include <mmreg.h>
11+
#include <stdint.h>
12+
#include <type_traits>
13+
#include <limits>
14+
15+
namespace SilenceDetect {
16+
17+
/**
18+
* Compile-time wave format tag.
19+
* Supports integer and floating-point formats.
20+
* `SampleType` should be the smallest numeric type that can hold a sample, for example, 32-bit int for 24-bit format.
21+
* Signedness of `SampleType` matters. For unsigned types, the zero point is at middle, e.g. 128 for 8-bit unsigned.
22+
* `bytesPerSample` should be <= `sizeof(SampleType)` for integer formats,
23+
* and == `sizeof(SampleType)` for floating-point formats.
24+
* Assumes C++20 standard.
25+
*/
26+
template <typename SampleType, size_t bytesPerSample = sizeof(SampleType)>
27+
struct WaveFormat {
28+
static_assert(std::is_arithmetic_v<SampleType>, "SampleType should be an integer or floating-point type");
29+
static_assert(!(std::is_floating_point_v<SampleType> && bytesPerSample != sizeof(SampleType)),
30+
"When SampleType is a floating-point type, bytesPerSample should be equal to sizeof(SampleType)");
31+
static_assert(!(std::is_integral_v<SampleType> && !(bytesPerSample <= sizeof(SampleType) && bytesPerSample > 0)),
32+
"When SampleType is an integer type, bytesPerSample should be less than or equal to sizeof(SampleType) and greater than 0");
33+
34+
typedef SampleType SampleType;
35+
static constexpr size_t bytesPerSample = bytesPerSample;
36+
37+
static constexpr SampleType zeroPoint() {
38+
// for unsigned types, zero point is at middle
39+
// for signed types, zero is zero
40+
if constexpr (std::is_unsigned_v<SampleType>)
41+
return SampleType(1) << (bytesPerSample * 8 - 1);
42+
else
43+
return SampleType();
44+
}
45+
46+
static constexpr SampleType (max)() {
47+
if constexpr (std::is_floating_point_v<SampleType>) {
48+
// For floating-point samples, maximum value is 1.0
49+
return SampleType(1);
50+
} else {
51+
// Trim the maximum value to `bytesPerSample` bytes
52+
return (std::numeric_limits<SampleType>::max)() >> ((sizeof(SampleType) - bytesPerSample) * 8);
53+
}
54+
}
55+
56+
static constexpr SampleType (min)() {
57+
if constexpr (std::is_floating_point_v<SampleType>) {
58+
// For floating-point samples, minimum value is -1.0
59+
return SampleType(-1);
60+
} else {
61+
// Trim the minimum value to `bytesPerSample` bytes
62+
return (std::numeric_limits<SampleType>::min)() >> ((sizeof(SampleType) - bytesPerSample) * 8);
63+
}
64+
}
65+
66+
static constexpr SampleType defaultThreshold() {
67+
// Default threshold: 1 / 2^10 or 0.0009765625
68+
if constexpr (std::is_floating_point_v<SampleType>)
69+
return SampleType(1) / (1 << 10);
70+
else if constexpr (bytesPerSample * 8 > 10)
71+
return SampleType(1) << (bytesPerSample * 8 - 10);
72+
else
73+
return SampleType();
74+
}
75+
76+
static constexpr auto toSigned(SampleType smp) {
77+
if constexpr (std::is_integral_v<SampleType>) {
78+
// In C++20, signed integer types must use two's complement,
79+
// so the following conversion is well-defined.
80+
using SignedType = std::make_signed_t<SampleType>;
81+
return SignedType(smp - zeroPoint());
82+
} else {
83+
return smp;
84+
}
85+
}
86+
87+
static constexpr SampleType fromSigned(SampleType smp) {
88+
if constexpr (std::is_integral_v<SampleType>) {
89+
// Signed overflow is undefined behavior,
90+
// so convert to unsigned first.
91+
using UnsignedType = std::make_unsigned_t<SampleType>;
92+
return SampleType(UnsignedType(smp) + zeroPoint());
93+
} else {
94+
return smp;
95+
}
96+
}
97+
98+
static constexpr SampleType signExtend(SampleType smp) {
99+
if constexpr (std::is_unsigned_v<SampleType> || bytesPerSample == sizeof(SampleType)) {
100+
return smp;
101+
} else {
102+
constexpr auto shift = (sizeof(SampleType) - bytesPerSample) * 8;
103+
// Convert to unsigned first to prevent left-shifting negative numbers
104+
using UnsignedType = std::make_unsigned_t<SampleType>;
105+
return SampleType(UnsignedType(smp) << shift) >> shift;
106+
}
107+
}
108+
};
109+
110+
inline WORD getFormatTag(const WAVEFORMATEX* wfx) {
111+
if (wfx->wFormatTag == WAVE_FORMAT_EXTENSIBLE) {
112+
auto wfext = reinterpret_cast<const WAVEFORMATEXTENSIBLE*>(wfx);
113+
if (IS_VALID_WAVEFORMATEX_GUID(&wfext->SubFormat))
114+
return EXTRACT_WAVEFORMATEX_ID(&wfext->SubFormat);
115+
}
116+
return wfx->wFormatTag;
117+
}
118+
119+
/**
120+
* Return the leading silence wave data length, in bytes.
121+
* Assumes the wave data to be of one channel (mono).
122+
* Uses a `WaveFormat` type (`Fmt`) to determine the wave format.
123+
*/
124+
template <class Fmt>
125+
size_t getLeadingSilenceSizeMono(
126+
const unsigned char* waveData,
127+
size_t size,
128+
typename Fmt::SampleType threshold
129+
) {
130+
using SampleType = Fmt::SampleType;
131+
constexpr size_t bytesPerSample = Fmt::bytesPerSample;
132+
133+
if (size < bytesPerSample)
134+
return 0;
135+
136+
constexpr SampleType zeroPoint = Fmt::zeroPoint();
137+
const SampleType minValue = zeroPoint - threshold, maxValue = zeroPoint + threshold;
138+
139+
// Check each sample
140+
SampleType smp = SampleType();
141+
const unsigned char* const pEnd = waveData + (size - (size % bytesPerSample));
142+
for (const unsigned char* p = waveData; p < pEnd; p += bytesPerSample) {
143+
memcpy(&smp, p, bytesPerSample);
144+
smp = Fmt::signExtend(smp);
145+
// this sample is out of range, so the previous sample is the final sample of leading silence.
146+
if (smp < minValue || smp > maxValue)
147+
return p - waveData;
148+
}
149+
150+
// The whole data block is silence
151+
return size;
152+
}
153+
154+
/**
155+
* Invoke a functor with an argument of a WaveFormat type that corresponds to the specified WAVEFORMATEX.
156+
* Return false if the WAVEFORMATEX is unknown.
157+
*/
158+
template <class Func>
159+
bool callByWaveFormat(const WAVEFORMATEX* wfx, Func&& func) {
160+
switch (getFormatTag(wfx)) {
161+
case WAVE_FORMAT_PCM:
162+
switch (wfx->wBitsPerSample) {
163+
case 8: // 8-bits are unsigned, others are signed
164+
func(WaveFormat<uint8_t>());
165+
break;
166+
case 16:
167+
func(WaveFormat<int16_t>());
168+
break;
169+
case 24:
170+
func(WaveFormat<int32_t, 3>());
171+
break;
172+
case 32:
173+
func(WaveFormat<int32_t>());
174+
break;
175+
default:
176+
return false;
177+
}
178+
break;
179+
case WAVE_FORMAT_IEEE_FLOAT:
180+
switch (wfx->wBitsPerSample) {
181+
case 32:
182+
func(WaveFormat<float>());
183+
break;
184+
case 64:
185+
func(WaveFormat<double>());
186+
break;
187+
default:
188+
return false;
189+
}
190+
break;
191+
default:
192+
return false;
193+
}
194+
return true;
195+
}
196+
197+
/**
198+
* Return the leading silence wave data length, in bytes.
199+
* Uses a `WAVEFORMATEX` to determine the wave format.
200+
*/
201+
inline size_t getLeadingSilenceSize(
202+
const WAVEFORMATEX* wfx,
203+
const unsigned char* waveData,
204+
size_t size
205+
) {
206+
size_t len;
207+
if (!callByWaveFormat(wfx, [=, &len](auto fmtTag) {
208+
using Fmt = decltype(fmtTag);
209+
len = getLeadingSilenceSizeMono<Fmt>(
210+
waveData, size, Fmt::defaultThreshold());
211+
}))
212+
return 0;
213+
214+
return len - len % wfx->nBlockAlign; // round down to block (channel) boundaries
215+
}
216+
217+
} // namespace SilenceDetect
218+
219+
#endif // SILENCEDETECT_H

nvdaHelper/local/wasapi.cpp

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ This license can be found at:
2424
#include <mmdeviceapi.h>
2525
#include <common/log.h>
2626
#include <random>
27+
#include "silenceDetect.h"
2728

2829
/**
2930
* Support for audio playback using WASAPI.
@@ -194,6 +195,8 @@ class WasapiPlayer {
194195
HRESULT resume();
195196
HRESULT setChannelVolume(unsigned int channel, float level);
196197

198+
void startTrimmingLeadingSilence(bool start);
199+
197200
private:
198201
void maybeFireCallback();
199202

@@ -245,6 +248,7 @@ class WasapiPlayer {
245248
unsigned int defaultDeviceChangeCount;
246249
unsigned int deviceStateChangeCount;
247250
bool isUsingPreferredDevice = false;
251+
bool isTrimmingLeadingSilence = false;
248252
};
249253

250254
WasapiPlayer::WasapiPlayer(wchar_t* endpointId, WAVEFORMATEX format,
@@ -342,6 +346,19 @@ HRESULT WasapiPlayer::feed(unsigned char* data, unsigned int size,
342346
return true;
343347
};
344348

349+
if (isTrimmingLeadingSilence) {
350+
size_t silenceSize = SilenceDetect::getLeadingSilenceSize(&format, data, size);
351+
if (silenceSize >= size) {
352+
// The whole chunk is silence. Continue checking for silence in the next chunk.
353+
remainingFrames = 0;
354+
} else {
355+
// Silence ends in this chunk. Skip the silence and continue.
356+
data += silenceSize;
357+
remainingFrames = (size - silenceSize) / format.nBlockAlign;
358+
isTrimmingLeadingSilence = false; // Stop checking for silence
359+
}
360+
}
361+
345362
while (remainingFrames > 0) {
346363
UINT32 paddingFrames;
347364

@@ -643,6 +660,10 @@ HRESULT WasapiPlayer::setChannelVolume(unsigned int channel, float level) {
643660
return volume->SetChannelVolume(channel, level);
644661
}
645662

663+
void WasapiPlayer::startTrimmingLeadingSilence(bool start) {
664+
isTrimmingLeadingSilence = start;
665+
}
666+
646667
HRESULT WasapiPlayer::disableCommunicationDucking(IMMDevice* device) {
647668
// Disable the default ducking experience used when a communication audio
648669
// session is active, as we never want NVDA's audio to be ducked.
@@ -839,6 +860,10 @@ HRESULT wasPlay_setChannelVolume(
839860
return player->setChannelVolume(channel, level);
840861
}
841862

863+
void wasPlay_startTrimmingLeadingSilence(WasapiPlayer* player, bool start) {
864+
player->startTrimmingLeadingSilence(start);
865+
}
866+
842867
/**
843868
* This must be called once per session at startup before wasPlay_create is
844869
* called.

projectDocs/dev/developerGuide/developerGuide.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1393,6 +1393,7 @@ For examples of how to define and use new extension points, please see the code
13931393
|`Action` |`synthIndexReached` |Notifies when a synthesizer reaches an index during speech.|
13941394
|`Action` |`synthDoneSpeaking` |Notifies when a synthesizer finishes speaking.|
13951395
|`Action` |`synthChanged` |Notifies of synthesizer changes.|
1396+
|`Action` |`pre_synthSpeak` |Notifies when the current synthesizer is about to speak something.|
13961397

13971398
### tones {#tonesExtPts}
13981399

source/config/configSpec.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
autoDialectSwitching = boolean(default=false)
4646
delayedCharacterDescriptions = boolean(default=false)
4747
excludedSpeechModes = int_list(default=list())
48+
trimLeadingSilence = boolean(default=true)
4849
4950
[[__many__]]
5051
capPitchChange = integer(default=30,min=-100,max=100)

source/gui/settingsDialogs.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3737,6 +3737,7 @@ def __init__(self, parent):
37373737
# Advanced settings panel
37383738
label = _("Speech")
37393739
speechSizer = wx.StaticBoxSizer(wx.VERTICAL, self, label=label)
3740+
speechBox = speechSizer.GetStaticBox()
37403741
speechGroup = guiHelper.BoxSizerHelper(speechSizer, sizer=speechSizer)
37413742
sHelper.addItem(speechGroup)
37423743

@@ -3767,6 +3768,14 @@ def __init__(self, parent):
37673768
["featureFlag", "cancelExpiredFocusSpeech"],
37683769
)
37693770

3771+
# Translators: This is the label for a checkbox control in the
3772+
# Advanced settings panel.
3773+
label = _("Trim leading silence in speech audio")
3774+
self.trimLeadingSilenceCheckBox = speechGroup.addItem(wx.CheckBox(speechBox, label=label))
3775+
self.bindHelpEvent("TrimLeadingSilenceSpeech", self.trimLeadingSilenceCheckBox)
3776+
self.trimLeadingSilenceCheckBox.SetValue(config.conf["speech"]["trimLeadingSilence"])
3777+
self.trimLeadingSilenceCheckBox.defaultValue = self._getDefaultValue(["speech", "trimLeadingSilence"])
3778+
37703779
# Translators: This is the label for a group of advanced options in the
37713780
# Advanced settings panel
37723781
label = _("Virtual Buffers")
@@ -3951,6 +3960,7 @@ def haveConfigDefaultsBeenRestored(self):
39513960
and self.wtStrategyCombo.isValueConfigSpecDefault()
39523961
and self.cancelExpiredFocusSpeechCombo.GetSelection()
39533962
== self.cancelExpiredFocusSpeechCombo.defaultValue
3963+
and self.trimLeadingSilenceCheckBox.IsChecked() == self.trimLeadingSilenceCheckBox.defaultValue
39543964
and self.loadChromeVBufWhenBusyCombo.isValueConfigSpecDefault()
39553965
and self.caretMoveTimeoutSpinControl.GetValue() == self.caretMoveTimeoutSpinControl.defaultValue
39563966
and self.reportTransparentColorCheckBox.GetValue()
@@ -3979,6 +3989,7 @@ def restoreToDefaults(self):
39793989
self.diffAlgoCombo.SetSelection(self.diffAlgoCombo.defaultValue)
39803990
self.wtStrategyCombo.resetToConfigSpecDefault()
39813991
self.cancelExpiredFocusSpeechCombo.SetSelection(self.cancelExpiredFocusSpeechCombo.defaultValue)
3992+
self.trimLeadingSilenceCheckBox.SetValue(self.trimLeadingSilenceCheckBox.defaultValue)
39823993
self.loadChromeVBufWhenBusyCombo.resetToConfigSpecDefault()
39833994
self.caretMoveTimeoutSpinControl.SetValue(self.caretMoveTimeoutSpinControl.defaultValue)
39843995
self.reportTransparentColorCheckBox.SetValue(self.reportTransparentColorCheckBox.defaultValue)
@@ -3989,6 +4000,14 @@ def restoreToDefaults(self):
39894000

39904001
def onSave(self):
39914002
log.debug("Saving advanced config")
4003+
4004+
if config.conf["speech"]["trimLeadingSilence"] != self.trimLeadingSilenceCheckBox.IsChecked():
4005+
# Reload the synthesizer if "trimLeadingSilence" changes
4006+
config.conf["speech"]["trimLeadingSilence"] = self.trimLeadingSilenceCheckBox.IsChecked()
4007+
currentSynth = getSynth()
4008+
if not setSynth(currentSynth.name):
4009+
_synthWarningDialog(currentSynth.name)
4010+
39924011
config.conf["development"]["enableScratchpadDir"] = self.scratchpadCheckBox.IsChecked()
39934012
selectiveUIAEventRegistrationChoice = self.selectiveUIAEventRegistrationCombo.GetSelection()
39944013
config.conf["UIA"]["eventRegistration"] = self.selectiveUIAEventRegistrationVals[

0 commit comments

Comments
 (0)