Skip to content

Commit 3a1ebf8

Browse files
gexgd0419cary-rowen
authored andcommitted
Make SAPI5 & MSSP voices use WavePlayer (WASAPI) (nvaccess#17592)
Closes nvaccess#13284 Summary of the issue: Currently, SAPI5 and MSSP voices use their own audio output mechanisms, instead of using the WavePlayer (WASAPI) inside NVDA. This may make them less responsive compared to eSpeak and OneCore voices, which are using the WavePlayer, or compared to other screen readers using SAPI5 voices, according to my test result. This also gives NVDA less control of audio output. For example, audio ducking logic inside WavePlayer cannot be applied to SAPI5 voices, so additional code is required to compensate for this. Description of user facing changes SAPI5 and MSSP voices will be changed to use the WavePlayer, which may make them more responsive (have less delay). According to my test result, this can reduce the delay by at least 50ms. This haven't trimmed the leading silence yet. If we do that also, we can expect the delay to be even less. Description of development approach Instead of setting self.tts.audioOutput to a real output device, do the following: create an implementation class SynthDriverAudioStream to implement COM interface IStream, which can be used to stream in audio data from the voices. Use an SpCustomStream object to wrap SynthDriverAudioStream and provide the wave format. Assign the SpCustomStream object to self.tts.AudioOutputStream, so SAPI will output audio to this stream instead. Each time an audio chunk needs to be streamed in, ISequentialStream_RemoteWrite will be called, and we just feed the audio to the player. IStream_RemoteSeek can also be called when SAPI wants to know the current byte position of the stream (dlibMove should be zero and dwOrigin should be STREAM_SEEK_CUR in this case), but it is not used to actually "seek" to a new position. IStream_Commit can be called by MSSP voices to "flush" the audio data, where we do nothing. Other methods are left unimplemented, as they are not used when acting as an audio output stream. Previously, comtypes.client.GetEvents was used to get the event notifications. But those notifications will be routed to the main thread via the main message loop. According to the documentation of ISpNotifySource: Note that both variations of callbacks as well as the window message notification require a window message pump to run on the thread that initialized the notification source. Callback will only be called as the result of window message processing, and will always be called on the same thread that initialized the notify source. However, using Win32 events for SAPI event notification does not require a window message pump. Because the audio data is generated and sent via IStream on a dedicated thread, receiving events on the main thread can make synchronizing events and audio difficult. So here SapiSink is changed to become an implementation of ISpNotifySink. Notifications received via ISpNotifySink are "free-threaded", sent on the original thread instead of being routed to the main thread. To connect the sink, use ISpNotifySource::SetNotifySink. To get the actual event that triggers the notification, use ISpEventSource::GetEvents. Events can contain pointers to objects or memory, so they need to be freed manually. Finally, all audio ducking related code are removed. Now WavePlayer should be able to handle audio ducking when using SAPI5 and MSSP voices.
1 parent 1d5d615 commit 3a1ebf8

File tree

3 files changed

+157
-122
lines changed

3 files changed

+157
-122
lines changed

source/synthDrivers/mssp.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
class SynthDriver(SynthDriver):
1111
COM_CLASS = "speech.SPVoice"
12+
CUSTOMSTREAM_COM_CLASS = "speech.SpCustomStream"
1213

1314
name = "mssp"
1415
description = "Microsoft Speech Platform"

source/synthDrivers/sapi5.py

Lines changed: 151 additions & 122 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,17 @@
44
# This file is covered by the GNU General Public License.
55
# See the file COPYING for more details.
66

7-
from typing import Optional
7+
from ctypes import POINTER, c_ubyte, c_wchar_p, cast, windll, _Pointer
88
from enum import IntEnum
99
import locale
1010
from collections import OrderedDict
11+
from typing import TYPE_CHECKING
12+
from comInterfaces.SpeechLib import ISpEventSource, ISpNotifySource, ISpNotifySink
1113
import comtypes.client
12-
from comtypes import COMError
14+
from comtypes import COMError, COMObject, IUnknown, hresult, ReturnHRESULT
1315
import winreg
14-
import audioDucking
16+
import nvwave
17+
from objidl import _LARGE_INTEGER, _ULARGE_INTEGER, IStream
1518
from synthDriverHandler import SynthDriver, VoiceInfo, synthIndexReached, synthDoneSpeaking
1619
import config
1720
from logHandler import log
@@ -31,14 +34,6 @@
3134
)
3235

3336

34-
class SPAudioState(IntEnum):
35-
# https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms720596(v=vs.85)
36-
CLOSED = 0
37-
STOP = 1
38-
PAUSE = 2
39-
RUN = 3
40-
41-
4237
class SpeechVoiceSpeakFlags(IntEnum):
4338
# https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms720892(v=vs.85)
4439
Async = 1
@@ -53,41 +48,133 @@ class SpeechVoiceEvents(IntEnum):
5348
Bookmark = 16
5449

5550

56-
class SapiSink(object):
57-
"""Handles SAPI event notifications.
58-
See https://msdn.microsoft.com/en-us/library/ms723587(v=vs.85).aspx
51+
if TYPE_CHECKING:
52+
LP_c_ubyte = _Pointer[c_ubyte]
53+
else:
54+
LP_c_ubyte = POINTER(c_ubyte)
55+
56+
57+
class SynthDriverAudioStream(COMObject):
58+
"""
59+
Implements IStream to receive streamed-in audio data.
60+
Should be wrapped in an SpCustomStream
61+
(which also provides the wave format information),
62+
then set as the AudioOutputStream.
5963
"""
6064

65+
_com_interfaces_ = [IStream]
66+
6167
def __init__(self, synthRef: weakref.ReferenceType):
6268
self.synthRef = synthRef
69+
self._writtenBytes = 0
6370

64-
def StartStream(self, streamNum, pos):
71+
def ISequentialStream_RemoteWrite(self, pv: LP_c_ubyte, cb: int) -> int:
72+
"""This is called when SAPI wants to write (output) a wave data chunk.
73+
:param pv: A pointer to the first wave data byte.
74+
:param cb: The number of bytes to write.
75+
:returns: The number of bytes written.
76+
"""
6577
synth = self.synthRef()
6678
if synth is None:
67-
log.debugWarning("Called StartStream method on SapiSink while driver is dead")
68-
return
69-
if synth._audioDucker:
70-
if audioDucking._isDebug():
71-
log.debug("Enabling audio ducking due to starting speech stream")
72-
synth._audioDucker.enable()
79+
log.debugWarning("Called Write method on AudioStream while driver is dead")
80+
return 0
81+
if not synth.isSpeaking:
82+
return 0
83+
synth.player.feed(pv, cb)
84+
self._writtenBytes += cb
85+
return cb
86+
87+
def IStream_RemoteSeek(self, dlibMove: _LARGE_INTEGER, dwOrigin: int) -> _ULARGE_INTEGER:
88+
"""This is called when SAPI wants to get the current stream position.
89+
Seeking to another position is not supported.
90+
:param dlibMove: The displacement to be added to the location indicated by the dwOrigin parameter.
91+
Only 0 is supported.
92+
:param dwOrigin: The origin for the displacement specified in dlibMove.
93+
Only 1 (STREAM_SEEK_CUR) is supported.
94+
:returns: The current stream position.
95+
"""
96+
if dwOrigin == 1 and dlibMove.QuadPart == 0:
97+
# SAPI is querying the current position.
98+
return _ULARGE_INTEGER(self._writtenBytes)
99+
# Return E_NOTIMPL without logging an error.
100+
raise ReturnHRESULT(hresult.E_NOTIMPL, None)
101+
102+
def IStream_Commit(self, grfCommitFlags: int):
103+
"""This is called when MSSP wants to flush the written data.
104+
Does nothing."""
105+
pass
106+
107+
108+
class SapiSink(COMObject):
109+
"""
110+
Implements ISpNotifySink to handle SAPI event notifications.
111+
Should be passed to ISpNotifySource::SetNotifySink().
112+
Notifications will be sent on the original thread,
113+
instead of being routed to the main thread.
114+
"""
115+
116+
_com_interfaces_ = [ISpNotifySink]
117+
118+
def __init__(self, synthRef: weakref.ReferenceType):
119+
self.synthRef = synthRef
73120

74-
def Bookmark(self, streamNum, pos, bookmark, bookmarkId):
121+
def ISpNotifySink_Notify(self):
122+
"""This is called when there's a new event notification.
123+
Queued events will be retrieved."""
75124
synth = self.synthRef()
76125
if synth is None:
77-
log.debugWarning("Called Bookmark method on SapiSink while driver is dead")
126+
log.debugWarning("Called Notify method on SapiSink while driver is dead")
78127
return
79-
synthIndexReached.notify(synth=synth, index=bookmarkId)
128+
# Get all queued events
129+
eventSource = synth.tts.QueryInterface(ISpEventSource)
130+
while True:
131+
# returned tuple: (event, numFetched)
132+
eventTuple = eventSource.GetEvents(1) # Get one event
133+
if eventTuple[1] != 1:
134+
break
135+
event = eventTuple[0]
136+
if event.eEventId == 1: # SPEI_START_INPUT_STREAM
137+
self.StartStream(event.ulStreamNum, event.ullAudioStreamOffset)
138+
elif event.eEventId == 2: # SPEI_END_INPUT_STREAM
139+
self.EndStream(event.ulStreamNum, event.ullAudioStreamOffset)
140+
elif event.eEventId == 4: # SPEI_TTS_BOOKMARK
141+
self.Bookmark(
142+
event.ulStreamNum,
143+
event.ullAudioStreamOffset,
144+
cast(event.lParam, c_wchar_p).value,
145+
event.wParam,
146+
)
147+
# free lParam
148+
if event.elParamType == 1 or event.elParamType == 2: # token or object
149+
pUnk = cast(event.lParam, POINTER(IUnknown))
150+
del pUnk
151+
elif event.elParamType == 3 or event.elParamType == 4: # pointer or string
152+
windll.ole32.CoTaskMemFree(event.lParam)
153+
154+
def StartStream(self, streamNum: int, pos: int):
155+
synth = self.synthRef()
156+
synth.isSpeaking = True
80157

81-
def EndStream(self, streamNum, pos):
158+
def Bookmark(self, streamNum: int, pos: int, bookmark: str, bookmarkId: int):
82159
synth = self.synthRef()
83-
if synth is None:
84-
log.debugWarning("Called Bookmark method on EndStream while driver is dead")
160+
if not synth.isSpeaking:
85161
return
162+
# Bookmark event is raised before the audio after that point.
163+
# Queue an IndexReached event at this point.
164+
synth.player.feed(None, 0, lambda: self.onIndexReached(bookmarkId))
165+
166+
def EndStream(self, streamNum: int, pos: int):
167+
synth = self.synthRef()
168+
synth.isSpeaking = False
169+
synth.player.idle()
86170
synthDoneSpeaking.notify(synth=synth)
87-
if synth._audioDucker:
88-
if audioDucking._isDebug():
89-
log.debug("Disabling audio ducking due to speech stream end")
90-
synth._audioDucker.disable()
171+
172+
def onIndexReached(self, index: int):
173+
synth = self.synthRef()
174+
if synth is None:
175+
log.debugWarning("Called onIndexReached method on SapiSink while driver is dead")
176+
return
177+
synthIndexReached.notify(synth=synth, index=index)
91178

92179

93180
class SynthDriver(SynthDriver):
@@ -110,6 +197,7 @@ class SynthDriver(SynthDriver):
110197
supportedNotifications = {synthIndexReached, synthDoneSpeaking}
111198

112199
COM_CLASS = "SAPI.SPVoice"
200+
CUSTOMSTREAM_COM_CLASS = "SAPI.SpCustomStream"
113201

114202
name = "sapi5"
115203
description = "Microsoft Speech API version 5"
@@ -123,24 +211,21 @@ def check(cls):
123211
except: # noqa: E722
124212
return False
125213

126-
ttsAudioStream = (
127-
None #: Holds the ISPAudio interface for the current voice, to aid in stopping and pausing audio
128-
)
129-
_audioDucker: Optional[audioDucking.AudioDucker] = None
130-
131214
def __init__(self, _defaultVoiceToken=None):
132215
"""
133216
@param _defaultVoiceToken: an optional sapi voice token which should be used as the default voice (only useful for subclasses)
134217
@type _defaultVoiceToken: ISpeechObjectToken
135218
"""
136-
if audioDucking.isAudioDuckingSupported():
137-
self._audioDucker = audioDucking.AudioDucker()
138219
self._pitch = 50
220+
self.player = None
221+
self.isSpeaking = False
139222
self._initTts(_defaultVoiceToken)
140223

141224
def terminate(self):
142-
self._eventsConnection = None
143225
self.tts = None
226+
if self.player:
227+
self.player.close()
228+
self.player = None
144229

145230
def _getAvailableVoices(self):
146231
voices = OrderedDict()
@@ -204,27 +289,31 @@ def _initTts(self, voice=None):
204289
# Therefore, set the voice before setting the audio output.
205290
# Otherwise, we will get poor speech quality in some cases.
206291
self.tts.voice = voice
207-
# SAPI5 automatically selects the system default audio device, so there's no use doing work if the user has selected to use the system default.
208-
# Besides, our default value is not a valid endpoint ID.
209-
if (outputDevice := config.conf["audio"]["outputDevice"]) != config.conf.getConfigValidation(
210-
("audio", "outputDevice"),
211-
).default:
212-
for audioOutput in self.tts.GetAudioOutputs():
213-
# SAPI's audio output IDs are registry keys. It seems that the final path segment is the endpoint ID.
214-
if audioOutput.Id.endswith(outputDevice):
215-
self.tts.audioOutput = audioOutput
216-
break
217-
self._eventsConnection = comtypes.client.GetEvents(self.tts, SapiSink(weakref.ref(self)))
292+
293+
self.tts.AudioOutput = self.tts.AudioOutput # Reset the audio and its format parameters
294+
fmt = self.tts.AudioOutputStream.Format
295+
wfx = fmt.GetWaveFormatEx()
296+
if self.player:
297+
self.player.close()
298+
self.player = nvwave.WavePlayer(
299+
channels=wfx.Channels,
300+
samplesPerSec=wfx.SamplesPerSec,
301+
bitsPerSample=wfx.BitsPerSample,
302+
outputDevice=config.conf["audio"]["outputDevice"],
303+
)
304+
audioStream = SynthDriverAudioStream(weakref.ref(self))
305+
# Use SpCustomStream to wrap our IStream implementation and the correct wave format
306+
customStream = comtypes.client.CreateObject(self.CUSTOMSTREAM_COM_CLASS)
307+
customStream.BaseStream = audioStream
308+
customStream.Format = fmt
309+
self.tts.AudioOutputStream = customStream
310+
311+
# Set event notify sink
218312
self.tts.EventInterests = (
219313
SpeechVoiceEvents.StartInputStream | SpeechVoiceEvents.Bookmark | SpeechVoiceEvents.EndInputStream
220314
)
221-
from comInterfaces.SpeechLib import ISpAudio
222-
223-
try:
224-
self.ttsAudioStream = self.tts.audioOutputStream.QueryInterface(ISpAudio)
225-
except COMError:
226-
log.debugWarning("SAPI5 voice does not support ISPAudio")
227-
self.ttsAudioStream = None
315+
notifySource = self.tts.QueryInterface(ISpNotifySource)
316+
notifySource.SetNotifySink(SapiSink(weakref.ref(self)))
228317

229318
def _set_voice(self, value):
230319
tokens = self._getVoiceTokens()
@@ -370,74 +459,14 @@ def outputTags():
370459

371460
text = "".join(textList)
372461
flags = SpeechVoiceSpeakFlags.IsXML | SpeechVoiceSpeakFlags.Async
373-
# Ducking should be complete before the synth starts producing audio.
374-
# For this to happen, the speech method must block until ducking is complete.
375-
# Ducking should be disabled when the synth is finished producing audio.
376-
# Note that there may be calls to speak with a string that results in no audio,
377-
# it is important that in this case the audio does not get stuck ducked.
378-
# When there is no audio produced the startStream and endStream handlers are not called.
379-
# To prevent audio getting stuck ducked, it is unducked at the end of speech.
380-
# There are some known issues:
381-
# - When there is no audio produced by the synth, a user may notice volume lowering (ducking) temporarily.
382-
# - If the call to startStream handler is delayed significantly, users may notice a variation in volume
383-
# (as ducking is disabled at the end of speak, and re-enabled when the startStream handler is called)
384-
385-
# A note on the synchronicity of components of this approach:
386-
# SAPISink.StartStream event handler (callback):
387-
# the synth speech is not blocked by this event callback.
388-
# SAPISink.EndStream event handler (callback):
389-
# assumed also to be async but not confirmed. Synchronicity is irrelevant to the current approach.
390-
# AudioDucker.disable returns before the audio is completely unducked.
391-
# AudioDucker.enable() ducking will complete before the function returns.
392-
# It is not possible to "double duck the audio", calling twice yields the same result as calling once.
393-
# AudioDucker class instances count the number of enables/disables,
394-
# in order to unduck there must be no remaining enabled audio ducker instances.
395-
# Due to this a temporary audio ducker is used around the call to speak.
396-
# SAPISink.StartStream: Ducking here may allow the early speech to start before ducking is completed.
397-
if audioDucking.isAudioDuckingSupported():
398-
tempAudioDucker = audioDucking.AudioDucker()
399-
else:
400-
tempAudioDucker = None
401-
if tempAudioDucker:
402-
if audioDucking._isDebug():
403-
log.debug("Enabling audio ducking due to speak call")
404-
tempAudioDucker.enable()
405-
try:
406-
self.tts.Speak(text, flags)
407-
finally:
408-
if tempAudioDucker:
409-
if audioDucking._isDebug():
410-
log.debug("Disabling audio ducking after speak call")
411-
tempAudioDucker.disable()
462+
self.tts.Speak(text, flags)
412463

413464
def cancel(self):
414465
# SAPI5's default means of stopping speech can sometimes lag at end of speech, especially with Win8 / Win 10 Microsoft Voices.
415-
# Therefore instruct the underlying audio interface to stop first, before interupting and purging any remaining speech.
416-
if self.ttsAudioStream:
417-
self.ttsAudioStream.setState(SPAudioState.STOP, 0)
466+
# Therefore instruct the audio player to stop first, before interupting and purging any remaining speech.
467+
self.isSpeaking = False
468+
self.player.stop()
418469
self.tts.Speak(None, SpeechVoiceSpeakFlags.Async | SpeechVoiceSpeakFlags.PurgeBeforeSpeak)
419-
if self._audioDucker:
420-
if audioDucking._isDebug():
421-
log.debug("Disabling audio ducking due to setting output audio state to stop")
422-
self._audioDucker.disable()
423470

424471
def pause(self, switch: bool):
425-
# SAPI5's default means of pausing in most cases is either extremely slow
426-
# (e.g. takes more than half a second) or does not work at all.
427-
# Therefore instruct the underlying audio interface to pause instead.
428-
if self.ttsAudioStream:
429-
oldState = self.ttsAudioStream.GetStatus().State
430-
if switch and oldState == SPAudioState.RUN:
431-
# pausing
432-
if self._audioDucker:
433-
if audioDucking._isDebug():
434-
log.debug("Disabling audio ducking due to setting output audio state to pause")
435-
self._audioDucker.disable()
436-
self.ttsAudioStream.setState(SPAudioState.PAUSE, 0)
437-
elif not switch and oldState == SPAudioState.PAUSE:
438-
# unpausing
439-
if self._audioDucker:
440-
if audioDucking._isDebug():
441-
log.debug("Enabling audio ducking due to setting output audio state to run")
442-
self._audioDucker.enable()
443-
self.ttsAudioStream.setState(SPAudioState.RUN, 0)
472+
self.player.pause(switch)

user_docs/en/changes.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ To use this feature, "allow NVDA to control the volume of other applications" mu
5252
Prefix matching on command line flags, e.g. using `--di` for `--disable-addons` is no longer supported. (#11644, @CyrilleB79)
5353
* The keyboard settings for "Speak typed characters" and "Speak typed words" now have three options: Off, Always, and Only in edit controls. (#17505, @Cary-rowen)
5454
* By default, "Speak typed characters" is now set to "Only in edit controls".
55+
* Microsoft Speech API version 5 and Microsoft Speech Platform voices now use WASAPI for audio output, which may improve the responsiveness of those voices. (#13284, @gexgd0419)
56+
5557

5658
### Bug Fixes
5759

@@ -171,6 +173,9 @@ As the NVDA update check URL is now configurable directly within NVDA, no replac
171173
* `updateCheck.UpdateAskInstallDialog` no longer automatically performs an action when the update or postpone buttons are pressed.
172174
Instead, a `callback` property has been added, which returns a function that performs the appropriate action when called with the return value from the dialog. (#17582)
173175
* Dialogs opened with `gui.runScriptModalDialog` are now recognised as modal by NVDA. (#17582)
176+
* Because SAPI5 voices now use `nvwave.WavePlayer` to output audio: (#17592, @gexgd0419)
177+
* `synthDrivers.sapi5.SPAudioState` has been removed.
178+
* `synthDrivers.sapi5.SynthDriver.ttsAudioStream` has been removed.
174179
* Changed keyboard typing echo configuration from boolean to integer values. (#17505, @Cary-rowen)
175180
* `config.conf["keyboard"]["speakTypedCharacters"]` and `config.conf["keyboard"]["speakTypedWords"]` now use integer values.
176181
* Added `TypingEcho` enum in `config.configFlags` to represent these modes, 0=Off, 1=Only in edit controls, 2=Always.

0 commit comments

Comments
 (0)