Make SAPI5 & MSSP voices use WavePlayer (WASAPI) (nvaccess#17592)

gexgd0419 · cary-rowen · commit 3a1ebf871380 · 2025-01-11T11:05:03.000+08:00
Closes nvaccess#13284 Summary of the issue: Currently, SAPI5 and MSSP voices use their own audio output mechanisms, instead of using the WavePlayer (WASAPI) inside NVDA. This may make them less responsive compared to eSpeak and OneCore voices, which are using the WavePlayer, or compared to other screen readers using SAPI5 voices, according to my test result. This also gives NVDA less control of audio output. For example, audio ducking logic inside WavePlayer cannot be applied to SAPI5 voices, so additional code is required to compensate for this. Description of user facing changes SAPI5 and MSSP voices will be changed to use the WavePlayer, which may make them more responsive (have less delay). According to my test result, this can reduce the delay by at least 50ms. This haven't trimmed the leading silence yet. If we do that also, we can expect the delay to be even less. Description of development approach Instead of setting self.tts.audioOutput to a real output device, do the following: create an implementation class SynthDriverAudioStream to implement COM interface IStream, which can be used to stream in audio data from the voices. Use an SpCustomStream object to wrap SynthDriverAudioStream and provide the wave format. Assign the SpCustomStream object to self.tts.AudioOutputStream, so SAPI will output audio to this stream instead. Each time an audio chunk needs to be streamed in, ISequentialStream_RemoteWrite will be called, and we just feed the audio to the player. IStream_RemoteSeek can also be called when SAPI wants to know the current byte position of the stream (dlibMove should be zero and dwOrigin should be STREAM_SEEK_CUR in this case), but it is not used to actually "seek" to a new position. IStream_Commit can be called by MSSP voices to "flush" the audio data, where we do nothing. Other methods are left unimplemented, as they are not used when acting as an audio output stream. Previously, comtypes.client.GetEvents was used to get the event notifications. But those notifications will be routed to the main thread via the main message loop. According to the documentation of ISpNotifySource: Note that both variations of callbacks as well as the window message notification require a window message pump to run on the thread that initialized the notification source. Callback will only be called as the result of window message processing, and will always be called on the same thread that initialized the notify source. However, using Win32 events for SAPI event notification does not require a window message pump. Because the audio data is generated and sent via IStream on a dedicated thread, receiving events on the main thread can make synchronizing events and audio difficult. So here SapiSink is changed to become an implementation of ISpNotifySink. Notifications received via ISpNotifySink are "free-threaded", sent on the original thread instead of being routed to the main thread. To connect the sink, use ISpNotifySource::SetNotifySink. To get the actual event that triggers the notification, use ISpEventSource::GetEvents. Events can contain pointers to objects or memory, so they need to be freed manually. Finally, all audio ducking related code are removed. Now WavePlayer should be able to handle audio ducking when using SAPI5 and MSSP voices.
diff --git a/source/synthDrivers/mssp.py b/source/synthDrivers/mssp.py
@@ -9,6 +9,7 @@
 
 class SynthDriver(SynthDriver):
 	COM_CLASS = "speech.SPVoice"
+	CUSTOMSTREAM_COM_CLASS = "speech.SpCustomStream"
 
 	name = "mssp"
 	description = "Microsoft Speech Platform"
diff --git a/source/synthDrivers/sapi5.py b/source/synthDrivers/sapi5.py
@@ -4,14 +4,17 @@
 # This file is covered by the GNU General Public License.
 # See the file COPYING for more details.
 
-from typing import Optional
+from ctypes import POINTER, c_ubyte, c_wchar_p, cast, windll, _Pointer
 from enum import IntEnum
 import locale
 from collections import OrderedDict
+from typing import TYPE_CHECKING
+from comInterfaces.SpeechLib import ISpEventSource, ISpNotifySource, ISpNotifySink
 import comtypes.client
-from comtypes import COMError
+from comtypes import COMError, COMObject, IUnknown, hresult, ReturnHRESULT
 import winreg
-import audioDucking
+import nvwave
+from objidl import _LARGE_INTEGER, _ULARGE_INTEGER, IStream
 from synthDriverHandler import SynthDriver, VoiceInfo, synthIndexReached, synthDoneSpeaking
 import config
 from logHandler import log
@@ -31,14 +34,6 @@
 )
 
 
-class SPAudioState(IntEnum):
-	# https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms720596(v=vs.85)
-	CLOSED = 0
-	STOP = 1
-	PAUSE = 2
-	RUN = 3
-
-
 class SpeechVoiceSpeakFlags(IntEnum):
 	# https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms720892(v=vs.85)
 	Async = 1
@@ -53,41 +48,133 @@ class SpeechVoiceEvents(IntEnum):
 	Bookmark = 16
 
 
-class SapiSink(object):
-	"""Handles SAPI event notifications.
-	See https://msdn.microsoft.com/en-us/library/ms723587(v=vs.85).aspx
+if TYPE_CHECKING:
+	LP_c_ubyte = _Pointer[c_ubyte]
+else:
+	LP_c_ubyte = POINTER(c_ubyte)
+
+
+class SynthDriverAudioStream(COMObject):
+	"""
+	Implements IStream to receive streamed-in audio data.
+	Should be wrapped in an SpCustomStream
+	(which also provides the wave format information),
+	then set as the AudioOutputStream.
 	"""
 
+	_com_interfaces_ = [IStream]
+
 	def __init__(self, synthRef: weakref.ReferenceType):
 		self.synthRef = synthRef
+		self._writtenBytes = 0
 
-	def StartStream(self, streamNum, pos):
+	def ISequentialStream_RemoteWrite(self, pv: LP_c_ubyte, cb: int) -> int:
+		"""This is called when SAPI wants to write (output) a wave data chunk.
+		:param pv: A pointer to the first wave data byte.
+		:param cb: The number of bytes to write.
+		:returns: The number of bytes written.
+		"""
 		synth = self.synthRef()
 		if synth is None:
-			log.debugWarning("Called StartStream method on SapiSink while driver is dead")
-			return
-		if synth._audioDucker:
-			if audioDucking._isDebug():
-				log.debug("Enabling audio ducking due to starting speech stream")
-			synth._audioDucker.enable()
+			log.debugWarning("Called Write method on AudioStream while driver is dead")
+			return 0
+		if not synth.isSpeaking:
+			return 0
+		synth.player.feed(pv, cb)
+		self._writtenBytes += cb
+		return cb
+
+	def IStream_RemoteSeek(self, dlibMove: _LARGE_INTEGER, dwOrigin: int) -> _ULARGE_INTEGER:
+		"""This is called when SAPI wants to get the current stream position.
+		Seeking to another position is not supported.
+		:param dlibMove: The displacement to be added to the location indicated by the dwOrigin parameter.
+			Only 0 is supported.
+		:param dwOrigin: The origin for the displacement specified in dlibMove.
+			Only 1 (STREAM_SEEK_CUR) is supported.
+		:returns: The current stream position.
+		"""
+		if dwOrigin == 1 and dlibMove.QuadPart == 0:
+			# SAPI is querying the current position.
+			return _ULARGE_INTEGER(self._writtenBytes)
+		# Return E_NOTIMPL without logging an error.
+		raise ReturnHRESULT(hresult.E_NOTIMPL, None)
+
+	def IStream_Commit(self, grfCommitFlags: int):
+		"""This is called when MSSP wants to flush the written data.
+		Does nothing."""
+		pass
+
+
+class SapiSink(COMObject):
+	"""
+	Implements ISpNotifySink to handle SAPI event notifications.
+	Should be passed to ISpNotifySource::SetNotifySink().
+	Notifications will be sent on the original thread,
+	instead of being routed to the main thread.
+	"""
+
+	_com_interfaces_ = [ISpNotifySink]
+
+	def __init__(self, synthRef: weakref.ReferenceType):
+		self.synthRef = synthRef
 
-	def Bookmark(self, streamNum, pos, bookmark, bookmarkId):
+	def ISpNotifySink_Notify(self):
+		"""This is called when there's a new event notification.
+		Queued events will be retrieved."""
 		synth = self.synthRef()
 		if synth is None:
-			log.debugWarning("Called Bookmark method on SapiSink while driver is dead")
+			log.debugWarning("Called Notify method on SapiSink while driver is dead")
 			return
-		synthIndexReached.notify(synth=synth, index=bookmarkId)
+		# Get all queued events
+		eventSource = synth.tts.QueryInterface(ISpEventSource)
+		while True:
+			# returned tuple: (event, numFetched)
+			eventTuple = eventSource.GetEvents(1)  # Get one event
+			if eventTuple[1] != 1:
+				break
+			event = eventTuple[0]
+			if event.eEventId == 1:  # SPEI_START_INPUT_STREAM
+				self.StartStream(event.ulStreamNum, event.ullAudioStreamOffset)
+			elif event.eEventId == 2:  # SPEI_END_INPUT_STREAM
+				self.EndStream(event.ulStreamNum, event.ullAudioStreamOffset)
+			elif event.eEventId == 4:  # SPEI_TTS_BOOKMARK
+				self.Bookmark(
+					event.ulStreamNum,
+					event.ullAudioStreamOffset,
+					cast(event.lParam, c_wchar_p).value,
+					event.wParam,
+				)
+			# free lParam
+			if event.elParamType == 1 or event.elParamType == 2:  # token or object
+				pUnk = cast(event.lParam, POINTER(IUnknown))
+				del pUnk
+			elif event.elParamType == 3 or event.elParamType == 4:  # pointer or string
+				windll.ole32.CoTaskMemFree(event.lParam)
+
+	def StartStream(self, streamNum: int, pos: int):
+		synth = self.synthRef()
+		synth.isSpeaking = True
 
-	def EndStream(self, streamNum, pos):
+	def Bookmark(self, streamNum: int, pos: int, bookmark: str, bookmarkId: int):
 		synth = self.synthRef()
-		if synth is None:
-			log.debugWarning("Called Bookmark method on EndStream while driver is dead")
+		if not synth.isSpeaking:
 			return
+		# Bookmark event is raised before the audio after that point.
+		# Queue an IndexReached event at this point.
+		synth.player.feed(None, 0, lambda: self.onIndexReached(bookmarkId))
+
+	def EndStream(self, streamNum: int, pos: int):
+		synth = self.synthRef()
+		synth.isSpeaking = False
+		synth.player.idle()
 		synthDoneSpeaking.notify(synth=synth)
-		if synth._audioDucker:
-			if audioDucking._isDebug():
-				log.debug("Disabling audio ducking due to speech stream end")
-			synth._audioDucker.disable()
+
+	def onIndexReached(self, index: int):
+		synth = self.synthRef()
+		if synth is None:
+			log.debugWarning("Called onIndexReached method on SapiSink while driver is dead")
+			return
+		synthIndexReached.notify(synth=synth, index=index)
 
 
 class SynthDriver(SynthDriver):
@@ -110,6 +197,7 @@ class SynthDriver(SynthDriver):
 	supportedNotifications = {synthIndexReached, synthDoneSpeaking}
 
 	COM_CLASS = "SAPI.SPVoice"
+	CUSTOMSTREAM_COM_CLASS = "SAPI.SpCustomStream"
 
 	name = "sapi5"
 	description = "Microsoft Speech API version 5"
@@ -123,24 +211,21 @@ def check(cls):
 		except:  # noqa: E722
 			return False
 
-	ttsAudioStream = (
-		None  #: Holds the ISPAudio interface for the current voice, to aid in stopping and pausing audio
-	)
-	_audioDucker: Optional[audioDucking.AudioDucker] = None
-
 	def __init__(self, _defaultVoiceToken=None):
 		"""
 		@param _defaultVoiceToken: an optional sapi voice token which should be used as the default voice (only useful for subclasses)
 		@type _defaultVoiceToken: ISpeechObjectToken
 		"""
-		if audioDucking.isAudioDuckingSupported():
-			self._audioDucker = audioDucking.AudioDucker()
 		self._pitch = 50
+		self.player = None
+		self.isSpeaking = False
 		self._initTts(_defaultVoiceToken)
 
 	def terminate(self):
-		self._eventsConnection = None
 		self.tts = None
+		if self.player:
+			self.player.close()
+			self.player = None
 
 	def _getAvailableVoices(self):
 		voices = OrderedDict()
@@ -204,27 +289,31 @@ def _initTts(self, voice=None):
 			# Therefore, set the voice before setting the audio output.
 			# Otherwise, we will get poor speech quality in some cases.
 			self.tts.voice = voice
-		# SAPI5 automatically selects the system default audio device, so there's no use doing work if the user has selected to use the system default.
-		# Besides, our default value is not a valid endpoint ID.
-		if (outputDevice := config.conf["audio"]["outputDevice"]) != config.conf.getConfigValidation(
-			("audio", "outputDevice"),
-		).default:
-			for audioOutput in self.tts.GetAudioOutputs():
-				# SAPI's audio output IDs are registry keys. It seems that the final path segment is the endpoint ID.
-				if audioOutput.Id.endswith(outputDevice):
-					self.tts.audioOutput = audioOutput
-					break
-		self._eventsConnection = comtypes.client.GetEvents(self.tts, SapiSink(weakref.ref(self)))
+
+		self.tts.AudioOutput = self.tts.AudioOutput  # Reset the audio and its format parameters
+		fmt = self.tts.AudioOutputStream.Format
+		wfx = fmt.GetWaveFormatEx()
+		if self.player:
+			self.player.close()
+		self.player = nvwave.WavePlayer(
+			channels=wfx.Channels,
+			samplesPerSec=wfx.SamplesPerSec,
+			bitsPerSample=wfx.BitsPerSample,
+			outputDevice=config.conf["audio"]["outputDevice"],
+		)
+		audioStream = SynthDriverAudioStream(weakref.ref(self))
+		# Use SpCustomStream to wrap our IStream implementation and the correct wave format
+		customStream = comtypes.client.CreateObject(self.CUSTOMSTREAM_COM_CLASS)
+		customStream.BaseStream = audioStream
+		customStream.Format = fmt
+		self.tts.AudioOutputStream = customStream
+
+		# Set event notify sink
 		self.tts.EventInterests = (
 			SpeechVoiceEvents.StartInputStream | SpeechVoiceEvents.Bookmark | SpeechVoiceEvents.EndInputStream
 		)
-		from comInterfaces.SpeechLib import ISpAudio
-
-		try:
-			self.ttsAudioStream = self.tts.audioOutputStream.QueryInterface(ISpAudio)
-		except COMError:
-			log.debugWarning("SAPI5 voice does not support ISPAudio")
-			self.ttsAudioStream = None
+		notifySource = self.tts.QueryInterface(ISpNotifySource)
+		notifySource.SetNotifySink(SapiSink(weakref.ref(self)))
 
 	def _set_voice(self, value):
 		tokens = self._getVoiceTokens()
@@ -370,74 +459,14 @@ def outputTags():
 
 		text = "".join(textList)
 		flags = SpeechVoiceSpeakFlags.IsXML | SpeechVoiceSpeakFlags.Async
-		# Ducking should be complete before the synth starts producing audio.
-		# For this to happen, the speech method must block until ducking is complete.
-		# Ducking should be disabled when the synth is finished producing audio.
-		# Note that there may be calls to speak with a string that results in no audio,
-		# it is important that in this case the audio does not get stuck ducked.
-		# When there is no audio produced the startStream and endStream handlers are not called.
-		# To prevent audio getting stuck ducked, it is unducked at the end of speech.
-		# There are some known issues:
-		# - When there is no audio produced by the synth, a user may notice volume lowering (ducking) temporarily.
-		# - If the call to startStream handler is delayed significantly, users may notice a variation in volume
-		# (as ducking is disabled at the end of speak, and re-enabled when the startStream handler is called)
-
-		# A note on the synchronicity of components of this approach:
-		# SAPISink.StartStream event handler (callback):
-		# the synth speech is not blocked by this event callback.
-		# SAPISink.EndStream event handler (callback):
-		# assumed also to be async but not confirmed. Synchronicity is irrelevant to the current approach.
-		# AudioDucker.disable returns before the audio is completely unducked.
-		# AudioDucker.enable() ducking will complete before the function returns.
-		# It is not possible to "double duck the audio", calling twice yields the same result as calling once.
-		# AudioDucker class instances count the number of enables/disables,
-		# in order to unduck there must be no remaining enabled audio ducker instances.
-		# Due to this a temporary audio ducker is used around the call to speak.
-		# SAPISink.StartStream: Ducking here may allow the early speech to start before ducking is completed.
-		if audioDucking.isAudioDuckingSupported():
-			tempAudioDucker = audioDucking.AudioDucker()
-		else:
-			tempAudioDucker = None
-		if tempAudioDucker:
-			if audioDucking._isDebug():
-				log.debug("Enabling audio ducking due to speak call")
-			tempAudioDucker.enable()
-		try:
-			self.tts.Speak(text, flags)
-		finally:
-			if tempAudioDucker:
-				if audioDucking._isDebug():
-					log.debug("Disabling audio ducking  after speak call")
-				tempAudioDucker.disable()
+		self.tts.Speak(text, flags)
 
 	def cancel(self):
 		# SAPI5's default means of stopping speech can sometimes lag at end of speech, especially with Win8 / Win 10 Microsoft Voices.
-		# Therefore  instruct the underlying audio interface to stop first, before interupting and purging any remaining speech.
-		if self.ttsAudioStream:
-			self.ttsAudioStream.setState(SPAudioState.STOP, 0)
+		# Therefore  instruct the audio player to stop first, before interupting and purging any remaining speech.
+		self.isSpeaking = False
+		self.player.stop()
 		self.tts.Speak(None, SpeechVoiceSpeakFlags.Async | SpeechVoiceSpeakFlags.PurgeBeforeSpeak)
-		if self._audioDucker:
-			if audioDucking._isDebug():
-				log.debug("Disabling audio ducking due to setting output audio state to stop")
-			self._audioDucker.disable()
 
 	def pause(self, switch: bool):
-		# SAPI5's default means of pausing in most cases is either extremely slow
-		# (e.g. takes more than half a second) or does not work at all.
-		# Therefore instruct the underlying audio interface to pause instead.
-		if self.ttsAudioStream:
-			oldState = self.ttsAudioStream.GetStatus().State
-			if switch and oldState == SPAudioState.RUN:
-				# pausing
-				if self._audioDucker:
-					if audioDucking._isDebug():
-						log.debug("Disabling audio ducking due to setting output audio state to pause")
-					self._audioDucker.disable()
-				self.ttsAudioStream.setState(SPAudioState.PAUSE, 0)
-			elif not switch and oldState == SPAudioState.PAUSE:
-				# unpausing
-				if self._audioDucker:
-					if audioDucking._isDebug():
-						log.debug("Enabling audio ducking due to setting output audio state to run")
-					self._audioDucker.enable()
-				self.ttsAudioStream.setState(SPAudioState.RUN, 0)
+		self.player.pause(switch)
diff --git a/user_docs/en/changes.md b/user_docs/en/changes.md
@@ -52,6 +52,8 @@ To use this feature, "allow NVDA to control the volume of other applications" mu
 Prefix matching on command line flags, e.g. using `--di` for `--disable-addons` is no longer supported. (#11644, @CyrilleB79)
 * The keyboard settings for "Speak typed characters" and "Speak typed words" now have three options: Off, Always, and Only in edit controls. (#17505, @Cary-rowen)
   * By default, "Speak typed characters" is now set to "Only in edit controls".
+* Microsoft Speech API version 5 and Microsoft Speech Platform voices now use WASAPI for audio output, which may improve the responsiveness of those voices. (#13284, @gexgd0419)
+
 
 ### Bug Fixes
 
@@ -171,6 +173,9 @@ As the NVDA update check URL is now configurable directly within NVDA, no replac
 * `updateCheck.UpdateAskInstallDialog` no longer automatically performs an action when the update or postpone buttons are pressed.
 Instead, a `callback` property has been added, which returns a function that performs the appropriate action when called with the return value from the dialog. (#17582)
 * Dialogs opened with `gui.runScriptModalDialog` are now recognised as modal by NVDA. (#17582)
+* Because SAPI5 voices now use `nvwave.WavePlayer` to output audio: (#17592, @gexgd0419)
+  * `synthDrivers.sapi5.SPAudioState` has been removed.
+  * `synthDrivers.sapi5.SynthDriver.ttsAudioStream` has been removed.
 * Changed keyboard typing echo configuration from boolean to integer values. (#17505, @Cary-rowen)
   * `config.conf["keyboard"]["speakTypedCharacters"]` and `config.conf["keyboard"]["speakTypedWords"]` now use integer values.
   * Added `TypingEcho` enum in `config.configFlags` to represent these modes, 0=Off, 1=Only in edit controls, 2=Always.