Skip to content

Conversation

to-audiobook
Copy link

Very long audio files might crash whisperX during the call to load_audio() in case the system runs out of RAM.

This PR adds the parameter useTmpFiles to load_audio(), which makes ffmpeg resample the audio using a temporary file, instead of trying to do it all on memory, thus substantially increasing the audio length whisperX can handle.

NOTE: if the audio file is long enough to crash the system during the load_audio() call, requiring the usage of temporary files, the system will probably run out of memory during the diarization stage too. I dealt with that by splitting the source audio in two or more parts, using the alignment stage result timestamps to avoid splitting the audio in the middle of a sentence. That code is not including in this patch and, even if it was, it still has some issues because, if we split the audio, the diarization might assign different speaker IDs to the same speaker on each one of those parts.

Freeing the ffmpeg call output buffer after it is initially loaded into a numpy array allows us to have much more free RAM for the next operations.

This helps avoiding running out of RAM when dealing with very long audio files.
apparently they changed some default arguments values after transformers v4.51.0 I believe num_beams is the culprit. It used to be 1, now it is set to 5. See

huggingface/transformers#40682

So, according to them if you pass num_beams=1 to the pipeline, versions >4.51.0 be as fast as before. But, since I am not exactly sure where to put that, I'll just lock the version for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant