Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104

Narsil · 2022-11-07T14:58:42Z

What does this PR do?

This adds chunk_length_s to seq2seq algorithms.

Approach

Since we have no way of finding a matching between output and input with seq2seq
this is an alternative route.

This runs the pipeline on the various chunks and finds all generated output.
Then it tries to find the longest sequence of non special ids that could correspond
to the subsequences within the batch.

Pros

It should work on any seq2seq models
It should work decently when the stride is long enough to have good overlapping of tokens so that the stitching can work correctly
It should be slightly robust to few token errors
It should perform best on mostly continuous talk (so that there is model output that can overlap)

Cons

This method is unsound and will fail under some circumstances
It will fail when there is silence in the overlap. If there is silence then there is no overlapping tokens, and the stitching might get lost during the stitching process. By default it will concatenate, but it might be put off by boundaries in the stride.
It will fail spectacularly when something repeats a single word over and over. Then, we will have overlap that might be TOO large. This is impossible to distinguish without getting access to the timestamps (which only whisper can currently do, and it does come with caveats). The currently algorithm will favor long chain of matching tokens.
It will have issues with capitalization and out of domain areas. For instance "Yes, sir." , "Sir Thomas" might be 2 chunks, which have different capitalization. Since the current algorithm works at the token level, the 2 tokens "sir" and ¨Sir" are different and will fail to match leading to some `¨Yes, sir. Sir Thomas" stitching instead of the intended "Yes, Sir Thomas.".

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-11-07T15:15:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Thanks for working on this. Not sure if the PR is ready for (at least core maintainer) review yet?

sgugger · 2022-11-07T15:19:49Z

src/transformers/pipelines/automatic_speech_recognition.py

+            # if self.type not in {"ctc", "ctc_with_lm"}:
+            #     raise ValueError(
+            #         "`chunk_length_s` is only valid for CTC models, use other chunking options for other models"
+            #     )


To clean up?

sgugger · 2022-11-07T15:20:07Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+        # self.assertEqual(
+        #     str(v.exception),
+        #     "`chunk_length_s` is only valid for CTC models, use other chunking options for other models",
+        # )


To clean up as well?

sgugger · 2022-11-07T15:20:23Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+        # waveform = np.tile(np.arange(1000, dtype=np.float32), 34)
+        # output = speech_recognizer(waveform)
+        # self.assertEqual(output, {"text": ""})
+
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation").sort("id")
+        filename = ds[40]["file"]
+        # output = speech_recognizer(filename)
+        # self.assertEqual(output, {"text": " A man said to the universe, Sir, I exist."})
+        print(filename)


Comments and print statements to clean up.

Narsil · 2022-11-07T15:36:12Z

Thanks for working on this. Not sure if the PR is ready for (at least core maintainer) review yet?

Yup sorry it was slightly early for you.
The core idea is still there.

We chunk with stride. and we make a hopeful stitch to find the longest sequence from all the subsequences.

PROs:

It's extremely generic.
It should work in a lot of scenarios including repeating tokens

CONs:

It's technically unsound. Meaning if the model infers widely varying tokens, there's no way to reconstruct what the model would actually predict on the whole file.
I expect it can fail spectacularly in well crafted examples where someone repeats the same word over and over, where the longest match will be MUCH longer than the original voices thing.

ArthurZucker · 2022-11-08T13:15:46Z

As we discussed offline with @Narsil , will be implementing the find_conmmon_sequence in O(N) 😉 Will open a new PR!

Narsil · 2022-11-08T16:59:55Z

As we discussed offline with @Narsil , will be implementing the find_conmmon_sequence in O(N) wink Will open a new PR!

Seems it's going to be complex because of fault tolerance which does seem to be important.

You can try doing something like

#!wget https://www.archive.org/download/around_world_80_days_mfs_librivox/around_world_in_80_days_01_verne.mp3
from transformers import pipeline

speech_recognizer = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-small",
    framework="pt",
    batch_size=2,
    device=0,
    chunk_length_s=30,
    generate_kwargs={"max_new_tokens": 1024},
)

out = speech_recognizer(["around_world_in_80_days_01_verne.mp3"])
print(out)

This will required some suboptimal stitches to work.

Narsil · 2022-11-08T17:05:26Z

@sgugger it's now ready for review.

The TODO is left intentionnally. It might really become relevant on hour+ long files where the current naive algorithm might become too slow. However the code is likely to be orders of magnitude more complex (if a O(n) solution exists, I'm pretty sure we could find an expected O(n) algorithm, but not sure about worst case).
The current code works correctly, has the fault tolerance we need to be useful.

I added a warning because the current code Will fail in some know circumstances. I updated the PR description to reflect those. If those tradeoffs are not good enough, I'm happy to not merge this PR in this state.

The only other option I see is whisper specific with timestamps and it would only alleviate some of the issues.

ArthurZucker · 2022-11-09T15:42:59Z

Before merging, would love to try a little bit, otherwise LGTM (looking for a solution to the faults)

Narsil · 2022-11-14T08:35:00Z

@ArthurZucker What are your conclusions ?

HuggingFaceDocBuilderDev · 2022-11-14T08:46:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2022-11-14T11:58:14Z

I think that including timestamp tokens in the process could help with the error tolerance as they are consistently predicted at the end of pauses in the speech. If the stride is big enough not at least include pauses in speech, it boils down to matching these.
Moreover, given that we know approximately the time between each tokens, we can use this information as some kind of guiding information. I am working on something, but we can merge for now and have an improved PR later on 😉

Narsil · 2022-11-14T12:26:38Z

@sgugger would like your opinion on this if possible.

The results are pretty decent imo on regular speech. I'm still mentionning the caveats because they are real.

ArthurZucker

LGTM thanks a lot for working on this

ArthurZucker · 2022-11-14T12:26:36Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+        output = speech_recognizer([filename], chunk_length_s=5, batch_size=4)
+        self.assertEqual(output, [{"text": " A man said to the universe, Sir, I exist."}])


sgugger

Just one comment on the warning, otherwise LGTM! Thanks!

sgugger · 2022-11-14T17:13:54Z

src/transformers/pipelines/automatic_speech_recognition.py

+                logger.warning(
+                    "Using `chunk_length_s` is very experimental. The results will not necessarily be entirely"
+                    " accurate and will have caveats. More information:"
+                    " https://github.com/huggingface/transformers/pull/20104"
                )


Can we add some logic to only throw this warning once? Users are complaining Transformers is too verbose.

Is there already a created way to do that ?

Otherwise I can create some tool for it.
Any other location we could add this "single" warning ? (Will add in a different PR)

We use a dict in the state like this one. No need to overengineer another solution IMO.

HuggingFaceDocBuilderDev · 2022-11-14T20:38:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HuggingFaceDocBuilderDev · 2022-11-14T21:13:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Narsil requested review from ArthurZucker and sgugger November 7, 2022 14:58

sgugger reviewed Nov 7, 2022

View reviewed changes

Narsil requested a review from sgugger November 8, 2022 17:02

ArthurZucker mentioned this pull request Nov 9, 2022

ASR pipeline does not work with openai/whisper on current master #19490

Closed

4 tasks

Narsil added 6 commits November 14, 2022 09:35

Very crude matching algorithm.

20af9b4

Fixing tests.

01e833c

Removing comments

4127c84

Adding warning + fix short matches.

1019536

Cleanup tests.

5db3432

Quality.

bd13f54

Narsil force-pushed the whisper_chunking branch from 5eb0179 to bd13f54 Compare November 14, 2022 08:35

ArthurZucker approved these changes Nov 14, 2022

View reviewed changes

sgugger reviewed Nov 14, 2022

View reviewed changes

Less noisy.

035c2bc

Fixup.

8b9f1f2

Narsil merged commit 25c451e into huggingface:main Nov 14, 2022

Narsil deleted the whisper_chunking branch November 14, 2022 22:57

pearl-yu mentioned this pull request Mar 16, 2023

whisper return_timestamp error #22214

Closed

4 tasks

ylacombe mentioned this pull request Jan 11, 2024

Seamless M4T-v2 Inference bug when using chunk_length_s parameter #28397

Closed

4 tasks

socket-security bot mentioned this pull request Jul 1, 2025

Bump transformers from 4.52.4 to 4.53.0 alphasecio/prompt-guard#36

Closed

This was referenced Jul 17, 2025

[Snyk] Fix for 2 vulnerabilities kingjay66/unilmf#259

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.52.0 kingjay66/unilmf#260

Open

1preethi mentioned this pull request Jul 30, 2025

Can't Transcribe Telugu audio files SWivid/F5-TTS#1152

Closed

4 tasks

socket-security bot mentioned this pull request Aug 1, 2025

Bump transformers from 4.53.2 to 4.54.1 alphasecio/prompt-guard#39

Merged

socket-security bot mentioned this pull request Aug 12, 2025

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#271

Open

socket-security bot mentioned this pull request Sep 1, 2025

Bump transformers from 4.55.0 to 4.56.0 alphasecio/prompt-guard#43

Closed

This was referenced Sep 25, 2025

[Snyk] Security upgrade transformers from 4.30.2 to 4.53.0 kingjay66/unilmf#278

Open

[Snyk] Security upgrade transformers from 2.10.0 to 4.53.0 kingjay66/unilmf#279

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#281

Open

socket-security bot mentioned this pull request Nov 1, 2025

Bump transformers from 4.56.2 to 4.57.1 alphasecio/prompt-guard#47

Closed

baoshi0311 mentioned this pull request Dec 8, 2025

RuntimeError: Could not load libtorchcodec. SWivid/F5-TTS#1235

Closed

4 tasks

Capzq mentioned this pull request Dec 14, 2025

F5-TTS itself works, but does not work with pyvideotrans, on MacOS Sequoia 15.6.1, error message from the terminal for F5-TTS jianchang512/pyvideotrans#953

Open

zxc524580210 mentioned this pull request Dec 16, 2025

Error: 'WhisperConfig' object has no attribute '_output_attentions'. Did you mean: 'output_attentions'? TheStageAI/TheWhisper#54

Open

		output = speech_recognizer([filename], chunk_length_s=5, batch_size=4)
		self.assertEqual(output, [{"text": " A man said to the universe, Sir, I exist."}])

Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104

Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104

Conversation

Narsil commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Approach

Pros

Cons

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil commented Nov 7, 2022

Uh oh!

ArthurZucker commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil commented Nov 8, 2022

Uh oh!

Narsil commented Nov 8, 2022

Uh oh!

ArthurZucker commented Nov 9, 2022

Uh oh!

Narsil commented Nov 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

ArthurZucker commented Nov 14, 2022

Uh oh!

Narsil commented Nov 14, 2022

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Narsil commented Nov 7, 2022 •

edited

Loading

ArthurZucker commented Nov 8, 2022 •

edited

Loading