-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Not sure if the PR is ready for (at least core maintainer) review yet?
| # if self.type not in {"ctc", "ctc_with_lm"}: | ||
| # raise ValueError( | ||
| # "`chunk_length_s` is only valid for CTC models, use other chunking options for other models" | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clean up?
| # self.assertEqual( | ||
| # str(v.exception), | ||
| # "`chunk_length_s` is only valid for CTC models, use other chunking options for other models", | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clean up as well?
| # waveform = np.tile(np.arange(1000, dtype=np.float32), 34) | ||
| # output = speech_recognizer(waveform) | ||
| # self.assertEqual(output, {"text": ""}) | ||
|
|
||
| ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation").sort("id") | ||
| filename = ds[40]["file"] | ||
| # output = speech_recognizer(filename) | ||
| # self.assertEqual(output, {"text": " A man said to the universe, Sir, I exist."}) | ||
| print(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments and print statements to clean up.
Yup sorry it was slightly early for you. We chunk with stride. and we make a hopeful stitch to find the longest sequence from all the subsequences. PROs:
CONs:
|
|
As we discussed offline with @Narsil , will be implementing the |
Seems it's going to be complex because of fault tolerance which does seem to be important. You can try doing something like #!wget https://www.archive.org/download/around_world_80_days_mfs_librivox/around_world_in_80_days_01_verne.mp3
from transformers import pipeline
speech_recognizer = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-small",
framework="pt",
batch_size=2,
device=0,
chunk_length_s=30,
generate_kwargs={"max_new_tokens": 1024},
)
out = speech_recognizer(["around_world_in_80_days_01_verne.mp3"])
print(out)This will required some suboptimal stitches to work. |
|
@sgugger it's now ready for review. The TODO is left intentionnally. It might really become relevant on hour+ long files where the current naive algorithm might become too slow. However the code is likely to be orders of magnitude more complex (if a O(n) solution exists, I'm pretty sure we could find an expected O(n) algorithm, but not sure about worst case). I added a warning because the current code Will fail in some know circumstances. I updated the PR description to reflect those. If those tradeoffs are not good enough, I'm happy to not merge this PR in this state. The only other option I see is whisper specific with timestamps and it would only alleviate some of the issues. |
|
Before merging, would love to try a little bit, otherwise LGTM (looking for a solution to the faults) |
|
@ArthurZucker What are your conclusions ? |
5eb0179 to
bd13f54
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
I think that including timestamp tokens in the process could help with the error tolerance as they are consistently predicted at the end of pauses in the speech. If the stride is big enough not at least include pauses in speech, it boils down to matching these. |
|
@sgugger would like your opinion on this if possible. The results are pretty decent imo on regular speech. I'm still mentionning the caveats because they are real. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks a lot for working on this
| output = speech_recognizer([filename], chunk_length_s=5, batch_size=4) | ||
| self.assertEqual(output, [{"text": " A man said to the universe, Sir, I exist."}]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIce
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment on the warning, otherwise LGTM! Thanks!
| logger.warning( | ||
| "Using `chunk_length_s` is very experimental. The results will not necessarily be entirely" | ||
| " accurate and will have caveats. More information:" | ||
| " https://github.com/huggingface/transformers/pull/20104" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some logic to only throw this warning once? Users are complaining Transformers is too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there already a created way to do that ?
Otherwise I can create some tool for it.
Any other location we could add this "single" warning ? (Will add in a different PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a dict in the state like this one. No need to overengineer another solution IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
What does this PR do?
This adds
chunk_length_stoseq2seqalgorithms.Approach
Since we have no way of finding a matching between output and input with
seq2seqthis is an alternative route.
This runs the pipeline on the various chunks and finds all generated output.
Then it tries to find the longest sequence of non special ids that could correspond
to the subsequences within the batch.
Pros
Cons
whispercan currently do, and it does come with caveats). The currently algorithm will favor long chain of matching tokens."sir"and¨Sir"are different and will fail to match leading to some `¨Yes, sir. Sir Thomas" stitching instead of the intended "Yes, Sir Thomas.".Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.