Skip to content

v0.2.0

Latest

Choose a tag to compare

@personalizedrefrigerator personalizedrefrigerator released this 25 Mar 22:58
· 7 commits to main since this release
9440ff2

Summary

This pull request includes multi-lingual and French-only voice typing models for Joplin.

The base-q8_0 model is recommended for most users. The -small model(s) are the slowest but most accurate of the models attached below. The -tiny models should be fastest, but have the lowest accuracy.

Note: The models containing -q4_0 have been observed to occasionally omit the last word of spoken input. This issue may be resolved by laurent22/joplin#12013.

Language-specific models

Two types of models are available in this release: French-only models and multilingual models. The French-only models end in .fr.zip. The multilingual models have no language specifier before the .zip.

The multilingual models are more accurate for some languages than others and should have similar word error rates to those presented in the Whisper paper, page 23-24.

Training data

Upstream training: The attached models are fine-tuned from OpenAI Whisper models. See OpenAI/Whisper: Training data for details about the upstream training data. For more information about the upstream Whisper training process, see the Whisper paper.

Fine-tuning: The following datasets were used to fine-tune the models:

Accuracy

French

On the first 128 samples of the French multilingual Librispeech dataset's test split, the models have the following word error rates (WERs), character error rates (CERs), and average evaluation times (avg_times):

Path WER (%) CER (%) avg_time (s)
whisper-tiny-q4_0.bin 44.9 21.3 1.64
whisper-tiny-q8_0.bin 38.6 17.3 1.78
whisper-base-q4_0.bin 31.1 14.2 2.93
whisper-base-q8_0.bin 27.4 11.8 3.70
whisper-base-q8_0.fr.bin 20.7 8.5 3.70
whisper-small-q5_0.bin 16.5 6.4 15.54
whisper-small-q8_0.bin 15.9 6.1 11.94
whisper-small-q8_0.fr.bin 15.1 5.6 11.95

Similarly, on the test split of the French (Voxpopuli) dataset:

Path WER (%) CER (%) avg_time (s)
whisper-tiny-q4_0.bin 40.8 19.8 1.10
whisper-tiny-q8_0.bin 32.2 15.6 1.31
whisper-base-q4_0.bin 26.6 12.9 1.95
whisper-base-q8_0.bin 23.6 11.5 2.61
whisper-base-q8_0.fr.bin 12.3 6.5 2.50
whisper-small-q5_0.bin 15.1 8.0 10.98
whisper-small-q8_0.bin 15.0 8.0 8.42
whisper-small-q8_0.fr.dynamic_ctx.bin 15.2 8.1 8.50

English

On the first 128 samples of the English Voxpopuli dataset's test split, the models have the following word error rates (WERs), character error rates (CERs), and average evaluation times (avg_times):

Path WER (%) CER (%) avg_time (s)
whisper-tiny-q4_0.bin 17.0 9.2 0.76
whisper-tiny-q8_0.bin 15.2 8.1 0.97
whisper-base-q4_0.bin 13.2 7.5 1.55
whisper-base-q8_0.bin 12.3 6.7 2.01
whisper-base-q8_0.fr.bin 123.5 85.9 3.63
whisper-small-q5_0.bin 10.7 6.2 9.18
whisper-small-q8_0.bin 10.4 6.0 6.88
whisper-small-q8_0.fr.bin 10.1 5.7 6.90

A smaller error rate is better.

See the Whisper model comparison notebook for additional details.