Summary

This pull request includes multi-lingual and French-only voice typing models for Joplin.

The base-q8_0 model is recommended for most users. The -small model(s) are the slowest but most accurate of the models attached below. The -tiny models should be fastest, but have the lowest accuracy.

Note: The models containing -q4_0 have been observed to occasionally omit the last word of spoken input. This issue may be resolved by laurent22/joplin#12013.

Language-specific models

Two types of models are available in this release: French-only models and multilingual models. The French-only models end in .fr.zip. The multilingual models have no language specifier before the .zip.

The multilingual models are more accurate for some languages than others and should have similar word error rates to those presented in the Whisper paper, page 23-24.

Training data

Upstream training: The attached models are fine-tuned from OpenAI Whisper models. See OpenAI/Whisper: Training data for details about the upstream training data. For more information about the upstream Whisper training process, see the Whisper paper.

Fine-tuning: The following datasets were used to fine-tune the models:

google/fleurs: Fine-tuning to allow models to still be accurate after applying a performance optimization. See whisper_more_efficient_encoding.ipynb and the futo-org/whisper-acft project for further details.
google/fleurs, mozilla-foundation/common_voice_11_0, facebook/voxpopuli, Myrtle/CAIMAN-ASR-BackgroundNoise, and auto-generated white noise: Used for fine-tuning the -fr models for improved French-language accuracy.

Accuracy

French

On the first 128 samples of the French multilingual Librispeech dataset's test split, the models have the following word error rates (WERs), character error rates (CERs), and average evaluation times (avg_times):

Path	WER (%)	CER (%)	avg_time (s)
whisper-tiny-q4_0.bin	44.9	21.3	1.64
whisper-tiny-q8_0.bin	38.6	17.3	1.78
whisper-base-q4_0.bin	31.1	14.2	2.93
whisper-base-q8_0.bin	27.4	11.8	3.70
whisper-base-q8_0.fr.bin	20.7	8.5	3.70
whisper-small-q5_0.bin	16.5	6.4	15.54
whisper-small-q8_0.bin	15.9	6.1	11.94
whisper-small-q8_0.fr.bin	15.1	5.6	11.95

Similarly, on the test split of the French (Voxpopuli) dataset:

Path	WER (%)	CER (%)	avg_time (s)
whisper-tiny-q4_0.bin	40.8	19.8	1.10
whisper-tiny-q8_0.bin	32.2	15.6	1.31
whisper-base-q4_0.bin	26.6	12.9	1.95
whisper-base-q8_0.bin	23.6	11.5	2.61
whisper-base-q8_0.fr.bin	12.3	6.5	2.50
whisper-small-q5_0.bin	15.1	8.0	10.98
whisper-small-q8_0.bin	15.0	8.0	8.42
whisper-small-q8_0.fr.dynamic_ctx.bin	15.2	8.1	8.50

English

On the first 128 samples of the English Voxpopuli dataset's test split, the models have the following word error rates (WERs), character error rates (CERs), and average evaluation times (avg_times):

Path	WER (%)	CER (%)	avg_time (s)
whisper-tiny-q4_0.bin	17.0	9.2	0.76
whisper-tiny-q8_0.bin	15.2	8.1	0.97
whisper-base-q4_0.bin	13.2	7.5	1.55
whisper-base-q8_0.bin	12.3	6.7	2.01
whisper-base-q8_0.fr.bin	123.5	85.9	3.63
whisper-small-q5_0.bin	10.7	6.2	9.18
whisper-small-q8_0.bin	10.4	6.0	6.88
whisper-small-q8_0.fr.bin	10.1	5.7	6.90

A smaller error rate is better.

See the Whisper model comparison notebook for additional details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Summary

Language-specific models

Training data

Accuracy

French

English

Uh oh!