Version 5 Changes

# Summary

## Major New Features
1. **Significantly** smaller file sizes
   1. 54% smaller file sizes for English, 73% smaller for Chinese (see #806 for details)
   2. This results in a **~50% decrease** in runtime for first-time users (who do not yet have the data downloaded/cached)
1. Significantly lower memory usage
   1. Worker memory utilization in the [web benchmark](https://github.com/naptha/tesseract.js/blob/dev/v5/benchmarks/browser/speed-benchmark.html) is reduced from 311 MB to 164 MB (47% reduction)
   2. The lower memory footprint makes it feasible to use more workers, significantly improving performance for projects that utilize [schedulers](https://github.com/naptha/tesseract.js/blob/master/docs/intro.md) for parallel processing
1. Compatible with iOS 17 (using default settings)
   1. iOS 17 broke compatibility with Tesseract.js v4--upgrading to v5 should resolve
      1. See discussion section below for details

## Breaking Changes Impacting Many Users
1. `createWorker` arguments changed
   1. Setting non-default language and OEM now happens in `createWorker`
      1. E.g. `createWorker("chi_sim", 1)`
1. `worker.initialize` and `worker.loadLanguage` functions now do nothing and can be deleted from code
   1. Loading the language and initialization now occurs in `createWorker`
   4. Workers can be re-initialized with different settings using `worker.reinitialize`

In other words, code should be modified from this:

```
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const ret = await worker.recognize(file);
```
To this:

```
const worker = await Tesseract.createWorker("eng");
const ret = await worker.recognize(file);
```


## Breaking Changes Impacting Fewer Users
1. Users who manually set `corePath` will need to update the contents of their `corePath` directory
   1. `corePath` should point to a directory that contains **all 4** of the files below from Tesseract.js-core v5:
      1. `tesseract-core.wasm.js`
      2. `tesseract-core-simd.wasm.js`
      3. `tesseract-core-lstm.wasm.js`
      4. `tesseract-core-simd-lstm.wasm.js`
   1. Tesseract.js will automatically select the correct version to use 
1. `worker.detect` function disabled by default
   1. Orientation + script detection is a function of the Legacy model only, which is no longer included by default
   6. To enable, set arguments  `legacyCore: true` and `legacyLang: true` in `createWorker` options
      1. E.g. `Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});`
1. Language of progress logs standardized
   1. This should only impact users who parse status logs (e.g. to update a loading bar)

## Non-Breaking Changes
1. Language data loaded from `jsdelivr` by default (rather than GitHub pages)
   1. This should result in improved performance and uptime
1. Separate "development" build (that produced `tesseract.dev.js` and `worker.dev.js` removed
1. Documentation and examples were modified to prevent new users from using `Tesseract.recognize` and `Tesseract.detect`
   1. Users who already use these functions are encouraged to modify their code to use `worker.recognize` and `worker.detect` instead

# Discussion

## How can file sizes be reduced by so much? 
Tesseract contains 2 recognition models—LSTM and Legacy.  The vast majority of users only use the LSTM model (the default).  However, the Legacy model takes up more space, and previous versions of Tesseract.js loaded all of the resources required for both models.  This resulted in significant wasteful network activity.  For example, for Chinese (simplified) 73% of the size of the code and data was attributable to the (usually) unused Legacy model. 

## What justifies the breaking changes to `createWorker`/`loadLanguage`/`initialize`? 
The primary reason is that these changes are necessary to facilitate the major improvement of v5—significantly reducing file sizes.  How this reduction is achieved is described in the answer directly above.  As Tesseract.js is a JavaScript library generally run in the browser, having reasonable file sizes is a high priority.  This is especially true as use on mobile devices becomes more common.  Making this improvement would have been impossible without combining `createWorker`/`loadLanguage`/`initialize`.  

Previously, the user specified which recognition model (OEM) to use during `initialize`.  As `initialize` was run after `createWorker` and `loadLanguage` (which load the code and language required for each model), there was no way for these functions to only load the data required for the chosen model.  By combining these functions, Tesseract.js knows what model is being used before it loads code or data, so can load only the required resources. 

In addition to this primary reason, combining these functions should simplify the process of creating a worker.  The large number of functions required to create a new worker (4 in `v3` and 3 in `v4`) was pushing some users towards using `Tesseract.recognize` instead (as this handles everything in a single function).  Simplifying the process of creating a new worker will hopefully result in more users using workers, which is more efficient than `Tesseract.recognize` (which creates and destroys a worker every time it is used). 

## How can I restore the old behavior (loading both LSTM + Legacy models)? 
Within `createWorker`, if you set `oem` to `0` (Tesseract Legacy) or `2` (Tesseract Legacy + LSTM), code and language data for both the Legacy and LSTM models will be loaded automatically.  You can force both models to be loaded regardless of `oem` by setting `legacyCore: true` and `legacyLang: true` in the `createWorker` options.  For example:

```
const worker = await Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
```

If your application re-initializes existing workers with a different language or OEM, this is now achieved using `worker.reinitialize` (rather than `worker.loadLanguage` and `worker.initialize`).  For example, the following snippet recognizes `file` using the LSTM model, and then switches to the Legacy model and re-runs recognition. 

```
const worker = await Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
const retLSTM = await worker.recognize(file);

worker.reinitialize("eng", 0);
const retLegacy = await worker.recognize(file);
```

## How does this release impact iOS compatibility? 
iOS `v17.0` and `v17.1` include a bug that causes the Legacy + LSTM build of Tesseract.js to crash.  Apple patched this issue in iOS `v17.2`.  This bug does not impact the LSTM-only build, which became the default in Tesseract.js v5.  Therefore, developers who want their application to be compatible with iOS `v17.0` and `v17.1` are advised to upgrade to Tesseract.js v5. Discussion regarding this issue is documented in #804.

## I am still having trouble upgrading my project, what should I do? 
Start by reviewing the [examples directory](https://github.com/naptha/tesseract.js/tree/dev/v5/examples)--most uses of Tesseract.js have a corresponding example.  If you are struggling to upgrade your project after reviewing both this issue and the examples, feel free to open a new git issue. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Version 5 Changes #820

Summary

Major New Features

Breaking Changes Impacting Many Users

Breaking Changes Impacting Fewer Users

Non-Breaking Changes

Discussion

How can file sizes be reduced by so much?

What justifies the breaking changes to `createWorker`/`loadLanguage`/`initialize`?

How can I restore the old behavior (loading both LSTM + Legacy models)?

How does this release impact iOS compatibility?

I am still having trouble upgrading my project, what should I do?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Version 5 Changes #820

Description

Summary

Major New Features

Breaking Changes Impacting Many Users

Breaking Changes Impacting Fewer Users

Non-Breaking Changes

Discussion

How can file sizes be reduced by so much?

What justifies the breaking changes to createWorker/loadLanguage/initialize?

How can I restore the old behavior (loading both LSTM + Legacy models)?

How does this release impact iOS compatibility?

I am still having trouble upgrading my project, what should I do?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What justifies the breaking changes to `createWorker`/`loadLanguage`/`initialize`?