[FEATURE] [BUG] Language selection for Tesseract/text extraction

This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).

**Overview of feature request**
 
Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. [Tesseract does seem to be installed with a handful of other languages](https://github.com/Islandora-Devops/isle-buildkit/blob/main/hypercube/Dockerfile#L22-L34), but Islandora natively **only** sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.

Request: a method (possibly using Contexts?) to:

1. Identify the language via the Repository Item's Language field (would have to be configurable)
2. Transform the language term into the correct format for Tesseract
3. Pass the language as a parameter to Tesseract as part of the text extraction process
 
**What kind of user is the feature intended for?**

Anyone ingesting content
 
 
**What inspired the request?**
 
Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.
 
**What existing behavior do you want changed?**
 
Send the language of the document as a parameter to Tesseract when extracting text.
 
**Any brand new behavior do you want to add to Islandora?**
 
Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.
 
**Any related open or closed issues to this feature request?**

None I have identified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions