Skip to content

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

@bondjimbond

Description

@bondjimbond

This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).

Overview of feature request

Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. Tesseract does seem to be installed with a handful of other languages, but Islandora natively only sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.

Request: a method (possibly using Contexts?) to:

  1. Identify the language via the Repository Item's Language field (would have to be configurable)
  2. Transform the language term into the correct format for Tesseract
  3. Pass the language as a parameter to Tesseract as part of the text extraction process

What kind of user is the feature intended for?

Anyone ingesting content

What inspired the request?

Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.

What existing behavior do you want changed?

Send the language of the document as a parameter to Tesseract when extracting text.

Any brand new behavior do you want to add to Islandora?

Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.

Any related open or closed issues to this feature request?

None I have identified.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions