- 
                Notifications
    
You must be signed in to change notification settings  - Fork 119
 
Description
This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).
Overview of feature request
Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. Tesseract does seem to be installed with a handful of other languages, but Islandora natively only sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.
Request: a method (possibly using Contexts?) to:
- Identify the language via the Repository Item's Language field (would have to be configurable)
 - Transform the language term into the correct format for Tesseract
 - Pass the language as a parameter to Tesseract as part of the text extraction process
 
What kind of user is the feature intended for?
Anyone ingesting content
What inspired the request?
Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.
What existing behavior do you want changed?
Send the language of the document as a parameter to Tesseract when extracting text.
Any brand new behavior do you want to add to Islandora?
Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.
Any related open or closed issues to this feature request?
None I have identified.