Force Docspell's OCR Engine to apply

According to the documentation, you may set the setting:

```conf
# For PDF files it is first tried to read the text parts of the
# PDF. But PDFs can be complex documents and they may contain text
# and images. If the returned text is shorter than the value
# below, OCR is run afterwards. Then both extracted texts are
# compared and the longer will be used.
DOCSPELL_JOEX_EXTRACTION_PDF_MIN__TEXT__LEN=500
```

It would be nice to set a value like "-1" to force Docspell's OCR Data to apply.

My motivation:
I have plenty of PDF files which are already OCRed.
However some of them are wrongly processed or not accurate enough.
Some have been wrongly processed by language, have encoding errors, etc.

For example:
I have a few (ocred) PDF files, which have text like this:
"T H I S  I S  A T E S T" instead of "THIS IS A TEST"

Due to this behavior, the actual OCR length of this already ocred file is most likely to exceed the joex length check and so this file has always a greater length, than the correctly processed OCR by joex (which is less, but more accurate)

I would be nice to always force joex's OCR data to apply.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force Docspell's OCR Engine to apply #1628

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Force Docspell's OCR Engine to apply #1628

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions