Skip to content

Force Docspell's OCR Engine to apply #1628

@Snify89

Description

@Snify89

According to the documentation, you may set the setting:

# For PDF files it is first tried to read the text parts of the
# PDF. But PDFs can be complex documents and they may contain text
# and images. If the returned text is shorter than the value
# below, OCR is run afterwards. Then both extracted texts are
# compared and the longer will be used.
DOCSPELL_JOEX_EXTRACTION_PDF_MIN__TEXT__LEN=500

It would be nice to set a value like "-1" to force Docspell's OCR Data to apply.

My motivation:
I have plenty of PDF files which are already OCRed.
However some of them are wrongly processed or not accurate enough.
Some have been wrongly processed by language, have encoding errors, etc.

For example:
I have a few (ocred) PDF files, which have text like this:
"T H I S I S A T E S T" instead of "THIS IS A TEST"

Due to this behavior, the actual OCR length of this already ocred file is most likely to exceed the joex length check and so this file has always a greater length, than the correctly processed OCR by joex (which is less, but more accurate)

I would be nice to always force joex's OCR data to apply.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions