Skip to content

Memory leak when using tess4j for parallel processing in docker environment #248

@milen-dimitrov

Description

@milen-dimitrov

I've encounter this memory leak a few weeks ago and I've managed to identify it only occurs when doing parallel OCR processing using tess4j within a docker container.

When running my container the java heap and native memory remain stable but the RAM usage by the container is increasing.

To reproduce this leak I'm iterating PDF files and for each PDF file I create 4-thread pool:
ExecutorService executor = Executors.newFixedThreadPool(4)

Each of the 4 threads is processing one page at a time.
For each page a Tesseract() instance is created and the tesseract.doOCR(pageImage) method is used to do the OCR.
When the processing of the PDF file finishes I close my thread pool using executor.shutdownNow()

I've managed to circumvent the leak if I make my thread pool static and I never shutdown my threads. I only reuse them.
This doesn't lead to an ever increasing RAM usage but I don't think recreating the thread pool and then shutting it down should be an issue.

If I run my code outside of the docker container, there is no memory leakage.
If I run my code in the container but using only one thread there is no memory leak either.

I made a git repository with a sample java project to illustrate and reproduce the leak. Just build and run the docker image:
https://github.com/milen-dimitrov/TessMemoryLeakSample

There are also these message that may mean something. I get them when I interrupt my program.
https://github.com/milen-dimitrov/TessMemoryLeakSample/blob/main/Screenshot_20230408_200926.png?raw=true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions