Skip to content

Fix memory leaks #977

@Balearica

Description

@Balearica

Several Node.js users have reported that using a single worker with hundreds of images increases memory usage linearly over time, which indicates the presence of a memory leak. The recommended solution has been to periodically terminate workers and create new ones. While this is good advice for other reasons (see note below), we should still attempt to resolve the memory leak.

The leak is small enough as to only (based on user reports) impact Node.js users recognizing many images on a server, so is likely relatively small on a per-image basis. The most likely explanation is that there is some issue with how we export results from Tesseract. This is based purely on process of elimination--if the issue was with the input (images), the leak would be much larger in magnitude, and if the leak occurred within Tesseract presumably it would be reported and (hopefully) patched within the main Tesseract repo.

Note for users: the advice to not reuse the same workers in perpetuity on a server is good, even if the memory leak gets fixed. This is because Tesseract workers "learn" over time by default. While this learning generally improves results, it assumes that (1) previous results are generally correct and (2) the image that is being recognized closely resembles previous images. As a result, if the same worker is used with hundreds of different documents from different users, it is common for Tesseract to "learn" something incorrect or inapplicable, making results worse than had a fresh worker be used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions