Skip to content

Conversation

DavidVentura
Copy link

Calling recognize on the iterator instead of getHOCRText is ~20% faster for my test cases.

Image 1

Test Recognize() (ms) GetHOCRText() (ms)
1 1366 1718
2 1116 1242
3 1048 1239
Average 1177 1400

Image 2

Test Recognize() (ms) GetHOCRText() (ms)
1 878 1603
2 929 1201
3 1396 1132
Average 1068 1312

@Robyer
Copy link
Member

Robyer commented Jul 29, 2025

Hi, sorry for late reply.

Have you tried comparison between getHOCRText and getUTF8Text? Because the recognize function is called at the start of both of these methods (if the image is not already recognized). And then the difference is only that getHOCRText is providing monitor to the recognize call to get informed about progress and let user be able to cancel the processing, and getUTF8Text is not (same as the recognize in your PR).

So the 20 % difference can be just because of that (+ some for extra markup of the HOCR format)?

@DavidVentura
Copy link
Author

About 15% of the overhead seems to happen when the progress callback is not null. There is still a ~5% overhead on getHOCRText vs Recognize

@Robyer
Copy link
Member

Robyer commented Aug 24, 2025

Thanks, that makes sense.

So if you don't want callback or the HOCR text format, just use getUTF8Text as that will be fastest - no need for separate Recognize call.

Or do you still see some benefit of using Recognize separately from getUTF8Text? If not, we can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants