Skip to content

Conversation

@jonashaag
Copy link

@jonashaag jonashaag commented Aug 13, 2025

refs #864

Note that this is partially written by GPT-5.

Is this the right approach? Should I add tests for other documents?

Copy link
Collaborator

@jbarlow83 jbarlow83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that needs to be done is to arrange for a DeterministicExecutor. It should work like
StandardExecutor._execute, except it needs to ensure the task_finished function is called in task order, not in completion order.

In practical terms this means e.g. even if some later pages finish OCR earlier, finalizing the page will be held back and done in order. (This is a MapReduce; we need to enforce ordering on the Reduce.) This will ensure the output is ordered in a consistent way.

It will be slower than StandardExecutor so it should not be the default.

set_pikepdf_as_editor=False, update_docinfo=False, strict=False
) as meta_original,
pdf.open_metadata() as meta_pdf,
pdf.open_metadata(set_pikepdf_as_editor=not pdf_save_settings.get("deterministic_id", False)) as meta_pdf,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set pikepdf as editor here. It's still deterministic.

text_xobj_name = Name.random(prefix="OCR-")
if self.context.options.deterministic_output:
# Use a stable name per page for deterministic output
text_xobj_name = Name(f"/OCR-{page_num:06d}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should bind QPDFObjectHandle::getUniqueResourceName in pikepdf to make this easier.

We can't count on any particular name being available (in particular, if OCRmyPDF is being reused on the same file in some weird way, or if the input file is merged from multiple files generated by OCRmyPDF).

Without have getUniqueResourceName, just make a prefix, appending a counter, and keep incrementing until there's no conflict with an existing name.

pdfmark['/Producer'] = f'pikepdf {PIKEPDF_VERSION}'
pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc))
if not options.deterministic_output:
pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to leaving ModDate undisturbed, we should use the environment variables used to control deterministic builds when compiling.

https://blog.conan.io/2019/09/02/Deterministic-builds-with-C-C++.html

In brief, if set ModDate to SOURCE_DATE_EPOCH which would be an environment variable holding a Unix epoch timestamp (seconds since 1970). If ZERO_AR_DATE is 1, then set ModDate to 0 (1970). That allows the user to control the timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants