Add `deterministic_output` option #1559

jonashaag · 2025-08-13T10:11:36Z

Note that this is partially written by GPT-5.

Is this the right approach? Should I add tests for other documents?

jbarlow83

Another thing that needs to be done is to arrange for a DeterministicExecutor. It should work like
StandardExecutor._execute, except it needs to ensure the task_finished function is called in task order, not in completion order.

In practical terms this means e.g. even if some later pages finish OCR earlier, finalizing the page will be held back and done in order. (This is a MapReduce; we need to enforce ordering on the Reduce.) This will ensure the output is ordered in a consistent way.

It will be slower than StandardExecutor so it should not be the default.

jbarlow83 · 2025-08-13T19:16:30Z

src/ocrmypdf/_metadata.py

                set_pikepdf_as_editor=False, update_docinfo=False, strict=False
            ) as meta_original,
-            pdf.open_metadata() as meta_pdf,
+            pdf.open_metadata(set_pikepdf_as_editor=not pdf_save_settings.get("deterministic_id", False)) as meta_pdf,


We should set pikepdf as editor here. It's still deterministic.

jbarlow83 · 2025-08-13T19:16:36Z

src/ocrmypdf/_graft.py

-            text_xobj_name = Name.random(prefix="OCR-")
+            if self.context.options.deterministic_output:
+                # Use a stable name per page for deterministic output
+                text_xobj_name = Name(f"/OCR-{page_num:06d}")


I should bind QPDFObjectHandle::getUniqueResourceName in pikepdf to make this easier.

We can't count on any particular name being available (in particular, if OCRmyPDF is being reused on the same file in some weird way, or if the input file is merged from multiple files generated by OCRmyPDF).

Without have getUniqueResourceName, just make a prefix, appending a counter, and keep incrementing until there's no conflict with an existing name.

jbarlow83 · 2025-08-13T19:26:38Z

src/ocrmypdf/_metadata.py

    pdfmark['/Producer'] = f'pikepdf {PIKEPDF_VERSION}'
-    pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc))
+    if not options.deterministic_output:
+        pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc))


In addition to leaving ModDate undisturbed, we should use the environment variables used to control deterministic builds when compiling.

https://blog.conan.io/2019/09/02/Deterministic-builds-with-C-C++.html

In brief, if set ModDate to SOURCE_DATE_EPOCH which would be an environment variable holding a Unix epoch timestamp (seconds since 1970). If ZERO_AR_DATE is 1, then set ModDate to 0 (1970). That allows the user to control the timestamp.

jonashaag added 3 commits August 13, 2025 11:42

wip

2623412

wip

2b07fa8

wip

cccee37

jbarlow83 reviewed Aug 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `deterministic_output` option #1559

Add `deterministic_output` option #1559

Uh oh!

jonashaag commented Aug 13, 2025 •

edited

Loading

Uh oh!

jbarlow83 left a comment

Uh oh!

jbarlow83 Aug 13, 2025

Uh oh!

jbarlow83 Aug 13, 2025

Uh oh!

jbarlow83 Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add deterministic_output option #1559

Are you sure you want to change the base?

Add deterministic_output option #1559

Uh oh!

Conversation

jonashaag commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbarlow83 left a comment

Choose a reason for hiding this comment

Uh oh!

jbarlow83 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

jbarlow83 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

jbarlow83 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `deterministic_output` option #1559

Add `deterministic_output` option #1559

jonashaag commented Aug 13, 2025 •

edited

Loading