- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.2k
 
          Add deterministic_output option
          #1559
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that needs to be done is to arrange for a DeterministicExecutor. It should work like
StandardExecutor._execute, except it needs to ensure the task_finished function is called in task order, not in completion order.
In practical terms this means e.g. even if some later pages finish OCR earlier, finalizing the page will be held back and done in order. (This is a MapReduce; we need to enforce ordering on the Reduce.) This will ensure the output is ordered in a consistent way.
It will be slower than StandardExecutor so it should not be the default.
| set_pikepdf_as_editor=False, update_docinfo=False, strict=False | ||
| ) as meta_original, | ||
| pdf.open_metadata() as meta_pdf, | ||
| pdf.open_metadata(set_pikepdf_as_editor=not pdf_save_settings.get("deterministic_id", False)) as meta_pdf, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set pikepdf as editor here. It's still deterministic.
| text_xobj_name = Name.random(prefix="OCR-") | ||
| if self.context.options.deterministic_output: | ||
| # Use a stable name per page for deterministic output | ||
| text_xobj_name = Name(f"/OCR-{page_num:06d}") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should bind QPDFObjectHandle::getUniqueResourceName in pikepdf to make this easier.
We can't count on any particular name being available (in particular, if OCRmyPDF is being reused on the same file in some weird way, or if the input file is merged from multiple files generated by OCRmyPDF).
Without have getUniqueResourceName, just make a prefix, appending a counter, and keep incrementing until there's no conflict with an existing name.
| pdfmark['/Producer'] = f'pikepdf {PIKEPDF_VERSION}' | ||
| pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc)) | ||
| if not options.deterministic_output: | ||
| pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to leaving ModDate undisturbed, we should use the environment variables used to control deterministic builds when compiling.
https://blog.conan.io/2019/09/02/Deterministic-builds-with-C-C++.html
In brief, if set ModDate to SOURCE_DATE_EPOCH which would be an environment variable holding a Unix epoch timestamp (seconds since 1970). If ZERO_AR_DATE is 1, then set ModDate to 0 (1970). That allows the user to control the timestamp.
refs #864
Note that this is partially written by GPT-5.
Is this the right approach? Should I add tests for other documents?