Skip to content

Conversation

cau-git
Copy link
Contributor

@cau-git cau-git commented Aug 19, 2025

This PR targets memory efficiency, in particular with long documents.

  • Calls PdfDocument.unload_pages API on docling-parse v4 where necessary on the pipelines (introduced with v4.2.2)
  • Deletes the Page.parsed_page after work, unless pipeline_options.generate_parsed_page option is requesting to keep them (changes previous default)

TODO

  • Recreate uv.lock with final docling-parse 4.2.x version after fixing release

Notes:

  • New behaviour that the Page.parsed_page is unavailable by default in the conversion result, also when using other PDF backends.

Issue resolved by this Pull Request:
Resolves #2077

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

… parsed_page data unless requested to keep

Signed-off-by: Christoph Auer <[email protected]>
Copy link
Contributor

github-actions bot commented Aug 19, 2025

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

Copy link

mergify bot commented Aug 19, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@cau-git cau-git changed the title fix: Call PdfDocument.unload_pages from the pipelines perf: Call PdfDocument.unload_pages from the pipelines Aug 19, 2025
dolfim-ibm
dolfim-ibm previously approved these changes Aug 20, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cau-git cau-git changed the title perf: Call PdfDocument.unload_pages from the pipelines perf: Clean up resources when using docling-parse v4, don Aug 20, 2025
@cau-git cau-git changed the title perf: Clean up resources when using docling-parse v4, don perf: Clean up resources with docling-parse v4, no parsed_page output by default Aug 20, 2025
Copy link

codecov bot commented Aug 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@cau-git cau-git merged commit 5f57ff2 into main Aug 20, 2025
14 checks passed
@cau-git cau-git deleted the cau/docling-parse-page-unload branch August 20, 2025 08:46
Copy link

dosubot bot commented Aug 20, 2025

Documentation updates
No documents were updated by changes in this PR

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docling Parse v4 accumulating memory
3 participants