Resolve a memory leak by large data on out_queue (related to #494) #545
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the problems or issues
Is your pull request related to a problem? Please describe.
Fonduer accelerates parsing document with multi-processing.
Each process gets documents from
in_queue
(shared memory), and puts parsed data and document name toout_queue
(shared memory).This is well-known process, but possibility to hung up by memory leak of shared memory.
Previous code put parsed (relatively large) data to
out_queue
.From the
out_queue
, other process get the data and commit it to postges DB.See also #494
Does your pull request fix any issue.
See #494
Description of the proposed changes
Change
out_queue
input to only document name, not include parsed data.Instead of committing data with
out_queue
, each multi-thread process commits parsed data before putting document name toout_queue
.Test plan
Do existing test and monitor python memory usage.
In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.
Checklist