Resolve a memory leak by large data on out_queue (related to #494) #545

YasushiMiyata · 2021-05-17T06:29:52Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.
Fonduer accelerates parsing document with multi-processing.
Each process gets documents from in_queue (shared memory), and puts parsed data and document name to out_queue (shared memory).
This is well-known process, but possibility to hung up by memory leak of shared memory.
Previous code put parsed (relatively large) data to out_queue.
From the out_queue, other process get the data and commit it to postges DB.
See also #494

Does your pull request fix any issue.
See #494

Description of the proposed changes

Change out_queue input to only document name, not include parsed data.
Instead of committing data with out_queue, each multi-thread process commits parsed data before putting document name to out_queue.

Test plan

Do existing test and monitor python memory usage.
In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

…azyResearch#494)

…._add

YasushiMiyata · 2021-05-21T08:01:14Z

Fix add & commit process on multi-thread. parser.py, labeler.py and featurizer.py generate data on multi-threads, but they did add & commit data on single process through out_queue. So, I update codes
from

doc(html) -- in_queue -- th1: parser -- out_queue(doc name, parsed data) -- writer(parsed data)
                      |- th2: parser -|
                      |- th3: parser -|
                      |- th4: parser -|

to

doc(html) -- in_queue -- th1: parser -- writer(data) -- out_queue(doc name)
                      |- th2: parser -- writer(data) -|
                      |- th3: parser -- writer(data) -|
                      |- th4: parser -- writer(data) -|

This cahnge reduces memory usage and prevents memory leaks because out_queue will have less data.

codecov-commenter · 2021-05-21T08:23:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.08%. Comparing base (b1d72be) to head (3d05728).
Report is 6 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #545   +/-   ##
=======================================
  Coverage   86.07%   86.08%           
=======================================
  Files          92       92           
  Lines        4776     4779    +3     
  Branches      899      899           
=======================================
+ Hits         4111     4114    +3     
  Misses        475      475           
  Partials      190      190

Flag	Coverage Δ
unittests	`86.08% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/fonduer/features/featurizer.py	`86.02% <100.00%> (ø)`
src/fonduer/parser/parser.py	`93.41% <100.00%> (ø)`
src/fonduer/supervision/labeler.py	`70.37% <100.00%> (ø)`
src/fonduer/utils/udf.py	`88.88% <100.00%> (+0.31%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

senwu

LGTM. 👍

YasushiMiyata added 2 commits May 17, 2021 14:54

Resolve a memory leak caused by large data on out_queue (related to H…

7a9594c

…azyResearch#494)

Add HazyResearch#545 change log to CHANGELOG.rst

5c283cf

YasushiMiyata marked this pull request as ready for review May 17, 2021 06:59

YasushiMiyata added 2 commits May 17, 2021 16:15

Fix HazyResearch#545 CHANGELOG.rst

bb39209

Add multi-thread support for Parser._add, Labeler._add and Featurizer…

3d05728

…._add

senwu approved these changes Jun 10, 2021

View reviewed changes

senwu merged commit 9d794b9 into HazyResearch:master Jun 10, 2021

senwu pushed a commit that referenced this pull request Jun 10, 2021

Add #545 change log to CHANGELOG.rst

39f2e1c

senwu pushed a commit that referenced this pull request Jun 10, 2021

Fix #545 CHANGELOG.rst

5ab8d4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resolve a memory leak by large data on out_queue (related to #494) #545

Resolve a memory leak by large data on out_queue (related to #494) #545

Uh oh!

YasushiMiyata commented May 17, 2021 •

edited

Loading

Uh oh!

YasushiMiyata commented May 21, 2021 •

edited

Loading

Uh oh!

codecov-commenter commented May 21, 2021 •

edited

Loading

Uh oh!

senwu left a comment

Uh oh!

Uh oh!

Resolve a memory leak by large data on out_queue (related to #494) #545

Resolve a memory leak by large data on out_queue (related to #494) #545

Uh oh!

Conversation

YasushiMiyata commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

Uh oh!

YasushiMiyata commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

senwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YasushiMiyata commented May 17, 2021 •

edited

Loading

YasushiMiyata commented May 21, 2021 •

edited

Loading

codecov-commenter commented May 21, 2021 •

edited

Loading