-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Description of the bug
When parsing a large HTML file (32MB), the following error occurred:
/home/user/.venv/lib/python3.7/site-packages/fonduer/parser/parser.py:286: UserWarning: Document XXX not added to database, because of parse error:
Unable to allocate 370. GiB for an array with shape (294791, 336473) and data type int32
Note that "XXX" above error message is masked for privacy.
To Reproduce
Steps to reproduce the behavior:
- Enable
lingual=True
- Parse a large HTML file
The above error message is not always shown.
Most of the times, Fonduer silently fails.
Expected behavior
A machine with 24GB can parse a HTML file of 32MB.
Error Logs/Screenshots
See above.
Environment (please complete the following information)
-
OS: Docker (based on
HiromuHota/fonduer-tutorials:0.8.2
but updated to the latest commit on the master branch) -
PostgreSQL Version: 12.1
-
Poppler Utils Version: N/A
-
spaCy Version: 2.1.9
-
Fonduer Version: master (9a33ada)
-
Docker Host
- OS: Ubuntu (20.04)
- MEM: 24GB
- CPU: 8 cores
Additional context
The original PDF has 389 pages and I converted this into a HTML using pdftotext -bbox-layout
.
The generated HTML file is 32MB.
When I disable all the four modals like structural=False, tabular=False, lingual=False, visual=False
, this issue did not happen.
When I enabled only lingual (ie structural=False, tabular=False, lingual=True, visual=False
, this issue happened.
I think this issue is directly related to #439 (comment), where we were discussing
Parser itself is very memory-hungry.