@@ -3,7 +3,7 @@ pdftotree
3
3
4
4
|License | |Stars | |PyPI | |Version | |Issues | |Travis | |Coveralls | |CodeStyle |
5
5
6
- **WARNING **: ``pdftotree `` *is experimental code and is NOT stable or maintained . It is not integrated with or supported by Fonduer. *
6
+ **WARNING **: ``pdftotree `` *is experimental code and is NOT stable. It is not integrated with or supported by Fonduer. *
7
7
8
8
Fonduer _ performs knowledge base construction from richly formatted data such
9
9
as tables. A crucial step in this process is the construction of the
@@ -16,8 +16,10 @@ This package is the result of building our own module as replacement to Adobe
16
16
Acrobat. Several open source tools are available for pdf to html conversion but
17
17
these tools do not preserve the cell structure in a table. Our goal in this
18
18
project is to develop a tool that extracts text, figures and tables in a pdf
19
- document and maintains the structure of the document using a tree data
20
- structure.
19
+ document and returns them in an easily consumable format.
20
+
21
+ Up to v0.4.1, pdftotree's output was formatted in its own "HTML-like" format.
22
+ From v0.5.0, it conforms to hOCR _, an open-standard format for OCR results.
21
23
22
24
Dependencies
23
25
------------
@@ -49,19 +51,14 @@ pdftotree
49
51
~~~~~~~~~
50
52
51
53
This is the primary command-line utility provided with this Python package.
52
- This takes a PDF file as input, and produces an HTML-like representation of the
53
- data::
54
+ This takes a PDF file as input and produces an hOCR file as output::
54
55
55
56
usage: pdftotree [options] pdf_file
56
57
57
- Script to extract tree structure from PDF files. Takes a PDF as input and
58
- outputs an HTML-like representation of the document's structure. By default,
59
- this conversion is done using heuristics. However, a model can be provided as
60
- a parameter to use a machine-learning-based approach.
58
+ Convert PDF into hOCR.
61
59
62
60
positional arguments:
63
- pdf_file PDF file name for which tree structure needs to be
64
- extracted
61
+ pdf_file Path to input PDF file.
65
62
66
63
optional arguments:
67
64
-h, --help show this help message and exit
71
68
-m MODEL_PATH, --model_path MODEL_PATH
72
69
Pretrained model, generated by extract_tables tool
73
70
-o OUTPUT, --output OUTPUT
74
- Path where tree structure should be saved . If none,
75
- HTML is printed to stdout.
71
+ Path to output hOCR file . If not given, it will be
72
+ printed to stdout.
76
73
-f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
77
74
Whether figures must be favored over other parts such
78
75
as tables and section headers
@@ -207,3 +204,4 @@ Then you can run our tests::
207
204
.. _version file : https://github.com/HazyResearch/pdftotree/blob/master/pdftotree/_version.py
208
205
.. _editable mode : https://packaging.python.org/tutorials/distributing-packages/#working-in-development-mode
209
206
.. _flake8 : http://flake8.pycqa.org/en/latest/
207
+ .. _hOCR : http://kba.cloud/hocr-spec/1.2/
0 commit comments