Skip to content

Commit 704aea1

Browse files
author
Hiromu Hota
committed
Update README
1 parent c2170d4 commit 704aea1

File tree

1 file changed

+11
-13
lines changed

1 file changed

+11
-13
lines changed

README.rst

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ pdftotree
33

44
|License| |Stars| |PyPI| |Version| |Issues| |Travis| |Coveralls| |CodeStyle|
55

6-
**WARNING**: ``pdftotree`` *is experimental code and is NOT stable or maintained. It is not integrated with or supported by Fonduer.*
6+
**WARNING**: ``pdftotree`` *is experimental code and is NOT stable. It is not integrated with or supported by Fonduer.*
77

88
Fonduer_ performs knowledge base construction from richly formatted data such
99
as tables. A crucial step in this process is the construction of the
@@ -16,8 +16,10 @@ This package is the result of building our own module as replacement to Adobe
1616
Acrobat. Several open source tools are available for pdf to html conversion but
1717
these tools do not preserve the cell structure in a table. Our goal in this
1818
project is to develop a tool that extracts text, figures and tables in a pdf
19-
document and maintains the structure of the document using a tree data
20-
structure.
19+
document and returns them in an easily consumable format.
20+
21+
Up to v0.4.1, pdftotree's output was formatted in its own "HTML-like" format.
22+
From v0.5.0, it conforms to hOCR_, an open-standard format for OCR results.
2123

2224
Dependencies
2325
------------
@@ -49,19 +51,14 @@ pdftotree
4951
~~~~~~~~~
5052

5153
This is the primary command-line utility provided with this Python package.
52-
This takes a PDF file as input, and produces an HTML-like representation of the
53-
data::
54+
This takes a PDF file as input and produces an hOCR file as output::
5455

5556
usage: pdftotree [options] pdf_file
5657

57-
Script to extract tree structure from PDF files. Takes a PDF as input and
58-
outputs an HTML-like representation of the document's structure. By default,
59-
this conversion is done using heuristics. However, a model can be provided as
60-
a parameter to use a machine-learning-based approach.
58+
Convert PDF into hOCR.
6159

6260
positional arguments:
63-
pdf_file PDF file name for which tree structure needs to be
64-
extracted
61+
pdf_file Path to input PDF file.
6562

6663
optional arguments:
6764
-h, --help show this help message and exit
@@ -71,8 +68,8 @@ data::
7168
-m MODEL_PATH, --model_path MODEL_PATH
7269
Pretrained model, generated by extract_tables tool
7370
-o OUTPUT, --output OUTPUT
74-
Path where tree structure should be saved. If none,
75-
HTML is printed to stdout.
71+
Path to output hOCR file. If not given, it will be
72+
printed to stdout.
7673
-f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
7774
Whether figures must be favored over other parts such
7875
as tables and section headers
@@ -207,3 +204,4 @@ Then you can run our tests::
207204
.. _version file: https://github.com/HazyResearch/pdftotree/blob/master/pdftotree/_version.py
208205
.. _editable mode: https://packaging.python.org/tutorials/distributing-packages/#working-in-development-mode
209206
.. _flake8: http://flake8.pycqa.org/en/latest/
207+
.. _hOCR: http://kba.cloud/hocr-spec/1.2/

0 commit comments

Comments
 (0)