-
Notifications
You must be signed in to change notification settings - Fork 94
Closed
Description
Describe the bug
A clear and concise description of what the bug is.
When I tried to convert PDF that looks scanned (according to TreeExtract
) into hOCR,
the title error happens.
To Reproduce
Steps to reproduce the behavior:
- Install pdftotree from master branch (6c4518d)
- Download test pdfs for fonduer (http://i.stanford.edu/hazy/share/fonduer/fonduer_test_data_v0.3.0.tar.gz) and extract CentralSemiconductorCorp_2N4013.pdf
- Execute
pdftotree CentralSemiconductorCorp_2N4013.pdf
If necessary, attach example data which can be used to replicate the issue.
Expected behavior
A clear and concise description of what you expected to happen.
The command runs without an error.
Error Logs/Screenshots
If applicable, add error logs or screenshots to help explain your problem.
$ pdftotree ../fonduer/tests/data/pdf/CentralSemiconductorCorp_2N4013.pdf
Traceback (most recent call last):
File "/Users/hiromu/miniconda3/envs/pdftotree/bin/pdftotree", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/Users/hiromu/workspace/pdftotree/bin/pdftotree", line 116, in <module>
args.visualize,
File "/Users/hiromu/workspace/pdftotree/pdftotree/core.py", line 69, in parse
pdf_html = extractor.get_html_tree()
File "/Users/hiromu/workspace/pdftotree/pdftotree/TreeExtract.py", line 277, in get_html_tree
"title", f"bbox 0 0 {int(pwidth)} {int(pheight)}; ppageno {page_num-1}"
UnboundLocalError: local variable 'pwidth' referenced before assignment
Environment (please complete the following information):
pdftotree
Version: 0.5.0+dev (6c4518d)
Additional context
I think this is a regression caused by #71
The above example PDF was also used in #27.
Metadata
Metadata
Assignees
Labels
No labels