Skip to content

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Oct 6, 2020

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #72

Does your pull request fix any issue.

Fix #72

Description of the proposed changes

Just correctly retrieve page dimensions from layout

Test plan

Test against the pdf causing #72.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.md accordingly.

@HiromuHota
Copy link
Contributor Author

Parsing CentralSemiconductorCorp_2N4013.pdf at 936b1c9 gives me

<?xml version="1.0" ?>
<html>
	<head>
		<meta content="Converted from PDF by pdftotree 0.5.0+dev" name="ocr-system"/>
		<meta content="ocr_page ocr_table ocrx_block ocrx_word" name="ocr-capabilities"/>
		<meta content="10" name="ocr-number-of-pages"/>
	</head>
	<body>
		<div class="ocr_page" id="page_1" title="bbox 0 0 612 792; ppageno 0"/>
		<div class="ocr_page" id="page_2" title="bbox 0 0 612 792; ppageno 1"/>
		<div class="ocr_page" id="page_3" title="bbox 0 0 612 792; ppageno 2"/>
		<div class="ocr_page" id="page_4" title="bbox 0 0 612 792; ppageno 3"/>
		<div class="ocr_page" id="page_5" title="bbox 0 0 612 792; ppageno 4"/>
		<div class="ocr_page" id="page_6" title="bbox 0 0 612 792; ppageno 5"/>
		<div class="ocr_page" id="page_7" title="bbox 0 0 612 792; ppageno 6"/>
		<div class="ocr_page" id="page_8" title="bbox 0 0 612 792; ppageno 7"/>
		<div class="ocr_page" id="page_9" title="bbox 0 0 612 792; ppageno 8"/>
		<div class="ocr_page" id="page_10" title="bbox 0 0 612 792; ppageno 9"/>
	</body>
</html>

The output contains no text.
This is caused because pdftotree currently prints LTChar only if they are children of LTTextLine, but LTChar in this PDF are children of LTFigure for some reason.

@HiromuHota HiromuHota changed the title Retrieve page dimensions from layout (fix #72) Extract LTChar even if they are not children of LTTextLine Oct 7, 2020
@codecov-io
Copy link

codecov-io commented Oct 7, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@22f9996). Click here to learn what that means.
The diff coverage is 96.96%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #79   +/-   ##
=========================================
  Coverage          ?   65.62%           
=========================================
  Files             ?       21           
  Lines             ?     2508           
  Branches          ?        0           
=========================================
  Hits              ?     1646           
  Misses            ?      862           
  Partials          ?        0           
Flag Coverage Δ
#unittests 65.62% <96.96%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pdftotree/utils/pdf/layout_utils.py 19.51% <ø> (ø)
pdftotree/TreeExtract.py 88.62% <89.28%> (ø)
pdftotree/utils/pdf/pdf_parsers.py 92.42% <100.00%> (ø)
pdftotree/utils/pdf/pdf_utils.py 95.77% <100.00%> (ø)
tests/test_basic.py 95.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22f9996...ac32a12. Read the comment docs.

@HiromuHota HiromuHota marked this pull request as ready for review October 7, 2020 17:17
@HiromuHota HiromuHota requested review from lukehsiao and senwu October 7, 2020 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnboundLocalError: local variable 'pwidth' referenced before assignment
3 participants