Skip to content

Commit eaf2fb2

Browse files
authored
community(pypdfloader): added page_label in metadata for pypdf loader (#29225)
# Description ## Summary This PR adds support for handling multi-labeled page numbers in the **PyPDFLoader**. Some PDFs use complex page numbering systems where the actual content may begin after multiple introductory pages. The page_label field helps accurately reflect the document’s page structure, making it easier to handle such cases during document parsing. ## Motivation This feature improves document parsing accuracy by allowing users to access the actual page labels instead of relying only on the physical page numbers. This is particularly useful for documents where the first few pages have roman numerals or other non-standard page labels. ## Use Case This feature is especially useful for **Retrieval-Augmented Generation** (RAG) systems where users may reference page numbers when asking questions. Some PDFs have both labeled page numbers (like roman numerals for introductory sections) and index-based page numbers. For example, a user might ask: "What is mentioned on page 5?" The system can now check both: • **Index-based page number** (page) • **Labeled page number** (page_label) This dual-check helps improve retrieval accuracy. Additionally, the results can be validated with an **agent or tool** to ensure the retrieved pages match the user’s query contextually. ## Code Changes - Added a page_label field to the metadata of the Document class in **PyPDFLoader**. - Implemented support for retrieving page_label from the pdf_reader.page_labels. - Created a test case (test_pypdf_loader_with_multi_label_page_numbers) with a sample PDF containing multi-labeled pages (geotopo-komprimiert.pdf) [[Source of pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)]. - Updated existing tests to ensure compatibility and verify page_label extraction. ## Tests Added - Added a new test case for a PDF with multi-labeled pages. - Verified both page and page_label metadata fields are correctly extracted. ## Screenshots <img width="549" alt="image" src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33" />
1 parent 1a38948 commit eaf2fb2

File tree

3 files changed

+27
-1
lines changed

3 files changed

+27
-1
lines changed

libs/community/langchain_community/document_loaders/parsers/pdf.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,11 @@ def _extract_text_from_page(page: pypdf.PageObject) -> str:
123123
Document(
124124
page_content=_extract_text_from_page(page=page)
125125
+ self._extract_images_from_page(page),
126-
metadata={"source": blob.source, "page": page_number},
126+
metadata={
127+
"source": blob.source,
128+
"page": page_number,
129+
"page_label": pdf_reader.page_labels[page_number],
130+
},
127131
# type: ignore[attr-defined]
128132
)
129133
for page_number, page in enumerate(pdf_reader.pages)
205 KB
Binary file not shown.

libs/community/tests/unit_tests/document_loaders/test_pdf.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
Path(__file__).parent.parent
1313
/ "document_loaders/sample_documents/layout-parser-paper.pdf"
1414
)
15+
path_to_multi_label_page_numbers_pdf = (
16+
Path(__file__).parent.parent
17+
/ "document_loaders/sample_documents/geotopo-komprimiert.pdf"
18+
)
1519
path_to_layout_pdf_txt = (
1620
Path(__file__).parent.parent.parent
1721
/ "integration_tests/examples/layout-parser-paper-page-1.txt"
@@ -32,6 +36,7 @@ def test_pypdf_loader() -> None:
3236
assert len(docs) == 16
3337
for page, doc in enumerate(docs):
3438
assert doc.metadata["page"] == page
39+
assert doc.metadata["page_label"] == str(page + 1)
3540
assert doc.metadata["source"].endswith("layout-parser-paper.pdf")
3641
assert len(doc.page_content) > 10
3742

@@ -49,6 +54,7 @@ def test_pypdf_loader_with_layout() -> None:
4954
assert len(docs) == 16
5055
for page, doc in enumerate(docs):
5156
assert doc.metadata["page"] == page
57+
assert doc.metadata["page_label"] == str(page + 1)
5258
assert doc.metadata["source"].endswith("layout-parser-paper.pdf")
5359
assert len(doc.page_content) > 10
5460

@@ -60,3 +66,19 @@ def test_pypdf_loader_with_layout() -> None:
6066
cleaned_first_page = re.sub(r"\x00", "", first_page)
6167
cleaned_expected = re.sub(r"\x00", "", expected)
6268
assert cleaned_first_page == cleaned_expected
69+
70+
71+
@pytest.mark.requires("pypdf")
72+
def test_pypdf_loader_with_multi_labled_page_numbers() -> None:
73+
"""Test PyPDFLoader with a pdf that contains multi-labled page numbers."""
74+
loader = PyPDFLoader(str(path_to_multi_label_page_numbers_pdf))
75+
docs = loader.load()
76+
77+
assert len(docs) == 7
78+
79+
assert docs[0].metadata["page"] == 0
80+
assert docs[0].metadata["page_label"] == "i"
81+
82+
# Since the actual page numbers in this pdf starts from 4th page
83+
assert docs[3].metadata["page"] == 3
84+
assert docs[3].metadata["page_label"] == "1"

0 commit comments

Comments
 (0)