[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827

MichaelYipInGitHub · 2025-09-10T03:49:11Z

submit for llama-index-readers-paddle-ocr

Description

This PR introduces a new data loader (PaddleOCRReader) for the LlamaIndex ecosystem. This reader is specifically designed to extract text from image-based PDFs or scanned documents by leveraging the powerful PaddleOCR engine.

Motivation & Context:
Many valuable documents are stored as scanned PDFs or contain crucial information within images (charts, diagrams, screenshots). Existing text-based PDF readers cannot process these. This integration bridges that gap by using PaddleOCR's excellent accuracy in OCR (Optical Character Recognition) for Chinese and English to convert image content within PDFs into readable Document objects that LlamaIndex can then index and query.

Key Features:

Extracts text from image-based PDFs and scanned documents.
Utilizes PaddleOCR, a state-of-the-art OCR engine with great support for both English and Chinese.
Returns a standard LlamaIndex Document object, making it seamless to use with existing pipelines.

Dependencies:
This change requires paddle and paddleocr packages, which are already specified in the pyproject.toml file.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

[√] Yes

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

[√] Yes (Initial version set to 0.1.0)

Type of Change

Please delete options that are not relevant.

[√] New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

[√] I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Testing Details:
I have added unit tests in tests/test_readers_paddle_ocr.py that:
Verify the class can be initialized without errors.

Suggested Checklist:

[√] I have performed a self-review of my own code
[√] I have commented my code, particularly in hard-to-understand areas
[√] I have made corresponding changes to the documentation (Added a comprehensive README.md)
I have added Google Colab support for the newly added notebooks. (N/A for a reader package)
[√] My changes generate no new warnings
[√] I have added tests that prove my fix is effective or that my feature works
[√] New and existing unit tests pass locally with my changes
[√] I ran uv run make format; uv run make lint to appease the lint gods

submit for llama-index-readers-paddle-ocr

AstraBert

Doesn't look bad, but I left some comments on things that do not make too much sense to me

llama-index-integrations/readers/llama-index-readers-paddle-ocr/pyproject.toml

...x-integrations/readers/llama-index-readers-paddle-ocr/llama_index/readers/paddle_ocr/base.py

...a-index-integrations/readers/llama-index-readers-paddle-ocr/tests/test_readers_paddle_ocr.py

1. Version: Changed to 0.1.0 2. License: Changed to MIT 3. Class Naming: PDFPaddleOCR changed to PaddleOcrReader 4. Default Language: Changed from "ch" to "en" 5. Arbitrary Filtering: Changed arbitrary logic like '"第", "页"...' to '"page", "of"' 6. Testing: Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed.

MichaelYipInGitHub · 2025-09-15T09:47:31Z

Hi @AstraBert , as a new contributor, I am very appreciate your helpful response!
I just upload the new code, which following your suggestions, to my 'llama-index-readers-paddle-ocr' branch, feel free to leave your comment.
Thanks again!

AstraBert · 2025-09-15T09:50:09Z

hey @MichaelYipInGitHub thanks a lot for adjusting the PR! We need to make linting pass tho, and you can do it via:

uv pip install pre-commit
pre-commit install
pre-commit run -a
git add . && git commit -m "ci: lint" && git push origin llama-index-readers-paddle-ocr

Once linting passes, the PR will merge automatically :)

MichaelYipInGitHub · 2025-09-16T08:03:49Z

auto-merge was automatically disabled

Thanks @AstraBert ,but seems auto-merge has been automatically disabled, could you help enable again?~ thanks

… PDFs (run-llama#19827) * submit for llama-index-readers-paddle-ocr submit for llama-index-readers-paddle-ocr * Add PaddleOCR reader integration 1. Version: Changed to 0.1.0 2. License: Changed to MIT 3. Class Naming: PDFPaddleOCR changed to PaddleOcrReader 4. Default Language: Changed from "ch" to "en" 5. Arbitrary Filtering: Changed arbitrary logic like '"第", "页"...' to '"page", "of"' 6. Testing: Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed. * ci: lint

submit for llama-index-readers-paddle-ocr

94ee1bc

submit for llama-index-readers-paddle-ocr

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Sep 10, 2025

AstraBert reviewed Sep 11, 2025

View reviewed changes

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Sep 15, 2025

AstraBert approved these changes Sep 15, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 15, 2025

AstraBert enabled auto-merge (squash) September 15, 2025 09:45

ci: lint

859d26b

auto-merge was automatically disabled September 15, 2025 14:22
Head branch was pushed to by a user without write access

AstraBert enabled auto-merge (squash) September 16, 2025 08:10

AstraBert merged commit 1d270ef into run-llama:main Sep 16, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827

[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827

Uh oh!

MichaelYipInGitHub commented Sep 10, 2025 •

edited

Loading

Uh oh!

AstraBert left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MichaelYipInGitHub commented Sep 15, 2025

Uh oh!

AstraBert commented Sep 15, 2025 •

edited

Loading

Uh oh!

MichaelYipInGitHub commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827

[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827

Uh oh!

Conversation

MichaelYipInGitHub commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Uh oh!

AstraBert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MichaelYipInGitHub commented Sep 15, 2025

Uh oh!

AstraBert commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelYipInGitHub commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MichaelYipInGitHub commented Sep 10, 2025 •

edited

Loading

AstraBert commented Sep 15, 2025 •

edited

Loading