Skip to content

Conversation

@MichaelYipInGitHub
Copy link
Contributor

@MichaelYipInGitHub MichaelYipInGitHub commented Sep 10, 2025

submit for llama-index-readers-paddle-ocr

Description

This PR introduces a new data loader (PaddleOCRReader) for the LlamaIndex ecosystem. This reader is specifically designed to extract text from image-based PDFs or scanned documents by leveraging the powerful PaddleOCR engine.

Motivation & Context:
Many valuable documents are stored as scanned PDFs or contain crucial information within images (charts, diagrams, screenshots). Existing text-based PDF readers cannot process these. This integration bridges that gap by using PaddleOCR's excellent accuracy in OCR (Optical Character Recognition) for Chinese and English to convert image content within PDFs into readable Document objects that LlamaIndex can then index and query.

Key Features:

  • Extracts text from image-based PDFs and scanned documents.
  • Utilizes PaddleOCR, a state-of-the-art OCR engine with great support for both English and Chinese.
  • Returns a standard LlamaIndex Document object, making it seamless to use with existing pipelines.

Dependencies:
This change requires paddle and paddleocr packages, which are already specified in the pyproject.toml file.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • [√] Yes

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • [√] Yes (Initial version set to 0.1.0)

Type of Change

Please delete options that are not relevant.

  • [√] New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • [√] I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Testing Details:
I have added unit tests in tests/test_readers_paddle_ocr.py that:
Verify the class can be initialized without errors.

Suggested Checklist:

  • [√] I have performed a self-review of my own code
  • [√] I have commented my code, particularly in hard-to-understand areas
  • [√] I have made corresponding changes to the documentation (Added a comprehensive README.md)
  • I have added Google Colab support for the newly added notebooks. (N/A for a reader package)
  • [√] My changes generate no new warnings
  • [√] I have added tests that prove my fix is effective or that my feature works
  • [√] New and existing unit tests pass locally with my changes
  • [√] I ran uv run make format; uv run make lint to appease the lint gods

submit for llama-index-readers-paddle-ocr
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Sep 10, 2025
Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look bad, but I left some comments on things that do not make too much sense to me

1. Version:
Changed to 0.1.0

2. License:
Changed to MIT

3. Class Naming:
PDFPaddleOCR changed to PaddleOcrReader

4. Default Language:
Changed from "ch" to "en"

5. Arbitrary Filtering:
Changed arbitrary logic like '"第", "页"...' to '"page", "of"'

6. Testing:
Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed.
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Sep 15, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 15, 2025
@AstraBert AstraBert enabled auto-merge (squash) September 15, 2025 09:45
@MichaelYipInGitHub
Copy link
Contributor Author

Hi @AstraBert , as a new contributor, I am very appreciate your helpful response!
I just upload the new code, which following your suggestions, to my 'llama-index-readers-paddle-ocr' branch, feel free to leave your comment.
Thanks again!

@AstraBert
Copy link
Member

AstraBert commented Sep 15, 2025

hey @MichaelYipInGitHub thanks a lot for adjusting the PR! We need to make linting pass tho, and you can do it via:

uv pip install pre-commit
pre-commit install
pre-commit run -a
git add . && git commit -m "ci: lint" && git push origin llama-index-readers-paddle-ocr

Once linting passes, the PR will merge automatically :)

auto-merge was automatically disabled September 15, 2025 14:22

Head branch was pushed to by a user without write access

@MichaelYipInGitHub
Copy link
Contributor Author

auto-merge was automatically disabled

Thanks @AstraBert ,but seems auto-merge has been automatically disabled, could you help enable again?~ thanks

@AstraBert AstraBert enabled auto-merge (squash) September 16, 2025 08:10
@AstraBert AstraBert merged commit 1d270ef into run-llama:main Sep 16, 2025
11 checks passed
frankiekim5 pushed a commit to frankiekim5/bedrock-agentcore-memory that referenced this pull request Sep 24, 2025
… PDFs (run-llama#19827)

* submit for llama-index-readers-paddle-ocr

submit for llama-index-readers-paddle-ocr

* Add PaddleOCR reader integration

1. Version:
Changed to 0.1.0

2. License:
Changed to MIT

3. Class Naming:
PDFPaddleOCR changed to PaddleOcrReader

4. Default Language:
Changed from "ch" to "en"

5. Arbitrary Filtering:
Changed arbitrary logic like '"第", "页"...' to '"page", "of"'

6. Testing:
Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed.

* ci: lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants