-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Package] Add PaddleOCR Reader for extracting text from images in PDFs #19827
Conversation
submit for llama-index-readers-paddle-ocr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look bad, but I left some comments on things that do not make too much sense to me
llama-index-integrations/readers/llama-index-readers-paddle-ocr/pyproject.toml
Outdated
Show resolved
Hide resolved
llama-index-integrations/readers/llama-index-readers-paddle-ocr/pyproject.toml
Outdated
Show resolved
Hide resolved
...x-integrations/readers/llama-index-readers-paddle-ocr/llama_index/readers/paddle_ocr/base.py
Outdated
Show resolved
Hide resolved
...x-integrations/readers/llama-index-readers-paddle-ocr/llama_index/readers/paddle_ocr/base.py
Outdated
Show resolved
Hide resolved
...x-integrations/readers/llama-index-readers-paddle-ocr/llama_index/readers/paddle_ocr/base.py
Outdated
Show resolved
Hide resolved
...a-index-integrations/readers/llama-index-readers-paddle-ocr/tests/test_readers_paddle_ocr.py
Show resolved
Hide resolved
1. Version: Changed to 0.1.0 2. License: Changed to MIT 3. Class Naming: PDFPaddleOCR changed to PaddleOcrReader 4. Default Language: Changed from "ch" to "en" 5. Arbitrary Filtering: Changed arbitrary logic like '"第", "页"...' to '"page", "of"' 6. Testing: Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed.
|
Hi @AstraBert , as a new contributor, I am very appreciate your helpful response! |
|
hey @MichaelYipInGitHub thanks a lot for adjusting the PR! We need to make linting pass tho, and you can do it via: uv pip install pre-commit
pre-commit install
pre-commit run -a
git add . && git commit -m "ci: lint" && git push origin llama-index-readers-paddle-ocrOnce linting passes, the PR will merge automatically :) |
Head branch was pushed to by a user without write access
Thanks @AstraBert ,but seems auto-merge has been automatically disabled, could you help enable again?~ thanks |
… PDFs (run-llama#19827) * submit for llama-index-readers-paddle-ocr submit for llama-index-readers-paddle-ocr * Add PaddleOCR reader integration 1. Version: Changed to 0.1.0 2. License: Changed to MIT 3. Class Naming: PDFPaddleOCR changed to PaddleOcrReader 4. Default Language: Changed from "ch" to "en" 5. Arbitrary Filtering: Changed arbitrary logic like '"第", "页"...' to '"page", "of"' 6. Testing: Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed. * ci: lint
submit for llama-index-readers-paddle-ocr
Description
This PR introduces a new data loader (
PaddleOCRReader) for the LlamaIndex ecosystem. This reader is specifically designed to extract text from image-based PDFs or scanned documents by leveraging the powerful PaddleOCR engine.Motivation & Context:
Many valuable documents are stored as scanned PDFs or contain crucial information within images (charts, diagrams, screenshots). Existing text-based PDF readers cannot process these. This integration bridges that gap by using PaddleOCR's excellent accuracy in OCR (Optical Character Recognition) for Chinese and English to convert image content within PDFs into readable
Documentobjects that LlamaIndex can then index and query.Key Features:
Documentobject, making it seamless to use with existing pipelines.Dependencies:
This change requires
paddleandpaddleocrpackages, which are already specified in thepyproject.tomlfile.New Package?
Did I fill in the
tool.llamahubsection in thepyproject.tomland provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.tomlfile of the package I am updating? (Except for thellama-index-corepackage)0.1.0)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
Testing Details:
I have added unit tests in
tests/test_readers_paddle_ocr.pythat:Verify the class can be initialized without errors.
Suggested Checklist:
README.md)uv run make format; uv run make lintto appease the lint gods