-
Notifications
You must be signed in to change notification settings - Fork 267
🔃 refactor: Improve Document Loaders, add langchain-ollama
to Lite Build
#170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔃 refactor: Improve Document Loaders, add langchain-ollama
to Lite Build
#170
Conversation
…te edition to work again with Ollama. Safely load PDF with images flag On, with fallback to try and read the file with it Off.
@danny-avila Any news about my PR? |
This should not be added to the lite version since the lite version should not be installing all the dependencies required for
Can you document the enhanced functionality and bug fixes? what are they? |
While the current image provides comprehensive functionality, we've identified a recurring need for a more specialized, lightweight option. We propose a new
The immediate and most critical reason for these upgrades was to resolve the catastrophic failures we observed when processing certain PDF files, especially those with security features or complex image content. This specific problem, along with its resolution, is precisely what's demonstrated in the provided screenshots. My contribution to LangChain (here) directly addresses a root cause within that library. Beyond this crucial fix, keeping our dependencies reasonably up-to-date is a standard industry practice. It ensures we proactively incorporate the latest security patches, benefit from general performance improvements, and maintain compatibility with the evolving Python ecosystem. While comprehensively documenting every minor bug fix or subtle feature enhancement across all upgraded libraries like pypdf, python-pptx, and cryptography would be quite extensive, the cumulative effect is a more robust, secure, and future-proof application, better equipped to handle the wide variety of files we process. |
|
Ok looks like it only adds 100-200 MB which is fine, will accept this PR as is. |
langchain-ollama
to Lite Build
This pull request introduces significant updates to the development environment, document loading logic, and dependency versions. Key changes include the addition of a development container configuration, enhancements to document loader functionality, and upgrades to several Python dependencies for improved compatibility and performance.
Development Environment Setup:
.devcontainer/Dockerfile
: Added a Dockerfile for setting up a development container, including system dependencies (git
,sudo
,pandoc
,libmagic1
) and a non-root user (vscode
) for development. Configured Python environment variables and switched to the non-root user..devcontainer/devcontainer.json
: Added a configuration file for the development container, specifying build arguments, VS Code extensions, port forwarding, and post-creation commands. Enabled features like "docker-outside-of-docker" for enhanced functionality.Document Loader Enhancements:
app/utils/document_loader.py
: Improved theget_loader
function to support additional MIME types for document loading and introduced theSafePyPDFLoader
class to gracefully handle image extraction failures in PDFs. [1] [2] [3]Dependency Updates:
requirements.lite.txt
: Upgraded multiple dependencies, includinglangchain
,pypdf
,python-pptx
, andcryptography
, to newer versions for enhanced functionality and bug fixes. Addedlangchain-ollama
as a new dependency to fix Ollama connector to Lite version.Testing Additions:
tests/utils/test_document_loader.py
: Added unit tests for theSafePyPDFLoader
class and its integration into theget_loader
function, ensuring proper behavior and compatibility.These changes collectively improve the development workflow, expand document processing capabilities, and ensure the codebase remains up-to-date with the latest library versions.
Screenshots
Before
After