Skip to content

Conversation

gafda
Copy link
Contributor

@gafda gafda commented Jul 9, 2025

This pull request introduces significant updates to the development environment, document loading logic, and dependency versions. Key changes include the addition of a development container configuration, enhancements to document loader functionality, and upgrades to several Python dependencies for improved compatibility and performance.

Development Environment Setup:

  • .devcontainer/Dockerfile: Added a Dockerfile for setting up a development container, including system dependencies (git, sudo, pandoc, libmagic1) and a non-root user (vscode) for development. Configured Python environment variables and switched to the non-root user.
  • .devcontainer/devcontainer.json: Added a configuration file for the development container, specifying build arguments, VS Code extensions, port forwarding, and post-creation commands. Enabled features like "docker-outside-of-docker" for enhanced functionality.

Document Loader Enhancements:

  • app/utils/document_loader.py: Improved the get_loader function to support additional MIME types for document loading and introduced the SafePyPDFLoader class to gracefully handle image extraction failures in PDFs. [1] [2] [3]

Dependency Updates:

  • requirements.lite.txt: Upgraded multiple dependencies, including langchain, pypdf, python-pptx, and cryptography, to newer versions for enhanced functionality and bug fixes. Added langchain-ollama as a new dependency to fix Ollama connector to Lite version.

Testing Additions:

  • tests/utils/test_document_loader.py: Added unit tests for the SafePyPDFLoader class and its integration into the get_loader function, ensuring proper behavior and compatibility.

These changes collectively improve the development workflow, expand document processing capabilities, and ensure the codebase remains up-to-date with the latest library versions.

Screenshots

Before

image

After

image

…te edition to work again with Ollama. Safely load PDF with images flag On, with fallback to try and read the file with it Off.
@gafda gafda marked this pull request as ready for review July 10, 2025 16:36
@gafda
Copy link
Contributor Author

gafda commented Aug 1, 2025

@danny-avila Any news about my PR?

@danny-avila
Copy link
Owner

Added langchain-ollama as a new dependency to fix Ollama connector to Lite version.

This should not be added to the lite version since the lite version should not be installing all the dependencies required for langchain-ollama.

Upgraded multiple dependencies, including langchain, pypdf, python-pptx, and cryptography, to newer versions for enhanced functionality and bug fixes.

Can you document the enhanced functionality and bug fixes? what are they?

@gafda
Copy link
Contributor Author

gafda commented Aug 4, 2025

Added langchain-ollama as a new dependency to fix Ollama connector to Lite version.

This should not be added to the lite version since the lite version should not be installing all the dependencies required for langchain-ollama.

While the current image provides comprehensive functionality, we've identified a recurring need for a more specialized, lightweight option. We propose a new lite-ollama variant that specifically includes the Ollama library. This would cater effectively to users who require direct access to Ollama for purposes such as LLM testing – a workflow our team extensively utilizes. The streamlined nature of its past availability in a similar configuration was indeed a significant asset for these rapid development environments.

Upgraded multiple dependencies, including langchain, pypdf, python-pptx, and cryptography, to newer versions for enhanced functionality and bug fixes.

Can you document the enhanced functionality and bug fixes? what are they?
That's an excellent question, and I appreciate your diligence in understanding the changes.

The immediate and most critical reason for these upgrades was to resolve the catastrophic failures we observed when processing certain PDF files, especially those with security features or complex image content. This specific problem, along with its resolution, is precisely what's demonstrated in the provided screenshots. My contribution to LangChain (here) directly addresses a root cause within that library.

Beyond this crucial fix, keeping our dependencies reasonably up-to-date is a standard industry practice. It ensures we proactively incorporate the latest security patches, benefit from general performance improvements, and maintain compatibility with the evolving Python ecosystem. While comprehensively documenting every minor bug fix or subtle feature enhancement across all upgraded libraries like pypdf, python-pptx, and cryptography would be quite extensive, the cumulative effect is a more robust, secure, and future-proof application, better equipped to handle the wide variety of files we process.

@danny-avila danny-avila changed the title Refactor: Improve document loaders, fix Lite edition, and enhance PDF loading. 🔃 refactor: Improve Document Loaders, fix Lite edition, and PDF loading Aug 17, 2025
@danny-avila
Copy link
Owner

While the current image provides comprehensive functionality, we've identified a recurring need for a more specialized, lightweight option. We propose a new lite-ollama variant that specifically includes the Ollama library. This would cater effectively to users who require direct access to Ollama for purposes such as LLM testing – a workflow our team extensively utilizes. The streamlined nature of its past availability in a similar configuration was indeed a significant asset for these rapid development environments.

langchain-ollama significantly increases the resulting image size, so it's no longer "lite" in this case. You can use the non-lite image in this case. I will be double-checking image size just to be sure.

@danny-avila
Copy link
Owner

Ok looks like it only adds 100-200 MB which is fine, will accept this PR as is.

@danny-avila danny-avila changed the title 🔃 refactor: Improve Document Loaders, fix Lite edition, and PDF loading 🔃 refactor: Improve Document Loaders, add Ollama to Lite edition Aug 17, 2025
@danny-avila danny-avila changed the title 🔃 refactor: Improve Document Loaders, add Ollama to Lite edition 🔃 refactor: Improve Document Loaders, add langchain-ollama to Lite Build Aug 17, 2025
@danny-avila danny-avila merged commit 15e31da into danny-avila:main Aug 17, 2025
1 check passed
dirkpetersen pushed a commit to dirkpetersen/rag_api that referenced this pull request Aug 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants