🔃 refactor: Improve Document Loaders, add `langchain-ollama` to Lite Build #170

gafda · 2025-07-09T11:04:52Z

This pull request introduces significant updates to the development environment, document loading logic, and dependency versions. Key changes include the addition of a development container configuration, enhancements to document loader functionality, and upgrades to several Python dependencies for improved compatibility and performance.

Development Environment Setup:

.devcontainer/Dockerfile: Added a Dockerfile for setting up a development container, including system dependencies (git, sudo, pandoc, libmagic1) and a non-root user (vscode) for development. Configured Python environment variables and switched to the non-root user.
.devcontainer/devcontainer.json: Added a configuration file for the development container, specifying build arguments, VS Code extensions, port forwarding, and post-creation commands. Enabled features like "docker-outside-of-docker" for enhanced functionality.

Document Loader Enhancements:

app/utils/document_loader.py: Improved the get_loader function to support additional MIME types for document loading and introduced the SafePyPDFLoader class to gracefully handle image extraction failures in PDFs. [1] [2] [3]

Dependency Updates:

requirements.lite.txt: Upgraded multiple dependencies, including langchain, pypdf, python-pptx, and cryptography, to newer versions for enhanced functionality and bug fixes. Added langchain-ollama as a new dependency to fix Ollama connector to Lite version.

Testing Additions:

tests/utils/test_document_loader.py: Added unit tests for the SafePyPDFLoader class and its integration into the get_loader function, ensuring proper behavior and compatibility.

These changes collectively improve the development workflow, expand document processing capabilities, and ensure the codebase remains up-to-date with the latest library versions.

Screenshots

Before

After

…te edition to work again with Ollama. Safely load PDF with images flag On, with fallback to try and read the file with it Off.

gafda · 2025-08-01T10:11:15Z

@danny-avila Any news about my PR?

danny-avila · 2025-08-01T13:41:37Z

Added langchain-ollama as a new dependency to fix Ollama connector to Lite version.

This should not be added to the lite version since the lite version should not be installing all the dependencies required for langchain-ollama.

Upgraded multiple dependencies, including langchain, pypdf, python-pptx, and cryptography, to newer versions for enhanced functionality and bug fixes.

Can you document the enhanced functionality and bug fixes? what are they?

gafda · 2025-08-04T09:31:39Z

Added langchain-ollama as a new dependency to fix Ollama connector to Lite version.

This should not be added to the lite version since the lite version should not be installing all the dependencies required for langchain-ollama.

While the current image provides comprehensive functionality, we've identified a recurring need for a more specialized, lightweight option. We propose a new lite-ollama variant that specifically includes the Ollama library. This would cater effectively to users who require direct access to Ollama for purposes such as LLM testing – a workflow our team extensively utilizes. The streamlined nature of its past availability in a similar configuration was indeed a significant asset for these rapid development environments.

Upgraded multiple dependencies, including langchain, pypdf, python-pptx, and cryptography, to newer versions for enhanced functionality and bug fixes.

Can you document the enhanced functionality and bug fixes? what are they?
That's an excellent question, and I appreciate your diligence in understanding the changes.

The immediate and most critical reason for these upgrades was to resolve the catastrophic failures we observed when processing certain PDF files, especially those with security features or complex image content. This specific problem, along with its resolution, is precisely what's demonstrated in the provided screenshots. My contribution to LangChain (here) directly addresses a root cause within that library.

Beyond this crucial fix, keeping our dependencies reasonably up-to-date is a standard industry practice. It ensures we proactively incorporate the latest security patches, benefit from general performance improvements, and maintain compatibility with the evolving Python ecosystem. While comprehensively documenting every minor bug fix or subtle feature enhancement across all upgraded libraries like pypdf, python-pptx, and cryptography would be quite extensive, the cumulative effect is a more robust, secure, and future-proof application, better equipped to handle the wide variety of files we process.

danny-avila · 2025-08-17T17:16:26Z

While the current image provides comprehensive functionality, we've identified a recurring need for a more specialized, lightweight option. We propose a new lite-ollama variant that specifically includes the Ollama library. This would cater effectively to users who require direct access to Ollama for purposes such as LLM testing – a workflow our team extensively utilizes. The streamlined nature of its past availability in a similar configuration was indeed a significant asset for these rapid development environments.

langchain-ollama significantly increases the resulting image size, so it's no longer "lite" in this case. You can use the non-lite image in this case. I will be double-checking image size just to be sure.

danny-avila · 2025-08-17T17:58:41Z

Ok looks like it only adds 100-200 MB which is fine, will accept this PR as is.

…Build (danny-avila#170)

Upgrade some libraries to improve multiple document loaders. Fixed Li…

c5c51f0

…te edition to work again with Ollama. Safely load PDF with images flag On, with fallback to try and read the file with it Off.

gafda marked this pull request as ready for review July 10, 2025 16:36

danny-avila mentioned this pull request Aug 5, 2025

fix: update LangChain packages for HuggingFace Hub 0.33.1+ compatibility #165

Open

danny-avila changed the title ~~Refactor: Improve document loaders, fix Lite edition, and enhance PDF loading.~~ 🔃 refactor: Improve Document Loaders, fix Lite edition, and PDF loading Aug 17, 2025

danny-avila changed the title ~~🔃 refactor: Improve Document Loaders, fix Lite edition, and PDF loading~~ 🔃 refactor: Improve Document Loaders, add Ollama to Lite edition Aug 17, 2025

danny-avila changed the title ~~🔃 refactor: Improve Document Loaders, add Ollama to Lite edition~~ 🔃 refactor: Improve Document Loaders, add langchain-ollama to Lite Build Aug 17, 2025

danny-avila approved these changes Aug 17, 2025

View reviewed changes

danny-avila merged commit 15e31da into danny-avila:main Aug 17, 2025
1 check passed

dirkpetersen pushed a commit to dirkpetersen/rag_api that referenced this pull request Aug 23, 2025

🔃 refactor: Improve Document Loaders, add langchain-ollama to Lite …

7799ba4

…Build (danny-avila#170)

danny-avila mentioned this pull request Aug 27, 2025

File upload error issue when changing PDF_EXTRACT_IMAGES=true #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔃 refactor: Improve Document Loaders, add `langchain-ollama` to Lite Build #170

🔃 refactor: Improve Document Loaders, add `langchain-ollama` to Lite Build #170

Uh oh!

gafda commented Jul 9, 2025 •

edited

Loading

Uh oh!

gafda commented Aug 1, 2025

Uh oh!

danny-avila commented Aug 1, 2025

Uh oh!

gafda commented Aug 4, 2025

Uh oh!

danny-avila commented Aug 17, 2025

Uh oh!

danny-avila commented Aug 17, 2025

Uh oh!

Uh oh!

Uh oh!

🔃 refactor: Improve Document Loaders, add langchain-ollama to Lite Build #170

🔃 refactor: Improve Document Loaders, add langchain-ollama to Lite Build #170

Uh oh!

Conversation

gafda commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Development Environment Setup:

Document Loader Enhancements:

Dependency Updates:

Testing Additions:

Screenshots

Before

After

Uh oh!

gafda commented Aug 1, 2025

Uh oh!

danny-avila commented Aug 1, 2025

Uh oh!

gafda commented Aug 4, 2025

Uh oh!

danny-avila commented Aug 17, 2025

Uh oh!

danny-avila commented Aug 17, 2025

Uh oh!

Uh oh!

Uh oh!

🔃 refactor: Improve Document Loaders, add `langchain-ollama` to Lite Build #170

🔃 refactor: Improve Document Loaders, add `langchain-ollama` to Lite Build #170

gafda commented Jul 9, 2025 •

edited

Loading