Skip to content

Conversation

@GaelReinaudi
Copy link

Fix: Auto-try blank password for encrypted PDFs without password

Problem

Many PDF files are technically "encrypted" but don't require an actual password - they just need an empty string to decrypt. This is standard behavior in PDF viewers (Adobe, Preview, etc.) and other PDF libraries (PyPDF2, pikepdf, etc.).

Currently, agno's PDFReader fails immediately when encountering such PDFs, causing the "zombie document bug":

  • PDF is uploaded to PostgreSQL ✅
  • PDF fails to vectorize due to "password protected" error ❌
  • Document exists in content DB but NOT in vector DB ❌
  • Document is unsearchable despite being "uploaded" ❌

Why Can't Users Just Pass password=""?

Three blockers prevent this:

  1. S3Reader hardcodes PDFReader() (line 62 in s3_reader.py):

    return PDFReader().read(pdf=BytesIO(...), name=doc_name)

    No password parameter is passed, even if S3Reader accepted one!

  2. The or operator bug (line 237 in pdf_reader.py):

    pdf_password = password or self.password

    If you pass password="", it's treated as False and falls through to self.password! Empty strings are falsy in Python.

  3. Users don't know the password is blank:
    How would a user know to pass password=""? PDF viewers handle this automatically - agno should too.

This fix makes agno match industry-standard behavior by auto-trying blank passwords.

Solution

Automatically try a blank password ("") when:

  1. A PDF is encrypted
  2. No password is provided by the user

This matches industry-standard behavior and prevents the zombie document bug.

Changes

Modified libs/agno/agno/knowledge/reader/pdf_reader.py:

  • _decrypt_pdf() method now attempts blank password before failing
  • Only logs error if blank password also fails
  • Maintains backward compatibility with explicit password usage

Behavior

Scenario Before After
Not encrypted ✅ Works ✅ Works
Encrypted, no password, blank works ❌ Fails ✅ Works
Encrypted, no password, blank fails ❌ Fails ❌ Fails (same)
Encrypted, password provided ✅ Works ✅ Works

Testing

Tested with real-world encrypted-but-no-password PDF (4.BH-Bevel-Helical_Manual-1.pdf):

  • Before: ERROR PDF file "X" is password protected but no password provided
  • After: INFO Successfully decrypted PDF file "X" with blank password → vectorization succeeds

Impact

  • S3Reader: Automatically benefits (uses PDFReader internally)
  • Local uploads: Automatically benefits
  • No breaking changes: Existing code with explicit passwords continues to work
  • User experience: "Just works" for common encrypted PDFs

Additional Context

This is a common pattern in PDF libraries:

  • PyPDF2: Automatically tries blank password
  • pikepdf: Allows empty password in decrypt
  • Adobe Reader: Opens "secured" PDFs without prompting for password

agno should match this industry-standard behavior.


How to Review

  1. Check the logic flow in _decrypt_pdf()
  2. Verify blank password is only tried when no password provided
  3. Confirm existing password-based decryption still works
  4. Test with encrypted-but-no-password PDF if available

Checklist

  • Code change is minimal and focused
  • Maintains backward compatibility
  • Improves user experience
  • Matches industry-standard behavior
  • Clear commit message with context

…assword

Many PDF files are technically 'encrypted' but don't require an actual password -
they just need an empty string to decrypt. This is common behavior in PDF viewers
and other PDF libraries.

WHY CAN'T USERS JUST PASS password=""?

Three blockers prevent this:

1. S3Reader hardcodes PDFReader() with no parameters (s3_reader.py:62)
   Even if S3Reader accepted a password parameter, it doesn't pass it to PDFReader.

2. The 'or' operator bug (pdf_reader.py:237):
   pdf_password = password or self.password
   If you pass password="", it's treated as False and falls through to self.password!
   Empty strings are falsy in Python.

3. Users don't know the password is blank
   How would a user know to pass password=""? PDF viewers handle this automatically.

This fix automatically tries a blank password when:
- A PDF is encrypted
- No password is provided by the user

This prevents the 'zombie document' bug where encrypted-but-no-password PDFs
would fail to process, ending up in PostgreSQL but not vectorized in the vector DB.

Behavior:
- If PDF is not encrypted: proceed normally
- If PDF is encrypted and no password provided: try blank password first
- If blank password works: success (log info message)
- If blank password fails: return error (existing behavior)
- If password is provided: use it directly (existing behavior)

This matches the behavior of popular PDF libraries (PyPDF2, pikepdf, Adobe Reader)
and improves the user experience by handling a common edge case automatically.
@GaelReinaudi GaelReinaudi requested a review from a team as a code owner October 23, 2025 18:10
…ord-auto-try

# Conflicts:
#	libs/agno/agno/knowledge/reader/pdf_reader.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants