Skip to content

[Bug] Metascraper parsers that do internal requests don't respect the proxy settings #1863

@llCorvinSll

Description

@llCorvinSll

Describe the Bug

When a direct internet connection is not available and the site can only be accessed through a proxy, the crawler job fails by timeout.
It seems that removing the metascraperLogo() plugin from the metascraper resolves the issue.

Also seems that adding generic HTTP_PROXY can solve problem (but this method can occasionally break other internal network connections and not all libraries respects NO_PROXY variable)

Steps to Reproduce

  1. block site or internet access for crawler
  2. setup proxy for access target site (instagram or etc.)
  3. add link to blocked site to karakeep

Expected Behaviour

scraping must 100% relay on proxy settings

Screenshots or Additional Context

in worker logs newer appears message "Done extracting metadata from the page."

2025-08-21T19:36:27.309Z info: [Crawler][172] Successfully navigated to "https://www.instagram.com/p/DKe1DhCtlUB". Waiting for the page to load ...
2025-08-21T19:36:30.045Z info: [Crawler][172] Finished waiting for the page to load.
2025-08-21T19:36:30.071Z info: [Crawler][172] Successfully fetched the page content.
2025-08-21T19:36:30.132Z info: [Crawler][172] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-08-21T19:36:30.148Z info: [Crawler][172] Will attempt to extract metadata from page ...
2025-08-21T19:36:30.296Z info: [Crawler][172] Will attempt to extract readable content ...
2025-08-21T19:36:30.560Z info: [Crawler][172] Done extracting readable content.
2025-08-21T19:36:30.652Z info: [Crawler][172] Stored the screenshot as assetId: f8d970d7-83f3-4908-a5c9-1c65a70048fd

Device Details

No response

Exact Karakeep Version

0.26.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpri/highHigh priority issuestatus/approvedThis issue is ready to be implemented

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions