Skip to content

Better resource & state management with UserPool #3090

@barjin

Description

@barjin

Resource management is currently done in multiple places (BrowserPool, SessionPool, ProxyConfiguration...), which leads to complexity and potential resource conflicts.

Typical issue:

const crawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({
        tieredUrls: ['first.proxy', 'second.proxy'] ,
    }),
    // implicit browserPool,
    // implicit sessionPool,
    retryOnBlocked: true,
    ...
})..

await crawler.run(['https://a-heavily-protected.page'])

While a 403 response from the heavily-protected.page will cause a session rotation (and a proxy URL rotation), the next request will still be made using first.proxy, as the BrowserPool instance reuses the running browser (we cannot switch the proxy URL for a running browser process).

This is not expected and requires the use of other options (e.g. browserPerProxy). This, in turn, may degrade performance, as Crawlee will try launching too many browser instances. The DX for adding new crawlers / browser types is also not great.

Proposed solution

Grooming the resource management system with ResourceOwners (users) and UserPool.

const crawler = new PlaywrightCrawler({
    userPool: new UserPool([
        // `BrowserUser` instance owns the browser process
        new BrowserUser({ proxy: 'first.proxy', launcher: chromium }),
        new BrowserUser({ proxy: 'second.proxy', launcher: firefox }),
    ]),
    requestHandler: ({ response: { statusCode }, switchUser }) => {
        if (statusCode === '403') {
             await switchUser(); // changes `request.userId` and throws, so the request is retried with a different user (+ browser + proxy)
        }
    }
    ...
})..

await crawler.run(['https://a-heavily-protected.page'])

See initial ideas behind the feature in Notion and incomplete design docs at https://jindrich.bar/misc/userpool-rfc.

PoC PR is at #3048

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions