-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Resource management is currently done in multiple places (BrowserPool, SessionPool, ProxyConfiguration...), which leads to complexity and potential resource conflicts.
Typical issue:
const crawler = new PlaywrightCrawler({
proxyConfiguration: new ProxyConfiguration({
tieredUrls: ['first.proxy', 'second.proxy'] ,
}),
// implicit browserPool,
// implicit sessionPool,
retryOnBlocked: true,
...
})..
await crawler.run(['https://a-heavily-protected.page'])While a 403 response from the heavily-protected.page will cause a session rotation (and a proxy URL rotation), the next request will still be made using first.proxy, as the BrowserPool instance reuses the running browser (we cannot switch the proxy URL for a running browser process).
This is not expected and requires the use of other options (e.g. browserPerProxy). This, in turn, may degrade performance, as Crawlee will try launching too many browser instances. The DX for adding new crawlers / browser types is also not great.
Proposed solution
Grooming the resource management system with ResourceOwners (users) and UserPool.
const crawler = new PlaywrightCrawler({
userPool: new UserPool([
// `BrowserUser` instance owns the browser process
new BrowserUser({ proxy: 'first.proxy', launcher: chromium }),
new BrowserUser({ proxy: 'second.proxy', launcher: firefox }),
]),
requestHandler: ({ response: { statusCode }, switchUser }) => {
if (statusCode === '403') {
await switchUser(); // changes `request.userId` and throws, so the request is retried with a different user (+ browser + proxy)
}
}
...
})..
await crawler.run(['https://a-heavily-protected.page'])See initial ideas behind the feature in Notion and incomplete design docs at https://jindrich.bar/misc/userpool-rfc.
PoC PR is at #3048