Skip to content

Commit 2ff3efa

Browse files
💄 style: add blockAds & stealth params for Browserless (lobehub#8255)
* ✨ feat: add blockAds & stealth params for Browserless * Apply `sourcery-ai` suggestion Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * 📝 docs: add docs for `BROWSERLESS_BLOCK_ADS` & `BROWSERLESS_STEALTH_MODE` --------- Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
1 parent f63b137 commit 2ff3efa

File tree

3 files changed

+79
-1
lines changed

3 files changed

+79
-1
lines changed

docs/self-hosting/advanced/online-search.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,40 @@ BROWSERLESS_URL=https://chrome.browserless.io
8888
8989
---
9090

91+
## `BROWSERLESS_BLOCK_ADS`
92+
93+
Enables ad blocking functionality. When using [Browserless](https://www.browserless.io/) for web scraping, it automatically blocks common ad resources (such as scripts, images, trackers, etc.), improving scraping speed and page clarity.
94+
95+
```env
96+
BROWSERLESS_BLOCK_ADS=1
97+
```
98+
99+
> 📌 Supported values:
100+
>
101+
> * `1`: Enable ad blocking (recommended);
102+
> * `0`: Disable ad blocking (default).
103+
104+
> ✅ It is recommended to use with `BROWSERLESS_STEALTH_MODE=1` to enhance stealth and scraping success rate.
105+
106+
---
107+
108+
## `BROWSERLESS_STEALTH_MODE`
109+
110+
Enables stealth mode. When using [Browserless](https://www.browserless.io/) for web scraping, it applies various anti-detection techniques (such as modifying the user agent, removing webdriver traits, simulating user interactions) to bypass anti-bot mechanisms.
111+
112+
```env
113+
BROWSERLESS_STEALTH_MODE=1
114+
```
115+
116+
> 📌 Supported values:
117+
>
118+
> * `1`: Enable stealth mode (recommended);
119+
> * `0`: Disable stealth mode (default).
120+
121+
> ⚠️ Some websites use advanced anti-scraping techniques. Enabling stealth mode can significantly improve scraping success rate.
122+
123+
---
124+
91125
## `GOOGLE_PSE_ENGINE_ID`
92126

93127
Configure the Search Engine ID for Google Programmable Search Engine (Google PSE), used to restrict the search scope. Must be used alongside `GOOGLE_PSE_API_KEY`.

docs/self-hosting/advanced/online-search.zh-CN.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,40 @@ BROWSERLESS_URL=https://chrome.browserless.io
8484
8585
---
8686

87+
## `BROWSERLESS_BLOCK_ADS`
88+
89+
启用广告拦截功能,在使用 [Browserless](https://www.browserless.io/) 进行网页抓取时自动屏蔽常见广告资源(如脚本、图片、追踪器等),提高抓取速度与页面清晰度。
90+
91+
```env
92+
BROWSERLESS_BLOCK_ADS=1
93+
```
94+
95+
> 📌 支持的值:
96+
>
97+
> * `1`:启用广告拦截(推荐);
98+
> * `0`:禁用广告拦截(默认)。
99+
100+
> ✅ 建议与 `BROWSERLESS_STEALTH_MODE=1` 一起使用,提高爬虫的隐蔽性和成功率。
101+
102+
---
103+
104+
## `BROWSERLESS_STEALTH_MODE`
105+
106+
启用隐身模式,在使用 [Browserless](https://www.browserless.io/) 抓取网页时,通过一系列防检测手段(如修改 UA、移除 webdriver 特征、模拟用户操作)来规避反爬虫机制。
107+
108+
```env
109+
BROWSERLESS_STEALTH_MODE=1
110+
```
111+
112+
> 📌 支持的值:
113+
>
114+
> * `1`:启用隐身模式(推荐);
115+
> * `0`:禁用隐身模式(默认)。
116+
117+
> ⚠️ 某些网站存在高级反爬机制,启用隐身模式可以显著提升抓取成功率。
118+
119+
---
120+
87121
## `GOOGLE_PSE_ENGINE_ID`
88122

89123
配置 Google Programmable Search Engine(Google PSE)的搜索引擎 ID,用于限定搜索范围。需配合 `GOOGLE_PSE_API_KEY` 一起使用。

packages/web-crawler/src/crawImpl/browserless.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ const REJECT_REQUEST_PATTERN =
1010
'.*\\.(?!(html|css|js|json|xml|webmanifest|txt|md)(\\?|#|$))[\\w-]+(?:[\\?#].*)?$';
1111
const BROWSERLESS_TOKEN = process.env.BROWSERLESS_TOKEN;
1212

13+
const BROWSERLESS_BLOCK_ADS = process.env.BROWSERLESS_BLOCK_ADS === '1';
14+
const BROWSERLESS_STEALTH_MODE = process.env.BROWSERLESS_STEALTH_MODE === '1';
15+
1316
class BrowserlessInitError extends Error {
1417
constructor() {
1518
super('`BROWSERLESS_URL` or `BROWSERLESS_TOKEN` are required');
@@ -30,7 +33,14 @@ export const browserless: CrawlImpl = async (url, { filterOptions }) => {
3033

3134
try {
3235
const res = await fetch(
33-
qs.stringifyUrl({ query: { token: BROWSERLESS_TOKEN }, url: urlJoin(BASE_URL, '/content') }),
36+
qs.stringifyUrl({
37+
query: {
38+
blockAds: BROWSERLESS_BLOCK_ADS,
39+
launch: JSON.stringify({ stealth: BROWSERLESS_STEALTH_MODE }),
40+
token: BROWSERLESS_TOKEN,
41+
},
42+
url: urlJoin(BASE_URL, '/content'),
43+
}),
3444
{
3545
body: JSON.stringify(input),
3646
headers: {

0 commit comments

Comments
 (0)