You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-5Lines changed: 12 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@
49
49
50
50
Scrapling isn't just another Web Scraping library. It's the first **adaptive** scraping library that learns from website changes and evolves with them. While other libraries break when websites update their structure, Scrapling automatically relocates your elements and keeps your scrapers running.
51
51
52
-
Built for the modern Web, Scrapling has its own rapid parsing engine and its fetchers to handle all Web Scraping challenges you are facing or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
52
+
Built for the modern Web, Scrapling features its own rapid parsing engine and fetchers to handle all Web Scraping challenges you face or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
@@ -87,7 +87,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
87
87
### Advanced Websites Fetching with Session Support
88
88
-**HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP3.
89
89
-**Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium, real Chrome, and custom stealth mode.
90
-
-**Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all levels of Cloudflare's Turnstile with automation easily.
90
+
-**Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile and Interstitial with automation easily.
91
91
-**Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
92
92
-**Async Support**: Complete async support across all fetchers and dedicated async session classes.
> There are many additional features, but we want to keep this page short, like the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
238
+
> There are many additional features, but we want to keep this page concise, such as the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
239
239
240
240
## Performance Benchmarks
241
241
242
-
Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 deliver exceptional performance improvements across all operations!
242
+
Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 have delivered exceptional performance improvements across all operations.
243
243
244
244
### Text Extraction Speed Test (5000 nested elements)
245
245
@@ -302,14 +302,21 @@ Starting with v0.3.2, this installation only includes the parser engine and its
302
302
```
303
303
Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
304
304
305
+
### Docker
306
+
You can also install a Docker image with all extras and browsers with the following command:
307
+
```bash
308
+
docker pull scrapling
309
+
```
310
+
This image is automatically built and pushed to Docker Hub through GitHub actions right here.
311
+
305
312
## Contributing
306
313
307
314
We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started.
308
315
309
316
## Disclaimer
310
317
311
318
> [!CAUTION]
312
-
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect website terms of service and robots.txt files.
319
+
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
Copy file name to clipboardExpand all lines: docs/ai/mcp-server.md
+38-3Lines changed: 38 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,20 +17,20 @@ The Scrapling MCP Server provides six powerful tools for web scraping:
17
17
-**`bulk_fetch`**: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time!
18
18
19
19
### 🔒 Stealth Scraping
20
-
-**`stealthy_fetch`**: Uses our modified version of Camoufox browser to bypass Cloudflare Turnstile and other anti-bot systems with complete control over the request/browser!
20
+
-**`stealthy_fetch`**: Uses our modified version of Camoufox browser to bypass Cloudflare Turnstile/Interstitial and other anti-bot systems with complete control over the request/browser!
21
21
-**`bulk_stealthy_fetch`**: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time!
22
22
23
23
### Key Capabilities
24
24
-**Smart Content Extraction**: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content
25
25
-**CSS Selector Support**: Use the Scrapling engine to target specific elements with precision before handing the content to the AI
26
-
-**Anti-Bot Bypass**: Handle Cloudflare Turnstile and other protections
26
+
-**Anti-Bot Bypass**: Handle Cloudflare Turnstile, Interstitial, and other protections
27
27
-**Proxy Support**: Use proxies for anonymity and geo-targeting
28
28
-**Browser Impersonation**: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more
29
29
-**Parallel Processing**: Scrape multiple URLs concurrently for efficiency
30
30
31
31
#### But why use Scrapling MCP Server instead of other available tools?
32
32
33
-
Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile, Scrapling's server is the only one that allows you to pass a CSS selector in the prompt to extract specific elements before handing the content to the AI.
33
+
Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that allows you to pass a CSS selector in the prompt to extract specific elements before handing the content to the AI.
34
34
35
35
The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume a lot more tokens that are not needed (from irrelevant content). Scrapling solves this problem by allowing you to pass a CSS selector to narrow down the content you want before passing it to the AI, which makes the whole process much faster and more efficient.
36
36
@@ -48,6 +48,11 @@ pip install "scrapling[ai]"
48
48
scrapling install
49
49
```
50
50
51
+
Or use the Docker image directly:
52
+
```bash
53
+
docker pull scrapling
54
+
```
55
+
51
56
## Setting up the MCP Server
52
57
53
58
Here we will explain how to add Scrapling MCP Server to [Claude Desktop](https://claude.ai/download) and [Claude Code](https://www.anthropic.com/claude-code), but the same logic applies to any other chatbot that supports MCP:
@@ -101,6 +106,20 @@ For me, on my Mac, it returned `/Users/<MyUsername>/.venv/bin/scrapling`, so the
101
106
}
102
107
}
103
108
```
109
+
#### Docker
110
+
If you are using the Docker image, then it would be something like
111
+
```json
112
+
{
113
+
"mcpServers": {
114
+
"ScraplingServer": {
115
+
"command": "docker",
116
+
"args": [
117
+
"run", "-i", "--rm", "scrapling", "mcp"
118
+
]
119
+
}
120
+
}
121
+
}
122
+
```
104
123
105
124
The same logic applies to [Cursor](https://docs.cursor.com/en/context/mcp), [WindSurf](https://windsurf.com/university/tutorials/configuring-first-mcp-server), and others.
106
125
@@ -120,6 +139,22 @@ Here's the main article from Anthropic on [how to add MCP servers to Claude code
120
139
121
140
Then, after you've added the server, you need to completely quit and restart the app you used above. In Claude Desktop, you should see an MCP server indicator (🔧) in the bottom-right corner of the chat input or see `ScraplingServer` in the `Search and tools` dropdown in the chat input box.
122
141
142
+
### Streamable HTTP
143
+
As per version 0.3.6, we have added the ability to make the MCP server use the 'Streamable HTTP' transport mode instead of the traditional 'stdio' transport.
144
+
145
+
So instead of using the following command (the 'stdio' one):
146
+
```bash
147
+
scrapling mcp
148
+
```
149
+
Use the following to enable 'Streamable HTTP' transport mode:
150
+
```bash
151
+
scrapling mcp --http
152
+
```
153
+
Hence, the default value for the host the server is listening on is '0.0.0.0' and the port is 8000, which both can be configured as below:
Now we will show you some examples of prompts we used while testing the MCP server, but you are probably more creative than we are and better at prompt engineering than we are :)
0 commit comments