Skip to content

Commit 8ad7cd6

Browse files
committed
docs: Update all pages/docstrings to reflect recent changes
1 parent 0cf3bcf commit 8ad7cd6

File tree

10 files changed

+131
-86
lines changed

10 files changed

+131
-86
lines changed

README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@
4949

5050
Scrapling isn't just another Web Scraping library. It's the first **adaptive** scraping library that learns from website changes and evolves with them. While other libraries break when websites update their structure, Scrapling automatically relocates your elements and keeps your scrapers running.
5151

52-
Built for the modern Web, Scrapling has its own rapid parsing engine and its fetchers to handle all Web Scraping challenges you are facing or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
52+
Built for the modern Web, Scrapling features its own rapid parsing engine and fetchers to handle all Web Scraping challenges you face or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
5353

5454
```python
5555
>> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
@@ -87,7 +87,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
8787
### Advanced Websites Fetching with Session Support
8888
- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP3.
8989
- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium, real Chrome, and custom stealth mode.
90-
- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all levels of Cloudflare's Turnstile with automation easily.
90+
- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile and Interstitial with automation easily.
9191
- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
9292
- **Async Support**: Complete async support across all fetchers and dedicated async session classes.
9393

@@ -235,11 +235,11 @@ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.
235235
```
236236

237237
> [!NOTE]
238-
> There are many additional features, but we want to keep this page short, like the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
238+
> There are many additional features, but we want to keep this page concise, such as the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
239239
240240
## Performance Benchmarks
241241

242-
Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 deliver exceptional performance improvements across all operations!
242+
Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 have delivered exceptional performance improvements across all operations.
243243

244244
### Text Extraction Speed Test (5000 nested elements)
245245

@@ -302,14 +302,21 @@ Starting with v0.3.2, this installation only includes the parser engine and its
302302
```
303303
Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
304304

305+
### Docker
306+
You can also install a Docker image with all extras and browsers with the following command:
307+
```bash
308+
docker pull scrapling
309+
```
310+
This image is automatically built and pushed to Docker Hub through GitHub actions right here.
311+
305312
## Contributing
306313

307314
We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started.
308315

309316
## Disclaimer
310317

311318
> [!CAUTION]
312-
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect website terms of service and robots.txt files.
319+
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
313320

314321
## License
315322

docs/ai/mcp-server.md

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,20 @@ The Scrapling MCP Server provides six powerful tools for web scraping:
1717
- **`bulk_fetch`**: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time!
1818

1919
### 🔒 Stealth Scraping
20-
- **`stealthy_fetch`**: Uses our modified version of Camoufox browser to bypass Cloudflare Turnstile and other anti-bot systems with complete control over the request/browser!
20+
- **`stealthy_fetch`**: Uses our modified version of Camoufox browser to bypass Cloudflare Turnstile/Interstitial and other anti-bot systems with complete control over the request/browser!
2121
- **`bulk_stealthy_fetch`**: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time!
2222

2323
### Key Capabilities
2424
- **Smart Content Extraction**: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content
2525
- **CSS Selector Support**: Use the Scrapling engine to target specific elements with precision before handing the content to the AI
26-
- **Anti-Bot Bypass**: Handle Cloudflare Turnstile and other protections
26+
- **Anti-Bot Bypass**: Handle Cloudflare Turnstile, Interstitial, and other protections
2727
- **Proxy Support**: Use proxies for anonymity and geo-targeting
2828
- **Browser Impersonation**: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more
2929
- **Parallel Processing**: Scrape multiple URLs concurrently for efficiency
3030

3131
#### But why use Scrapling MCP Server instead of other available tools?
3232

33-
Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile, Scrapling's server is the only one that allows you to pass a CSS selector in the prompt to extract specific elements before handing the content to the AI.
33+
Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that allows you to pass a CSS selector in the prompt to extract specific elements before handing the content to the AI.
3434

3535
The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume a lot more tokens that are not needed (from irrelevant content). Scrapling solves this problem by allowing you to pass a CSS selector to narrow down the content you want before passing it to the AI, which makes the whole process much faster and more efficient.
3636

@@ -48,6 +48,11 @@ pip install "scrapling[ai]"
4848
scrapling install
4949
```
5050

51+
Or use the Docker image directly:
52+
```bash
53+
docker pull scrapling
54+
```
55+
5156
## Setting up the MCP Server
5257

5358
Here we will explain how to add Scrapling MCP Server to [Claude Desktop](https://claude.ai/download) and [Claude Code](https://www.anthropic.com/claude-code), but the same logic applies to any other chatbot that supports MCP:
@@ -101,6 +106,20 @@ For me, on my Mac, it returned `/Users/<MyUsername>/.venv/bin/scrapling`, so the
101106
}
102107
}
103108
```
109+
#### Docker
110+
If you are using the Docker image, then it would be something like
111+
```json
112+
{
113+
"mcpServers": {
114+
"ScraplingServer": {
115+
"command": "docker",
116+
"args": [
117+
"run", "-i", "--rm", "scrapling", "mcp"
118+
]
119+
}
120+
}
121+
}
122+
```
104123

105124
The same logic applies to [Cursor](https://docs.cursor.com/en/context/mcp), [WindSurf](https://windsurf.com/university/tutorials/configuring-first-mcp-server), and others.
106125

@@ -120,6 +139,22 @@ Here's the main article from Anthropic on [how to add MCP servers to Claude code
120139

121140
Then, after you've added the server, you need to completely quit and restart the app you used above. In Claude Desktop, you should see an MCP server indicator (🔧) in the bottom-right corner of the chat input or see `ScraplingServer` in the `Search and tools` dropdown in the chat input box.
122141

142+
### Streamable HTTP
143+
As per version 0.3.6, we have added the ability to make the MCP server use the 'Streamable HTTP' transport mode instead of the traditional 'stdio' transport.
144+
145+
So instead of using the following command (the 'stdio' one):
146+
```bash
147+
scrapling mcp
148+
```
149+
Use the following to enable 'Streamable HTTP' transport mode:
150+
```bash
151+
scrapling mcp --http
152+
```
153+
Hence, the default value for the host the server is listening on is '0.0.0.0' and the port is 8000, which both can be configured as below:
154+
```bash
155+
scrapling mcp --http --host '127.0.0.1' --port 8000
156+
```
157+
123158
## Examples
124159

125160
Now we will show you some examples of prompts we used while testing the MCP server, but you are probably more creative than we are and better at prompt engineering than we are :)

docs/cli/extract-commands.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,9 @@ The extract command is a set of simple terminal tools that:
3636
3737
# Save a clean version of the text content of the webpage to the file
3838
scrapling extract get "https://example.com" content.txt
39+
40+
# Or use the Docker image with something like this:
41+
docker run -v $(pwd)/output:/output scrapling extract get "https://blog.example.com" /output/article.md
3942
```
4043

4144
- **Extract Specific Content**
@@ -345,4 +348,4 @@ If you are not a Web Scraping expert and can't decide what to choose, you can us
345348

346349
---
347350

348-
*Happy scraping! Remember to always respect website policies and comply with all applicable legal requirements.*
351+
*Happy scraping! Remember to always respect website policies and comply with all applicable laws and regulations.*

0 commit comments

Comments
 (0)