Skip to content

Commit e8a2dfe

Browse files
authored
Merge branch 'master' into docs/bump-api-plugin
2 parents 0eeeb93 + 9fd9a41 commit e8a2dfe

File tree

99 files changed

+9517
-2848
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+9517
-2848
lines changed

.github/workflows/pre_release.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ jobs:
3737
name: Wait for code checks to pass
3838
runs-on: ubuntu-latest
3939
steps:
40-
- uses: lewagon/[email protected].0
40+
- uses: lewagon/[email protected].1
4141
with:
4242
ref: ${{ github.ref }}
4343
repo-token: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/update_new_issue.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ jobs:
1414

1515
steps:
1616
# Add the "t-tooling" label to all new issues
17-
- uses: actions/github-script@v7
17+
- uses: actions/github-script@v8
1818
with:
1919
script: |
2020
github.rest.issues.addLabels({

CHANGELOG.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,16 @@
33
All notable changes to this project will be documented in this file.
44

55
<!-- git-cliff-unreleased-start -->
6-
## 0.6.13 - **not yet released**
6+
## 1.0.1 - **not yet released**
7+
8+
### 🐛 Bug Fixes
9+
10+
- Fix memory leak in `PlaywrightCrawler` on browser context creation ([#1446](https://github.com/apify/crawlee-python/pull/1446)) ([bb181e5](https://github.com/apify/crawlee-python/commit/bb181e58d8070fba38e62d6e57fe981a00e5f035)) by [@Pijukatel](https://github.com/Pijukatel), closes [#1443](https://github.com/apify/crawlee-python/issues/1443)
11+
- Update templates to handle optional httpx client ([#1440](https://github.com/apify/crawlee-python/pull/1440)) ([c087efd](https://github.com/apify/crawlee-python/commit/c087efd39baedf46ca3e5cae1ddc1acd6396e6c1)) by [@Pijukatel](https://github.com/Pijukatel)
12+
13+
14+
<!-- git-cliff-unreleased-end -->
15+
## [1.0.0](https://github.com/apify/crawlee-python/releases/tag/v1.0.0) (2025-09-29)
716

817
### 🚀 Features
918

@@ -19,6 +28,9 @@ All notable changes to this project will be documented in this file.
1928
- Persist RequestList state ([#1274](https://github.com/apify/crawlee-python/pull/1274)) ([cc68014](https://github.com/apify/crawlee-python/commit/cc680147ba3cc8b35b9da70274e53e6f5dd92434)) by [@janbuchar](https://github.com/janbuchar), closes [#99](https://github.com/apify/crawlee-python/issues/99)
2029
- Persist `DefaultRenderingTypePredictor` state ([#1340](https://github.com/apify/crawlee-python/pull/1340)) ([fad4c25](https://github.com/apify/crawlee-python/commit/fad4c25fc712915c4a45b24e3290b6f5dbd8a683)) by [@Mantisus](https://github.com/Mantisus), closes [#1272](https://github.com/apify/crawlee-python/issues/1272)
2130
- Persist the `SitemapRequestLoader` state ([#1347](https://github.com/apify/crawlee-python/pull/1347)) ([27ef9ad](https://github.com/apify/crawlee-python/commit/27ef9ad194552ea9f1321d91a7a52054be9a8a51)) by [@Mantisus](https://github.com/Mantisus), closes [#1269](https://github.com/apify/crawlee-python/issues/1269)
31+
- Add support for NDU storages ([#1401](https://github.com/apify/crawlee-python/pull/1401)) ([5dbd212](https://github.com/apify/crawlee-python/commit/5dbd212663e7abc37535713f4c6e3a5bbf30a12e)) by [@vdusek](https://github.com/vdusek), closes [#1175](https://github.com/apify/crawlee-python/issues/1175)
32+
- Add RQ id, name, alias args to `add_requests` and `enqueue_links` methods ([#1413](https://github.com/apify/crawlee-python/pull/1413)) ([1cae2bc](https://github.com/apify/crawlee-python/commit/1cae2bca0b1508fcb3cb419dc239caf33e20a7ef)) by [@Mantisus](https://github.com/Mantisus), closes [#1402](https://github.com/apify/crawlee-python/issues/1402)
33+
- Add `SqlStorageClient` based on `sqlalchemy` v2+ ([#1339](https://github.com/apify/crawlee-python/pull/1339)) ([07c75a0](https://github.com/apify/crawlee-python/commit/07c75a078b443b58bfaaeb72eb2aa1439458dc47)) by [@Mantisus](https://github.com/Mantisus), closes [#307](https://github.com/apify/crawlee-python/issues/307)
2234

2335
### 🐛 Bug Fixes
2436

@@ -30,6 +42,8 @@ All notable changes to this project will be documented in this file.
3042
- Include reason in the session rotation warning logs ([#1363](https://github.com/apify/crawlee-python/pull/1363)) ([d6d7a45](https://github.com/apify/crawlee-python/commit/d6d7a45dd64a906419d9552c45062d726cbb1a0f)) by [@vdusek](https://github.com/vdusek), closes [#1318](https://github.com/apify/crawlee-python/issues/1318)
3143
- Improve crawler statistics logging ([#1364](https://github.com/apify/crawlee-python/pull/1364)) ([1eb6da5](https://github.com/apify/crawlee-python/commit/1eb6da5dd85870124593dcad877284ccaed9c0ce)) by [@vdusek](https://github.com/vdusek), closes [#1317](https://github.com/apify/crawlee-python/issues/1317)
3244
- Do not add a request that is already in progress to `MemoryRequestQueueClient` ([#1384](https://github.com/apify/crawlee-python/pull/1384)) ([3af326c](https://github.com/apify/crawlee-python/commit/3af326c9dfa8fffd56a42ca42981374613739e39)) by [@Mantisus](https://github.com/Mantisus), closes [#1383](https://github.com/apify/crawlee-python/issues/1383)
45+
- Save `RequestQueueState` for `FileSystemRequestQueueClient` in default KVS ([#1411](https://github.com/apify/crawlee-python/pull/1411)) ([6ee60a0](https://github.com/apify/crawlee-python/commit/6ee60a08ac1f9414e1b792f4935cc3799cb5089a)) by [@Mantisus](https://github.com/Mantisus), closes [#1410](https://github.com/apify/crawlee-python/issues/1410)
46+
- Set default desired concurrency for non-browser crawlers to 10 ([#1419](https://github.com/apify/crawlee-python/pull/1419)) ([1cc9401](https://github.com/apify/crawlee-python/commit/1cc940197600d2539bda967880d7f9d241eb8c3e)) by [@vdusek](https://github.com/vdusek)
3347

3448
### Refactor
3549

@@ -39,9 +53,9 @@ All notable changes to this project will be documented in this file.
3953
- [**breaking**] Replace `HttpxHttpClient` with `ImpitHttpClient` as default HTTP client ([#1307](https://github.com/apify/crawlee-python/pull/1307)) ([c803a97](https://github.com/apify/crawlee-python/commit/c803a976776a76846866d533e3a3ee8144e248c4)) by [@Mantisus](https://github.com/Mantisus), closes [#1079](https://github.com/apify/crawlee-python/issues/1079)
4054
- [**breaking**] Change Dataset unwind parameter to accept list of strings ([#1357](https://github.com/apify/crawlee-python/pull/1357)) ([862a203](https://github.com/apify/crawlee-python/commit/862a20398f00fe91802fe7a1ccd58b05aee053a1)) by [@vdusek](https://github.com/vdusek)
4155
- [**breaking**] Remove `Request.id` field ([#1366](https://github.com/apify/crawlee-python/pull/1366)) ([32f3580](https://github.com/apify/crawlee-python/commit/32f3580e9775a871924ab1233085d0c549c4cd52)) by [@Pijukatel](https://github.com/Pijukatel), closes [#1358](https://github.com/apify/crawlee-python/issues/1358)
56+
- [**breaking**] Refactor storage creation and caching, configuration and services ([#1386](https://github.com/apify/crawlee-python/pull/1386)) ([04649bd](https://github.com/apify/crawlee-python/commit/04649bde60d46b2bc18ae4f6e3fd9667d02a9cef)) by [@Pijukatel](https://github.com/Pijukatel), closes [#1379](https://github.com/apify/crawlee-python/issues/1379)
4257

4358

44-
<!-- git-cliff-unreleased-end -->
4559

4660
## [0.6.12](https://github.com/apify/crawlee-python/releases/tag/v0.6.12) (2025-07-30)
4761

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,6 @@
3030

3131
Crawlee covers your crawling and scraping end-to-end and **helps you build reliable scrapers. Fast.**
3232

33-
> 🚀 Crawlee for Python is open to early adopters!
34-
3533
Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.
3634

3735
> 👉 **View full documentation, guides and examples on the [Crawlee project website](https://crawlee.dev/python/)** 👈

docs/guides/code_examples/service_locator/service_storage_configuration.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
import asyncio
22
from datetime import timedelta
33

4+
from crawlee import service_locator
45
from crawlee.configuration import Configuration
6+
from crawlee.storage_clients import MemoryStorageClient
57
from crawlee.storages import Dataset
68

79

@@ -11,10 +13,16 @@ async def main() -> None:
1113
headless=False,
1214
persist_state_interval=timedelta(seconds=30),
1315
)
16+
# Set the custom configuration as the global default configuration.
17+
service_locator.set_configuration(configuration)
1418

15-
# Pass the configuration to the dataset (or other storage) when opening it.
16-
dataset = await Dataset.open(
17-
configuration=configuration,
19+
# Use the global defaults when creating the dataset (or other storage).
20+
dataset_1 = await Dataset.open()
21+
22+
# Or set explicitly specific configuration if
23+
# you do not want to rely on global defaults.
24+
dataset_2 = await Dataset.open(
25+
storage_client=MemoryStorageClient(), configuration=configuration
1826
)
1927

2028

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from crawlee.crawlers import ParselCrawler
2+
from crawlee.storage_clients import SqlStorageClient
3+
4+
5+
async def main() -> None:
6+
# Create a new instance of storage client.
7+
# This will create an SQLite database file crawlee.db or created tables in your
8+
# database if you pass `connection_string` or `engine`
9+
# Use the context manager to ensure that connections are properly cleaned up.
10+
async with SqlStorageClient() as storage_client:
11+
# And pass it to the crawler.
12+
crawler = ParselCrawler(storage_client=storage_client)
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
from sqlalchemy.ext.asyncio import create_async_engine
2+
3+
from crawlee.configuration import Configuration
4+
from crawlee.crawlers import ParselCrawler
5+
from crawlee.storage_clients import SqlStorageClient
6+
7+
8+
async def main() -> None:
9+
# Create a new instance of storage client.
10+
# On first run, also creates tables in your PostgreSQL database.
11+
# Use the context manager to ensure that connections are properly cleaned up.
12+
async with SqlStorageClient(
13+
# Create an `engine` with the desired configuration
14+
engine=create_async_engine(
15+
'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres',
16+
future=True,
17+
pool_size=5,
18+
max_overflow=10,
19+
pool_recycle=3600,
20+
pool_pre_ping=True,
21+
echo=False,
22+
)
23+
) as storage_client:
24+
# Create a configuration with custom settings.
25+
configuration = Configuration(
26+
purge_on_start=False,
27+
)
28+
29+
# And pass them to the crawler.
30+
crawler = ParselCrawler(
31+
storage_client=storage_client,
32+
configuration=configuration,
33+
)
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import asyncio
2+
3+
from crawlee.storages import Dataset
4+
5+
6+
async def main() -> None:
7+
# Named storage (persists across runs)
8+
dataset_named = await Dataset.open(name='my-persistent-dataset')
9+
10+
# Unnamed storage with alias (purged on start)
11+
dataset_unnamed = await Dataset.open(alias='temporary-results')
12+
13+
# Default unnamed storage (both are equivalent and purged on start)
14+
dataset_default = await Dataset.open()
15+
dataset_default = await Dataset.open(alias='default')
16+
17+
18+
if __name__ == '__main__':
19+
asyncio.run(main())

0 commit comments

Comments
 (0)