Skip to content

v0.3

Compare
Choose a tag to compare
@github-actions github-actions released this 01 Sep 04:26
· 106 commits to main since this release
d5aaeb9

Scrapling v0.3.0 Release Notes

πŸŽ‰ Major Release β€” Complete Architecture Overhaul

Scrapling v0.3 represents the most significant update in the project's history, featuring a complete architectural rewrite, considerable performance improvements, and powerful new features, including AI integration and interactive Web Scraping shell capabilities.

This release includes multiple breaking changes; please review the release notes carefully.

πŸš€ Major New Features

Session-Based Architecture

  • New Session Classes: Complete rewrite introducing persistent session support
    • FetcherSession - HTTP requests with persistent state management that works with both sync and async code
    • DynamicSession/AsyncDynamicSession - Browser automation while keeping the browser open till you finish
    • StealthySession/AsyncStealthySession - Stealth browsing while keeping the browser open till you finish
  • Async Browser Tabs Management: A new pool of tabs feature through the max_pages argument that rotates browser tabs for concurrent browser fetches
  • Concurrent Sessions: Run multiple isolated sessions simultaneously

Refer to the Fetching section on the website for more details.

A lot of new stealth/anti-bot Capabilities

  • πŸ€– Cloudflare Solver: Automatic Cloudflare Turnstile challenge solving in StealthyFetcher and its session classes
  • Browser fingerprint impersonation: Mimic real browsers' TLS fingerprints, version-matching browser headers, HTTP/3 support, and more with the all-new Fetcher class
  • Improved stealth mode: The stealth mode for DynamicFetcher and its session classes is now more robust and reliable (AKA PlayWrightFetcher)

AI Integration & MCP Server

  • Built-in MCP Server: Model Context Protocol server for AI-assisted web scraping
  • 6 Powerful Tools: get, bulk_get, fetch, bulk_fetch, stealthy_fetch, bulk_stealthy_fetch
  • Smart Content Extraction: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content
  • CSS Selector Support: Use the Scrapling engine to target specific elements with precision before handing the content to the AI
  • Anti-Bot Bypass: Handle Cloudflare Turnstile and other protections
  • Proxy Support: Use proxies for anonymity and geo-targeting
  • Browser Impersonation: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more
  • Parallel Processing: Scrape multiple URLs concurrently for efficiency
  • and more...

New Interactive Web Scraping Shell

  • A New Shell: Custom IPython shell with many smart Built-in Shortcuts like get, post, put, delete, fetch, and stealthy_fetch
  • Smart Page Management: New commands page and pages to automatically store the current page and history for all requests done through the shell
  • Curl Integration: Convert browser DevTools curl commands with uncurl and curl2fetcher functions to Fetcher requests
  • and more...

Scrape from the terminal without programming

  • New Extract Commands: Terminal-based scraping without programming
    • scrapling extract get/post/put/delete - Simple HTTP requests
    • scrapling extract fetch - Dynamic content scraping
    • scrapling extract stealthy-fetch - Anti-bot bypass
  • Downloads web pages and saves their content to files.
  • Converts HTML to readable formats like Markdown, keeps it as HTML, or just extracts the text content of the page.
  • Supports custom CSS selectors to extract specific parts of the page.
  • Handles HTTP requests and fetching through browsers.
  • Highly customizable with custom headers, cookies, proxies, and the rest of the options. Almost all the options available through the code are also accessible through the terminal.
  • and more...

πŸ”§ Technical Improvements

Performance Enhancements

  • Fetcher is now 4 times faster - Yes you have read it right!
  • DynamicFetcher is now ~60% faster - A much faster version depending on your config (especially stealth mode)
  • StealthyFetcher is now 20–30% faster - Using the new structure, and starting to use our implementation instead of Camoufox Python interface
  • 50%+ combined speed gains across core selection methods (find_by_text, find_similar, find_by_regex, relocate, etc.) πŸš€
  • ~10% CSS/XPath first methods speed increase - css_first and xpath_first are now faster than css and xpath
  • 40% faster get_all_text() method for content extraction
  • 20% speed improvement in adaptive element relocation
  • Navigation properties optimization β€” Properties like next, previous, below_elements, and more are now noticeably faster
  • 5x faster text cleaning operations
  • Memory efficiency improvements with optimized imports and reduced overhead
  • ⚑ Lightning-fast imports: Reduced startup time with optimized module loading
  • Better benchmarks: All the speed improvements Scrapling got made it much faster than before, compared to other libraries (1775x faster than BeautifulSoup and 5.1x faster than AutoScraper, check benchmarks)

Architecture/Code Quality, and Quality of life

  • Persistent Context: All browser-based fetchers now use persistent context by default. (Solves #64 too)
  • Using msgspec to validate all browser-based fetchers very fast before running the requests, so now it's easier to debug errors.
  • All cookies returned from fetchers are now matching the format accepted by the same fetcher. So you can retrieve cookies and pass them again to all fetchers and their session classes.
  • Faster linting and formatting due to migrating to ruff
  • Modern Build System: Migrated from setup.py to pyproject.toml πŸ“¦
  • Better GitHub actions and workflows for smoother development and testing
  • 🎨 Enhanced Type Hints: Complete type coverage with modern Python standards for better IDE support and reliability
  • Cleaner Codebase: Removed dead code and optimized core functions 🧹
  • πŸš€ Backward Compatibility: Added shortcuts to maintain compatibility with older code

Breaking Changes

Minimum Python Version

  • Python 3.10+ Required: Dropped support for Python 3.9 and below

Class and Method Naming

These renamings are intended to improve clarity and consistency, particularly for new users.

  • Adaptor β†’ Selector: Core parsing class renamed (But still can be imported as Adaptor for backward compatibility)
  • Adaptors β†’ Selectors: Collection class renamed (But still can be imported as Adaptors for backward compatibility)
  • auto_match β†’ adaptive: Parameter renamed across all methods
  • adaptor_arguments β†’ selector_config: Configuration parameter renamed
  • automatch_domain β†’ adaptive_domain: Domain parameter renamed
  • additional_arguments β†’ additional_args: Shortened parameter name
  • ⚠️ text/body β†’ content: Selector constructor parameter is now accepting both str and bytes format
  • PlayWrightFetcher β†’ DynamicFetcher: Browser automation class renamed (But still can be imported as PlayWrightFetcher for backward compatibility)
  • DynamicFetcher doesn't have the NSTBrowser logic/arguments anymore since it's pointless to leave this logic now anyway.
  • StealthyFetcher's headless argument can't accept 'virtual' as an argument anymore since we are not using Camoufox's library right now in anything other than getting the browser installation path and the rest of the launch options

πŸ› Bug Fixes

  • Fixed nested children counting in ignored tags for get_all_text (#61)
  • Fixed the issue with installation due to spaces in Python's executable path (#57)
  • Resolved threading issues in storage with recursion handling while the adaptive feature is enabled
  • Fixed argument precedence issues using the Sentinel pattern in FetcherSession
  • Resolved proxy type handling in StealthyFetcher
  • Fixed referer and google_search argument conflicts
  • Fixed async stealth script injection problems

πŸ™ Special thanks to our Discord community for all the continuous testing, feedback, and contributions across the last four months


Big shoutout to our biggest Sponsors