A powerful web scraping tool that combines visual element selection with automated data extraction. Create scraping templates by clicking on elements in a real browser, then execute them programmatically.
- Point-and-Click Interface: Select elements visually in a real browser
- Container Recognition: Smart detection of repeating patterns (product cards, lists, etc.)
- Sub-Element Selection: Click inside containers to define structured data extraction
- Real-time Preview: See exactly what will be scraped as you build templates
- High-Performance Scraping: Built on Scrapling for robust data extraction
- AutoMatch Technology: Adapts to website changes automatically
- Multi-Format Export: JSON, CSV, and Excel output
- Batch Processing: Process multiple URLs with the same template
- Repeating Patterns: Perfect for product listings, profiles, articles
- Visual Sub-Elements: Click on names, prices, links inside each container
- Automatic Scaling: Handles hundreds of similar items efficiently
- Smart Selectors: Generates robust CSS selectors that survive design changes
Clone and setup:
| UnixΒ /Β Mac | Windows |
|---|---|
|
|
# Start interactive browser session for template creation
python -m src.core.main interactive https://example.com --output my_template.json
# Run in headless mode (for servers)
python -m src.core.main interactive https://example.com --headless --output my_template.json# Execute scraping with your template (JSON output)
python -m src.core.main scrape templates/my_template.json --format json
# Export to different formats
python -m src.core.main scrape templates/my_template.json --format csv
python -m src.core.main scrape templates/my_template.json --format excel
# Custom output location
python -m src.core.main scrape templates/my_template.json --output results/data.json# List all available templates
python -m src.core.main list
# Show help for any command
python -m src.core.main --help
python -m src.core.main interactive --help
python -m src.core.main scrape --help# Run tests
pytest tests/ -v
pytest tests/test_models.py -v
pytest tests/test_scrapling_runner.py -v
# Quick testing scripts
python quick_test.py
python test_corrected_template.py
python test_fixed_template.py
python test_working_template.pyScraper_V2/
βββ π src/ # Main source code
β βββ π core/ # Core scraping functionality
β β βββ __init__.py # Package initialization
β β βββ __main__.py # Module entry point
β β βββ cli.py # Command-line interface
β β βββ interactive_cli.py # Interactive CLI utilities
β β βββ main.py # Session management & Playwright integration
β β βββ scrapling_runner_refactored.py # NEW: Main orchestrator (300 lines)
β β βββ context.py # NEW: Shared state management
β β βββ π utils/ # NEW: Utility modules
β β β βββ progress.py # Progress tracking & ETA
β β βββ π analyzers/ # NEW: Analysis modules
β β β βββ template_analyzer.py # Directory & pattern detection
β β βββ π selectors/ # NEW: Selector modules
β β β βββ selector_engine.py # Smart selector mapping
β β βββ π extractors/ # NEW: Extraction modules
β β β βββ data_extractor.py # Multi-strategy element finding
β β βββ π handlers/ # NEW: Handler modules
β β β βββ pagination_handler.py # Pagination logic
β β βββ π processors/ # NEW: Processing modules
β β βββ subpage_processor.py # Subpage processing
β βββ π interactive/ # Browser-based interactive system
β β βββ __init__.py # Package initialization
β β βββ index.js # Main entry point & orchestration
β β βββ π core/ # Core interactive functionality
β β β βββ config.js # Configuration constants
β β β βββ error-handler.js # Error handling utilities
β β β βββ event-manager.js # Event delegation & handling
β β β βββ state-manager.js # Centralized state management
β β βββ π navigation/ # Navigation & session management
β β β βββ state-persistence.js # Save/restore session state
β β βββ π selectors/ # CSS selector generation
β β β βββ selector-generator.js # Smart CSS selector creation
β β βββ π tools/ # Interactive selection tools
β β β βββ base-tool.js # Base tool interface
β β β βββ element-tool.js # Element selection functionality
β β β βββ action-tool.js # Action selection & handling
β β β βββ container-tool.js # Container selection logic (β power feature)
β β β βββ scroll-tool.js # Scroll/pagination handling
β β βββ π ui/ # User interface components
β β β βββ control-panel.js # Main control panel UI
β β β βββ modal-manager.js # Modal dialogs & prompts
β β β βββ status-manager.js # Status updates & feedback
β β β βββ styles.js # CSS injection & styling
β β βββ π utils/ # Utility functions
β β βββ dom-utils.js # DOM manipulation helpers
β β βββ python-bridge.js # Python callback interface
β β βββ template-builder.js # Template generation logic
β βββ π models/ # Data models & validation
β βββ __init__.py # Package initialization
β βββ scraping_template.py # Pydantic models for templates
βββ π templates/ # Generated JSON templates
β βββ template.json # Example/current template
βββ π output/ # Scraped data files (JSON/CSV/Excel)
β βββ gibsondunn.com_*_*.json # Sample output files
β βββ ... # Additional scraped data
βββ π examples/ # Example scripts & demonstrations
β βββ gibson_dunn_demo.py # Gibson Dunn scraping demo
βββ π tests/ # Test suite
β βββ __init__.py # Package initialization
β βββ test_models.py # Template model tests
β βββ test_scrapling_runner.py # Scraping engine tests
βββ π requirements.txt # Python dependencies
βββ π CLAUDE.md # Claude Code guidance & project rules
βββ π README.md # This file - project documentation
βββ π CHECKLIST.md # Development checklist & roadmap
βββ π quick_test.py # Quick testing script
βββ π test_*.py # Additional test scripts
βββ π venv/ # Virtual environment (gitignored)
When you run the interactive command, it opens a browser with an overlay panel where you can:
- π― Select Containers: Click the "Container" tool and select repeating elements (product cards, profiles, etc.)
- π Add Sub-Elements: Click inside containers to select specific data (names, prices, links)
- β‘ Add Actions: Click buttons, links, or scrollable areas for navigation
- πΎ Save Template: Generate a reusable JSON template automatically
# 1. Start interactive session
python -m src.core.main interactive https://www.gibsondunn.com/people/ --output law_firm.json
# 2. In browser: Click "Container" β Click on any lawyer profile card
# 3. In browser: Click inside containers to select name, title, email, profile link
# 4. In browser: Save template
# 5. Run automated scraping
python -m src.core.main scrape templates/law_firm.json --format excel --output lawyers.xlsxThe scraper operates in two distinct phases:
- Interactive Phase: Visual template creation through browser overlay with Playwright
- Automated Phase: Template execution with Scrapling engine
- ScraplingRunner Refactored (300 lines): Main orchestrator using composition
- TemplateAnalyzer: Smart detection of directory pages vs individual pages
- DataExtractor: Multi-strategy element finding with fallback mechanisms
- PaginationHandler: Infinite scroll, load-more buttons, URL-based pagination
- SubpageProcessor: Navigate to individual profile pages for detailed data
- SelectorEngine: Automatic mapping of generic selectors to robust ones
# Law firms, consulting companies, real estate agents
python -m src.core.main interactive https://firm.com/people/
# Extract: Names, titles, practice areas, contact info, bios# Product catalogs, marketplace listings
python -m src.core.main interactive https://shop.com/products
# Extract: Product names, prices, descriptions, images, ratings# News sites, blogs, article directories
python -m src.core.main interactive https://news.com/articles
# Extract: Headlines, authors, dates, summaries, full articles# Property listings, rental sites
python -m src.core.main interactive https://realestate.com/listings
# Extract: Addresses, prices, features, photos, agent infoTemplates are JSON files that define what to scrape:
{
"name": "template_name",
"url": "https://example.com",
"elements": [
{
"label": "products",
"selector": ".product-card",
"is_container": true,
"is_multiple": true,
"sub_elements": [
{"label": "name", "selector": "h3", "element_type": "text"},
{"label": "price", "selector": ".price", "element_type": "text"},
{"label": "link", "selector": "a", "element_type": "link"}
]
}
],
"actions": [
{"label": "load_more", "selector": ".load-more-btn", "action_type": "click"}
]
}- Directory Detection: Automatically recognizes listing/directory pages
- Pagination Patterns: Detects infinite scroll, load-more buttons, URL pagination
- Smart Selector Enhancement: Upgrades generic selectors for better reliability
- Template Auto-Fixing: Corrects common template configuration issues
- JSON: Structured data with metadata for programmatic use
- CSV: Tabular format perfect for spreadsheet analysis
- Excel: Multi-sheet workbooks with data + extraction metadata
- AutoMatch Technology: Adapts to website changes automatically
- Fallback Strategies: Multiple selector approaches for maximum reliability
- Error Recovery: Graceful handling of missing elements
- Session Persistence: Maintains browser state across navigations
# Browser fails to start
playwright install chromium --force
# Permission issues on Linux/Mac
sudo playwright install-deps chromium- Elements not found: Check selectors in browser dev tools
- Empty results: Verify
is_multipleflag matches expected element count - Timeout errors: Increase wait timeout in template or add explicit waits
- Memory issues: Process smaller batches, clear browser cache between runs
- Check
scraper.logfor detailed error logs - Examine template JSON structure in
templates/directory - Use browser dev tools to verify selectors work correctly
- Test templates with small datasets before running full extraction
| Metric | Before | After | Improvement |
|---|---|---|---|
| File Size | 5,213 lines | ~300 lines main | 94% reduction |
| Method Count | 102 methods | 10-15 per class | 85% reduction |
| Complexity | Extremely high | Low per module | 97% reduction |
| Maintainability | Impossible | Easy | βΎοΈ improvement |
| Testability | Very difficult | Simple | βΎοΈ improvement |
| Developer Experience | Frustrating | Delightful | βΎοΈ improvement |
| Task | Command |
|---|---|
| Create Template | python -m src.core.main interactive https://example.com --output template.json |
| Run Scraping | python -m src.core.main scrape templates/template.json |
| Export to Excel | python -m src.core.main scrape templates/template.json --format excel |
| Export to CSV | python -m src.core.main scrape templates/template.json --format csv |
| List Templates | python -m src.core.main list |
| Get Help | python -m src.core.main --help |