Skip to content

ignuslabs/InteractiveScraper

Repository files navigation

πŸ•·οΈ Interactive Web Scraper v2.0

A powerful web scraping tool that combines visual element selection with automated data extraction. Create scraping templates by clicking on elements in a real browser, then execute them programmatically.

🌟 Key Features

🎯 Visual Template Creation

  • Point-and-Click Interface: Select elements visually in a real browser
  • Container Recognition: Smart detection of repeating patterns (product cards, lists, etc.)
  • Sub-Element Selection: Click inside containers to define structured data extraction
  • Real-time Preview: See exactly what will be scraped as you build templates

πŸ€– Automated Execution

  • High-Performance Scraping: Built on Scrapling for robust data extraction
  • AutoMatch Technology: Adapts to website changes automatically
  • Multi-Format Export: JSON, CSV, and Excel output
  • Batch Processing: Process multiple URLs with the same template

πŸ“¦ Container-Based Scraping

  • Repeating Patterns: Perfect for product listings, profiles, articles
  • Visual Sub-Elements: Click on names, prices, links inside each container
  • Automatic Scaling: Handles hundreds of similar items efficiently
  • Smart Selectors: Generates robust CSS selectors that survive design changes

πŸš€ Quick Start

Installation

Clone and setup:

UnixΒ /Β Mac Windows
git clone https://github.com/masodori/Scraper_V2.git
cd Scraper_V2
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install chromium
git clone https://github.com/masodori/Scraper_V2.git
cd Scraper_V2
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
playwright install chromium

πŸ“‹ Core Commands

🎯 Create Interactive Templates

# Start interactive browser session for template creation
python -m src.core.main interactive https://example.com --output my_template.json

# Run in headless mode (for servers)
python -m src.core.main interactive https://example.com --headless --output my_template.json

πŸš€ Execute Automated Scraping

# Execute scraping with your template (JSON output)
python -m src.core.main scrape templates/my_template.json --format json

# Export to different formats
python -m src.core.main scrape templates/my_template.json --format csv
python -m src.core.main scrape templates/my_template.json --format excel

# Custom output location
python -m src.core.main scrape templates/my_template.json --output results/data.json

πŸ“‹ Template Management

# List all available templates
python -m src.core.main list

# Show help for any command
python -m src.core.main --help
python -m src.core.main interactive --help
python -m src.core.main scrape --help

πŸ§ͺ Development & Testing

# Run tests
pytest tests/ -v
pytest tests/test_models.py -v
pytest tests/test_scrapling_runner.py -v

# Quick testing scripts
python quick_test.py
python test_corrected_template.py
python test_fixed_template.py
python test_working_template.py

πŸ“ Project Structure

Clean Modular Architecture (v2.0)

Scraper_V2/
β”œβ”€β”€ πŸ“ src/                              # Main source code
β”‚   β”œβ”€β”€ πŸ“ core/                         # Core scraping functionality
β”‚   β”‚   β”œβ”€β”€ __init__.py                  # Package initialization
β”‚   β”‚   β”œβ”€β”€ __main__.py                  # Module entry point
β”‚   β”‚   β”œβ”€β”€ cli.py                       # Command-line interface
β”‚   β”‚   β”œβ”€β”€ interactive_cli.py           # Interactive CLI utilities
β”‚   β”‚   β”œβ”€β”€ main.py                      # Session management & Playwright integration
β”‚   β”‚   β”œβ”€β”€ scrapling_runner_refactored.py # NEW: Main orchestrator (300 lines)
β”‚   β”‚   β”œβ”€β”€ context.py                   # NEW: Shared state management
β”‚   β”‚   β”œβ”€β”€ πŸ“ utils/                    # NEW: Utility modules
β”‚   β”‚   β”‚   └── progress.py              # Progress tracking & ETA
β”‚   β”‚   β”œβ”€β”€ πŸ“ analyzers/                # NEW: Analysis modules
β”‚   β”‚   β”‚   └── template_analyzer.py     # Directory & pattern detection
β”‚   β”‚   β”œβ”€β”€ πŸ“ selectors/                # NEW: Selector modules
β”‚   β”‚   β”‚   └── selector_engine.py       # Smart selector mapping
β”‚   β”‚   β”œβ”€β”€ πŸ“ extractors/               # NEW: Extraction modules
β”‚   β”‚   β”‚   └── data_extractor.py        # Multi-strategy element finding
β”‚   β”‚   β”œβ”€β”€ πŸ“ handlers/                 # NEW: Handler modules
β”‚   β”‚   β”‚   └── pagination_handler.py    # Pagination logic
β”‚   β”‚   └── πŸ“ processors/               # NEW: Processing modules
β”‚   β”‚       └── subpage_processor.py     # Subpage processing
β”‚   β”œβ”€β”€ πŸ“ interactive/                  # Browser-based interactive system
β”‚   β”‚   β”œβ”€β”€ __init__.py                  # Package initialization
β”‚   β”‚   β”œβ”€β”€ index.js                     # Main entry point & orchestration
β”‚   β”‚   β”œβ”€β”€ πŸ“ core/                     # Core interactive functionality
β”‚   β”‚   β”‚   β”œβ”€β”€ config.js                # Configuration constants
β”‚   β”‚   β”‚   β”œβ”€β”€ error-handler.js         # Error handling utilities
β”‚   β”‚   β”‚   β”œβ”€β”€ event-manager.js         # Event delegation & handling
β”‚   β”‚   β”‚   └── state-manager.js         # Centralized state management
β”‚   β”‚   β”œβ”€β”€ πŸ“ navigation/               # Navigation & session management
β”‚   β”‚   β”‚   └── state-persistence.js     # Save/restore session state
β”‚   β”‚   β”œβ”€β”€ πŸ“ selectors/                # CSS selector generation
β”‚   β”‚   β”‚   └── selector-generator.js    # Smart CSS selector creation
β”‚   β”‚   β”œβ”€β”€ πŸ“ tools/                    # Interactive selection tools
β”‚   β”‚   β”‚   β”œβ”€β”€ base-tool.js             # Base tool interface
β”‚   β”‚   β”‚   β”œβ”€β”€ element-tool.js          # Element selection functionality
β”‚   β”‚   β”‚   β”œβ”€β”€ action-tool.js           # Action selection & handling
β”‚   β”‚   β”‚   β”œβ”€β”€ container-tool.js        # Container selection logic (⭐ power feature)
β”‚   β”‚   β”‚   └── scroll-tool.js           # Scroll/pagination handling
β”‚   β”‚   β”œβ”€β”€ πŸ“ ui/                       # User interface components
β”‚   β”‚   β”‚   β”œβ”€β”€ control-panel.js         # Main control panel UI
β”‚   β”‚   β”‚   β”œβ”€β”€ modal-manager.js         # Modal dialogs & prompts
β”‚   β”‚   β”‚   β”œβ”€β”€ status-manager.js        # Status updates & feedback
β”‚   β”‚   β”‚   └── styles.js                # CSS injection & styling
β”‚   β”‚   └── πŸ“ utils/                    # Utility functions
β”‚   β”‚       β”œβ”€β”€ dom-utils.js             # DOM manipulation helpers
β”‚   β”‚       β”œβ”€β”€ python-bridge.js         # Python callback interface
β”‚   β”‚       └── template-builder.js      # Template generation logic
β”‚   └── πŸ“ models/                       # Data models & validation
β”‚       β”œβ”€β”€ __init__.py                  # Package initialization
β”‚       └── scraping_template.py         # Pydantic models for templates
β”œβ”€β”€ πŸ“ templates/                        # Generated JSON templates
β”‚   └── template.json                    # Example/current template
β”œβ”€β”€ πŸ“ output/                           # Scraped data files (JSON/CSV/Excel)
β”‚   β”œβ”€β”€ gibsondunn.com_*_*.json          # Sample output files
β”‚   └── ...                              # Additional scraped data
β”œβ”€β”€ πŸ“ examples/                         # Example scripts & demonstrations
β”‚   └── gibson_dunn_demo.py              # Gibson Dunn scraping demo
β”œβ”€β”€ πŸ“ tests/                            # Test suite
β”‚   β”œβ”€β”€ __init__.py                      # Package initialization
β”‚   β”œβ”€β”€ test_models.py                   # Template model tests
β”‚   └── test_scrapling_runner.py         # Scraping engine tests
β”œβ”€β”€ πŸ“„ requirements.txt                  # Python dependencies
β”œβ”€β”€ πŸ“„ CLAUDE.md                         # Claude Code guidance & project rules
β”œβ”€β”€ πŸ“„ README.md                         # This file - project documentation
β”œβ”€β”€ πŸ“„ CHECKLIST.md                      # Development checklist & roadmap
β”œβ”€β”€ πŸ“„ quick_test.py                     # Quick testing script
β”œβ”€β”€ πŸ“„ test_*.py                         # Additional test scripts
└── πŸ“ venv/                             # Virtual environment (gitignored)

🎨 Interactive Template Creation Process

When you run the interactive command, it opens a browser with an overlay panel where you can:

  1. 🎯 Select Containers: Click the "Container" tool and select repeating elements (product cards, profiles, etc.)
  2. πŸ“ Add Sub-Elements: Click inside containers to select specific data (names, prices, links)
  3. ⚑ Add Actions: Click buttons, links, or scrollable areas for navigation
  4. πŸ’Ύ Save Template: Generate a reusable JSON template automatically

Example: Law Firm Directory

# 1. Start interactive session
python -m src.core.main interactive https://www.gibsondunn.com/people/ --output law_firm.json

# 2. In browser: Click "Container" β†’ Click on any lawyer profile card
# 3. In browser: Click inside containers to select name, title, email, profile link
# 4. In browser: Save template

# 5. Run automated scraping
python -m src.core.main scrape templates/law_firm.json --format excel --output lawyers.xlsx

πŸ—οΈ Architecture Highlights

Two-Phase System

The scraper operates in two distinct phases:

  1. Interactive Phase: Visual template creation through browser overlay with Playwright
  2. Automated Phase: Template execution with Scrapling engine

Core Components

  • ScraplingRunner Refactored (300 lines): Main orchestrator using composition
  • TemplateAnalyzer: Smart detection of directory pages vs individual pages
  • DataExtractor: Multi-strategy element finding with fallback mechanisms
  • PaginationHandler: Infinite scroll, load-more buttons, URL-based pagination
  • SubpageProcessor: Navigate to individual profile pages for detailed data
  • SelectorEngine: Automatic mapping of generic selectors to robust ones

πŸš€ Use Cases & Examples

🏒 Professional Directory Scraping

# Law firms, consulting companies, real estate agents
python -m src.core.main interactive https://firm.com/people/
# Extract: Names, titles, practice areas, contact info, bios

πŸ›’ E-commerce Product Extraction

# Product catalogs, marketplace listings
python -m src.core.main interactive https://shop.com/products
# Extract: Product names, prices, descriptions, images, ratings

πŸ“° News & Content Aggregation

# News sites, blogs, article directories
python -m src.core.main interactive https://news.com/articles
# Extract: Headlines, authors, dates, summaries, full articles

🏠 Real Estate Listings

# Property listings, rental sites
python -m src.core.main interactive https://realestate.com/listings
# Extract: Addresses, prices, features, photos, agent info

🎯 Template Structure

Templates are JSON files that define what to scrape:

{
  "name": "template_name",
  "url": "https://example.com",
  "elements": [
    {
      "label": "products",
      "selector": ".product-card",
      "is_container": true,
      "is_multiple": true,
      "sub_elements": [
        {"label": "name", "selector": "h3", "element_type": "text"},
        {"label": "price", "selector": ".price", "element_type": "text"},
        {"label": "link", "selector": "a", "element_type": "link"}
      ]
    }
  ],
  "actions": [
    {"label": "load_more", "selector": ".load-more-btn", "action_type": "click"}
  ]
}

πŸ›‘οΈ Advanced Features

🧠 Auto-Detection & Enhancement

  • Directory Detection: Automatically recognizes listing/directory pages
  • Pagination Patterns: Detects infinite scroll, load-more buttons, URL pagination
  • Smart Selector Enhancement: Upgrades generic selectors for better reliability
  • Template Auto-Fixing: Corrects common template configuration issues

πŸ“Š Export Formats

  • JSON: Structured data with metadata for programmatic use
  • CSV: Tabular format perfect for spreadsheet analysis
  • Excel: Multi-sheet workbooks with data + extraction metadata

πŸ”§ Reliability Features

  • AutoMatch Technology: Adapts to website changes automatically
  • Fallback Strategies: Multiple selector approaches for maximum reliability
  • Error Recovery: Graceful handling of missing elements
  • Session Persistence: Maintains browser state across navigations

πŸ› Troubleshooting

Common Issues & Solutions

Playwright Browser Issues

# Browser fails to start
playwright install chromium --force

# Permission issues on Linux/Mac
sudo playwright install-deps chromium

Template Execution Problems

  • Elements not found: Check selectors in browser dev tools
  • Empty results: Verify is_multiple flag matches expected element count
  • Timeout errors: Increase wait timeout in template or add explicit waits
  • Memory issues: Process smaller batches, clear browser cache between runs

Debug Information

  • Check scraper.log for detailed error logs
  • Examine template JSON structure in templates/ directory
  • Use browser dev tools to verify selectors work correctly
  • Test templates with small datasets before running full extraction

πŸ“ˆ Performance Benefits (v2.0)

Metric Before After Improvement
File Size 5,213 lines ~300 lines main 94% reduction
Method Count 102 methods 10-15 per class 85% reduction
Complexity Extremely high Low per module 97% reduction
Maintainability Impossible Easy ♾️ improvement
Testability Very difficult Simple ♾️ improvement
Developer Experience Frustrating Delightful ♾️ improvement

🎯 Quick Commands Reference

Task Command
Create Template python -m src.core.main interactive https://example.com --output template.json
Run Scraping python -m src.core.main scrape templates/template.json
Export to Excel python -m src.core.main scrape templates/template.json --format excel
Export to CSV python -m src.core.main scrape templates/template.json --format csv
List Templates python -m src.core.main list
Get Help python -m src.core.main --help

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published