Skip to content

ZeroXClem/enhanced-web-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🕸️ Enhanced Web Data Extractor 🔍

A powerful and user-friendly web scraping tool built with Python and Streamlit.

🌟 Features

  • 🚀 Asynchronous web scraping for faster data collection
  • 🌐 Depth-limited crawling to control the scope of extraction
  • 🔑 Keyword filtering to focus on relevant content
  • 📊 Multiple export formats: CSV, Markdown, JSON, and XML
  • 🖥️ Interactive Streamlit UI for easy operation
  • 🛡️ Rate limiting to respect server resources
  • 📈 Real-time progress tracking

🛠️ Installation

  1. Clone this repository:

    git clone https://github.com/ZeroXClem/enhanced-web-data-extractor.git
    cd enhanced-web-data-extractor
    
  2. Create a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    
  3. Install the required packages:

    pip install -r requirements.txt
    

🚀 Usage

  1. Run the Streamlit app:

    streamlit run main.py
    
  2. Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

  3. In the Streamlit interface:

    • Enter the base URL you want to scrape
    • Set the maximum number of pages to scrape (1-100)
    • Set the maximum depth for crawling (1-10)
    • (Optional) Enter keywords to filter content
    • Set the rate limit (requests per second)
    • Choose the desired export format(s)
    • Click "Start Scraping"
  4. Monitor the progress and download the extracted data when complete.

🎯 Use Cases

  • 📚 Research: Gather data from academic websites or online journals
  • 💼 Business Intelligence: Collect product information from e-commerce sites
  • 📰 News Aggregation: Compile articles from various news sources
  • 🏢 Competitive Analysis: Extract data from competitor websites
  • 📊 Market Research: Gather consumer reviews and opinions

⚠️ Important Notes

  • This tool is for educational purposes only.
  • Always respect websites' terms of service and robots.txt files.
  • Be mindful of rate limiting and don't overload servers with requests.
  • Some websites may have measures in place to prevent scraping.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check issues page.

👨‍💻 Author

ZeroXClem


Happy Scraping! 🎉🕷️

About

A powerful and user-friendly web scraping tool built with Python and Streamlit.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages