A powerful and user-friendly web scraping tool built with Python and Streamlit.
- 🚀 Asynchronous web scraping for faster data collection
- 🌐 Depth-limited crawling to control the scope of extraction
- 🔑 Keyword filtering to focus on relevant content
- 📊 Multiple export formats: CSV, Markdown, JSON, and XML
- 🖥️ Interactive Streamlit UI for easy operation
- 🛡️ Rate limiting to respect server resources
- 📈 Real-time progress tracking
-
Clone this repository:
git clone https://github.com/ZeroXClem/enhanced-web-data-extractor.git cd enhanced-web-data-extractor
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run main.py
-
Open your web browser and navigate to the URL provided by Streamlit (usually
http://localhost:8501
). -
In the Streamlit interface:
- Enter the base URL you want to scrape
- Set the maximum number of pages to scrape (1-100)
- Set the maximum depth for crawling (1-10)
- (Optional) Enter keywords to filter content
- Set the rate limit (requests per second)
- Choose the desired export format(s)
- Click "Start Scraping"
-
Monitor the progress and download the extracted data when complete.
- 📚 Research: Gather data from academic websites or online journals
- 💼 Business Intelligence: Collect product information from e-commerce sites
- 📰 News Aggregation: Compile articles from various news sources
- 🏢 Competitive Analysis: Extract data from competitor websites
- 📊 Market Research: Gather consumer reviews and opinions
- This tool is for educational purposes only.
- Always respect websites' terms of service and robots.txt files.
- Be mindful of rate limiting and don't overload servers with requests.
- Some websites may have measures in place to prevent scraping.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions, issues, and feature requests are welcome! Feel free to check issues page.
ZeroXClem
- GitHub: @ZeroXClem
- LinkedIn: @ZeroXClem LinkedIn
Happy Scraping! 🎉🕷️