This repository provides a comprehensive setup for a modern data engineering stack utilizing powerful open-source tools. The stack enables real-time and batch data processing, orchestration, and storage with seamless containerized deployment.
- Workflow orchestration and scheduling.
- Manages end-to-end data pipelines.
- DAG-based execution for automation.
- Primary programming language for data processing and orchestration.
- Used in ETL scripts, Kafka consumers, and data transformation tasks.
- Distributed event streaming platform for real-time data ingestion.
- Handles high-throughput and low-latency data streams.
- Integrates seamlessly with Spark and ClickHouse.
- Manages and coordinates Kafka brokers.
- Provides leader election and distributed synchronization.
- Distributed data processing engine for real-time and batch workloads.
- Utilized for transformations, aggregations, and analytics.
- Columnar database optimized for fast analytical queries.
- Stores structured and semi-structured data efficiently.
- Relational database used for transactional workloads.
- Acts as metadata storage for Airflow and other applications.
- Containerization for seamless deployment of all components.
- Ensures portability and reproducibility of the data stack.
Ensure you have the following installed:
- Docker & Docker Compose
- Python 3.x
- Kafka & Zookeeper dependencies
- Clone this repository:
git clone https://github.com/Theglassofdata/Beehiiv-realtime.git cd Beehiiv-realtime
- Start the services using Docker Compose:
docker-compose up -d
- Verify Airflow setup:
docker-compose exec airflow-webserver airflow dags list
- Access services:
- Airflow UI: http://localhost:8080
- Kafka UI (if available): http://localhost:9092
- ClickHouse: Connect via
clickhouse-client
- PostgreSQL: Access using
psql
or admin tools
Feel free to open issues and contribute improvements to this stack.
This project is licensed under the MIT License.