Awesome Data Engineering

A curated list of data engineering tools for software developers

List of content

[Databases] (#databases)
Ingestion
[File System] (#file-system)
File Format
Stream Processing
[Batch Processing] (#batch-processing)
[Front End] (#front-end)
[Frameworks] (#frameworks)

Databases

Relational
- [MySQL] (http://www.mysql.com/)
- [PostgreSQL] (http://www.postgresql.org/)
- [Amazon RDS] (http://aws.amazon.com/rds/)
Key-Value
- [Redis] (http://redis.io/)
- [Riak] (https://docs.basho.com/riak/latest/)
- [AWS DynamoDB] (http://aws.amazon.com/dynamodb/)
Column
- [Cassandra] (http://cassandra.apache.org/)
- [HBase] (http://hbase.apache.org/)
- [Infobright] (http://www.infobright.org)
- [AWS Redshift] (http://aws.amazon.com/redshift/)
Document
- [MongoDB] (https://www.mongodb.org/)
- [Elasticsearch] (https://www.elastic.co/)
- [Couchbase] (http://www.couchbase.com/)
Graph
- [Neo4j] (http://neo4j.com/)
- [OrientDB] (http://orientdb.com/orientdb/)
- [ArangoDB] (https://www.arangodb.com/)
- [Titan] (http://thinkaurelius.github.io/titan/)

Data Ingestion

[Kafka] (http://kafka.apache.org/)
- Camus LinkedIn's Kafka to HDFS pipeline.
- BottledWater Change data capture from PostgreSQL into Kafka
- kafkat Simplified command-line administration for Kafka brokers
- kafkacat Generic command line non-JVM Apache Kafka producer and consumer
- pg-kafka A PostgreSQL extension to produce messages to Apache Kafka
- librdkafka The Apache Kafka C/C++ library
- kafka-docker Kafka in Docker
- kafka-manager A tool for managing Apache Kafka
- kafka-node Node.js client for Apache Kafka 0.8
- [Secor] (https://github.com/pinterest/secor) Pinterest's Kafka to S3 distributed consumer
[AWS Kinesis] (http://aws.amazon.com/kinesis/)
RabbitMQ
FluentD
Apache Scoop
Luigi Python module that helps you build complex pipelines of batch jobs

File System

[HDFS] (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
[AWS S3] (http://aws.amazon.com/s3/)
[Tachyon] (http://tachyon-project.org/)

File Format

Apache Avro Apache Avro™ is a data serialization system
Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
ProtoBuf Protocol Buffers - Google's data interchange format
SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats

Stream Processing

Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Apache Storm Apache Storm is a free and open source distributed realtime computation system
- Pyleus Pyleus is a Python framework for developing and launching Storm topologies.
- ParselyStreamparse lets you run Python code against real-time streams of data with Apache Storm.
Apache Samza Apache Samza is a distributed stream processing framework
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data

Batch Processing

[Hadoop MapReduce] (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
[Spark] (https://spark.apache.org/)
- Spark Packages A community index of packages for Apache Spark
- Deep Spark Connecting Apache Spark with different data stores
[AWS EMR] (http://aws.amazon.com/elasticmapreduce/)
Flink
[Tez] (https://tez.apache.org/)

Batch ML
- [H2O] (http://h2o.ai/)
- [Mahout] (http://mahout.apache.org/)
- [Spark MLlib] (https://spark.apache.org/docs/1.2.1/mllib-guide.html)
Batch Graph
- [GraphLab] (https://dato.com/products/create/)
- [Giraph] (http://giraph.apache.org/)
- [Spark GraphX] (https://spark.apache.org/graphx/)
Batch SQL
- [Presto] (https://prestodb.io/docs/current/index.html)
- [Hive] (http://hive.apache.org)
- [Drill] (https://drill.apache.org/)

Front End

[Flask] (http://flask.pocoo.org/)
[D3] (http://d3js.org/)
- [D3Plus] (http://d3plus.org) D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
[AngularJS] (https://angularjs.org/)
[Django] (https://www.djangoproject.com/)
[Highcharts] (http://www.highcharts.com/)
C3.js D3-based reusable chart library

Frameworks

[Luigi] (https://github.com/spotify/luigi) Luigi is a Python module that helps you build complex pipelines of batch jobs.
[Cascading] (http://www.cascading.org/) Java based application development platform.
[Airflow] (https://github.com/airbnb/airflow) Airflow is a system to programmaticaly author, schedule and monitor data pipelines.

ELK Elastic Logstash Kebana

docker-logstash

Docker

Flocker Easily manage Docker containers & their data

Datasets

Realtime

Instagram Realtime

Data Dumps

[GitHub Archive] (https://www.githubarchive.org/) GitHub's public timeline since 2011, updated every hour
[Common Crawl] (https://commoncrawl.org/) Open source repository of web crawl data

Cheers to The Data Engineering Ecosystem: An Interactive Map

Inspired by the awesome list. Created by Insight Data Engineering fellows.

License

To the extent possible under law, Igor Barinov has waived all copyright and related or neighboring rights to this work.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
README.md		README.md
contributing.md		contributing.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Data Engineering

Databases

Data Ingestion

File System

File Format

Stream Processing

Batch Processing

Front End

Frameworks

ELK Elastic Logstash Kebana

Docker

Datasets

Realtime

Data Dumps

License

About

Uh oh!

Releases

Packages

gchoy/awesome-data-engineering

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Engineering

Databases

Data Ingestion

File System

File Format

Stream Processing

Batch Processing

Front End

Frameworks

ELK Elastic Logstash Kebana

Docker

Datasets

Realtime

Data Dumps

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages