Muy buen articulo de Raul Estrada . Principales puntos:
1. Data acquisition: pipeline for performance
In this step, data enters the system from diverse sources. The key focus of this stage is performance, as this step impacts of how much data the whole system can receive at any given point in time.
-
Technologies
For this stage you should consider streaming APIs and messaging solutions like:
- Apache Kafka - open-source stream processing platform
- Akka Streams - open-source stream processing based on Akka
- Amazon Kinesis - Amazon data stream processing solution
- ActiveMQ - open-source message broker with a JMS client in Java
- RabbitMQ - open-source message broker with a JMS client in Erlang
- JBoss AMQ - lightweight MOM developed by JBoss
- Oracle Tuxedo - middleware message platform by Oracle
- Sonic MQ - messaging system platform by Sonic
2. Data storage: flexible experimentation leads to solutions
There are a lot of points of view for designing this layer, but all should consider two perspectives: logical (i.e. the model) and physical data storage. The key focus for this stage is "experimentation” and flexibility.
-
Technologies
For this stage consider distributed database storage solutions like:
- Apache Cassandra - distributed NoSQL DBMS
- Couchbase - NoSQL document-oriented database
- Amazon DynamoDB - fully managed proprietary NoSQL database
- Apache Hive - data warehouse built on Apache Hadoop
- Redis - distributed in-memory key-value store
- Riak - distributed NoSQL key-value data store
- Neo4J - graph database management system
- MariaDB - with Galera form a replication cluster based on MySQL
- MongoDB - cross-platform document-oriented database
- MemSQL - distributed in-memory SQL RDBMS
3. Data processing: combining tools and approaches
Years ago, there was discussion about whether big data systems should be (modern) stream processing or (traditional) batch processing. Today we know the correct answer for fast data is that most systems must be hybrid — both batch and stream at the same time. The type of processing is now defined by the process itself, not by the tool. The key focus of this stage is "combination."
-
Technologies
For this stage, you should consider data processing solutions like:
- Apache Spark - engine for large-scale data processing
- Apache Flink - open-source stream processing framework
- Apache Storm - open-source distributed realtime computation system
- Apache Beam - open-source, unified model for batch and streaming data
- Tensorflow - open-source library for machine intelligence
4. Data visualization
Visualization communicates data or information by encoding it as visual objects in graphs, to clearly and efficiently get information to users. This stage is not easy; it’s both an art and a science.
Technologies
- Notebook reports: Apache Zeppelin and Jupyter notebooks
- Charts, maps, and graphics: Tableau
- Customized charts, maps, and graphics: D3.js and Gephi