Friday, June 03, 2016

Thoughts on Lambda Architecture

Hi All,

Recently i have read "Big data principles and best practices of scalable real-time data systems" by Nathan Marz & James Warren. The book is very informative on analyzing how to build scalable data systems using hadoop ecosystem.

Lambda Architecture

Regardless which tools you are using to implement this but i can say the biggest take away of this book is describing in detail Lambda Architecture (LA). I am new to LA and the way how this architecture is being laid out in building highly scalable big data systems.

LA provides a separation of concerns for building large data systems especially on separating the batch from serving and speed layers.

Lambda Architecture (LA) consists of main three layers:
1) Batch Layer: contains the original master data set (immutable, append-only data) and precomupte functions over the master dataset.

Hadoop is the standard batch processing system used for most high-throughput architectures. MapReduce is used for big data computational systems. Recently, Developers lean to use Spark as a new computation system for big data computing for its high performance & in memory processing.

2) Serving Layer: contains batch views that serves the precomputed results with low-latency reads.
Examples of serving layer technologies: Apache Cassandra, Apache HBase, ElephantDB, and Cloudera Impala.

3) Speed Layer. contains real-time views that fills the latency gap by querying recently obtained data. The speed layer is responsible for any data not yet available in the serving layer.

You can use Apache Storm to perform realtime computation in the speed layer.

It is recommended to use Apache Cassandra or Apache HBase for speed layer output while ElephantDB or Cloudera Impala for batch layer output.

Hope this article helps you in getting into designing big data systems with high throughput and low latency.


a) Lambda Architecture website:

b) Cloudera Impala:

No comments: