Monday, December 14, 2015

Get Started with Apache Spark Resources in HDInsight

Hi All,

I am writing this blog post to share some important Apache Spark framework for starters topics and foundation understanding points..

Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications

Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and graph computations. 

You can write applications in Python, Scala and R in Spark clusters. HDInsight contains out of the box notebooks (tools/Dev IDEs) that allows data scientist to write programs in Spark using:

1) Python language using Jupyter , It also supports R, Julia, and Scala
2) Scala language using Zepplein 

Most important libraries for Apache Spark:

1) Spark Sql: A module for structured data manipulation using SQL or DataFrame API.

2) Spark Streaming: A module for building stream processing apps the same way your write batch jobs. It supports Java, Scala and Python.

3) MLLib: A library for machine learning


Some useful links and resources for Apache Spark:

1) Apache Spark homepage:

2) Learning Python:

3) Learning Scala:

HDInsight Apache Spark provides tons of tools out of the box, check out this link to see why would use HDInsight Apache Spark in Azure:


Hope this helps! Enjoy crushing streams of data...

No comments: