Mostafa's Blog: Get Started with Apache Spark Resources in HDInsight

Monday, December 14, 2015

Get Started with Apache Spark Resources in HDInsight

Hi All,

I am writing this blog post to share some important Apache Spark framework for starters topics and foundation understanding points..

Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.

Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and graph computations.

You can write applications in Python, Scala and R in Spark clusters. HDInsight contains out of the box notebooks (tools/Dev IDEs) that allows data scientist to write programs in Spark using:

1) Python language using Jupyter , It also supports R, Julia, and Scala

2) Scala language using Zepplein

Most important libraries for Apache Spark:

1) Spark Sql: A module for structured data manipulation using SQL or DataFrame API.

http://spark.apache.org/sql/

2) Spark Streaming: A module for building stream processing apps the same way your write batch jobs. It supports Java, Scala and Python.

http://spark.apache.org/streaming/

3) MLLib: A library for machine learning

http://spark.apache.org/mllib/

Some useful links and resources for Apache Spark:

1) Apache Spark homepage:

http://spark.apache.org/