In this blog post i am sharing basic understanding concepts of Apache Spark for developers. My target is to educate developers/engineers with no big data experience on what is Apache Spark and the importance of this platform in working with Hadoop big data solutions.
The target developers should have minimum experience in building business applications or products (desktop, web or mobile) using any OOB language such as: C++. Java or C# developers.
What is Apache Spark?
Apache Spark is a distributed computation framework for big data. It is an open source platform for processing large amount of data in Hadoop ecosystem solutions.
Because it is a distributed platform, there are important concepts to solidify such as:
1) Any spark application contains a driver program which is the main entry point for the application that executes the main function and executes various parallel operations in a cluster.
2) Spark provides a resilient distributed dataset (RDD) which is a collection of data elements that are partitioned across different nodes in a cluster that can be operated on in parallel.
3) We can persist RDD in memory to allow it to be reused efficiently across parallel operations.
4) RDDs can automatically recover from nodes failures.
Tip: To start working with RDDs in Spark, RDD starts with a file in HDFS or any Hadoop supported file system.
5) Spark supports shared variables in parallel operations. There are two types of shared variables in Spark, the first is broadcast variables and the second is accumulators.
You can write Spark applications in Scala, Python and Java programming languages.
To connect with Spark, you need to have a Spark context object which requires a Spark configuration object. The Spark configuration object contains information about your application.
Spark contains Scala and Python shells where you can write and execute your code against Spark cluster nodes.
More to come in the following posts...