Thursday, January 26, 2017

Mashing RDDs in Apache Spark from RDBMs perspective

Hi,

Happy new year! This is my first post in 2017!. 2016 was amazing year for me. lots of work, projects and achievements. Looking forward to 2017.

I am writing this blog post to cover the standard techniques to work with Resilient Distributed Datasets (RDDs) to join data in Apache Spark.


I would like to share some insights when working with RDDs in Spark. That's related to how to work with multiple RDDs as we do when working with relational database management systems.

Apache Spark support joins in RDDs, where you can implement all kinds of joins that we are aware of in RDBMS. Below i will list how would you implement this on this platform.

Apache Spark Join Transformations Operations:

1) join: This is equivalent to inner join in RDBMs. It returns a new pair RDD with the elements containing all possible pairs of values from the first and second RDDs that has the same keys. For the keys that exist in only one of the the two RDDs. the resulting RDD will have no elements.

2) leftOuterJoin: This is equivalent to left outer join in RDBMs. The resulting RDD will also contain the elements for those keys that don't exist in the second RDD.

3) rightOuterJoin: This is equivalent to right outer join in RDBMs. The resulting RDD will also contain the elements for those keys that don't exist in the first RDD.

4) fullOuterJoin: This is equivalent to cross join in RDBMs. The resulting RDD will also contain the elements for both keys that exist in either RDDs.

In case of the RDDs contain duplicate keys, these keys will be joined multiple times.


Hope this helps!