Mostafa's Blog: December 2015

Monday, December 21, 2015

Learn R for Data Science Applications and Machine Learning

Hi All,

I am writing this blog post to highlight cool resources to learn programming in R. R language is a widely used programming language for data scientists and engineers who build programmable components in big data solutions.

Here are the resources to get started:

1) Quick-R: the easiest and fastest way to learn R
http://www.statmethods.net/

2) Official CRAN (Comprehensive R Archive Network) resource:
https://cran.r-project.org/manuals.html

3) Introduction to R Programming: Free Course (On-Demand)
https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0

Hope this helps.

Sunday, December 20, 2015

Apache Spark for developers - part 1

In this blog post i am sharing basic understanding concepts of Apache Spark for developers. My target is to educate developers/engineers with no big data experience on what is Apache Spark and the importance of this platform in working with Hadoop big data solutions.

The target developers should have minimum experience in building business applications or products (desktop, web or mobile) using any OOB language such as: C++. Java or C# developers.

What is Apache Spark?
Apache Spark is a distributed computation framework for big data. It is an open source platform for processing large amount of data in Hadoop ecosystem solutions.

Because it is a distributed platform, there are important concepts to solidify such as:

1) Any spark application contains a driver program which is the main entry point for the application that executes the main function and executes various parallel operations in a cluster.

2) Spark provides a resilient distributed dataset (RDD) which is a collection of data elements that are partitioned across different nodes in a cluster that can be operated on in parallel.

3) We can persist RDD in memory to allow it to be reused efficiently across parallel operations.

4) RDDs can automatically recover from nodes failures.

Tip: To start working with RDDs in Spark, RDD starts with a file in HDFS or any Hadoop supported file system.

5) Spark supports shared variables in parallel operations. There are two types of shared variables in Spark, the first is broadcast variables and the second is accumulators.

You can write Spark applications in Scala, Python and Java programming languages.

To connect with Spark, you need to have a Spark context object which requires a Spark configuration object. The Spark configuration object contains information about your application.

Spark contains Scala and Python shells where you can write and execute your code against Spark cluster nodes.

More to come in the following posts...

- ME

Tuesday, December 15, 2015

Data Science and Machine Learning Training Course for Free

Hi All,

To get started to learn data science and Machine Learning principals i would strongly recommend to try this free course from Microsoft Virtual Academy (MVA) course:

Some key points from the course:

Data science is about using data to make decisions that drive actions.

Data science involves:

Finding data

Acquiring data

Cleaning and transforming data

Understanding relationships in data

Delivering value from data

Predictive analytics is about using past data to predict future values.

Prescriptive analytics is about using those predictions to drive decisions.

Learn more in this course:

https://mva.microsoft.com/en-US/training-courses/data-science-and-machine-learning-essentials-14100?l=UyhoTxWdB_3505050723

Enjoy!

Monday, December 14, 2015

Get Started with Apache Spark Resources in HDInsight

Hi All,

I am writing this blog post to share some important Apache Spark framework for starters topics and foundation understanding points..

Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.

Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and graph computations.

You can write applications in Python, Scala and R in Spark clusters. HDInsight contains out of the box notebooks (tools/Dev IDEs) that allows data scientist to write programs in Spark using:

1) Python language using Jupyter , It also supports R, Julia, and Scala

2) Scala language using Zepplein

Most important libraries for Apache Spark:

1) Spark Sql: A module for structured data manipulation using SQL or DataFrame API.

http://spark.apache.org/sql/

2) Spark Streaming: A module for building stream processing apps the same way your write batch jobs. It supports Java, Scala and Python.

http://spark.apache.org/streaming/

3) MLLib: A library for machine learning

http://spark.apache.org/mllib/

Some useful links and resources for Apache Spark:

1) Apache Spark homepage:

http://spark.apache.org/

2) Learning Python:

https://www.python.org/doc/

3) Learning Scala:

http://www.scala-lang.org/documentation/

HDInsight Apache Spark provides tons of tools out of the box, check out this link to see why would use HDInsight Apache Spark in Azure:

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-overview/

Hope this helps! Enjoy crushing streams of data...

Saturday, December 12, 2015

The best way to test website compatibility issues

Hi All,

I had a chance to attend one of the New York city JavaScript events (JS Open NYC) that was hosted at Microsoft NYC office. I had the opportunity to talk and chat with dozens of open source front end developers about web compatibility and interoperability in modern web design and development.

While i was talking with the event attendees, I introduced new scanning tools that every PM, BA or front end engineer could use to test website compatibility issues.

Microsoft developed tools to scan and test your website for free, below is the website to check all the available tools.

Microsoft Edge Dev Center homepage:
https://dev.windows.com/en-us/microsoft-edge/

In this website, you have four tools to use. I will go through each one of them in this blog post.

1) Quick Scan tool: (my favorite one for technical analysis)
The best tool to do quick scan to your website, it points out all out of date libraries, layouts and things to change in your website to be compatible with most of modern browsers.

Url: https://dev.windows.com/en-us/microsoft-edge/tools/staticscan/

This tool is open source, here is the source code repo in GitHub:
https://github.com/MicrosoftEdge/static-code-scan

2) Browser Screenshots tool: (my favorite one for UX)
This tool shows you how your website looks like in all browser! very handy and useful.

Url: https://dev.windows.com/en-us/microsoft-edge/tools/screenshots/

3) Virtual Machine (VM):
You can download various VMs that have all IE version with different OS options including Linux and Mac. This is a useful tool for internal and intranet sites.

Url: https://dev.windows.com/en-us/microsoft-edge/tools/vms/

4) Remote App:
A way to test your site using Remote App session on any OS you use.

Url: https://dev.windows.com/en-us/microsoft-edge/tools/remote/

I also had a discussion about Chakra (The Core Engine for MS Edge). Since Microsoft announce that the core Edge engine is open source (ChakraCore) and it will be available in GitHub on Jan 2016. Check out the official announcement from Microsoft Edge Dev Team:

https://blogs.windows.com/msedgedev/2015/12/05/open-source-chakra-core/

Hope this helps!

Wednesday, December 09, 2015

Get Started with Apache Storm Resources in HDInsight

Hi All,

I'd like to share some useful resource to get started with Apache Storm in this blog post.
Apache storm is a distributed real-time computational system that allows engineers to process streams of data at scale.

Apache storm is one of the major hadoop ecosystem components, where engineers use it to process the sources of data into hadoop ecosystem.

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Imagine you need to process endless source of data (such as: Facebook news feed or Twitter feed) and you are going to process this large volume of information and then store in Hadoop. In this case, you want to build a storm application specifying the topology by defining the sources of information (Spouts) and how to process this chunk of data (Bolts).

Every Storm application contains a topology, Set of spouts and bolts in addition to a specification file for the topology.

I compiled some useful resources to get started and work with Apache storm:

1) Apache Storm main website:
http://storm.apache.org/index.html

2) HDInsight Hadoop documentation in Azure:
https://azure.microsoft.com/en-us/documentation/services/hdinsight/

3) SCP.NET, Get started with building .NET apps in C# in Storm:
https://github.com/hdinsight/hdinsight-storm-examples/blob/master/SCPNet-GettingStarted.md

4) Power of Storm with examples:
https://github.com/hdinsight/hdinsight-storm-examples/blob/master/README.md

5) EDX Free online course (Implementing Real-Time Analytics with Hadoop in Azure HDInsight) :
https://www.edx.org/course/implementing-real-time-analytics-hadoop-microsoft-dat202-2x

.
Hope this helps.

Thursday, December 03, 2015

Working with HBase in Azure

I have published a video on how to work with HBase tables in HDInsight HBase cluster. The video is a walk-through on the basics of CRUD operations in HBase.

The video covers the following topics:
1) How to connect to hbase shell tool.
2) How to create tables in HBase.
3) How to select, insert, update, records in HBase.
4) Understanding create, put, delete, deleteall commands in HBase.

The video is giving a basic "Order" table structure as an example and execute all the above operations to it.

Video Link:

https://channel9.msdn.com/Blogs/MostafaElzoghbi/Working-with-HBase-in-Azure

Enjoy!

Tuesday, December 01, 2015

HBase introduction in Azure

Hi,

I have published a new channel9 video about HBase Introduction in Azure.

This video covers an introduction to HBase in Azure. It covers what is HDInsight clusters, What are the available cluster types. What Microsoft Azure offers as Hadoop ecosystem components. The video focuses on HDInsight HBase cluster type and the need for HBase in Hadoop ecosystem to store NoSQL data and the available tools (such as: hbase shell) and commands to use to manipulate data within HBase tables.

The video covers the column families concept for engineers who come from RDBMS background.

This video helps any engineer with no Hadoop experience to understand what is the role of HBase in Hadoop and big data applications.

Channel 9 video url:
https://channel9.msdn.com/Blogs/MostafaElzoghbi/HBase-Introduction-in-Azure

Enjoy!