Mostafa's Blog: August 2016

Wednesday, August 31, 2016

Building Big Data Solutions in Azure Data Platform @ Data Science MD

Hi All,

Yesterday i was at Johns Hopkins University in Laurel, MD presenting how to build big data solutions in Azure. The presentation was focused on the underling technologies and tools that are needed to build end to end big data solutions in the cloud. I presented the capabilities that Azure offers out of the box in addition to cluster types and tiers that are available for ISVs and developers.

The session covers the following:

1) What HDInsight cluster offers in hadoop ecosystem technology stack.
2) HDInsight cluster tiers and types.
3) HDInsight developer tools in Visual Studio 2015, HDInsight developer tools.
4) Working with HBase databases and Hive View, deploying Hive apps from Visual Studio.
5) Building, Debugging and Deploying Storm Apps into Storm clusters.
6) Working with Spark clusters using Jupyter, PySpark.

Session Title: Building Big Data Solutions in Azure Data Platform

Session Details:

The session covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop HDFS, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.

Big data solutions in Azure from Mostafa Elzoghbi

meetup link: https://www.meetup.com/Data-Science-MD/events/232936519/

Thanks and Enjoy working with data!

Friday, August 26, 2016

Study notes for exam 70-475: Designing and Implementing Big Data Analytics Solutions

Hi All,

Today I passed the "Designing and Implementing Big Data Analytics Solutions" Microsoft exam.

I have been preparing for this exam (70-475) for a couple of months and I have been using Hadoop ecosystem tools and platforms for awhile.

I wanted to master building big data analytics solutions using HDInsight clusters using Hadoop ecosystem which contains: Storm, Spark, HBase, Hive and HDFS. I worked to cover any gap in understanding I had in Azure Data Lake, ML, Python & R programming and Azure Machine Learning.

This exam covers the following primarily four main technologies (from most covered to least):

1) Hadoop ecosystem: Working with HDFS, HBase, Hive, Storm, Spark and understanding Lambda Architecture. If you want to know more about Lambda Architecture, read my blog post explaining it here.

2) Azure Machine Learning: building/training models, predictive models, classification vs regression vs clustering, recommender algorithms. building custom models, Executing code in R and Python. Ingesting data from Azure Event Hub & transformation in Stream analytics.

3) Azure Data Lake: building pipeline, activities, linked services, move, transform and analyze data, working with storage options in Azure (blob vs block) & tools to transform data.

4) SQL Server and Azure SQL: Security in transit and at rest, SQL Data Warehouse. Working with R in Sql Server 2016/Azure SQL.

My study notes while preparing to pass this test:

1) To protect data at rest as well as querying in Azure SQL Database: Use "Always Encrypted" to make sure data in transit is encrypted. Use "Transparent Data Encryption" to make sure that data at rest is encrypted. Read more about TDE here. Read more about Always Encrypted feature here.

2) When running an Azure ML experiment and you are getting "Out of memory error" here is how to fix it:
a) Increase the memory settings for the map and reduce operations in the import module.
b) Use Hive query to limit the amount of data being processed in the import module.

3) The easiest way to manage Hadoop clusters in Azure is to assign every HDInsight cluster to a resource group and to apply tags to all related resources.

4) In Hadoop, When the data is row-based, self-describing with schema and provide compact binary data serialization: it is recommended to use Avro.

5) Which Hadoop cluster type for query and analysis batch jobs:
a) Spark: A cluster for In-memory processing, interactive queries, and micro-batch stream processing.
b) Storm: A real-time event processing.
c) HBase: NoSQL data storage for big data systems.

6) Importing data using Pyhon in Azure ML tips:
a) Missing values are converted into NA for processing. NA will be converted back to missing values when converted back to datasets.
b) Azure Dataset are converted to data frames in Pandas. Pandas module is used to work with data in Python.
c) Number names columns are not ignored. str() function is applied to those.
d) Duplicate column names are not ignored. The duplicate column names are modified to make sure they have unique names.

7) The only platform that supports ACID transaction in Hadoop file storage options is Apache Orc.

8) You have three utilities you can use to move data from local storage to managed cluster blob storage. These tools are: Azure CLI, PowerShell & AzCopy.

9) How to improve Hive queries using static vs dynamic partitioning, read more here.

10) Understand when to use Filter based Feature Selection in Azure ML.

11) AzureML requires Python to store visualizations as PNG Files. To configure MatPlotLib in AzureML, you should configure it to use AGG backend for rendering and you should save charts as PNG files.

12) To detect potential SQL injection attempts on Azure SQL database in ADL cluster: Enable Threat Detection.

13) To create synthetic samples of dataset for classes that are under represented: use SMOTE module in AzureML.

14) D14 V2 Virtual Machines in Azure supports 100GB in memory processing.

15) You can add multiple contributors to AzureML workspace as users.

16) Understand the minimum requirements for each cluster type in HDInsight;
a) At least 1 data node for Hadoop cluster type.
b) At least 1 region server for HBase cluster type.
c) Two Nimbus nodes for Storm cluster type.
d) At least 1 worker role for Spark cluster type.

17) If you want to store a file with a file size is greater than 1 TB, you need to use Azure Data Lake Store.

18) In Azure Data Factory (ADF), you can train, score and publish experiments to AzureML using:
a) AzureML Batch execution: to train and score.
b) AzureML Update resource activity: to update AzureML web services.

19) In Azure Data Factory (ADF), A pipeline is used to configure several activities, including the sequence and timing activities in a pipeline can be managed as a unit.

20) Working with R models in SQL Server 2016/AzureSQL: read more here.

21) Apache Spark in HDInsight can read files from Azure blob storage (WASB) but not SQL Server.

22) Always Encrypted protects data in transit and at rest will be encrypted. Also this feature allows you to store encryption keys on premise.

23) Transparent Data encryption (TDE) : secure data at rest, it will not protect data in transit and the keys are stored in the cloud.

24) Distcp is a Hadoop tool to copy data to and from HDInsight clusters storage blob into Azure Data lake store.

25) Adlcopy: is a command line utility to copy data from azure blob storage into azure data lake storage account.

26) AzCopy: A tool to copy data from and to Azure blob storage.

27) While working with large binary files and you would like to optimize the speed of AzureML experiment, you can do the following:
a) Developers should write data as block blob.
b) The blob format should be in CSV or TSV.
c) You should NOT turn off the cached results option.
d) You can NOT filter data using SQL but R language.

28) SQL DB contributor role allows monitoring and auditing of SQL databases without granting permissions to modify security or audit policies.

29) To process data in HDInsight clusters in Azure Data Factory (ADF):
a) Add a new item to the pipeline in the solution explorer.
b) Select Hive Transformation.
c) Construct JSON to process the cluster data in an activity.

30) Understanding Tumbling vs Hopping vs Sliding Windows in Azure Stream Analytics. (link)

Hope this helps you get ready to pass the test, and good luck everyone!
Let's get all certified ya'll data wranglers :-)

-- ME

References:
1) Microsot Exam 70-475 details, skills measured and more:
https://www.microsoft.com/en-us/learning/exam-70-475.aspx

Thursday, August 25, 2016

GIT 101 in Visual Studio Team Services (VSTS)

Hi All,

I have been working with multiple developers on sharing project code using git. I found out that git is new to a lot of developers who have been using Visual Studio Team Foundation Server (TFS), Visual Studio Online (VSO aka VSTS now), or any other centralized source control system.

What is the difference between TFS/VSO/VSTS versus Git?

If you have been using TFS, VSTS or VSO, those all fall under Team foundation Version Control (TFVC) which is a centralized source control system.

While git, is a distributed source control system (DVCS). which means: you have a local and remote code repositories. you can commit your code to you local repo but not remote repo (unless you want to). Also, you can share your code to the remote repo so other team members can get these changes.
This is a fundamental concept to understand when working with git. git is distributed, contains local and remote repos & works offline and it is a great way to enable collaborations among developers.

**Popular git platforms: GitHub, VSTS, Bitbucket, GitLab, RhodeCode and others.

This article is focusing on managing multiple developers code working in a team & what is the git best practices around that, This also applies to any other git platform. For the sake of simplicity, This article will be focusing on using Git in VSTS.

Basic terminology and keywords to know when working with Git:

1) A branch: In Git, every developer should have his own branch. you write code and commit your changes into your local branch. To sync with other developers, get latest from the master branch and merge it into yours so you make sure everything is compiling & working before creating a new pull request to the master branch (by merging back your code into master).

2) Fetch: It download changes to your local branch from the remote branch. Fetch downloads these commits and adds them to the local repo without updating your local branch. To update your local branch either by executing merge or pull requests to your local repo to be up to date with its remote.

3) Pull: Get updates from your remote branch into your local branch. basically keeps your branch up to date with its remote one. Pull does a fetch and then a merge into your local branch.
So just use Pull to get your local branch updates from its remote one.

3) Pull vs Fetch: git pull does a git fetch. so if you used git pull this means that your have executed git fetch. you can execute fetch if you want to get the updates but do not want to merge them into yor local branch yet.

5) Push: sends committed changes to remote branch so it is shared with others.

Basic rules to work with git in Visual Studio that everyone should be aware of before start coding:
This section i cover all needed actions to work with using git in Visual Studio Team Explorer window.

1) You need to click on Sync in team explorer to refresh the current branch from the remote branch. followed by pull to get those changes merged into the current local branch. sync just show status but to actually merge those changes you need to click on pull link.

2) You need to click on Changes in team explorer every time you want to check in or get latest updates of the current branch.

3) You need to click on Branches in team explorer every time you want to manage branches in Visual Studio.

4) You need to click on Pull Requests in team explorer every time you want to manage pull requests in Visual Studio.

A) Setup a project for your team using Git in VSTS:

1) Visit http://visualstudio.com.
2) Login to your account.
3) Click on New button to create a new git project.

4) Once you hit create project button, the project will be created in few seconds and then we will use Visual Studio to do some necessary steps.

5) To start using Visual Studio with the created project, Click on Code tab for the newly created project "MicrosoftRocks".

6) Click on Clone in Visual Studio button.
7) This will open up VS and then open up Team Explorer window.
8) You need to click on "Clone this repository" this will allow you to create a local repo of the remote repo we have just created in VSTS.

9) Select a local repo folder and click on Clone.

10) Now, VS shows a message that we can create a new project or solution.

11) You can go ahead and create any project in VS, the only thing to notice to uncheck create a new Git repository checkbox when creating any new project since we have created already our local repo.

12) First things first, you need to exclude bin and debug folders from getting checked in Git. So, click on Settings in Team explorer -- > Click on Repository Settings link under Git --> Click on Add link to add .gitignore file.

13) To edit .gitignore file, click on edit link. Then, add the following at the bottom of the file:

# exclude bin and debug folders
bin
debug

14) Build the project and then we will do our first check in to the master branch.

15) Click on Home icon in team explorer to go back to the home page to manage source control options.

16) Click on Settings, Type a check in message and then click on Commit Staged.

17) Commit staged action has check in all our changes to our local repo, these changes have not been share to the remote, so we need to click on sync to share it with others.
You will notice, that VS shows you a sync link afterwards so you can sync changes immediately or your can click on Sync from team explorer and then click on Push.

18) Now, the project is ready in the master branch for everyone with git ignore file before everyone will create his own branch and start developing.

B) Create your own branch in Visual Studio:

1) Every developer in a team, should create his own branch and get the latest from master to start developing in our project.

2) From Visual Studio, Click on master branch from the bottom bar and click on new branch.

3) Enter your branch name "dev1" and from which branch your want to create yours "master" and then click on Create Branch button. This step will create your own branch and get latest from master and switch to your branch to start coding in it.

4) You will notice, the name of the current branch has changed from master to dev1 in Visual Studio. now you can start working in your branch.

5) Once you are done coding a feature or at a good point to check in some code, Follow these steps to check in your changes:

Click on Changes in team explorer, write a message and then click on Commit All button.
You can also click on sync to push these changes to the remote branch in VSTS online.
Remember, these changes are still in your branch no one else has seen it until you submit it to the master branch.

6) Publish your branch: It is important to publish your branch to VSTS, Follow these steps:

From team explorer, click on branches.
Right click on your branch.
Click on Publish Branch.

C) How to submit your code to the master branch:

1) First, you need to make sure that your local master branch is up to date. to do that, switch to master branch and click on sync and then click on pull in Team explorer window.

2) Second, Switch back to your branch "dev1" and then click on branches in team explorer.

3) Click on Merge link.

4) Select to merge from "master" into "dev1" and then click on Merge. This step will merge all master changes into your branch so your branch will get other people work and fix any conflict (if any) before submitting all changes to master using Pull Request (PR) action.

5) Now, we need to submit all these changes after making sure there are no conflicts to the master branch. Click on Pull Requests in Team explorer.

6) Click on New Pull Request link.

7) This will open Visual Studio online webpage to submit new pull request.
8) Click on New Pull Request button.

9) Submitted Pull Requests (PRs) will be either approved or rejected by the repository admins. unless you are an admin, you will be able to approve/reject and complete submitted PRs in any project and therefore these changes are committed/merged to the master branch.

10) Click on Complete button to complete the pull request. Visual Studio will prompt a popup window if you want to add any notes and then click on Complete merge button. This is the last step to merge your changes into master after your PR has been approved.

11) Repeat the same steps every time you want to merge your changes into master using PRs.

Hope this article has shown in detailed walk-though how to work in a team using Git in Visual Studio Team Services and manage your code checkins/checkouts/merge/branching and PRs in Git.

Enjoy!

-- ME

Tuesday, August 23, 2016

How to create websites with MySQL database in Azure

Hi,

Microsoft recently announced Azure App Service support for In-app MySQL Feature (Still in Preview).

What does "In-App MySQL" in App Service mean?

It means that MySQL database is provisioned and it shares the resources with your web app. MySQL in-app enables developers to run the MySQL server side-by-side with their Web application within the same environment, which makes it easier to develop and test PHP applications that use MySQL.

So, You can have your MySQL In-App database along with your website into Azure App Service and both share the same resources. No need to provision a different VM for MySQL or purchase ClearDB for your websites under development. The feature is available for new or existing web apps in Azure.

Definitely we recommend when moving to production is to move out of In-App MySQL database, since the intention is to keep this for development and testing purposes only.

In-App MySQL is like hosting SQLServer Express DB instance in your app before mounting it to an actual SQL Server instance.

How to provision MySQL In-App to Azure App Service?

Create a new web app or select an existing web app and then you will find "MySQL in App (Preview)" option. Click MySQL In App On and then save.

Current Limitation for MySQL In App Feature:
1) Auto Scaling feature is not supported.
2) Enabling Local Cache is not supported.
3) You can access your database only using PHPMyAdmin web tool or using KUDU debug console.
4) Web Apps and WordPress templates support MySQL In App when you provision it in Azure Portal. The team is working to expand this to other services in Azure portal.

Hope this helps.

Reference:
1) MySQL in-app for web apps: https://blogs.msdn.microsoft.com/appserviceteam/2016/08/18/announcing-mysql-in-app-preview-for-web-apps

Monday, August 22, 2016

Avro vs Parquet vs ORCFile as Hadoop storage files

While working on developing big data applications and systems in Hadoop. Every time we store data in Hadoop cluster, we think about what is the best way to store our data. There are tons of challenges when storing Petabytes of data including what is the required storage and how to faster reads your data!

In Hadoop, you can store your files in many formats. I would like to share some of these options and which to be used in certain scenarios.

How to store data files in Hadoop and what are the available options:

1) AVRO
Apache Avro™ is a data serialization system. Avro provides a row-based data storage, while the schema is encoded on the file and it provides binary data serialization.

Use Case: Use Avro if you have a requirement to support binary data serialization for your data while maintaining a self contained schema on a row-based data files.

Read more about Avro: http://avro.apache.org/

2) PARQUET
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Use Case: You want to store data on a column based files & save on storage. Parquet uses an efficient encoding & compression data representation (schemas) on your Hadoop clusters.
It works with different processing framework and programming languages.

Read more about Apache Parquet: https://parquet.apache.org/

3) ORCFile
Apache Orc is the smallest, fastest columnar storage for Hadoop workloads. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.

Use Case: Use ORC when you need to store your data on columnar storage in Hadoop in an efficient and faster way to retrieve your data. ORCFile contains its schema which makes reading values is so fast.

Read more about Apache Arc:https://orc.apache.org/

Hope this helps!

Wednesday, August 17, 2016

How to read images from a URL in asp.net core

Hi,

I was building an asp.net core web api that suppose to read images from an external url. Even though i have done this dozens of time. I got stuck for a bit trying to have the same code that reads an image from a url into my asp.net core project using Visual Studio 2015.

After little bit of searching, i found out that before trying to read a static file such as an image from your controller. you need to enable first Directory browsing and configure routing path so you are able to view this image in a browser by hitting the url of the image.

So, follow these below steps to be able to read images from a url (in my case these images were part of project):

1) Move images folder (or any static files folder) under wwwroot folder.
2) Open startup.cs file and enable directory browsing.

C# code to enable directory browsing and serving static files:

public void Configure(IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory)
{
loggerFactory.AddConsole(Configuration.GetSection("Logging"));
loggerFactory.AddDebug();
app.UseApplicationInsightsRequestTelemetry();
app.UseApplicationInsightsExceptionTelemetry();

// Enable directory browsing
app.UseStaticFiles();
app.UseStaticFiles(new StaticFileOptions()
{
FileProvider = new PhysicalFileProvider(
Path.Combine(Directory.GetCurrentDirectory(), @"wwwroot\images")),
RequestPath = new PathString("/images")
});

app.UseDirectoryBrowser(new DirectoryBrowserOptions()
{
FileProvider = new PhysicalFileProvider(
Path.Combine(Directory.GetCurrentDirectory(), @"wwwroot\images")),
RequestPath = new PathString("/images")
});

app.UseMvc();
}

3) Run your app and try to load an image from the browser, for example:
http://localhost:12354/images/test1.jpg

4) You will be able to view the image in the browser. now, let's read image in C# from a url.

// read from remote image drive
using (HttpClient c = new HttpClient())
{
using (Stream s = await c.GetStreamAsync(imgUrl))
{
// do any logic with the image stream, save it, store it...etc.
}
}

If you haven't done step #3 (This is were i got stuck!), the GetStreamAsync method will throw an exception (404 not found) error because we haven't configured the app to deliver static files.

Hope this helps!

References:
1) Working with static files in asp.net core:
https://docs.asp.net/en/latest/fundamentals/static-files.html#enabling-directory-browsing

Wednesday, August 03, 2016

Building Big Data Solutions using Hadoop in Azure

Hi All,

Today i am at New York City presenting how to build data solutions in Azure. The presentation is focused on the underling technologies and tools that are needed to build big data solutions.

The session also covers the following:

1) What HDInsight cluster offers in hadoop ecosystem technology stack.
2) HDInsight cluster tiers and types.
3) HDInsight developer tools in Visual Studio 2015.
4) Working with HBase databases and Hive View.
5) Building, Debugging and Deploying Storm Apps.
6) Working with Spark clusters.

Session Title: Building Big Data Solutions in Azure.

Session Details:

Event Url: http://www.html5report.com/conference/newyork/agenda.aspx?t=#D1-8

Building Big data solutions in Azure from Mostafa Elzoghbi

Hope this helps!

Tuesday, August 02, 2016

Working with Hive in HDInsight

Hi,

While i am working on building big data solutions in Azure HDInsight clusters. I found out really new tools that have been added to HDP to easily help you working with Hive and HBase datastores.

In this blog post, I would like to share that you can manage your Hive databases and queries using Hive View in HDInsight clusters.

I have provisioned a Linux based Spark cluster in HDInsight. Spark clusters comes with a preloaded tools, frameworks and services. Hive service is preloaded and configured by default as well.

Follow these steps to work with Hive:

1) From Azure Portal, select your HDInsight cluster.
2) Click on Dashboard.
3) Enter your admin username and password.
4) This would be Ambari homepage for your cluster.

5) From the top right corner, click on Hive View.

6) You will be able to write any SQL statements in Hive query as you used to do.

Hive view also contains other capabilities such as defining UDFs and upload tables to Hive.

Hope this helps.