Friday, August 26, 2016

Study notes for exam 70-475: Designing and Implementing Big Data Analytics Solutions

Hi All,

Today I passed the "Designing and Implementing Big Data Analytics Solutions" Microsoft exam.

I have been preparing for this exam (70-475) for a couple of months and I have been using Hadoop ecosystem tools and platforms for awhile.

I wanted to master building big data analytics solutions using HDInsight clusters using Hadoop ecosystem which contains: Storm, Spark, HBase, Hive and HDFS. I worked to cover any gap in understanding I had in Azure Data Lake, ML, Python & R programming and Azure Machine Learning.

This exam covers the following primarily four main technologies (from most covered to least):

1) Hadoop ecosystem: Working with HDFS, HBase, Hive, Storm, Spark and understanding Lambda Architecture. If you want to know more about Lambda Architecture, read my blog post explaining it here.

2) Azure Machine Learning: building/training models, predictive models, classification vs regression vs clustering, recommender algorithms. building custom models, Executing code in R and Python. Ingesting data from Azure Event Hub & transformation in Stream analytics.

3) Azure Data Lake: building pipeline, activities, linked services, move, transform and analyze data, working with storage options in Azure (blob vs block) & tools to transform data.

4) SQL Server and Azure SQL: Security in transit and at rest, SQL Data Warehouse. Working with R in Sql Server 2016/Azure SQL.


My study notes while preparing to pass this test:

1) To protect data at rest as well as querying in Azure SQL Database: Use "Always Encrypted" to make sure data in transit is encrypted. Use "Transparent Data Encryption" to make sure that data at rest is encrypted. Read more about TDE here. Read more about Always Encrypted feature here.

2) When running an Azure ML experiment and you are getting "Out of memory error" here is how to fix it:
   a) Increase the memory settings for the map and reduce operations in the import module.
   b) Use Hive query to limit the amount of data being processed in the import module.

3) The easiest way to manage Hadoop clusters in Azure is to assign every HDInsight cluster to a resource group and to apply tags to all related resources.

4) In Hadoop, When the data is row-based, self-describing with schema and provide compact binary data serialization: it is recommended to use Avro.

5) Which Hadoop cluster type for query and analysis batch jobs:
     a) Spark: A cluster for In-memory processing, interactive queries, and micro-batch stream processing.
     b) Storm: A real-time event processing.
     c) HBase: NoSQL data storage for big data systems.

6) Importing data using Pyhon in Azure ML tips:
    a) Missing values are converted into NA for processing. NA will be converted back to missing values when converted back to datasets.
    b) Azure Dataset are converted to data frames in Pandas. Pandas module is used to work with data in Python.
    c) Number names columns are not ignored. str() function is applied to those.
    d) Duplicate column names are not ignored. The duplicate column names are modified to make sure they have unique names.

7) The only platform that supports ACID transaction in Hadoop file storage options is Apache Orc.

8) You have three utilities you can use to move data from local storage to managed cluster blob storage. These tools are: Azure CLI, PowerShell & AzCopy.

9) How to improve Hive queries using static vs dynamic partitioning, read more here.

10) Understand when to use Filter based Feature Selection in Azure ML.

11) AzureML requires Python to store visualizations as PNG Files. To configure MatPlotLib in AzureML, you should configure it to use AGG backend for rendering and you should save charts as PNG files.

12) To detect potential SQL injection attempts on Azure SQL database in ADL cluster: Enable Threat Detection.

13) To create synthetic samples of dataset for classes that are under represented: use SMOTE module in AzureML.

14) D14 V2 Virtual Machines in Azure supports 100GB in memory processing.

15) You can add multiple contributors to AzureML workspace as users.

16) Understand the minimum requirements for each cluster type in HDInsight;
       a) At least 1 data node for Hadoop cluster type.
       b) At least 1 region server for HBase cluster type.
       c) Two Nimbus nodes for Storm cluster type.
       d) At least 1 worker role for Spark cluster type.

17) If you want to store a file with a file size is greater than 1 TB, you need to use Azure Data Lake Store.

18) In Azure Data Factory (ADF), you can train, score and publish experiments to AzureML using:
      a) AzureML Batch execution: to train and score.
      b) AzureML Update resource activity: to update AzureML web services.

19) In Azure Data Factory (ADF), A pipeline is used to configure several activities, including the sequence and timing activities in a pipeline can be managed as a unit.

20) Working with R models in SQL Server 2016/AzureSQL: read more here.

21) Apache Spark in HDInsight can read files from Azure blob storage (WASB) but not SQL Server.

22) Always Encrypted protects data in transit and at rest will be encrypted. Also this feature allows you to store encryption keys on premise.

23) Transparent Data encryption (TDE) : secure data at rest, it will not protect data in transit and the keys are stored in the cloud.


24) Distcp is a Hadoop tool to copy data to and from HDInsight clusters storage blob into Azure Data lake store.

25) Adlcopy: is a command line utility to copy data from azure blob storage into azure data lake storage account.

26) AzCopy: A tool to copy data from and to Azure blob storage.

27) While working with large binary files and you would like to optimize the speed of AzureML experiment, you can do the following:
      a) Developers should write data as block blob.
      b) The blob format should be in CSV or TSV.
      c) You should NOT turn off the cached results option.
      d) You can NOT filter data using SQL but R language.


28) SQL DB contributor role allows monitoring and auditing of SQL databases without granting permissions to modify security or audit policies.

29) To process data in HDInsight clusters in Azure Data Factory (ADF):
      a) Add a new item to the pipeline in the solution explorer.
      b) Select Hive Transformation.
      c) Construct JSON to process the cluster data in an activity.

30) Understanding Tumbling vs Hopping vs Sliding Windows in Azure Stream Analytics. (link)

Hope this helps you get ready to pass the test, and good luck everyone!
Let's get all certified ya'll data wranglers :-)


-- ME


References:
1) Microsot Exam 70-475 details, skills measured and more:
https://www.microsoft.com/en-us/learning/exam-70-475.aspx


No comments: