codezombie May 19, 2016 at 05:07

R at Microsoft Azure to win the hackathon. Instructions for use

The standard plan of any hackathon ↓
Microsoft Azure Machine Learning Hackathon

R, one of the most popular programming languages among data scientists, receives more and more support both among the opensource community and among private companies that have traditionally been developers of proprietary products. Among such companies is Microsoft, whose intensively increasing support for the R language in its products / services has attracted my attention.

One of the “locomotives” for integrating R with Microsoft products is the Microsoft Azure cloud platform. In addition, there was an excellent opportunity to take a closer look at the R + Azure bundle - this is the Microsoft-hosted machine learning hackathon taking place this weekend (May 21-22) .

Hackathon - an event where~~coffee~~time is an extremely valuable resource. In the context of this, I previously wrote about best practices for learning models in Azure Machine Learning. But Azure ML is not a prototyping tool; it is rather a service for creating a product with SLA with all the ensuing costs both for the development time and the cost of ownership.

R is perfect for creating prototypes, for mining in data, for quickly testing your hypotheses - that is,
all that we need in this type of competition! Below, I’ll show you how to use the full power of R in Azure , from prototyping to publishing the finished model in Azure Machine Learning.

Motivating off-topic

As in the previous hackathon (yes, this is not the first ML hackathon from Microsoft), you will have the opportunity to program yourself in your favorite Python / R / C #, twist the pens in Azure Machine Learning, chat with like-minded people and experts, not get enough sleep, drink a freebie coffee and a meal of delicious cookies. And the most cunning will make the world a better place and receive well-deserved prizes!

0. Microsoft love R

Immediately determine the list of Microsoft products / services that will allow us to work with R:

Microsoft R Server / R Server for Azure HDInsight
Data Science VM
Azure machine learning
SQL Server R Services
Power BI
R Tools for Visual Studio

And (oh joy!) 1-3 products are available to us in Azure under the IaaS / PaaS model. We will consider them in turn.

1. Microsoft R Server (+ for Azure HDInsight)

After last year’s purchase of the notorious Revolution Analytics, Revolution R Open (RRO) and Revolution R Enterprise (RRE) were renamed Microsoft R Open (MRO) and Microsoft R Server, respectively. Now Microsoft R Server is a well-built ecosystem, consisting of both opensource products and proprietary Revolution Analytics modules.

Source The

central place is occupied by R + CRAN; 100% compatibility with both the R language and compatibility with existing packages is guaranteed. Another central component of R Server is Microsoft R Open, which is a runtime environment with improved performance indicators for matrices, mathematical functions, and improved multithreading support.

ConnectR module allows you to access data stored in Hadoop, Teradata Database and others.

R Server for Azure HDInsight adds to everything the ability to run R-scripts directly on the Spark cluster in the Azure cloud. Thus, the problem is solved that the data does not fit into the RAM of the machine, locally, with respect to which the R-script is executed. The instruction is attached .

Azure HDInsight itself is a cloud service that provides a Hadoop / Spark cluster on demand. Since this is a service, the only administrative task is to deploy and remove a cluster. All! Not a second of time spent on cluster configuration, installing updates, setting accesses, etc.

Creating / removing a Hadoop cluster (HDI 3.3) of 8 nodes

To create a Spark cluster, we need to click either 3 buttons (image above), or execute the following simple PowerShell script [ source ]:

 Login-AzureRmAccount
# Set these variables
$clusterName = $containerName # As a best practice, have the same name for the cluster and container
$clusterNodes = 8 # The number of nodes in the HDInsight cluster
$credentials = Get-Credential -Message "Enter Cluster user credentials" -UserName "admin"
$sshCredentials = Get-Credential -Message "Enter SSH user credentials"
# The location of the HDInsight cluster. It must be in the same data center as the Storage account.
$location = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroupName `
    -StorageAccountName $storageAccountName | %{$_.Location}
# Create a new HDInsight cluster
New-AzureRmHDInsightCluster -ClusterName $clusterName `
    -ResourceGroupName $resourceGroupName -HttpCredential $credentials `
    -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" `
    -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName  `
    -ClusterSizeInNodes $clusterNodes -ClusterType Hadoop  `
    -OSType Linux -Version "3.3" -SshCredential $sshCredentials

To remove a cluster, either click one button and one confirmation, or execute the following line of the PowerShell script:

 Remove-AzureRmHDInsightCluster -ClusterName

2. Data Science VM

If you suddenly wanted: 32x CPU, 448Gb RAM, ~ 0.5 TB SSD with preinstalled and configured:

Microsoft R Server Developer Edition,
Anaconda Python distribution,
Jupyter Notebooks for Python and R,
Visual Studio Community Edition with Python and R Tools,
Power BI desktop,
SQL Server Express edition.

If you are going to write in R, Python, C # and use SQL. And then they decided that you won’t be bothered by xgboost, Vowpal Wabbit, CNTK (open source deep learning library from Microsoft Research). Then the Data Science Virtual Machine is what you need - all the products listed above and not only are pre-installed and ready to work. Deployment is simple, but there is an instruction for it .

3. Azure Machine Learning

Azure Machine Learning (Azure ML) is a cloud service for completing machine learning tasks. Almost certainly, Azure ML will be the central service that you will use if you want to train the model in the Azure cloud.

A detailed story about Azure ML is not part of the goal of this post, especially since the service has already been written enough: Azure ML for Data Scientists , Best Practices training models in Azure ML . We focus on the following task: organization of team work with the most painless transfer of R-scripts from the local computer to Azure ML Studio.

3.1. Initial requirements

For this, you will need the following free software products:

For conservatives: R (runtime), R Studio (IDE).
For Democrats: R (runtime), Microsoft R Open (runtime), Visual Studio Community 2015 (IDE), R Tools for Visual Studio (IDE extension).

To work on Azure, you will need an active Microsoft Azure subscription.

3.2. Getting started: collaboration everything

One workspace in Azure ML at all

We create one (!) For the entire workspace team in Azure ML and share it between all team members.

One code repository at all

We create one cloud Team Project (TFS in Azure) / repository in GitHub and also share it on the whole team.

I think it’s obvious that now the part of the team working on one hackathon task makes commits to one repository, commits features to brunches, holds brunches to the master - in general, normal team work on the code is in progress.

One set of initial data for all

Go to Azure ML Studio (web IDE), go to the “Datasets” tab and upload the initial dataset to the cloud. Generate a Data Access Code and send it to the team.

This is how the data loading interface looks in Azure ML Studio:

Listing 1. R-script for loading data

 library("AzureML")


ws <- workspace(

 id = "",

 auth = "",

 api_endpoint = "https://europewest.studioapi.azureml.net")

 data.raw <- download.datasets(

 dataset = ws,

 name = "ML-Hackathon-2016-dataset")

3.3. Jupyter Notebook: running R-scripts in the cloud and visualizing results

After the date, code and project in Azure ML were shared for the whole team, it’s time to learn how to share the visual results of the study.
Traditionally, for this task, the data science community likes to use Jupyter Notebook - a client-server web application that allows the developer to combine in a single document: code (R, Python), results of its execution (including graphics) and rich-text- explanations to it.

Create a Jupyter Notebook in Azure ML:

Create a separate Jupyter Notebook per participant document.
Fill a single shared initial dataset from Azure ML (the code from Listing 1). The code also works when launched from the local R Studio, so there is no need to write anything new for the Jupyter Notebook - we just take and copy the code from R Studio.
We share the link to the Jupyter Notebook document with the team, ~~we rush~~ We discuss, supplement directly in the Jupyter Notebook.

As a result, for each hackathon task, several Jupyter Notebook documents should be obtained:

containing R-scripts and the results of their execution;
over which the whole team fantasized and thought;
with full flow: from loading data to the result of applying the machine learning algorithm.

This is how it looks with me:

3.4. Prototype to production

At this stage, we have several studies on which an acceptable result was obtained and corresponding to these studies:

in GitHub / Team Project: brunches with R-scripts;
in the Jupyter Notebook: a bit of a document with the results discussed in the team of what happened.

The next step is to create experiments in Azure ML Studio (tab "Experiments") - then AzureML experiments .

At this stage, you must adhere to the following best practices when transferring R-code to the AzureML experiment:

Modules:

If possible, do not use the built-in “Execute R script” module as a container for executing R code: it does not have versioning support (code changes made inside the module cannot be rolled back), the module together with R code cannot be reused within another experiment.
Use the ability to upload custom R-packages (Custom R Module) in Azure ML (about the download process below). Custom R Module have a unique name, description of the module, the module can be reused as part of various AzureML experiments.

R scripts:

Organize R scripts inside R modules as a set of functions with a single entry point.
Transfer to Azure ML in the form of R-code only the functionality that is impossible / difficult to reproduce using the built-in Azure ML Studio modules.
The R-code in the modules is executed with the following restrictions: there is no access to persistence-storage and network connection.

In accordance with the rules above, we will transfer our R-code to the AzureML experiment. For us, we need a zip archive consisting of 2 files:

An .R file containing the code we are about to transfer to the cloud.
Example of searching / filtering outliers in data
PreprocessingData <- function(dataset1, dataset2, swap = F, color = "red") { # do something # ... # detecting outliners range <- GetOutlinersRange(dataset1$TransAmount) ds <- dataset1[dataset1$TransAmount >= range[["Lower"]] & dataset1$TransAmount < range[["Upper"]], ] return(ds) } # outlines detection for normal distributed values GetOutlinersRange <- function(values, na.rm = F) { # interquartile range: IQ = Q3 - Q1 Q1 = quantile(values, probs = c(0.25), na.rm = na.rm) Q3 = quantile(values, probs = c(0.75), na.rm = na.rm) IQ = Q3 - Q1 # outliners interval: [Q1 - 1.5IQR, Q3 + 1.5IQR] range <- c(Q1 - 1.5*IQ, Q3 + 1.5*IQ) names(range) <- c("Lower", "Upper") return(range) }
An xml file containing the definition / metadata of our R function.
Example (the Arguments section is just for the breadth of the example)
Dmitry Petukhov Preprocessing dataset for ML Hackathon Demo. Transactions Log MCC List Processed dataset View the R console graphics device output. Swap input datasets. Select a color.

Download the resulting archive through Azure ML Studio. And we’ll run the experiment, making sure that the script worked and we trained the model.

Now you can improve the existing module, load a new one, arrange a competition between them - in general, use the benefits of encapsulation and a modular structure.

Conclusion

In my opinion, R is extremely effective in prototyping, and from this it has proven itself in various types of data science hackathons. At the same time, between the prototype and the product there is an insurmountable gap in such things as scalability, availability, reliability.

Using the Azure toolkit for R, we can balance for a long time on the edge between the flexibility of R and the reliability + other benefits that Azure ML gives us.

And further…

Come to the Azure Machine Learning hackathon (I wrote about it at the beginning) and try it all yourself, chat with experts and like-minded people~~kill yourself a weekend~~. I can also be found there (in the jury).

In addition, for those who will have little offline communication, I invite you to a warm tube slack chat , where hackathon participants can ask questions, share experiences with each other, and after the hackathon talk about their ML solution and continue to maintain professional contacts.

Knock on me for an invite to slack through personal messages in Habré or on any of the contacts that you will find on my blog (I won’t put a link - it will not be difficult to find it through a profile on Habré).

Tags: