R at Microsoft Azure to win the hackathon. Instructions for use

    The standard plan of any hackathon
    Microsoft Azure Machine Learning Hackathon
    R, one of the most popular programming languages ​​among data scientists, receives more and more support both among the opensource community and among private companies that have traditionally been developers of proprietary products. Among such companies is Microsoft, whose intensively increasing support for the R language in its products / services has attracted my attention.

    One of the “locomotives” for integrating R with Microsoft products is the Microsoft Azure cloud platform. In addition, there was an excellent opportunity to take a closer look at the R + Azure bundle - this is the Microsoft-hosted machine learning hackathon taking place this weekend (May 21-22) .

    Hackathon - an event wherecoffeetime is an extremely valuable resource. In the context of this, I previously wrote about best practices for learning models in Azure Machine Learning. But Azure ML is not a prototyping tool; it is rather a service for creating a product with SLA with all the ensuing costs both for the development time and the cost of ownership.

    R is perfect for creating prototypes, for mining in data, for quickly testing your hypotheses - that is,
    all that we need in this type of competition! Below, I’ll show you how to use the full power of R in Azure , from prototyping to publishing the finished model in Azure Machine Learning.



    Motivating off-topic
    As in the previous hackathon (yes, this is not the first ML hackathon from Microsoft), you will have the opportunity to program yourself in your favorite Python / R / C #, twist the pens in Azure Machine Learning, chat with like-minded people and experts, not get enough sleep, drink a freebie coffee and a meal of delicious cookies. And the most cunning will make the world a better place and receive well-deserved prizes!

    0. Microsoft love R


    Immediately determine the list of Microsoft products / services that will allow us to work with R:
    1. Microsoft R Server / R Server for Azure HDInsight
    2. Data Science VM
    3. Azure machine learning
    4. SQL Server R Services
    5. Power BI
    6. R Tools for Visual Studio

    And (oh joy!) 1-3 products are available to us in Azure under the IaaS / PaaS model. We will consider them in turn.

    1. Microsoft R Server (+ for Azure HDInsight)


    After last year’s purchase of the notorious Revolution Analytics, Revolution R Open (RRO) and Revolution R Enterprise (RRE) were renamed Microsoft R Open (MRO) and Microsoft R Server, respectively. Now Microsoft R Server is a well-built ecosystem, consisting of both opensource products and proprietary Revolution Analytics modules.


    Source The

    central place is occupied by R + CRAN; 100% compatibility with both the R language and compatibility with existing packages is guaranteed. Another central component of R Server is Microsoft R Open, which is a runtime environment with improved performance indicators for matrices, mathematical functions, and improved multithreading support.

    ConnectR module allows you to access data stored in Hadoop, Teradata Database and others.

    R Server for Azure HDInsight adds to everything the ability to run R-scripts directly on the Spark cluster in the Azure cloud. Thus, the problem is solved that the data does not fit into the RAM of the machine, locally, with respect to which the R-script is executed. The instruction is attached .

    Azure HDInsight itself is a cloud service that provides a Hadoop / Spark cluster on demand. Since this is a service, the only administrative task is to deploy and remove a cluster. All! Not a second of time spent on cluster configuration, installing updates, setting accesses, etc.

    Creating / removing a Hadoop cluster (HDI 3.3) of 8 nodes


    To create a Spark cluster, we need to click either 3 buttons (image above), or execute the following simple PowerShell script [ source ]:

     Login-AzureRmAccount
    # Set these variables
    $clusterName = $containerName # As a best practice, have the same name for the cluster and container
    $clusterNodes = 8 # The number of nodes in the HDInsight cluster
    $credentials = Get-Credential -Message "Enter Cluster user credentials" -UserName "admin"
    $sshCredentials = Get-Credential -Message "Enter SSH user credentials"
    # The location of the HDInsight cluster. It must be in the same data center as the Storage account.
    $location = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroupName `
        -StorageAccountName $storageAccountName | %{$_.Location}
    # Create a new HDInsight cluster
    New-AzureRmHDInsightCluster -ClusterName $clusterName `
        -ResourceGroupName $resourceGroupName -HttpCredential $credentials `
        -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" `
        -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName  `
        -ClusterSizeInNodes $clusterNodes -ClusterType Hadoop  `
        -OSType Linux -Version "3.3" -SshCredential $sshCredentials
                        
    To remove a cluster, either click one button and one confirmation, or execute the following line of the PowerShell script:

    Remove-AzureRmHDInsightCluster -ClusterName


    2. Data Science VM


    If you suddenly wanted: 32x CPU, 448Gb RAM, ~ 0.5 TB SSD with preinstalled and configured:
    • Microsoft R Server Developer Edition,
    • Anaconda Python distribution,
    • Jupyter Notebooks for Python and R,
    • Visual Studio Community Edition with Python and R Tools,
    • Power BI desktop,
    • SQL Server Express edition.

    If you are going to write in R, Python, C # and use SQL. And then they decided that you won’t be bothered by xgboost, Vowpal Wabbit, CNTK (open source deep learning library from Microsoft Research). Then the Data Science Virtual Machine is what you need - all the products listed above and not only are pre-installed and ready to work. Deployment is simple, but there is an instruction for it .

    3. Azure Machine Learning


    Azure Machine Learning (Azure ML) is a cloud service for completing machine learning tasks. Almost certainly, Azure ML will be the central service that you will use if you want to train the model in the Azure cloud.

    A detailed story about Azure ML is not part of the goal of this post, especially since the service has already been written enough: Azure ML for Data Scientists , Best Practices training models in Azure ML . We focus on the following task: organization of team work with the most painless transfer of R-scripts from the local computer to Azure ML Studio.



    3.1. Initial requirements


    For this, you will need the following free software products:
    • For conservatives: R (runtime), R Studio (IDE).
    • For Democrats: R (runtime), Microsoft R Open (runtime), Visual Studio Community 2015 (IDE), R Tools for Visual Studio (IDE extension).

    To work on Azure, you will need an active Microsoft Azure subscription.

    3.2. Getting started: collaboration everything


    One workspace in Azure ML at all


    We create one (!) For the entire workspace team in Azure ML and share it between all team members.

    One code repository at all


    We create one cloud Team Project (TFS in Azure) / repository in GitHub and also share it on the whole team.

    I think it’s obvious that now the part of the team working on one hackathon task makes commits to one repository, commits features to brunches, holds brunches to the master - in general, normal team work on the code is in progress.

    One set of initial data for all


    Go to Azure ML Studio (web IDE), go to the “Datasets” tab and upload the initial dataset to the cloud. Generate a Data Access Code and send it to the team.

    This is how the data loading interface looks in Azure ML Studio:




    Listing 1. R-script for loading data
    library("AzureML")

    ws <- workspace(
    id = "",
    auth = "",
    api_endpoint = "https://europewest.studioapi.azureml.net")
    data.raw <- download.datasets(
    dataset = ws,
    name = "ML-Hackathon-2016-dataset")

    3.3. Jupyter Notebook: running R-scripts in the cloud and visualizing results


    After the date, code and project in Azure ML were shared for the whole team, it’s time to learn how to share the visual results of the study.
    Traditionally, for this task, the data science community likes to use Jupyter Notebook - a client-server web application that allows the developer to combine in a single document: code (R, Python), results of its execution (including graphics) and rich-text- explanations to it.

    Create a Jupyter Notebook in Azure ML:
    1. Create a separate Jupyter Notebook per participant document.
    2. Fill a single shared initial dataset from Azure ML (the code from Listing 1). The code also works when launched from the local R Studio, so there is no need to write anything new for the Jupyter Notebook - we just take and copy the code from R Studio.



    3. We share the link to the Jupyter Notebook document with the team, we rush We discuss, supplement directly in the Jupyter Notebook.

    As a result, for each hackathon task, several Jupyter Notebook documents should be obtained:
    • containing R-scripts and the results of their execution;
    • over which the whole team fantasized and thought;
    • with full flow: from loading data to the result of applying the machine learning algorithm.

    This is how it looks with me:



    3.4. Prototype to production


    At this stage, we have several studies on which an acceptable result was obtained and corresponding to these studies:
    • in GitHub / Team Project: brunches with R-scripts;
    • in the Jupyter Notebook: a bit of a document with the results discussed in the team of what happened.

    The next step is to create experiments in Azure ML Studio (tab "Experiments") - then AzureML experiments .

    At this stage, you must adhere to the following best practices when transferring R-code to the AzureML experiment:

    Modules:
    1. If possible, do not use the built-in “Execute R script” module as a container for executing R code: it does not have versioning support (code changes made inside the module cannot be rolled back), the module together with R code cannot be reused within another experiment.
    2. Use the ability to upload custom R-packages (Custom R Module) in Azure ML (about the download process below). Custom R Module have a unique name, description of the module, the module can be reused as part of various AzureML experiments.

    R scripts:
    1. Organize R scripts inside R modules as a set of functions with a single entry point.
    2. Transfer to Azure ML in the form of R-code only the functionality that is impossible / difficult to reproduce using the built-in Azure ML Studio modules.
    3. The R-code in the modules is executed with the following restrictions: there is no access to persistence-storage and network connection.

    In accordance with the rules above, we will transfer our R-code to the AzureML experiment. For us, we need a zip archive consisting of 2 files:
    1. An .R file containing the code we are about to transfer to the cloud.
      Example of searching / filtering outliers in data
      PreprocessingData <- function(dataset1, dataset2, swap = F, color = "red") {
      # do something
      # ...

      # detecting outliners
      range <- GetOutlinersRange(dataset1$TransAmount)
      ds <- dataset1[dataset1$TransAmount >= range[["Lower"]] &
      dataset1$TransAmount < range[["Upper"]], ]

      return(ds)
      }

      # outlines detection for normal distributed values
      GetOutlinersRange <- function(values, na.rm = F) {
      # interquartile range: IQ = Q3 - Q1
      Q1 = quantile(values, probs = c(0.25), na.rm = na.rm)
      Q3 = quantile(values, probs = c(0.75), na.rm = na.rm)
      IQ = Q3 - Q1

      # outliners interval: [Q1 - 1.5IQR, Q3 + 1.5IQR]
      range <- c(Q1 - 1.5*IQ, Q3 + 1.5*IQ)
      names(range) <- c("Lower", "Upper")

      return(range)
      }


    2. An xml file containing the definition / metadata of our R function.
      Example (the Arguments section is just for the breadth of the example)

      Dmitry Petukhov
      Preprocessing dataset for ML Hackathon Demo.





      Transactions Log


      MCC List



      Processed dataset


      View the R console graphics device output.





      Swap input datasets.







      Select a color.




    Download the resulting archive through Azure ML Studio. And we’ll run the experiment, making sure that the script worked and we trained the model.



    Now you can improve the existing module, load a new one, arrange a competition between them - in general, use the benefits of encapsulation and a modular structure.


    Conclusion


    In my opinion, R is extremely effective in prototyping, and from this it has proven itself in various types of data science hackathons. At the same time, between the prototype and the product there is an insurmountable gap in such things as scalability, availability, reliability.

    Using the Azure toolkit for R, we can balance for a long time on the edge between the flexibility of R and the reliability + other benefits that Azure ML gives us.

    And further…


    Come to the Azure Machine Learning hackathon (I wrote about it at the beginning) and try it all yourself, chat with experts and like-minded peoplekill yourself a weekend. I can also be found there (in the jury).

    In addition, for those who will have little offline communication, I invite you to a warm tube slack chat , where hackathon participants can ask questions, share experiences with each other, and after the hackathon talk about their ML solution and continue to maintain professional contacts.

    Knock on me for an invite to slack through personal messages in Habré or on any of the contacts that you will find on my blog (I won’t put a link - it will not be difficult to find it through a profile on Habré).


    Also popular now: