Azure Machine Learning for Data Scientist

    This article was created by our friend from the community, Dmitry Petukhov , Microsoft Certified Professional, developer of Quantum Art.
    The article is part of the Fraud Detection series, the rest of the articles can be found in Dmitry’s profile.




    The Learning Machine Azure  - a cloud service to perform tasks predictive analytics ( predictive analytics ). The service is represented by two components:  Azure ML Studio  , a development environment accessible through a web interface, and  Azure ML web services .
    A typical sequence of actions of a data scientist when searching for patterns in a data set using learning algorithms with a teacher is depicted and described in detail under the habracat.



    Projects in Azure ML Studio are called  experiments . Let's create an experiment and look at the set of tools that Azure ML offers the data specialist for each stage of the sequence illustrated above.

    Data retrieval


    The Reader control   allows you to load both structured and semi-structured data sets. It supports loading both from relational DBMS (Azure SQL Database) and data from non-relational sources: NoSQL (Azure Table, queries to Hive), OData services, as well as downloading documents of various text formats from Azure Blob Storage and URL (http protocol )

    Manual data entry is also possible ( Enter Data control  ). For the purpose of converting data of various formats, elements from the Data Format Conversation section are used  . The following output formats are available: CSV, TSV, ARFF, SVMLight.

    Data preparation


    Incomplete data / duplicate data

    In the general case, the researcher deals with incomplete data - the training set has empty values ​​in the data. The Clean Missing Data control   allows you to delete a row / column containing missing data, and replace the missing value with a constant, average, median, mode.
    It is not uncommon for a set to contain duplicate data, which, in turn, can significantly reduce the accuracy of prediction of a future model. To remove duplicate data, use the  Remove Duplicate Rows control .

    Data exploration


    Transformation and data cleansing

    Data transformation is one of the stages requiring a lot of manual work, especially if the data for the training set are taken from various sources: local CSV, distributed file system (HDFS), Hive. The absence of tools by which queries to heterogeneous sources could be made uniformly can significantly complicate the work of a data analysis specialist.

    After loading data into Azure ML, the researcher does not encounter problems of unified access to heterogeneous data sources, but works with data received from various sources in a uniform manner. In the Manipulation section  controls are available that allow you to perform inner / left / full join operations, project, add and delete columns, group data by predictors, and even arbitrary SQL transformations on loaded datasets ( Apply SQL Transformation control  ).

    Defining the structure (metadata) of a dataset

    The Metadata Editor control   allows you to explicitly specify the type of data (string, integer, timestamp, etc.) contained in certain columns, attribute the contents of the column to predictors ( feature ), or to responses ( label ), and also specify the type of predictor scale: nominal ( categorical) or absolute.

    Presence of patterns and anomalies

    Azure ML Studio offers numerous statistical analysis tools (the Statistical Functions section of  the toolbar). One of the most commonly used by me is the Descriptive Statistics control. With it, you can get information about the minimum (Min) and maximum (Max) value stored in the column, the value of the median (Median), arithmetic mean (Mean), the value of the first (1st Quartile) and third (3rd Quartile) quartiles, standard deviation (Sample Standard Deviation), etc.

    Data set breaking

    When using training algorithms with a teacher at least once per experiment (in the general case), it will be necessary to divide the data set into two subsets: the training sample ( Training Dataset ) and the test one ( Test Dataset ).

    For a positive end result - the creation of an accurate model - it is very important that the training sample contains the widest possible range of values ​​that precedents can take (in other words, the training data set should cover the largest possible range of states that the predicted system can take). To obtain the highest quality training sample, the most widely used strategies are mixing initial data.

    Azure ML Studio uses the Split control for data set breaking tasks that implements several data separation strategies and allows you to specify the proportions of the data that fall into each of the subsets.

    Model building


    Feature selection

    The selection of predictors ( Feature Selection ) is a stage that has a huge impact on the accuracy of the resulting model. To identify all the essential predictors within the model and at the same time to prevent adding too many predictors to the model, the researcher will need knowledge both in the field of mathematical statistics and in the subject area of ​​research. Filter Based Feature Selection

    Control  allows identifying predictors in a loaded dataset based on Pearson, Spearman, Kendall, or other statistical methods. Identifying predictors using mathematical methods will help in the early stages to quickly create an acceptable model. At the final stage of refinement of the model, the choice of predictors is often carried out on the basis of expert opinion in the studied area. For explicit (manual) selection of predictors in Azure ML, the Metadata Editor tool is used  , which allows you to specify that a dataset column is considered a predictor.

    Feature Scaling / Dimension reduction

    Some machine learning algorithms do not work correctly without normalizing predictor values ​​( Feature Scaling ). In addition, a decrease in the number of variables / predictors available in the model ( Dimension reduction ) can improve resource utilization during the execution of the training algorithm, and avoid retraining the model. Both of these techniques will reduce the search time for the objective function that describes the model.
    Elements from this functionality group are located in the Scale and Reduce section of the   Azure ML Studio toolbar.

    Application of machine learning algorithm

    The process of applying the machine learning algorithm in Azure ML goes through the following stages:
    initializing a model using a specific machine learning algorithm (subsection Machine Learning ->  Initialize Model ),
    learning a model  (Machine Learning ->  Train )
    evaluating the resulting model  for a training and test sample (Machine Learning ->  Score )
    evaluation of the resulting algorithm  (Machine Learning ->  Evaluate ). Regression, classification, clustering

    algorithms available in Azure ML . It is possible to configure the key parameters of the selected algorithm: for the Multiclass Neural Network algorithm, you can specify the number of hidden nodes, the number of training iterations, initial weights, the type of normalization, etc. ( list of all configurable parameters ).

    A complete list of algorithms for March 2015 is shown in the illustration below.



    Model Evaluation


    As mentioned above, to evaluate a model in Azure ML Studio, the toolbar has a sub-section called Machine Learning ->  Score . Moreover, the evaluation result is available both in the form of histograms and in the form of statistical indicators (minimum, maximum value, median, average, mathematical expectation, etc.).

    The Evaluate Model control   contains a confusion matrix that contains correctly recognized good examples ( True Positive , TP), correctly recognized bad examples ( True Negative , TN), and recognition errors ( False Positive, False Negative ).

    Evaluation of the performance of the model is available both in the form of a graph, and in the form of a table of metrics:  Accuracy, Precision, Recall, F1 Score .

    The largest (but not the only) interest rate prediction accuracy The Accuracy , which is calculated as the ratio of successful predictions to the total number of elements in the set: (the TP + TN) / of Total numbers .
    The meaning of the remaining indicators is clearly demonstrated by the following illustration:

    The next most popular indicator after Accuracy is  AUC (Area Under Curve). AUC lies in the range from 0 to 1; values ​​close to 0.5 say that the model works with the same efficiency as if you tossed a coin and based on the loss of one of the sides of the coin made an assumption to which class the event belongs. The closer AUC to 1, the more accurate the model. Each Threshold level has its own AUC schedule.
    You can read more about the performance indicators of algorithms in Azure ML  here .

    Publish Model


    Models built and calculated in Azure ML Studio can be deployed as a scalable, fault-tolerant web service.

    The service operates in 2 modes: batch mode (asynchronous response from the service, SLA 99.9%) and Request / Response mode with low latency (synchronous response, SLA 99.95%).
    The service receives and sends messages in application / json format via https. To access the service, an API Key is issued - an access key that is included in the request header.
    It is possible to add an arbitrary number of endpoints through which you can access the service. For each endpoint, you can configure the Throttle Level, which is certainly a plus. The disadvantage is that there are only two of these levels - High and Low - and there is no way to set this level manually, say, at 10,240 requests / sec. Another oddity is that all endpoints have the same API Key.

    After creating the service, the service API documentation page becomes available, which, in addition to the general description of the service, the description of the formats of the expected input and output messages, also contains examples of calling the service in C #, Python and R.
    In addition, a successful model can always be shared with community in  Azure ML Gallery, which already has a lot of interesting experiments. If your model is of great public value, then take the opportunity to publish the service providing access to the model in the store of SaaS-applications  Microsoft Azure Marketplace . In turn, the Azure Marketplace already contains a large number of data services available both for free and by subscription (for example, for every 10K requests).

    disadvantages


    In Azure ML, as in many services of the Azure cloud platform, there are several levels ( tier ) of service provision. In Azure ML, these are the Free  and  Standard levels  . Free will cost you a minimal (almost zero) amount and is perfect for an initial acquaintance with the service. The Standart tier is an enterprise tier free of the many artificial limitations Free Tier has. Therefore, further I will talk only about Standart Tier.

    I will not say that what I list below is a limitation, but rather things that have remained unclear to me.

    Fly in the ointment for Azure ML Experiment


    I did not find in the Azure ML documentation indications of the maximum input size (in GB), whether there are (and what) restrictions on the number of columns (predictors) and rows (use cases) available in Azure ML learning algorithms. If these limitations exist, then the importance of this knowledge in the design of the analytical system can hardly be overestimated.

    Fly in the ointment for Azure ML Web Services


    Unknown: maximum number of simultaneous requests to one endpoint (endpoint) and maximum number of endpoints. In total, in one place I found the following numbers (I can’t say anything about their relevance): a maximum of 20 parallel requests per endpoint, a maximum of 80 endpoints. I checked the call duration for  one of my Azure ML web services located in the US Central South region (the client sending the requests was in the same DC). The response time in Request / Response service mode is about 0.4 seconds.
    From here it can be calculated that the performance of more than 5K (20 * 80 * 1 / 0.4) requests per second, in my particular case, should not be expected. This limitation of application scalability must also be considered when designing.

    And, lastly, there is not enough ability to configure rights for each endpoint individually. But in order to issue these rights for each endpoint, you need a personal endpoint API Key (or other means of authentication), and this feature is not yet available in Azure ML.

    Killer Feature (instead of conclusion)


    It is worth noting that the functionality of the Azure ML Studio built-in tools for some reason is not enough, researchers have the opportunity to write and execute scripts written in the project written in  R  ( quickstart ) and  Python  ( quickstart ) - the most popular programming languages ​​in the field of scientific data research.
    And they say that all this can be  tried for free . To whom it seems a little, here are the prices for Free and Standart Tier .

    Additional sources

    Also popular now: