Schvepsss December 13, 2016 at 16:10

How to choose algorithms for machine learning Microsoft Azure

Original author: Gary Ericson, Larry Franks, Brandon Rohrer

Transfer

In this article, you will find the Microsoft Azure Machine Learning Cheat Sheet to help you choose the right algorithm for your predictive analytics solutions from the Microsoft Azure Algorithm Library . You will also learn how to use it.

The answer to the question "What kind of machine learning algorithm should I use?" Always sounds like this: "Depending on the circumstances." The choice of algorithm depends on the volume, quality and nature of the data. It depends on how you manage the result. It depends on how the instructions for the computer that implements it were created from the algorithm, and also on how much time you have. Even the most experienced data analysts will not tell you which algorithm is better until they try it themselves.

Microsoft Azure Machine Learning Cheat Sheet

Download cheat sheet on machine learning algorithms Microsoft Azure can be here .

It was created for beginner data analysis specialists with sufficient experience in machine learning who want to choose an algorithm for use in Azure Machine Learning Studio. This means that the information in the cheat sheet is generalized and simplified, but it will show you the right direction for further actions. Also, not all algorithms are presented in it. As Azure Machine Learning evolves and offers more methods, algorithms will be supplemented.

These recommendations are based on the feedback and advice of many data analysts and machine learning experts. We do not agree with each other in everything, but we tried to generalize our opinions and reach consensus. Most controversial points begin with the words "Depending on the circumstances ..." :)

How to use a cheat sheet

You need to read the path and algorithm labels in the diagram as follows: "For < path label > use <a algorithm> ". For example, "For speed, use two class logistic regression ." Sometimes you can use several branches. Sometimes none of them would be the perfect choice. These are just recommendations, so don't worry about inaccuracies. Some data analysts that I’ve been able to talk to say that the only sure way to find the best algorithm is to try them all.

Here is an example experiment from the Cortana Intelligence Gallery , in which several algorithms are tried with the same data and the results are compared.

You can download and print a diagram with an overview of the capabilities of Machine Learning Studio in this article .

Types of Machine Learning

Teacher training

Teacher-based learning algorithms make predictions based on a set of examples. So, in order to predict prices in the future, you can use the stock price in the past. Each example used for training gets its own distinctive value label, in this case, the stock price. The learning algorithm with the teacher is looking for patterns in these value labels. The algorithm can use any important information - the day of the week, time of year, financial data of the company, type of industry, the presence of serious geopolitical events, and each algorithm looks for different types of patterns. After the algorithm finds a suitable regularity, with its help it makes predictions for unallocated test data in order to predict prices in the future.

This is a popular and useful type of machine learning. With one exception, all Azure machine learning modules are teacher learning algorithms. Azure Machine Learning Services provides several specific types of machine learning with a teacher: classification, regression, and anomaly detection.

Classification . When data is used to predict categories, learning with a teacher is called classification. In this case, the image is assigned, for example, “cat” or “dog”. When there are only two choices, this is called a two-class classification . When there are more categories, for example, when predicting the winner of an NCAA March Madness tournament, this is called a multi-class classification .
Regression . When a value is predicted, for example, in the case of stock prices, learning with a teacher is called regression.
Emission filtering . Sometimes you need to identify unusual data points. For example, if fraud is detected, any strange patterns of credit card spending are suspected. There are so many possible options, and so few training examples that it is almost impossible to find out what the fraudulent activity will look like. When filtering outliers, normal activity is simply studied (using the archive of valid transactions) and all operations with significant differences are found.

Teacherless Learning

In teacherless learning, data objects have no labels. Instead, a teacherless learning algorithm should organize data or describe its structure. To do this, they can be grouped into clusters so that they become more structured, or find other ways to simplify complex data.

Reinforcement training

As part of reinforcement learning, the algorithm selects an action in response to each incoming data object. After some time, the learning algorithm receives a reward signal that indicates how correct the decision was. On this basis, the algorithm changes its strategy in order to receive the highest reward. There are currently no reinforcement learning modules in Azure Machine Learning. Reinforcement learning is common in robotics, where a set of sensor readings at a certain point in time is an object, and the algorithm must choose the next robot action. In addition, this algorithm is suitable for applications on the Internet of things.

Algorithm Selection Tips

Accuracy

The most accurate answer is not always needed. Depending on the purpose, sometimes it’s enough to get an approximate answer. If so, then you can significantly reduce mining time by choosing approximate methods. Another advantage of approximate methods is that they exclude retraining .

Studying time

The number of minutes or hours required to train the model is highly dependent on the algorithms. Often, training time is closely related to accuracy - they define each other. In addition, some algorithms are more sensitive to the size of the training sample than others. The time limit helps to choose an algorithm, especially if a large training set is used.

Linearity

Many machine learning algorithms use linearity. Linear classification algorithms suggest that classes can be separated by a straight line (or its more multidimensional analogue). Here we are talking about logistic regression and the support vector method (in Azure machine learning). Linear regression algorithms suggest that the distribution of data is described by a straight line *. These assumptions are suitable for solving a number of problems, but in some cases they reduce accuracy.

The restriction of nonlinear classes - the use of the linear classification algorithm reduces the accuracy

Data with a nonlinear regularity - when using the linear regression method, more serious errors occur than is permissible

Despite the drawbacks, linear algorithms are usually accessed first. They are simple with an algorithmic point of view, and learning is quick.

Number of parameters

Parameters are the leverage by which data experts customize the algorithm. These are numbers that affect the behavior of the algorithm, for example, error tolerance or the number of iterations, or differences in the behavior of the algorithm. Sometimes the training time and the accuracy of the algorithm can vary depending on certain parameters. As a rule, you can find a good combination of parameters for algorithms through trial and error.

Also in Azure machine learning, there is a modular parameter selection unit that automatically tries all combinations of parameters with the level of detail you specify. Although this method allows you to try out many options, however, the more parameters, the more time it takes to train the model.

Fortunately, if there are a lot of parameters, this means that the algorithm is highly flexible. And with this method, you can achieve excellent accuracy. But provided that you manage to find a suitable combination of parameters.

Number of signs

In some types of these features, there can be many more features than objects. This usually happens with genetics or text data. A large number of signs hinder the operation of some training algorithms, which is why the training time is incredibly stretched. The support vector method is well suited for such cases (see below).

Special cases

Some learning algorithms make assumptions about data structure or desired results. If you manage to find a suitable option for your goals, it will bring you excellent results, more accurate forecasts or reduce training time.

Properties of the algorithm:

• - shows excellent accuracy, short training time and the use of linearity.
○ - Demonstrates excellent accuracy and average training time.

Algorithm	Accuracy	Studying time	Linearity	Options	Note
Two Class Classification
Logistic Regression		•	•	5
Forest of Decision Trees	•	○		6
Jungle tree making	•	○		6	Low memory requirements
Improved Decision Tree	•	○		6	High memory requirements
Neural network	•			nine	Additional setting possible
Single Layer Perceptron	○	○	•	4
Support Vector Method		○	•	5	Good for large feature sets.
Local deep methods of support vectors	○			8	Good for large feature sets.
Bayesian methods		○	•	3
Multi-class classification
Logistic Regression		•	•	5
Forest of Decision Trees	•	○		6
Jungle tree making	•	○		6	Low memory requirements
Neural network	•			nine	Additional setting possible
One against all	-	-	-	-	See properties of the selected two-class method
Multi-class classification
Regression
Linear		•	•	4
Bayesian linear		○	•	2
Forest of Decision Trees	•	○		6
Improved Decision Tree	•	○		5	High memory requirements
Fast quantile regression forests	•	○		nine	Predicting distributions rather than point values
Neural network	•			nine	Additional setting possible
Poisson			•	5	Technically logarithmic. To calculate forecasts
Ordinal				0	To predict rating
Emission filtering
Support Vector Methods	○	○		2	Great for large feature sets.
Emission filtering based on principal component analysis		○	•	3	Great for large feature sets.
K-means method		○	•	4	Clustering algorithm

Algorithm Notes

Linear regression

As we already said, linear regression considers data linearly (either in the plane or in the hyperplane). This is a convenient and fast workhorse, but for some problems it can be too simple. Here you will find a guide to linear regression.

Linear trend data

Logistic Regression

Let the word "regression" in the title do not mislead you. Logistic regression is a very powerful tool for two-class and multi-class classification. It is quick and easy. Since a curve in the shape of the letter S is used here instead of a straight line, this algorithm is perfect for dividing data into groups. Logistic regression limits the linear class, so you have to put up with linear approximation.

Logistic regression for two-class data with only one attribute - the class boundary is at the point where the logistic curve is close to both classes

Trees, forests and jungle

Decision tree forests ( regression , two-class and multiclass ), decision tree jungle ( two-class and multiclass ) and improved decision trees ( regression and two-class ) are based on decision trees, the basic concept of machine learning. There are many options for decision trees, but they all have one function - they subdivide the feature space into areas with the same label. These can be areas of the same category or constant value, depending on whether you use classification or regression.

The decision tree subdivides the feature space into areas with approximately the same values.

Since the feature space can be divided into small areas, it can be done so that there is one object in one area - this is a rough example of a false connection. To avoid this, large sets of trees are created so that the trees are not connected to each other. Thus, the "decision tree" should not produce false links. Decision trees can consume large amounts of memory. Decision tree jungles consume less memory, but training takes a little longer.

Improved decision trees limit the number of partitions and the distribution of data points in each area to avoid false relationships. The algorithm creates a sequence of trees, each of which corrects earlier mistakes. As a result, we get a high degree of accuracy without a large memory footprint. For a full technical description, see Friedman’s scientific work .

Fast quantile regression forests are a variant of decision trees for those cases when you want to know not only the typical (average) value of data in a region, but also their distribution in the form of quantiles.

Neural networks and perception

Neural networks are learning algorithms that are based on the model of the human brain and are aimed at solving multiclass , two- class , and regression problems. There are many, but in Azure machine learning, neural networks take the form of a directed acyclic graph. This means that input features are passed forward through a sequence of levels and converted into output. At each level, the input data are measured in various combinations, summed up and transmitted to the next level. This combination of simple calculations allows you to study complex class boundaries and data trends as if by magic. These multi-level networks provide “deep learning,” which provides inspiration for technical reports and science fiction.

But such performance is not free. Neural network training takes a lot of time, especially for large data sets with many attributes. They have more parameters than most algorithms, and therefore the selection of parameters significantly increases the learning time. And for perfectionists who want to specify their own network structure , the possibilities are practically unlimited.

The boundaries studied by neural networks are complex and chaotic.

A single-layer perceptron is the response of neural networks to an increase in training time. It uses a network structure that creates linear class boundaries. By modern standards, it sounds primitive, but this algorithm has long been tested in practice and quickly learns.

Support Vector Methods

Support vector methods find a boundary that divides classes as widely as possible. When it is not possible to clearly separate the two classes, the algorithms find the best boundary. According to Azure Machine Learning, the two-class reference vector method does this with a straight line (speaking the language of reference vector methods, it uses a linear kernel). Thanks to linear approximation, training is fast enough. Of particular interest is the function of working with objects with many attributes, for example, text or genome. In such cases, the reference vector machines can quickly separate classes and are characterized by the minimum probability of creating a false connection, and also do not require large amounts of memory.

The standard class boundary of the reference vector machine increases the field between the two classes

Another product from Microsoft Research is two-class local deep methods of support vectors . This is a non-linear version of the support vector methods, which is characterized by the speed and memory efficiency inherent in the linear version. It is ideal for cases where the linear approach does not provide sufficiently accurate answers. To ensure high speed, developers split the problem into several small tasks of the linear support vector method. Read more about this in the full description .

By expanding non-linear methods of support vectors, a single-class machine of support vectorscreates a border for the entire data set. This is especially useful for filtering outliers. All new objects that do not fall within the border are considered unusual and therefore are carefully studied.

Bayesian methods

Bayesian methods have a very necessary quality: they avoid false connections. To do this, they make assumptions about the possible distribution of the answer in advance. Also, they do not need to configure many parameters. Azure Machine Learning offers Bayesian methods for both classification ( Bayesian two-class classification ) and regression ( Bayesian linear regression ). It is assumed that the data can be divided or arranged along a straight line.

By the way, Bayesian point machines were developed at Microsoft Research. Their foundation is a magnificent theoretical work. If you are interested in this topic, read the MLR article and Chris Bishop's blog .

Special algorithms

If you pursue a specific goal, you are in luck. The Azure Machine Learning collection contains algorithms that specialize in rating prediction ( ordinal regression ), quantity prediction ( Poisson regression ), and anomalies (one of them is based on the analysis of the main components , and the other on support vector methods ). And there is a clustering algorithm ( k-means method ).

PCA-based anomaly detection - a huge amount of data falls under the stereotypical distribution; fall under suspicion in terms which strongly deviate from this distribution of

the data set is divided into five clusters by k-means

also havemulticlass classifier “one against all” , which breaks down the classification problem of the N-class into two-class problems of the class N-1. The accuracy, training time, and linearity properties depend on the two-class classifiers used.

Two two-class classifiers form a three-class classifier

In addition, Azure offers access to a powerful machine learning platform called Vowpal Wabbit. VW refuses categorization because it can study classification and regression problems and even learn from partially labeled data. You can choose any of the training algorithms, loss functions, and optimization algorithms. This platform is characterized by efficiency, parallel execution and unrivaled speed. She easily copes with large data sets. VW was launched by John Langford, a specialist from Microsoft Research, and is a Formula 1 car in the ocean of production cars. Not every problem is suitable for VW, but if you think that this is the right option for you, then the effort expended will definitely pay off. The platform is also available as stand - alone open source in several languages.

The latest material from our blog on this topic

1. Azure in plain language (cheat sheet).
2. Trucks and refrigerators in the cloud (case).

We remind you that you can try Microsoft Azure here .

If you see an inaccuracy in the translation, please report this in private messages.

* UPD

Since there is an inaccuracy in the author’s text, we supplement the material (thanks @fchugunov)

Linear regression is used not only to determine the dependence, which is described by a straight line (or plane), as indicated in the article. Dependence can be described by more complex functions. For example, for a function in the second graph, the polynomial regression method (a type of linear regression) can be applied. To do this, the input data (for example, the value of x) is converted into a set of factors [x, x², x³, ..], and the linear regression method already selects the coefficients for them.

Tags: