Pattern recognition. The beginning of the theory
In this article, I set out to illuminate some of the fundamental results of machine learning theory in such a way that the concepts are understandable to readers who are somewhat familiar with the problems of classification and regression. The idea of writing such an article manifested itself more clearly in my mind with every book I read, in which the ideas of teaching recognition machines were told as if from the middle and it was completely unclear what the authors of this or that method relied on during its development. On the other hand, there are a number of books devoted to the basic concepts in machine learning, but the presentation of the material in them may seem too complicated for the first reading.
Consider this problem. We have two classes of apples - tasty and not tasty, 1 and 0. Apples have signs - color and size. Color will change continuously from 0 to 1, i.e. 0 is a fully green apple, 1 is completely red. Size can vary in the same way, 0 - small apple, 1 - large. We would like to develop an algorithm that would receive color and size at the input, and at the output it would give an apple class - whether it is tasty or not. It is highly desirable that the number of errors in this case be the smaller the better. At the same time, we have a final list, which contains historical data on the color, size and class of apples. How could we solve such a problem?
Solving our problem, the first method that may come to mind may be this: let's manually create rules like if-else and depending on the color and size values, we will assign a certain class to the apple. Those. we have prerequisites - this is color and size, and there is a consequence - the taste of the apple. It is quite reasonable when there are few signs and thresholds for comparison can be estimated by eye. But it may happen that it is not possible to come up with clear conditions, and it is not obvious from the data which thresholds to take, and the number of signs may increase in the future. But what if in our list with historical data, we find two apples with the same color and size, but one is marked as tasty and the other is not? Thus, our first method is not as flexible and scalable as we would like.
We introduce the following notation. We will denote the th apple as . In turn, each consists of two numbers - color and size. This fact will be denoted by a pair of numbers: . The class of each apple is denoted by . A list with historical data is denoted by a letter ; the length of this list is equal . The th element of this list is the value of the signs of the apple and its class. Those. . We will also call sampling. In capital letters , we denote variables that can take on the values of a particular attribute and class. We are moving a new concept - the decisive rule is a function that accepts an input value of color and size, and the output returns the class label:
Developing the idea of a logical method with assumptions and consequences, we ask ourselves the question - what is the likelihood that an apple that does not belong to our sample will be tasty, provided the measured values of color and size? In the notation of probability theory, this question can be written as follows:
In this expression, it can be interpreted as a premise, as a consequence, but the transition from a premise to a consequence will obey probabilistic laws, not logical ones. Those. instead of the truth table with Boolean values 0 and 1 for the class, there will be probability values that take values from 0 to 1. We apply the Bayes formula and get the following expression:
Consider the right side of this expression in more detail. The multiplier is called a priori probability and means the probability of meeting a delicious apple among all possible apples. A priori probability of meeting a tasteless apple is . This probability may reflect our personal knowledge of how delicious and tasteless apples are distributed in nature. For example, in our past experience, we know that 80% of all apples are delicious. Or we can estimate this value simply by calculating the proportion of delicious apples in our list with historical data S. The next factor shows how likely it is to get a specific color and size valuefor an apple of class 1. This expression is also called the likelihood function and may take the form of any particular distribution, for example, a normal one. We use the denominator as the normalization constant, so that the desired probabilityranged from 0 to 1. Our ultimate goal is not a search for probabilities, but a search for a decision rule that would immediately give us a class. The final form of the decision rule depends on what values and parameters are known to us. For example, we can only know the values of a priori probability, and the remaining values cannot be estimated. Then the decisive rule will be to give all apples the value of the class for which the a priori probability is greatest. Those. if we know that 80% of apples in nature are delicious, then we put class 1 for each apple. Then our error will be 20%. If, in addition, we can estimate the values of the likelihood function $ p (X = x_m | Y = 1) $, then we can also find the value of the desired probabilityaccording to the Bayes formula, as written above. The decisive rule here will be: put a label of the class for which the probability is maximum:
We will call this rule the Bayesian classifier. Since we are dealing with probabilities, even a large probability value does not guarantee that the apple does not belong to class 0. Let us estimate the probability of error on the apple as follows: if the decision rule returned the class value of 1, then the probability will be wrong and vice versa:
We are interested in the probability of a classifier error not only in this particular example, but in general for all possible apples:
This expression is a mathematical expect error . So, solving the original problem, we came to the Bayesian classifier, but what are its disadvantages? The main problem is to estimate the conditional probability from the data . In our case, we represent the object as a pair of numbers - color and size, but in more complex problems the dimension of features can be several times higher and the number of observations from our list with historical data may not be enough to estimate the probability of a multidimensional random variable. Next, we will try to generalize our concept of a classifier error, and also see if it is possible to choose some other classifier to solve the problem.
Losses from classifier errors
Suppose we already have some kind of decisive rule . Then it can make two types of errors - the first one is to classify the object as class 0, which has real class 1 and vice versa, to classify the object as class 1, which has real class 0. In some problems, it is important to distinguish between these cases. For example, we suffer more from the fact that an apple marked as tasty turned out to be tasteless and vice versa. We formalize the degree of our discomfort from deceived expectations in the concept of More generally - we have a loss function that returns a number for each classifier error. Let be the real label of the class. Then the loss function returns the loss value for the real class label and the value of our decision rule. An example of using this function - take out an apple to a known class , pass the apple to the input of our decision rule , we obtain a class evaluation of the decision rule if the values and match, then we assume that the classifier is not wrong, and no losses if the values do not match, then the value of losses our function will say
Conditional and Bayesian risk
Now that we have the loss function and we know how much we lose from the incorrect classification of an object , it would be nice to understand how much we lose on average in many objects. If we know the value - the probability that an apple will be tasty, provided the measured color and size values, as well as the real class value (for example, take an apple from the S sample, see the beginning of the article), we can introduce the concept of conditional risk . Contingent risk is the average value of losses at the facility for the decision rule :
In our case of binary classification, when it turns out:
When a (x) = 0
In order to calculate the average losses at all possible objects, the concept of Bayesian risk is introduced, which is the mathematical expectation of conditional risk:
Above, we described a decisive rule that assigns an object to the class that has the highest probability value.This rule gives a minimum to our average losses (Bayesian risk), therefore, the Bayesian classifier is optimal from the point of view of the risk functional that we introduced. This means that the Bayesian classifier has the smallest possible classification error.
Some typical loss functions
One of the most common loss functions is a symmetric function, when losses from the first and second types of errors are equivalent. For example, the loss function 1-0 (zero-one loss) is defined as follows:
Then the conditional risk for a (x) = 1 will simply be the probability value to get class 0 on the object :
Similarly for a (x) = 0:
The loss function 1-0 takes the value 1 if the classifier makes an error on the object and 0 if it does not. Now we will make sure that the value on the error is not 1, but another function Q, depending on the decision rule and the real class label:
Then the conditional risk can be written as follows:
The previous text was written according to the notation adopted in the book of Duda and Hart. In the original book of V.N. Vapnik considered the following process: nature selects an object according to the distribution of $ p (x) $, and then puts a class label according to the conditional distribution of $ p (y | x) $. Then the risk (expectation of losses) is defined as
Where is the function by which we are trying to approximate an unknown dependence, is the loss function for the real value and the value of our function . This notation is more obvious in order to introduce the following concept - empirical risk.
At this stage, we have already found out that the logical method is not suitable for us, because it is not flexible enough, and we cannot use the Bayesian classifier when there are a lot of signs, and there is a limited number of training data and we will not be able to restore the probability . We also know that the Bayesian classifier has the smallest possible classification error. Since we cannot use the Bayesian classifier, let's take something simpler. Let's fix some parametric family of functions H and select a classifier from this family.
Example: let the set of all functions of the form
All functions of this set will differ from each other only by coefficients. When we chose such a family, we assumed that in color-size coordinates between points of class 1 and points of class 0, you can draw a straight line with coefficients so that points with different classes are located in different sides of the line. It is known that for a straight line of this form, the coefficient vector is normal to the straight line. Now we do this - we take our apple, measure its color and size, and put a point with the received coordinates on the graph in the color-size axes. Next, measure the angle between this point and the vector $ w $. We notice that our point can lie either on one or on the other side of the line. Then the angle betweenand the point will be either sharp or dull, and the scalar product is either positive or negative. The decisive rule follows from this:
After we fixed the class of functions $ H $, the question arises - how to choose a function with the necessary coefficients from it? The answer is, let's choose the function that delivers the minimum to our Bayesian risk $ R () $. Again, the problem is that in order to calculate the Bayesian risk values, you need to know the distribution of $ p (x, y) $, but it is not given to us, and it is not always possible to restore it. Another idea is to minimize the risk, not at all possible objects, but only on the sample. Those. minimize function:
This function is called empirical risk. The next question is why did we decide that by minimizing empirical risk, we also minimize Bayesian risk? Let me remind you that our practical task is to make as few classification errors as possible. The fewer errors, the lower the Bayesian risk. The justification of the convergence of empirical risk to Bayesian risk with increasing data volume was obtained in the 70s by two scientists - V.N. Vapnik and A. Ya. Chervonenkis.
Convergence guarantees. Simplest case
So, we have come to the conclusion that the Bayesian classifier gives the smallest possible error, but in most cases we cannot train it and we are not able to calculate the error (risk) either. However, we can calculate an approximation to the Bayesian risk, which is called empirical risk, and knowing the empirical risk, choose an approximating function that would minimize empirical risk. Let's look at the simplest situation when minimizing empirical risk gives a classifier that also minimizes Bayesian risk. For the simplest case, we will have to make an assumption that is rarely implemented in practice, but which can be weakened in the future. We fix a finite class of functionsfrom which we will choose our classifier and assume that this function, which is used to mark the nature of our apples to taste is the finite set of hypotheses: . We also have a sample obtained from the distribution over objects . All objects of the sample are considered equally independently distributed (iid). Then the following will be true.
Choosing a function from a class by minimizing empirical risk, we are guaranteed to find such that it has a small value of Bayesian risk if the sample on which we perform minimization is of sufficient size.
What does “small value” and “sufficient size” mean, see the literature below.
Idea of proof
By the hypothesis of the theorem, we obtain a sample from the distribution , i.e. the process of selecting objects from nature is random. Each time we collect a sample, it will be from the same distribution, but the objects themselves in it may be different. The main idea of the proof is that we can get such an unsuccessful sample that the algorithm that we choose by minimizing the empirical risk in this sample will not be able to minimize the Bayesian risk but it is good to minimize the empirical risk, but the probability of getting such a sample is small and growth sample size, this probability decreases. Similar theorems exist for more realistic assumptions, but here we will not consider them.
Having evidence that the function found while minimizing empirical risk will not have a large error on previously unobserved data with a sufficient training sample, we can use this principle in practice, for example, as follows - we take the expression:
And we substitute different loss functions, depending on the problem being solved. For linear regression:
For logistic regression:
Despite the fact that behind the support vector method lies mainly geometric motivation, it can also be represented as a problem of minimizing empirical risk.
Many teaching methods with a teacher can be considered including as special cases of the theory developed by V. N. Vapnik and A. Ya. Chervonenkis. This theory gives guarantees regarding the error on the test sample, provided that the training sample is large enough and there are some requirements for the hypothesis space in which we are looking for our algorithm.
- The Nature of Statistical Learning Theory, Vladimir N. Vapnik
- Pattern Classification, 2nd Edition, Richard O. Duda, Peter E. Hart, David G. Stork
- Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz, Shai Ben-David
PS The request to write in a personal about all inaccuracies and typos
PPS Many thanks to the authors of this article for the TeX editor