
Pattern recognition. The beginning of the theory
Introduction
In this article, I set out to illuminate some of the fundamental results of machine learning theory in such a way that the concepts are understandable to readers who are somewhat familiar with the problems of classification and regression. The idea of writing such an article manifested itself more clearly in my mind with every book I read, in which the ideas of teaching recognition machines were told as if from the middle and it was completely unclear what the authors of this or that method relied on during its development. On the other hand, there are a number of books devoted to the basic concepts in machine learning, but the presentation of the material in them may seem too complicated for the first reading.
Motivation
Consider this problem. We have two classes of apples - tasty and not tasty, 1 and 0. Apples have signs - color and size. Color will change continuously from 0 to 1, i.e. 0 is a fully green apple, 1 is completely red. Size can vary in the same way, 0 - small apple, 1 - large. We would like to develop an algorithm that would receive color and size at the input, and at the output it would give an apple class - whether it is tasty or not. It is highly desirable that the number of errors in this case be the smaller the better. At the same time, we have a final list, which contains historical data on the color, size and class of apples. How could we solve such a problem?
Logical approach
Solving our problem, the first method that may come to mind may be this: let's manually create rules like if-else and depending on the color and size values, we will assign a certain class to the apple. Those. we have prerequisites - this is color and size, and there is a consequence - the taste of the apple. It is quite reasonable when there are few signs and thresholds for comparison can be estimated by eye. But it may happen that it is not possible to come up with clear conditions, and it is not obvious from the data which thresholds to take, and the number of signs may increase in the future. But what if in our list with historical data, we find two apples with the same color and size, but one is marked as tasty and the other is not? Thus, our first method is not as flexible and scalable as we would like.
Designations
We introduce the following notation. We will denote the
Probabilistic approach
Developing the idea of a logical method with assumptions and consequences, we ask ourselves the question - what is the likelihood that
In this expression, it can be interpreted
Consider the right side of this expression in more detail. The multiplier
We will call this rule the Bayesian classifier. Since we are dealing with probabilities, even a large probability value
We are interested in the probability of a classifier error not only in this particular example, but in general for all possible apples:
This expression is a mathematical expect error
Losses from classifier errors
Suppose we already have some kind of decisive rule
Conditional and Bayesian risk
Now that we have the loss function and we know how much we lose from the incorrect classification of an object
In our case of binary classification, when it
When a (x) = 0
In order to calculate the average losses at all possible objects, the concept of Bayesian risk is introduced, which is the mathematical expectation of conditional risk:
Above, we described a decisive rule that assigns an object to the class that has the highest probability
Some typical loss functions
One of the most common loss functions is a symmetric function, when losses from the first and second types of errors are equivalent. For example, the loss function 1-0 (zero-one loss) is defined as follows:
Then the conditional risk for a (x) = 1 will simply be the probability value to get class 0 on the object
Similarly for a (x) = 0:
The loss function 1-0 takes the value 1 if the classifier makes an error on the object and 0 if it does not. Now we will make sure that the value on the error is not 1, but another function Q, depending on the decision rule and the real class label:
Then the conditional risk can be written as follows:
Notation Notes
The previous text was written according to the notation adopted in the book of Duda and Hart. In the original book of V.N. Vapnik considered the following process: nature selects an object according to the distribution of $ p (x) $, and then puts a class label according to the conditional distribution of $ p (y | x) $. Then the risk (expectation of losses) is defined as
Where
Empirical risk
At this stage, we have already found out that the logical method is not suitable for us, because it is not flexible enough, and we cannot use the Bayesian classifier when there are a lot of signs, and there is a limited number of training data and we will not be able to restore the probability
Example: let the
All functions of this set will differ from each other only by coefficients.
After we fixed the class of functions $ H $, the question arises - how to choose a function with the necessary coefficients from it? The answer is, let's choose the function that delivers the minimum to our Bayesian risk $ R () $. Again, the problem is that in order to calculate the Bayesian risk values, you need to know the distribution of $ p (x, y) $, but it is not given to us, and it is not always possible to restore it. Another idea is to minimize the risk, not at all possible objects, but only on the sample. Those. minimize function:
This function is called empirical risk. The next question is why did we decide that by minimizing empirical risk, we also minimize Bayesian risk? Let me remind you that our practical task is to make as few classification errors as possible. The fewer errors, the lower the Bayesian risk. The justification of the convergence of empirical risk to Bayesian risk with increasing data volume was obtained in the 70s by two scientists - V.N. Vapnik and A. Ya. Chervonenkis.
Convergence guarantees. Simplest case
So, we have come to the conclusion that the Bayesian classifier gives the smallest possible error, but in most cases we cannot train it and we are not able to calculate the error (risk) either. However, we can calculate an approximation to the Bayesian risk, which is called empirical risk, and knowing the empirical risk, choose an approximating function that would minimize empirical risk. Let's look at the simplest situation when minimizing empirical risk gives a classifier that also minimizes Bayesian risk. For the simplest case, we will have to make an assumption that is rarely implemented in practice, but which can be weakened in the future. We fix a finite class of functions
Theorem
Choosing a function from a class
What does “small value” and “sufficient size” mean, see the literature below.
Idea of proof
By the hypothesis of the theorem, we obtain a sample
Practical results
Having evidence that the function found while minimizing empirical risk will not have a large error on previously unobserved data with a sufficient training sample, we can use this principle in practice, for example, as follows - we take the expression:
And we substitute different loss functions, depending on the problem being solved. For linear regression:
For logistic regression:
Despite the fact that behind the support vector method lies mainly geometric motivation, it can also be represented as a problem of minimizing empirical risk.
Conclusion
Many teaching methods with a teacher can be considered including as special cases of the theory developed by V. N. Vapnik and A. Ya. Chervonenkis. This theory gives guarantees regarding the error on the test sample, provided that the training sample is large enough and there are some requirements for the hypothesis space in which we are looking for our algorithm.
Used Books
- The Nature of Statistical Learning Theory, Vladimir N. Vapnik
- Pattern Classification, 2nd Edition, Richard O. Duda, Peter E. Hart, David G. Stork
- Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz, Shai Ben-David
PS The request to write in a personal about all inaccuracies and typos
PPS Many thanks to the authors of this article for the TeX editor