Translation of Andrew Un’s Passion for Machine Learning Chapters 20 - 27

20 Offset and scatter: Two major sources of error

Translator's Note Before the change, this chapter was called "Systematic and Random: Two main sources of error," that is, I used the terms "random error" and "systematic error" to translate bias and variance. However, the forumchanin robot @ Phaker rightly remarked in the comments that in the field of machine learning in the Russian-language terminology the concepts of "displacement" and "scatter" are fixed for these terms. I looked at the work of K.V. Vorontsova, who is deservedly one of the authorities in the field of machine learning in Russia and the resources of the professional community, and agreed with the comment robot @ Phaker. Despite the fact that, from my point of view, there is a profound analogy between the "bias" and the "variance" in teaching algorithms and the "systematic error" and "random error" of a physical experiment. In addition, they are equally mathematically expressed. , it is still correct to use the terms established in this field. Therefore, I revised the translation of this and subsequent chapters, replacing "Systematic and Random errors" with "Offset and Scatter" and will stick to this approach in the future.

Suppose your training, validation, and test samples have the same distribution. Then you need to take more data for training, it will only improve the quality of the algorithm, is this true?

Despite the fact that getting more data can not damage the work, unfortunately, new data does not always help as much as you can expect. In some cases, the work of obtaining additional data may be a waste of effort. How to make a decision - when to add data, and when not to worry about it.

In machine learning there are two main sources of errors: displacement and dispersion (variance). Understanding what they are will help you decide whether you need to add more data, will also help you choose tactics to improve the quality of the classifier.

Suppose you hope to build a cat recognizer with 5% errors. Currently, your classifier has a 15% error on the training sample, and a 16% validation sample. In this case, adding training data is unlikely to help significantly increase the quality. You need to concentrate on other system changes. In fact, adding more examples to your training sample will only make it harder for your algorithm to get a good result on this sample (why this will be explained in the following chapters).

If the proportion of your mistakes in the training sample is 15% (which corresponds to an accuracy of 85%), but your goal is the proportion of errors in 5% (95% accuracy), then first of all you need to improve the quality of your algorithm in the training sample. The quality of the algorithm on a validation / test sample is usually worse than the quality of its work on the sample for training (on a training sample). You need to understand that those approaches that led you to an accuracy not exceeding 85% with examples with which your algorithm is familiar will not allow you to get 95% accuracy with examples that this algorithm has not even seen.

Suppose, as noted above, the error rate of your algorithm is 16% (accuracy is 84%) in the validation sample. We need to break the 16% error into two components:

First, the proportion of algorithm errors in the training sample. In this example, it is 15%. We informally call it bias .
Second, how much worse the algorithm works on the validation (or test) sample than on the training one. In our example, 1% worse on the validation sample than on the training. We will also informally consider it as a variance of the algorithm.

author's comment In the statistics there is a more accurate definition for the bias and the spread (systematic and random errors), but we should not be disturbed. Roughly speaking, we will assume that the offset is the error of your algorithm in your training sample, when you have a very large training sample. The scatter is how much worse the algorithm works on the test sample compared to the training one with the same parameter settings. If we use the root-mean-square error, then we can write the formulas defining these two quantities and prove that the total error is equal to the sum of the displacement and the spread (the sum of random and systematic errors). But for our purposes of improving the algorithms in machine learning problems, a rather informal definition of displacement and scatter.

Some changes in the learning algorithm affect the first component of the error - bias ( bias ) and improve the execution of the algorithm on the training set. Some changes affect the second component - the variance and help to better generalize the work of the algorithm to the validation and test samples. To select the most effective changes to be made to the system, it is extremely useful to understand how each of these two components of the error affects the overall system error.

author's note: There are also some approaches that simultaneously reduce displacement and scatter, making significant changes to the system architecture. But they are usually more difficult to find and implement.

To select the most effective changes that need to be made to the system, it is extremely useful to understand how each of these two components of the error affects the overall system error.

Developing an intuition in understanding how the Offset contributes to the error and which Scatter will help you effectively choose how to improve your algorithm.

21 Error Classification Examples

Consider our task on the classification of cats. The ideal classifier (for example, a person) can achieve the excellent quality of this task.

Suppose that the quality of our algorithm is as follows:

Error in training sample = 1%
Error on validation sample = 11%

What is the problem with this classifier? Applying the definitions from the previous chapter, we estimate the displacement at 1% and the variation at 10% (= 11% - 1%). Thus, our algorithm has a large scatter . The qualifier has a very low error on the training sample, but cannot generalize the learning results to a validation sample. In other words, we are dealing with overfitting .

Now consider the following situation:

Error in training sample = 15%
Error on validation sample = 16%

Then we estimate the offset at 15% and the spread at 1%. This classifier was poorly trained in the training sample, while its error in the validation sample was slightly more than in the training sample. Thus, this classifier has a large offset, but a small variation. It can be concluded that this algorithm is underfitting .

Consider the following error distribution:

Error in training sample = 15%
Error on validation sample = 30%

In this case, the offset is 15% and the spread is also 15%. This classifier has high offset and scatter: it works poorly in the training sample, having a high offset, and its quality in the validation sample is much worse than in the training sample, i.e. the spread is also great. This case is difficult to describe in terms of over-training / under-training, this classifier is both re-trained and under-trained.

Finally, consider the following situation:

Error in training sample = 0.5%
Error on validation sample = 1%

This is a perfectly working classifier, it has a low offset and scatter. Congratulations to engineers with a great result!

22 Comparison with optimal error rate.

In our example on recognition of cats, the ideal fraction of errors is the level available to the “optimal” classifier and this level is close to 0%. The person viewing the picture is almost always able to recognize whether the cat is in the picture or not, and we can hope that sooner or later the car will do it as well.

But there are more complex tasks. For example, imagine that you are developing a speech recognition system and found that 14% of audio recordings have so much background noise or so unintelligible speech that even a person cannot make out what was said there. In this case, even the most "optimal" speech recognition system may have an error in the region of 14%.

Suppose in the given problem of speech recognition our algorithm has achieved the following results:

Error in training sample = 15%
Error on validation sample = 30%

The quality of work of the classifier on the training sample is already close to the optimum, having a 14% error rate. Thus, in this case, we do not have many opportunities to reduce the bias (improving the performance of the algorithm on the training sample). However, it is not possible to generalize the work of this algorithm on a validation sample, therefore there is a large field for scatter reduction activities .

This case is similar to the third example from the previous chapter, in which the error in the training sample is also equal to 15% and the error in the validation sample is 30%. If the optimal error rate is about 0%, then the error on the training sample of 15% gives a lot of space to work on improving the algorithm. With this assumption, efforts to reduce the bias in the work of the algorithm can be very fruitful. But if the optimal fraction of classification errors cannot be lower than 14%, then a similar proportion of algorithm errors in the training sample (ie, in the region of 14-15%) suggests that the possibilities for reducing the bias are almost exhausted.

For tasks in which the optimal fraction of classification errors is significantly different from zero, it is possible to propose a more detailed structuring of errors. Continuing with the speech recognition example given above, a total error of 30% on a validation sample can be decomposed into the following components (errors can be analyzed in a test sample in the same way):

Optimal bias (unavoidable bias): 14%. Imagine, we decided that even perhaps the best speech recognition system in the world would have an error rate of 14%. We will speak of this as the “unavoidable” part of the bias of the learning algorithm.
Avoidable bias : 1%. This value is calculated as the difference between the fraction of errors in the training sample and the optimal fraction of errors.

author's note: If this value turned out to be negative, thus, your algorithm on the training sample shows a smaller error than the “optimal” one. This means that you have retrained in the training sample, your algorithm has memorized examples (and their classes) of the training sample. In this case, you should focus on the methods of reducing the scatter, and not on further reducing the bias.

Variance : 15%. Difference between errors in the training sample and validation sample

Matching this with our previous definitions, the bias and the disposable bias are related as follows:

Offset (bias) = Optimal Offset ( "unavoidable bias" ) + Disposable Offset ( "avoidable bias" )

author's note : These definitions are chosen to better explain how the quality of the learning algorithm can be improved. These definitions differ from the formal definitions of displacement and scatter adopted in statistics. Technically, what I define as “Offset” should be called “an error that lies in the data structure (it cannot be identified and eliminated)” and “Eliminate Offset” should be defined as “Offset of the learning algorithm that exceeds the optimal offset” .

The avoidable bias shows how much worse the quality of your algorithm in the training sample is than the quality of the “optimal classifier”.

The basic idea of variance remains the same. In theory, we can always reduce the spread to almost zero, training on a fairly large training sample. Thus, any variation is “avoidable” if there is a sufficiently large sample, so there can be no such thing as an “unavoidable variance”.

Consider another example in which the optimal error is 14% and we have:

Error in training sample = 15%
Error on validation sample = 16%

In the previous chapter, we estimated the classifier with such indicators as a classifier with a high bias, in the current conditions we will say that the “avoidable bias” is 1%, and the spread is about 1%. Thus, the algorithm is already working quite well and there is almost no room for improving the quality of its work. The quality of operation of this algorithm is only 2% lower than optimal.

From these examples it is clear that the knowledge of the magnitude of a fatal error is useful for making decisions about further actions. In statistics, the optimal error rate is also called the Bayes error rate .

How to find out the size of the optimal error rate? For tasks that a person does well, such as image recognition or decoding audio clips, you can ask assessors to mark up the data, and then measure the accuracy of the human markup on the training sample. This will give an estimate of the optimal proportion of errors. If you are working on a problem that is difficult for a person to cope with (for example, to predict which film to recommend or which advertisement to show to the user), in this case it is rather difficult to estimate the optimal proportion of errors.

In the Comparison with Human-Level Performance section, Chapters 33 through 35, I will discuss in more detail the process of comparing the quality of the learning algorithm with the level of quality that a person can achieve.

In the last chapters, you learned how to estimate removable / unrecoverable displacement and spread by analyzing the fraction of classifier errors in training and validation samples. The next chapter will look at how you can use the findings from such an analysis to decide whether to focus on methods that reduce bias or on methods that reduce scatter. Approaches to dealing with bias are very different from approaches to reducing scatter, so the techniques that you should use in your project to improve quality are highly dependent on what is currently the problem — a large bias or a large scatter.

Read on!

23 Elimination of bias and scatter

We give a simple formula for eliminating bias and scatter:

If you have a large avoidable bias, increase the complexity of your model (for example, increase your neural network by adding layers or (and) neurons)
If you have a wide variation, add examples to your training sample.

If you have the opportunity to increase the size of the neural network and add data to the training sample without limit, this will help to achieve a good result for a large number of machine learning tasks.

In practice, increasing the size of a model will ultimately cause computational difficulties, since learning of very large models is slow. You can also exhaust the limit of training data. (Even in the whole Internet, the number of images with cats of course!)

Different architectures of algorithm models, for example, different architectures of neural networks, will give different values for offset and scatter, as applied to your task. A shaft of recent research in the field of depth learning has allowed for the creation of a large number of innovative architectures of neural network models. Thus, if you use neural networks, scientific literature can be an excellent source for inspiration. There is also a large number of excellent implementations of algorithms in open sources, for example on GitHub. However, the results of attempts to use new architectures are much less predictable than the simple formula given above — increase the size of the model and add data.

Increasing the size of the model usually reduces the offset, but it can also cause an increase in the spread, and the risk of over-training also increases. However, the problem of retraining arises only when you are not using regularization. If you include a well-designed regularization method in the model, it is usually possible to safely increase the size of the model without allowing retraining.

Suppose you apply deep learning using L2 regularization or dropout ( Translator's Note : you can read about Dropout , for example, here: https://habr.com/company/wunderfund/blog/330814/ ), using regularization parameters that work flawlessly on validation sample. If you increase the size of the model, usually the quality of your algorithm remains the same or grows; its significant decrease is unlikely. The only reason for which it is necessary to refuse to increase the size of the model - large computational costs.

24 Compromise between bias and scatter

You may have heard of the “trade-off between displacement and spread.” Among the many changes that can be made to the learning algorithms, there are those that reduce the offset and increase the spread or vice versa. In this case, they talk about a “compromise” between displacement and spread.

For example, increasing the size of a model — adding neurons and (or) layers of the neural network, or adding input features usually reduce the offset, but can increase the spread. On the contrary, the addition of regularization often increases the offset, but reduces the spread.

Today, we usually have access to a large amount of data and enough computing power to train large neural networks (for deep learning). Thus, the problem of compromise is not so acute, and we have many tools at our disposal to reduce displacement, without harming the value of the scatter strongly and vice versa.

For example, you can usually increase the size of the neural network and adjust the regularization so as to reduce the offset without noticeably increasing the spread. Adding data to a training sample also tends to reduce the spread without affecting the offset.

If you successfully select the model architecture that is well suited to the task, you can simultaneously reduce both the offset and the spread. But choosing such an architecture can be challenging.

In the next few chapters, we will discuss other specific techniques aimed at combating displacement and scatter.

25 Approaches to reducing disposable bias

If your learning algorithm suffers from a large removable offset, you can try the following approaches:

Increasing the size of the model (such as the number of neurons and layers): this approach reduces the offset, so you have the opportunity to better customize the algorithm to the training sample. If you find that this increases the spread, use regularization, which usually eliminates the increase in spread.
Modify incoming signs based on the ideas that came up when analyzing errors . Suppose error analysis has prompted you to create new additional features that help the algorithm get rid of a certain category of errors (in the next chapters we will discuss this aspect). These new features can help with both offset and spread. In theory, the addition of new features may increase the spread; but if this happens, you can always use regularization, which usually helps to cope with the increase in scatter.
Reduction or rejection of regularization (L2 regularization, L1 regularization, Dropout): this approach reduces the recoverable displacement, however, leads to an increase in the spread.
Modifying the model's architecture (for example, the neural network architecture) so that it is more suitable for your task: This approach affects both the spread and the offset.

One not very useful method:

Adding data to the training sample : This approach helps to reduce the spread, but usually does not have a significant effect on the bias.

26 Analysis of errors in the training sample

Only after a good quality of the algorithm in the training sample, can we expect acceptable results from it on a validation / test sample.

In addition to the methods described earlier, applied to a large offset, I sometimes also pass on error analysis to the training sample data, following the same approach that was used in analyzing the validation sample of the eyeball. This can help if your algorithm has a high offset, i.e., if the algorithm was not able to study well in a training set.

For example, suppose you are developing a speech recognition system for an application and have collected a training sample of audio clips from volunteers. If your system does not work well on a training sample, you can consider listening to a set of 100 examples in which the algorithm worked poorly in order to understand the main categories of errors in the training sample. Similar to analyzing errors in a validation sample, you can calculate errors by category:

Audio clip	Loud background noise	User spoke too fast	Too far from microphone	Comments
one	X			Car noise
2	X		X	Restaurant noise
3		X	X	User shouts across the room
four	X			Noise cafe
% of the total count	75%	25%	50%

In this example, you could understand that your algorithm is experiencing particular difficulties with training examples that have a lot of background noise. In this way, you can focus on methods that will allow him to work better on training examples with background noise.

You can also re-check how much a person can parse such audio clips by letting him listen to the same recordings as the learning algorithm. If there is so much background noise in them that it is simply impossible for anyone to understand what is being said there, then it may be meaningless to expect that any algorithm correctly recognizes such pronunciation. In further chapters we will discuss the benefits of comparing the quality of our algorithm with the level of quality available to humans.

27 Approaches to reducing scatter

If your algorithm suffers from a large scatter, you can try the following approaches:

Add more data to the training sample : This is the most simple and feasible way to reduce scatter, it works as long as you have the opportunity to significantly increase the amount of data used and there is enough computational power to process them.
Add regularization (L1 regularization, L2 regularization, dropout): this approach reduces the spread, but increases the offset.
Add an early stop (i.e., stop the gradient descent earlier, based on the error value on the validation sample): This technique reduces the spread, but increases the offset. Early stopping strongly resembles the regularization method, therefore some authors refer it to regularization.
Selection of features to reduce the number / types of incoming featuresA: This approach can help with the scatter problem, but can also increase offset. A slight decrease in the number of signs (say, from 1000 signs to 900) is unlikely to have a large effect on displacement. A significant decrease (say, from 1000 signs to 100 or 10 fold reduction) is more likely to have a significant effect, the effect will increase until you eliminate too many useful signs. In modern deep learning, when there is a lot of data, there is a departure from careful selection of signs, and today we will most likely take all the signs that we have and will teach them the algorithm, allowing the algorithm to decide which ones to use based on number of teaching examples. However, if your training sample is small, the selection of signs can be very useful.
Reducing the size (complexity) of the model (such as the number of neurons / layers). Use with caution! This approach can reduce scatter and at the same time possibly increase offset. However, I would not recommend this approach to reduce scatter. The addition of regularization usually leads to a better classification quality. The advantage of reducing the size of the model is to reduce your need for computing power and thus speeds up the process of training models. If an increase in the speed of training models will be useful, then you need to consider the option with a decrease in the size of the model. However, if your task is only to reduce the spread and you do not experience a shortage of computing power, it is better to consider the possibilities of additional regularization.

Here I present two additional tactical techniques, repeating what was said in previous chapters, in relation to reducing bias:

Modify incoming signs based on the understanding derived from error analysis : Let's say your error analysis has led to the idea that you can create additional signs that will help the algorithm get rid of certain categories of errors. These new features will help reduce both scatter and offset. Theoretically, the addition of new features may increase the bias; but if this happens, you can always use regularization, which usually eliminates the increase in offset.
Modify the model architecture (for example, the neural network architecture), making it more suitable for your task: This approach can reduce both displacement and spread.

a continuation

Tags:

machine learning