Translation of Andrew Un's book “Passion for Machine Learning” Chapter 15 - 19
15. Simultaneous evaluation of several ideas during error analysis.
Your team has a few ideas on how to improve the cat identifier in your application:
- Solve the problem with the fact that your algorithm refers dogs to cats
- Solve the problem so that your algorithm recognizes big wild cats (lions, panthers, etc.) as domestic
- Improve system performance on fuzzy images
You can evaluate all these ideas at the same time. Usually I create a special table and fill it in for about 100 cases of erroneous classification of a validation (dev) sample. I also make brief comments that can help me recall specific examples later. To illustrate this process, let's look at a pivot table that you could create from a small set of examples of your validation (dev) sample.
|one||x||Pitbull unusual color|
|3||x||x||A lion; Photo taken at the zoo on a rainy day|
|four||x||Panther behind a tree|
Image 3 in the table below applies to both large and fuzzy cats. Thus, due to the fact that we can refer one image to several categories of errors, the total percentages in the bottom line are not limited to 100%.
Although at the beginning of work you can form a specific set of categories for errors (Dogs, Big cats, Fuzzy images) in the process of manually assigning classification errors to these categories, you may decide to add new types of errors. For example, suppose you looked at a dozen images and decided that a lot of mistakes were made by the classifier on images from Instagram that had color filters superimposed on them. You can remake the table, add the “Instagram” column to it and re-classify errors according to this category. By manually examining the examples on which the algorithm is wrong and asking yourself how you, as a person, were able to correctly mark the image, you will be able to see new categories of errors and, perhaps, be inspired to search for new solutions.
The most useful categories of errors will be those for which you have an idea for improving the system. For example, adding the category "Instagram" will be most useful if you have an idea how to remove filters and restore the original image. But you should not limit yourself only to those categories of errors for which you have a recipe for their elimination; The purpose of the error analysis process is to develop your intuition when choosing the most promising areas of focus.
Error analysis is an iterative process. Do not worry if you start it without inventing a single category. After viewing a couple of images, you will have a few ideas for categorizing errors. After manually categorizing multiple images, you may want to add new categories and review classification errors in the light of newly added categories, and so on.
Suppose you have completed an analysis of errors from 100 erroneously classified examples of a validation sample and obtained the following:
|one||X||Pitbull unusual color|
|3||X||X||A lion; Photo taken at the zoo on a rainy day|
|four||X||Panther behind a tree|
Now you know that working on a project to eliminate an erroneous classification of dogs as cats will, at best, eliminate 8% of errors. Working on Big Cats or Fuzzy Images will help get rid of significantly more errors. Therefore, you can select one of these two categories and focus on them. If your team has enough people to work simultaneously in several areas, you can ask a few engineers to tackle the big cats, concentrating the rest on fuzzy images.
Error analysis does not provide a rigid mathematical formula that tells you which task to assign the highest priority to. You must also relate the progress that comes from working on the various categories of errors and the effort that needs to be spent on this work.
16. Cleaning up validation and test samples from incorrectly labeled examples.
When analyzing errors, you may notice that some examples in your validation sample are incorrectly labeled (assigned to the wrong class). When I say “mistakenly tagged,” I mean that the images were already incorrectly classified by human marking before the algorithm found it. That is, when marking the example (x, y), an incorrect value was specified for y. For example, suppose some images in which there are no cats are mistakenly marked as containing cats and vice versa. If you suspect that the proportion of mistakenly marked examples is significant, add the appropriate category to track incorrectly marked examples:
|Picture||Dogs||Big cats||Fuzzy||Markup Error||Comments|
|98||X||Mistakenly labeled as having a cat in the background.|
|100||X||Painted cat (not real)|
Do I need to correct incorrect tagged data in your validation sample? Recall that the task of using a validation sample is to help you quickly evaluate the algorithms so that you can decide if algorithm A is better than B. If the proportion of validation samples that are marked incorrectly prevents you from making such a judgment, then it makes sense to spend time Correction of errors in the markup validation sample.
For example, imagine that the accuracy shown by your classifier is as follows:
- Overall Accuracy on the Validation Sample ………… ..90% (10% total error)
- Error related to markup errors ...................... 0.6% (6% of the total error on the validation sample)
- Error due to other reasons ... ... ... ... 9.4% (94% of the total error on the validation sample)
Here, an error of 0.6% due to mislabeling may not be significant enough in relation to a 9.4% error that you could improve. Manual correction of markup errors of a validation sample will not be superfluous, but its correction is not crucial because it does not matter whether the real total error of your system is 9.4% or 10%
Suppose you improve the cat classifier and have achieved the following accuracy measures:
- Overall Accuracy on the Validation Sample .................98% (2% total error)
- Error related to markup errors ...................... 0.6% (30% of the total error on the validation sample)
- Error due to other reasons ... ... ... ... 1.4% (70% of the total error on the validation sample)
30% of your error is due to incorrect marking of images from a validation sample, this proportion makes a significant contribution to the overall error in assessing the accuracy of your system. In this case, it is worthwhile to improve the validation sample markup. Eliminating incorrectly tagged examples will help you figure out where your classifier’s errors are closer to 1.4% or 2%. Between 1.4 and 2 is a significant relative difference.
It is not uncommon that incorrectly tagged images of a validation or test sample begin to attract your attention only after your system has improved so much that the proportion of error associated with incorrect examples will increase relative to the total error on these samples.
The following chapter explains how you can improve the categories of bugs, such as Dogs, Big Cats and Fuzzy in the process of working on improving the algorithms. In this chapter, you learned that you can reduce the error associated with the category “Errors in the markup” and improve the quality by improving the data markup.
Regardless of which approach you take to mark up a validation sample, do not forget to apply it to the markup of the test sample, so your validation and test sample will have the same distribution. By applying the same approach to validation and test samples, you prevent the problem we discussed in Chapter 6 when your team optimizes the quality of the algorithm on a valid sample, and later realizes that this quality was evaluated on the basis of a different test sample.
If you decide to improve the quality of the markup, consider double checking. Check both the markup of the examples that your system classified incorrectly and the markup of examples that are classified correctly. It is possible that both the original markup and your learning algorithm were wrong on the same example. If you correct only the markup of those examples in which your system was wrong in the classification, you can introduce a systematic error in your assessment. If you take 1000 examples of validation samples, and if your classifier shows an accuracy of 98.0%, it is easier to check 20 examples that were classified incorrectly than 980 correctly classified examples. Due to the fact that in practice it is easier to check only incorrectly classified examples, In some cases a systematic error may creep into the validation samples. Such an error is acceptable if you are only interested in developing applications, but it will be a problem if you plan to use your result in an academic research article or need measurements of the accuracy of the algorithm on a test sample completely exempted from a systematic error.
17. If you have a large validation sample, divide it into two subsamples, and consider only one of them.
Suppose you have a large validation sample, consisting of 5,000 examples where the error rate is 20%. Thus, your algorithm incorrectly classifies about 1000 validation images. Manual evaluation of 1000 images will take a long time, so we can decide not to use all of them for error analysis purposes.
In this case, I would unambiguously divide the validation sample into two subsamples, one of which you will observe, and the other not. You are more likely to retrain on the part that you will manually analyze. You can use the part that you do not use for manual analysis to adjust the parameters of the models.
Let's continue our example described above, in which the algorithm incorrectly classified 1000 examples out of 5000 components of a validation sample. Imagine that you want to take 100 errors for analysis (10% of all validation sample errors). Need to randomly select 10% of the examples of the validation sample, and out of them a " validation sample eyeball » ( Eyeball dev set ), we are so named in order to remember all the time that we study these examples using their own eyes.
Translator's note: from my point of view, the definition of “eyeball sampling” does not sound altogether (especially from the point of view of the Russian language). But with all due respect to Andrew (and considering that I did not invent anything better), I will leave this determination
(For a speech recognition project in which you will listen to audio clips, perhaps you would use something like “validation sampling for the ears” instead of this name). Thus, the Validation eyeball sample consists of 500 examples in which there should be about 100 incorrectly classified ones. The second subsample of the validation sample, which we call the Validation Black Box Sampling (Blackbox dev set), will consist of 4500 examples. You can use the “Black Box subsample” to automatically assess the quality of work of classifiers, measuring their share of errors. You can also use this subsample to choose between algorithms or to configure hyper parameters. However, you should avoid examining examples of this subsample with your eyes. We use the term “black box” because we will use a subsample, its component, as a “black box”
approx. translator : that is, the object whose structure is not known
to us for assessing the quality of classifiers.
Why do we clearly divide the validation sample into the “subsample of the eyeball” and “subsample of the black box”?
Since from a certain moment you will feel better (understand) the examples in “The subsample of the eyeball”, the likelihood that you will retrain on this subsample will increase. To control retraining, we will use the “Black Box Selection”. If you see that the quality of the algorithms on the “Sample of the Eyeball” grows significantly faster than the quality on the “Sampling of the Black Box”, apparently you have retrained on the “Eyeball”. In this case, you may need to discard the existing “Eyeball” subsample and create a new one by transferring more examples from the Black Box to the Eyeball or by taking a new batch of labeled data.
Thus, splitting a validation sample into a “subsample of the eyeball” and “subsample of a black box” allows you to see the moment when the process of manual error analysis leads you to retraining in a subsample of the eyeball.
18 How big should an eyeball sample and a black box sample be?
Your sample of the eyeball should be large enough for you to discover the main categories of classification errors for your algorithm. If you are working on a task that a person can handle (such as recognizing cats in images), you can make the following rather rude recommendations:
- A validation sample of the eyeball that contains 10 errors of your classifier will be considered very small. Having only 10 errors it is very difficult to accurately assess the effect of various categories of errors on the quality of the classifier. But if you have very little data and there is no possibility to add more examples to the eyeball sample, it is still better than nothing and in any case will help with prioritizing the work on the project.
- If your classifier is wrong about 20 times on a sample of the eyeball, you can make a rough estimate of the main sources of errors.
- With about 50 errors, you will get a good idea of the main sources of errors in your classifier.
- If you have about 100 errors, you will get a very good understanding of where the main errors come from. I met people who manually analyzed even more errors, sometimes up to 500. Why not, if you have enough data.
Suppose that your classifier’s error rate is 5%. In order to obtain with confidence about 100 incorrectly labeled examples in a sample of the eyeball, this sample must contain about 2000 examples (since 0.05 * 2000 = 100). The smaller the fraction of errors in your classifier, the larger the sample of the eyeball is needed in order to get a large enough sample of errors from it for analysis.
If you are working on such a task, in which even people find it difficult to classify examples correctly, the exercises for checking the validation sample of the eyeball will not be particularly useful, because it is hard to understand why the algorithm could not correctly classify the example. In this case, you can skip the setting for Eyeball sampling. We will discuss recommendations for such projects in the following chapters.
And what can you say about the "black box selection"? We have already mentioned that in the general case a validation sample contains 1000 - 10000 examples. Complementing this statement, a validation black box sample of 1000–10,000 examples usually (often) gives you enough data to set up hyper parameters and choose between models, but if you take more data for a black box sample, it will not be worse. A sample of a black box of 100 examples is of course too small, but it will still be useful (better than nothing).
If you have a small validation sample, it may not have enough data to divide it into samples of the eyeball and the black box, so that both of them are large enough and can serve the purposes described above. In this case, you may have to use your entire validation sample as a sample of the eyeball.
That is, you will manually examine all validation sample data.
I believe that sampling the eyeball is more important than sampling the black box (assuming that you are working on a problem in which people do well with the definition of classes and manual checking of examples will help you get an idea of your data). If you only have a sample of the eyeball, you can work on error analysis, model selection and setting up hyper parameters using only it. The disadvantage of working only with a sample of the eyeball is that in this case the risk of retraining the model on a validation sample increases.
If you have an abundance of data at your disposal, then the size of the eyeball sample will be determined mainly by how much time you can devote to manual data analysis. For example, I rarely met someone who would manually analyze more than 1000 errors.
19 Conclusions: Basic Error Analysis
- When you start a new project, especially in an area where you are not an expert, it is quite difficult to assume the most effective direction of the effort.
- Therefore, do not immediately try to develop and build an ideal system. Instead, build and train a simple system as quickly as possible — maybe in a few days. Then use error analysis to help you identify the most effective areas of work and then iteratively improve your algorithm based on this.
- Analyze errors by manually examining about 100 examples from a validation sample that your algorithm incorrectly classified and evaluate which categories of errors make the main contribution to the overall classification error. Use this information to prioritize the work on the types of errors that need to be corrected.
- Consider splitting your validation sample into an eyeball sample that you will manually examine and a black box sample that you will not touch. If the quality of the algorithm on a sample of the eyeball is much better than the quality of the black box sample, you have retrained the algorithm on a sample of the eyeball and you need to consider adding more data to it.
- Validation sampling of the eyeball should be large enough so that the number of errors of your algorithm on it is enough for manual analysis. A validation sample of a black box consisting of 1000-10000 examples is usually sufficient for developing applications.
- If your validation sample is not large enough to split it into an eyeball sample and a black box sample, simply use the Validation Eyeball sample for manual error analysis, model selection, and hyperparameter settings.