Introduction to Neural Networks
- Transfer
Artificial neural networks are now at the peak of popularity. One may wonder if the big name played a role in marketing and applying this model. I know some business managers who joyfully mention the use of “artificial neural networks” and “deep learning” in their products. Would they be so glad if their products used “models with connected circles” or “machines“ make a mistake - you will be punished ””? But, without a doubt, artificial neural networks are a worthwhile thing, and this is evident due to their success in many fields of application: image recognition, natural language processing, automated trading and autonomous cars. I am a specialist in data processing and analysis, but I did not understand them before, so I felt like a master who did not master my tool.
The R code for the examples presented in this article can be found here in the Machine Learning Problem Bible . In addition, after reading this article, it is worth exploring Part 2, Neural Networks - A Worked Example , which details the creation and programming of a neural network from scratch.
We will start with a motivating task. We have a set of images in grayscale, each of which is a 2 × 2 pixel grid in which each pixel has a brightness value from 0 (white) to 255 (black). Our goal is to create a model that will find images with a “staircase” pattern.
At this stage, we are only interested in finding a model that can logically select data. The selection methodology will be interesting to us later.
Preliminary processing
In each image we mark pixels , , , and generate the input vector , which will be the input to our model. We expect our model to predict True (the image contains the stair pattern) or False (the image does not contain the stair pattern).
Imageid | x1 | x2 | x3 | x4 | Isstairs |
---|---|---|---|---|---|
1 | 252 | 4 | 155 | 175 | TRUE |
2 | 175 | 10 | 186 | 200 | TRUE |
3 | 82 | 131 | 230 | 100 | False |
... | ... | ... | ... | ... | ... |
498 | 36 | 187 | 43 | 249 | False |
499 | 1 | 160 | 169 | 242 | TRUE |
500 | 198 | 134 | 22 | 188 | False |
Single-layer perceptron (iteration of model 0)
We can build a simple model consisting of a single - layer perceptron . The Perceptron uses a weighted linear combination of input to return the forecast estimate. If the forecast estimate exceeds the selected threshold, the perceptron predicts True. Otherwise, it predicts False. If more formally, then
Let's put it another way
Here - our estimate of the forecast .
Graphically, we can represent the perceptron as input nodes that transmit data to the output node.
For our example, we will construct the following perceptron:
Here's how the perceptron will work on some of the training images.
This is definitely better than random guesses and makes good sense. All stair patterns have dark pixels in the bottom row, which creates large positive coefficients. and . However, there are obvious problems with this model.
- The model outputs a real number whose value correlates with the concept of similarity (the larger the value, the higher the likelihood that there is a ladder in the image), but there is no basis for interpreting these values as probabilities, because they can be outside the interval [0 , 1].
- The model cannot capture the nonlinear relationships between variables and the target value. To verify this, consider the following hypothetical scenarios:
Case A
Let's start with the image x = [100, 0, 0, 125]. Increase from 0 to 60.
Case B
Let's start with the previous image, x = [100, 0, 60, 125]. Increase from 60 to 120.
Intuitively, case A should increase much moreThan the case of Bed and . However, since our perceptron model is a linear equation, the gain is +60 in both cases will lead to an increase of +0.12 .
Our linear perceptron has other problems, but let's solve these two first.
Single-layer perceptron with sigmoid activation function (iteration of model 1)
We can solve problems 1 and 2 by wrapping our perceptron in a sigmoid (with the subsequent selection of other weights). It is worth recalling that the “sigmoid” function is an S-shaped curve bounded along the vertical axis between 0 and 1, due to which it is often used to model the probability of a binary event.
In accordance with this thought, we can complement our model with the following image and equation.
Looks familiar? Yes, this is our old friend, logistic regression . However, it will serve us well to interpret the model as a linear perceptron with a sigmoid activation function, because it gives us more opportunities for a more general approach. Also, since we can now interpretas probability , then we need to change the decision rule accordingly.
We continue with our example of the problem and we assume that we have the following selected model:
We will observe how this model behaves on the same examples of images from the previous section.
We definitely managed to solve problem 1. Look how it solves problem 2.
Case A
Let's start with the image [100, 0, 0, 100]. Increase"from 0 to 50.
Case B
Let's start with the image [100, 0, 50, 100]. Increase"from 50 to 100.
Notice how the curvature of the sigmoid makes Case A “work” (increase rapidly) with increasingbut the pace slows as the increase continues . This is consistent with our intuitive understanding that case A should reflect a large increase in the probability ladder pattern than case B .
Unfortunately, this model still has problems.
- has a monotonic relationship with each variable. But what if we need to recognize stairs of a lighter shade?
- The model does not take into account the interaction of variables. Suppose the bottom row of the image is black. If the top left pixel is white, then dimming the top right pixel should increase the likelihood of a stair pattern. If the upper left pixel is black, then shading the upper right pixel should reduce the likelihood of stairs. In other words, increaseshould potentially lead to an increase or decrease, depending on the values of other variables. In our current model this cannot be achieved.
Multilayer perceptron with sigmoid activation function (iteration of model 2)
We can solve both of the above problems by adding another layer to our perceptron model . We will create several basic models similar to those presented above, but we will transfer the output of each basic model to the input of another perceptron. This model is actually a “vanilla” neural network. Let's see how it can work in different examples.
Example 1: stair pattern recognition
- We’ll build a model that works when the “left stairs” are recognized,
- We’ll build a model that works when the “right stairs” are recognized,
- We add an estimate to the base models so that the final sigmoid works only if both values (, ) are great
Another variant
- Построим модель, срабатывающую, когда нижний ряд тёмный,
- Построим модель, срабатывающую, когда верхний левый пиксель тёмный и верхний правый пиксель светлый,
- Построим модель, срабатывающую, когда верхний левый пиксель светлый и верхний правый пиксель тёмный,
- Добавим базовые модели так, что конечная сигмоидная функция срабатывала только когда и велики, или когда и велики. (Заметьте, что и не могут быть большими одновременно.)
Пример 2: распознать лестницы светлого оттенка
- Построим модели, срабатывающие при «затенённом нижнем ряде», «затенённом x1 и белом x2», «затенённом x2 и белом x1», , и
- Построим модели, срабатывающие при «тёмном нижнем ряде», «тёмном x1 и белом x2», «тёмном x2 и белом x», , and
- Connect the models so that the "dark" identifiers are subtracted from the "shaded" identifiers before compressing the result with a sigmoid
Terminology Note
One layer perceptron has one output layer . That is, the models we constructed will be called two- layer perceptrons, because they have an output layer, which is the input of another output layer. However, we can call the same models neural networks, and in this case the networks have three layers - the input layer, the hidden layer and the output layer.
Alternative activation features
In our examples, we used the sigmoid activation function. However, other activation functions may also be used. Often used tanh and relu . The activation function must be non-linear, otherwise the neural network will be simplified to a similar single-layer perceptron.
Multi-class classification
We can easily expand our model so that it works in a multiclass classification by using several nodes in the final output layer. The idea here is that each output node corresponds to one of the classeswhich we seek to predict. Instead of narrowing the output with a sigmoid that reflects an element frominto the element from the interval [0, 1] we can use the softmax function , which reflects the vector in in vector in so that the sum of the elements of the resulting vector is 1. In other words, we can create such a network that gives the vector [, , ..., ].
Using three or more layers (deep learning)
You may wonder - is it possible to expand our “vanilla” neural network so that its output layer is transferred to the fourth layer (and then to the fifth, sixth, etc.)? Yes. This is usually called "deep learning." In practice, it can be very effective. However, it is worth noting that any network consisting of more than one hidden layer can be simulated by a network with one hidden layer. Indeed, according to the universal approximation theorem, any continuous function can be approximated using a neural network with one hidden layer. The reason for the frequent choice of deep neural network architectures instead of networks with one hidden layer is that during the selection procedure they usually converge to a solution faster.
Model selection for labeled training samples (backward propagation of learning errors)
Alas, we got to the selection procedure. Prior to this, we talked about the fact that neural networks can work efficiently, but did not discuss how the neural network is adapted to labeled training samples. An analogue of this question can be: “How can I choose the best weights for the network based on several labeled training samples?”. The usual answer is gradient descent (although MMP may be suitable ). If we continue to work on our example problem, the gradient descent procedure may look something like this:
- We start with some labeled training data
- To minimize the differentiable loss function,
- Choose a network structure. Especially clearly you need to determine the number of layers and nodes on each layer.
- Инициализируем сеть со случайными весами
- Пропускаем сквозь сеть обучающие данные, чтобы сгенерировать прогноз для каждого образца. Измерим общую погрешность согласно функции потерь, . (Это называется прямым распространением.)
- Определяем, насколько меняются текущие потери относительно небольших изменений каждого из весов. Другими словами, вычисляем градиент с учётом каждого веса в сети. (Это называется обратным распространением.)
- Делаем небольшой «шаг» в направлении отрицательного градиента. Например, если , а , то уменьшение на небольшую величину должно привести к небольшому уменьшению текущих потерь. Поэтому мы изменяем (where 0.001 is the specified "step size").
- We repeat this process (from step 5) a fixed number of times or until the losses converge
At least that is the main idea. When implemented in practice, many difficulties arise.
Difficulty 1 - computational complexity
In the process of selecting, among other things, we need to calculate the gradient taking into account each weight. This is difficult becausedepends on each node in the output layer, and each of these nodes depends on each node in the layer in front of it, and so on. This means that the calculationturns into a real nightmare with complex derivative formulas. (Do not forget that many neural networks in the real world contain thousands of nodes in dozens of layers.) You can solve this problem by noting that when applying the complex derivative formula, mostreuses the same intermediate derivatives. If you carefully monitor this, you can avoid the same repeated calculations thousands of times.
Another trick is to use special activation functions, derivatives of which can be written as a function of their value. For example, derivative = . This is convenient because during a direct pass when calculating for each training sample we need to calculate bitwise for some vector . During backpropagation, we can reuse these values to calculate the gradienttaking into account weights, which will save time and memory.
The third trick is to divide the training samples into “mini-groups” and to change weights taking into account each group, one after another. For example, if we divide the training data into {batch1, batch2, batch3}, then the first pass through the training data will be
- Change weights based on batch1
- Change weights based on batch2
- Change weights based on batch3
where is the gradient recalculated after each change.
In the end, another technique worth mentioning is the use of a video processor instead of a central processor, because it is better suited for performing a large number of parallel calculations.
Difficulty 2 - gradient descent may have problems finding the absolute minimum
This is not so much a neural network problem as a gradient descent. There is a possibility that during a gradient descent, weights may be stuck at a local minimum. It is also possible that the weights “jump over” at least. One way to handle this is to use different step sizes. Another way is to increase the number of nodes and / or layers in the network. (But one should be wary of an overly close fit). In addition, some heuristic techniques, such as using the moment , may be effective .
Difficulty 3 - how to develop a common approach?
How to write a genetic program that can select values for any neural network with any number of nodes and layers? The correct answer is that you need to use Tensorflow . But if you want to try, then the most difficult part will be calculating the gradient of the loss function. The trick here is to determine that a gradient can be represented as a recursive function. A five-layer neural network is just a four-layer neural network that transmits data to some perceptrons. But a four-layer neural network is just a three-layer neural network that transfers data to some perceptrons and so on. More formally, this is called automatic differentiation .