# Introduction to Neural Networks

Original author: Ben Gorman
• Transfer

Artificial neural networks are now at the peak of popularity. One may wonder if the big name played a role in marketing and applying this model. I know some business managers who joyfully mention the use of “artificial neural networks” and “deep learning” in their products. Would they be so glad if their products used “models with connected circles” or “machines“ make a mistake - you will be punished ””? But, without a doubt, artificial neural networks are a worthwhile thing, and this is evident due to their success in many fields of application: image recognition, natural language processing, automated trading and autonomous cars. I am a specialist in data processing and analysis, but I did not understand them before, so I felt like a master who did not master my tool.

The R code for the examples presented in this article can be found here in the Machine Learning Problem Bible . In addition, after reading this article, it is worth exploring Part 2, Neural Networks - A Worked Example , which details the creation and programming of a neural network from scratch.

We will start with a motivating task. We have a set of images in grayscale, each of which is a 2 × 2 pixel grid in which each pixel has a brightness value from 0 (white) to 255 (black). Our goal is to create a model that will find images with a “staircase” pattern.

At this stage, we are only interested in finding a model that can logically select data. The selection methodology will be interesting to us later.

## Preliminary processing

In each image we mark pixels $inline x_ \left\{1\right\} inline$, $inline x_ \left\{2\right\} inline$, $inline x_ \left\{3\right\} inline$, $inline x_ \left\{4\right\} inline$ and generate the input vector $inline x = \ begin \left\{bmatrix\right\} x_ \left\{1\right\} & x_ \left\{2\right\} & x_ \left\{3\right\} & x_ \left\{4\right\} \ end \left\{bmatrix\right\} inline$, which will be the input to our model. We expect our model to predict True (the image contains the stair pattern) or False (the image does not contain the stair pattern).

Imageidx1x2x3x4Isstairs
12524155175TRUE
217510186200TRUE
382131230100False
..................
4983618743249False
4991160169242TRUE
50019813422188False

## Single-layer perceptron (iteration of model 0)

We can build a simple model consisting of a single - layer perceptron . The Perceptron uses a weighted linear combination of input to return the forecast estimate. If the forecast estimate exceeds the selected threshold, the perceptron predicts True. Otherwise, it predicts False. If more formally, then

$display f \left(x\right) = \left\{\ begin \left\{cases\right\} 1 & \left\{\ text \left\{if\right\}\right\} \ w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4> threshold \\ 0 & \left\{\ text \left\{otherwise\right\}\right\} \ end \left\{cases \right\}\right\} display$

Let's put it another way

$display \ widehat y = \ mathbf w \ cdot \ mathbf x + b display$

$display f \left(x\right) = \left\{\ begin \left\{cases\right\} 1 & \left\{\ text \left\{if\right\}\right\} \ \ widehat \left\{y\right\}> 0 \\ 0 & \left\{\ text \left\{otherwise\right\}\right\} \ end \left\{cases\right\}\right\} display$

Here $inline \ hat \left\{y\right\} inline$- our estimate of the forecast .

Graphically, we can represent the perceptron as input nodes that transmit data to the output node.

For our example, we will construct the following perceptron:

$display \ hat \left\{y\right\} = - 0.0019x_ \left\{1\right\} + 0.0016x_ \left\{2\right\} + 0.0020x_ \left\{3\right\} + 0.0023x_ \left\{4\right\} + 0.0003 display$

Here's how the perceptron will work on some of the training images.

This is definitely better than random guesses and makes good sense. All stair patterns have dark pixels in the bottom row, which creates large positive coefficients.$inline x_ \left\{3\right\} inline$ and $inline x_ \left\{4\right\} inline$. However, there are obvious problems with this model.

1. The model outputs a real number whose value correlates with the concept of similarity (the larger the value, the higher the likelihood that there is a ladder in the image), but there is no basis for interpreting these values ​​as probabilities, because they can be outside the interval [0 , 1].
2. The model cannot capture the nonlinear relationships between variables and the target value. To verify this, consider the following hypothetical scenarios:

##### Case A

Let's start with the image x = [100, 0, 0, 125]. Increase$inline x_ \left\{3\right\} inline$ from 0 to 60.

##### Case B

Let's start with the previous image, x = [100, 0, 60, 125]. Increase$inline x_ \left\{3\right\} inline$ from 60 to 120.

Intuitively, case A should increase much more$inline \ hat \left\{y\right\} inline$Than the case of Bed and . However, since our perceptron model is a linear equation, the gain is +60$inline x_ \left\{3\right\} inline$ in both cases will lead to an increase of +0.12 $inline \ hat \left\{y\right\} inline$.

Our linear perceptron has other problems, but let's solve these two first.

## Single-layer perceptron with sigmoid activation function (iteration of model 1)

We can solve problems 1 and 2 by wrapping our perceptron in a sigmoid (with the subsequent selection of other weights). It is worth recalling that the “sigmoid” function is an S-shaped curve bounded along the vertical axis between 0 and 1, due to which it is often used to model the probability of a binary event.

$display sigmoid \left(z\right) = \ frac \left\{1\right\} \left\{1 + e ^ \left\{- z\right\}\right\} display$

In accordance with this thought, we can complement our model with the following image and equation.

$display z = w \ cdot x = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 display$

$display \ widehat y = sigmoid \left(z\right) = \ frac \left\{1\right\} \left\{1 + e ^ \left\{- z\right\}\right\} display$

Looks familiar? Yes, this is our old friend, logistic regression . However, it will serve us well to interpret the model as a linear perceptron with a sigmoid activation function, because it gives us more opportunities for a more general approach. Also, since we can now interpret$inline \ hat \left\{y\right\} inline$as probability , then we need to change the decision rule accordingly.

$display f \left(x\right) = \left\{\ begin \left\{cases\right\} 1 & \left\{\ text \left\{if\right\}\right\} \ \ widehat \left\{y\right\}> 0.5 \\ 0 & \left\{\ text \left\{otherwise\right\}\right\} \ end \left\{cases\right\}\right\} display$

We continue with our example of the problem and we assume that we have the following selected model:

$display \ begin \left\{bmatrix\right\} w_1 & w_2 & w_3 & w_4 \ end \left\{bmatrix\right\} = \ begin \left\{bmatrix\right\} -0.140 & -0.145 & 0.121 & 0.092 \ end \left\{bmatrix\right\} display$

$display b = -0.008 display$

$display \ widehat y = \ frac \left\{1\right\} \left\{1 + e ^ \left\{- \left(- 0.140x_1 -0.145x_2 + 0.121x_3 + 0.092x_4 -0.008\right)\right\}\right\} display$

We will observe how this model behaves on the same examples of images from the previous section.

We definitely managed to solve problem 1. Look how it solves problem 2.

##### Case A

Let's start with the image [100, 0, 0, 100]. Increase$inline x_3 inline$"from 0 to 50.

##### Case B

Let's start with the image [100, 0, 50, 100]. Increase$inline x_3 inline$"from 50 to 100.

Notice how the curvature of the sigmoid makes Case A “work” (increase rapidly) with increasing$inline z = w \ cdot x inline$but the pace slows as the increase continues $inline z inline$. This is consistent with our intuitive understanding that case A should reflect a large increase in the probability ladder pattern than case B .

Unfortunately, this model still has problems.

1. $inline \ widehat y inline$has a monotonic relationship with each variable. But what if we need to recognize stairs of a lighter shade?
2. The model does not take into account the interaction of variables. Suppose the bottom row of the image is black. If the top left pixel is white, then dimming the top right pixel should increase the likelihood of a stair pattern. If the upper left pixel is black, then shading the upper right pixel should reduce the likelihood of stairs. In other words, increase$inline x_3 inline$should potentially lead to an increase or decrease$inline \ widehat y inline$, depending on the values ​​of other variables. In our current model this cannot be achieved.

## Multilayer perceptron with sigmoid activation function (iteration of model 2)

We can solve both of the above problems by adding another layer to our perceptron model . We will create several basic models similar to those presented above, but we will transfer the output of each basic model to the input of another perceptron. This model is actually a “vanilla” neural network. Let's see how it can work in different examples.

##### Example 1: stair pattern recognition

1. We’ll build a model that works when the “left stairs” are recognized, $inline \ widehat y_ \left\{left\right\} inline$
2. We’ll build a model that works when the “right stairs” are recognized, $inline \ widehat y_ \left\{right\right\} inline$
3. We add an estimate to the base models so that the final sigmoid works only if both values ​​($inline \ widehat y_ \left\{left\right\} inline$, $inline \ widehat y_ \left\{right\right\} inline$) are great

Another variant

1. Построим модель, срабатывающую, когда нижний ряд тёмный, $inline\widehat y_1inline$
2. Построим модель, срабатывающую, когда верхний левый пиксель тёмный и верхний правый пиксель светлый, $inline\widehat y_2inline$
3. Построим модель, срабатывающую, когда верхний левый пиксель светлый и верхний правый пиксель тёмный, $inline\widehat y_3inline$
4. Добавим базовые модели так, что конечная сигмоидная функция срабатывала только когда $inline\widehat y_1inline$и$inline\widehat y_2inline$ велики, или когда $inline\widehat y_1inline$и$inline\widehat y_3inline$ велики. (Заметьте, что $inline\widehat y_2inline$ и $inline\widehat y_3inline$ не могут быть большими одновременно.)

##### Пример 2: распознать лестницы светлого оттенка

1. Построим модели, срабатывающие при «затенённом нижнем ряде», «затенённом x1 и белом x2», «затенённом x2 и белом x1», $inline\widehat y_1inline$, $inline\widehat y_2inline$ и $inline\widehat y_3inline$
2. Построим модели, срабатывающие при «тёмном нижнем ряде», «тёмном x1 и белом x2», «тёмном x2 и белом x», $inline \ widehat y_4 inline$, $inline \ widehat y_5 inline$ and $inline \ widehat y_6 inline$
3. Connect the models so that the "dark" identifiers are subtracted from the "shaded" identifiers before compressing the result with a sigmoid

#### Terminology Note

One layer perceptron has one output layer . That is, the models we constructed will be called two- layer perceptrons, because they have an output layer, which is the input of another output layer. However, we can call the same models neural networks, and in this case the networks have three layers - the input layer, the hidden layer and the output layer.

## Alternative activation features

In our examples, we used the sigmoid activation function. However, other activation functions may also be used. Often used tanh and relu . The activation function must be non-linear, otherwise the neural network will be simplified to a similar single-layer perceptron.

## Multi-class classification

We can easily expand our model so that it works in a multiclass classification by using several nodes in the final output layer. The idea here is that each output node corresponds to one of the classes$inline C inline$which we seek to predict. Instead of narrowing the output with a sigmoid that reflects an element from$inline \ mathbb \left\{R\right\} inline$into the element from the interval [0, 1] we can use the softmax function , which reflects the vector in$inline \ mathbb \left\{R\right\} ^ n inline$ in vector in $inline \ mathbb \left\{R\right\} ^ n inline$ so that the sum of the elements of the resulting vector is 1. In other words, we can create such a network that gives the vector [$inline prob \left(class_1\right) inline$, $inline prob \left(class_2\right) inline$, ..., $inline prob \left(class_C\right) inline$].

## Using three or more layers (deep learning)

You may wonder - is it possible to expand our “vanilla” neural network so that its output layer is transferred to the fourth layer (and then to the fifth, sixth, etc.)? Yes. This is usually called "deep learning." In practice, it can be very effective. However, it is worth noting that any network consisting of more than one hidden layer can be simulated by a network with one hidden layer. Indeed, according to the universal approximation theorem, any continuous function can be approximated using a neural network with one hidden layer. The reason for the frequent choice of deep neural network architectures instead of networks with one hidden layer is that during the selection procedure they usually converge to a solution faster.

## Model selection for labeled training samples (backward propagation of learning errors)

Alas, we got to the selection procedure. Prior to this, we talked about the fact that neural networks can work efficiently, but did not discuss how the neural network is adapted to labeled training samples. An analogue of this question can be: “How can I choose the best weights for the network based on several labeled training samples?”. The usual answer is gradient descent (although MMP may be suitable ). If we continue to work on our example problem, the gradient descent procedure may look something like this:

1. We start with some labeled training data
2. To minimize the differentiable loss function, $inline L \left(\ mathbf \left\{\ widehat Y\right\}, \ mathbf \left\{Y\right\}\right) inline$
3. Choose a network structure. Especially clearly you need to determine the number of layers and nodes on each layer.
4. Инициализируем сеть со случайными весами
5. Пропускаем сквозь сеть обучающие данные, чтобы сгенерировать прогноз для каждого образца. Измерим общую погрешность согласно функции потерь, $inlineL\left(\mathbf\left\{\widehat Y\right\}, \mathbf\left\{Y\right\}\right)inline$. (Это называется прямым распространением.)
6. Определяем, насколько меняются текущие потери относительно небольших изменений каждого из весов. Другими словами, вычисляем градиент $inlineLinline$ с учётом каждого веса в сети. (Это называется обратным распространением.)
7. Делаем небольшой «шаг» в направлении отрицательного градиента. Например, если $inlinew_\left\{23\right\} = 1.5inline$, а $inline\frac\left\{\partial L\right\}\left\{\partial w_\left\{23\right\}\right\} = 2.2inline$, то уменьшение $inlinew_\left\{23\right\}inline$ на небольшую величину должно привести к небольшому уменьшению текущих потерь. Поэтому мы изменяем $inline w_3: = w_3 - 2.2 \ times 0.001 inline$ (where 0.001 is the specified "step size").
8. We repeat this process (from step 5) a fixed number of times or until the losses converge

At least that is the main idea. When implemented in practice, many difficulties arise.

### Difficulty 1 - computational complexity

In the process of selecting, among other things, we need to calculate the gradient $inline L inline$taking into account each weight. This is difficult because$inline L inline$depends on each node in the output layer, and each of these nodes depends on each node in the layer in front of it, and so on. This means that the calculation$inline \ frac \left\{\ partial L\right\} \left\{\ partial w_ \left\{ab\right\}\right\} inline$turns into a real nightmare with complex derivative formulas. (Do not forget that many neural networks in the real world contain thousands of nodes in dozens of layers.) You can solve this problem by noting that when applying the complex derivative formula, most$inline \ frac \left\{\ partial L\right\} \left\{\ partial w_ \left\{ab\right\}\right\} inline$reuses the same intermediate derivatives. If you carefully monitor this, you can avoid the same repeated calculations thousands of times.

Another trick is to use special activation functions, derivatives of which can be written as a function of their value. For example, derivative$inline sigmoid \left(x\right) inline$ = $inline sigmoid \left(x\right) \left(1 - sigmoid \left(x\right)\right) inline$. This is convenient because during a direct pass when calculating$inline \ widehat y inline$ for each training sample we need to calculate $inline sigmoid \left(\ mathbf \left\{x\right\}\right) inline$ bitwise for some vector $inline \ mathbf \left\{x\right\} inline$. During backpropagation, we can reuse these values ​​to calculate the gradient$inline L inline$taking into account weights, which will save time and memory.

The third trick is to divide the training samples into “mini-groups” and to change weights taking into account each group, one after another. For example, if we divide the training data into {batch1, batch2, batch3}, then the first pass through the training data will be

1. Change weights based on batch1
2. Change weights based on batch2
3. Change weights based on batch3

where is the gradient $inline L inline$recalculated after each change.

In the end, another technique worth mentioning is the use of a video processor instead of a central processor, because it is better suited for performing a large number of parallel calculations.

### Difficulty 2 - gradient descent may have problems finding the absolute minimum

This is not so much a neural network problem as a gradient descent. There is a possibility that during a gradient descent, weights may be stuck at a local minimum. It is also possible that the weights “jump over” at least. One way to handle this is to use different step sizes. Another way is to increase the number of nodes and / or layers in the network. (But one should be wary of an overly close fit). In addition, some heuristic techniques, such as using the moment , may be effective .

### Difficulty 3 - how to develop a common approach?

How to write a genetic program that can select values ​​for any neural network with any number of nodes and layers? The correct answer is that you need to use Tensorflow . But if you want to try, then the most difficult part will be calculating the gradient of the loss function. The trick here is to determine that a gradient can be represented as a recursive function. A five-layer neural network is just a four-layer neural network that transmits data to some perceptrons. But a four-layer neural network is just a three-layer neural network that transfers data to some perceptrons and so on. More formally, this is called automatic differentiation .