Neural networks and deep learning, chapter 2: how the backpropagation algorithm works

Original author: Michael Nielsen
  • Transfer

In the last chapter, we saw how neural networks can independently learn weights and offsets using the gradient descent algorithm. However, there was a gap in our explanation: we did not discuss the calculation of the gradient of the cost function. And this is a decent gap! In this chapter, I will introduce a quick algorithm for calculating such gradients, known as back propagation.

The backpropagation algorithm was first invented in the 1970s, but its importance was not fully understood until the famous work of 1986, written by David Rumelhart, Joffrey Hinton and Ronald Williams. The paper describes several neural networks in which backpropagation works much faster than in earlier approaches to learning, which is why it has since been possible to use a neural network to solve previously unsolvable problems. Today, the backpropagation algorithm is the workhorse for learning a neural network.

This chapter contains more math than everyone else in the book. If you don’t particularly like mathematics, you might be tempted to skip this chapter and simply treat back distribution as a black box, the details of which you are prepared to ignore. Why waste time studying them?

The reason, of course, is understanding. The back propagation is based on the expression of the partial derivative ∂C / ∂w of the cost function C with respect to the weight w (or bias b) of the network. The expression shows how quickly the value changes when weights and offsets change. And although this expression is quite complex, it has its own beauty, because each of its elements has a natural and intuitive interpretation. Therefore, backpropagation is not just a fast learning algorithm. It gives us a detailed understanding of how changing weights and offsets changes all network behavior. And it’s worth it to study the details.

Given all this, if you just want to flip through this chapter or jump to the next, that's okay. I wrote the rest of the book so that it is understandable, even if we consider the reverse distribution as a black box. Of course, later in the book there will be moments from which I make references to the results of this chapter. But at that moment, you should understand the basic conclusions, even if you did not follow all the arguments.

For warming up: a quick matrix approach for calculating the output of a neural network

Before discussing backpropagation, let's get some quick matrix algorithm to calculate the output of a neural network. We actually already met this algorithm by the end of the previous chapter, but I described it quickly, so it is worth considering again in more detail. In particular, it will be a good way to adapt to the record used in back distribution, but in a familiar context.

Let's start with a record that allows us to clearly indicate weights on the net. We will use w l jk to denote the connection weight of neuron No. k in layer No. (l-1) with neuron No. j in layer No. l. So, for example, the diagram below shows the connection weight of the fourth neuron of the second layer with the second neuron of the third layer:

At first, such a recording seems awkward, and requires some getting used to. However, soon it will seem simple and natural to you. One of its features is the order of indices j and k. You could decide that it would be more reasonable to use j to designate the input neuron, and k - for the output neuron, and not vice versa, as we have. I will explain the reason for this feature below.

We will use similar notations for offsets and network activations. In particular, b l j will denote the displacement of neuron No. j in layer No. l. a l j will denote the activation of neuron No. j in the layer No. l. The following diagram shows examples of the use of this entry:

With this entry, activation a l jneuron No. j in layer No. l is associated with activation in layer No. (l-1) by the following equation (compare with equation (4) and its discussion in the previous chapter):

$ a ^ l_j = \ sigma (\ sum_k w ^ l_ {jk} a ^ {l − 1} _k + b ^ l_j) \ tag {23} $

where the sum goes over all neurons k in the layer (l-1). To rewrite this expression in matrix form, we define a weight matrix w l for each layer l. The elements of the weight matrix are simply the weights connected to the layer No. l, that is, the element in row No. j and column No. k will be w l jk . Similarly, for each layer l we define the displacement vector b l . You probably guessed how this works - the components of the displacement vector will simply be the values ​​b l j , one component for each neuron in layer No. l. And finally, we define the activation vector a l , whose components are the activation a l j .

The last ingredient needed to rewrite (23) is the matrix form of the vectorization of the function σ. We casually encountered vectorization in the last chapter - the idea is that we want to apply the function σ to every element of the vector v. We use the obvious notation σ (v) to denote the element-wise application of the function. That is, the components σ (v) are simply σ (v) j = σ (v j ). For example, if we have a function f (x) = x 2 , then the vectorized form f gives us

$ f (\ begin {bmatrix} 2 \\ 3 \ end {bmatrix}) = \ begin {bmatrix} f (2) \\ f (3) \ end {bmatrix} = \ begin {bmatrix} 4 \\ 9 \ end {bmatrix} \ tag {24} $

that is, a vectorized f simply squares every element of the vector.

Given all these forms of writing, equation (23) can be rewritten in a beautiful and compact vectorized form:

$ a ^ l = \ sigma (w ^ la ^ {l − 1} + b ^ l) \ tag {25} $

Such an expression allows us to take a more global look at the relationship between the activations of one layer and the activations of the previous one: we simply apply the weight matrix to the activations, add the displacement vector, and then apply the sigmoid. By the way, it is this record that requires the use of the record w l jk. If we used the index j to denote the input neuron, and k for the output neuron, we would have to replace the weight matrix in equation (25) with the transposed one. This is a small but annoying change, and we would lose the simplicity of the statement (and reflection) about “applying the weight matrix to activations”. Such a global approach is simpler and more concise (and uses fewer indexes!) Than the poneuron one. This is just a way to avoid the index hell without losing the accuracy of what is happening. This expression is also useful in practice, since most matrix libraries offer quick ways to multiply matrices, add vectors, and vectorize. The code in the last chapter directly used this expression to calculate network behavior.

Using equation (25) to calculate a l, we calculate the intermediate value z l ≡ w l a l − 1 + b l . This value is quite useful for naming: we call z l the weighted input of neurons of layer No. l. Later we will actively use this weighted input. Equation (25) is sometimes written through a weighted input, as a l = σ (z l ). It is also worth noting that z l has components$ z ^ l_j = \ sum_k w ^ l_ {jk} a ^ {l − 1} _k + b ^ l_j $that is, z l j is just a weighted input of the activation function of neuron j in layer l.

Two essential assumptions about the cost function

The goal of backpropagation is to calculate the partial derivatives ∂C / ∂w and ∂C / ∂b of the cost function C for each weight w and the bias b of the network. For backpropagation to work, we need to make two main assumptions about the form of the cost function. However, before that it would be useful to imagine an example of a cost function. We use the quadratic function from the previous chapter (equation (6)). In the entry from the previous section, it will look like

$ C = \ frac {1} {2n} \ sum_x || y (x) - a ^ L (x) || ^ 2 \ tag {26} $

where: n is the total number of training examples; the sum goes for all examples x; y = y (x) is the required output; L denotes the number of layers in the network; a L = a L (x) is the output vector of network activations when x is at the input.

Okay, so what kind of assumptions do we need about the cost function C to apply backpropagation? First, the cost function can be written as the average C = 1 / n ∑ x C x of the cost function C x for individual training examples x. This is done in the case of a quadratic cost function, where the cost of one training example is C x = 1/2 || y - a L || 2. This assumption will be true for all other cost functions that we will meet in the book.

We need this assumption because in fact the back propagation allows us to calculate the partial derivatives ∂C / ∂w and ∂C / ∂b, averaging over the training examples. Accepting this assumption, we assume that the training example x is fixed, and stop specifying the index x, writing down the value of C x as C. Then we will return x, but for now it’s better to just mean it.

The second assumption regarding the cost function is that it can be written as a function of the output of a neural network:

For example, a quadratic cost function satisfies this requirement, since the quadratic cost of one training example x can be written as

$ C = 1/2 ||  y − a ^ L || ^ 2 = 1/2 \ sum_j (y_j - a ^ L_j) ^ 2 \ tag {27} $

which will be the function of output activations. Of course, this cost function also depends on the desired output y, and you may wonder why we do not consider C as a function also of y. However, recall that the input training example x is fixed, so the output y is also fixed. In particular, we cannot change it by changing weights and displacements, that is, this is not what the neural network learns. Therefore, it makes sense to consider C as a function of only the output activations a L , and y as just a parameter that helps to determine it.

The work of Hadamard s⊙t

The backpropagation algorithm is based on the usual operations of linear algebra - addition of vectors, multiplication of a vector by a matrix, etc. However, one of the operations is used less frequently. Suppose s and t are two vectors of the same dimension. Then by s⊙t we denote the elementwise multiplication of two vectors. Then the components s⊙t are simply (s⊙t) j = s j t j . For instance:

$ \ begin {bmatrix} 1 \\ 2 \ end {bmatrix} \ odot \ begin {bmatrix} 3 \\ 4 \ end {bmatrix} = \ begin {bmatrix} 1 * 3 \\ 2 * 4 \ end {bmatrix} = \ begin {bmatrix} 3 \\ 8 \ end {bmatrix} \ tag {28} $

Such a piecewise work is sometimes called the work of Hadamard or the work of Schur. We will call it the work of Hadamard. Good libraries for working with matrices usually have a quick implementation of the Hadamard product, and this can be convenient when implementing backpropagation.

Four fundamental equations for back propagation

Backpropagation involves understanding how changing weights and offsets of the network changes the cost function. In essence, this means calculating the partial derivatives ∂C / ∂w l jk and ∂C / ∂b l j . But for their calculation, we first calculate the intermediate value δ l j , which we call the error in neuron No. j in layer No. l. The back propagation will give us a procedure for calculating the error δ l j , and then associate δ l j with ∂C / ∂w l jk and ∂C / ∂b l j .

To understand how an error is determined, imagine that a demon has started up in our neural network:

He sits on neuron No. j in layer No. l. Upon receipt of the input data, the daemon disrupts the operation of the neuron. It adds a small change in Δz l j to the weighted input of the neuron, and instead of yielding σ (z l j ), the neuron will produce σ (z l j + Δz l j ). This change will also spread through the following layers of the network, which will ultimately change the total cost by (∂C / ∂z l j ) * Δz l j .

But our demon is good, and he is trying to help you improve the cost, that is, find Δz l j that reduces the cost. Suppose the value ∂C / ∂z l jgreat (positive or negative). Then the demon can seriously reduce the cost by choosing Δz l j with the sign opposite to ∂C / ∂z l j . But if ∂C / ∂z l j is close to zero, then the demon cannot greatly improve the cost by changing the weighted input z l j . So, from the point of view of the demon, the neuron is already close to optimum (this, of course, is true only for small Δz l j . Suppose these are the limitations of the actions of the demon). Therefore, in the heuristic sense, ∂C / ∂z l j is a measure of neuron error.

Under the motivation from this story, we define the error δ l j of the neuron j in the layer l, as

$ \ delta l_j \ equiv \ frac {\ partial C} {\ partial z ^ l_j} \ tag {29} $

By our usual convention, we use δ l to denote the error vector associated with layer l. Back propagation will give us a way to calculate δ l for any layer, and then correlate these errors with the quantities that really interest us, ∂C / ∂w l jk and ∂C / ∂b l j .

You may be wondering why the daemon changes the weighted input z l j . It would be more natural to imagine that the demon changes the output activation a l j so that we use ∂C / ∂a l jas a measure of error. In fact, if you do so, then everything turns out very similar to what we will discuss further. However, in this case, the representation of back propagation will be algebraically a little more complicated. Therefore, we dwell on the variant δ l j = ∂C / ∂z l j as a measure of error.

In classification problems, the term “error” sometimes means the number of incorrect classifications. For example, if a neural network correctly classifies 96.0% of the digits, then the error will be 4.0%. Obviously, this is not at all what we mean by the vectors δ. But in practice, you can usually easily understand what meaning is meant.

Attack plan: backpropagation is based on four fundamental equations. Together, they give us a way to calculate both the error δ l and the gradient of the cost function. I give them below. No need to expect their instant development. You will be disappointed. The backpropagation equations are so deep that a good understanding requires tangible time and patience, and a gradual deepening of the question. The good news is that this patience will pay off handsomely. Therefore, in this section, our reasoning is just beginning, helping you to follow the path of a deep understanding of equations.

Here is a diagram of how we will delve into these equations later: I will give a brief proof of them to help explain why they are true; we will rewrite them in an algorithmic form in the form of pseudocode, and see how to implement it in real python code; in the last part of the chapter we will develop an intuitive idea of ​​the meaning of the backpropagation equations, and how they can be found from scratch. We will periodically return to the four fundamental equations, and the deeper you understand them, the more comfortable and perhaps beautiful and natural they will seem to you.

The equation of the error of the output layer, δ L : the components of δ L are considered as

$ \ delta ^ L_j = \ frac {\ partial C} {\ partial a ^ L_j} \ sigma '(z ^ L_j) \ tag {BP1} $

Very natural expression. The first term on the right, ∂C / ∂a L j , measures how quickly the cost changes as a function of output activation No. j. If, for example, C is not particularly dependent on the particular output neuron j, then δ L j will be small, as expected. The second term on the right, σ '(z L j ), measures how quickly the activation function σ changes in z L j .

Note that everything in (BP1) is easy to count. In particular, we calculate z L j when calculating the behavior of the network, and it will take slightly more resources to calculate σ '(z L j ). Of course, the exact form ∂C / ∂a L jdepends on the form of the cost function. However, if the cost function is known, then there should be no problems with calculating ∂C / ∂a L j . For example, if we use the quadratic cost function, then C = 1/2 ∑ j (y j - a L j ) 2 , therefore ∂C / ∂a L j = (a L j - y j ), which is easy to calculate.

Equation (BP1) - it exploded expression delta of L . It is completely normal, but not recorded in the matrix form, which we need for back distribution. However, it is easy to rewrite in matrix form, as

$ \ delta ^ L = \ nabla_a C \ odot \ sigma '(z ^ L) \ tag {BP1a} $

Here ∇ a C is defined as a vector whose components are the partial derivatives ∂C / ∂a L j . It can be represented as an expression of the rate of change of C with respect to output activations. It is easy to see that equations (BP1a) and (BP1) are equivalent, therefore we will use (BP1) to refer to any of them below. For example, in the case of a quadratic value, we have ∇ a C = (a L - y), so the full matrix form (BP1) will be

$ \ delta ^ L = (a ^ L - y) \ odot \ sigma '(z ^ L) \ tag {30} $

Everything in this expression has a convenient vector form, and it is easy to calculate using a library such as Numpy.

The expression of the error δ l through the error in the next layer, δ l + 1 : in particular,

$ \ delta ^ l = ((w ^ {l + 1}) ^ T \ delta ^ {l + 1}) \ cdot \ sigma '(z ^ l) \ tag {BP2} $

where (w l + 1 ) T is the transposition of the weight matrix w l + 1 for layer No. (l + 1). The equation seems complicated, but each element is easy to interpret. Suppose we know the error δ l + 1 for the layer (l + 1). The transposition of the weight matrix, (w l + 1 ) T , can be imagined as moving the error backward through the network, which gives us some measure of error at the output of layer No. l. Then we consider the Hadamard product ⊙σ '(z l ). This pushes the error back through the activation function in layer l, giving us the error value δl in the weighted input for layer l.

By combining (BP2) with (BP1), we can calculate the error δ lfor any network layer. We start by using (BP1) to calculate δ L , then use equation (BP2) to calculate δ L-1 , then again to calculate δ L-2 , and so on, all the way to the back of the network.

The equation of the rate of change of cost in relation to any offset in the network : in particular:

$ \ frac {\ partial C} {\ partial b ^ l_j} = \ delta ^ l_j \ tag {BP3} $

That is, the error δ l j is exactly equal to the rate of change ∂C / ∂b l j . This is excellent because (BP1) and (BP2) have already told us how to calculate δ l j . We can rewrite (BP3) shorter as

$ \ frac {\ partial C} {\ partial b} = \ delta \ tag {31} $

where δ is estimated for the same neuron as the bias b.

The equation for the rate of change of value in relation to any weight in the network : in particular:

$ \ frac {\ partial C} {\ partial w ^ l_ {jk}} = a ^ {l-1} _k \ delta ^ l_j \ tag {BP4} $

From here we learn how to calculate the partial derivative ∂C / ∂w l jk through the values ​​of δ l and a l-1 , the calculation method of which we already know. This equation can be rewritten in a less loaded form:

$ \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out} \tag{32} $

where a in is the activation of the neural input for the weight w, and δ out is the error of the neural output from the weight w. If we look in more detail at the weight w and two neurons connected by it, then we can draw it this way:

A nice consequence of equation (32) is that when the activation a in is small, a in ≈ 0, the term term ∂C / ∂w also tends to to zero. In this case, we say that the weight is trained slowly, that is, it does not change much during the gradient descent. In other words, one of the consequences (BP4) is that the weighted output of neurons with low activation learns slowly.

Other ideas can be drawn from (BP1) - (BP4). Let's start with the output layer. Consider the term σ '(z L j) in (BP1). Recall from the graph of the sigmoid from the last chapter that it becomes flat when σ (z L j ) approaches 0 or 1. In these cases, σ '(z L j ) ≈ 0. Therefore, the weight in the last layer will be trained slowly if activation the output neuron is small (≈ 0) or large (≈ 1). In this case, it is usually said that the output neuron is saturated, and as a result, the weight has ceased to be trained (or is being trained slowly). The same remarks are valid for displacements of the output neuron.

Similar ideas can be obtained regarding earlier layers. In particular, consider the term σ '(z l ) in (BP2). This means that δ l jmost likely will be small as the neuron approaches saturation. And this, in turn, means that any weights at the input of a saturated neuron will be trained slowly (however, this will not work if w l + 1 T δ l + 1 will have sufficiently large elements that compensate for the small value of σ '(z L j )).

To summarize: we learned that weight will be trained slowly if either the activation of the input neuron is small or the output neuron is saturated, that is, its activation is small or large.

This is not particularly surprising. And yet, these observations help improve our understanding of what happens when we train the network. Moreover, we can approach these arguments from the other side. The four fundamental equations are valid for any activation function, and not just for the standard sigmoid (since, as we will see later, the properties do not use the sigmoid). Therefore, these equations can be used to develop activation functions with certain necessary learning properties. For example, suppose we choose an activation function σ that is different from a sigmoid, such that σ 'is always positive and does not approach zero. This prevents the learning slowdown that occurs when normal sigmoid neurons are saturated. Later in the book we will see examples where the activation function changes in a similar way.

Bottom line: backpropagation equations


  • Alternative notation of backpropagation equations. I wrote down the backpropagation equations using the Hadamard product. This can confuse people who are not used to this work. There is another approach, based on the usual matrix multiplication, which may be instructive for some readers. Show that (BP1) can be rewritten as

$ \delta^L = \Sigma'(z^L) \nabla_a C \tag{33} $

where Σ '(z L ) is a square matrix with σ' (z L j ) diagonally located and the other elements are 0. Note that this matrix interacts with ∇ a C through the usual matrix multiplication.

Show that (BP2) can be rewritten as

$ \delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1} \tag{34} $

Combining the previous tasks, show that:

$ \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T \Sigma'(z^L) \nabla_a C \tag{35} $

For readers accustomed to matrix multiplication, this equation will be easier to understand than (BP1) and (BP2). I concentrate on (BP1) and (BP2) because this approach is faster to implement numerically. [here Σ is not the sum (∑), but the capital σ (sigma) / approx. perev. ]

Proof of the four fundamental equations (optional section)

Now we prove the four fundamental equations (BP1) - (BP4). All of them are consequences of the chain rule (the rule of differentiation of a complex function) from the analysis of functions of many variables. If you are familiar with the chain rule, I highly recommend trying to count the derivatives yourself before continuing with the reading.

Let's start with the equation (BP1), which gives us an expression for the output error delta of L . To prove it, we recall that, by definition:

$ \delta^L_j = \frac{\partial C}{\partial z^L_j} \tag{36} $

Applying the chain rule, we rewrite the partial derivatives through the partial derivatives of output activations:

$ \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j} \tag{37} $

where the summation goes over all neurons k in the output layer. Of course, the output activation a L k of neuron No. k depends only on the weighted input z L j for neuron No. j when k = j. Therefore, ∂a L k / ∂z L j disappears when k ≠ j. As a result, we simplify the previous equation to

$ \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j} \tag{38} $

Recalling that a L j = σ (z L j ), we can rewrite the second term on the right as σ '(z L j ), and the equation turns into

$ \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \tag{39} $

that is, in (BP1) in an exploded view.

Then we prove (BP2), which gives the equation for the error δ l through the error in the next layer δ l + 1 . To do this, we need to rewrite δ l j = ∂C / ∂z l j through δ l + 1 k = ∂C / ∂z l + 1 k . This can be done using the chain rule:

$ \delta^l_j = \frac{\partial C}{\partial z^l_j} \tag{40} $

$ = \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41} $

$ = \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \tag{42} $

where in the last line we swapped the two terms on the right, and substituted the definition of δ l + 1 k . To calculate the first term on the last line, note that

$ z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k \tag{43} $

Differentiating, we get

$ \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \tag{44} $

Substituting this into (42), we obtain

$ \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45} $

That is, (BP2) in an exploded entry.

It remains to prove (BP3) and (BP4). They also follow from the chain rule, in approximately the same way as the two previous ones. I will leave them to you as an exercise.


  • Prove (BP3) and (BP4).

That's all the proof of the four fundamental backpropagation equations. It may seem complicated. But in reality, this is simply the result of careful application of the chain rule. Speaking less succinctly, back propagation can be imagined as a way of calculating the gradient of the cost function through the systematic application of the chain rule from the analysis of the functions of many variables. And that’s really all that back distribution is — the rest is just details.

Backpropagation algorithm

The backpropagation equations give us a method for calculating the gradient of the cost function. Let's write this explicitly as an algorithm:
  1. Input x: assign appropriate activation a 1 for the input layer.
  2. Direct distribution: for each l = 2,3, ..., L, calculate z l = w l a l − 1 + b l and a l = σ (z l ).
  3. Output error δ L : calculate vector δ L = ∇ a C ⊙ σ '(z L ).
  4. The reverse propagation of the error: for each l = L − 1, L − 2, ..., 2, calculate δ l = ((w l + 1 ) T δ l + 1 ) ⊙ σ '(z l ).
  5. Output: the gradient of the cost function is specified $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.

Looking at the algorithm, you will understand why it is called backpropagation. We calculate the error vectors δ l backwards, starting from the last layer. It may seem strange that we are going backward through the network. But if you think about the proof of back propagation, then the reverse movement is a consequence of the fact that cost is a function of the network output. To understand how cost varies with early weights and offsets, we need to apply the chain rule over and over, going back through the layers to get useful expressions.


  • Обратное распространение с одним изменённым нейроном. Допустим, мы изменили один нейрон в сети с прямым распространением так, чтобы его выход был f(∑j wjxj+b), где f – некая функция, не похожая на сигмоиду. Как нам поменять алгоритм обратного распространения в данном случае?
  • Обратное распространение с линейными нейронами. Допустим, мы заменим обычную нелинейную сигмоиду на σ(z) = z по всей сети. Перепишите алгоритм обратного распространения для данного случая.

As I explained earlier, the backpropagation algorithm computes the gradient of the cost function for one training example, C = C x . In practice, back propagation is often combined with a learning algorithm, for example, with stochastic gradient descent, when we calculate the gradient for many training examples. In particular, for a given mini-package of m training examples, the following algorithm applies gradient descent based on this mini-package:

  1. Entrance: A set of training examples.
  2. For each training example x, assign the corresponding input activation a x, 1 and perform the following steps:
    • Прямое распространение для каждого l=2,3,…,L вычислить zx,l = wlax,l−1+bl и ax,l = σ(zx,l).
    • Выходная ошибка δx,L: вычислить вектор δx,L = ∇a Cx ⋅ σ'(zx,L).
    • Обратное распространение ошибки: для каждого l=L−1,L−2,…,2 вычислить δx,l = ((wl+1)Tδx,l+1) ⋅ σ'(zx,l).
  3. Градиентный спуск: для каждого l=L,L−1,…,2 обновить веса согласно правилу $w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, и смещения согласно правилу $b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}$.

Of course, to implement stochastic gradient descent in practice, you will also need an external cycle that generates mini-packages of training examples, and an external cycle that goes through several epochs of training. For simplicity, I omitted them.

Code for back distribution

Having understood the abstract side of backpropagation, we can now understand the code used in the previous chapter that implements backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Network class. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates network weights and offsets by calculating the gradient for the current mini_batch training examples:

class Network(object):
    def update_mini_batch(self, mini_batch, eta):
        """Обновить веса и смещения сети, применяя градиентный спуск с использованием обратного распространения к одному мини-пакету. mini_batch – это список кортежей (x, y), а eta – скорость обучения."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the lines delta_nabla_b, delta_nabla_w = self.backprop (x, y), using the backprop method to calculate the partial derivatives ∂C x / ∂b l j and ∂C x / ∂w l jk. The backprop method almost repeats the algorithm of the previous section. There is one small difference - we use a slightly different approach to layer indexing. This is done in order to take advantage of the python feature, negative array indexes that allow you to count elements backwards from the end. l [-3] will be the third element from the end of the array l. The backprop code is given below, along with auxiliary functions used to calculate the sigmoid, its derivative and derivative of the cost function. With them, the code is complete and understandable. If something is unclear, refer to the first full listing code description.

class Network(object):
   def backprop(self, x, y):
        """Вернуть кортеж ``(nabla_b, nabla_w)``, представляющий градиент для функции стоимости C_x.  ``nabla_b`` и ``nabla_w`` - послойные списки массивов numpy, похожие на ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # прямой проход
        activation = x
        activations = [x] # список для послойного хранения активаций
        zs = [] # список для послойного хранения z-векторов
        for b, w in zip(self.biases, self.weights):
            z =, activation)+b
            activation = sigmoid(z)
        # обратный проход
        delta = self.cost_derivative(activations[-1], y) * \
        nabla_b[-1] = delta
        nabla_w[-1] =, activations[-2].transpose())
        """Переменная l в цикле ниже используется не так, как описано во второй главе книги. l = 1 означает последний слой нейронов, l = 2 – предпоследний, и так далее. Мы пользуемся преимуществом того, что в python можно использовать отрицательные индексы в массивах."""
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta =[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] =, activations[-l-1].transpose())
        return (nabla_b, nabla_w)
    def cost_derivative(self, output_activations, y):
        """Вернуть вектор частных производных (чп C_x / чп a) для выходных активаций."""
        return (output_activations-y) 
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))
def sigmoid_prime(z):
    """Производная сигмоиды."""
    return sigmoid(z)*(1-sigmoid(z))


  • Полностью основанный на матрицах подход к обратному распространению на мини-пакете. Наша реализация стохастического градиентного спуска использует цикл по обучающим примерам из мини-пакета. Алгоритм обратного распространения можно поменять так, чтобы он вычислял градиенты для всех обучающих примерах мини-пакета одновременно. Вместо того, чтобы начинать с одного вектора x, мы можем начать с матрицы X=[x1x2…xm], чьими столбцами будут векторы мини-пакета. Прямое распространение идёт через произведение весовых матриц, добавление подходящей матрицы для смещений и повсеместного применения сигмоиды. Обратное распространение идёт по той же схеме. Напишите псевдокод для такого подхода для алгоритма обратного распространения. Измените так, чтобы он использовал этот матричный подход. Преимуществом такого подхода будет использование всех преимуществ современных библиотек для линейной алгебры. В итоге он может работать быстрее цикла по мини-пакеты (к примеру, на моём компьютере программа ускоряется примерно в 2 раза на задачах классификации MNIST). На практике все серьёзные библиотеки для обратного распространения используют такой полноматричный подход или какой-то его вариант.

В каком смысле обратное распространение является быстрым алгоритмом?

In what sense is back propagation a fast algorithm? To answer this question, consider another approach to calculating the gradient. Imagine the early days of neural network research. Perhaps this is the 1950s or 1960s, and you are the first person in the world who came up with the idea of ​​using gradient descent for training! But for this to work, you need to calculate the gradient of the cost function. You recall algebra and decide to see if you can use the chain rule to calculate the gradient. Having played a little, you see that algebra seems difficult, and you are disappointed. You are trying to find a different approach. You decide to consider the cost as a function of only the weights C = C (w) (we will return to offsets a bit later). You number the weights w 1 , w 2 , ... and want to calculate ∂C / ∂w jfor weight w j . The obvious way is to use the approximation

$ \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \tag{46} $

Where ε> 0 is a small positive number, and e j is the unit direction vector j. In other words, we can approximately estimate ∂C / ∂w j by calculating the cost C for two slightly different values ​​of w j , and then apply equation (46). The same idea allows us to calculate the partial derivatives of ∂C / ∂b with respect to displacements.

The approach looks promising. Conceptually simple, easy to implement, uses only a few lines of code. It looks much more promising than the idea of ​​using a chain rule to calculate the gradient!

Unfortunately, although this approach looks promising, when implemented in code, it turns out that it works extremely slowly. To understand why, imagine that we have a million weights in the network. Then for each weight w j we need to calculate C (w + εe j ) to calculate ∂C / ∂w j . And this means that to calculate the gradient, we need to calculate the cost function a million times, which will require a million direct passes through the network (for each training example). And we also need to calculate C (w), so we get a million and one pass through the network.

The trick of backpropagation is that it allows us to simultaneously calculate all the partial derivatives ∂C / ∂w jusing only one forward pass through the network, followed by one return pass. Roughly speaking, the computational cost of the return pass is about the same as that of the direct one.

Therefore, the total cost of back propagation is approximately the same as that of two direct passes through the network. Compare this with the million and one direct pass necessary to implement method (46)! So, although backpropagation looks like a more complicated approach, in reality it is much faster.

For the first time this acceleration was fully appreciated in 1986, and this dramatically expanded the range of tasks solved with the help of neural networks. In turn, this has led to an increase in the number of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s, people already encountered its limitations, especially when trying to use back propagation to train deep neural networks, that is, networks with many hidden layers. Later we will see how modern computers and new tricky ideas make it possible to use backpropagation to train such deep neural networks.

Reverse distribution: in general

As I have already explained, back propagation reveals two mysteries to us. The first thing the algorithm actually does? We have developed a back propagation scheme for the error from the output. Is it possible to go deeper further, get a more intuitive idea of ​​what happens during all these multiplications of vectors and matrices? The second riddle is how did anyone even detect back propagation? It is one thing to follow the steps of the algorithm or the proof of its operation. But this does not mean that you understood the problem so well that you could invent this algorithm. Is there a reasonable line of reasoning that can lead us to the discovery of the backpropagation algorithm? In this section, I will cover both puzzles.

To improve understanding of the operation of the algorithm, imagine that we made a small change Δw l jksome weight w l jk :

This change in weight will lead to a change in the output activation of the corresponding neuron:

This will lead to a change in all activations of the next layer:

These changes will lead to changes in the next layer, and so on, right up to the last, and then to changes in the cost function:

Change ΔC is related to the change in Δw l jk by the equation

$ \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \tag{47} $

It follows that a probable approach to calculating ∂C / ∂w l jk is to carefully monitor the propagation of a small change w l jk , leading to a small change in C. If we can do this, carefully expressing along the way everything in quantities that are easy to calculate , then we can calculate ∂C / ∂w l jk .

Let's try. A change in Δw l jk causes a slight change in Δa l j in the activation of neuron j in layer l. This change is set.

$ \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \tag{48} $

The change in activation Δa l j leads to changes in all activations of the next layer, (l + 1). We will concentrate only on one of these changed activations, for example, a l + 1 q ,

This will lead to the following changes:

$ \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j \tag{49} $

Substituting equation (48), we obtain:

$ \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \tag{50} $

Of course, a change in Δa l + 1 q will also change the activation in the next layer. We can even imagine a path along the entire network from w l jk to C, where each change in activation leads to a change in the next activation, and, finally, to a change in the output value. If the path goes through activations a l j , a l + 1 q , ..., a L − 1 n , a L m , then the final expression will be

$ \Delta C \approx \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \tag{51} $

That is, we choose a member of the form ∂a / ∂a for each of the next neuron we pass, as well as for the term ∂C / ∂a L m at the end. This is a representation of changes in C due to changes in activations on this particular path through the network. Of course, there are many ways in which a change in w l jk can go through and affect the cost, and we considered only one of them. To calculate the total change in C, it is reasonable to assume that we should summarize all the possible paths from weight to final cost:

$ \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \tag{52} $

where we summed up all the possible choices for intermediate neurons along the way. Comparing this with (47), we see that:

$ \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}. \tag{53} $

Equation (53) looks complicated. However, it does have a nice intuitive interpretation. We count the change in C with respect to the network weights. It tells us that each edge between two neurons of the network is associated with a ratio factor, which is only a partial derivative of the activation of one neuron with respect to the activation of another neuron. For a rib from the first weight to the first neuron, the ratio factor is ∂a l j / ∂w l jk . The ratio coefficient for the path is simply the product of the coefficients along the path. And the total coefficient of change ∂C / ∂w l jk is the sum of the coefficients in all ways from the initial weight to the final cost. This procedure is shown below for one path:

So far, we have been giving a heuristic argument, a way to represent what happens when the network weights change. Let me outline a further way of thinking on this topic for the development of this argument. First, we can derive the exact expression for all individual partial derivatives in equation (53). This is easy to do using simple algebra. After that, you can try to understand how to write down all the sums by indices in the form of matrix products. This turns out to be a tedious task requiring patience, but not something extraordinary. After all this and maximum simplification, you will see that the exact same backpropagation algorithm is obtained! Therefore, the backpropagation algorithm can be imagined as a way of calculating the sum of the coefficients for all paths. Or, to reformulate,

Here I will not do all this. This business is unattractive, requiring careful study of the details. If you are ready for this, you may like to do this. If not, I hope that such thoughts will give you some ideas regarding the goals of backpropagation.

What about another riddle - how could back propagation be discovered at all? In fact, if you follow the path I have outlined, you will receive evidence of back propagation. Unfortunately, the proof will be longer and more complicated than what I described earlier. So how was that short (but even more mysterious) evidence discovered? If you write down all the details of a long proof, you will immediately notice a few obvious simplifications. You apply simplifications, get simpler proof, write it down. And then you again come across some obvious simplifications. And you repeat the process. After several repetitions, the proof that we saw earlier will be short, but a bit incomprehensible, since all the milestones have been removed from it! Of course, I suggest you take my word for it, however, there is no real mystery of the origin of the evidence. Just a lot of hard work to simplify the proof that I described in this section.

However, there is one clever trick in this process. In equation (53), the intermediate variables are activations of the type a l + 1 q . The trick is to switch to using weighted inputs, such as z l + 1 q , as intermediate variables. If you do not use this and continue to use activations, the evidence obtained will be a little more complicated than the one given earlier in this chapter.

Also popular now: