# Multilayer perceptron (with an example in PHP)

Reading Habr on the subject of materials on neural networks and in general on the topic of artificial intelligence, I found a post on a single-layer perceptron and out of curiosity decided to start studying neural networks from it, and then expand the experience to a multi-layer perceptron. About which I will tell.

#### Theory

The multilayer perceptron is well described on the Wiki , but only the structure is described there, but we will try it in practice, together with the learning algorithm. By the way, it is also described on the Wiki , although, comparing it with several other sources (books and aiportal.ru ), I found several problematic places both here and there.
So, a multilayer perceptron is a neural network consisting of layers, each of which consists of elements - neurons (more precisely, their models). These elements are of three types: sensory (input, S), associative (trained "hidden" layers, A) and reactive (output, R). This type of perceptron is called multilayer, not because it consists of several layers, because the input and output layers can be omitted at all in the code, but because it containsseveral (usually not more than two to three) trained (A) layers .
A neuron model (we will call it simply a neuron) is a network element that has several inputs, each of which has weight. Having received a signal, a neuron multiplies signals by weights and sums up the resulting values, after which it transfers the result to another neuron or to the network output. Here, too, the multilayer perceptron has differences. Its function is a sigmoid , it gives values ​​in the range from 0 to 1. Several functions belong to sigmoid, we will mean the logistic function . As you review the method, you will understand why this is so good.
Several layers that can be trained (more precisely, adjusted) allow us to approximate very complex nonlinear functions, that is, their scope is wider than single-layer ones.

#### We try in practice

Immediately we will transfer the theory to practice, so that it is better remembered and everyone can try.
I recommend reading the above post if you are not an expert on neural networks, of course.
So, let's take a simple task - to recognize numbers without rotation and distortion. In such a task, a multilayer perceptron will be sufficient; moreover, it is less sensitive to noise.

##### Network structure

The network will have two hidden layers, 5 times smaller than the input. That is, if we have 20 inputs, then on the hidden layers there will be 4 neurons. In the case of this task, I allow myself the courage to select the number of layers and neurons empirically. We take 2 layers, with an increase in the number of layers the result does not improve.

##### Learning algorithm

The training of neural networks of the selected type is carried out according to the algorithm of back propagation of errors . That is, if the answer in our layers transmit a signal in the output of the network, then we will compare the response of the neural network with the correct one and calculate the error, which then goes “up” along the network - from the outputs to the inputs.

We will evaluate the network error as half the sum of the squared differences of the signals at the outputs. Simpler: divide in half the sum over i of these expressions: (ti - oi) ^ 2, where ti is the value of the i-th signal in the correct answer, and oi is the value of the i-th output of the neural network. That is, we summarize the squares of errors on the inputs and divide everything in half. If this error (in the example code this is \$ d) is large enough (does not fit into the accuracy we need), we will correct the weights of the neurons.

Weight correction formulas are fully uploaded to the WikiI will not repost. I just want to note that I tried to literally repeat the formulas so that it was clear how it is in practice. This brings up the advantage of choosing an activation function - it has a simple derivative (σ '(x) = σ (x) * (1-σ (x))), and it is used to correct weights. The weights of each layer are adjusted separately. That is, layer by layer from last to first. And here I made a mistake, at first I adjusted the weights for each example separately, and the neural network learned to solve only one “example”. It’s correct to give all the examples of the training sample to the inputs in turn in such an algorithm, this is called the epoch . And only with the end of the era consider the error (total for all sample examples) and adjust the weight.

During training, jumps in errors are possible, but this happens. The selection of the coefficients α (determines the influence of weights on training) and η (determines the influence of the magnitude of the correction δ) is very important - the rate of convergence and getting into local extremes depends on it. I consider α = 0.7 and η = 0.001 to be the most universal, although try playing with them: increasing α and η speeds up learning, but we can fly a minimum.

Next, lay out an example in PHP . The code is far from ideal, but it performs its tasks.