
Hacker Guide to Neural Networks. Chapter 2: Machine Learning. Binary classification
- Transfer
Content:
In the last chapter, we examined schemes with real values that calculated complex expressions of their original values (forward passage), and we were also able to calculate the gradients of these expressions from the original initial values (reverse passage). In this chapter we will understand how useful this rather simple mechanism in machine learning can be.
Binary Classification
As before, let's start with a simple one. The most simple, standard and at the same time quite common problem of machine learning is binary classification. Many very interesting and important problems can be reduced to it. For example, we are given a data set of N vectors and each of them is marked with a value of +1 or -1. In a two-dimensional form, our data set may look like this:
vector -> label
Here we have N = 6 data entry points, where each point has two characteristics (D = 2). Three data points are +1, and the other three are -1. This is the simplest example, but in practice the data set + 1 / -1 can be really useful: for example, the definition of spam / non-spam among emails, in which vectors somehow evaluate various characteristics of the contents of letters, such as the number of references to a certain magic tool .
Goal. Our goal in binary classification is to figure out a function that takes a two-dimensional vector and predicts a label. This function is usually parameterized by a certain set of parameters, and we need to configure the parameters of this function so that its results correspond to the labels in the given data set. Ultimately, we can discard the data set and use the detected parameters to predict labels for previously unknown vectors.
Training protocol
We’ll finally begin to build entire neural networks and complex expressions, but let's start with a simple one and we will train a linear classifier, very similar to one neuron, which we examined at the end of Chapter 1. The only difference is that we will reject the sigmoid, since it complicates everything unnecessarily (I used it only as an example in Chapter 1, since historically sigmoid neurons are popular, although modern neural networks rarely use sigmoid nonlinearities). In any case, let's take a simple linear function: f (x, y) = ax + by + c
In this expression, we consider x and y as the initial values (two-dimensional vectors), and a, b, c- as function parameters that we need to know. For example, if a = 1, b = -2, c = -1, then the function will take the first data entry point ([1.2, 0.7]) and the result will be 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2. Here's how the training will work:
1. We select an arbitrary data entry point and draw it through the circuit.
2. We interpret the result of the scheme to make sure that the data entry point is class +1 (ie very high values: the scheme is absolutely sure that the data entry point is class +1, and very low values: the scheme is absolutely sure that data entry point has class -1).
3. We measure how well the forecast lines up the presented tags. To show clearly - for example, if a positive example yields very low values, we will need to pull it in the positive direction according to the scheme, requiring it to produce a higher value for this data entry point. Please note that this is an example for the first data entry point: it has a +1 mark, but our prediction function assigns it a value of -1.2. Therefore, we will push it according to the scheme in a positive direction. We need the value to be higher.
4. The circuit will receive a jolt and respond with the back propagation of the error in order to calculate the jolts at the initial values a, b, c, x, y .
5. Since we consider x, y as (fixed) data entry points, we will ignore the tension with respect to x, y. If you like my physical analogies, then imagine these initial values as pegs driven into the ground.
6. On the other hand, we take the parameters a, b, c and make them respond to their push (that is, we will perform the so-called update of the parameters). This, of course, can cause the circuit to produce slightly higher values for this particular data entry point in the future.
7. Repeat! We return to step 1.
The training scheme that I described above generally relates to stochastic gradient descent. An interesting point that I would like to repeat again is that a, b, c, x, yconsist of the same elements as the circuit can provide: they represent the initial values of the circuit, and the circuit will push them all in a certain direction. She does not know the difference between parameters and data entry points. However, after completing the return pass, we ignore all shocks to the data entry points (x, y) and continue to load and unload them as our examples repeat in the data set. On the other hand, we save the parameters (a, b, c) and continue to push them every time we measure the data entry point. Over time, the tension with respect to these parameters will fine-tune these values so that the function produces high values for positive examples and low values for negative ones.
Chapter 1: Real Value Diagrams
Part 1:
Part 2:
Part 3:
Part 4:
Part 5:
Part 6:
Введение
Базовый сценарий: Простой логический элемент в схеме
Цель
Стратегия №1: Произвольный локальный поиск
Part 2:
Стратегия №2: Числовой градиент
Part 3:
Стратегия №3: Аналитический градиент
Part 4:
Схемы с несколькими логическими элементами
Обратное распространение ошибки
Part 5:
Шаблоны в «обратном» потоке
Пример "Один нейрон"
Part 6:
Становимся мастером обратного распространения ошибки
Chapter 2: Machine Learning
In the last chapter, we examined schemes with real values that calculated complex expressions of their original values (forward passage), and we were also able to calculate the gradients of these expressions from the original initial values (reverse passage). In this chapter we will understand how useful this rather simple mechanism in machine learning can be.
Binary Classification
As before, let's start with a simple one. The most simple, standard and at the same time quite common problem of machine learning is binary classification. Many very interesting and important problems can be reduced to it. For example, we are given a data set of N vectors and each of them is marked with a value of +1 or -1. In a two-dimensional form, our data set may look like this:
vector -> label
---------------
[1.2, 0.7] -> +1
[-0.3, 0.5] -> -1
[-3, -1] -> +1
[0.1, 1.0] -> -1
[3.0, 1.1] -> -1
[2.1, -3] -> +1
Here we have N = 6 data entry points, where each point has two characteristics (D = 2). Three data points are +1, and the other three are -1. This is the simplest example, but in practice the data set + 1 / -1 can be really useful: for example, the definition of spam / non-spam among emails, in which vectors somehow evaluate various characteristics of the contents of letters, such as the number of references to a certain magic tool .
Goal. Our goal in binary classification is to figure out a function that takes a two-dimensional vector and predicts a label. This function is usually parameterized by a certain set of parameters, and we need to configure the parameters of this function so that its results correspond to the labels in the given data set. Ultimately, we can discard the data set and use the detected parameters to predict labels for previously unknown vectors.
Training protocol
We’ll finally begin to build entire neural networks and complex expressions, but let's start with a simple one and we will train a linear classifier, very similar to one neuron, which we examined at the end of Chapter 1. The only difference is that we will reject the sigmoid, since it complicates everything unnecessarily (I used it only as an example in Chapter 1, since historically sigmoid neurons are popular, although modern neural networks rarely use sigmoid nonlinearities). In any case, let's take a simple linear function: f (x, y) = ax + by + c
In this expression, we consider x and y as the initial values (two-dimensional vectors), and a, b, c- as function parameters that we need to know. For example, if a = 1, b = -2, c = -1, then the function will take the first data entry point ([1.2, 0.7]) and the result will be 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2. Here's how the training will work:
1. We select an arbitrary data entry point and draw it through the circuit.
2. We interpret the result of the scheme to make sure that the data entry point is class +1 (ie very high values: the scheme is absolutely sure that the data entry point is class +1, and very low values: the scheme is absolutely sure that data entry point has class -1).
3. We measure how well the forecast lines up the presented tags. To show clearly - for example, if a positive example yields very low values, we will need to pull it in the positive direction according to the scheme, requiring it to produce a higher value for this data entry point. Please note that this is an example for the first data entry point: it has a +1 mark, but our prediction function assigns it a value of -1.2. Therefore, we will push it according to the scheme in a positive direction. We need the value to be higher.
4. The circuit will receive a jolt and respond with the back propagation of the error in order to calculate the jolts at the initial values a, b, c, x, y .
5. Since we consider x, y as (fixed) data entry points, we will ignore the tension with respect to x, y. If you like my physical analogies, then imagine these initial values as pegs driven into the ground.
6. On the other hand, we take the parameters a, b, c and make them respond to their push (that is, we will perform the so-called update of the parameters). This, of course, can cause the circuit to produce slightly higher values for this particular data entry point in the future.
7. Repeat! We return to step 1.
The training scheme that I described above generally relates to stochastic gradient descent. An interesting point that I would like to repeat again is that a, b, c, x, yconsist of the same elements as the circuit can provide: they represent the initial values of the circuit, and the circuit will push them all in a certain direction. She does not know the difference between parameters and data entry points. However, after completing the return pass, we ignore all shocks to the data entry points (x, y) and continue to load and unload them as our examples repeat in the data set. On the other hand, we save the parameters (a, b, c) and continue to push them every time we measure the data entry point. Over time, the tension with respect to these parameters will fine-tune these values so that the function produces high values for positive examples and low values for negative ones.