Mixture Density Networks

Hello!

Let's talk about, as you probably already guessed, neural networks and machine learning. From the title, it is clear that Mixture Density Networks will be discussed, then simply MDN, I don’t want to translate the name and leave it as it is. Yes, yes, yes ... it will be a bit boring mathematics and probability theory, but without it, unfortunately, or fortunately, then decide for yourself, it is difficult to imagine the world of machine learning. But I hasten to reassure you, it will be relatively small and it will not be very difficult. And in general it will be possible to skip it, but just look at a small amount of Python and PyTorch code, that's right, we will write the network using PyTorch, as well as various graphs with results. But the most important thing is that there will be an opportunity to understand a bit and understand what the MD network is.

Well, let's start!

Regression

Let's first refresh our knowledge a bit and remember, quite briefly, what a linear regression is .

We have a vector

$X = \ {x_1, x_2, ..., x_n \}$ we need to predict the value

$inline$ which somehow depends on

$inline$ using a certain linear model:

$\ hat {Y} = X ^ T \ hat {\ beta}$

As the error function, we will use the square deviation (Squared Error):

$SE (\ beta) = \ sum_ {i = 1} ^ n (y_i- \ hat {y} _i) ^ 2 = \ sum_ {i = 1} ^ N (y_i-x_i ^ T \ hat {\ beta} ) ^ 2$

This problem can be solved directly by taking the derivative of the SE and equating its value to zero:

$\ frac {\ delta SE (\ beta)} {\ delta \ beta} = 2X ^ T (\ mathbf {y} -X \ beta) = 0$

Thus, we will simply find its minimum, and SE is a quadratic function, which means that the minimum will always exist. After that, you can already easily find

$\ beta$ :

$\ hat \ beta = (X ^ TX) ^ {- 1} X ^ T \ mathbf {y}$

That's all, the problem is solved. This concludes the recall of what is linear regression.

Of course, the dependence inherent in the nature of data generation may be different and then we need to add some nonlinearity to our model. Solving the regression problem directly for large and real data is also a bad idea, since there is a matrix there

$inline$ dimensions

$n \ times n$ , and it is still necessary to find its inverse matrix, and it often happens that such a matrix simply does not exist. In this case, various methods based on gradient descent come to our rescue. And non-linearity of models can be implemented in different ways, including using neural networks.

But now, let's talk not about this, but about the error functions. What is the difference between SE and Log-Likelihood in the case where the data may have a non-linear relationship?

We deal with the zoo, namely: OLS, LS, SE, MSE, RSS

All this is essentially the same, RSS is the residual sum of squares, OLS is the least squares, LS is the least squares, MSE is the mean squared error, SE is squared error. In different sources you can find different names. The essence of this is only one: the square deviation . You can get confused of course, but you get used to it quickly.

It should be noted that the MSE is the average deviation, some average error value for the entire training data set. In practice, MSE is usually used. The formula is no different:

$MSE (\ beta) = \ frac {1} {N} \ sum_ {i = 1} ^ n (y_i- \ hat {y} _i) ^ 2$

$inline$ - dataset size,

$\ hat {y} _i$ - model prediction for

$inline$ .

Stop! Likelihood? This is something from the theory of probability. That's right - this is the theory of probability in its purest form. But how can the standard deviation be related to the likelihood function? And as it turns out connected. Associated with finding the maximum likelihood (Maximum Likelihood) and the normal distribution, to be more precise, with its average

$\ mu$ .

In order to realize that this is the case, let's look again at the square deviation function:

$RSS (\ beta) = \ sum_ {i = 1} ^ n (y_i- \ hat {y} _i) ^ 2 \ qquad \ qquad (1)$

Now suppose that the likelihood function has a normal appearance, that is, a Gaussian or normal distribution:

$L (X) = p (X | \ theta) = \ prod ^ X \ mathcal {N} (x_i; \ mu, \ sigma ^ 2)$

In general, what is the likelihood function and what meaning it contains I will not tell, you can read about this in another place, you should also get acquainted with the concept of conditional probability, Bayes theorem and much more, for a deeper understanding. It all goes into pure probability theory, which is studied both at school and at university.

Now, remembering the formula of normal distribution, we get:

$L (X; \ mu, \ sigma ^ 2) = \ prod ^ X \ frac {1} {\ sqrt {2 \ pi \ sigma ^ 2}} e ^ {- \ frac {(x_i- \ mu) ^ 2} {2 \ sigma ^ 2}} \ qquad \ qquad (2)$

And what if we put the standard deviation

$\ sigma ^ 2 = 1$ and remove all the constants in the formula (2), just remove it, do not reduce, because finding the minimum of the function does not depend on them. Then see this:

$L (X; \ mu, \ sigma ^ 2) \ sim \ prod ^ Xe ^ {- (x_i- \ mu) ^ 2}$

Still nothing like? Not? Well, what if you take the logarithm of a function? From the logarithm there are generally some pluses: multiplication will transform into a sum, degree into multiplication, and

$\ log {e} = 1$ - for this property it is worth clarifying that this is a natural logarithm and strictly speaking

$\ ln {e} = 1$ . And in general, the logarithm of a function does not change its maximum, and this is the main feature for us. About the connection with Log-Likelihood and Likelihood and why it will be useful to be discussed below, in a small digression. And so, what we did: removed all the constants, and took the logarithm of the likelihood function. They also removed the minus sign, thus turning the Log-Likelihood into the Negative Log-Likelihood (NLL), the connection between them will also be described as a bonus. As a result, we got the NLL function:

$\ log L (X; \ mu, I ^ 2) \ sim \ sum (X- \ mu) ^ 2$

Take another look at the RSS (1) function. Yes, they are the same! Exactly! It also shows that

$\ mu = \ hat {y}$ .

If we use the MSE standard deviation function, we get from this:

$\ operatorname {argmin} MSE (\ beta) \ sim \ operatorname {argmax} \ mathbb {E} _ {X \ sim P_ {data}} \ log P_ {model} (x; \ beta)$

Where

$\ mathbb {E}$ - expected value,

$\ beta$ - the parameters of the model, we will further denote them as:

$\ theta$ .

Conclusion: If, in the regression question, we use the LS family as error functions, then in essence we solve the problem of finding the maximum likelihood function in the case when the distribution is Gaussian. And the predicted value

$\ hat {y}$ equal to the average in the normal distribution. And now we know how all this is connected, how probability theory is related (with its likelihood function and normal distribution) and the methods of standard deviation or OLS. More details about this can be found in [2].

And here is the promised bonus. If we are talking about the links between the various functions of the error, then we will consider (it is not necessary to read):

Relationship between Cross-Entropy, Likelihood, Log-Likelihood and Negative Log-Likelihood

Suppose we have data

$X = \{x_1, x_2, x_3, x_4,...\}$ , each point belongs to a certain class, for example

$\{x_1\rightarrow1, x_2\rightarrow2, x_3\rightarrow n,...\}$ . There is a total

$inline$ классов, при этом класс 1 встречается

$inline$ раз, класс 2 —

$inline$ раз, а класс

$inline$ —

$inline$ раз. На этих данных мы обучили некоторую модель

$\theta$ . Функция правдоподобия (Likelihood) для нее будет выглядеть так:

$P(1|\theta)P(2|\theta)...P(n|\theta)=\prod^{c_1}\hat{y}_1\prod^{c_2}\hat{y}_2...\prod^{c_n}\hat{y}_n=\hat{y}_1^{c_1}\hat{y}_2^{c_2}...\hat{y}_n^{c_n}$

где

$P(n|\theta) = \hat{y}_n$ — предсказанная вероятность для класса

$inline$ .

Берем логарифм от функции правдоподобия и получаем Log-Likelihood:

$\log{P(data|\theta)}=\log{(\hat{y}_1^{c_1}...\hat{y}_n^{c_n})}=c_1\log{\hat{y_1}}+...+c_n\log{\hat{y_n}}=\sum_i^n{c_i\log{\hat{y_i}}}$

Вероятность

$\hat{y} \in [0, 1]$ лежит в пределах от 0 до 1, исходя из определения вероятности. Следовательно логарифм будет иметь отрицательное значение. И если умножить Log-Likelihood на -1 мы получим функцию Negative Log-Likelihood (NLL):

$NLL=-\log{P(data|\theta)}=-\sum_i^n{c_i\log{\hat{y_i}}}$

Если поделим NLL на количество точек в

$inline$ ,

$inline$ , то получим:

$-\frac{1}{N}\log{P(data|\theta)}=-\sum_i^n{\frac{c_i}{N}\log{\hat{y_i}}}$

при этом можно заметить, что реальная вероятность для класса

$inline$ равна:

$y_n=\frac{c_n}{N}$ . Отсюда получаем:

$Nll = - \ sum_i ^ n {y_i \ log {\ hat {y_i}}}$

теперь если посмотреть на определение кросс-энтропии

$H (p, q) = - \ sum {p \ log {q}}$ то получим:

$Nll = h (y_i, \ hat {y_i})$

В случае когда у нас всего два класса

$inline$ (бинарная классификация) получим формулу для binary cross entropy (так же можно встретить всем известное название Log-Loss):

$H (y, \ hat {y}) = - (y \ log {\ hat {y}} + (1-y) \ log {(1- \ hat {y})})$

Из ходя из всего этого можно понять, что в некоторых случаях минимизация Cross-Entropy эквивалентна минимизации NLL или нахождению максимума функции правдоподобия (Likelihood) или Log-Likelihood.

Пример. Рассмотрим бинарную классификацию. У нас есть значения классов:

y = np.array([0, 1, 1, 1, 1, 0, 1, 1]).astype(np.float32)

Реальная вероятность

$inline$ для класса 0 равна

$inline$ , для класса 1 равна

$inline$ . Пусть у нас есть бинарный классификатор который предсказывает вероятность класса 0

$\ hat {y}$ для каждого примера, соответственно для класса 1 вероятность равна

$(1- \ hat {y})$ . Построим график значений функции Log-Loss для разных предсказаний

$\ hat {y}$ :

На графике можно увидеть что минимум функции Log-Loss соответствует точке 0.75, т.е. если бы наша модель полностью «выучила» распределение исходных данных, $\ hat {y} = y$ .

Regression using neural networks

Here we come to the more interesting, to practice. Let's see how you can solve the regression problem using neural networks (neural networks). We will implement everything in the Python programming language; to create a network, we use the PyTorch deep learning library.

Source data generation

Input data

$\ mathbf {X} \ in \ mathbb {R} ^ N$ let's generate using uniform distribution (uniform distribution), we will take an interval from-15 to 15,

$\ mathbf {X} \ in U [-15, 15]$ . Points

$\ mathbf {Y}$ get using the equation:

$\ mathbf {Y} = 0.5 \ mathbf {X} + 8 \ sin (0.3 \ mathbf {X}) + noise \ qquad \ qquad (3)$

Where

$inline$ - vector noise dimension

$inline$ obtained using the normal distribution with parameters:

$\ mu = 0, \ sigma ^ 2 = 1$ .

Data generation

N = 3000# размер данных
IN_DIM = 1
OUT_DIM = IN_DIM
x = np.random.uniform(-15., 15., (IN_DIM, N)).T.astype(np.float32)
noise = np.random.normal(size=(N, 1)).astype(np.float32)
y = 0.5*x+ 8.*np.sin(0.3*x) + noise # формула 3
x_train, x_test, y_train, y_test = train_test_split(x, y) #разобьем на тренировочные и тестовые данные

Graph of the data.

Network building

Create a regular forward distribution network (feed forward neural network or FFNN).

Build FFNN

classNet(nn.Module):def__init__(self, input_dim=IN_DIM, out_dim=OUT_DIM, layer_size=40):
        super(Net, self).__init__()
        self.fc = nn.Linear(input_dim, layer_size)
        self.logit = nn.Linear(layer_size, out_dim)
    defforward(self, x):
        x = F.tanh(self.fc(x)) # формула 4
        x = self.logit(x)
        return x

Our network consists of one hidden layer with the dimension of 40 neurons and with the activation function - the hyperbolic tangent:

$\ tanh x = \ frac {e ^ xe ^ {- x}} {e ^ x + e ^ {- x}} \ qquad \ qquad (4)$

The output layer is a regular linear transformation without an activation function.

Training and getting results

We’ll use AdamOptimizer as an optimizer. Number of learning epochs = 2000, learning rate (learning rate or lr) = 0.1.

FFNN training

deftrain(net, x_train, y_train, x_test, y_test, epoches=2000, lr=0.1):
    criterion = nn.MSELoss()
    optimizer = optim.Adam(net.parameters(), lr=lr)
    N_EPOCHES = epoches
    BS = 1500
    n_batches = int(np.ceil(x_train.shape[0] / BS))
    train_losses = []
    test_losses = []
    for i in range(N_EPOCHES):
        for bi in range(n_batches):
            x_batch, y_batch = fetch_batch(x_train, y_train, bi, BS)
            x_train_var = Variable(torch.from_numpy(x_batch))
            y_train_var = Variable(torch.from_numpy(y_batch))
            optimizer.zero_grad()
            outputs = net(x_train_var)
            loss = criterion(outputs, y_train_var)
            loss.backward()
            optimizer.step()
        with torch.no_grad():
            x_test_var = Variable(torch.from_numpy(x_test))
            y_test_var = Variable(torch.from_numpy(y_test))
            outputs = net(x_test_var)
            test_loss = criterion(outputs, y_test_var)
            test_losses.append(test_loss.item())
            train_losses.append(loss.item())
            if i%100 == 0:
                sys.stdout.write('\r Iter: %d, test loss: %.5f, train loss: %.5f'
                                 %(i, test_loss.item(), loss.item()))
                sys.stdout.flush()
    return train_losses, test_losses
net = Net()
train_losses, test_losses = train(net, x_train, y_train, x_test, y_test)

Now look at the learning outcomes.

The graph of the MSE function values versus the learning iteration, the graph of the values for the training data and the test data.

Real and predicted results on test data.

Inverted data

Complicate the task and invert the data.

Invert data

x_train_inv = y_train
y_train_inv = x_train
x_test_inv = y_train
y_test_inv = x_train

Inverted data graph.

To predict

$\ mathbf {\ hat Y}$ let's use the direct distribution network from the previous section and see how it handles it.

inv_train_losses, inv_test_losses = train(net, x_train_inv, y_train_inv, x_test_inv, y_test_inv)

The graph of the MSE function values versus the learning iteration, the graph of the values for the training data and the test data.

Real and predicted results on test data.

As you can see from the graphs above, our network did n’t cope at all with such data; it simply cannot predict them. And all this happened because in such an inverted task for one point

$inline$ several points can match

$inline$ . You ask, what about noise? He, too, created a situation in which for one

$inline$ could get multiple values

$inline$ . Yes, it's true. But the thing is that, despite the noise, it was all one definite distribution. And since our model essentially predicted

$inline$ , and in the case of MSE it was an average value for a normal distribution (why this is explained in the first part of the article), then it coped well with the “direct” task. In the opposite case, we get several different distributions for one

$inline$ and accordingly we can not get a good result using only one normal distribution.

Mixture Density Network

The fun begins! What is the Mixture Density Network (further MDN or MD network)? In general, this kind of model that is able to simulate several distributions at once:

$p (\ mathbf {y} | \ mathbf {x}; \ theta) = \ sum_k ^ K \ pi_k (\ mathbf {x}) \ mathcal {N} (\ mathbf {y}; \ mu_k (\ mathbf { x}), \ sigma ^ 2 (\ mathbf {x})) \ qquad \ qquad (5)$

What an incomprehensible formula, you say. Let's see. Our MD network is learning to model the average.

$\ mu$ and variance

$\ sigma ^ 2$ for several distributions. In the formula (5)

$\ pi_k (\ mathbf {x})$ - the so-called coefficients of significance of the individual distribution for each point

$x_i \ in \ mathbf {x}$ , a certain mixing coefficient or how much each of the distributions contributes to a certain point. There is a total

$inline$ distributions.

Just a couple of words about

$\ pi_k (\ mathbf {x})$ - in fact, this is also a distribution and represents the probability that for a point

$x_i \ in \ mathbf {x}$ there will be a condition

$inline$ .

Fuh, again this math, let's write something. And so, let's start implementing the network. For our network, take

$inline$ .

self.fc = nn.Linear(input_dim, layer_size)
self.fc2 = nn.Linear(layer_size, 50)
self.pi = nn.Linear(layer_size, coefs)
self.mu = nn.Linear(layer_size, out_dim*coefs) # mean
self.sigma_sq = nn.Linear(layer_size, coefs) # variance

Define the output layers for our network:

x = F.relu(self.fc(x))
x = F.relu(self.fc2(x))
pi = F.softmax(self.pi(x), dim=1)
sigma_sq = torch.exp(self.sigma_sq(x))
mu = self.mu(x)

Write the error function or loss function, formula (5):

defgaussian_pdf(x, mu, sigma_sq):return (1/torch.sqrt(2*np.pi*sigma_sq)) * torch.exp((-1/(2*sigma_sq)) * torch.norm((x-mu), 2, 1)**2)
losses = Variable(torch.zeros(y.shape[0]))  # p(y|x)for i in range(COEFS):
    likelihood = gaussian_pdf(y, mu[:, i*OUT_DIM:(i+1)*OUT_DIM], sigma_sq[:, i])
    prior = pi[:, i]
    losses += prior * likelihood
loss = torch.mean(-torch.log(losses))

Complete MDN Build Code

COEFS = 30classMDN(nn.Module):def__init__(self, input_dim=IN_DIM, out_dim=OUT_DIM, layer_size=50, coefs=COEFS):
        super(MDN, self).__init__()
        self.fc = nn.Linear(input_dim, layer_size)
        self.fc2 = nn.Linear(layer_size, 50)
        self.pi = nn.Linear(layer_size, coefs)
        self.mu = nn.Linear(layer_size, out_dim*coefs) # mean
        self.sigma_sq = nn.Linear(layer_size, coefs) # variance
        self.out_dim = out_dim
        self.coefs = coefs
    defforward(self, x):
        x = F.relu(self.fc(x))
        x = F.relu(self.fc2(x))
        pi = F.softmax(self.pi(x), dim=1)
        sigma_sq = torch.exp(self.sigma_sq(x))
        mu = self.mu(x)
        return pi, mu, sigma_sq
# функция плотности вероятности для нормального распределенияdefgaussian_pdf(x, mu, sigma_sq):return (1/torch.sqrt(2*np.pi*sigma_sq)) * torch.exp((-1/(2*sigma_sq)) * torch.norm((x-mu), 2, 1)**2)
# функция ошибкиdefloss_fn(y, pi, mu, sigma_sq):
    losses = Variable(torch.zeros(y.shape[0]))  # p(y|x)for i in range(COEFS):
        likelihood = gaussian_pdf(y,
                                  mu[:, i*OUT_DIM:(i+1)*OUT_DIM],
                                  sigma_sq[:, i])
        prior = pi[:, i]
        losses += prior * likelihood
    loss = torch.mean(-torch.log(losses))
    return loss

Our MD network is ready to go. Almost ready. It remains to train her and look at the results.

MDN Training

deftrain_mdn(net, x_train, y_train, x_test, y_test, epoches=1000):
    optimizer = optim.Adam(net.parameters(), lr=0.01)
    N_EPOCHES = epoches
    BS = 1500
    n_batches = int(np.ceil(x_train.shape[0] / BS))
    train_losses = []
    test_losses = []
    for i in range(N_EPOCHES):
        for bi in range(n_batches):
            x_batch, y_batch = fetch_batch(x_train, y_train, bi, BS)
            x_train_var = Variable(torch.from_numpy(x_batch))
            y_train_var = Variable(torch.from_numpy(y_batch))
            optimizer.zero_grad()
            pi, mu, sigma_sq = net(x_train_var)
            loss = loss_fn(y_train_var, pi, mu, sigma_sq)
            loss.backward()
            optimizer.step()
        with torch.no_grad():
            if i%10 == 0:
                x_test_var = Variable(torch.from_numpy(x_test))
                y_test_var = Variable(torch.from_numpy(y_test))
                pi, mu, sigma_sq = net(x_test_var)
                test_loss = loss_fn(y_test_var, pi, mu, sigma_sq)
                train_losses.append(loss.item())
                test_losses.append(test_loss.item())
                sys.stdout.write('\r Iter: %d, test loss: %.5f, train loss: %.5f'
                                 %(i, test_loss.item(), loss.item()))
                sys.stdout.flush()
    return train_losses, test_losses
mdn_net = MDN()
mdn_train_losses, mdn_test_losses = train_mdn(mdn_net, x_train_inv, y_train_inv, x_test_inv, y_test_inv)

Graph of values of loss function depending on the iteration of training, on the graph of values for training data and test data.

Since our network has learned the mean values for several distributions, let's look at it:

pi, mu, sigma_sq = mdn_net(Variable(torch.from_numpy(x_test_inv)))

The graph for the two most likely mean values for each point (left). The graph for the 4 most likely mean values for each point (right).

Graph for all mean values for each point.

To predict the data, we will randomly choose several values.

$\ mu$ and

$\ sigma ^ 2$ based on the value

$\ pi_k (\ mathbf {x})$ . And then based on them generate target data

$\ hat {y}$ using normal distribution.

Result prediction

defrand_n_sample_cumulative(pi, mu, sigmasq, samples=10):
    n = pi.shape[0]
    out = Variable(torch.zeros(n, samples, OUT_DIM))
    for i in range(n):
        for j in range(samples):
            u = np.random.uniform()
            prob_sum = 0for k in range(COEFS):
                prob_sum += pi.data[i, k]
                if u < prob_sum:
                    for od in range(OUT_DIM):
                        sample = np.random.normal(mu.data[i, k*OUT_DIM+od], np.sqrt(sigmasq.data[i, k]))
                        out[i, j, od] = sample
                    breakreturn out
pi, mu, sigma_sq = mdn_net(Variable(torch.from_numpy(x_test_inv)))
preds = rand_n_sample_cumulative(pi, mu, sigma_sq, samples=10)

Predicted data for 10 randomly selected values. $\ mu$ and $\ sigma ^ 2$ (left) and two (right).

The figures show that MDN did an excellent job with the “reverse” task.

Using more complex data

Let's see how our MD network will cope with more complex data, such as spiral data. The equation of a hyperbolic spiral in Cartesian coordinates:

$x = \ rho \ cos \ phi \\\ qquad \ qquad \ qquad \ qquad \ qquad \ qquad (6) \\ y = \ rho \ sin \ phi \\$

Spiral Data Generation

N = 2000
x_train_compl = []
y_train_compl = []
x_test_compl = []
y_test_compl = []
noise_train = np.random.uniform(-1, 1, (N, IN_DIM)).astype(np.float32)
noise_test = np.random.uniform(-1, 1, (N, IN_DIM)).astype(np.float32)
for i, theta in enumerate(np.linspace(0, 5*np.pi, N).astype(np.float32)):
    # формула 6
    r = ((theta))
    x_train_compl.append(r*np.cos(theta) + noise_train[i])
    y_train_compl.append(r*np.sin(theta))
    x_test_compl.append(r*np.cos(theta) + noise_test[i])
    y_test_compl.append(r*np.sin(theta))
x_train_compl = np.array(x_train_compl).reshape((-1, 1))
y_train_compl = np.array(y_train_compl).reshape((-1, 1))
x_test_compl = np.array(x_test_compl).reshape((-1, 1))
y_test_compl = np.array(y_test_compl).reshape((-1, 1))

Graph of the obtained spiral data.

For the sake of interest, let's see how the usual Feed-Forward network will cope with this task.

As expected, the Feed-Forward network is not able to solve the regression problem for such data.

We use the previously described and created MD network for training on spiral data.

Mixture Density Network did a great job in this situation.

Conclusion

At the beginning of this article, we recalled the basics of linear regression. Saw that the general between finding of an average for normal distribution and MSE. Disassembled as related NLL and cross entropy. And most importantly, we figured out the MDN model, which is capable of learning from data obtained from a mixed distribution. I hope the article was clear and interesting, despite the fact that there was a bit of mathematics.

The full code can be viewed on GitHub .

Literature

Mixture Density Networks (Christopher M. Bishop, Neural Computing Research Group, Dept. of Computer Science and Applied Mathematics, Aston University, Birmingham) - the theory of MD networks is fully described.
Least squares and maximum likelihood (MROsborne)

Tags: