Probabilistic Models: From Naive Bayes to LDA, Part 1
- Tutorial
We continue the conversation. The previous article was a transition from the previous cycle on graphical models in general ( part 1 , part 2 , part 3 , part 4 ) to a new mini-cycle about thematic modeling: we talked about sampling as a method of output in graphical models. And now we are starting the way to the latent Dirichlet allocation model and how all these wonderful sampling algorithms are applied in practice. Today is the first part, in which we will understand where it makes sense to generalize the naive Bayesian classifier, and at the same time we will talk a little about clustering.

Thematic modeling solves the classical problem of text analysis - how to create a probabilistic model of a large collection of texts, which can then be used, for example, for information retrieval, classification, or, as in our case, for content recommendations. We have already talked about one of the simplest models trying to solve this problem - the well-known naive Bayes classifier. Of course, it is possible to classify texts in another way - using the support vector method, for example, or logistic regression, or any other classifier - however, naive Bayes will now be most useful to us as an example for further generalization.
In the model of the naive Bayesian classifier, each document is assigned a hidden variable corresponding to its theme (taken from a predefined discrete set, for example, "finance", "culture", "sport"), and the words are conditionally independent with each other on this topic. Each topic corresponds to a discrete distribution of words from which they are generated; we simply assume that the words are conditionally independent subject to the topic:

As a result, each topic is a discrete distribution of words. You can imagine a huge dishonest cube that you throw to get the next word, or a bag of words into which you without looking start your hand to pull out the next. True, it’s more difficult to imagine a dishonest bag - well, let's say, each word on the pebbles in the bag is found several times, and different words are different (all this intuition is still useful to us, bear with me for a moment). You have one bag that says "finance", another - "culture", the third - "sport":

And when you "generate" a new document from these bags in naive bayes, you first select the bag (throwing a die), and then from this bag you begin to choose the words one by one:

The graphical model of this process looks very simple (we talked about how to read these pictures in the first part of the previous cycle, and in the second part we even had an example of a naive Bayes classifier). Left - graphic model completely, in the center - a model of a single document with a die, which hides repeating the same variable, and the right - the same model, but for the whole dataset, with clearly allocated variables α and beta, which contains the parameters of all these discrete distributions:

She also shows that naive Bayes consists of one discrete distribution (cube), which determines the comparative "weight" of those, and several discrete distributions (bags), according to the number of those that show the probability of falling out of one or another words in the subject.
And in order to train naive Bayes (and any other classifier, there’s nowhere to go), you need to have a marked-up set of texts on which you can train these distributions:

When teaching naive Bayes, we know its topic for each document, and it remains to train only the distribution words in each topic individually and the probability of falling out of topics:

To do this, simply calculate how many times a word has occurred in a particular topic (and how many times topics are found in the dataset), and smooth the result according to Laplace.
The naive Bayesian classifier immediately shows two directions in which it could be expanded and improved:
It will be logical to start from the first direction, on how to proceed to processing datasets without labels, because even if we create a model in which each document has several topics, marking up a large dataset with our hands, and even several topics for one document, will be extremely difficult. So, let's imagine that we have a dataset consisting of texts, and we assume, as in naive Bayes, that each document was generated from a single topic. The only difference is that now we don’t know which one, and the task for us looks like this:

Let's assume (we will do this in the LDA too; to get rid of this, we need nonparametric methods that are still far from us) that we know the number of potential categories (different bags with words), only their contents are unknown. Thus, the probabilistic model is exactly the same:
,
and distribution in it all the same. The only difference is that we used to know the subject of each document during training, but now we don’t. In other words, the joint distribution from which documents are generated is the same: if we denote by w the vector of words and by c the category of the document D , its general likelihood will be

where β c is the distribution in the bag of words corresponding to the categoryc ; and the overall likelihood that needs to be maximized during training is, respectively,

and it must be maximized as before by α and β c . But if earlier we did this with known c , and the maximization problem was divided into separate topics and was essentially trivial, now we need to maximize with unknown c , i.e. actually taking an expectation over them:

How to do this?
What we have is actually a completely standard task of clustering.: There are many objects (texts) in multidimensional space (words), and you need to break them into separate classes so that the texts in one class would be as similar to each other as possible and at the same time as little as possible like texts in other classes. So, the standard tool for teaching clustering models can help here - the EM algorithm (from the words Expectation-Maximization, now you will understand why). The EM algorithm is designed for situations when there are hidden variables in the model (as we have c ), and we need to maximize the likelihood of the model by its parameters (we have α and β), but in anticipation of hidden variables. The EM algorithm first initializes the model with some initial values, and then iteratively repeats two steps:
In our case, at the E-step, we simply find the probabilities that each document belongs to different categories for fixed α and β:

And then at the M-step we consider these probabilities of belonging to be fixed and optimize α and β (better with smoothing α 0 and β 0 , corresponding to certain a priori distributions on the parameters α and β):

Then all this must be repeated until α and β converge; this is the EM algorithm for clustering, applied to the case of discrete attributes (words in documents).
The general idea of why the EM algorithm actually works is as follows: at each E-step, the EM algorithm actually builds an auxiliary function from the model parameters, which, in the current approximation, relates to the likelihood function and remains everywhere smaller (ignores the likelihood function) . Here is a picture from one of the sources on this topic (honestly, I forgot which one):

And then at the M-step, the EM algorithm finds the parameters that maximize this auxiliary function. Obviously, in doing so we are moving to a point at which the overall likelihood increases. I will not go deep into the evidence now that the EM-algorithm really does all this and really works correctly, only the general idea is important to us.
Today we took the first of two steps on the way from the naive Bayesian classifier to the LDA: we switched from the classification task, in which each document in the training dataset should have a fixed category, to the clustering task, in which documents do not have to be categorized. Note, by the way, that if you still have some documents marked up (this is called semi-supervised learning), it is very easy to add to the resulting model. You just need to fix the corresponding variables c i ( D), make them equal to unity in the marked-up topic and zero in all the others, and do not train these variables in the framework of the EM algorithm, leaving their values fixed. Such a “seed dataset” will be, by the way, very useful for further training of the model.
Next time we will take the second step towards latent placement of Dirichlet: from the categories assigned to the entire text, we will move on to topics that may be several in each document.

Classification: Naive Bayes
Thematic modeling solves the classical problem of text analysis - how to create a probabilistic model of a large collection of texts, which can then be used, for example, for information retrieval, classification, or, as in our case, for content recommendations. We have already talked about one of the simplest models trying to solve this problem - the well-known naive Bayes classifier. Of course, it is possible to classify texts in another way - using the support vector method, for example, or logistic regression, or any other classifier - however, naive Bayes will now be most useful to us as an example for further generalization.
In the model of the naive Bayesian classifier, each document is assigned a hidden variable corresponding to its theme (taken from a predefined discrete set, for example, "finance", "culture", "sport"), and the words are conditionally independent with each other on this topic. Each topic corresponds to a discrete distribution of words from which they are generated; we simply assume that the words are conditionally independent subject to the topic:

As a result, each topic is a discrete distribution of words. You can imagine a huge dishonest cube that you throw to get the next word, or a bag of words into which you without looking start your hand to pull out the next. True, it’s more difficult to imagine a dishonest bag - well, let's say, each word on the pebbles in the bag is found several times, and different words are different (all this intuition is still useful to us, bear with me for a moment). You have one bag that says "finance", another - "culture", the third - "sport":

And when you "generate" a new document from these bags in naive bayes, you first select the bag (throwing a die), and then from this bag you begin to choose the words one by one:

The graphical model of this process looks very simple (we talked about how to read these pictures in the first part of the previous cycle, and in the second part we even had an example of a naive Bayes classifier). Left - graphic model completely, in the center - a model of a single document with a die, which hides repeating the same variable, and the right - the same model, but for the whole dataset, with clearly allocated variables α and beta, which contains the parameters of all these discrete distributions:



She also shows that naive Bayes consists of one discrete distribution (cube), which determines the comparative "weight" of those, and several discrete distributions (bags), according to the number of those that show the probability of falling out of one or another words in the subject.
And in order to train naive Bayes (and any other classifier, there’s nowhere to go), you need to have a marked-up set of texts on which you can train these distributions:

When teaching naive Bayes, we know its topic for each document, and it remains to train only the distribution words in each topic individually and the probability of falling out of topics:

To do this, simply calculate how many times a word has occurred in a particular topic (and how many times topics are found in the dataset), and smooth the result according to Laplace.
The naive Bayesian classifier immediately shows two directions in which it could be expanded and improved:
- How necessary is it to have a labeled dataset? marking up a dataset with your hands is very expensive, and you need a big dataset because we train the distribution of all the words in a given topic; maybe it’s possible somehow, without tags, if skillfully?
- in real datasets, perhaps tweets will have one well-defined theme; for example, the news “ Chelsea defender and Brazilian national team David Luis was sold at PSG for more than £ 40 million ” will prettyly spoil our model, because here the words about sports and finances are combined, and as a result, the sports theme will get “ contamination ”from financial words, or vice versa; Is it possible to make a single document have several topics?
From classification to clustering
It will be logical to start from the first direction, on how to proceed to processing datasets without labels, because even if we create a model in which each document has several topics, marking up a large dataset with our hands, and even several topics for one document, will be extremely difficult. So, let's imagine that we have a dataset consisting of texts, and we assume, as in naive Bayes, that each document was generated from a single topic. The only difference is that now we don’t know which one, and the task for us looks like this:

Let's assume (we will do this in the LDA too; to get rid of this, we need nonparametric methods that are still far from us) that we know the number of potential categories (different bags with words), only their contents are unknown. Thus, the probabilistic model is exactly the same:

and distribution in it all the same. The only difference is that we used to know the subject of each document during training, but now we don’t. In other words, the joint distribution from which documents are generated is the same: if we denote by w the vector of words and by c the category of the document D , its general likelihood will be

where β c is the distribution in the bag of words corresponding to the categoryc ; and the overall likelihood that needs to be maximized during training is, respectively,

and it must be maximized as before by α and β c . But if earlier we did this with known c , and the maximization problem was divided into separate topics and was essentially trivial, now we need to maximize with unknown c , i.e. actually taking an expectation over them:

How to do this?
What we have is actually a completely standard task of clustering.: There are many objects (texts) in multidimensional space (words), and you need to break them into separate classes so that the texts in one class would be as similar to each other as possible and at the same time as little as possible like texts in other classes. So, the standard tool for teaching clustering models can help here - the EM algorithm (from the words Expectation-Maximization, now you will understand why). The EM algorithm is designed for situations when there are hidden variables in the model (as we have c ), and we need to maximize the likelihood of the model by its parameters (we have α and β), but in anticipation of hidden variables. The EM algorithm first initializes the model with some initial values, and then iteratively repeats two steps:
- E-step - find the expectation of hidden variables with this model;
- M-step - maximize the likelihood (maximization), considering the variables equal to their expectations found at the E-step.
In our case, at the E-step, we simply find the probabilities that each document belongs to different categories for fixed α and β:

And then at the M-step we consider these probabilities of belonging to be fixed and optimize α and β (better with smoothing α 0 and β 0 , corresponding to certain a priori distributions on the parameters α and β):


Then all this must be repeated until α and β converge; this is the EM algorithm for clustering, applied to the case of discrete attributes (words in documents).
The general idea of why the EM algorithm actually works is as follows: at each E-step, the EM algorithm actually builds an auxiliary function from the model parameters, which, in the current approximation, relates to the likelihood function and remains everywhere smaller (ignores the likelihood function) . Here is a picture from one of the sources on this topic (honestly, I forgot which one):

And then at the M-step, the EM algorithm finds the parameters that maximize this auxiliary function. Obviously, in doing so we are moving to a point at which the overall likelihood increases. I will not go deep into the evidence now that the EM-algorithm really does all this and really works correctly, only the general idea is important to us.
What's next
Today we took the first of two steps on the way from the naive Bayesian classifier to the LDA: we switched from the classification task, in which each document in the training dataset should have a fixed category, to the clustering task, in which documents do not have to be categorized. Note, by the way, that if you still have some documents marked up (this is called semi-supervised learning), it is very easy to add to the resulting model. You just need to fix the corresponding variables c i ( D), make them equal to unity in the marked-up topic and zero in all the others, and do not train these variables in the framework of the EM algorithm, leaving their values fixed. Such a “seed dataset” will be, by the way, very useful for further training of the model.
Next time we will take the second step towards latent placement of Dirichlet: from the categories assigned to the entire text, we will move on to topics that may be several in each document.