# Probabilistic Models: Bayesian Networks

- Tutorial

In this blog we talked a lot about what: there were brief descriptions of the main recommendation algorithms ( task setting , user-based and item-based , SVD: 1 , 2 , 3 , 4 ), several models for working with content ( naive Bayes , LDA , a review of text analysis methods ), there was a series of articles about a cold start ( task setting , text mining , tags ), there was a mini-series about multi-armed bandits ( part 1 , part 2 ).

To move on and put these and many other methods in a common context, we need to develop some common base, learn the language that modern data processing methods speak - the language of

First, briefly recall what we have already talked about . The main tool for machine learning is the Bayesian theorem:

Here D is the data, what we know, and θ are the model parameters that we want to train. For example, in the SVD model, data is the ratings that users put to products, and model parameters are factors that we train for users and products. - this is what we want to find, the probability distribution of the model parameters

Thus, we come to one of the main tasks of the Bayesian conclusion: to find the posterior distribution on hypotheses / parameters:

and to find the maximum posterior hypothesis .

But if everything is so simple - we multiplied two functions, maximized the result - what is the whole science about? The problem is that the distributions that interest us are usually too complex to be directly maximized analytically. There are too many variables in them; between the variables there are too complex relationships. But, on the other hand, they often have an additional structure that can be used, a structure in the form of

Recall our talk about naive Bayes : in the model there turned out to be an overwhelming number of parameters, and we made an additional assumption about the conditional independence of attributes (words), subject to the topic:

As a result, it was possible to rewrite the complex posterior distribution as

in this form it was much easier to deal with it (to train the parameters of each small distribution separately, and then choose

So, let's look at one of the most convenient ways to represent large and complex probability distribution -

For example, the graph corresponding to naive bayes:

It corresponds to the decomposition

,

We already know that in this case all attributes

Now let's look at all the simplest variants of the Bayesian network and see what conditions of independence between the variables they correspond to.

We start with a network of two variables,

And if there is an edge (let it go from

Now go to the networks of three variables,

As with the two variables, which means that

Another simple case is when

This case is also similar to that considered above; for example, let the edges go from

which, again, is always true, for any joint distribution. In this formula, one could select the variables in any order, nothing would change. Note that directional loops in Bayesian networks are forbidden, and as a result of the options how to draw all the edges, there are only six, not eight.

These observations, of course, are generalized to the case of any number of variables: if the variables are not connected at all with each other, they are independent, and the graph in which all the edges are drawn (with arrows in any directions, but so that directed cycles are not obtained) , corresponds to the decomposition, which is always true, and does not give us any information.

In this section, we will consider three more interesting cases - these will be those “bricks” from which any Bayesian network can be composed. Fortunately, for this it is enough to consider graphs on three variables - everything else will be generalized from them. In the examples below, I’ll be somewhat against the truth and intuitively interpret the edge, the arrow between the two variables, as “

We start with a sequential connection between the variables:

Such a graph depicts a decomposition.

Intuitively, this corresponds to a consistent causal relationship: if you run in winter without a hat, you will catch a cold, and if you catch a cold, your temperature will rise. Obviously,

Formally, this corresponds to the conditional independence of

where the first equality is the definition of conditional probability, the second is our expansion, and the third is the application of Bayes' theorem.

So, the consistent relationship between the three variables tells us that the extreme variables are conditionally independent under the condition of the average. Everything is very logical and fairly straightforward.

The next possible option is a divergent connection:

Such a graph depicts a decomposition.

Intuitively, this corresponds to two consequences from the same reason: if you catch a cold, you may have a fever and a runny nose. As in the previous case, it is obvious that

However, in such a network, like the previous case,

Formally, this corresponds to the conditional independence of

So, the diverging connection between the three variables tells us that the “consequences” are conditionally independent under the condition of their “common cause”. If the cause is known, then the consequences become independent; while the cause is unknown, the consequences through it are connected. Again, so far it seems nothing surprising.

We have only one possible variant of the connection between the three variables: a converging connection, when

The decomposition here is this:

This is a situation in which the same effect can have two different causes: for example, the temperature can be the result of a cold, or it can be a result of poisoning. Are cold and poisoning dependent? Not! In this situation, while the overall result of unknown causes two have nothing to do with each other, and it is very easy to test formally,

however, if the "general consequence»

Thus, with a converging connection, the two “causes” are independent, but only as long as the meaning of their “common effect” is unknown; if the general effect is signified, the causes become dependent.

But there is, as Vasily Ivanovich said, one nuance: when there is a converging connection on the path, it is not enough to look only at its “general effect” to determine independence. In fact, even if

Sequential, converging and diverging bonds are the three bricks that make up any acyclic directed graph. And our reasoning is quite enough to generalize the results on conditional dependence and independence to all such graphs. Of course, this is not the time or the place to formally prove the general theorems, but the result is predictable enough - suppose you need to check in a large graph whether two vertices (or even two sets of vertices) are independent. To do this, you look at all the paths in the graph (excluding the arrows) connecting these two sets of vertices. Each of these paths can be “broken” by one of the above constructions: for example, a serial connection will break if it has a meaning in the middle (the value of the variable is known), and a converging connection will break if, on the contrary, there is no signification (and not at the very top of the path, nor in its descendants). If as a result all the paths are broken, then the sets of variables are really independent; if not, no.

So, today we looked at a simple, convenient and important representation of factorization of complex distributions - Bayesian networks. Next time we will consider another (but, of course, similar) graphic representation and begin to develop output algorithms, i.e. solving Bayesian inference problems in complex probabilistic models.

To move on and put these and many other methods in a common context, we need to develop some common base, learn the language that modern data processing methods speak - the language of

*graphical probability models*. Today is the first part of this story, the simplest, with pictures and explanations.### Bayesian inference problems

First, briefly recall what we have already talked about . The main tool for machine learning is the Bayesian theorem:

Here D is the data, what we know, and θ are the model parameters that we want to train. For example, in the SVD model, data is the ratings that users put to products, and model parameters are factors that we train for users and products. - this is what we want to find, the probability distribution of the model parameters

*after*we take into account the data; this is called*posterior probability*. This probability, as a rule, cannot be directly found, and Bayes' theorem is needed here. - this is the so-called*likelihood*, likelihood of data subject to fixed model parameters; this is usually easy to find, in fact, the design of the model usually consists in setting the likelihood function. A is*a*prior probability, it is a mathematical formalization of our intuition about an object, a formalization of what we knew before, even before any experiments.Thus, we come to one of the main tasks of the Bayesian conclusion: to find the posterior distribution on hypotheses / parameters:

and to find the maximum posterior hypothesis .

### Factorization of complex distributions and Bayesian networks

But if everything is so simple - we multiplied two functions, maximized the result - what is the whole science about? The problem is that the distributions that interest us are usually too complex to be directly maximized analytically. There are too many variables in them; between the variables there are too complex relationships. But, on the other hand, they often have an additional structure that can be used, a structure in the form of

*independence*( ) and*conditional independence*( ) of some variables.Recall our talk about naive Bayes : in the model there turned out to be an overwhelming number of parameters, and we made an additional assumption about the conditional independence of attributes (words), subject to the topic:

As a result, it was possible to rewrite the complex posterior distribution as

in this form it was much easier to deal with it (to train the parameters of each small distribution separately, and then choose

*v*that gives the maximum of the product). This is one of the simplest examples of decomposition (factorization) of a complex distribution into a product of simple ones. Today and in subsequent texts, we will be able to generalize this method and bring it to unrivaled heights.So, let's look at one of the most convenient ways to represent large and complex probability distribution -

*Bayesian belief network*(Bayesian belief networks), which have recently been often referred to simply as*directed graphical models*(directed graphical models). A Bayesian network is a directed graph without directed cycles (this is a very important condition!) In which the vertices correspond to the variables in the distribution, and the edges connect the “connected” variables. In each node, a conditional distribution of the node is specified, provided that its parents provide it , and the graph of the Bayesian network means that a large joint distribution is decomposed into the product of these conditional distributions.For example, the graph corresponding to naive bayes:

It corresponds to the decomposition

,

*C has*no ancestors, so we take its unconditional (marginal) distribution, and each of*A*“grows” directly from_{i}*C*and is not connected with anyone else.We already know that in this case all attributes

*A*conditionally independent under the condition of category_{i are}*C*:Now let's look at all the simplest variants of the Bayesian network and see what conditions of independence between the variables they correspond to.

### Bayesian networks with two and three variables: trivial cases

We start with a network of two variables,

*x*and*y*. There are only two options: either there is no edge between*x*and*y*, or there is. If there is no edge, it simply means that*x*and*y are*independent, because such a graph corresponds to a decomposition .And if there is an edge (let it go from

*x*to*y*, it doesn’t matter), we get a decomposition that, literally by the definition of conditional probability, is always stupidly true for any distribution . Thus, a graph of two vertices with an edge does not give us new information.Now go to the networks of three variables,

*x*,*y*and*z*. The simplest case is when there are no edges at all.As with the two variables, which means that

*x*,*y*and*z*only independent: .Another simple case is when

*all the*edges are drawn between the variables .This case is also similar to that considered above; for example, let the edges go from

*x*to*y*and*z*, and also from*y*to*z*. We get a decompositionwhich, again, is always true, for any joint distribution. In this formula, one could select the variables in any order, nothing would change. Note that directional loops in Bayesian networks are forbidden, and as a result of the options how to draw all the edges, there are only six, not eight.

These observations, of course, are generalized to the case of any number of variables: if the variables are not connected at all with each other, they are independent, and the graph in which all the edges are drawn (with arrows in any directions, but so that directed cycles are not obtained) , corresponds to the decomposition, which is always true, and does not give us any information.

### Bayesian networks with three variables: interesting cases

In this section, we will consider three more interesting cases - these will be those “bricks” from which any Bayesian network can be composed. Fortunately, for this it is enough to consider graphs on three variables - everything else will be generalized from them. In the examples below, I’ll be somewhat against the truth and intuitively interpret the edge, the arrow between the two variables, as “

*x*affects*y*”, i.e. essentially as a causal relationship. In fact, this, of course, is not entirely true - for example, you can often expand an edge without losing meaning (remember: in a graph of two variables it did not matter which way the edge should be drawn). And in general this is a difficult philosophical question - what are “cause-effect relationships” and why are they needed. But for reasoning on the fingers will do.#### Serial communication

We start with a sequential connection between the variables:

*x*“affects”*y*, and*y*, in turn, “affects”*z*.Such a graph depicts a decomposition.

Intuitively, this corresponds to a consistent causal relationship: if you run in winter without a hat, you will catch a cold, and if you catch a cold, your temperature will rise. Obviously,

*x*and*y*, as well as*y*and*z*are connected to each other, between them even the arrows are drawn directly. Are*x*and*z*connected in a network like thisare these variables dependent? Of course! If you run in winter without a hat, the likelihood of getting a high temperature increases. However, in such a network,*x*and*z*are connected only through*y*, and if we already know the value of*y*,*x*and*z*become independent: if you already know that you have a cold, it does not matter what caused it, the temperature will now rise (or not rise) ) from a cold.Formally, this corresponds to the conditional independence of

*x*and*z*under the condition*y*; let's check this:where the first equality is the definition of conditional probability, the second is our expansion, and the third is the application of Bayes' theorem.

So, the consistent relationship between the three variables tells us that the extreme variables are conditionally independent under the condition of the average. Everything is very logical and fairly straightforward.

#### Diverging connection

The next possible option is a divergent connection:

*x*“affects” both*y*and*z*.Such a graph depicts a decomposition.

Intuitively, this corresponds to two consequences from the same reason: if you catch a cold, you may have a fever and a runny nose. As in the previous case, it is obvious that

*x*and*y*, as well as*x*and*z are*dependent, and the question is between*y*and*z*. Again, it is obvious that these variables are dependent: if you have a runny nose, this increases the likelihood that you have a cold, which means that the likelihood of high temperature also increases.However, in such a network, like the previous case,

*y*and*z*are connected only through*x*, and if we already know the meaning of the common reason*x*,*y*and*z*become independent: if you already know that you have a cold, a runny nose and temperature become independent. Disclaimer: my medical examples are extremely arbitrary, in fact, I have no idea if the common cold and temperature are independent in case of a cold, and I even suspect that it’s unlikely, because many other diseases can also cause both of these symptoms together. But in our conditional model in the universe there are no other diseases besides the common cold.Formally, this corresponds to the conditional independence of

*y*and*z*subject to*x*; checking this is even easier than for a serial connection:So, the diverging connection between the three variables tells us that the “consequences” are conditionally independent under the condition of their “common cause”. If the cause is known, then the consequences become independent; while the cause is unknown, the consequences through it are connected. Again, so far it seems nothing surprising.

#### Convergent connection

We have only one possible variant of the connection between the three variables: a converging connection, when

*x*and*y*together “affect”*z*.The decomposition here is this:

This is a situation in which the same effect can have two different causes: for example, the temperature can be the result of a cold, or it can be a result of poisoning. Are cold and poisoning dependent? Not! In this situation, while the overall result of unknown causes two have nothing to do with each other, and it is very easy to test formally,

however, if the "general consequence»

*the z*becomes famous, the situation is changing. Now, the general causes of a known consequence begin to influence each other. This phenomenon is called explaining away in English: suppose you know that you have a fever. This greatly increases the likelihood of both a cold and poisoning. However, if you now find out that you are poisoned, the likelihood of a cold will decrease — the symptom is “already explained” to one of the possible causes, and the second becomes less likely.Thus, with a converging connection, the two “causes” are independent, but only as long as the meaning of their “common effect” is unknown; if the general effect is signified, the causes become dependent.

But there is, as Vasily Ivanovich said, one nuance: when there is a converging connection on the path, it is not enough to look only at its “general effect” to determine independence. In fact, even if

*z*does not have a signification, but one of its descendants (possibly quite distant) has it, two reasons will still become dependent. Intuitively, this is also easy to understand: for example, let us not observe the temperature itself, but observe its descendant — the readings of the thermometer. Probably, there are faulty thermometers in the world, so this is also a certain probabilistic connection. However, observing the thermometer readings in the same way makes colds and poisoning dependent.### Putting it all together

Sequential, converging and diverging bonds are the three bricks that make up any acyclic directed graph. And our reasoning is quite enough to generalize the results on conditional dependence and independence to all such graphs. Of course, this is not the time or the place to formally prove the general theorems, but the result is predictable enough - suppose you need to check in a large graph whether two vertices (or even two sets of vertices) are independent. To do this, you look at all the paths in the graph (excluding the arrows) connecting these two sets of vertices. Each of these paths can be “broken” by one of the above constructions: for example, a serial connection will break if it has a meaning in the middle (the value of the variable is known), and a converging connection will break if, on the contrary, there is no signification (and not at the very top of the path, nor in its descendants). If as a result all the paths are broken, then the sets of variables are really independent; if not, no.

So, today we looked at a simple, convenient and important representation of factorization of complex distributions - Bayesian networks. Next time we will consider another (but, of course, similar) graphic representation and begin to develop output algorithms, i.e. solving Bayesian inference problems in complex probabilistic models.