Graphic models in machine learning. Yandex Workshop

    Despite the enormous popularity of the apparatus of graphic models for solving the structural classification problem, the task of setting their parameters for the training sample remained open for a long time. In his report, Dmitry Vetrov spoke about a generalization of the support vector method and some features of its application for adjusting the parameters of graphic models. Dmitry is the head of the Bayesian methods group, associate professor at the Moscow State University’s VMK and a teacher at the ShAD.

    Video recording of the report.

    Report outline:
    • Bayesian methods in machine learning.
    • Tasks with interdependent hidden variables.
    • Probabilistic Graphic Models
    • The support vector method and its generalization for adjusting the parameters of graphic models.



    The very concept of machine learning is quite uncomplicated - figuratively speaking, it is a search for relationships in data. The data are presented in the classical setting by a set of objects taken from the same general population, each object has observable variables, there are hidden variables. Observed variables (we will denote them X below ) are often called attributes, respectively, hidden variables ( T) Are those that are to be determined. In order to establish this relationship between observable and hidden variables, it is assumed that we have a training sample, i.e. a set of objects for which both observable and hidden components are known. Looking at it, we are trying to set up some crucial rules that will allow us in the future, when we see a set of signs, evaluate hidden components. The training procedure is approximately as follows: a set of valid decision rules is fixed, which are usually set using weights ( W), and then somehow in the course of training, these weights are adjusted. Inevitably, the problem of retraining immediately arises, if we have too rich a family of permissible decision rules, then during the training process we can easily come to the case when for the training sample we perfectly predict its hidden component, but for new objects the forecast turns out to be bad. Researchers in the field of machine learning have spent many years and effort in order to remove this problem from the agenda. At present, it seems that somehow it has succeeded.

    The Bayesian paradigm is a subsection of machine learning, and in this case it is assumed that the relationship between the tunable parameters W , the observed variables X and the hidden variables TIt is modeled in the form of some joint distribution, for example, for all these three groups of variables. Accordingly, in the classical statement of machine learning problems, i.e., when it is assumed that the data is a collection of objects taken from the same general set of independent objects, this joint distribution can be represented in this relatively simple form:



    The first factor is responsible for the distribution of the hidden variable with known characteristics and a given decision rule. The second factor is responsible for the distribution of the observed variables for a given decision rule (the so-called discriminative models). This factor does not interest us, since we do not model the distribution of the observed variables. But more in the general case of the so-called generative models, it has a place to be. And finally, the last factor that inevitably follows if we just consider the rule of working with probabilities is an a priori distribution on the weights W, i.e. on the parameter values ​​of the decision rule. And it was precisely the fact that we had the opportunity to set this a priori distribution that was probably the key reason why the Bayesian methods have been booming for the past 10 years. For a long time, until the mid-nineties, the Bayesian approach to probability theory was rather skeptical of the researchers for the reason that it suggests that we must somehow set an a priori distribution on the parameters. And the traditional question was - where to get it. The likelihood function, of course, is determined by the data, and take the a priori distribution from the ceiling? It is clear that if we take a different a priori distribution, we will end up with different results using the Bayes formula. Therefore, for a long time, the Bayesian approach seemed to be some kind of mental game. Nonetheless,

    It turned out that there are a lot of specifics, it seems like all the tasks come down to standard settings, but in most cases there is some specificity, some a priori preferences for one or another form of the decision rule. Those. actually preference to certain values of the weights the W . And then the question arose, how can this be combined with the quality of the decision rule in the training sample? Traditionally, a machine learning procedure looks like minimizing a learning error, i.e. we are trying to construct a decisive rule so that the variables T can be predicted as accurately as possible for the training sample for which they are known. Those. We can appreciate the quality. This is one indicator. Another indicator is that we consider some values ​​of Wmore preferred.

    How can all this be linked in uniform terms? It turned out that this is possible within the framework of the Bayesian paradigm. Qualitative in training is characterized by a likelihood function, which is a probabilistic sense, and preference for parameters is characterized by an a priori distribution. Accordingly, the likelihood function and the a priori distribution are perfectly combined in the framework of the Bayes formula and lead to posterior distributions of the parameter values Wthat we customize during training. Thus, the Bayesian paradigm made it possible to correctly combine, in uniform terms, both the accuracy of the training sample and our a priori preferences for the values ​​of weights, which arose quite organically in many applied problems, and, accordingly, the introduction of these a priori preferences allowed us to significantly improve the quality of the resulting decision rules . In addition, from the late nineties and into the zero years, more advanced tools appeared that also allow a priori distribution to parameterize and tune according to data in some clever way. Those. I pressed the button, and the a priori distribution most suitable for the given task was automatically configured, after which the machine learning procedure went.

    These methods are already well developed. It will not be a strong exaggeration to say that the classical statement of the problem of machine learning has been solved. And as it often happens in life, as soon as we closed the classical production, it immediately turned out that we were more interested in non-classical productions, which turned out to be much more rich and interesting. A non-classical statement in this case is a situation where hidden object variables turn out to be interdependent. Those. we cannot determine the hidden variable of each object, looking only at the signs of this object, as was assumed in the traditional task of machine learning. Moreover, it depends not only and not so much on the signs of other objects, but on the hidden variables of other objects. Accordingly, we cannot determine the hidden variable of each object independently, only in aggregate. Below I will give a large number of examples of tasks with interconnected hidden components. You may notice that these relationships, i.e. Awareness of the fact that hidden variables in our sample are interdependent opens up tremendous opportunities for additional regularization of tasks. Those. ultimately to further improve the accuracy of the solution.

    Image segmentation




    The first most classic example is image segmentation. This is a tool that can solve such problems with interdependent hidden variables, from which, in fact, the boom of graphic models began. The task has become extremely popular with the development of digital cameras. What it is? For each pixel we are trying to put some class label. In the simplest case with binary segmentation, we try to separate the object from the background. In more complex cases, we are also trying to introduce some kind of semantics for objects. In principle, this problem can be considered as a standard classification problem. Each pixel is an object, each pixel can belong to one of Kclasses, each pixel has a specific set of attributes. In the simplest case, this is a color, you can remove texture features, some contextual information, but it seems fairly obvious that in this case we have some specifics. Neighboring pixels most likely belong to the same class, simply because in the segmentation problem we assume that we will have some compact areas as a result of the segmentation. As soon as we try to formalize this fact, it inevitably follows that we cannot classify or segment each pixel separately from everything else, we can only segment the entire picture at once. Here is the simplest task in which there are interdependencies between hidden variables.

    Video tracking




    The next task is video tracking. We can process each frame independently, say, search or identify pedestrians, identify pedestrians. Or do tracking mice. The mice on the top look very similar, you need to track them so that even each mouse is identified accurately. Attempts to put colored strokes on the wool are harshly suppressed by biologists who claim that the mouse will then be under stress. Accordingly, the mice are all the same, you need to know where each mouse is located at each moment of time in order to analyze their behavior. It is clear that for a single frame we are not able to solve the problem. We need to consider neighboring frames, preferably in large numbers. We again have interdependencies between hidden variables, where the mouse identifier acts as a hidden variable.

    Social Network Analysis




    Analysis of social networks is now an obvious task. Each user of social networks, of course, can be analyzed independently of the others, but it is obvious that we will lose a significant layer of information. For example, we are trying to analyze users for their opposition to the current regime. If I have a banner "Bulk in the presidency!" or “Navalny to the mayors!”, of course, I can be identified as an oppositionist, but if I approach this more carefully and do not post similar banners, then I can only determine if I sympathize with the opposition by analyzing my friends who are less careful who these banners are joyfully hang out. If it turns out that I am in the status of friends with a large number of such people, then the likelihood that I myself also sympathize with these trends slightly increases.

    Inserting 2D and 3D models




    The next task is to register images to build three-dimensional models from the image, fit the image into a two- or three-dimensional model. This task assumes that the images undergo some deformation during registration. Of course, we can try to search for a deformation for each pixel independently of the others, but it is clear that in this case we will not get any elastic deformation, the result will be rather deplorable. Although formally we will enter each pixel somewhere, and the error for each pixel will be minimal. At the same time, the entire image will simply crumble if each pixel deforms independently. As one of the ways to account for these interdependencies, you can use the apparatus of graphic models, which allows you to analyze the interdependence of hidden variables.

    Decoding Noisy Messages




    There is another important task. It is not often mentioned about it, but in fact it is very relevant and, most likely, relevance will only increase in the coming years. This is decoding of noisy messages. Now we are surrounded by mobile devices. Data transfer with their help is not too reliable, the channels are terribly overloaded. Obviously, the demand for traffic exceeds current capabilities, so the task naturally arises of creating a protocol that would be noise-resistant on the one hand and have high bandwidth on the other. Here we are faced with a curious phenomenon. Since the fifties of the last century, when Claude Shannon developed the information theory, it is known that for each communication channel there is a certain maximum throughput at a given interference intensity.

    A theoretical formula was derived. More information than the Shannon theorem predicts cannot be transmitted over a noisy channel in principle. But it turns out that as we approach this theoretical limit, i.e. the more efficiently we try to transmit information using some tricky archiving schemes and adding control bits, which are just aimed at fixing errors, the more complicated the decoding scheme for this message is. Those. the most efficient error-correcting codes are extremely difficult to decode. There are no simple decoding formulas. Accordingly, if we reach the Shannon limit, the complexity of decoding increases exponentially. We can encode and transmit the message, but we are no longer able to decrypt it. It turned out that modern codes,

    Simulation




    In the late nineties, a task also became popular, which I conditionally called simulation modeling or modeling of agent environments. This is an attempt to analyze the behavior of systems consisting of a large number of interacting agents. The task is extremely interesting, there are no strict methods for its solution. Therefore, most of these problems are solved by simulation using Monte Carlo methods in various modifications. A simple example is traffic modeling. Vehicles act as an agent, it is clear that we can not model the behavior of cars out of context, since all drivers make decisions looking at their nearest neighbors on the highway. And it turned out that introducing even the simplest interaction between motorists into the model allowed us to get an amazing result. First, it was established first in theory, and then in practice, that there is some critical limit to the flow intensity. If the flow exceeds this limit, then it becomes metastable, i.e. unstable. The slightest deviation from the norm (for example, one of the drivers slowed down a little more sharply and immediately went on) leads to the fact that traffic jams grow tens of kilometers right there. This means that when the transport channel is overloaded, it can still function without traffic jams, but the traffic jam can occur almost out of the blue.

    First they got it in theory, then they became convinced that in practice this really happens. Indeed, for each route, depending on its interchanges, there is a certain maximum throughput at which the flow is still moving stably. This effect of an abrupt transition from a stable motion to a dead traffic jam is one example of phase transitions. It turned out that in agent systems such phenomena occur quite often, and the conditions under which a phase transition occurs are by no means trivial. In order to be able to simulate, appropriate tools are needed. If I already say that these are multi-agent systems, then agents are objects, and objects interact with each other. We have interdependencies between the variables describing each agent.

    Deep learning




    The next example is deep learning. The image was taken from a report by a Google representative at the ICML 2012 conference, where they made a splash by significantly increasing the quality of classifications of photographs from the Flickr database by using the relatively young paradigm in machine learning - deep learning. In some form, this is the reincarnation of neural networks in a new way. Traditional neural networks have successfully ordered a long life. However, they inspired in some way the emergence of a different approach. Deep networks, Boltzmann networks, which are a collection of limited Boltzmann machines. In fact, these are layers of interconnected binary variables. These variables can be interpreted as hidden variables, which, when making a decision, are somehow adjusted according to the observed variables. But the peculiarity is that all hidden variables here are interconnected in a cunning way. In order to train such a network and in order to make a decision within the framework of such a network, we again have to apply methods based on the analysis of objects with interdependent variables.

    Collaborative filtering




    Finally, the last application that came to my mind is collaborative filtering or a recommender system. Also a relatively new direction. However, for obvious reasons, it is gaining more and more popularity. Now online stores are actively developing, providing tens and hundreds of thousands of items, i.e. an amount that no user is able to even capture with a glance. It is clear that the user will only see those products that he had previously somehow selected from this gigantic assortment. Accordingly, there are tasks of constructing a recommendation system. You can build it by analyzing not only a specific user (which products he already bought, his personal data, gender, age), but also the behavior of other users who bought similar products.

    Graphic Models


    I hope that I was able to show the relevance of the fact that we really have to move away from the classical statements of machine learning, and that the world is actually more complicated and interesting. We have to move on to tasks when we have objects that have interdependent hidden components. As a result, one has to model a highly multidimensional distribution, for example, to hidden components. There is trouble. Imagine the simplest situation. Let's say a thousand objects, each of them has a hidden binary variable, can take values ​​0/1. If we want to specify a joint distribution, where everything depends on everything, then we need to formally set a thousand-dimensional distribution on binary variables. Question: how many parameters will be required for this to set such a distribution? Answer: 2 1000or approximately 10,300 . For comparison, the number of atoms in the observable universe is 10 80 . Obviously, we cannot set 10,300 , even if we combine the computing power of the whole world. Imagine a 30x30 picture in which each pixel can take an object / background value. We cannot calculate the distribution on the set of such pictures in principle, since the distribution is discrete and requires 2 1000 parameters.

    What to do? The first option that comes to mind, suppose that each pixel we have is distributed independently of the others. Then we need only a thousand parameters, i.e. the probability that each pixel takes a value, say 1. Then -1 is the probability that it takes a value of 0. This is how standard machine learning works. One small problem - all variables have become independent. So, we cannot solve any of these tasks in this way. Question: how do we limit ourselves to a small number of parameters, but ask a distribution in which everything depends on everything? And here the paradigm of graphic models comes to the rescue, which suggests that we have a joint distribution can be represented as a product of a certain number of factors, defined on intersecting subsets of variables. In order to specify these factors, an undirected graph was traditionally used that connects some objects in the sample to each other, and the factorization system is determined by the maximum clicks of this graph.



    These clicks are very small, the maximum clicks are of size 2. This means that three parameters are enough for us to determine the distribution of clicks of size 2. Each such factor can be specified by a small number of variables, and, as a result, since there are not many such factors , we can describe the joint distribution by a relatively small number of parameters. On the other hand, even a relatively simple factorization system is guaranteed to give us a situation where all of our variables depend on everyone. Those. we got complete interdependence, limiting ourselves to some factorization system. The factorization system determines conditional independence.

    Main goals


    What are the main tasks that arise in the framework of the theory of graphic models? The device was proposed in the late nineties: to model the joint distribution in the form of some factorization system. Suppose we have introduced such a system. Then we return to the problems of machine learning, Bayesian formulation which suggested that we set distribution on three groups of variables: the hidden variables T , the observed variables X and a parameter decision rule the W . It turned out that within the framework of graphical models, the tasks are very similar.
    1. Decision-making. The decision rule W is given ; there is a sample in which we see the observed variables. The challenge is to evaluate hidden variables. In relation to standard tasks of machine learning, the task is solved easily and is not even considered by anyone. In graphical models, the task has become substantially non-trivial and far from always solvable.
    2. Determination of the normalization constant the Z . Calculating it is not always easy.
    3. Training with a teacher . A training sample is given in which we know not only observable variables, but also hidden ones. This is a collection of interconnected objects. It is required to find a vector W that draws a maximum joint distribution.
    4. Learning without a teacher. The sample is given, but in it we only know the observed components, we don’t know the hidden ones. We want to set up our decision rule. The task turns out to be extremely difficult, because there are many configurations of T.
    5. Marginal Distributions . The sample is given, the observed components of X are known, the decision rule W is known , but we do not want to find the most probable configuration of all the hidden variables of the sample, but the distribution of the hidden variables for any one object. The task is often relevant in applications. It is solved non-trivially.


    The rest of the report is devoted to the analysis of the tasks of teaching with a teacher. View the full report here .

    Also popular now: