ScienceHub # 04: Random Graph Theory

    The film crew of PostNauka, headed by the editor-in-chief, didn’t go anywhere, but to Yandex to see what fundamental science has applied in the world of modern technologies. We met with Andrei Raigorodsky, doctor of physical and mathematical sciences, head of the theoretical and applied research department of Yandex, and a professor at Moscow State University and Moscow Institute of Physics and Technology.



    Introduction



    There is a suspicion that trying to tell in your own words Raigorodsky’s speech here on Habré is unheard of arrogance and absurdity. Probably, savvy readers understand this and do not need to be repeated. Therefore, in order not to overflow from empty to empty, I will quote Andrey's quotes, and you can add them in the comments.

    Andrei Raigorodsky: “For a person who does not know at all what a graph is, it is right to tell in very specific terms. Imagine that you have a large spatial molecule, which consists of hard balls, no matter what they are made of (cardboard, metal). And these balls are interconnected, that is, they are connected by wires, rectilinear segments. It turns out a rigid structure, which in mathematics is usually called a graph. This is a purely visual interpretation - in the form of a molecule. Although I think it seems convenient to anyone who is trying to imagine what a graph is.

    Abstractly, the graph is arranged in this way - it has many vertices and many edges. Actually, those balls that we see in space are the vertices of the graph. And the wires that connect them are the edges of the graph. Speaking in abstract terms, vertices are just some finite sets of some elements.

    It doesn't matter what nature these elements are. It can be people, for example, between whom there is some kind of relationship. It can be natural numbers, sites on the Internet, biological objects, for example, proteins. There may be completely abstract elements of some kind.

    The ribs that connect some pairs are the essence of a pair of these elements. That is, we say that two elements are connected by an edge. And that’s it. About every two elements we can say whether they are connected by an edge or not. But it depends on how we define.
    In an abstract situation, we can define as you like, just taking a finite set of elements, and some pairs of these elements, connecting with edges. The result is a structure in space, for example, located or on a plane. As you draw it, it will be so. ”



    “Back in the 19th century, there was a classically posed task that concerned the coloring of maps. Here you have a map of the world. Some countries are painted on it. Naturally, you need to paint each country in a certain color so that the countries that border each other are of different colors. If they are the same color, they simply merge together. And you cannot distinguish them. The question is, how many colors were required for an arbitrary card to carry out such painting?

    The hypothesis was that four colors would always suffice if some natural restrictions were fulfilled regarding the properties of the borders of these countries and their relative positions. If there are no enclaves such as Kaliningrad, then the map is considered correct, roughly speaking. And I want to somehow paint it in four colors. This is naturally associated with graph theory. We simply assign an abstract peak to each country (if you wish, its capital). And the top of the graph is obtained - the capital of these countries.

    Further, we connect by edge (if you want expensive) two capitals, which are the main cities of those countries that border each other. We get a graph. Its peaks correspond to countries. Ribs are pairs of peaks that correspond to the capitals of neighboring countries. And we want to color the vertices of this graph (that is, the country) in the minimum number of colors so that the vertices connected by an edge are painted in different colors. The task of the four colors is a great classic problem that has not yet been completely solved. It has been solved with the help of computer search, but there is still no manual, purely analytical proof that four colors are always enough. ”

    Randomness



    The direction of random graphs is one of the most interesting in mathematics now. It is connected with the Erdшаs-Renyi theory; these Hungarian mathematicians proposed introducing randomness on many graphs.

    AR: “They said this:“ Let's consider some fixed set from n. peaks, no matter what it consists of. Let it be abstract n. objects. For example, the natural numbers 1, 2 ... n. We want to draw ribs between them randomly. ”
    It is clear how many ribs there are. But every two vertices could potentially be connected by an edge. Let's take some two peaks - throw a coin. The coin, however, with a displaced center of gravity. It theoretically or practically (I do not know how to say it better, but the most correct way to say it is statistically) falls “eagle” more often than tails. When you throw it on the table, the probability that it will fall “eagle” up is not the same as the probability that it will fall up tails. That is, if the coin were made of perfectly homogeneous material, then this would certainly not have happened. On average, half the time it would fall tails up, half the time an eagle. This is quite natural to expect. But we have a bad coin shifted. The likelihood that she will fall tails up is some kind of number R. from the interval 0 - 1. And the likelihood that it will fall “eagle” upward is, accordingly, 1 - P, because, of course, we assume that the coin does not fall sideways - it does not stand on the edge and does not hang in the air. That is, there are only two possibilities: either it falls “eagle” up or it falls up tails.

    Let us assume that if it is tilted upward with probability R., then when a tails out, we draw, respectively, an edge between the vertices fixed by us. And if we were not lucky, she fell an eagle, then the ribs will not appear. And so we do with each pair of peaks that we have at our disposal. We have n. peaks. And we test each pair with the help of such a coin. This is called in science (in probability theory) a Bernoulli test scheme. We just throw a coin many times and fix it every time - it fell tails or “eagle” up. Depending on this, either we decide to draw another edge - this connection between the vertices, or we decide not to hold it. As a result, we, of course, get a random graph - that is, a graph in which edges arise as a result of tossing a coin, by chance.



    The power law

    “There are a great many problems in the theory of random graphs. We can talk about the same chromatic number of a random graph. There is a very beautiful interesting science about how the chromatic number of a random graph behaves according to the Erdдеs-Renyi model. Practical tasks that are very interesting from the point of view of random graphs are the tasks of constructing adequate models of random graphs that describe the behavior of some real networks.
    Real networks can be understood, for example, as the Internet, whose vertices are (depending on the interpretation) either sites, or pages, or some other structural units. And ribs are usually hyperlinks. It turns out a hefty object that changes over time. The graph is somehow evolving. New sites appear, new connections between them appear.
    If we return to the spatial interpretation, then we get a molecule that grows over time, something in it, on the contrary, disappears, collapses. Connections disappear, new ones appear.

    It turns out that a molecule that describes, for example, the Internet or describes some biological networks, transport networks, networks that arise in sociology, does not change completely randomly, but obeys some interesting statistical laws. You can, on the one hand, observe in practice how these patterns are structured, and on the other hand, try to come up with adequate models of random graphs that are different from the classical model (because the classical model still does not describe it) that would most likely have the same properties that we observed in practice.

    For example, it is known. What is the distribution of degrees of vertices on the Internet, that is, roughly speaking, the probability that a randomly selected vertex in the Internet graph has a predetermined degree of the number of edges that are connected to it. This distribution behaves in a very specific way. This is called a power law. ”



    What to do with spam



    “Imagine that you have some kind of random graph model that really very adequately and well describes the Internet in the sense that, by its probabilistic characteristics, it perfectly fits the empirically obtained data from the Internet. That is, we know that a web graph is developing according to certain laws. And, lo and behold, we came up with a model of a random graph in which, with a high probability, all these patterns are fulfilled. This is a difficult task, but partially solved. We already have some first approximation to how to properly model the behavior of the Internet. Imagine that it is completely resolved.

    For example, we have an ideal model that predicts how the Internet should behave if there are no external wrong influences. Now let's imagine that we need to catch some kind of spam that has what is called a link character. The link nature of spam is organized as follows - there is some kind of structure built by malicious optimizers.

    The structure is designed, naturally, to ensure that those sites that are inside this structure are easier to find in the search process. That is, you are asking some kind of search query. You want an adequate answer to it. And spammers are trying to make sure that you get an answer to it, consisting of those sites that are interesting to spammers. And they artificially wind up these sites in the issuance. In particular, an important issue factor is, of course, the number of links to this site or the number of some citations of those guys who quote you and so on.
    It is important to look at the local structure of the graph around the structure that the spammers may have built. As a rule, they build structures that are really tightly linked enough to fool the search engine. The search engine will think, “Oh, this is a cool site. There really are sites that are highly cited. Therefore, they should be given up to the top of the issue. ” And so the spammers win.

    What we can do? We can use some algorithm (of course, I'm not going to say which one) to catch many, many structures that could potentially turn out to be bad, that is, quite densely, densely interlinked. And then look: in this ideal model of the world, which we came up with using graph theory and probability theory, the appearance of each concrete structure (from those found using the algorithm) has a high probability or low?

    If, from the point of view of a random graph model, which is designed to describe the behavior of a reasonable Internet, the probability of such a structure is high, then the spammer’s design, probably, cannot reduce it in any way. There is no need to assign weighting places for her to fly away. And if her probability is extremely low within the framework of our ideal model, then we already need to suspect her somehow. Maybe put her some kind of flag in the ranking so that she does not appear, perhaps, at the highest position. Or maybe banned. You have to look with your eyes. If she is really bad, let's ban. There are not many such structures. Why not. But all this can be done automatically, using the expectation that we have within the framework of a good model, and reality. That is, indeed, the number of edges (links) and so on. Here I do not open any secrets. These are very natural things. ”



    Future Spam Catchers: Who Is It?

    To solve the problems that lie in the field of graph theory and the problems associated with the Internet and new technologies, we need scientists who are involved in combinatorics and graph theory, probability theory and random processes. Laboratories are now opening all over the world, and universities are placing great emphasis on this area. In addition, bioinformatics is also associated with this. We have already talked about the task of storing and processing data obtained by biologists, more precisely, the microbiologist Konstantin Severinov.
    Such specialists are needed not only on the Internet, not only in connection with the control and dissemination of information, but also in banking institutions, in the study of issues of financial crises, which are also associated with the theory of random graphs.


    Andrei Raigorodsky: “I will tell you briefly what exactly my department at Yandex does. This is a fairly large department, which now employs more than thirty researchers and developers. It is built in such a way that there is a rather significant piece of the department that is built into the Yandex product part, and he receives part of his tasks from direct developers - from those who are engaged in search, tasks, primarily interesting to Yandex as a search system, builds some services. These people are engaged in research not of a purely mathematical nature, but rather, on the contrary, by trying to apply some mathematical ideas to the implementation of meaningful projects.

    And there is a piece of the department that, in fact, generates such ideas. That is, everything is built in a fairly natural way. There is a piece that is directly on sale. There is a piece that deals with the generation of ideas, the production of scientific publications both at industrial conferences and in journals of purely mathematical content. And here are the main areas of research that we conduct? Of course, we are very interested in graphs. And we build models of various graphs that adequately describe the web. Probably, we have not yet managed to build the perfect model. But we have pretty good progress in this direction. We get models that really adequately describe the behavior of the web. They also manage to be put into practice, as I said.
    There is a direction related to the study of machine learning. There is a direction related to the problems of the abstract theory of random graphs.

    You can listen to Andrei and look at the interiors of Yandex here.

    Also popular now: