# Who is VKontakte the most important?

Hi, Habr!

We are already familiar in previous articles on data analysis . Now it is time to talk about one very practical task that we have learned to solve. Namely, we will find out

Many of you remember that not so long ago we opened one of the open projectsin which, under the guidance of experienced people, participants could solve real-world data analysis tasks and gain experience. Now it's time to talk about the results and show where we are moving next. To begin with, we recall the problem that we solved.

It is no secret that recently social networks have become one of the most effective tools for disseminating information. Not so long ago, the public saw examples when these tools are used for other purposes - you seem to be sitting on your own - you are not doing anything, when suddenly an unknown account is added to your friends and you unwittingly start receiving new obscure entries in the Facebook feed. Today, there are many methods for promoting groups and pages on social networks to generate traffic and then derive benefit from it. It is also far from a secret that Internet companies are actively struggling with this type of activity and are cooperating with relevant departments, as often the activities of “opinion leaders” pose a threat, including to the national security of countries.

Do you know how you can manage opinions on social networks? In fact, this is not so difficult - it is important to know through which channels to disseminate information. And here, as always, one cannot do without mathematics, especially graph theory.

That is why we decided to launch a series of tasks in which we will show

The first and simplest task that brings us closer to calculating opinion leaders is to

A social network in terms of mathematics can be represented by a

In general, nothing special

We’ll say right away that when we talk about a web graph, by vertices we mean certain units on the Internet, for definiteness we will consider sites (for example, we can also consider hosts). The edges will connect those vertices between which there are links. It is clear that the graph is oriented and multiple edges are allowed. Even loops (edges from the vertex into itself) and even multiple loops are allowed.

This is one of the most well-known properties of web-type graphs, which many people know as the

A web graph is a fairly

It turns out that the proportion of vertices of degree d in the web graph is estimated as

These are the properties of the Internet. Now, if we know how the process of such a graph works, we can do many things, starting from looking for spam structures (many of which are combo dense bipartite subgraphs), ending with effectively looking for vertices of the maximum degree (about this we will tell below).

Today, one of the most common ideas underlying the construction of a web graph is the idea of preferred joining, the logic of which is as follows.

It is clear that the idea described above is poorly formalized and clarifying many details, you can get many models of web graphs. So there were models of Bollobash - Riordan, Buckley - Ostgus, a copy model and many others. Not so long ago, a whole class of models was considered in Yandex , for which it was provedmany interesting results.

I hope the reader has an understanding of how the Internet works in terms of graph theory. It turns out that many other real networks, including social. Networks are similar. This is what will help us in our task.

Now back to our task. How can the properties of real networks described above be used to find vertices of maximum degree with sufficiently high accuracy? Let's take a number of arbitrary vertices of our graph (for example, from a uniform distribution). We denote the set of vertices of

And now the main point, which is very difficult to explain to the reader without serious mathematical preparation in more detail than in this article: due to the fact that the social graph has the properties described above, it turns out that exactly the vertices of the maximum degree in the original graph fall into the set

As you can see, the algorithm is quite simple and not demanding on computing resources. This is exactly what the participants of our study were supposed to realize for the Russian social network VKontakte.

As we said earlier, including at past conferences, when educating people, we try to explain complex things in simple language and give the most practical skills. In order for a person to have a certain image and understanding, sufficient for the practical application of data analysis skills. That is why, we take people from different subject areas, and exchanging experience with each other, we instill the missing skills to the participants and ourselves.

One of the study participants was Gleb Morozov , who for more than 6 years has been solving analytical problems in large federal retail chains, including, for example,

Below is a fairly simple and well-commented implementation of this algorithm in the

As you can see, the algorithm and its implementation are very simple. Now let's take a look at the results.

In just a few minutes, we will get the following results (TOP20 vertices of maximum degree, a more complete list - see here ).

In general, when we saw this list (and then looked at it further), we made 2 conclusions:

We began to doubt how correctly the algorithm works, but because we did not find people with the maximum number of subscribers in open sources, we decided to test the algorithm on groups (look for groups with the maximum number of subscribers), since there are lists for them, like this . And, indeed, having received such results here , and realizing that VKontakte is not at all the world of the infamous

Now, after we have obtained such results, we will move on, namely, using the same API, try to build a

If you want to connect to this study and get a very cool experience - join our Data Science community and write me an e-mail (

Well, in the meantime, we will continue our research and will delight you with new articles from the world of data analysis and good mathematics!

Yes, by the way, they found a lot of beautiful girls this way.

For example, vk.com/id109507300 , madame scored more subscribers than many Russian stars. Happy viewing!

We are already familiar in previous articles on data analysis . Now it is time to talk about one very practical task that we have learned to solve. Namely, we will find out

**who actually controls our opinion on the social network VKontakte**. Code katom a lot of unusual results and interesting mathematics.**Just a reservation - we don’t mean anything concrete and don’t draw any conclusions, we just love math and just use open data.**Many of you remember that not so long ago we opened one of the open projectsin which, under the guidance of experienced people, participants could solve real-world data analysis tasks and gain experience. Now it's time to talk about the results and show where we are moving next. To begin with, we recall the problem that we solved.

### Introduction

It is no secret that recently social networks have become one of the most effective tools for disseminating information. Not so long ago, the public saw examples when these tools are used for other purposes - you seem to be sitting on your own - you are not doing anything, when suddenly an unknown account is added to your friends and you unwittingly start receiving new obscure entries in the Facebook feed. Today, there are many methods for promoting groups and pages on social networks to generate traffic and then derive benefit from it. It is also far from a secret that Internet companies are actively struggling with this type of activity and are cooperating with relevant departments, as often the activities of “opinion leaders” pose a threat, including to the national security of countries.

Do you know how you can manage opinions on social networks? In fact, this is not so difficult - it is important to know through which channels to disseminate information. And here, as always, one cannot do without mathematics, especially graph theory.

That is why we decided to launch a series of tasks in which we will show

**how, using a rather limited API of social networks, learn how to find opinion leaders, measure the quality of advertising campaigns and monitor the dissemination of information on social networks using VKontakte as an example**.The first and simplest task that brings us closer to calculating opinion leaders is to

**identify users with the maximum number of subscribers within a limited API**. The peculiarity here is that the entire graph of a social network is almost impossible to view. Considering that the monthly VKontakte audience totals more than 300 million users, and API requests are allowed to be made no more than 3 per second, it is easy to figure out that it took about 3 years to calculate the number of subscribers of all people.**We will show how to calculate the TOP100 people with the maximum number of subscribers in a few minutes**. But first, a little dive into the formal definitions. After all, one cannot do without good mathematics.A social network in terms of mathematics can be represented by a

**directed graph**whose vertices are users, and the edges connecting them indicate the fact of friendship / follower / subscription. At the same time, our graph is oriented - the ribs have a direction (depending on who follows whom). For an undirected graph, the concept of**degree**(number of friends of the vertex) is defined . For the oriented - the incoming degree (number of “followers”) and the outgoing degree (number of people on whom the person is “signed”).In general, nothing special

**if many of the graphs that occur in life would not be arranged in a certain way**. To do this, there is even a whole science that is engaged in the study of so-called web graphs. It all started with the fact that in 1999 an article by A.-L. Barabashi and R. Albert, in which the authors experimentally investigated the properties of the Internet and proposed the idea of its formation, which was subsequently formalized by many authors and is very actively used in practice. So, let's start with the very features of social networks.### Real Network Properties

We’ll say right away that when we talk about a web graph, by vertices we mean certain units on the Internet, for definiteness we will consider sites (for example, we can also consider hosts). The edges will connect those vertices between which there are links. It is clear that the graph is oriented and multiple edges are allowed. Even loops (edges from the vertex into itself) and even multiple loops are allowed.

#### 1. The diameter of the web graph is small

This is one of the most well-known properties of web-type graphs, which many people know as the

**“theory of six handshakes,”**which means that any 2 people on Earth know each other through 6 handshakes. In the language of graph theory, this property means that the graph has a small diameter. This suggests that the graph should be sufficiently dense, however, this is not so.#### 2. Sparseness

A web graph is a fairly

**sparse**graph. More formally, at**n**vertices there are only about**const * n edges**. It would seem that a large graph of small diameter should be very dense, i.e. have an order of**n ^ 2**edges, however, the fact remains. The following property for a person who is not familiar with mathematics does not sound so informative, but it will tell in the future how the graph itself should be formed in order to satisfy all the properties of the real web.#### 3. The power law of the distribution of degrees of vertices

It turns out that the proportion of vertices of degree d in the web graph is estimated as

**const / d ^ a**, where**2 <a <3**(at the time of the study, the value of**a**was about**2.1**).These are the properties of the Internet. Now, if we know how the process of such a graph works, we can do many things, starting from looking for spam structures (many of which are combo dense bipartite subgraphs), ending with effectively looking for vertices of the maximum degree (about this we will tell below).

### The idea of preferred joining

Today, one of the most common ideas underlying the construction of a web graph is the idea of preferred joining, the logic of which is as follows.

**Let's assume that as soon as a new site appears, he would rather prefer to compete with those who are already popular at that time**. More formally, the probability with which a new site will link to an existing one is proportional to the number of links already available to that site.It is clear that the idea described above is poorly formalized and clarifying many details, you can get many models of web graphs. So there were models of Bollobash - Riordan, Buckley - Ostgus, a copy model and many others. Not so long ago, a whole class of models was considered in Yandex , for which it was provedmany interesting results.

I hope the reader has an understanding of how the Internet works in terms of graph theory. It turns out that many other real networks, including social. Networks are similar. This is what will help us in our task.

### Heuristic algorithm for determining vertices of maximum degree

Now back to our task. How can the properties of real networks described above be used to find vertices of maximum degree with sufficiently high accuracy? Let's take a number of arbitrary vertices of our graph (for example, from a uniform distribution). We denote the set of vertices of

**A**. It is approximately clear that with a high probability many of them will refer to one of some popular peaks. Let us then released from these vertices edges (see, to whom these peaks subscribe) - the set of vertices of the**Bed and**. It is understood that any vertex of**A**refer to the same vertex of**B**. For each vertex**V**from the set**B**We calculate the number of vertices in**A**, which were in the top**V**. Thus, we obtain a lower bound on the incoming degree (number of subscribers) of each vertex of the many**Bed and**. We call this "estimate incoming degree" vertex**V**.And now the main point, which is very difficult to explain to the reader without serious mathematical preparation in more detail than in this article: due to the fact that the social graph has the properties described above, it turns out that exactly the vertices of the maximum degree in the original graph fall into the set

**B.**Thus, it remains to sort the vertices from the set**B**according to the "assessment of the incoming degree" and to clarify for each of these peaks the real value of the subscribers in it. You can learn more about the algorithm and its justification here .As you can see, the algorithm is quite simple and not demanding on computing resources. This is exactly what the participants of our study were supposed to realize for the Russian social network VKontakte.

#### Small offtopic

As we said earlier, including at past conferences, when educating people, we try to explain complex things in simple language and give the most practical skills. In order for a person to have a certain image and understanding, sufficient for the practical application of data analysis skills. That is why, we take people from different subject areas, and exchanging experience with each other, we instill the missing skills to the participants and ourselves.

One of the study participants was Gleb Morozov , who for more than 6 years has been solving analytical problems in large federal retail chains, including, for example,

**Tander CJSC (Magnet)**or**Megapolis Trading House**. After a large number of jointly solved problems, which he will talk about a little later, Gleb learned to use machine learning in retail, and the MLClass team gained experience in the subject area. Gleb will tell about the tasks themselves a little later on his own, but for now I will give his code, which he used to solve the problem described above.### R algorithm implementation

Below is a fairly simple and well-commented implementation of this algorithm in the

**R**language . Any reader can easily reproduce the result.```
library("RCurl") # Библиотека для генерации запросов к API
library("jsonlite") # Библиотека для обработки JSON
library("stringr") # Библиотека для работы с текстом
# Шаблон запроса подписок аккаунта
url <- "https://api.vk.com/method/users.getSubscriptions?user_id="
# Шаблон запроса подписчиков аккаунта
url2 <- "https://api.vk.com/method/users.getFollowers?user_id="
# Генерируем id аккаунта с replace (выбираем из 2.8 млн. id 700 аккаунтов с повторениями)
id_num <- sample.int(280000000, size = 300, replace = T)
# Функция для получения id подписок для сгенерированных аккаунтов
get_id_account <- function(id) {
url_get <- str_c(url, id) # Составляем запрос к API
resp <- getURL(url_get) # Запрашиваем и получаем ответ
Sys.sleep(0.5) # Задержка для обхода ограничения на кол-во запросов в сек
# Обрабатываем ответ и получаем id подписок
fromJSON(resp)$response$users$items
}
# Функция для получения кол-ва подписчиков аккаунта
get_count <- function(id) {
url_get <- str_c(url2, id)
resp <- getURL(url_get)
Sys.sleep(0.5)
fromJSON(resp)$response$count
}
# Получаем, при помощи функции get_id_account, id подписок для аккаунтов в виде list
account_id <- sapply(id_num, get_id_account)
# Преобразовываем list в data.frame
account_id <- unlist(account_id)
# Подсчитываем кол-во попаданий для каждого id полученных аккаунтов
account_id <- table(account_id)
# Преобразовываем в data.frame
account_id <- as.data.frame(account_id)
# Сортируем в убывающем порядке
account_id <- account_id[order(-account_id$Freq),]
# Получаем, при помощи функции get_count, кол-во подписчиков
result <- sapply(as.numeric(as.character(account_id$account_id[1:300])), get_count)
# Объединяем полученные данные о кол-ве подписчиков с id аккаунтов
result <- cbind(as.numeric(as.character(account_id$account_id[1:300])), result)
# Сортируем в убывающем порядке
result <- result[order(-result[,2]),]
result <- data.frame(id = result[,1], count_of_followers = result[, 2])
write.csv(result, "result_subscribers.csv")
```

As you can see, the algorithm and its implementation are very simple. Now let's take a look at the results.

### results

In just a few minutes, we will get the following results (TOP20 vertices of maximum degree, a more complete list - see here ).

Human | Number of subscribers |

Pavel Durov | 6192543 |

Dmitry Medvedev | 2058871 |

Kate Clap | 1397818 |

Ivan Rudskoy | 1308458 |

Nyusha Shurochkina | 916338 |

Mariya Kozhevnikova | 908909 |

Sasha Spielberg | 837179 |

Mikhail Zadornov | 736109 |

Maxim Golopolosov | 734602 |

Victoria Bonya | 733385 |

Christina Good-natured | 607419 |

Maria Way | 598583 |

Tina Kandelaki | 541142 |

Katya Sambuka | 538611 |

Sofia Temnikova | 530157 |

Alexey Dolmatov | 518934 |

Mikhail Galustyan | 515012 |

Mariana Rozhkova | 508894 |

Vasya Vakulenko | 445594 |

Vladimir Zhirinovsky | 442578 |

In general, when we saw this list (and then looked at it further), we made 2 conclusions:

- Algorithm works
- The results are a bit amazing. We especially liked the schoolboy Ivan Rudskoy, who paid a lot of Russian pop stars

We began to doubt how correctly the algorithm works, but because we did not find people with the maximum number of subscribers in open sources, we decided to test the algorithm on groups (look for groups with the maximum number of subscribers), since there are lists for them, like this . And, indeed, having received such results here , and realizing that VKontakte is not at all the world of the infamous

**MDK**, they were satisfied.### TOTAL

Now, after we have obtained such results, we will move on, namely, using the same API, try to build a

**graph of “reposts” from the pages of these people**in order to evaluate a person not only by the number of subscribers, but also by how this person is able to disseminate opinions using his influence on the network. There are ideas on how to correctly apply**PageRank**.If you want to connect to this study and get a very cool experience - join our Data Science community and write me an e-mail (

**al.krot.kav@gmail.com**) with the topic**“VKontakte Opinion Leaders”**Well, in the meantime, we will continue our research and will delight you with new articles from the world of data analysis and good mathematics!

**UPD**Yes, by the way, they found a lot of beautiful girls this way.

For example, vk.com/id109507300 , madame scored more subscribers than many Russian stars. Happy viewing!