Who is VKontakte the most important?

    Hi, Habr!



    We are already familiar in previous articles on data analysis . Now it is time to talk about one very practical task that we have learned to solve. Namely, we will find out who actually controls our opinion on the social network VKontakte . Code katom a lot of unusual results and interesting mathematics.

    Just a reservation - we don’t mean anything concrete and don’t draw any conclusions, we just love math and just use open data.

    Many of you remember that not so long ago we opened one of the open projectsin which, under the guidance of experienced people, participants could solve real-world data analysis tasks and gain experience. Now it's time to talk about the results and show where we are moving next. To begin with, we recall the problem that we solved.

    Introduction


    It is no secret that recently social networks have become one of the most effective tools for disseminating information. Not so long ago, the public saw examples when these tools are used for other purposes - you seem to be sitting on your own - you are not doing anything, when suddenly an unknown account is added to your friends and you unwittingly start receiving new obscure entries in the Facebook feed. Today, there are many methods for promoting groups and pages on social networks to generate traffic and then derive benefit from it. It is also far from a secret that Internet companies are actively struggling with this type of activity and are cooperating with relevant departments, as often the activities of “opinion leaders” pose a threat, including to the national security of countries.

    Do you know how you can manage opinions on social networks? In fact, this is not so difficult - it is important to know through which channels to disseminate information. And here, as always, one cannot do without mathematics, especially graph theory.

    That is why we decided to launch a series of tasks in which we will show how, using a rather limited API of social networks, learn how to find opinion leaders, measure the quality of advertising campaigns and monitor the dissemination of information on social networks using VKontakte as an example .

    The first and simplest task that brings us closer to calculating opinion leaders is to identify users with the maximum number of subscribers within a limited API. The peculiarity here is that the entire graph of a social network is almost impossible to view. Considering that the monthly VKontakte audience totals more than 300 million users, and API requests are allowed to be made no more than 3 per second, it is easy to figure out that it took about 3 years to calculate the number of subscribers of all people. We will show how to calculate the TOP100 people with the maximum number of subscribers in a few minutes . But first, a little dive into the formal definitions. After all, one cannot do without good mathematics.

    A social network in terms of mathematics can be represented by a directed graphwhose vertices are users, and the edges connecting them indicate the fact of friendship / follower / subscription. At the same time, our graph is oriented - the ribs have a direction (depending on who follows whom). For an undirected graph, the concept of degree (number of friends of the vertex) is defined . For the oriented - the incoming degree (number of “followers”) and the outgoing degree (number of people on whom the person is “signed”).

    In general, nothing special if many of the graphs that occur in life would not be arranged in a certain way. To do this, there is even a whole science that is engaged in the study of so-called web graphs. It all started with the fact that in 1999 an article by A.-L. Barabashi and R. Albert, in which the authors experimentally investigated the properties of the Internet and proposed the idea of ​​its formation, which was subsequently formalized by many authors and is very actively used in practice. So, let's start with the very features of social networks.

    Real Network Properties


    We’ll say right away that when we talk about a web graph, by vertices we mean certain units on the Internet, for definiteness we will consider sites (for example, we can also consider hosts). The edges will connect those vertices between which there are links. It is clear that the graph is oriented and multiple edges are allowed. Even loops (edges from the vertex into itself) and even multiple loops are allowed.

    1. The diameter of the web graph is small


    This is one of the most well-known properties of web-type graphs, which many people know as the “theory of six handshakes,” which means that any 2 people on Earth know each other through 6 handshakes. In the language of graph theory, this property means that the graph has a small diameter. This suggests that the graph should be sufficiently dense, however, this is not so.

    2. Sparseness


    A web graph is a fairly sparse graph. More formally, at n vertices there are only about const * n edges . It would seem that a large graph of small diameter should be very dense, i.e. have an order of n ^ 2 edges, however, the fact remains. The following property for a person who is not familiar with mathematics does not sound so informative, but it will tell in the future how the graph itself should be formed in order to satisfy all the properties of the real web.

    3. The power law of the distribution of degrees of vertices


    It turns out that the proportion of vertices of degree d in the web graph is estimated as const / d ^ a , where 2 <a <3 (at the time of the study, the value of a was about 2.1 ).

    These are the properties of the Internet. Now, if we know how the process of such a graph works, we can do many things, starting from looking for spam structures (many of which are combo dense bipartite subgraphs), ending with effectively looking for vertices of the maximum degree (about this we will tell below).

    The idea of ​​preferred joining


    Today, one of the most common ideas underlying the construction of a web graph is the idea of ​​preferred joining, the logic of which is as follows. Let's assume that as soon as a new site appears, he would rather prefer to compete with those who are already popular at that time . More formally, the probability with which a new site will link to an existing one is proportional to the number of links already available to that site.

    It is clear that the idea described above is poorly formalized and clarifying many details, you can get many models of web graphs. So there were models of Bollobash - Riordan, Buckley - Ostgus, a copy model and many others. Not so long ago, a whole class of models was considered in Yandex , for which it was provedmany interesting results.

    I hope the reader has an understanding of how the Internet works in terms of graph theory. It turns out that many other real networks, including social. Networks are similar. This is what will help us in our task.

    Heuristic algorithm for determining vertices of maximum degree


    Now back to our task. How can the properties of real networks described above be used to find vertices of maximum degree with sufficiently high accuracy? Let's take a number of arbitrary vertices of our graph (for example, from a uniform distribution). We denote the set of vertices of A . It is approximately clear that with a high probability many of them will refer to one of some popular peaks. Let us then released from these vertices edges (see, to whom these peaks subscribe) - the set of vertices of the Bed and . It is understood that any vertex of A refer to the same vertex of B . For each vertex V from the set BWe calculate the number of vertices in A , which were in the top V . Thus, we obtain a lower bound on the incoming degree (number of subscribers) of each vertex of the many Bed and . We call this "estimate incoming degree" vertex V .

    And now the main point, which is very difficult to explain to the reader without serious mathematical preparation in more detail than in this article: due to the fact that the social graph has the properties described above, it turns out that exactly the vertices of the maximum degree in the original graph fall into the set B. Thus, it remains to sort the vertices from the set Baccording to the "assessment of the incoming degree" and to clarify for each of these peaks the real value of the subscribers in it. You can learn more about the algorithm and its justification here .

    As you can see, the algorithm is quite simple and not demanding on computing resources. This is exactly what the participants of our study were supposed to realize for the Russian social network VKontakte.

    Small offtopic


    As we said earlier, including at past conferences, when educating people, we try to explain complex things in simple language and give the most practical skills. In order for a person to have a certain image and understanding, sufficient for the practical application of data analysis skills. That is why, we take people from different subject areas, and exchanging experience with each other, we instill the missing skills to the participants and ourselves.

    One of the study participants was Gleb Morozov , who for more than 6 years has been solving analytical problems in large federal retail chains, including, for example, Tander CJSC (Magnet) or Megapolis Trading House. After a large number of jointly solved problems, which he will talk about a little later, Gleb learned to use machine learning in retail, and the MLClass team gained experience in the subject area. Gleb will tell about the tasks themselves a little later on his own, but for now I will give his code, which he used to solve the problem described above.

    R algorithm implementation


    Below is a fairly simple and well-commented implementation of this algorithm in the R language . Any reader can easily reproduce the result.

    library("RCurl") # Библиотека для генерации запросов к API
    library("jsonlite") # Библиотека для обработки JSON
    library("stringr") # Библиотека для работы с текстом
    # Шаблон запроса подписок аккаунта
    url <- "https://api.vk.com/method/users.getSubscriptions?user_id=" 
    # Шаблон запроса подписчиков аккаунта
    url2 <- "https://api.vk.com/method/users.getFollowers?user_id="
    # Генерируем id аккаунта с replace (выбираем из 2.8 млн. id 700 аккаунтов с повторениями)
    id_num <- sample.int(280000000, size = 300, replace = T)
    # Функция для получения id подписок для сгенерированных аккаунтов
    get_id_account <- function(id) {
            url_get <- str_c(url, id) # Составляем запрос к API
            resp <- getURL(url_get) # Запрашиваем и получаем ответ
            Sys.sleep(0.5) # Задержка для обхода ограничения на кол-во запросов в сек
            # Обрабатываем ответ и получаем id подписок
            fromJSON(resp)$response$users$items 
    }
    # Функция для получения кол-ва подписчиков аккаунта
    get_count <- function(id) {
            url_get <- str_c(url2, id)
            resp <- getURL(url_get)
            Sys.sleep(0.5)
            fromJSON(resp)$response$count
    }
    # Получаем, при помощи функции get_id_account, id подписок для аккаунтов в виде list
    account_id <- sapply(id_num, get_id_account)
    # Преобразовываем list в data.frame
    account_id <- unlist(account_id)
    # Подсчитываем кол-во попаданий для каждого id полученных аккаунтов
    account_id <- table(account_id)
    # Преобразовываем в data.frame
    account_id <- as.data.frame(account_id)
    # Сортируем в убывающем порядке
    account_id <- account_id[order(-account_id$Freq),]
    # Получаем, при помощи функции get_count, кол-во подписчиков
    result <- sapply(as.numeric(as.character(account_id$account_id[1:300])), get_count)
    # Объединяем полученные данные о кол-ве подписчиков с id аккаунтов
    result <- cbind(as.numeric(as.character(account_id$account_id[1:300])), result)
    # Сортируем в убывающем порядке
    result <- result[order(-result[,2]),]
    result <- data.frame(id = result[,1], count_of_followers = result[, 2])
    write.csv(result, "result_subscribers.csv")
    


    As you can see, the algorithm and its implementation are very simple. Now let's take a look at the results.

    results


    In just a few minutes, we will get the following results (TOP20 vertices of maximum degree, a more complete list - see here ).

    HumanNumber of subscribers
    Pavel Durov6192543
    Dmitry Medvedev2058871
    Kate Clap1397818
    Ivan Rudskoy1308458
    Nyusha Shurochkina916338
    Mariya Kozhevnikova908909
    Sasha Spielberg837179
    Mikhail Zadornov736109
    Maxim Golopolosov734602
    Victoria Bonya733385
    Christina Good-natured607419
    Maria Way598583
    Tina Kandelaki541142
    Katya Sambuka538611
    Sofia Temnikova530157
    Alexey Dolmatov518934
    Mikhail Galustyan515012
    Mariana Rozhkova508894
    Vasya Vakulenko445594
    Vladimir Zhirinovsky442578

    In general, when we saw this list (and then looked at it further), we made 2 conclusions:
    • Algorithm works
    • The results are a bit amazing. We especially liked the schoolboy Ivan Rudskoy, who paid a lot of Russian pop stars

    We began to doubt how correctly the algorithm works, but because we did not find people with the maximum number of subscribers in open sources, we decided to test the algorithm on groups (look for groups with the maximum number of subscribers), since there are lists for them, like this . And, indeed, having received such results here , and realizing that VKontakte is not at all the world of the infamous MDK , they were satisfied.

    TOTAL


    Now, after we have obtained such results, we will move on, namely, using the same API, try to build a graph of “reposts” from the pages of these people in order to evaluate a person not only by the number of subscribers, but also by how this person is able to disseminate opinions using his influence on the network. There are ideas on how to correctly apply PageRank .

    If you want to connect to this study and get a very cool experience - join our Data Science community and write me an e-mail ( al.krot.kav@gmail.com ) with the topic “VKontakte Opinion Leaders”

    Well, in the meantime, we will continue our research and will delight you with new articles from the world of data analysis and good mathematics!

    UPD
    Yes, by the way, they found a lot of beautiful girls this way.
    For example, vk.com/id109507300 , madame scored more subscribers than many Russian stars. Happy viewing!

    Also popular now: