Find out the age of VK user or what else can the social graph tell

“Tell me who your friend is and I will tell you who you are.”
Euripides 480-406 BC e.

For a long time I looked at API VK like a cat at a washing machine - I was hypnotized by the opportunity to conduct some kind of research in one of the largest social networks, which penetrated many areas of our lives. And once a question was born, is it possible to determine his age by the social circle of a user of a social network?

For those who wanted to know the hidden age, there was a small hack before. You just need to use the search by people, specify narrow parameters so that the desired profile falls into the SERP, and then use the binary search to determine the age range. Or it turns out that the contact information suddenly indicates the year of graduation. And you don’t need to write any scripts. But the hidden age and indirect information can be distorted, and most importantly the article is still not about how to get more personal information. The article proposes to analyze one of the aspects of the social graph.

One of the first things that comes to mind when considering profile connections: let's see the age of classmates and classmates, in the vast majority this user will have an age of + - 1 year. For this, thanks to universal secondary education. There is only one caveat: identify classmates. The more time passes from graduation, the more we start to rotate in more mottled circles. School friends seem to be in a past life, and now they are almost imperceptibly among a large number of new friends. Is it possible for profiles of people of mature age to somehow understand what stream they studied and, therefore, approximate age?

So, let's look at the task of determining the age of a user as determining a subset of classmates and classmates. That is, we took for the assumption that he has in friends a certain number of classmates, whose age approximately corresponds to the age of the profile. Of course there are exceptions, but they are rare. A person goes to school from bell to bell for 10 years, during this period many cross-social contacts have been established. In short, everyone knows each other, while the age spread in this social tangle is minimal. In the future, when a person joins other groups, as a rule, the age spread in them is significant, whether it is work, sports activity or an interest club. Based on this difference, we will try to identify the necessary social groups.

Let's have a look at one of the VK profiles with a lot of friends. We’ll get the user's friends list using the friends.get query. We will consider profiles only with the specified age and place them on the timeline in the form of a histogram by year. There is a slight nuance with how to break a lot of friends into annual intervals. After all, we want to ensure that classmates enter one interval, and not spread out over two neighboring ones. It was experimentally found that breaking the year is best in the fall, and so that users with birth dates in the yellow season enter immediately into two adjacent intervals. That is, 15 month intervals from September to November are obtained in increments of 12 months.

oX is the age of users, oY is the number of users who fall within a given interval.

We observe a five-year plateau with a maximum annual number of friends. It is not at all obvious to find a group of peers among this 5 year period. In truth, such a picture is not typical. More often, the year of birth of classmates / classmates stands out from the others by a larger number of friends. But let's in a difficult case for each user find the ratio of friendships within the annual group to the number of connections with other friends of the original user, for whom we determine the age; then we average this indicator for each year. We call this the normalized coefficient of connectivity.

oX is the age of users, oY is the normalized coefficient of connectivity for a given interval.

The picture has changed, and the leaders have a single year. A team with a uniform age has a large share in it, therefore we have the right to expect that since the user is part of it, then he has a similar age. But what if a person in this collective plays some special role, for example, not a classmate, but a teacher? Indeed, for the case of teachers / trainers, there may be subgroups with a high density of connections in a narrow age interval. In part, this case can be handled if, when choosing a group, not with the highest connectivity, but with the highest age among groups with a sufficiently large connectivity. In other words, use the logic that a person on his life path must first be an ordinary student, and only then play a distinguished role in “teams with a uniform age”.

A more detailed description and some formulas
Express numerically detected on the graph phenomenon. Let F0 denote the set of friends of the user for whom age is calculated. Fi - many friends of any profile. Fi, y is the set of profile friends having a specified date of birth in the annual interval y. Then Сi, y is the connection of profile i in the interval у :

$ C_ {i, y} = \ frac {| F_0 \ cap F_ {i, y} |} {| F_0 \ cap F_i |} $

Сy is the non-normalized coefficient of connectivity in the interval y for all profiles:

$ C_y = \ sum_i ^ {F_ {0, y}} C_ {i, y} $

And finally, the desired year of birth:

$ \ DeclareMathOperator * {\ argmax} {argmax} year \ _of \ _birth = \ argmax_y (\ frac {C_y} {| F_ {0, y} |}): C_y \ geq 0.7 \ max_ {y \ in Y} ( C_y) $

There was also an idea to consider what type this or that connection belongs to. If the type of connection is school or university friends, then consider them with increased weight. And if the type of colleague, relatives and everything else, then do not take into account such relationships in general. However, if you use requests that download such information, the waiting time will increase by a factor of 5. In addition, specifying the type of connection is not a popular practice, so it was decided to request such information only for profiles with few friends.

From the above algorithm, the natural limits of applicability of the approach to determining age follow. If the user does not suffer from nostalgia for his school years, and he doesn’t have friends of his classmates / classmates, then we must use another method.

How about trying this mess in business? A comic service was implemented in the VK group “Fortune Teller of the Age” . There, a friendly bot will lose age if you drop a link to it on an unclosed VK profile using the above algorithm.

How is the service arranged
The first link in the work of the fortuneteller is the message mechanism of the VK group. In the group settings, the callback API is connected to its own server. As the sent event types, select “Incoming message”. In this way, the group message turns into a request on our server. If you are not friends with the frontend just like me, then this is a super option. Then, from the server, the VK API is called with users.get requests for the profile in question and friends.get for friends of the profile with a known date of birth. Their implementation requires access token VK applications. I did not use requests that require confirmation of user rights, so as not to load people with requests for access permission. After the calculation of the estimated age is made, a response to the request from the group is formed, and the fortune-teller user sees the answer in the dialogs.

As for the improvement of the algorithm itself, nothing prevents you from going even further, collecting a training dataset from profiles with a specified age and training a regression model based on, say, an adjacency matrix of an age graph among profile friends. I am sure that with a sufficiently large sample, the results will be more accurate than heuristics. As I mentioned above, I was curious to check the fundamental idea, so I do not plan to develop this direction.

In conclusion, I would like to touch on the aspect of ethics. In my opinion, the “Fortune Teller of the Age” is on the border of private life, but still does not cross it, because it uses open data for analysis. Actually, therefore, for users with a hidden profile, the service will not work.

There is a feeling that all sorts of “fortunetellers of age”, like search engines, SearchFace are just the first signs of a socially transparent world. To some extent, this can be called a return to basics. Man for a long time existed in small societies, where everyone was in sight of each other. An open reputation was an integral part of the mechanism of social regulation. Yes, new tools will gradually make it possible to re-make social interactions of a person in full view, only now on a global level. Yes, like any tool, it can be used to the detriment. Do I need to make them accessible to everyone? I do not know. But I am sure that if such tools are available only to a limited circle of people, then the balance towards constructive use will definitely not shift.

Also popular now: