Who lives in social networks?
No matter how the scandals about PRISM, about personal data and their leaks, social networks beckon to tell everything about themselves: which kittens you like, with whom you are friends and why you haven’t slept so much in the morning.
The whole encyclopedia about the behavior of the majority of online active public lies very close, and I always wanted to feel it. On the one hand, this data seems to be in the public domain, but just taking and analyzing it is not so easy - everything is too unstructured and fragmented. In addition, as far as I know, there are practically no social network datasets suitable for machine analysis. And for Russia - and even more so.
There was no choice, and I had to laugh sinisterly at night to write simple spiders for the social networks VKontakte, Odnoklassniki, MoyMir and the Russian segment Facebook, which over the course of several months had slowly collected more or less statistically correct sample data. Only the information that people told about themselves was collected. And they told a lot.
About that managed to fetch from such data, and the story will go.
I admit, this study is far from the first. Social networks (especially Facebook and VKontakte) have been openly studied many times. And even your humble servant wrote an article about six handshakes , collecting for this a complete graph of friends from VKontakte.
But RuNet does not live by a single VKontakte. I wanted to look at what was going on in other social networks, no less inhabited, and also to understand the differences in their audience.
This is not our first experience of collecting big data under cover of night. So at a fast five-hand pace, four spiders were written on Qt / C ++ and Python, which, slowly strolling through individual social networks, wrote down everything they met in the database.
Different social networks relate to parsing in different ways. Problems arose with Odnoklassniki and Facebook, which, as it turned out, have a rather tricky system for detecting suspicious bots. Fortunately, it is mainly aimed at spammers, and our bots from this point of view look pink and fluffy, and we somehow managed to set up a more or less stable, albeit very slow collection.
Pumping out a lot of data is easy, just two months of collection. But paranoia is walking on the planet, and for most people, an open profile on the social network looks very poor. The lion's share of information is only available to friends. But the fact is that friends themselves are most often open!
And based on them, you can calculate quite a lot of interesting things. For example, city, age and university. Yes, and much more. For starters, I’ll show a graph of the dependence of real age on the median age of friends:
As you might guess, the real age is for the most part very related to the median age of friends. So even if you are paranoid, then your friends will give out a lot about you simply by their presence.
For storage and analysis, we decided, like big boys, to use HBase / Hadoop. It is stylish, fashionable, youth, in addition, there was already experience in training such technologies. As a result, approximately 50 parameters were calculated (i.e., either reduced to a single form or isolated from social ties) from what we collected. Thumbs up. Then a random sample of one million users from each social network was made from a common data set and carefully analyzed. A similar feint was made in order to at least slightly normalize the audience of different social networks with a different number of users.
Further, in fact, the most delicious thing that was found out.
For starters, it would be interesting to know the age structure of different social networks.
As the age, either age itself was used if the person did not hesitate to indicate his year of birth, or his approximation based on the date of graduation from the school / university. A similar maneuver was necessary for the most part because of VKontakte and Facebook, for which the exact age is known for 40% and 20% of users, respectively.
The result is something like this.
It's funny Seeing this, you can definitely notice the following features:
- Odnoklassniki sits an older audience.
- Those who are younger are almost completely on VKontakte. The strange outburst in the region of 14 years is explained by the fact that the last year of birth, which can be chosen in the profile, is 1999. That's all the younger ones choose it. Child protection and labeling 14+ in action.
- People go to Facebook at a conscious age, under 18 there is almost nobody there.
And what about the gender and age structure? The gender was either taken as is, if it is possible to indicate it on the social network at all, or calculated on the basis of the name as follows. If the majority of people bearing the name “Alexander” are men, then we will consider all Aleksandrov with an unknown gender to be men. A similar approach works for the vast majority of names, but has some problems with Zhenya and Sasha.
Perfectly. I always suspected this:
- At Odnoklassniki, most of the audience is women. I suggest that the administration rename the network to Odnoklassniki, it will be closer to the truth.
- In other social networks, a general preponderance towards women is also felt, but not so fatal.
- The general failure in the number of men before 1976 can be explained by the sad fact that middle-aged men die earlier and more often than women. I believe that this dependence reflects the general demographic situation for this group of people. Take care of men.
Next, a certain abstract indicator of a person’s activity in social networks was calculated in the form of a set of several rules: “there is an avatar”, “friends more than 50”, “recently was online”, etc. For the operation of each such rule, a few points fell into the general profile piggy bank. And this is how the distribution of this indicator across different social networks looks like:
Surprisingly, VKontakte simply gushes young and hyperactive users, whose fuse is dying away (or is sensitively redirected to the family channel) only by the age of 35. In Odnoklassniki, the main activity begins as early as 30 years. And in MoiMir and Facebook in this regard, the situation is more deplorable - there is a swamp.
Age and activity is certainly good, but very boring. And in order not to fall asleep, for each person in the sample the number of obscene words found in his posts was calculated. It was especially funny to compose a dictionary of such words. Dimension along the ordinate axis - the number of words in the last 10 posts.
Obviously, from the age of 15 young people are so bold that they can swear obscenely right on their page. Personally, I wrote my first “X * Y” in the schoolyard as early as 12, but the truth is anonymous. The outpouring of foul language continues to some extent until the 23rd. Then, apparently, seriousness sets in and it is time to become an adult. A very captain’s statement, but now it is at least proved by facts.
Now it's time to dissect the names. It always seemed to me that names have different popularity in time. Sometimes, one feels a peculiar fashion to call children somehow unusual. And now you can see it with your own eyes.
The trend is obvious: previously popular names with a rapid jack are losing their former popularity.
- For example, in the good old days, Alexander was called every eighth boy, and now only every 50th. Alexandra was transferred today.
- The Vladimirov population has been steadily declining for over 50 years.
- Noticeable waves of fashion in the names of Alex and Igor.
With female names, the situation looks similar:
In general, the same rapid jack, but attention should be paid to the following features:
- Popular female names have lost their popularity even more than male names. Apparently, if a girl is born, parents do their best to strain their imagination and spew to the light of Isolde, Malvina or Dazdraperma. But if a boy appears, then why strain something: and Alexander will come down.
- The turning point with the decline in popularity was in the late 70s and early 80s. It was inspired by the West, the thaw, and here is the result: each parent now competes in the originality of the names of their children.
We go further. All posts on the VKontakte wall have a funny tag if this post is made from an iOS / Android phone. And it is also possible (and even necessary) to analyze. I want to note the fact that the proportion of men is indicated on the graph along the Y axis. The proportion of women, as you might guess, has a very simple dependence on the proportion of men.
Interestingly, the iPhone has a clear bias towards the fair sex, which is not surprising. “Dad, buy me an iPhone that I’m walking around like a fool” is a rather popular phrase that appears in nightmares for many young dads. And Android is starting to be in high demand among harsh men over 30.
My grandmother always told me that in her time, being single (or, even worse, unmarried) at 25 was tantamount to disaster. Which in the future, usually led to a lecture on the topic “you need to get married, master” and “everyone has already got married, and you are alone like an owl.” I always wanted arguments in this dispute, and now I have them.
I want to note that marital status was analyzed only among those who indicated it.
The following facts are very interesting:
- At 27, only half of those who indicated their marital status were married. And you said something.
- Young people are more likely to go with the status of “single / unmarried” than “meet”. Walks, walks, and then claps - and is “married”.
Now you can move on to bad habits: smoking and drinking alcohol. This parameter is only VKontakte, but many of them diligently fill it, which we will use.
It is regrettable, but the love of alcohol and smoking only intensifies with age. Some plateau comes only as early as 30 years, which somewhat surprised me. Somewhere around 40 some people think over and try to correct the situation, but it’s too late to drink Borjomi.
Height and weight
In MyMir, there are funny parameters in the profile: height and weight. I can’t explain how someone’s great mind was guided, which added them there. But there are parameters, and it would be foolish not to see them through the prism of our curiosity.
I expected to see a less contrasting graph. But it turned out that way. I suppose this strangeness can be explained by the fact that women are more often proud of their small stature and hide their big stature. For men, the situation looks exactly the opposite: it’s a shame on the entire Internet to admit that you’re only 150cm in you, but if you’re two meters away, then everyone should know it.
On the other hand, women are on average lower than men and everything can be much simpler.
With weight, the situation is about the same as with growth. Women after 60kg abruptly stop mentioning their weight. But men are always welcome. One hundred and twenty? Yes, not a problem, there should be a lot of a good person.
In general, the relationship of height and weight is described in many medical sources. And on this graph it is obvious. It's funny to note that undersized girls are usually heavier than boys.
Even when I was little, I always suspected that girls were more often friends with girls. And in revenge they were friends mostly with boys. I suppose it's time to confirm this trend.
Yes, you can see a clear connection with the fact that girls have girls in friends. If you are male and you have only ladies in friends, then I have bad news for you.
Likes are an amazing thing. Six years ago they did not exist, and now it is an integral attribute of any social network.
I always suspected that the fair sex like much more often. And this trend continues almost up to 30 years, but then it slowly disappears. Fortunately, values change with age.
Like is a phenomenon of modern times. Girls nervously consider how many people have polished their new avatar with autumn leaves. and at that time a dialogue sounded in my head: “Dad, how did you get to know your mother?” - “Well, I got to know her ava, and then it started.”
No matter how the habr was outside politics, it is now oozing from all the cracks. In social networks, there is even a special field describing the political views that we are currently preparing.
- Personally, I am surprised that today there are so many people who proudly noted their indifference to politics. The number of such people gradually decreases with age, but only slightly.
- With age, the number of conservatives and liberals is growing. Apparently, at the expense of people with indifferent political views.
This is where the beautiful graphics ended. But the fun is not yet.
To create such a wonderful data set for analysis and not share it is a crime against humanity. Therefore, it was decided to put it in the public domain, but in such a way as not to hurt the rights and privacy of people who fell into this data set:
- Profiles are anonymized, there is no name, surname, date of birth, university or profile address. Theoretically, some of the profiles can be deanonymized, but this will be quite difficult.
- A data set is a sample whose volume is exactly 4 million profiles, 1 million for each social network.
- The collection and analysis of this good, it seems to me, satisfies the law on personal data. Keywords: publicly available data (users posted it themselves), anonymization (no name, date of birth, etc.), use for scientific non-commercial purposes.
- Data License: Creative Commons Attribution Shake-alike Noncommercial (CC-BY-NC-SA).
- Dump data for MySQL. Included is a description of all available fields.
- There may be bugs in the data, I'm just sure of that. If you find a bug, you can proudly share it with me in a private message.
Archive with data , 7z, 135Mb in the archive, 1Gb in unpacked form.
Instead of an afterword
Be careful with the data that you upload to the network. What was once uploaded there will remain there for centuries. So take care of yourself and your privacy from a young age.