Identify meaningful profiles in VK
It’s really difficult to distinguish bots from people. I really can’t do it myself. But on the other hand, I came up with a good bike ... a method how to distinguish in VK “interesting people” from “not very interesting”. In terms of network communication, of course, and not in life.
If someone is knocking on your friends, but at first glance you can’t understand this is a normal person or who the hell knows, this method can provide some useful information about the user. It is unlikely to use it to identify relevant target groups, because VK has put a limit on the ability to download the contents of user walls , and it’s slow to hurt. Those. it is possible, but it is necessary to greatly refine, optimize and dodge in order to circumvent restrictions.
The main idea is that bots, dull (in network terms) personalities, all kinds of mass collectors of friends-subscribers do not really care about who they are friends with, although they can “write” quite a lot of meaningful posts on their walls. But dull personalities don’t especially read their tape, and bots don’t need it at all. Moreover, this is not necessary for mass collectors of subscribers and stars.
But for people who have at least some communicative interests regarding VK, it is very important who they are friends with. And, of course, they will not be able to collect 6,000 dudes in their friends who will only share reposts, pictures of naked women and advertisement of drain barrels at a discount from a warehouse in Novy Urengoy.
And on this basis, you can try to draw up a criterion by which to single out people who are interested in the content of their feed. Such people show the features of a real person. A person who, at a minimum, carries out a meaningful unilateral communicative act. Nowadays, this is not so small.
Two criteria immediately came to my mind:
I chose 50 random friends and 50 random followers who met some criteria that would cut off the very obvious fakes, children or people who don’t use it all. Type that the user should not be deactivated and at the same time should have more than 50 existing friends.
I looked through all these people and identified which of them is a “bot” and which is not. Naturally, most of the friends were real, and most of the subscribers offered to buy something (but a few real people were there).
Further, I took the first 100 posts from each of the friends of the person being checked, if there were so many on the wall. For each person, I considered two such factors:
Another factor appeared by chance. This is the logarithm from ID in VK log10 (ID)
On this, I trained logistic regression , and I got this:
log (OR) = 9.92-1.537 * log10 (ID) + 0.067 * SQRT (Dic) -0.023 * Percent
For the test part of the sample it turned out a very good classifier with AUC = 0.93. Here is such a ROC curve :
ROC curve of the classifier that determines the content of a person’s page.
Some questions are caused by the importance of VC ID for classifying the content of a person, but it seems, alas, it works. The further the ID is from 1, the more likely it is that it is just a bot that is made to advertise microloans. Without ID, the classifier also works, but worse. AUC = 0.78. This is not direct good, but also not direct useless.
In any case, the final decision on the character’s usefulness is up to the decision maker.
I took all of its 5,000 subscribers from one of my comrades, where, of course, 95% of the advertising slag was sent out and the regression was run without additional training. With a cutoff of 20%, the results came out such TP = 78%, FP = 11% . That is, in general, on an arbitrary person it also works more or less less.
Yes, it’s easy enough to generate a bot that will have some pseudo-meaningful posts surrounded by friends, but so far no one needs it. Well, it’s hard to bother with different content, because if all the bots generate the same thing, it is also easy to recognize.
Probably possible, but I’m breaking it to do hello VK. If anyone wants, let him do it. It seems that the method is described, its idea is simple.
Enough. But suddenly someone will come in handy as a base for their developments. This method can easily be complicated, for example, by considering not just the length of the dictionaries, but considering the content. Here you can already use the full power of NLP and train in content. You can still take more complex classifiers: trees, neural networks, etc. All this can be complicated, but it is important that even simple ones give something interesting.
If someone is knocking on your friends, but at first glance you can’t understand this is a normal person or who the hell knows, this method can provide some useful information about the user. It is unlikely to use it to identify relevant target groups, because VK has put a limit on the ability to download the contents of user walls , and it’s slow to hurt. Those. it is possible, but it is necessary to greatly refine, optimize and dodge in order to circumvent restrictions.
main idea
The main idea is that bots, dull (in network terms) personalities, all kinds of mass collectors of friends-subscribers do not really care about who they are friends with, although they can “write” quite a lot of meaningful posts on their walls. But dull personalities don’t especially read their tape, and bots don’t need it at all. Moreover, this is not necessary for mass collectors of subscribers and stars.
But for people who have at least some communicative interests regarding VK, it is very important who they are friends with. And, of course, they will not be able to collect 6,000 dudes in their friends who will only share reposts, pictures of naked women and advertisement of drain barrels at a discount from a warehouse in Novy Urengoy.
And on this basis, you can try to draw up a criterion by which to single out people who are interested in the content of their feed. Such people show the features of a real person. A person who, at a minimum, carries out a meaningful unilateral communicative act. Nowadays, this is not so small.
Two criteria immediately came to my mind:
- The average dictionary of a person’s friends for N last posts.
- The percentage of posts without texts from friends of the person being checked.
And how did I end up checking this?
I chose 50 random friends and 50 random followers who met some criteria that would cut off the very obvious fakes, children or people who don’t use it all. Type that the user should not be deactivated and at the same time should have more than 50 existing friends.
I looked through all these people and identified which of them is a “bot” and which is not. Naturally, most of the friends were real, and most of the subscribers offered to buy something (but a few real people were there).
Further, I took the first 100 posts from each of the friends of the person being checked, if there were so many on the wall. For each person, I considered two such factors:
- The average size of a person’s friends dictionary for their first 100 posts. Those. 50 friends, each with approximately 100 posts. For each friend, all words from 100 posts are raked into a heap, stamped and the number of unique words of a friend is considered. Further, the average for all 50 friends is considered. From this value, the root was taken - SQRT (Dic).
- If a friend has more than 60 out of 100 posts without words, he is marked as “lost”. The percentage of “lost” people in friends is the second factor - Percent.
Another factor appeared by chance. This is the logarithm from ID in VK log10 (ID)
On this, I trained logistic regression , and I got this:
log (OR) = 9.92-1.537 * log10 (ID) + 0.067 * SQRT (Dic) -0.023 * Percent
For the test part of the sample it turned out a very good classifier with AUC = 0.93. Here is such a ROC curve :
ROC curve of the classifier that determines the content of a person’s page.
Some questions are caused by the importance of VC ID for classifying the content of a person, but it seems, alas, it works. The further the ID is from 1, the more likely it is that it is just a bot that is made to advertise microloans. Without ID, the classifier also works, but worse. AUC = 0.78. This is not direct good, but also not direct useless.
In any case, the final decision on the character’s usefulness is up to the decision maker.
additional verification
I took all of its 5,000 subscribers from one of my comrades, where, of course, 95% of the advertising slag was sent out and the regression was run without additional training. With a cutoff of 20%, the results came out such TP = 78%, FP = 11% . That is, in general, on an arbitrary person it also works more or less less.
Can they make bots that pass this test?
Yes, it’s easy enough to generate a bot that will have some pseudo-meaningful posts surrounded by friends, but so far no one needs it. Well, it’s hard to bother with different content, because if all the bots generate the same thing, it is also easy to recognize.
Is it possible to make an application that checks people by ID?
Probably possible, but I’m breaking it to do hello VK. If anyone wants, let him do it. It seems that the method is described, its idea is simple.
Is it too commonplace?
Enough. But suddenly someone will come in handy as a base for their developments. This method can easily be complicated, for example, by considering not just the length of the dictionaries, but considering the content. Here you can already use the full power of NLP and train in content. You can still take more complex classifiers: trees, neural networks, etc. All this can be complicated, but it is important that even simple ones give something interesting.