Analysis of trends in Russian YouTube for 2018
The old-timers probably don’t remember, but at the end of 2017, there was a widespread discussion on the Internet that “often wired” videos were often found on YouTube trends.
Therefore, on the eve of the new 2018, I wrote a utility for collecting information about videos that are in trend. For each video, a title, a list of tags, a date of creation are requested, as well as a history of changes in varnishes / dislikes / views. Developed on TypeScript for NodeJS, the code itself is laid out on GitHub .
As a result, now there is an opportunity to build beautiful graphics:
It is also possible to build trend charts for keywords. In total for the 2018th year, information was collected on 29,271 videos. Statistics are collected now.
General principle of work
- Every 5 minutes, the current list of trends is taken.
- For each new video, the main information is saved (title, tag list, creation date)
- Based on the title and tags, each video is assigned a keyword cloud.
- Schedule information is requested on likes / dislikes / views on each of the videos. The statistics is collected within two days, the first time requests go at intervals of 2 minutes, then the interval increases. If there is a suspicion of cheating, the interval is again set to 2 minutes.
If the graph of changes in the number of likes / dislikes on one of the sites is a straight line, then only the first and last value on this site is saved. This is done to reduce the size of the database. Now there are only 6908449 records in the table with statistics, the table on the disk takes 458 mb.
The principle of automatic detection markup
For myself, the task was formulated as follows: you need to mark videos that have a “ladder” on the chart for changing likes / dislikes. The steps at this ladder are determined on the basis of three adjacent statistics measurements. The angle between the two straight lines is taken into account: one straight line is drawn between the first and second measurements, the second between the second and third, as well as the lengths of the segments. Graphs that have many small irregularities are also marked.
An example of suspicious graphics:
All the parameters of the algorithm were determined by me manually and checked on the video already collected at that time and during the year changes were made to this algorithm. Therefore, it’s probably not worth taking these results seriously for each individual video. In my defense, I can say that when the parameters were changed, the recalculation was started for all the videos already collected, therefore the same algorithm was applied to all the videos.
In general, it is impossible to say whether there was a cheat on one (or several) changes of likes / dislikes. Any suspicious drops can be explained by the work of CQRS or flares on the Sun. Yes, one schedule is smooth, the other is stepped, but perhaps all videos occasionally encounter similar behavior? That is why to compile a general picture, information was collected from all the videos that hit the trends.
2018th year, the algorithm showed the following results: Suspicion of likes: 180 videos (0.32% of the total number of videos)
Suspicion of dislikes: 1303 videos (4.45% of the total number of videos)
There are few videos with suspicious graphics, but this was not always the case: in the first month of 2018, 96 such videos were recorded (more than 50% of all suspicious likes for the year). However, in February such videos became much less, only 8.
Here, you probably should again refer to the old-timers, who may recall (or not remember) the event that occurred on January 10, 2018, when YouTube blocked many channels . For my part, I can say that among those blocked were those for which my utility managed to gather information. Graph of one of the deleted videos:
If we assume that there were really some cheatings, then it seems that YouTube did a lot of work and now you can see videos that are suspicious of likes in trends not every day (and those that are found often look like an accident or an error). On the other hand, such a drop in wrap can be explained by the fact that, in contrast to dislikes, it does not make sense to wind up likes from videos that have already become trendy.
And some more statistics. On average, tracked videos gain 21,479 likes and 2,863 dies.
Suspicion of cheating likes: 15502/4250
Suspicion of cheating dislik: 16868/22087
So, if you look at the result, there is no benefit from cheating likes, while it is possible to increase the percentage of dislikes.
Suspicious on dizlakam graphics are uneven. For example, on the Yevgeny Roizman channel, out of 21 videos caught in a trend, more than half are marked by the algorithm as twisted on dizlikes.
About the graphics from the title of this article. If we assume that there is a set of accounts in the amount of 5–10 thousand, which they first gave the command to put on dislikes, and then, not waiting for the end of work on the same set, they gave the command to put likes, then you can probably get a similar schedule.
The strangest schedule that I met:
I would be grateful if someone offered an explanation of what's going on here. By the way, it can be noted that according to this schedule, the statistics were collected for almost a week, not two days.
The principle of the algorithm for measuring the popularity of keywords
As already said, for each video is saved the name and set of tags. Further, the name and each of the tags is divided into separate words, they are run through a stemmer and saved as a cloud of keywords for the video.
Having the information about when the video got into trends and when it came out of the trends, as well as sets of words for the video, you can make a graph of popularity change for each of the keywords. At the moment, the schedule of changes in keyword threads is based on the day. As a measure, the total time (in hours), which all videos with this keyword were in trends, is used.
Example: in the trends there were only two videos corresponding to the keyword. One video lasted 5 hours in trends, the other - 10 hours. Then the popularity of the keyword is set to 10 + 5 = 15.
Sample keyword popularity charts
According to the algorithm that I wrote above, the most resonant and most noticeable event of 2018 was not the elections and not even football, but the tragedy in Kemerovo:
Unlike all other events, the tragedy in Kemerovo has affected everyone, and the video on this incident ousted all the rest from the trends.
Well, a bit of politics:
How to feel
The system is now running on Amazon Cloud, using two instances:
- t2.micro - web server
- t3.small - server with MySQL. Utilities for collecting statistics are executed on the same server.
It is possible that in the event of a load, the web server will fall down first, while the second server will continue to collect statistics. This is me to the fact that you should not be surprised if everything stops working.
The database itself as of 01/23/2019 can be downloaded from the link .