Big data: size matters?

All web developers are faced with the task of individually selecting content for users. With the growth of the volume of data and the increase in its diversity, ensuring the accuracy of sampling is becoming an increasingly important task that has a significant impact on the attractiveness of the project in the eyes of users. If the above is within the scope of your interests, then perhaps this post will run into some new ideas.
In each era of the development of the IT industry, there were buzzwords - words that everyone had heard, everyone knew that they had a future, but only a few knew what really stood behind this word and how it was right for themtake advantage of. At one time, bizvordami were "waterfall", and "XML", and "Scrum", and "web services". Today, one of the main contenders for the title of Buzzword No. 1 is “big data”. With the help of big data, British scientists diagnose pregnancy by check from a supermarket with an accuracy close to the hCG test. Large vendors create platforms for analyzing big data, the cost of which is over the top for millions of dollars, and there is no doubt that every pixel in any self-respecting Internet project will be built with big data in mind no later than 2020.
At the same time, a rare article about big data analysis algorithms dispenses with the comment “Well, show me an example working on an industrial scale!” Therefore, we will not beat around the bush and start with an example: www.ok.ru/music. Most of the content in the music section of Odnoklassniki is selected based on "big data" individually for each user. Is it worth it? Here are some simple numbers:
- + 300% plays and subscriptions
- + 200% song uploads
- + 1000% click through rate for music ad targeting
But the main thing is not at all. The lively and unbiased opinion of real users is much more valuable. A year ago, in the framework of the project “Outside the Window,” people who had never used Odnoklassniki before, spent two weeks online, reporting in detail about their impressions. One of the reviews about the music section was: “It somehow guesses what I like. I don’t understand how, but it’s nice. ”
In fact, of course, there is no magic - the whole thing is in the data. The very data that our users generate by listening and downloading music, browsing the music catalog. Information about all user actions flows into the classic MS SQL relational database, where primary processing, filtering and data aggregation take place (yes, good old SQL can also process big data). The data prepared in SQL is uploaded for additional analysis to a small Hadoop cluster, which makes a compact but informative squeeze that is already used in real time (part of it is imported into Cassandra, part is loaded immediately into memory). For greater efficiency, the latest user actions are added to the database (Tarantool) and are also taken into account online.

The squeeze used to select content includes various correlations between objects of different types. For music tracks, this is information about how often they are listened to within a small time window (temporal similarity). For music artists, this is information about how often the same user likes them (collaborative similarities), and how similar the music lists of their closest neighbors are (second-order collaborative similarities). For users, this is information about which tracks, which artists and how often they listen (user ratings). For ease of processing, all correlations are recorded in a single structure - the graph of tastes.

Due to its relatively compact size, the graph of tastes allows real-time solution of a wide range of tasks related to the personal selection of content. Having a list of the most popular tracks throughout the system, you can:
- evaluate their relevance for a particular user (the number and weight of paths of length no more than N between the user and the tracks),
- break the user's tastes into connected blocks (clustering by density of the subgraph of common neighbors using the affinity propagation method) and select recommendations for the block (personalized PageRank)
Having a collection of songs compiled by the user, you can pick up similar interesting tracks (also PPR for the collection and personalization of the result for the user). Technical details on how, why and why can be found here .
The fact that none of the solutions used can be called new / breakthrough / unique (underline as necessary) either from the point of view of algorithms or from the point of view of technology will not escape the gaze of the attentive reader. Why then really high-quality solutions based on big data appear on the Russian market so rarely?
A lot of spears were broken (and still breaks) in disputes about what size data can be considered truly “large”. But is it really about the size? Hundreds of gigabytes / terabytes / petabytes (underline) of data are not valuable in themselves - their main purpose is to help understand the past and predict the future. Obviously, data alone is not enough for this - analysis algorithms, technologies and people who can implement them are needed.
Many companies have data arrays sufficient to benefit the business when used properly. Processing algorithms are widely known and actively developed, processing technologies are also available in various price categories (from open source software that can run on stock iron to multi-million integrated systems). Obviously, the last, most important component is missing - experienced people who can assemble all the components together.
It’s easy enough to find a programmer who knows all the nuances of garbage collection in Java, has experience working with a dozen DBMSs of various types, thoroughly familiar with Spring / Trove / Hibernate and even fifty libraries and packages. However, most of them are technologically oriented and "not sharpened" to work with literature, to master new methods of statistical processing, to set up experiments. Finding a mathematician capable of this is more difficult, but also possible. But in this case it will be extremely difficult to move beyond the shapeless cloud of Matlab code. The likelihood of finding a person who can take the best from two worlds is so small that many generally doubt their existence.
It would seem that many university graduates should strive to get into such a valuable ecological niche, but even yesterday’s students have the same stratification into “techies” and “mathematicians”. The former in the problems of intellectual analysis are prone to cap-and-shoot approaches of “what to do here,” the latter go into the nirvana of mathematics and do not always return. But their learning abilities are not yet as dull as those of mature specialists, although their development requires serious additional investments.
Despite the complexity and capital intensity, an effective data mining system can make the project very attractive and user-friendly, providing an increase in audience.