How do we use the data processing infrastructure at and

    A year ago, we abandoned all public meters in favor of private services and our own data processing infrastructure. Gathering 10 million hits a day at the Olympics, we felt the limit of kindness of Google Analytics, beyond which free service is no longer possible. But now we have our own lunapark with convolution and graphs, so we can easily reduce the intensity of using GA, while retaining only audit functions. About how we collect data and how to use it in our work - in a sheet with funny pictures inside.

    We do not have such a room with monitors and nice bearded guys.  This image was found on the Internet and was taken at NOC Wallmart.  Guys can afford it)

    “I don’t notice the code at all. I see blondes, brunettes, redheads. By the way, do you ... want a drink? ”

    This is not to say that we knew what we were doing at the beginning of the journey. We assumed that the data would add a little more meaning to our intuitive creativity. And we really wanted a lot of beautiful graphs and more understanding of how our users live. We did not expect at all that in a year we will build a complete data processing system, and the company will begin to breathe in the rhythm of Data Driven.

    And we really had a place to turn around: every day on and Tribuna.com700 news and 500 texts appear (of which 97% are user-defined), and under them 30 thousand comments, 120 thousand ratings and 2 thousand statuses appear. This entire shaft of content settles not only on sites with their 750 thousand unique pages and 5 million hits per day, but also scatters through 100 thousand tags across 160 mobile applications (where 200 thousand unique pages make 8 million hits per day) and 1,200 thematic groups in social networks.

    Sorry, I could not resist, but bring down numbers - my hobby!
    Sorry, I could not resist, but bring down numbers - my hobby!

    So, these are our normal numbers. The figures are becoming paranormal on the busy days of sporting events, as they are happening in Sochi. Every Olympic day we break the mark of 1 million unique. Instant traffic can jump 10 times, and the total volume of hits per day grows 1.5-2 times.

    We knew that we would have to process a lot of data: during the first months of operation of the data processing infrastructure, we had accumulated 500 gigabytes of raw material and aggregates. But it would be too risky to fit into your Hadoop cluster or a computer processor right away, and we didn’t have such an opportunity - so we got out.

    Turquoise indicates the components in our own IT infrastructure, orange - everything that we use as an external service
    At the beginning of last year, we did not know even half the words in this diagram

    We made everything simple: data on page impressions is collected using the open source counter Piwik , more precisely with the help of its front-end part - we have a nginx cluster on the backend, the nodes of which access.log calls from this counter. Raw data from logs is downloaded in batches through Amazon S3 to Amazon Redshift , where client sessions are calculated from ClickStream hits. Additionally, Redshift unloads data from the SQL storage of the site, necessary to enrich the created structures. Above the created structures, a SQL query plan has been developed, which is used to generate graphs and reports in , as well as for Adhoc data analysis.

    In addition, we register individual user actions in the NoSQL repository for quick calculation of the recommendations matrix for our users, process access.log accesses to site pages using Okmeter agents , and use them to analyze user content from SQL repositories. As a result, we didn’t do the entire complex part of BigData (data storage, complex calculations, building charts and reports) as our own hardware farm and software collection, but completely removed it to SaaS. We left only the collection of raw data and the final consumption of aggregates.

    Who needs boiling water?
    Who needs boiling water?

    We hope to tell you the details of the technical implementation of the entire analytical infrastructure in a separate post, now we want to share with you how we managed to improve our product and business by adding only a small portion of the data.

    User Recommendations

    We take this opportunity to convey greetings to perfectionists with this illustration.
    We now know a lot about our readers and not only what they tell us during registration — which teams and athletes like or dislike them — but also what we learn about them from our behavioral profile: from which fan sites or search queries users come to and, news with which tags they read and comment on, which posts and comments are plus and which are minus, which is interesting to friends. Forming a preference matrix for each visitor, we superimpose a matrix of recommendations on it and give each of them relevant hints.

    Business Metrics Monitoring

    When we still get tired of watching how the fire burns, the water flows and our colleagues work, we observe these graphs

    Making a large MCC with monitors, graphs, and nice bearded guys is the most obvious thing that comes to mind after launching a data processing infrastructure. We made the charts, but we didn’t involve monitors and bearded men: we certainly don’t need emergency response teams.

    But we put alerts on some business metrics using Okmeter. Although this service is designed more to monitor technical metrics (uptime, traffic, system metrics, etc.), its functionality allows you to build metrics from any data from any SQL repository. During the six months of operation, the service twice notified us that comments had ceased to appear on the site: on New Year's Eve and when a curved banner appeared on the site blocking Javascript on the pages.

    But the coolest thing in Okmeter is the Play mode, in which you can work with the chart (add metrics, data functions, change display parameters) in real time.

    We draw the speed of the number of comments on the site in real time

    The service is made by great guys from Russia, so far it is closed for registration, but you can ask for an invite on

    Content Distribution and Production

    Incomprehensible graphics

    Incomprehensible tables

    Another obscure tables

    We, of course, spy on what our users write, how they react to materials, news, look for popular comments and photos, and the best thing - we take them to the editorial office. The user community, in a sense, is a Petri dish in which cool jokes and biting remarks arise (we notice these facts by a sharp surge in the pros or cons), in the same place we test our editorial and social hypotheses, and disseminate the best throughout our channels: applications, social networks, tag streams.

    At first glance, this activity is similar to the blatant exploitation of the population and undisguised crowdsourcing, and this is indeed so. But! The same mechanism allows young and unknown authors of the Tribune to become popular. We notice really bright characters and take them into circulation in the social edition: we post to the main page, scatter links in streams, give them advice on design and presentation.

    Adhoc analysis

    We do not have very many fun pictures about analytics, so we begin to post cats
    We don't have a lot of funny pictures about analytics, so we start to post cats

    With the BigData infrastructure, we finally got a tool with which we can reliably answer the questions: “Why did this happen?”, “How can I improve the product?”, “What color button works best here?”, Etc. P. Every day we use the magic of numbers to estimate, search for dependencies and correlations.

    For example, we recently launched a personalized compulsion to register: in various ways we find out which team a visitor is interested in, select a popular player from this team, hang a picture of this athlete at the invitation to register, and show this visitor a special block. We have already found out that such a personalized treatment works many times more efficient than cute cats or Guus Hiddink. But we decided to choose the text for such an invitation, a simple ABCD test. This is how the test variants on experimental Alexander Amisulashvili from “Wings of the Soviets” look:

    Try to guess the best option yourself and compare it with our results (a selection was made for 100,000 impressions).

    CTR Results A. Amisulashvili
    CTR Results  A. Amisulashvili

    Mailing lists

    For such mailings once a week, no BigData, of course, is needed

    You can measure the effectiveness of email newsletters without special skills and tools. We use the data infrastructure for campaigning: selecting users to send them personalized messages. We identify users who have not visited the site for several weeks to tell them via email that we have had interesting things while they were gone. We remind players of their Fantasy team or prediction tournament that they have abandoned. We invite fans to thematic interest forums.

    Business analytics

    Take a peek at full weekly report

    Once a week, heads of all departments of and gather for a retrospective of the past week and discuss plans for the current one. Each participant talks about his field of activity, relying on the data in the report, which is generated weekly from the data collected in our infrastructure. We discuss the structure of traffic, sales, competitors and the market as a whole, editorial and SMM activities, development, product, finance, IT. The value of these meetings is solely in the synchronization of knowledge and priorities at the level of the entire company - no bureaucratic reporting under the hood.

    ... And this is only a small part of what we already have in BigData. We showed you our infrastructure through the keyhole with only one goal - to inspire you to process data in your project. It really is not as scary and expensive as it seems at first glance: we did without capital expenditures on equipment at the start, although we spent 4 man-months of development. We spend no more than 60 thousand rubles per month on server rental and payment of external services. And we assure you - insight is worth it.

    Pffff! A post about BigData and not a single mention of MapReduce, or at least the words PETABYT. Cats get out of here

    Go for it!

    UPD: continuation and technical part of the post here

    Also popular now: