Neither GA nor NM. How we made our own clickstream
We collect more than two billion analytic events per day. Thanks to this, we can find out a lot of necessary things: do they click on hearts more than stars, at what time do they write more detailed descriptions, in which regions do they miss by the green buttons.
The system of collecting and analyzing events can be collectively called the clickstream. I'll tell you about the technical side of the clickstream in Avito: the device events, their sending and delivery, analytics, reports. Why do you want your own, if you have Google Analytics and Yandex.Metrica, whom the developers of clickstrims spoil life and why go-coders cannot forget php.
Dmitry Khasanov, ten years in web development, three years in Avito. I work in a platform team, developing common infrastructure tools. I love hackathons .
Business requires a deep understanding of the processes occurring on the site. For example, when registering a user, I want to know from which region, from which device and through which browser the user came. How the form fields are filled in, whether it was sent or the user gave up. And if you gave up at what step. And how long it took.
I want to know whether to press the button more often if you repaint it in green. Will users of mobile applications or a site click on the green button more often in Murmansk or Vladivostok, day or night? users coming from the main or from the search; bought before on Avito or who came for the first time.
All of these signs: the operating system, user ID, request time, device, browser, values in the fields, must be made available for analysis. Collect, structure, give quick access to data.
Additionally, it is often necessary to split the flow of events. Projects need to take action when certain events occur. For example, in this way, feedback is obtained for the further training of the pattern recognition and auto-moderation model, there are real-time statistics.
Using the clickstream as a product, programmers should be able to easily send events from the project, and for analysts to manage the collected events and build a variety of reports showing trends and confirming hypotheses.
Reports based on the flow of events.
We know about Yandex Metrics and Google Analytics, we use it for some tasks. With their help, it is good and fast to collect analytical data from frontends. But in order to export data from backends to external analytical systems, we have to do clever integrations.
With external tools you have to solve the problem of splitting the flow of events.
Analytical information is very valuable. We collect it for years, it allows you to know in the smallest detail how our users behave. I don’t want to share such knowledge with the outside world.
The legislation obliges to store data in Russia.
These reasons were quite enough to develop your own solution as the main tool for collecting and processing analytical data.
Events are sent via high-performance transport (Event Streaming Processing, ESP) to the storage (Data Warehouse, DWH). Based on the data in the repository, analytical reports are built.
The central essence. In itself, it means a fact. Something concrete happened in the indicated unit of time.
It is necessary to distinguish one event from another. This is the unique event identifier.
Also interested in the time of occurrence of events. We transmit it in each event up to microseconds. In the events arriving from the frontends, we additionally fix time on the client device in order to more accurately restore the sequence of actions.
The event consists of fields. The field is the smallest semantic unit of the analytical system. In the previous paragraph there are examples of fields: event identifier, dispatch time.
The attributes of the field are: type (string, number, array), mandatory.
The same event can occur in different parts of the system: for example, authorization is possible on the website or in a mobile application. In this case, we send the same event, but always add the unique identifier of the event source.
Sources differ markedly from each other. These can be internal demons and crowns, a front-end or back-end service, or a mobile application. Part of the fields must be sent with each event of a specific source.
The concept of “environment” arises. This is a logical grouping of events by sources with the ability to set common fields for all source events.
Examples of environments: “backend of service A”, “frontend of service A”, “ios-application of service A”.
All existing events are described in a reference book that developers and analysts can edit. Events are logically grouped by environments, each event has an owner, a change log is kept in the directory.
At the moment, the directory describes several hundred fields, several dozen environments, and more than a thousand events.
We refused to torture, and no longer force developers to manually write the code to send events. Instead, based on the directory, we generate a set of files for each of the server languages supported by the company: php, go, or python. This generated code is called “langpack”.
Files in langpack as simple as possible, they do not know about the business logic of projects. This is a set of getters and field setters for each of the events and a code for sending an event.
For each environment, one langpack is created. It is decomposed into a package repository (satis for php, pypi for python). Updated automatically when changes are made to the directory.
Can't stop writing in PHP. The code for the Langpack-generating service is written in Go. The company has enough PHP projects, so I had to remember my favorite three-letter programming language and generate the PHP code on Go. If you get a little carried away, you can also generate tests to verify the generated code with these tests.
Reference can be edited. Code on the battle can not be broken. We generate a combat code based on the directory. Dangerous.
After each change in the event, a new version is created in the directory. All ever created versions of events live in the directory forever. So we solve the problem of the immutability of specific events. Projects always indicate which version of the event we are working with.
If the langpack code changes (for example, there were only setters, and now they decided to add getters), create a new version of langpack. She, too, will live forever. Projects always request a specific version of langpack for their environment. So we solve the problem of invariance of the langpack interface.
Use semver. Each langpack version consists of three digits. The first is always zero, the second is the version of the Langpack code, the third is the increment. The third digit changes most often after each change of events.
Two-level versioning allows you to edit the directory without breaking the code in combat. Keeps on two principles: you can not delete anything; you cannot edit created entities, just create modified copies side by side.
They hid the NSQ behind a small layer of code on go, laid out the collectors on each node in the cluster of Coubernethes with the help of daemon set'ov, wrote capacitors who can add events to different sources.
At the moment, transport delivers about two billion events per day. Under such a load with some margin work thirty collectors. Each consumes a bit more CPU core and a little more GB memory.
Event senders can be projects that live inside or outside our cluster. Within clusters, these are backends of services, crowns, daemons, infrastructure projects, and intranets. Events from frontends of public projects, from mobile applications and partner projects fly outside.
To receive events outside the cluster, use a proxy. A common entry point with a small filtering of the flow of events, with the possibility of enrichment. Further sending in transport according to the general scheme.
General routing scheme: each event can have a set of recipients. Potential recipients include a common analytic repository (DWH), rabbits, or Mongi projects interested in certain events. The latter case, for example, is used for additional training of auto-moderation of ads. Models listen to certain events, receiving the necessary feedback.
Projects have no knowledge of routing. They send events using langpacks, which are protected addresses of common collectors.
The main event repository is HP Vertica for several tens of terabytes. Column base with characteristics suitable to our analysts. Interface - Tableau for building reports.
Write events to our repository more efficiently in large batches. Before storage is a buffer in the form of Mongo. Auto-generated auto-deleted collections for every hour. Collections are stored for several days in order to be able to restart the proofreading in the Vertic if something goes wrong.
Proofreading from the Mongo buffer on the pit scripts. The scripts are oriented on the reference book, we try not to keep business logic here. At this stage, enrichment of events is possible.
Hand dancing in the dark
The need to log events occurred much earlier than the awareness of the need to keep a reference book. Developers in each of the projects invented a way to send events, looking for transport. This has generated a lot of code in different languages, lying in different projects, but solving one problem.
Often inside the code to send events lived pieces of business logic. Code with this knowledge can not be ported to other projects. When refactoring, business logic needs to be returned to the project, leaving only the correspondence to the specified data format in the event code.
At this stage there was no directory of events. It was possible to understand what events are already logged, what fields the events have, just by looking at the code. It was possible to learn about the fact that the developer accidentally stopped recording data in a required field when building a report, if you pay special attention to this.
There were not many events. Buffer collections in mongo were added as needed. As the number of events increased, it was necessary to manually redirect events to other collections, to build up the necessary collections. The decision to place the event in a buffer collection was taken at the time of sending, on the side of the project. Transport acted Fluent, the client for him - td-agent.
It was decided to create a directory of all existing events. Rasparsil code backends, pulled out some of the information. Obligated developers at each change of the event code to mark it in the directory.
Events arriving from front-ends and from mobile applications were manually described, sometimes catching the necessary information from the flow of events at the transport level.
Developers know how to forget. This led to desynchronization of the directory and code, but the directory showed the overall picture.
The number of buffer collections has increased significantly, the manual work on their maintenance has increased markedly. An irreplaceable person appeared with a bunch of secret knowledge about buffer storage.
Created a common transport, ESP, who knows about all the points of event delivery. Made it a single point of reception. This allowed to control all streams of events. Projects directly stopped accessing buffer stores.
On the basis of the directory generated langpacks. They do not allow to create invalid events.
Implemented automatic validation of events arriving from front-end and mobile applications. Events in this case, we do not stop writing, so as not to lose data, but we log errors and signal the developers.
Rare events on the backends that are difficult to refactor and which are still not sent through langpacks, we validate with a separate library according to the rules from the reference book. In case of errors, we throw an exception that blocks the rollout.
Got a system that wants to match the directory. Bonuses: transparency, controllability, speed of creating and changing events.
The main difficulties and lessons were organizational. It is difficult to link initiatives involving multiple teams. It is not easy to change the code of a large old project. Communication skills with other teams, splitting tasks into relatively independent and pre-designed integration with the possibility of independent rolling out help. Clickstream developers have stopped loving product teams when the integration phase of a new solution begins. If interfaces change, work is added to all.
Creating a directory was a very good idea. He became the only source of truth, one can always appeal to him in case of discrepancies in the code. A lot of automation is tied to the directory: checks, event routing, code generation.
Infrastructure does not need to know about business logic. Signs of the emergence of business logic: events change along the way from the project to the repository; change transport without changing projects becomes impossible. On the infrastructure side, there should be knowledge about the composition of events, types of fields, and their obligations. On the product side - the logical meaning of these fields.
There is always room to grow. Technically, this is an increase in the number of events, a decrease in time from the creation of an event to the beginning of data recording, the elimination of manual work at all stages.
There are a couple of bold ideas. Getting a detailed graph of user transitions, configuring events on the fly without rolling out the service. But about this - in the following articles.