As we wrote the anti-click protection system
Our company operates in the field of online advertising. About 2 years ago, we finally became disillusioned with the anti-click protection systems built into the content network and decided to do our own, at that time for internal use.
Under the cut a lot of technical details of the functioning of the system, as well as a description of the problems that we encountered in the process of work and their solutions. If you are just interested in looking at the system, the main picture is clickable.
The first task that needed to be solved was the identification of unique users.
Those. we need to identify the user even if he changes the browser or clears cookies.
After some deliberation and a series of experiments, we started writing not only in cookies, but also in the repositories of all possible plug-ins for browsers that have such repositories, and so on for the little things, third-party cookies, various JS storage.
As a result, we not only identify the user in most cases, but also have a certain digital cast of his computer (OS, screen resolution, color depth, the presence or absence of certain plugins, support for various JS storage and third-party cookies by browsers) , which allows you to identify the user with a high degree of probability, even if he manages to clean everything that we have set for him.
At this stage, there were no particular problems to write about.
Second task- transfer of all user data to our server.
To get the most complete data, we use 2 scripts: server-side (PHP / Python / ASP.NET) and client-side JS. Thus, we are able to receive information even about those users who closed the page without waiting for the full download and, accordingly, the development of client JS. Such clicks on teaser advertising are usually not less than 30%, and we have not found other systems that take them into account. Therefore, we get significantly more data than the same Metric, Analytics and all other statistics systems with JS counters.
So we smoothly approached the third task - choosing the hardware.
Architecturally, the system currently consists of 4 segments:
- Data collection and processing
- Landing Page Indexing
- Storage of usernames / passwords for third-party services
All domains are tied up with Amazon Route 53 with ttl 60 sec, so that in case of any possible problems with the servers, they can quickly migrate to backup ones. Nothing special to say
about the frontend . The load on it is small - almost any vps can handle it.
With the collection and processing of data, everything is somewhat more complicated, since it is necessary to work with large volumes of data. Today, we process about 200 requests every second.
Thanks to the correct initial choice of hardware and software, one server copes with this volume perfectly.
For hardware - 8-core AMD, RAID10 from SAS disks, 16Gb RAM.
Data collection is carried out by a tuned nginx + php-fpm + mysql bundle, processing by C ++ scripts.
At first, we faced the problem of intensive CPU resource consumption by a data collection script. The solution was completely unexpected. Replacing all ereg_ php functions with their preg_ analogues, we reduced the CPU consumption by about 8 times, which was very surprising.
In case of problems with the current server or the need for scaling, in another DC, the second server of a similar configuration with the possibility of putting into operation within an hour is waiting in the wings.
Landing pages are indexed by a separate server with a dedicated IP block, which is quite voracious in CPU and RAM, however, it is absolutely not demanding on the disk subsystem. Indexing is done by a “search bot” written in Python.
This node is not duplicated with us, however, its replacement or expansion takes less than a day, and it does not directly affect the quality of traffic analysis, in the worst case, several advertising campaigns will not stop if the client lies down on the site or our code disappears from the site.
The repository of usernames / passwords for third-party services is a rather specific thing and, in general, from a security point of view, is not good.
However, for most advertising networks, the API does not provide all the necessary functionality and you have to parse their web interface, which is difficult to do without a password. For example, in Google Adwords, banning IP addresses is possible only through the web interface. As a bonus, users have the opportunity to go to the accounts of ad networks from the interface of our system with one click.
This is the fourth challenge.- ensuring the security of data when storing them in open form.
For the most secure data storage, we created the following scheme:
- If the password is obtained by us through the web interface
- We put it in the frontend database, symmetrically encrypting the client password to our service
- Also, the password is placed in the frontend database, asymmetrically encrypted with the public key on the server
- The repository periodically makes requests to the server database, taking the encrypted passwords of ad networks, decrypts them with the private key and puts them in its database
- If the password is generated by us on the store
- We put it in the warehouse database
- At the next user login, the password is placed in the frontend database, asymmetrically encrypted with the public key on the server
- The repository periodically makes requests to the server database, taking the encrypted passwords, decrypts them with the private key
- Then, the repository symmetrically encrypts the passwords from its database with the received user passwords and puts them in an encrypted form on the front-end
- When a user logs in to our service, his password is stored in a specific method inside JSa and is used to decrypt client-side passwords from ad networks and a click-through login
- Access to the store is allowed only from a number of IPs to which we have access.
- Storage IP is kept secret; there are no incoming storage requests
Due to the fact that we still cannot parse some web interfaces without using full browser emulation, the storage is demanding on RAM and CPU. In another DC, in case of unforeseen circumstances, a backup storage server is also waiting, ready to start working within an hour.
The fifth and final task was the integration with ad networks for the automatic ban of "bad" IP and sites.
There were no problems with conditionally small networks, such as Begun and MarketGuide, all interaction works by API, if some methods are not enough, partners quickly add them.
But with Yandex.Direct, and especially AdWords, there are enough problems. Getting the API in AdWords turns into a whole quest. First you get it for a month, then it turns out that half of the functions are not there and you still need to parse the web interface. Then it turns out that even those functions that are are rigidly limited by units that cannot be purchased in the Basic API. And a new quest begins with an API of the next level of access, with an expanded number of units. As you can see, the search giants are doing everything to make it as difficult as possible for advertisers to optimize advertising costs. However, at the moment, we nevertheless successfully analyze their traffic and clear it automatically.
The bottom line, at the moment, our system does not have real competitors that are similar in capabilities and, most importantly, comparable in quality of detecting low-quality traffic. In some cases, we see 40-45% more traffic than other analytical systems.
The cost of traffic audit is, on average, about 100 times less than the cost of purchased advertising, and for individual advertising systems the service is completely free. At the same time, savings are from 10 to 50% of the advertising budget, and sometimes even up to 90%.
Currently, the system works in a fully automatic mode with Yandex.Direct, Google AdWords, Begun and MarketGuide. With any other advertising systems, the service works in a traffic audit mode with the subsequent manual addition of fraudulent IPs and sites to their blacklist.