Everyday life MT_FREE: several stories about the influence of third-party services on the work of public Wi-Fi

    The Internet is a large and dynamic environment where everything is connected to one another in one way or another and can influence each other. Such a relationship, when a small change in one part of the system can lead to a complete change in another, is popularly called the “butterfly effect”. The effect perfectly illustrates how one “well-placed boots on the console” can bring down a major service and at the same time a couple of strangers ... We’ll talk about this.

    Five years ago, when Wi-Fi in the subway just appeared ...

    ... it was a phenomenon that divided the life of Muscovites into “before” and “after”. At that time, the project was the only one in the world, and everything in it was just as unique: network structure, monetization model, user services, approaches to construction and operation.

    Almost from the launch of the first segment of Wi-Fi in the metro, we got authorization and our own media portal. We generously experimented with the portal in terms of integration with third-party services, in fact, exploring the capabilities of our business model ("what if we sell coffee in the subway with delivery to the entrance from the lobby ?!").

    At first, we actively involved partners from various fields in our work. But almost every publication of a new affiliate service led to the fall of the latter under load and the need for an emergency rollback of changes. Few people can survive thousands of new requests per minute, and some are incapable of this in principle because of the non-scalable architecture. The presence of such a problem made us monitor the performance of affiliate services, on which the user experience directly depends. And also develop mechanisms to reduce this dependence (proxy, cache).

    Once a loud cry in the office of "Five Hundred" set in motion the whole company - now such situations practically do not occur. On the screen from July 2015, the result of the launch of a flower sales service with delivery on our sub-domain.

    But evolution never goes fast. Before we built the current system, we had to “fill up cones” and experience a whole series of accidents on our own experience. Moreover, the process does not stop: the deeper we delve into the issues, the more we identify the most unexpected dependencies. Looking back, we understand how important it is sometimes to have an example of how it happens. That is what we want to share.

    New iOS dropped traffic by 20%

    MaximaTelecom specializes in building networks in transport. The vast majority of subscriber devices that use our network are mobile, smartphones and tablets based on Android and iOS. Both vendors, Google and Apple, have roadmaps for releasing updates to their operating systems. In new versions, the modules responsible for connecting to Wi-Fi often change. In the best case, on the day the update is released, traffic is growing due to the fact that devices download the update via Wi-Fi. But there are catastrophic cases.

    Just last year, Apple released a new version of iOS 10.3.1, after which network traffic crashed by almost 20%. It turned out that in the new version Apple “broke” the process of connecting to the network: the authorization mechanics in Captive stopped working and the devices could not log into MT_FREE. I had to release a fix in emergency mode and correct the situation. The problem was fixed after three minor updates, after we opened a case in the Apple bugtracker.

    The number of calls to the auth.wi-fi.ru authorization page per minute. The graph clearly shows a significant lag from the indicators for the previous period.

    The situation is aggravated by the fact that Wi-Fi is a rather old and extremely widespread technology, the creation of which was not supposed to be used on such a scale as we have in the Moscow Metro. So, we have to deal with a whole "salad" of various devices, each of which behaves in the network in its own way. Flat metrics of the number of abstract megabytes or “spherical subscribers on the network” are not applicable to us. Any service, whether it is basic access to the Internet, a media portal or a mobile application, should be considered in the context of specific devices and / or operating systems, since the problem may concern a specific and fairly narrow group.

    ... and a few dozen of the most exotic options.

    This is not DDOS: the accident of a mobile operator led to a jump in traffic by almost a third

    Two years ago, one of the mobile operators had a major accident. In such cases, users are looking for an alternative to the communication service. If we talk about the metro, then there were no alternative ways of communication on trains at all.

    And now, only a few operators provide service in areas equipped with a radiating cable. But this technology is very limited in capacity and is not able to provide a comparable level of service for a significant proportion of users. Not to mention the cost of traffic on the limit tariff plans.

    But at the stations, cellular communications has developed quite strongly, not to mention the terrestrial segments, where Wi-Fi directly competes with it.

    We learned about the accident on the mobile operator’s network from our dispatcher service, which announced that they were attacking us. The growth in the number of users and traffic was such that at first we thought that we were DDOS-based. We learned about the real reasons for the increase in traffic later, finding out that a third of employees do not have cell phones.

    This is how it looked for our Wi-Fi users above the ground.

    The specificity of our situation specifically is that we have Wi-Fi networks, which means it doesn’t matter to us which SIM card of which telecom operator is installed in the user device.

    It is worth mentioning that the accident that occurred affected our service in part and negatively. Some segments of the MT_FREE network, in particular, the network in city buses and commuter trains, use cellular communication as a backbone network, which means an accident on cellular networks leads to degradation of service in these segments.

    Wi-Fi in the subway without ads? YES!

    Advertising is the foundation of free access to the MT_FREE network, because it is thanks to it that the service exists and pays off. As a base AdServer, we have been using AdFox for many years. It is interesting that AdServer itself did not undergo any significant changes for the entire time we worked with it. One of its specifics is the system of collecting statistics on impressions, which is formed by hourly intervals. This causes rhythmic peaks in the response time from the service (every hour, exactly at the border of the hour, the “twist” starts to “play pranks” and think about each answer). We did not catch this nuance very immediately!

    AdFox response timeline for an ad request. Bursts and dips on the border of the hour are clearly visible.

    In fact, we observed the same characteristic hourly “peaks” in the number of impressions for other monitoring tools, for the same Metric. But I want to talk about a more extreme situation. Last winter, AdFox suffered a serious accident: the service did not respond for a long time. On our metrics, this manifested itself as a lack of user authorization and a sharp drop in portal performance. At the same time, the AdFox management interface with a certificate error was not available.

    Illustration of adfox.ru certificate error.

    After conducting a couple of tests and calling AdFox itself, we found out about the accident, and we had no choice but to let all identified users into the network without advertising.

    And here is the accident on Yandex metrics on our portal.

    Faster downloads sometimes produce unexpected results

    The perceived quality of our service depends not only on the work of other people's infrastructure, OS updates and crashes on mass resources, but also on the behavior of specific browsers on specific devices. In this regard, we have much more opportunities for influence, so we are constantly working to improve products. On average, we publish one update per day. But sometimes a seemingly simple update, which should lead to an improvement in the user experience, leads to unpredictable consequences.

    Since we have the opportunity to influence the operation of services at the network level (for example, by changing the priority of one type of traffic relative to another), the idea arose to speed up authorization by prioritizing traffic. We published the corresponding changes and, in amazement, began to observe numerous errors and a 20% drop in advertising revenue. Technical tests showed that the circuit works absolutely correctly from a network point of view. The rollback of changes, however, confirmed that the reason was precisely in the new settings.

    As a result, we found that by increasing the priority of some scripts over others, we changed the order of execution of functions at the loading level of the authorization page itself in the browser. This has significantly affected the user experience. In fact, authorization scripts began to load and run faster than ad scripts. Due to the existing relationship between them, situations arose when one function waits for the result of another, the file with which has not even been downloaded to the device yet.

    Social networks vs Media

    The behavior of users on the Internet corresponds to standard patterns. People are used to communicating through messengers, search for content on media portals, read news through social networks and news aggregators. Pretty obvious, but still focusing on the fact that social networks are an alternative to news, and vice versa. When something suddenly happens with one of the sources of information, the attention of users is redistributed to the remaining, usually the most accessible. So in 2017 there was a global glitch on VKontakte. For our part, this event looked like a sharp increase in users and time on our news portal wi-fi.ru. In fact, users, realizing that their favorite social network is not working, went to read the news to us.

    The moment of the collapse of VK was marked by a 30% increase in load on the portal wi-fi.ru.

    This case illustrates how important it is for mass services to have a margin of safety for "digesting" the consequences of an informational "neighbor" accident.

    Green - no accidents

    The described situations constantly encourage us to improve monitoring of third-party services in MT_FREE. This is what the dashboard for operating our network looks like.

    Dashboard network operation in St. Petersburg.

    A dashboard consists of many indicators of the “traffic light” type: green state - everything is normal, red color - alarm. The color of indicators varies with time. This can be either normal behavior or a sign of abnormality. But if you “pull” all indicators with a line and put each measure step in such a way on the board, you get a two-dimensional, constantly growing picture that describes the evolution of the network as a whole. This picture can easily be “fed” with standard machine learning algorithms designed to recognize graphic patterns (a kind of FindFace, only for sensor patterns).

    The time-based color chart of indicators is nothing more than a picture describing the evolution of the network.

    Next, self-learning algorithms (such as AI) are added that can automatically classify patterns and identify causes of deviations or incomplete data. Everything looks simple, but what do you think, how many telecom operators really use it?

    Few, and we are not among them

    In fairness, the application of this technology within the framework of MaximTelecom itself is at a fairly early stage, largely because it is not clear where the line is between what needs to be received from outside the network and what can be obtained from the inside. Our advantage here is that we began to develop the necessary algorithmic base from the very beginning as part of our platform for advertising network monetization.

    Maxima is the operator, first of all, of the free Wi-Fi access service. Moreover, unlike a sufficiently large number of “social” Wi-Fi, we are a full-fledged commercial communications operator. In fact, this is our corporate idea: we strive to make communication free and profitable at the same time, and we have already proved that this is possible. Almost no telecom operator in the world can (or does not) want this, and therefore does not develop technology for this. This gives hope that in the future we will be able to bring our technologies to the point where the user experience of MT_FREE will not differ from what traditional paid carriers provide. At the same time, the level of reliability will be higher due to a more developed intelligent control and operation system.

    But, unfortunately, not all problems can be solved within the capabilities of one company, if only because there are many manufacturers of subscriber and network Wi-Fi equipment, and the level of unification is significantly inferior to that in cellular networks. We solve problems with various devices when connecting to the network from the moment of launch. The "root of evil" here is in the absence of any standard and, as a result, each manufacturer creates something of his own.

    To solve such industry problems, there are international associations. For example, now we are leading the project in standardizing user experience when connecting to Wi-Fi networks using advertising monetization. But this is a topic for another article.

    By the way, we are constantly expanding the development staff, relevant vacancies can be found on ourcareer page .

    Also popular now: