How we helped CDN MegaFon.TV not to fall on the 2018 World Cup

    In 2016, we talked about how MegaFon.TV managed with everyone to watch the new season of “Game of Thrones”. The development of the service did not stop there, and by mid-2017 we had to deal with loads several times more. In this post we will describe how such rapid growth inspired us to drastically change the approach to organizing the CDN and how this new approach was tested by the World Cup.

    Briefly about MegaFon.TV

    MegaFon.TV is an OTT-service for viewing various video content - movies, TV shows, TV channels, programs in recording. Through MegaFon.TV, you can access content on almost any device: on phones and tablets with iOS and Android, on smart TVs LG, Samsung, Philips, Panasonic of different exit years, with a whole OS zoo (Apple TV, Android TV), desktop browsers on Windows, MacOS, Linux, mobile browsers on iOS and Android, and even on exotic devices like STB and children's Android projectors. There are practically no restrictions on devices - it is only important to have internet with a bandwidth of 700 Kbps. There will be a separate article in the future about how we organized support for such a number of devices.
    The majority of service users are MegaFon subscribers, which is explained by profitable (and most often even free) offers included in the subscriber’s tariff plan. Although we also note a significant increase in users of other operators. In accordance with this distribution, 80% of MegaFon.TV traffic is consumed within the MegaFon network.

    Architecturally, since the launch of the service, the content has been distributed via CDN. We have a separate post dedicated to the work of this CDN. In it, we told how it allowed to process the peak traffic that went to the service at the end of 2016, during the release of the new season of the Game of Thrones. In this post we will talk about the further development of MegaFon.TV and about the new adventures that fell on the service together with the 2018 World Cup.

    Growth services. And problems

    Compared with events from the previous post, by the end of 2017 the number of Megafon users. TV has increased several times, films and TV shows have also increased by an order of magnitude. New functionality was launched, new packages were available, available by subscription. The peaks of the traffic of the times of the Game of Thrones we now saw every day, the share of movies and TV shows in the general stream grew steadily.

    Along with this, problems began with the redistribution of traffic. Our monitoring, which is configured to load chunks for different types of traffic in different formats, has increasingly begun to generate download errors for a video channel by timeout. In the MegaFon.TV service, the length of the chunk is 8 seconds. If the chunk does not have time to boot in 8 seconds, errors may occur.

    The peak of errors was expected at the hours of maximum load. How should this affect the users? At a minimum, they could see a deterioration in video quality. It is not even always noticeable to the naked eye, due to the rather large number of multibitrate profiles. In the worst case, the video was hanging.

    The search for the problem has begun. Almost immediately it became clear that a return error occurs on the EDGE servers of the CDN. Here you need to make a small digression and tell how the servers work with live and VOD traffic. The scheme is slightly different. A user who comes to the EDGE server for content (playlist or chunk), if there is content in the cache, gets it from there. Otherwise, the EDGE server goes for the Origin content, loading the main channel. Together with the playlist or chunk given the titleCache-Control: max-agethat tells the EDGE server how much to cache this or that content item. The difference between LIVE and VOD is precisely in the time of caching chunks. For live chunks, a small caching time is set, usually from 30 seconds to several minutes - this is due to the small time of live content. This cache is stored in RAM, as it is necessary to constantly give chunks and rewrite the cache. More time is set for VOD-chunks, from several hours to weeks and even months - depending on the size of the content library and the distribution of its views among users. As for playlists, they are cached, as a rule, in no more than two seconds, or they are not cached at all. It should be clarified that we are talking only about the so-called PULL-mode of the CDN, in which our servers worked.

    But back to finding the problem. As we have already noted, all servers simultaneously worked on the return of both types of content. In this case, the servers themselves had a different configuration. As a result, some machines were overloaded by IOPS. Chunks did not have time to write / read due to the small performance, the number, volumes of disks, a large content library. On the other hand, more powerful machines that received more traffic began to fall through CPU usage. The processor's resources were spent on servicing SSL traffic, chunks delivered via https, while IOPS on the disks barely reached 35%.

    A scheme was required which, with minimal cost, would allow the use of available power in an optimal way. Moreover, in six months, the World Cup should start, and according to preliminary calculations, the peaks in live traffic should have increased sixfold ...

    New approach to CDN

    After analyzing the problem, we decided to separate VOD- and live-traffic on different PADs, made up of servers of different configuration. And also create a function for distributing traffic and balancing it across different groups of servers. In total, there were three such groups:

    • Servers with a large number of high-performance disks, best suited for caching VOD-content. In fact, SSD RI disks of maximum capacity would be best suited, but there were no such ones available, and too large a budget would be required to purchase the required amount. As a result, it was decided to use the best that was available. Each server contained eight 1TB SAS 10k disks in RAID5. Of these servers was compiled VOD_PAD.
    • Servers with a large amount of RAM for caching all possible formats of delivery of live chunks, with processors capable of handling ssl-traffic, and "thick" network interfaces. We used the following configuration: 2 processors with 8 cores / 192GB of RAM / 4 10GB interfaces. Of these servers, EDGE_PAD was compiled.
    • The remaining server group that is not able to serve VOD traffic, but suitable for small amounts of live content. They can be used as a reserve. From the servers was compiled RESERVE_PAD.

    The distribution went according to the following scheme:

    A special logic module was responsible for choosing the PAD from which the user should receive content. Here are his tasks:
    • Analyze the URL, apply the above scheme for each request stream and issue the necessary PAD
    • Remove the load from the EDGE_PAD interfaces every 5 minutes ( and this was our mistake ), and when the limit is reached, switch the excess traffic to RESERVE_PAD. To remove the load, a small perl script was written that returned the following data:
      - timestamp  - the date and time of the load data update (in RFC 3339 format);
      - total_bandwidth  — the current interface load (total), Kbps;
      - rx_bandwidth  — current interface load (incoming traffic), Kbps;
      - tx_badwidth  - current interface load (outgoing traffic), Kbps.
    • Direct traffic in manual mode to any PAD or Origin server in case of unforeseen situations, or, if necessary, to work on one of the PADs. The config lay on the server in the yaml format and allowed either the entire traffic or the traffic to be led to the required PAD:
      - Content type
      - Traffic encryption
      - Traffic payability
      - Device
      - Playlist type - Region

    Origin servers were equipped with SSD. Unfortunately, HIT_RATE by VOD-chunks when switching traffic to Origin left much to be desired (about 30%), but they performed their task, so we did not see any problems with the packers in the NNR.

    Since there were few servers for EDGE_PAD configuration, it was decided to allocate them in the regions with the largest traffic share - Moscow and the Volga region. By means of GeoDNS, traffic was directed to the Volga region from the regions of the Volga and Ural federal districts. The Moscow node serviced everything else. We didn’t like the idea of ​​delivering traffic to Siberia and the Far East from Moscow, but in total these regions provide about 1/20 of all traffic, and MegaFon channels were wide enough for such volumes.
    After developing the plan, we carried out the following work:

    • In two weeks, we developed CDN switching functionality.
    • It took a month to install and configure EDGE_PAD servers, as well as to expand channels for them
    • It took two weeks to divide the current server group into two parts, plus another two weeks - to apply the settings on all network and server equipment to all servers.
    • And, finally, the week went for testing (unfortunately, not under load, which affected later)

    Some of the work turned out to be parallelized, and in the end everything took six weeks.

    First results and future plans

    After tuning, the overall system performance was 250 Gbit / s. The decision with the transfer of VOD-traffic to individual servers immediately showed its effectiveness after rolling out in production. Since the beginning of the World Cup, there have been no problems with VOD traffic. Several times, for various reasons, it was necessary to switch VOD traffic to Origin, but in principle, they also coped. Perhaps this scheme is not very effective due to the very small use of the cache, since we force SSD-disks to constantly rewrite the content. But the circuit works.

    As for the live traffic, the corresponding volumes for checking our solution appeared with the start of the World Cup. The problems began when we faced the second time with switching traffic when the limit was reached during the Russia-Egypt match. When the traffic switch worked, it poured all over the backup PAD. In these five minutes, the number of requests (growth curve) was so great that the backup CDN was completely clogged up and started to make mistakes. At the same time, the main PAD was released during this time and started to idle a little:

    From this, 3 conclusions were made:

    1. Five minutes is still too much. It was decided to reduce the load release period to 30 seconds. As a result, the traffic on the backup PAD ceased to grow abruptly:

    2. It is necessary to transfer users between the PADs at a minimum with each switch firing. This should provide an additional smooth switch. We decided to assign a cookie to each user (or rather the device), according to which the module responsible for the distribution understands whether to leave the user on the current PAD or switch. Here the technology may be at the discretion of the person who implements it. As a result, we do not drop traffic on the main PAD.
    3. The threshold for switching was set too low, as a result, traffic on the backup PAD grew like an avalanche. In our case, this was a reinsurance - we were not completely sure that we had the correct tuning of the servers (the idea for which, by the way, was taken from Habr). The threshold has been increased to the physical performance of network interfaces.

    Improvements took three days, and already at the match Russia - Croatia, we checked whether our optimization worked. In general, we were pleased with the result. At the peak, the system processed 215 Gbit / s of mixed traffic. This was not a theoretical limit of system performance - we still had a solid margin. If necessary, we can now connect any external CDN, if necessary, and “throw out” excess traffic there. Such a model is good when you don’t want to pay substantial money every month for using someone else’s CDN.

    We have plans to further develop the CDN. First, I would like to extend the EDGE_PAD scheme to all federal districts - this will lead to less use of channels. Also, tests are performed with the VOD_PAD redundancy scheme, and some of the results already look pretty impressive.

    In general, everything that has been done over the past year makes me think that CDN for a service involved in distributing video content is a must have. And not even because it saves a lot of money, but rather because the CDN becomes part of the service itself, directly affects the quality and functionality. In such circumstances, to give him the wrong hands - at least unwise.

    Also popular now: