How we carried Miss Russia

    On April 15, the Miss Russia 2017 contest was held. After a complete alteration of the site, the page loading speed began to fit one second, even at peak times. Our partners from Byndyusoft in the person of Alexander Byndyu ( @alexanderbyndyu ), the architect of the whole system, told how they succeeded, shared the details of transferring the platform to the cloud, and also told why they had to change the entire internal infrastructure of the project.



    Company Information : Byndyusoft is a company that implements projects on the .NET platform for various subject areas around the world.

    About the contest


    The national contest "Miss Russia" is the country's largest beauty project. Currently, the competition is supported by the Ministry of Culture of the Russian Federation. This year the contest celebrated its 25th anniversary.

    Miss Russia qualifying rounds are held in 85 constituent entities of the Russian Federation, more than 75,000 girls take part in castings annually.

    The Miss Russia contest has unique rights to represent our country at the largest international beauty contests Miss World and Miss Universe.



    The peculiarity of the contest website is that 97% of the traffic and 100% of the vote takes place within two weeks in April. Compared to the April load, the rest of the year passes without load.



    What is the problem



    Over the past few years, the Miss Russia website has been constantly being refined and remade, updated every year for a new competition. Toward the 25th anniversary, Mission Russia arrived with a design from Lebedev Studio, who lived on CMS Plesk (revised May 10, 2017 - Cubique CMS, not Plesk) .

    During the contest, users constantly complained that the pages were slowly opening, they were giving a 500 error and voting was not working. There were attempts to “disperse” the site, but they ended in failure.

    The site was hosted on a dedicated server. Adding iron to the server has ceased to be beneficial. When the customer performed stress testing, the site fell at 150 requests per second.


    Load tests of the first version

    First attempt

    First, customers tried to transfer the site from a dedicated server to the cloud - they copied the virtual machine and launched it on Azure. Despite the increased power and hope for “clouds,” the performance dropped , the site went down at 90 requests per second.


    First version on Azure

    Competition site could not be scaled horizontally. You add 5 times more iron, and productivity increases by 30%; you increase twice as much - it increases by another 2%. A typical problem of monolithic systems.


    Old Version Architecture

    Of course, one could tinker with the old Miss Russia. For example, add a CDN or raise several virtual machines through the balancer. But each such move would rest against the problems of the old code and CMS, but it still would not bring the desired scalability flexibility.
    It became clear that it needed to be redone completely for the new architecture and cloud infrastructure.

    The second attempt

    The customer came to us with the task: we need to transfer “Miss Russia” to the cloud and achieve sufficient performance. We looked under the hood and realized that with the current architecture it would not be possible to achieve business goals. We decided to redo everything from scratch.

    Layout

    First, we redesigned the site. He began to weigh less and open faster. Previously, a tiny avatar could meet on the site, which in fact was loaded with highres and weighed 3 MB. As a result, we achieved such results for the new site:

    1. In 2..13 times less requests page.
    2. In 5..16 times less traffic.
    3. The 8 times less time to complete the download.

    They analyzed Metrica and it turned out that 60% of visitors go through mobile devices. We redesigned the site to make everything responsive and responsive.

    It



    was



    Architecture


    Instead of a monolithic backend, a distributed microservice architecture was introduced so that to increase the capacity it was not necessary to fill the site with server capacities, but it was enough to add load to the desired service at the right time.

    For the new architecture, we took as a basis ideas that would lead to the achievement of business goals:

    1. Divide the application into (micro) responsibilities.
    2. Each part will perfectly fulfill its role.
    3. Each part will take care of scaling itself.
    4. Total automation.

    As a result, we came to this architecture:



    New architecture



    Previously, all requests from the site fell into the monolith block, which was responsible for processing votes and generating content. When one module was overloaded, others also started to slow down. Now each part works and scales independently.

    The new result of stress testing was encouraging:

    1. Loaded through a network with a bandwidth of 1 Gbit / s.
    2. After ~ 5450 RPS, we see the first problems with server responses.
    3. Response time did not exceed 1000 ms




    Technology


    Azure provides the choice in technology and solutions. For example, which CDN to take? Akamai or Verizon? We set up experiments and chose the most suitable tools, finding a couple of critical problems along the way.

    .NET Core and Kestrel The

    new competition site is written in .NET Core. We have been working with him in production on other projects for six months now - we don’t see problems.

    The only unpleasant problem arose with Kestrel, which under load began to respond with code 502.3. At the same time, the application crashed and did not come to life before the restart.

    The problem was in Kestrel version 1.1.0. Description in Issue323 and Issue311 . We were lucky that two weeks before the start of the competition, the Microsoft.AspNetCore.Server.Kestrel version 1.1.1 package was released and the problem was gone.

    Cdn

    We chose between Akamai and Verizone. We chose Akamai, because they have a cache server in Russia, which is important for the competition audience.

    CDN generally took a standard approach:

    1. Images are cached for 7 days, HTML is updated 1 time per hour.
    2. JavaScript and CSS of new versions automatically fall into the CDN, each version is cached separately.
    3. Compression is on.
    4. You can manually reset the cache.

    If you want to cache HTML, please note that Akamai CDN only supports 3rd level domains. For caching, we had to redirect from missrussia.ru to www.missrussia.ru .

    WebApp

    We have deployed the main site and API in separate WebApp. When changing the load, we had a choice on the method of scaling:

    • Scale Up: increase / decrease power by changing the tariff.
    • Scale Out: Increase the number of instances. At the same time, Azure itself balances the load between working copies of the service.

    During the competition, both WebApps worked on the S3 tariff, after the competition on the S1 tariff, so as not to burn money when there is no load.

    Service Bus

    We chose a queue between Service Bus and Storage Queues. Here is what we needed:

    1. Small messaging with short processing time.
    2. No need for transactions and support for message processing priority.
    3. The presence of a client under .NET Core.

    We chose Service Bus with the .NET Standard client library for Azure Service Bus client .

    If the queue is slow and crashes when sending a message, then check that you:

    • Queues were raised in the same region as the services that read and publish messages.
    • We registered the ServiceBus client as Singleton so as not to raise the connection every time.

    WebJob and checking votes

    When solving the voting problem, it was necessary to take into account that the influx of voters did not affect the response speed of the main site. In addition, it was important to strengthen the algorithms for dropping bots, because in the past year the “cheating” of votes was a problem. That is, the voting system should work faster and at the same time make a more complex analysis.

    We have spaced out the definition of voice quality and the increment of voice over time. When a person votes, they always answer: “Thank you, your vote has been counted!”. At this moment a voice message is generated, the message is sent to the queue where it waits its turn to parse in the voice processing service. Processed votes enter the database, and then, after a few hours, the site. The vote counter is added at once to hundreds and thousands of units.

    This approach allowed us to remove feedback for those who were trying to choose the parameters of the API call to “cheat” the votes. Now it became unclear to them how the system responded to a manually generated POST request.

    In terms of horizontal scaling, the solution is also great. The Service Bus queue is scaled horizontally, and to speed up heavy voice parsing, it’s enough to raise several dozen voice processing services. In Azure, you can raise several WebJob as manually with a couple of mouse clicks or in automatic mode.

    There is a technical nuance why we chose WebJob rather than Service Fabric for the voice processing service.

    To work through Service Fabric under .NET Core you need to install SDKfrom the special Ubuntu repository. This creates problems with both deployment and development. And WebJob is ready to work with .NET Core without unnecessary movements.

    Pure PaaS

    The whole project was done by two of our developers in four weeks. It turned out so fast partly due to the fact that the entire infrastructure is clicked on with the mouse in the Microsoft Azure web interface - this is pure PaaS. We have not created or configured any virtual machines.

    Scaling vertical and horizontal was also done with the mouse.

    Microservices

    Although the project is small and micro-responsible, there are only three, but we adhered to the basic ideas of microservice architecture. In the project, we identified three microservices: a voice processor (service), a voice receiver (API), and a content shaper (Web).

    Microservices are completely independent. If any of them turns off, then the rest will continue to work. If either of them experiences a load and starts to slow down, the others will not know about it.

    If, after processing the voice, we would like to send the girl SMS with congratulations that they voted for her, then another microservice connected to the Service Bus would appear on the diagram. This microservice would consume the events that formed for him at the time the voice processing was completed. Thus, the architecture is expandable almost endlessly.

    The only thing that distinguishes the new architecture of the Miss Russia contest site from the canonical microservice is the common database for all microservices. In this case, we specifically went for such a simplification in order to save time and money. The database is small, there is not much data and they are divided in the database so that they hardly overlap. If someday the logic in the project gets complicated, which is unlikely, then we will give each microservice a storage.

    Result


    Site speed The site

    works quickly and smoothly, even on mobile. All content is cached on CDN, which handled 5500 requests per second. Caching in the browser, in the CDN, and in the web application allowed 99.7% of users to open the Miss Russia contest page in less than one second.



    Load flexibility

    Due to the flexibility of allocating capacity in Azure, the cost of a new infrastructure during the vote (two weeks a year) is equal to the cost of a dedicated server for the previous version of the site. But after the vote, we removed unnecessary power in a couple of clicks and the cost became 3 times cheaper.

    On large projects, we usually create an automated system that, in case of a load, adds instances of services, when there is no load, it reduces their number. Peaks occur not only in connection with known events (Black Friday, March 8), but also daily allowance (nobody's night, peak day), weekly (nobody's weekend, weekday peaks), random (someone mentioned the site on a popular forum) therefore automation is necessary.

    Voting

    Voting fulfilled 100% of requests, not a single user received a 500 error. Out of 750K votes, the algorithm screened out approximately 500K as bots, and the remaining votes were counted by the girls.

    The contest organizers received transparent reporting on the voting process: who received how many votes and who tried to wind up the results.

    conclusions

    Competent architecture allowed:

    - To give more resources to the loaded parts, less to the unloaded.
    - Remove the influence of parts of the system on each other.
    - Resources are allocated for each part only when it is required.

    Additional materials on architecture:



    The author of the material


    Alexander Byndyu ( @alexanderbyndyu ) - Owner of Byndyusoft, an expert in Agile and Lean, an IT architect.

    The project was completed with the inclusion of expert support from the CSP – provider InfoboxCloud , which quickly answered all questions regarding Azure.

    Also popular now: