"Tester Calendar" for May. Load service
Load testing is in many ways similar to exercises in civil defense and emergency situations. It is better to understand in advance how this or that situation will look than to try to orient in a panic. In addition to our own tests and problems collected on production, you can learn from the experience of industry colleagues. Specially for the “ Tester Calendar” project , Dmitry Vorotnikov, the Kontur tester, based on the example of PE of large IT companies, brought up some simple but important rules for testing the service.

Changed load profile
When they talk about stress testing, they usually mean capacity testing. Online stores have Black Friday and Cyber Monday - sales time and an extreme increase in the load on all services. In the circuit, similar jumps in traffic occur in the last days of reporting to regulatory authorities. For whatever reason the number of visitors has increased, one cannot allow inaccessibility of operations, errors or increase in response time. By testing the capacity of the service, we will make sure that users do not maliciously jerk their mouse or go to competitors, but can work comfortably and productively.
When testing with a load profile that repeats a typical one in the last month, year, or two, you might run into a problem like Amazon Simple Storage Service had on February 15, 2008 . Access to data in S3 is regulated by AWS Authentication service. Requests for it are encrypted and require large processing resources to process. Amazon supported as many servers as needed to handle the load of the previous two years. On the reporting day at 3:30 in the morning, engineers noticed that the number of authentication requests had increased. This overloaded the AWS infrastructure and made it impossible to handle all requests. To handle the increased load, additional capacities had to be introduced. Until 6:48 all projects using S3 were not available.
Consider possible changes in the load profile when planning service capacity and during load tests.
Irregularity
It is not enough to conduct single or occasional testing, even taking into account a possible change in the load profile. Especially if your number of users or new integrations is constantly growing.
On Wednesday, December 22, 2010, the Skype messenger began to work with errors. To establish a connection, more and more time was required until finally the service completely stopped working. The problem was triggered by overloading a number of servers that process instant messages. Processing them began to take noticeably more time. This slowdown encountered a bug in the Windows client, which led to their crashes. A significant part of clients (supernode) supported P2P exchange between other clients. Due to the failure of 25-30% of the supernode, the remaining ones were overloaded and failed, increasing the load even more. As a result of such a cascading failure, the Skype network was unavailable for about a day.
Test and review your service regularly.
By-effect
When you plan regular testing and develop its scripts, keep in mind that force majeure can increase the load. Dividing a service into several data centers will not add fault tolerance if one fails, and the rest cannot handle its load. Capacity planning should consider such scenarios. Another option to increase the load is to intentionally make changes to the system, for example, software updates or work.
On February 24, 2009, a new feature led to problems with GMail, which made it possible to store letters geographically closer to senders. During technical work, users of one data center were redirected to another, overloading it. This caused a cascade of failures from the data center to the data center, each of which took on an ever-increasing load. The service was unavailable for two and a half hours. This story is nicknamed Gfail .
Two related conclusions can be drawn from this post mortem. The first is that the service and its environment are constantly changing, which means that you need to test the scenarios of technical work or disconnecting services and servers. The second - take into account the test results when working and update them before changes and outages.
Unpreparedness for failure
To avoid downtime and add nines to the availability metric, you need to know what kind of load your service can withstand, regularly update this knowledge, maintain its availability and use it when making various changes. When designing and developing a service, we apply many measures that provide protection against overloads, cascading failures, power outages, equipment failure and data loss. This variety, unfortunately, is not a panacea for errors in implementation, configuration, and the human factor.
On July 19, 2010, the site of a major American online retailer American Eagle Outfitters became unavailable due to a failure of the main storage. It doesn’t matter, there are backups! However, switching to backup storage led to its failure. Not scary, because there were still backups on magnetic tape. They were restored for a long time, then they tried to launch the site on a reserve site, but it also failed. The reserve site was not ready, although it should have been prepared in advance. Despite a wide range of protective measures, it was possible to restore the ability to take orders in 4 days. Another 4 days were required for a full recovery.
After conducting stress testing and identifying the capabilities of the service, do not forget to test the defense mechanisms, failover and your backups.
And finally, keep in an accessible place information about all crashes of the service, the chronology of the crash and an analysis of the reasons that led to it. This will help the whole team of developers and testers to better study the file and find new approaches to solving it.
List of calendar articles:
Try a different approach
Reasonable pair testing Testing
: how it happens
Optimize tests
Read a book
Testing analytics A
tester should catch a bug, read Caner and organize traffic
Load a service
Metrics in the service of QA
Test security
Find your client
Sort backlog