Our tanks. The history of stress testing in Yandex
Today I want to recall how stress testing appeared in Yandex, developed and arranged now.
By the way, if you like this story, come to Test Wednesday in our St. Petersburg office on November 30 ( register ) - there I will talk more about game mechanics in testing and I’ll talk to you live with pleasure. So.
In 2005-2006, part of Yandex’s non-search infrastructure began to experience the burden of growing by the yeast of the Runet. There was a need to test the performance of services adjacent to the search, first of all - banner twist. Timur Khayrullin, who at that time was in charge of stress testing, was puzzled by the search for a suitable tool.
But open solutions existing at that time were either very primitive (ab / siege) or insufficiently productive (jmeter). HP Load Runner stood out from commercial utilities , but the high cost of licenses and the tie for proprietary software did not please us. Therefore, Timur, together with the developer of the high-performance phantom web server, Zhenya Mamchits, came up with a tricky trick: they taught the server to work in client mode. This is how the phantom-benchmark module came about. The phantom code itself is now open and can be downloaded from here , and the story about the phantom can be viewed from the presentation video here .
Then Phantom was very simple, knew how to measure only the maximum server performance, and we could only limit the number of threads. But already at that time, our utility was head-on more productive than its analogues. Therefore, more and more services began to turn to us for load testing. From 2006 to 2009, the load testing team has grown to ten people. The name “tankers” attached very quickly to us, which load “tapes” with “cartridges” and “shoot” from “tanks” at “targets”. Tank themes are still with us. To save resources, we created a special "training ground" or "chicken coop", where we kept virtual machines for stress testing. The virtualization platform at that time was on openvz, and now we have completely switched to lxcdue to better support for new ubuntu server kernels and distributions. Respect to the lxc community!
In parallel with the incoming services and the growing popularity of load testing, or rather, with the growth of self-awareness of service teams, an understanding came of the limitations of the tool's capabilities.
With the assistance of developers and under the leadership of the "tanker" Andrei baabakaKuzmichev, we began to develop phantom into a real framework for supporting load testing - Lunapark. Previously, due to the poor organization of reports, the results were haphazard - in Wiki, JIRA, mail, etc. It was very inconvenient, we put a lot of work into this sore spot and gradually we got a real web interface with dashboards and graphs, where all the tests were tied to tickets in JIRA, all reports finally got uniformity and clear design. The web interface has learned to display percentiles, timings, average times, response codes, the amount of received and transmitted data and about 30 different graphs and tables. In addition, Lunapark was connected with mail, jabber and other services. The changes were not spared by the Phantom load generator itself - he learned to do a lot of what he did not know before. For instance, submit requests according to the schedule - linearly, stepwise, reduce the load and even (!) apply zero and fractional load. Aggregation output with percentiles, monitoring the amount of data, errors, answers was added to the console output. This is what the console output of the 2009 sample looked like.
Gradually, an understanding came that it was necessary not only to constantly test some services, but also to accompany them from year to year from release to release, and also to be able to compare tests among themselves. This is how the Lunapark pages important for the work of the load tester appeared: test comparison and regression. So their first versions looked.
At any time, a tester, developer or manager could find out how a product is developing in terms of performance. Now we are working on improving and putting these pages at the head of the framework. At the current stage of development, the most important for us are not unit tests, but productivity trends.
In 2011, an important event happened - we became the first team in Yandex to launch a real gamification of the workflow, about which I will have a separate reporton the test environment. Even now, this is rarely found in the most advanced IT companies. On the Lunapark page, we have placed the “Hall of Fame” and are very proud of this part of the framework. For a specific test, you can tell the load tester a real friendly "Thank you!" Each tester receives different badges for this or that event, becomes a “tank commander” (a la mayor in Forskverik) and even a “rank”, which is issued for the number of tests performed. Among the routine work, any achievement or thank you is worth its weight in gold. This is very cool motivates.
We consider 2011 a turning point in our stress testing. The development team was led by Andrei undera Pohilko. The tester community knows him as a developer of great plugins for Jmeter. Andrey brought fresh ideas and approaches that are now very helpful in the work.
Firstly, we realized that developing the tool in the old paradigm “there is no time to explain, code” fails, and switched to a modular development paradigm, where a large monolith is broken down into components and developed in separate parts without creating a threat to the entire project. Secondly, since orders for load testing began to come from services that were not interested in http-traffic, we needed a tool that could test SMTP / POP3 / FTP / DNS and other protocols. It seemed expensive to write phantom-loggers for each such service, and we decided to build a regular Jmeter in Lunapark. Thus, with little effort, we learned to support several tens of new protocols in load testing. Embedding helped us leave the standard web interface without switching to the Jmeter GUI.
At a certain point, orders for load testing fell so much that it became obvious to us: we would not be able to double staff each year without removing routine and regression tests from the load tester. To solve this problem, we analyzed all our current daily work, and it turned out that services should be approached with varying degrees of testing depth. We came up with the following scheme of work:
An example of an online page for a test.
For analysis, optional graphs can be included, such as response times.
HTTP and network errors:
Times at different stages of interaction and flows:
For regression tests, or for tests whose results are accepted by developers or administrators, we made the so-called automatic and semi-automatic tests. You need to talk about this separately.
At the end of 2011, we realized that essentially all test operations can be done with scripts, calls, or, more simply, with some kind of executing mechanism. The ideologically closest to such activities are CI frameworks that can assemble projects, run a set of tests, notify about events and issue a verdict on passing. We looked at the options for these tools and found that there are not many open frameworks. Jenkins seemed to us the most convenient for expanding functionality with the help of plug-ins and, having tested it near the Lunapark, implemented it in testing. Using external calls to a special API and a built-in scheduler, we were able to shift all the tester's routine work to Jenkins. The developers received the coveted button "Test my service now!", Dozens of loaders received, and even hundreds of tests per day without their participation, managers and system administrators received regression performance graphs from build to build. At the moment, automatic tests make up about 70% of the total flow, and this indicator is constantly growing. This saves us dozens of people on staff and allows us to concentrate the tester's intelligence on manual and research tests.
An attentive reader will notice that gradually Lunapark began to represent a separate structure: an alienated load generator, a backend for statistics and telemetry, as well as a separate automation framework. Looking around at this and knowing that with YAC'10, where baabaka talked about Lunapark, testers of the entire Runet troll us at opening the Lunapark out at every opportunity, we decided to put part of the Lunapark in opensource. In the summer of 2012, at one of Yandex.Subbotniks in Moscow, we introduced a LOAD generator to a community of testers. Now Yandex.Tank with light graphics, with built-in support for jmeter and ab develops only on an external github, we answer user questions in the club and accept external pull requests from developers.
We know that the community of load testers in Runet is small, the available knowledge is very scarce and superficial, but nevertheless, interest in the topic of productivity is only growing. Therefore, we will be happy to share our experience and knowledge in this area and promise to periodically publish articles on the topic of loads, tools and testing methods.
But open solutions existing at that time were either very primitive (ab / siege) or insufficiently productive (jmeter). HP Load Runner stood out from commercial utilities , but the high cost of licenses and the tie for proprietary software did not please us. Therefore, Timur, together with the developer of the high-performance phantom web server, Zhenya Mamchits, came up with a tricky trick: they taught the server to work in client mode. This is how the phantom-benchmark module came about. The phantom code itself is now open and can be downloaded from here , and the story about the phantom can be viewed from the presentation video here .
Then Phantom was very simple, knew how to measure only the maximum server performance, and we could only limit the number of threads. But already at that time, our utility was head-on more productive than its analogues. Therefore, more and more services began to turn to us for load testing. From 2006 to 2009, the load testing team has grown to ten people. The name “tankers” attached very quickly to us, which load “tapes” with “cartridges” and “shoot” from “tanks” at “targets”. Tank themes are still with us. To save resources, we created a special "training ground" or "chicken coop", where we kept virtual machines for stress testing. The virtualization platform at that time was on openvz, and now we have completely switched to lxcdue to better support for new ubuntu server kernels and distributions. Respect to the lxc community!
In parallel with the incoming services and the growing popularity of load testing, or rather, with the growth of self-awareness of service teams, an understanding came of the limitations of the tool's capabilities.
With the assistance of developers and under the leadership of the "tanker" Andrei baabakaKuzmichev, we began to develop phantom into a real framework for supporting load testing - Lunapark. Previously, due to the poor organization of reports, the results were haphazard - in Wiki, JIRA, mail, etc. It was very inconvenient, we put a lot of work into this sore spot and gradually we got a real web interface with dashboards and graphs, where all the tests were tied to tickets in JIRA, all reports finally got uniformity and clear design. The web interface has learned to display percentiles, timings, average times, response codes, the amount of received and transmitted data and about 30 different graphs and tables. In addition, Lunapark was connected with mail, jabber and other services. The changes were not spared by the Phantom load generator itself - he learned to do a lot of what he did not know before. For instance, submit requests according to the schedule - linearly, stepwise, reduce the load and even (!) apply zero and fractional load. Aggregation output with percentiles, monitoring the amount of data, errors, answers was added to the console output. This is what the console output of the 2009 sample looked like.
Gradually, an understanding came that it was necessary not only to constantly test some services, but also to accompany them from year to year from release to release, and also to be able to compare tests among themselves. This is how the Lunapark pages important for the work of the load tester appeared: test comparison and regression. So their first versions looked.
At any time, a tester, developer or manager could find out how a product is developing in terms of performance. Now we are working on improving and putting these pages at the head of the framework. At the current stage of development, the most important for us are not unit tests, but productivity trends.
In 2011, an important event happened - we became the first team in Yandex to launch a real gamification of the workflow, about which I will have a separate reporton the test environment. Even now, this is rarely found in the most advanced IT companies. On the Lunapark page, we have placed the “Hall of Fame” and are very proud of this part of the framework. For a specific test, you can tell the load tester a real friendly "Thank you!" Each tester receives different badges for this or that event, becomes a “tank commander” (a la mayor in Forskverik) and even a “rank”, which is issued for the number of tests performed. Among the routine work, any achievement or thank you is worth its weight in gold. This is very cool motivates.
We consider 2011 a turning point in our stress testing. The development team was led by Andrei undera Pohilko. The tester community knows him as a developer of great plugins for Jmeter. Andrey brought fresh ideas and approaches that are now very helpful in the work.
Firstly, we realized that developing the tool in the old paradigm “there is no time to explain, code” fails, and switched to a modular development paradigm, where a large monolith is broken down into components and developed in separate parts without creating a threat to the entire project. Secondly, since orders for load testing began to come from services that were not interested in http-traffic, we needed a tool that could test SMTP / POP3 / FTP / DNS and other protocols. It seemed expensive to write phantom-loggers for each such service, and we decided to build a regular Jmeter in Lunapark. Thus, with little effort, we learned to support several tens of new protocols in load testing. Embedding helped us leave the standard web interface without switching to the Jmeter GUI.
At a certain point, orders for load testing fell so much that it became obvious to us: we would not be able to double staff each year without removing routine and regression tests from the load tester. To solve this problem, we analyzed all our current daily work, and it turned out that services should be approached with varying degrees of testing depth. We came up with the following scheme of work:
- To test prototypes or experimental assemblies, we made an alienated version of the load generator and, in order not to distract the load-bearing personnel from more prepared projects, we began to recommend it to the developers or system administrators for debugging load tests of our prototype. The results are compatible with the Lunapark framework and there are no situations "I tested my prototype using ab, it gave out 1000 rps, and in Lunapark only 500!"
- To test one-time or event services, where the load can suddenly increase several times (Sports, Unified State Examination, News, Promo-projects), we keep quick-lift virtual machines where ready-made bundled services are rolled out. Tests are carried out in manual mode with the removal of remote telemetry from the load object. Sometimes the process is coordinated in special chat rooms, where up to 5-10 people can sit: “tankers”, developers, managers. According to our internal statistics, 50% of manual tests (!) Catch various performance problems - from incorrectly constructed indexes, “spikes” on file operations, insufficient number of workers and so on. Final results are documented by JIRA.
An example of an online page for a test.
For analysis, optional graphs can be included, such as response times.
HTTP and network errors:
Times at different stages of interaction and flows:
For regression tests, or for tests whose results are accepted by developers or administrators, we made the so-called automatic and semi-automatic tests. You need to talk about this separately.
At the end of 2011, we realized that essentially all test operations can be done with scripts, calls, or, more simply, with some kind of executing mechanism. The ideologically closest to such activities are CI frameworks that can assemble projects, run a set of tests, notify about events and issue a verdict on passing. We looked at the options for these tools and found that there are not many open frameworks. Jenkins seemed to us the most convenient for expanding functionality with the help of plug-ins and, having tested it near the Lunapark, implemented it in testing. Using external calls to a special API and a built-in scheduler, we were able to shift all the tester's routine work to Jenkins. The developers received the coveted button "Test my service now!", Dozens of loaders received, and even hundreds of tests per day without their participation, managers and system administrators received regression performance graphs from build to build. At the moment, automatic tests make up about 70% of the total flow, and this indicator is constantly growing. This saves us dozens of people on staff and allows us to concentrate the tester's intelligence on manual and research tests.
An attentive reader will notice that gradually Lunapark began to represent a separate structure: an alienated load generator, a backend for statistics and telemetry, as well as a separate automation framework. Looking around at this and knowing that with YAC'10, where baabaka talked about Lunapark, testers of the entire Runet troll us at opening the Lunapark out at every opportunity, we decided to put part of the Lunapark in opensource. In the summer of 2012, at one of Yandex.Subbotniks in Moscow, we introduced a LOAD generator to a community of testers. Now Yandex.Tank with light graphics, with built-in support for jmeter and ab develops only on an external github, we answer user questions in the club and accept external pull requests from developers.
We know that the community of load testers in Runet is small, the available knowledge is very scarce and superficial, but nevertheless, interest in the topic of productivity is only growing. Therefore, we will be happy to share our experience and knowledge in this area and promise to periodically publish articles on the topic of loads, tools and testing methods.