Production Tests: Chaos Automation Platform on Netflix
Our June Test in Production mitap was devoted to chaos engineering. Lead Software Engineer Norah Jones began with how Netflix performs production tests.
“Chaos engineering ... these are production experiments to find vulnerabilities in the system before they make the service unusable for customers. At Netflix, we do it with a tool called ChAP ... [it] catches vulnerabilities and allows us to introduce failures in services and production. These failures confirm the assumptions about these services before they lead to full-scale outages. ”
Watch a presentation (or read the transcript) about how its team helps users — Netflix engineers — safely test in production and actively identify vulnerabilities in their systems.
I am very glad to be here today.
Netflix makes extensive use of tests in production. We do them with the help of chaos engineering, and recently renamed our team Resiliance Engineering [sustainable development], because chaos engineering is one of the means to achieve overall sustainability. This is what I’m going to talk about today.
Our goal is to increase uptime by proactively searching for vulnerabilities in services. We do this through production experiments. The team is confident that a certain class of vulnerabilities and problems can only be detected on real traffic.
First of all, you should take care of security and monitoring, otherwise you will not be able to deploy normal tests in production. Such tests can be scary: if they are scary, you should listen to this voice and find out why. Maybe because you don't have a good security system. Or because there is no good monitoring. In our tools, we really care about these things.
If we formulate chaos-engineering in one sentence, then this is the discipline of production experiments in order to find vulnerabilities in the system before they cause the service to become unsuitable for customers. At Netflix, we do this with a tool called ChAP, which means the Chaos Automation Platform (Chaos Automation Platform). ChAP catches vulnerabilities and allows users to introduce failures in services and production. These failures confirm the users' assumptions about these services before they lead to full-scale outages.
I will tell you how the platform works at a high level. This is a hypothetical set of dependencies of microservices. There is a proxy. He sends a request to service A, which then diverges in terms of services B, C and D, and there is still a level of persistence. Service D accesses Cassandra, and then Service B accesses the cache.
I rush forward and squeeze everything out, because the essence begins later. We want to make sure that service D is resilient to cache failure. The user logs on to the ChAP interface and selects service D as the service that monitors cache failures and the service for failure. ChAP in reality clones service B into two copies. We use them for control in experimental clusters: in some way they work like A / B tests or canary tests. These replicas are much smaller in size than service B. We send only a very, very small percentage of clients to these clusters, because we obviously don’t want a full-scale failure. We calculate this percentage based on the current number of users using the service at the moment.
ChAP then instructs the crash implementation system to mark requests that match our criteria. This is done by adding information to the request headers. Two sets of tags are created. In the first set of instructions for failure and routing to a canary-replica, and in the second - only instructions for routing to the monitoring element.
When the RPC client and service A receive the instructions needed to route the request, they actually send traffic to the monitoring cluster or the experimental cluster. Then, the crash injection system at the RPC level of the experimental cluster sees that the request is flagged for failure, and returns a failed response. As before, the experimental cluster, as a failed response from the cache, will execute the code to handle the failure. We do this on the assumption that it is fault tolerant, right? But sometimes we see that it is not. From the point of view of service A, everything looks like normal behavior.
We carefully control the chaos-engineering, which can go really bad. When Netflix first started such experiments, we didn’t have a good control system. We ran an artificial crash and sat in the room, crossing our fingers and checking that everything was working fine. Now we have a lot more attention to security.
We look at key business metrics. One of them is called SPS (stream starts per second), that is, video stream starts per second. If you think that the most important thing for a Netflix business is that users can run any series any time they want.
On the graphs you see a real experiment. It shows the difference in SPS between the experimental and control clusters during the test. You may notice that the graphs deviate strongly from each other, which should not be, because the same percentage of traffic is sent to both clusters.
For this reason, automated canary analysis is used in the test. He gives a signal that the graphics are very deviated from each other. In this case, the test is immediately interrupted so that people can work normally with the site. From the user's point of view, this is more like a short-term glitch when this happens.
We have many other remedies. We limit the amount of test traffic in each region, so we do not conduct an experiment only in the US West 2 zone. We do it everywhere and limit the number of experiments that can be performed in the region at a time. Tests are held only during working hours, so we will not wake up the engineers if something goes wrong. If the test fails, it cannot be automatically started again until someone explicitly manually fixes it and confirms: “Hey, I know that the test did not pass, but I fixed everything I needed.”
It is possible to apply custom properties to clusters. This is useful if the service is divided into shards, like many Netflix services. In addition, you can embed failures based on the type of device. If we assume some problems on Apple devices or on a certain type of TV, then we can conduct tests specifically for them.
ChAP found a lot of bugs. Here is one of my favorites. We conducted an experiment to test the backup path of the service, which is crucial for its availability, and found a bug there. The problem was solved before it led to the incident of service availability. This is a really interesting case, because this backup path was not executed frequently. Therefore, the user did not really know if he was working correctly, and we managed to imitate him. We actually caused a crash in the service and checked whether it went along the siding and whether this way works properly. In this case, the user thought that his service was not critical or secondary, but in fact it was a critical service.
Here is another example. We conducted an experiment to reproduce the problem in the registration process, which manifested itself at night on some servers. Something strange was happening with the service. The problem was able to reproduce after the introduction of a delay of 500 milliseconds. During the test, the problem was found in the logs uploaded to the Big Data Portal. It helped to understand why registration did not work in some cases. Only through the ChAP experiment was it possible to see what was happening and why.
ChAP test setup requires a lot of information. Need to figure out the appropriate point of implementation bugs. Teams need to determine if they want a glitch or delay. It all depends on the point of injection. You can crash Cassandra, Hystrix (our backup system), RPC service, RPC client, S3, SQS, or our cache, or add a delay from them. Or do both. You can even come up with combinations of different experiments.
What you need to do is get together with the service team and come up with a good test. It will take a lot of time. When setting up an experiment, you should also define ACA (Automated Canary Analysis) configurations or automatic canary configurations.
We had several ready-made ACA configurations. There was one ChAP configuration for SPS. There was one with monitoring system indicators. Another one that checked RPS failures. Another one made sure that our service actually works fine and how it should inject bugs. We realized that designing a test can be very time consuming, so it happened. There were not so many tests. It is difficult for a person to keep in mind all that is needed for a good experiment. We decided to automate something with ChAP. We looked at the indicators: where and from whom calls are coming, files with timeouts, repeated calls. It became clear that all information comes from different places. It was necessary to aggregate it.
We scaled the analysis to the level of ChAP, where it is much more convenient to work with information and you can use Monocle. Now all information about the application and the cluster can be studied in one place. Here, each line represents a dependency, and these dependencies are the nutrient medium for experiments of chaos engineering.
We collected all the information in one place for the development of the experiment, but did not understand that such aggregation is very useful in itself, so this is an interesting side effect. Here you can go and actually see the anti-patterns associated with a particular service. For example, a dependency is discovered that was not considered critical, but does not have an alternate execution path. Obviously, it is now becoming critical. People could see inconsistencies of timeouts, inconsistencies of repeated calls. We use this information to assess the criticality of a particular type of experiment and enter it into an algorithm that determines priorities.
Each line represents a dependency, and these lines can be expanded. Here is an interesting example.
Here, the blue line above indicates someone's timeout, and the purple line below shows the normal execution time. As you can see, it is very, very far from the timeout. But much of this information was not available. What happens if we test right below the timeout? What do you think? Will he pass? This is an interesting question. We are trying to provide users with this level of detail prior to running tests, so that they can draw conclusions and change settings.
I want to play a little game. There is a vulnerability in this Netflix service, try to detect it. Take a second and see.
To give you some context, the remote Hystrix command includes both sample-rest-client and sample-rest-client.GET. Hystrix timeout is set to 500 milliseconds. Sample-rest-client.GET has a 200 ms timeout with one retry, which is good, for a total of 400 milliseconds, which fits into the Hystrix limit. The second client has timeouts of 100 and 600 with one retry.
In this case, a retry cannot be completed taking into account the timeout of the Hystrix shell, that is, Hystrix refuses the request before the client can receive a response. This is where the vulnerability lies. We provide this information to users. Interestingly, most of the logic in the implementation of these functions is in different places, and earlier they could not compare these things. They thought that everything is working fine, but here is a bug.
Why did it happen? Of course, the developer is easy to see the conflict and change the timeout, is not it? But we want to find out the reason. We can change the timeout, but how to ensure that this does not happen again? We also help find out the reasons.
When creating automatic tests, we also use Monocle. The user creates an experiment on numerous types of input data. We take all this and automate the creation of such tests so that users do not bother themselves. We automatically create and assign priorities for Hystrix experiments and RPC experiments with delays and failures due to delays. ACA configurations are added by default. We have SPC, system metrics, query statistics, and experiments run automatically. Priorities for experiments are also created. For them, the high-level algorithm works. We use RPS statistics. We use multiple retries and related Hystrix commands. The whole set is weighted properly.
In addition, the number of commands without backup execution paths and any external influence (curated impact), which the client adds to his dependency, are taken into account. External influence strongly influences the authorization, registration and SPS procedures. And we really measure its impact and do not conduct experiments if the result is negative. Then tests are ranked and run in decreasing order of criticality. The higher the criticality score, the earlier and more often the test runs.
Ironically, Monocle provided us with feedback that allows us to perform fewer tests in production. We conducted so many tests that resulted in a feedback loop: we saw connections between tests. Now you can look at certain configuration files and see certain anti-patterns. Even without tests on this information, one can understand what exactly will cause a failure, at that time we did not understand it before.
This led to a new level of security. Previously, an unsuccessful experiment was marked as resolved. Now it is marked as resolved before re-launch. But now we can clearly add external (curator) effects to addiction. The user logs into his Monocle and indicates: this factor precisely influences the authorization procedure. This one is on SPC. And we are working on a feedback loop to ensure that such a curatorial effect is added in case of failure.
Thus, Monocle in ChAP is an important tool in which all information is collected, it automatically generates experiments, automatically prioritizes and searches for vulnerabilities before they lead to full-scale outages. To summarize, it is important to remember why we are engaged in chaos engineering and conduct all these experiments in production. This is done to understand how customers use the service, and not to lose sight of them. You want to provide people with the most convenient service. So monitoring and security are paramount. Netflix should always show videos.