Unsuccessful Software Deployment Crashes Cloudflare Service
- Transfer
This is a small temporary article, which will later be followed by a full analysis and exhaustive information about what happened today.
Today, for about 30 minutes, visitors to Cloudflare sites could see error 502 caused by a sharp jump in the CPU load of our network. This was due to a failed software deployment. We rolled back the changes, and now the service is functioning as usual, as before, and all domains using Cloudflare have returned to normal traffic levels.
We assure you that there was no attack, and we offer our deepest apologies for what happened. Our developers are already conducting a detailed analysis of errors and trying to figure out what needs to be done in order to avoid such incidents in the future.
Posted at 20:09 UTC:
Today at 13:42 UTC a failure was detected in our network, as a result of which visitors to Cloudflare domains saw error 502 ("Bad Gateway"). The reason for this failure was the deployment of a misconfigured rule in the Cloudflare Web Application Firewall (WAF) during the standard process of deploying new Cloudflare WAF managed rules.
The new rules were designed to improve the blocking mechanism of embedded JavaScript used in hacker attacks. These rules were deployed in simulation mode, in which errors are usually detected and logged without blocking user traffic, which allows us to measure the number of false positives and make sure that the new rules will work properly when deployed within the framework of this project.
Unfortunately, one of these rules contained a regular expression, which led to a jump in CPU load of up to 100% on our computers everywhere. It is because of this leap that the users of our service witnessed a 502 error, and traffic dropped to 82%.
The graph below shows the CPU load jump on one of our PoPs:
For the first time we were faced with the problem of complete exhaustion of CPU resources, which was extremely unexpected for us.
We are constantly deploying software in our network and have already developed automated systems for running tests and a phased deployment procedure in order to prevent unpleasant situations. Unfortunately, the global deployment of WAF rules was a one-off process, which caused today's failure.
At 14:02 UTC, we realized what had happened and decided to completely disable the WAF rule sets, which immediately normalized the CPU load and restored traffic. We did it at 14:09 UTC.
After that, we analyzed the problematic pull request, rolled back the changes in the relevant rules, tested our actions to be 100% sure that the error was found correctly, and then restored the WAF rule sets at 14:52.
We are aware of how much damage these incidents cause our users. In this case, our testing mechanism did not cope with the task, and we are already working on improving it and optimizing the deployment process in order to avoid similar errors in the future.