thevar1able March 4, 2013 at 00:20

Why lay CloudFlare

Transfer

Today at 09:47 UTC, CloudFlare actually dropped out of the Internet. The fall affected all CloudFlare services, including DNS and proxy services. Everyone who tried to open any site using CloudFlare services during the fall received a DNS error. Ping and traceroute to CloudFlare hosts also generated a “No Route to Host” error.

The cause of the fall was an error on the border routers. CloudFlare now has 23 data centers around the world. They are connected to the Internet through routers. These routers made it so that packets sent to us from anywhere on the Internet usually reached our servers. When the router stops working, the network behind it ceases to be accessible to the Internet.

We regularly turn off one or more of our wonderful routers, for example, during any work. Due to the fact that we use Anycast, traffic is redirected to the nearest data center. However, this morning we ran into an error that all our routers laid out.

Flowspec

All edge routers that were error prone were from Juniper. One of the reasons we like Juniper routers is to support the Flowspec protocol. It allows you to effectively extend routing rules to a large number of routers. Here at CloudFlare, we are constantly updating our routing rules. This is necessary to protect against attacks and redirect traffic for the fastest possible service.

This morning, we noticed a DDoS attack aimed at one of our customers. The attack was aimed exclusively at the DNS server. We have a special tool for creating attack signatures that are equally well understood both for automated systems and for employees. Typically, these signatures are used to create routing rules that will reduce the number of “bad” requests.
In this case, our attack profiler determined that the “bad” packets were between 99.971 and 99.985 bytes long. This is rather strange, because the length of a regular packet does not exceed 600 bytes, and the largest are up to 1,500 bytes. A restriction of 4.470 bytes was set in our network, but the profiler said that the attacker's packets have exactly this length.

Fatal rule

Someone from our team is always monitoring the network, 24/7. As usual, one of the operators took the profiler’s output and added a rule according to which all packets from 99.971 to 99.985 bytes in size should “drop”. This is what it looked like in Junos, the Juniper operating system:

Flowspec adopted this rule and distributed it across the border network. In theory, no packet should have come up to this rule, because there could not be such large packets on the network. In fact, all routers adopted the rule and began to consume all available RAM until they hung.

In the usual case, the router should automatically reboot, but then for some reason this did not happen. We were also unable to access through the management ports. Even if some kind of data center suddenly rose itself, it immediately went back down, because all the traffic of the entire network began to go through it.

Sam Bowne, a professor at City College San Francisco, using BGPlay got this video, which shows how the routers fall one by one:

Incident response

The CloudFlare network team has been aware of the incident from the start. The reason for the crash of the routers was unclear, but it was obvious that the packets could not find the way to our network. We were able to access several routers and found out that they were falling because of that rule. We deleted it and then called the operators in other data centers to reboot the routers.

23 CloudFlare data centers are located in 14 countries, so the response time was somewhere around 30 minutes. At 10:49 UTC, all CloudFlare services were already running. We continue to investigate cases that our customers are still complaining about. Usually they are connected with the fact that a bad DNS response was cached.

We have already contacted Juniper specialists to find out if they know this bug or our first case. We have to conduct several Flowspec tests and find out whether it is possible to limit the application of the rules to several data centers. We also plan to return money to accounts protected by SLA. We are categorically against the arbitrarily short time of unavailability of services and the CloudFlare team apologizes for this case.

Tags:

Why lay CloudFlare

Flowspec

Fatal rule

Incident response

Also popular now: