How Verizon and BGP Optimizer made a big offline
- Transfer
A major route leak has affected large Internet sectors, including Cloudflare
What happened?
On June 24, at 10:30 UTC, the Internet collapsed: a small company in northern Pennsylvania poured a stream of traffic from many routes passing through a large provider Verizon (AS701) - with the same success, the navigator could send a stream of cars from a multi-lane highway to a narrow street . As a result, many Cloudflare websites and many other providers have access issues. This should not have happened at all, because Verizon was not supposed to send these routes to the entire Internet. To find out how it happened, read on.
We already wrote about such incidents before, they happen from time to time, but this time we felt the consequences all over the world. The problem was exacerbated by Noction 's BGP Optimizer . It has a function that splits the received IP prefixes into smaller and more specific ones. For example, our IPv4 route 104.20.0.0/20 was divided into 104.20.0.0/21 and 104.20.8.0/21. As if the Pennsylvania sign was replaced with two others: Pittsburgh, PA and Philadelphia, PA. By dividing large IP blocks into small ones, the network manages the traffic within itself, but this separation should not have become publicly available. Otherwise, such troubles arise.
To explain what happened next, let's first recall the way the Internet works. In fact, the Internet is a network made up of networks called autonomous systems. Each autonomous system has its own unique identifier. All networks are connected to each other using the Border Gateway Protocol (BGP). BGP connects these networks and forms an Internet structure in which traffic passes, for example, from your Internet provider to a popular website in another part of the world.
Through BGP, networks exchange information about routes, namely: how to get to them from anywhere. These routes can be specific (like a specific city on the map) or general (like an area). And then trouble happened.
One Internet service provider in Pennsylvania ( AS33154 - DQE Communications) used BGP Optimizer on its network, meaning there were many specific routes on their network. Specific routes take precedence over general ones (in the same navigator, for example, the route to Buckingham Palace will be more specific than the route to London).
DQE provided these specific routes to its client ( AS396531 - Allegheny Technologies Inc), and from there they got to the transit provider ( AS701 - Verizon), which distributed these “optimal” routes all over the Internet. They seem optimal because they have more details and specifics.
And all this was not supposed to go beyond Verizon. Although there are effective ways to protect against such crashes, Verizon’s lack of filters has led to a collapse affecting many services such as Amazon, Linode and Cloudflare .
As a result, Verizon, Allegheny and DQE hit a shaft of users trying to access these services through their network. They were not designed for such powerful traffic, which led to interruptions. And even if there were enough resources, DQE, Allegheny and Verizon should not have told everyone about the ideal route to Cloudflare, Amazon, Linode, etc.
BGP leak process with BGP Optimizer.
In the worst moments of the failure, we observed a loss of approximately 15% of global traffic.
Cloudflare traffic levels during the incident.
How could a leak be prevented?
There are several ways.
For a BGP session, you can set a hard limit for accepted prefixes, and if the number of prefixes exceeds the threshold, the router will terminate the session. If Verizon had such a limit on prefixes, nothing would have happened. For a provider like Verizon, installing it would be worthless. Why were there no limits? I have one version: negligence and laziness.
Another way to prevent such leaks is to use IRR filtering. IRR (Internet Routing Registry) is a distributed database of Internet routes to which networks add entries. Other network operators use these entries in the IRR to create lists of specific prefixes for BGP sessions with other networks. If IRR filters were used, none of these networks would accept erroneous specific routes. Incredibly, Verizon did not have this filtering at all in BGP sessions with Allegheny Technologies, although IRR filtering has been used (and well documented) for more than 24 years. IRR filters would not cost Verizon anything and would not limit their service in any way. And again - negligence and laziness.
Last year, we implemented and deployed the RPKI platform, which just prevents such leaks. It sets filters according to the source network and prefix size. Cloudflare announces prefixes with a maximum size of 20. RPKI indicates that more specific prefixes cannot be accepted, regardless of the path. For this mechanism to work, BGP Origin Validation must be enabled on the network. Many providers, for example, AT&T already successfully use RPKI in their network.
If Verizon used RPKI, they would see that the proposed routes are not valid, and the router would automatically reject them.
Cloudflare advises all network operators to deploy RPKI right now!
Route leak prevention using IRR, RPKI and prefix limits.
All of these recommendations are well described in MANRS ( Mutually Agreed Norms for Routing Security ).
How to solve the problem
The Cloudflare network team contacted the affected networks AS33154 (DQE Communications) and AS701 (Verizon). It was not easy - maybe because when it all started, it was an early morning on the east coast of the United States.
Screenshot of a letter to Verizon.
One of our network engineers quickly contacted DQE Communications, and after a short delay we were connected to the one who could solve the problem. With our telephone support, DQE were able to stop sending “optimized” routes to Allegheny Technologies Inc. We are grateful to them for their help. Everything stabilized and returned to normal.
Screenshot of attempts to contact DQE and Verizon Support Services
Unfortunately, despite all our attempts to contact Verizon by phone and email, at the time of writing (more than 8 hours have passed since the incident), no one answered us, and we don’t know if they are doing anything .
We at Cloudflare would not want a repetition of this, but unfortunately, very little is being done for this. It’s time for the industry to take more effective measures to ensure routing security, for example with systems such as RPKI. We hope that large providers will follow the example of Cloudflare, Amazon and AT&T and begin to check routes . This is especially true for you, Verizon. We are still waiting for an answer.
And although we could not influence what happened, we apologize for the interruption in service. We care about our customers, and engineers in the US, UK, Australia, and Singapore contacted us a few minutes after we discovered the problem.
Other articles tagged with BGP .