MnogoByte August 25, 2012 at 00:56

Quality take-off after the "fall". Or why "lay" ManyBytes

On August 23, 2012, from 02:00 to 16:30, part of the MnogoByte network did not function correctly, which led to a partial or complete loss of communication in about a third of the company's customers. To dispel the rumors immediately in hot pursuit, we decided to talk about what was and what was done so that this does not happen again.

I'll start from afar. About a year ago, a rapid growth of traffic began on the entire Mnogobyte network. This was facilitated by adequate tariffs for bandwidth and traffic, and good connectivity in Russia, and an overall increase in the number of client equipment hosted in our data centers. As a result of increased traffic, the Cisco Catalyst 6500 and 7600 series switches and routers that we installed in 2007 and 2008 became insufficient for further growth. Everything is extremely simple: 2x20 Gbit / s per slot and only 4 full-speed ports per slot is the limit. Therefore, in early 2012, we planned to transfer the network core to Juniper routers and a general network upgrade in order to fit our “ring” connecting nodes on the MMTS-9, MMTS-10 and data centers, to provide customers with connectivity at 10 Gbps sec and, accordingly,

Juniper MX960 3D

Getting the necessary equipment (DWDM-multiplexers, DWDM-SFP +, 10Gb / s switches, Juniper routers) we transferred the “ring” to new equipment. So on July 5, 2012, we successfully replaced the router on our site with MMTS-9 and almost none of the data center clients noticed this. Although the work was hard - the central router, nevertheless!

On August 23, 2012 we planned another router replacement. Now the task was much more complicated: more than a dozen access switches and about 130 client connections that were directly included in the router had to be switched. We prepared for the work quite thoroughly: a separate switch was included in our ring, where customers switched in several stages. These clients were also routed by another router. On the night of August 23, we planned to switch access-switches to the same “piece of the ring” and take customers to other routers. Total customer downtime would be less than an hour. And 130 direct connections are nowhere to be found - they would have to wait for the inclusion of the new Juniper. For the reader, I also note that 130 connections are not only 1 Gbit / s, but 10 Gbit / s ports too.

Juniper EX8216

At 02:00, according to the plan, we started work by transferring client routing to another router and switching access switches. However, after the connection was transferred and the dismantling of the Cisco Catalyst router began, strange problems began with the backup switch: it became low on memory and the CPU was periodically heavily loaded. We tried to solve the problem, but it yielded only partially. As a result, part of the access switches remained without access to the network. We could not return everything back. We continue to study the problem, because the same switch in the same configuration before this without any problems passed through about 15 Gbit / s of traffic and did not even strain.

Due to problems with access switches, our site, as well as telephony, were temporarily taken out of service. This is what caused the complaints of customers that they could not reach us. But pretty soon the problem was solved and it worked.

The new Juniper was launched at the scheduled time. Switching of access switches and user connections has begun. Along with the connections, new problems arose that are also being studied by us. For example, a “loop” was formed, which was not immediately caught. Catching the loop took extra time. The loop has hooked a part of clients in our other data centers.

During the connection of access switches, it also turned out that Cisco Catalyst didn’t really want to be friends with Juniper equipment and had to jump with a ~~tambourine~~console next to each switch. And reconfigure it. By 11:00, with a delay of 5 hours from the planned time, most of the clients of the data center were working and had no problems.

But that was not the end of the problems of that day. Different worldviews of manufacturers Juniper Networks, Extreme Networks and Cisco Systems on seemingly completely standardized protocols STP and MPLS left some customers unconnected. Glitch with the passage of large-sized packets was caught by us until 16:30. At 16:30, only a little less than 100 connections to access switches of the data center remained victims. Some client switches connected to our access switches were also configured incorrectly and affected our network. After clarifying conversations with customers, reconfiguring their equipment and installing many different filters on these ports, the problem was finally resolved and around 18:30 the last affected customers got access to the network without problems.

What's next

All affected customers will receive compensation and pleasant bonuses - there is no doubt about it. As I already said, traffic on the ManyByte network is growing. The resulting modernization will allow us to continue to meet the needs of our existing and new customers. By the way, we are one of the few data centers in Moscow that provide connectivity for client servers at a speed of 10 Gb / s. It is worth waiting for new tasty tariffs and a more flexible tariff policy. There is no silver lining, as they say!

Thanks to all our customers who have stayed with us for years and have shown patience on this difficult day for us and them!

Tags:

Quality take-off after the "fall". Or why "lay" ManyBytes

What's next

Also popular now: