Building and operating a fault-tolerant anycast network

    Hello, Habr! The following is a transcription of the report by Evgeny error2407 Bogomazov (network R&D engineer) and Dmitry h8r Shemonaev (head of NOC) from the past UPTIMEDAY. Video at the end of the post.


    Today we would like to talk about what problems arise when building anycast network. Why did we get into this and what are we doing there.


    At Qrator Labs, we build our own anycast network to solve special problems that differ from those of “ordinary” telecom operators. Accordingly, we have points of presence in these regions - we just forgot to add Russia here. And this means that we have many stories about how to and should not be done. Today we will share with you some of them.


    What are we going to tell, considering that the topic is initially voluminous? At first, we only wanted to do Q&A (questions and answers), but we were still asked to read the report. Therefore, if we don’t tell something, but we definitely don’t have time, catch us afterwards, on the sidelines.

    As part of the planned, we will try to talk about the difference between balancing using DNS and BGP. How to choose new sites and what you need to pay attention to in order to avoid subsequent pain. How to support all this and how much this difficult task Dmitry will tell.

    To begin with, let's determine how familiar you are with the topic.
    - How many knows what anycast is and why it is needed? (about a third of hands rise in the hall)
    - And who is familiar with DNS and has configured the server? (approximately the same number of hands)
    - And BGP (two hands in the frame)

    Anyway, a lot.

    - Well, the last question - who is familiar with NOC? Who had problems with suppliers, and who tried to solve them? (the hand of the system administrator Habr is visible in the frame)

    Excellent. In this case, I hope that what we are going to tell you comes up.

    Before moving on to anycast, let's see why it is needed. You have an application with which you want to process customer requests. You are hosted somewhere - while you are not particularly thinking about it. Buy the DNS name, resolve it, and so on. Then sign the certificate, because HTTPS. Your application is growing.

    First, you have to handle the load. If your application "shoots" at the same time - that is, it becomes sharply popular, much more users come to you. You must buy iron and balance the load on it.

    Additionally, especially demanding customers may arise with the words: “Guys, you should be available always and everywhere for that kind of money!” Which leads to the fact that you are laying the redundancy of computing resources not only for the processing of peak load, but simply as a reserve .

    Additionally, today it has already been mentioned in other appearances, you can’t place the glands in one data center - a natural disaster can happen, which means the application will go into downtime, which will cause financial and other losses. Therefore, if your application has grown sufficiently, you should already be located in several data centers, otherwise it will be bad.

    The problem has a flip side - if your application is time sensitive, as in financial analytics or trading, it is important for you to send requests to your users as quickly as possible. The notorious latency with which two points are connected. The first - if you want to give a request as quickly as possible, then for this the number of requests and responses to them should be as small as possible. Again, the fact is that when the user connects to you for the very first time - everything does not work and he is forced to go through all the circles of hell. The second point is the speed of light. A package from Western Europe to Russia cannot go faster than a certain number of milliseconds, there is nothing to be done about it.

    It is necessary to be located in several data centers because we want redundancy and we need to stand closer to potential customers. If, for example, your main client region is America, then you will place your equipment there so that traffic does not go through other countries and parts of the world.

    It turns out that from some point on you will have to be present in all parts of the world at different venues. And this is still not anycast.

    So you need some sites. Somehow you need to choose them, both initially new, and understand their ability to scale - during overload, you will have to buy additional hardware.

    If you already have several sites, you need to learn how to distribute users between them. There are two chairs: BGP and DNS.

    From two points we will start with the last. And again two main approaches. In the first you have different sites have different IPs and, accordingly, when a user comes up with a request - he gets the IP of a specific site and maps on it.

    What are we deciding here? We want a user from a certain region to get to a site located in the same region. The simplest and dumbest solution is to use GeoDNS. You have an understanding of the regions in which prefixes are located — you take this data, push it into a DNS server, map users accordingly, if source IP came from the right prefix, to the right site. But there is a problem - resolvers. And about 15-20% of requests come from resolvers - that is, source IP will be 8.8.8.8. Where to put this?

    To do this, there is EDNS, which allows within the framework of the request to transfer the original client subnet from which it came. As you know, on February 1, 2019, DNS Flag Day happened - just from that day on, all DNS servers must support this extension.

    In this example, you can have either one or several sites where DNS servers are located that map users - and the servers themselves can be distributed around the world. And already within the framework of DNS there is an opportunity to use anycast - we will talk about this a bit later.

    In the general scheme, you map the user to the site closest to him, giving the address of this particular site. It is used less frequently.

    The third approach is related to the fact that even if the user came from the same region where the site is located, this does not mean that the problem of delays has been resolved. It may be even more profitable to transfer the user to another site, as the region can be overloaded if alternative routes are available. Would it be nice to use this? Unfortunately, there are almost no current solutions to do something similar. Facebook somehow showed a report that he did this - but there is no box, everything will have to be done with your own hands.

    What do we have with DNS?

    The pluses are that different users can be given different addresses, and a specific user can be sent to a strictly defined site - that is, you can work with individual users. Well, DNS is easy to configure.

    What are the downsides? If you do granular adjustment, then the config grows pretty quickly, which is impossible to support with your hands. Need automation. And if automation is done incorrectly, then everything will break down - if the DNS lies, then the application is not available.

    On the other hand, if you do DNS balancing, then the user will map to a specific site and its IP will become vulnerable. This is the reason why we do not use DNS-balancing at home, since in this case all the traffic of the attack can flow exactly at one point, disabling it.

    And as already mentioned, DNS does not support latency balancing out of the box. And to do it yourself is very difficult.

    Let's finally get to the finer things, namely BGP anycast.

    This is exactly our case. What's the point? All sites have the same IP, or rather, they announce the same prefix. The user maps to the "closest" site to it. “Nearest” from the point of view of BGP - such a prefix is ​​announced using various routes, and if the operator has several routes to the advertised site, then most often he will choose the shortest one. Again from the point of view of BGP. Soon we will explain why this is bad.

    BGP also works with the availability of prefixes, so you always operate on a subnet and cannot manipulate individual IPs.

    As a result, since the same prefix is ​​announced from all sites, all users from the same region will be directed to the same site. The attacker has no way to transfer the load from one region to another, so you need to gain as much power in each of the places as the operator who chose this route wanted. Even if you don’t score, it can still be protected.

    The same prefix is ​​announced - what could be easier? But there are also problems.
    The first is that because of the need to announce the same prefix around the world, you are forced to buy provider-independent addresses, which are several times more expensive.

    The second boils down to the fact that users from one region cannot simply be thrown into another, if suddenly some of them are illegitimate or with the aim of diversifying attack traffic using other sites, because some of them hurt. There are no such pens.

    The third problem is that within the framework of BGP it is very easy to choose the “wrong” site and the “wrong” providers. It will seem to you that you have redundancy and availability, but in reality there will be neither one nor the other.

    You have several sites between which you want to scatter users. What are the handles for restricting a certain region, pulling users to a specific site?

    There are Geo Community. Why are they? Let me remind you - you choose the closest route from the point of view of BGP. And you have a Tier-1 operator, for example Level3, which has its own trunk around the world. The Level3 client, if you are connected directly to it, is located in two hopes from you. And some local operator - in three. Accordingly, an operator from America will be closer to you than an operator from Russia or Europe, because from the point of view of BGP this is so.

    Using Geo Community, you can limit the region within which such a large and international operator will announce your route. The problem is also that they are not always available (Geo Community).

    We have several cases when it came to ourselves. Dim?

    (Dmitry Shemonaev takes the floor)


    Out of the box, many operators do not provide this and say that, they say, we will not restrict anything for network neutrality, freedom, and so on. We have to explain to operators for a long time and who we are, why we want it and why it is so important for us, and also to educate why this does not apply to the neutrality that they have in mind. Sometimes this works - and sometimes not, and we simply refuse to cooperate with potentially interesting operators because such cooperation will lead to further pain in the operation of our service.

    Also, we often encounter the fact that there are a number of operators that Eugene has already mentioned - these are Tier-1, which do not buy traffic from anyone and only exchange traffic between them. But, besides them, there are a couple of at least dozens of operators who are not Tier-1 - they buy traffic, but at the same time they also have networks deployed around the world. You don’t have to go far - from the closest to us it is Rostelecom or ReTN, a little further away there are wonderful Taipei telecom, China Unicom, Singtel and so on.

    And in Asia, we quite often faced such a situation that, it would seem, we have several points of presence in Asia, we are connected to several rather large operators, from the point of view of this region. However, we are constantly faced with the fact that traffic from Asia goes to our site through Europe or makes a Transatlantic trip. From the point of view of BGP, this is quite normal for itself, because it does not take into account latency. But the application suffers in such conditions, its users, too - in general, everyone suffers, but from the point of view of BGP, everything is fine.

    And you have to make some changes with your hands, do reverse engineering of how the routing of this or that operator is arranged, sometimes negotiate, ask, beg, kneel. In general, do anything to solve these problems. With this, our NOC is faced with enviable regularity.

    As a rule, operators meet their needs and in some cases are ready to provide a certain set ... But in general, can those who worked with BGP at the community level raise their hands? (Smiles) Great! That is, operators are ready to provide some set of community managers in order, for example, to lower local pref in a certain region, or add prepend, or not announce well or something else.

    Accordingly, there are two ways to balance the load in BGP. The first is how it is written on the slide, the so-called prepends. We can imagine the path to BGP as a small line listing autonomous systems through which the packet path from the sender to the recipient passes. You can add an nth number of autonomous systems to this path, and as a result, the path will be extended and become not very priority. This is a frontal method and it does not work for everything - if you add prepend, it is not granular, that is, everyone will see it in the cone of the operator in which you are doing this manipulation.

    On the other hand, there is also the BGP community, which are marking ones, in order to understand where this or that prefix comes from, whose it is with respect to the operator - that is, a feast, client or upstream, and also in what place it is taken and etc. And there are community managers who go to the router to the operator and he takes certain actions with this prefix.

    Most operators have restrictive communities. Take for example the abstract Russian operator, which is connected to a number of Russian operators in a vacuum. Some of them have peer-to-peer relationships, which imply a parity exchange of traffic, and some buy them. Accordingly, they provide community in order to make prepends in this direction, by extending the AS path, or not to announce or change local pref. If you are operating with BGP - look at the community and learn what the applicant can do to become your supplier. Sometimes communities are hidden and you have to communicate either with operator managers or with technical experts in order to show us a certain supported set.

    By default, community, in the case of the European region, is described in RIPE DB. That is, you make a request to whois the numbers of the autonomous system and the Remarks field usually says what the operator has in terms of marking and managing community. Not everyone has this, so often you have to look for different interesting places.

    As soon as you start operating BGP, in essence, you say that the network is part of your application and not something abstract, so you have to consider the risks.

    For example, we had a case with one Latvian financial institution, whose prefix, if included through our network, became unavailable in about half of Latvia. Although it would seem that nothing has changed - the same prefix that we are announcing in Tier-1 operators, in Europe, it would seem that everything is there, including redundancy. But we could not even imagine that approximately half of the Latvian operators had as border devices that they could not digest the full volume of fullview (the entire BGP routing table), which at that time was about 650 thousand prefixes. They stood there, well, if anyone knows what the Catalyst 3550 is, it was exactly there that it stood, it could only 12,000 prefixes. Well, they got a certain amount of prefixes from IX'a, on which, of course, there was no default.

    As a result, he went to a place where he did not know where to route it and everything flew into the pipe. In order to fix this, it took us about two days and persistent correspondence with the Latvian operators until they showed us the output from their border device and we only noticed the hostname there. Hello everyone, it’s so fun sometimes.

    There are many operators with old iron. There are many operators with a strange understanding of how the network should work. And now this is also your problem if you are going to play with BGP. Well, in the end, many operators are one-legged (one superior provider of connectivity), so they have their own sets of crutches.


    (Evgeny Bogomazov continues)



    As you can see, even this topic can be developed for a long time and it is difficult to meet 40 minutes.

    So, you have pens with which you want to limit the region. Let’s now figure out what you need to look at and what is important to consider when you want to accommodate on a new site.

    The best case is not to buy your hardware, but to agree with a cloud on hosting. Then already, it will be possible to agree with him that you will independently connect to certain providers.

    On the other hand, if you nevertheless went along this path, you should roughly understand which region, with or without handles, will be pulled together to this site. To do this, you need modeling, or rather, you need to understand that if there are several routes from different sites, which one will be chosen as the best. To do this, you should have some idea of ​​how BGP works and how routes circulate in the current situation.

    Two main points are the length of the path, influenced by prepends and the local preference, which says that routes from clients are preferable to routes from elsewhere. In principle, these two points are enough to understand which region will be pulled together and where to get up.

    Among other things, there are a couple of things to consider, namely, what kind of connectivity does your supplier have, additionally, the fact that some suppliers do not communicate with each other (peer-to-peer wars), and even if you are connected to the regional Tier-1, this does not mean that all local users will see you.

    Another thing that is often forgotten is that the connectivity in IPv4 and IPv6 is completely different, not portable to each other.

    And here we come to the main point. Answering the question: “Where to get up?” The choice seems to be obvious. If you have users in the region, you get up to IX in this region and there’s nothing more to think about. There is connectivity, most users, in theory, should be connected to it, and most content specialists, companies such as Yandex and others, first connect to IXs and only then to suppliers. But suppliers may have unique customers, some suppliers themselves are not present at IX, and as a result, you cannot redirect these users to yourself - they will go to you in strange ways.

    When choosing suppliers, you can’t turn off your head - we had a couple of cases when the wrong approach led to a problem. Suppliers are our choice, because if you do not have many resources for connecting, then going to the largest regional players you will end up with the same connectivity as at IX.

    Tell me how to choose the right suppliers, Dima.

    (and again Dmitry Shemonaev)

    Okay. Let's imagine that we have one point and the region of our interest is Russia. We have a point in a conditionally good data center in Moscow, we operate with our own autonomous system, our own set of prefixes and decided to scale using BGP anycast - stylish and fashionable.

    The business, together with technical colleagues, decided that from Vladivostok to Moscow there is a very large RTT and this is bad, that is, not good. Let’s say, we’ll stay in Moscow and put an end to Novosibirsk, everything will be better, RTT will fall, of course. No sooner said than done.

    This raises the question of a site for equipment, but this is a little outside the scope of our conversation today, but the question of choosing an operator is quite.
    It would seem that the choice is obvious - in Moscow we are connected to the conditional Moskvatelekom, let’s stick to it in Novosibirsk too. Yes, in principle, we can rely on the internal routing of the provider, but this is not always right - in this case we put all the eggs in one basket and we must understand that routing according to the operator’s IGP may not be optimal, to say the least, because it is not always clear what drives it. Sometimes it’s understandable, sometimes it’s not very - now it’s not about that, besides, the management forbade me to swear, so I just can’t make out some examples in detail.

    Modern trends are such that even Moskvatelecom can say that the time has come for SDN and now we will put a wonderful controller that will control the network. And at one point, such a controller can simply destroy this network. Offhand I do not recall such a case, specifically with an SDN controller, but just recently in America, a large operator (CenturyLink) left one network card to the network god and the entire network was unstable throughout the United States. Because of one network interface card. The NOC of this operator resolved this problem for three or four days. Because of one network card.

    If you are connected to one operator - I sincerely congratulate you.

    Well, then they have decided - we will not cooperate with a conditional single operator in Moscow and Novosibirsk. Here - with Moskvatelecom, and there - with Novosibirsktelecom (all coincidences are random). The sizes of the client cones of these two telecoms differ like a turtle from an elephant, and you will get all your traffic to where the main client cone falls, that is, to Moscow-based Moskvatelecom. It is always advisable that the operators are parity and have peerings among themselves in the territory of the region in which your interest is located. In Russia, several years ago, the largest operators, such as Rostelecom and TTK, had peerings in Moscow, St. Petersburg, Nizhny Novgorod, Novosibirsk and, it seems, in Vladivostok. Therefore, traffic went between these operators plus or minus optimally.

    But the operator still needs to be selected correctly. So he has a community, so he has a NOC. All this is really important, because last year there was a wonderful case when one, quite large Russian operator tested some of its services and at night he announced a lot of prefixes from the city of St. Petersburg with the insertion of its autonomous system in the second route server DE-CIX in Frankfurt. He announced it there with a blackhole community.

    As a result, a lot of St. Petersburg operators and data centers faced inaccessibility from, for example, the TTK network. This also affected us, but we were able to get around this, because between our points we have our own network, somewhere overlay, and somewhere physical, and we routed back traffic from the problem operator to the one where there were no problems. In general, overcame. But I am telling you this so that you know that the NOC of the operator must be adequate, because in that case the NOC of the offender did not contact the night of Friday to Saturday, but woke up only on Monday. Three-day partial inaccessibility for a number of operators. Better think three times.

    Let's get back to NOC. Network Operations Center - a network management center, this is the division of the company that is engaged in the operation of the network, network operations, and so on. Answers to a number of tickets received regarding the network. What do you want to add? The specialists brought up in the Aichi-sum in this room probably know all the good things about monitoring. This is really important. In some cases, very specific things will have to be monitored.

    Some users may complain that "everything is somehow bad" and are not able to provide the diagnostics required to begin work to correct the situation. There is a signal that is bad - and what and where is not clear. In this case, we try to interact with the NOC of the operator in the client cone of which this user is located. If it doesn’t work out, then we look at what correlations are - the possibility of a RIPE Atlas project node inside this cone. In general, we collect as we can. We are ready for what they can’t always give us.

    In some cases, it makes sense to monitor with which community this or that prefix comes to your border router and collect a historical retrospective, sorry for the tautology. For example, we take three operators: Megafon, Rostelecom and Transtelecom. Suppose that all of them have peering in the territory of the Russian Federation, and you are connected to a conditional Rostelecom or it doesn’t matter. You see the prefixes in which your users are located, with some sort of marking community, in this case. They can be collected, recorded, and when something happens - the community will change. For example, you get a prefix with community that this is a feast in Russia. Yeah, well, recorded. And then this community has changed to the fact that this is a feast in Frankfurt. What does it mean? That these operators broke off peer-to-peer relations and you and Latency are not doing very well - you are going through the European loop. In this case, you can do something proactively, but it is time-consuming and requires determination, as well as other qualities.

    And if possible, automate everything - it was difficult 10 years ago, and now there are a lot of utilities, such as Ansible, Chef, Puppet, which can interact with network elements. Why is automation important? I have been setting up BGP for a very long time and the first rule is something like this: “Whoever you set up your BGP neighborhood with, on the other hand, you are not set up by a very nice person.” From the point of view of that person, this rule also applies to you.

    I personally had a case when I transferred all peerings from one border to another in a Samara operator whose name we will not name. I had a junction with one major content provider - an online movie theater and I had a junction with the local daughter of Rostelecom. Only with the content manager I had a gigabyte junction, and with my daughter a hundred-megabyte one. And I, as a pleasant person, endured all this at night - I look at the graphics, at the hundred-megabit joint and think: “Oh, what a hell!” And then I look - this one in the regiment, that one in the regiment and I think (hits himself on the forehead) - I forgot to make filters. That's to be fenced off from such their actions, because everyone else accepted, from the triple strike just need to be protected by automation. Automation is the enemy of bad people and a friend of good ones.

    Then you are, Zhenya.

    (Evgeny Bogomazov continues)

    So we discussed all the initial points. But on anycast, everything does not end; besides him, you need to keep track of other, additional, things.

    Let's see what else is. You need to see how well the application integrates into the distribution - if there are several sites, you need to be able to scatter content on these sites. If this is not possible, then no matter how your system is distributed, all users will go to the site where the application is actually located. And exactly RTT you will not save.

    On the other hand, if you do not have user data as such, then you can place the user application on that site - just do it. If the application will support everything, then use anycast clouds, if you do not want to develop your own infrastructure, this will bring a lot of profit.

    Well, if you already have several sites and the user comes to one of them, something can happen there, let's say it will fall or links will break, so users will go to another site. But they should not notice it. Therefore, you should be able to transfer this traffic as quickly as possible within the internal anycast network, and in general, you must consider this problem as inevitable - therefore, it’s nice to put something into the application in order to prepare for such a development of events.

    Ideally, if you have application business metrics, then in case of their fall, immediately make a request to network monitoring and generate a report on the status of either the internal or external network, or better both of them. Business metrics usually fall because something happened somewhere, but this, of course, is a utopia - even we have not come close to this yet.

    You have an external network, but there are also internal sites - they must communicate with each other. At the same time, you do not need to have your own physical infrastructure - you can use the networks of third-party operators, the main thing is to configure virtual tunnels. Additionally, you must configure the routing of the internal network, because everything does not end on BGP. Due to the nature of traffic processing, we have our own protocols, communication style and scripts.

    If you have quite a few sites, then you should be able to update configs in different places at the same time. You get a new prefix - you have to announce it everywhere, updated DNS - the same thing. In the context of SDNs, you must collect data from the sites and aggregate them somewhere, and pour the changes that occurred due to these data back into the sites.

    The last item is DNS. The example with Dyn was indicative - as you remember, in 2016 an attack was made against them, which they could not cope with and a very large number of resources popular in the USA ceased to be available. DNS also needs to be protected, otherwise users will not find your application on the network. DNS caches save, in part, and there is interesting work on this topic at the IETF, but it always depends on whether certain DNS resolvers will support it.

    In any case, you must have a secure DNS. This is the first stage that must be worked out in order for the user to familiarize himself with your application. When you open the page for the first time, in addition to the RTT, which we have already mentioned, there will be additional delays for DNS queries and you need to be prepared that the user will open the page for the first time for a long time, someone may not be able to tolerate. If for you a long first discovery is critical, you will also have to speed up the DNS.

    In the case of anycast, you can cache DNS responses at your points of presence, because you already have these sites and DNS answers will come quickly enough.

    What’s the problem? Well, latency, well, balancing.
    As we already mentioned, there is still nature. Partly because of it, you need to be placed in different places. This should not be forgotten, although the risk seems small.

    There is also a person and a factor of chance. Therefore, you should automate everything as much as possible, test the poured configs and monitor the changes. Even if you have taken something, it will be good if you can make changes quickly and locally.

    This is 90% of cases. But the remaining 10% - this is if competitors decided to eliminate you. In this case, you have serious problems. Why is the “Reservation” highlighted in a separate large font? If you decide to put, you will need a very large amount of channels on your sites, which means you will need to negotiate with a large number of providers. Otherwise, with the current average level of attacks, you can’t do it at all.

    So it’s better to delegate than to buy your own hardware. Even part of the functionality that we described today in the framework of anycast and the problems that arise are easy to solve incorrectly. So if there is an opportunity to not solve these problems and shift them to outsiders, this is probably worth doing. Otherwise, you need to accurately answer the question of why you need to implement all this.

    Well, in the event of an attack, turn to those clouds that specialize in solving such problems. Well, or you can still chat with us.

    Thanks! Questions?


    Also popular now: