Inter-AS routing. Can I save on a BGP router?
As a preface: yesterday I presented the ideas below at a local meeting of administrators. After the presentation, a representative of a company engaged in the production of network equipment approached me and asked: “Have you published this somewhere? Share the presentation, I’ll send my colleagues a look. ”Actually, why not publish it? As we say in Ukraine, "i mi, Khimko, people." If one of the vendors is even remotely interested, then there will be a person in the community who will also find the ideas interesting. In addition, I myself plan to use this solution. I must say right away that there will be no 100% finished result, but there will be some intermediate result, which is enough for ersatz routing and a bit of information to continue work in this direction. Go!
I must say right away - the title of the article is about the router, the physical device, and not the BGP protocol. And no, you won’t be able to live without BGP, at least for now - but you yourself know that. Why do we need a protocol, how is it good, what allows us to do it, and how to configure it - a lot of books and articles have been written about this both on the Habré and on third-party resources, therefore I will omit this topic. If you have enough statics or a creature is working in a neighboring autonomous system that you can ask to change announcement policies, you can safely close the post - you do not need this information.
If you are still reading, let me start with design, as the article is more about designing a solution than about the solution itself. So, we are the ISP A service provider that has the same pool of addresses and the need to announce these addresses to the Network. The general scheme of logical connections is shown in the figure.

At the edge of our network is the RTR A router, which is connected by the eBGP session to the peer-neighbor. Through the protocol session, we get a FullView from the neighbor with the next hop address — the IP address of RTR B. In response, we give information about our internal networks with the next hop hop of RTR A. I’ll immediately note that this is only one of many possible schemes for organizing a BGP neighborhood: there can be several routers at the border, as well as neighbors, the neighbors themselves can not be connected directly, we can get more than one FullView, reserve channels, and more. However, I allow myself to omit the analysis of various peer-to-peer organization schemes, and dwell on the simplest case - this will not change the essence, but understanding will simplify. IGP works inside our autonomous system, through which we transfer reachability of ISPA networks.
User traffic flows through the core of CoreA, RTR A, and goes further into the network. In my understanding, this (with all possible variations) is the classic method of organizing the perimeter of the network and the BGP neighborhood. Now let's see how much this hardware solution costs.
I will build on the necessary bandwidth of 10Gb / s. At a minimum, the solution should have 10Gb / s interfaces with the possibility of further upgrade. Juniper offers a solution MX104-40G (or as an option MX80) for 40 thousand dollars. with two (four for MX80) 10Gb / s “on board” interfaces and routing performance of 40Gb / s (80 - for MX80). Cisco responds with a Cisco ASR 1001-X device with a base capacity of 2.5 Gb / s and two 10 Gb / s on-board interfaces for $ 17,000 + the price of a license to improve performance (up to 20 Gb / s) and activate additional interface slots. I must say right away that I did not set myself the task of comparing the devices rigorously - in the end, the post is not about that, but some numbers are needed, since our main task is to reduce the cost of the solution.
So, at least 17 thousand dollars. What does our RTRA router do? Yes, in general, a little - twists a BGP session (or several) with a neighbor and forwards traffic to the network. Is it possible to do without it? To answer, we analyze the following topology.

We removed the physical router and ran BGP peering on the kernel device. Is it possible? Yes, the good sorted L3 switches support the launch of BGP. However, there are at least two weak points in this decision. First, most switches were not designed for full routing, and therefore have a limited size routing table. For example, the Juniper EX4550 has a limit of 14,000 IPv4 unicast routes, and the Cisco Nexus3k has 16,000. The second is that you need to buy a license to run BGP, which costs 8 (Cisco Nexus3k) or 10 (Juniper EX4550) thousand dollars. If we need redundant switches, this will double the numbers given. In addition, you will need to negotiate with a higher provider to summarize the networks, well, or get the default route. Nonetheless, Such a design will nevertheless allow you to refuse to buy a dedicated router and at the same time use the useful BGP goodies. Another possible variation on this topic is given below.

We run the BGP process on a physical server or virtual machine that spins an eBGP session with RTRB and iBGP with a kernel device. On the virtual machine, install one of the available packages to run BGP, for example Quagga, Vyatta or BIRD.
One of the great features of the BGP protocol is the ability to change next-hop during the announcement of updates, we will use it in order to avoid the situation when user traffic needs to be forwarded through the BGP speaker. That is, we sort of separate the devices that have routing information (virtual machine) and the devices that are engaged in traffic forwarding (CoreA) inside the autonomous system. Accordingly, RTRB receives as the next-hop the address of CoreA and vice versa. Such a control-plane vs forwarding-plane. The idea itself is not new and is actively used in the organization of exchange points, only through eBGP sessions.
This is already a more interesting scenario, since now we can get both FullView, and several of them, filter and summarize routes locally without having to call the provider. Another interesting feature of the solution is that we do not even need to populate the kernel table in a virtual machine with BGP. Those who are faced with the configuration, for example, Quagga know that first you need to enable the “ip forwarding” option and then transfer the routes that the daemon received to the kernel (well, or the host routing table) to correctly forward traffic. So, this is all superfluous - the virtual only deals with the announcement of BGP information and does not participate in traffic promotion, and filling the table inside Quagga takes as much time as it takes to transfer directly the volume of the route table - 10 seconds.
This is more like the solution you are looking for, but the question remains with the license, because the virtual machine and CoreA communicate through BGP. Are there any other options? Is it possible to do without a license fee? And here we come to the main salt of this post. Take a look at the topology.

The basic idea is the same - to run eBGP on a virtual machine, but already use some IGP protocol inside the autonomous system, for example, as in the figure, OSPF. The part with the eBGP session has remained unchanged and there are still no problems. But with IGP they are - after all, none of them was designed to transmit non-directly connected next-hop, excuse me for the abundance of English words. Among other things, Nexus3k also requires a license for OSPF, but these are the details - I have a Juniper network, and for Nexus you can use RIP :). One way or another, you need to transfer another next-hop, because otherwise user traffic will go through the virtual machine, and such a solution will not work. Accordingly, we need a crutch that allows the “impossible” - to transfer another, not local, next-hop when the route is announced. When running the idea, I tried the following options:
Speaking of the last point - export to the forwarding table - it can be used to perform per-flow BGP ECMP, at least on Juniper. If someone is interested, I can throw a config in gratitude for the fact that you paid attention to the post.
So, unfortunately, all of the above does not work. Qugga and Juniper quietly ignored my picking in politics, and BIRD immediately cursed when trying to change the "next-hop" parameter in the announcement. That's so corny and insulting my idea was broken on the rocks of misunderstanding on the part of manufacturers. In the process, I even googled the problem and it turned out that I wasn’t the only one so cunning. But there wasn’t any solution, except that they indicated that Cisco has a “forwarding address” feature (you can read it here ), but that’s not it.
Already almost desperate, I turned to colleagues for help. Andrushko Dmitry, Kovalenko Alexander (@ alk0v) and Simonenko Dmitry, thanks - the country should know its heroes! So, there are options.
First, there is an off-the-shelf solution for software-defined networks called the Atruim project ( read ). In addition, if I heard correctly, Mellanox is manufacturing devices with Quagga / BIRD inside. As a matter of fact, SDN is a cool thing - do what you want and how you want. But this is SDN and new equipment, and my task is to solve everything on the existing one.
Further, if I understood correctly (“if” is the main word, since I am not strong in * NIXs), the demons in Quagga (for example, ospfd) communicate with the kernel through the iproute2 module and, theoretically, you can intercept the packet at the output from ospfd and modify his. I don’t know if I think correctly and whether this is possible, but somehow.
And finally, the iron version is Scapy, which allows you to generate packages with the given content. And in fact - the structure of the OSPF package is known to us, and what value to change too. The point is small - to realize this. Here I stopped at the moment.
The way I imagine the solution - it must first of all be dynamic. Otherwise, why all these dances with the minutes? In your opinion, you can even raise one virtual machine for each eBGP peer - the price of a virtual machine is negligible, and such a simplification will simply allow you to modify all outgoing OSPF packets, changing one next-hop to another.
But until I got to the implementation of such a solution, I decided that for my task I would run eBGP on a virtual machine, and use statics on the core (CoreA). Indescribably - yes, but it will allow me to do without buying a router, at least at first.
I understand that such a solution is not suitable for transit autonomous systems and places where additional services like MPLS are needed. There may still be problems with geofiltration, or rather prioritization of a particular peer with non-adjacent blocks of addresses, where optimal summing is difficult. It is also necessary to take into account the relatively slow transmission of routing information via IGP. However, for dead-end AS and simpler tasks, the solution is quite suitable.
These are the ideas. I hope someone will find them interesting and find their application.
I must say right away - the title of the article is about the router, the physical device, and not the BGP protocol. And no, you won’t be able to live without BGP, at least for now - but you yourself know that. Why do we need a protocol, how is it good, what allows us to do it, and how to configure it - a lot of books and articles have been written about this both on the Habré and on third-party resources, therefore I will omit this topic. If you have enough statics or a creature is working in a neighboring autonomous system that you can ask to change announcement policies, you can safely close the post - you do not need this information.
If you are still reading, let me start with design, as the article is more about designing a solution than about the solution itself. So, we are the ISP A service provider that has the same pool of addresses and the need to announce these addresses to the Network. The general scheme of logical connections is shown in the figure.

At the edge of our network is the RTR A router, which is connected by the eBGP session to the peer-neighbor. Through the protocol session, we get a FullView from the neighbor with the next hop address — the IP address of RTR B. In response, we give information about our internal networks with the next hop hop of RTR A. I’ll immediately note that this is only one of many possible schemes for organizing a BGP neighborhood: there can be several routers at the border, as well as neighbors, the neighbors themselves can not be connected directly, we can get more than one FullView, reserve channels, and more. However, I allow myself to omit the analysis of various peer-to-peer organization schemes, and dwell on the simplest case - this will not change the essence, but understanding will simplify. IGP works inside our autonomous system, through which we transfer reachability of ISPA networks.
User traffic flows through the core of CoreA, RTR A, and goes further into the network. In my understanding, this (with all possible variations) is the classic method of organizing the perimeter of the network and the BGP neighborhood. Now let's see how much this hardware solution costs.
I will build on the necessary bandwidth of 10Gb / s. At a minimum, the solution should have 10Gb / s interfaces with the possibility of further upgrade. Juniper offers a solution MX104-40G (or as an option MX80) for 40 thousand dollars. with two (four for MX80) 10Gb / s “on board” interfaces and routing performance of 40Gb / s (80 - for MX80). Cisco responds with a Cisco ASR 1001-X device with a base capacity of 2.5 Gb / s and two 10 Gb / s on-board interfaces for $ 17,000 + the price of a license to improve performance (up to 20 Gb / s) and activate additional interface slots. I must say right away that I did not set myself the task of comparing the devices rigorously - in the end, the post is not about that, but some numbers are needed, since our main task is to reduce the cost of the solution.
So, at least 17 thousand dollars. What does our RTRA router do? Yes, in general, a little - twists a BGP session (or several) with a neighbor and forwards traffic to the network. Is it possible to do without it? To answer, we analyze the following topology.

We removed the physical router and ran BGP peering on the kernel device. Is it possible? Yes, the good sorted L3 switches support the launch of BGP. However, there are at least two weak points in this decision. First, most switches were not designed for full routing, and therefore have a limited size routing table. For example, the Juniper EX4550 has a limit of 14,000 IPv4 unicast routes, and the Cisco Nexus3k has 16,000. The second is that you need to buy a license to run BGP, which costs 8 (Cisco Nexus3k) or 10 (Juniper EX4550) thousand dollars. If we need redundant switches, this will double the numbers given. In addition, you will need to negotiate with a higher provider to summarize the networks, well, or get the default route. Nonetheless, Such a design will nevertheless allow you to refuse to buy a dedicated router and at the same time use the useful BGP goodies. Another possible variation on this topic is given below.

We run the BGP process on a physical server or virtual machine that spins an eBGP session with RTRB and iBGP with a kernel device. On the virtual machine, install one of the available packages to run BGP, for example Quagga, Vyatta or BIRD.
One of the great features of the BGP protocol is the ability to change next-hop during the announcement of updates, we will use it in order to avoid the situation when user traffic needs to be forwarded through the BGP speaker. That is, we sort of separate the devices that have routing information (virtual machine) and the devices that are engaged in traffic forwarding (CoreA) inside the autonomous system. Accordingly, RTRB receives as the next-hop the address of CoreA and vice versa. Such a control-plane vs forwarding-plane. The idea itself is not new and is actively used in the organization of exchange points, only through eBGP sessions.
This is already a more interesting scenario, since now we can get both FullView, and several of them, filter and summarize routes locally without having to call the provider. Another interesting feature of the solution is that we do not even need to populate the kernel table in a virtual machine with BGP. Those who are faced with the configuration, for example, Quagga know that first you need to enable the “ip forwarding” option and then transfer the routes that the daemon received to the kernel (well, or the host routing table) to correctly forward traffic. So, this is all superfluous - the virtual only deals with the announcement of BGP information and does not participate in traffic promotion, and filling the table inside Quagga takes as much time as it takes to transfer directly the volume of the route table - 10 seconds.
This is more like the solution you are looking for, but the question remains with the license, because the virtual machine and CoreA communicate through BGP. Are there any other options? Is it possible to do without a license fee? And here we come to the main salt of this post. Take a look at the topology.

The basic idea is the same - to run eBGP on a virtual machine, but already use some IGP protocol inside the autonomous system, for example, as in the figure, OSPF. The part with the eBGP session has remained unchanged and there are still no problems. But with IGP they are - after all, none of them was designed to transmit non-directly connected next-hop, excuse me for the abundance of English words. Among other things, Nexus3k also requires a license for OSPF, but these are the details - I have a Juniper network, and for Nexus you can use RIP :). One way or another, you need to transfer another next-hop, because otherwise user traffic will go through the virtual machine, and such a solution will not work. Accordingly, we need a crutch that allows the “impossible” - to transfer another, not local, next-hop when the route is announced. When running the idea, I tried the following options:
- Next-hop change on BGP-> OSPF redistribution
- Next-hop change in OSPF outbound policy
- Next-hop change in OSPF driving policy
- Change next-hop when exporting to forwarding table on Juniper device
Speaking of the last point - export to the forwarding table - it can be used to perform per-flow BGP ECMP, at least on Juniper. If someone is interested, I can throw a config in gratitude for the fact that you paid attention to the post.
So, unfortunately, all of the above does not work. Qugga and Juniper quietly ignored my picking in politics, and BIRD immediately cursed when trying to change the "next-hop" parameter in the announcement. That's so corny and insulting my idea was broken on the rocks of misunderstanding on the part of manufacturers. In the process, I even googled the problem and it turned out that I wasn’t the only one so cunning. But there wasn’t any solution, except that they indicated that Cisco has a “forwarding address” feature (you can read it here ), but that’s not it.
Already almost desperate, I turned to colleagues for help. Andrushko Dmitry, Kovalenko Alexander (@ alk0v) and Simonenko Dmitry, thanks - the country should know its heroes! So, there are options.
First, there is an off-the-shelf solution for software-defined networks called the Atruim project ( read ). In addition, if I heard correctly, Mellanox is manufacturing devices with Quagga / BIRD inside. As a matter of fact, SDN is a cool thing - do what you want and how you want. But this is SDN and new equipment, and my task is to solve everything on the existing one.
Further, if I understood correctly (“if” is the main word, since I am not strong in * NIXs), the demons in Quagga (for example, ospfd) communicate with the kernel through the iproute2 module and, theoretically, you can intercept the packet at the output from ospfd and modify his. I don’t know if I think correctly and whether this is possible, but somehow.
And finally, the iron version is Scapy, which allows you to generate packages with the given content. And in fact - the structure of the OSPF package is known to us, and what value to change too. The point is small - to realize this. Here I stopped at the moment.
The way I imagine the solution - it must first of all be dynamic. Otherwise, why all these dances with the minutes? In your opinion, you can even raise one virtual machine for each eBGP peer - the price of a virtual machine is negligible, and such a simplification will simply allow you to modify all outgoing OSPF packets, changing one next-hop to another.
But until I got to the implementation of such a solution, I decided that for my task I would run eBGP on a virtual machine, and use statics on the core (CoreA). Indescribably - yes, but it will allow me to do without buying a router, at least at first.
I understand that such a solution is not suitable for transit autonomous systems and places where additional services like MPLS are needed. There may still be problems with geofiltration, or rather prioritization of a particular peer with non-adjacent blocks of addresses, where optimal summing is difficult. It is also necessary to take into account the relatively slow transmission of routing information via IGP. However, for dead-end AS and simpler tasks, the solution is quite suitable.
These are the ideas. I hope someone will find them interesting and find their application.