Bormoglotx March 18, 2017 at 16:05

Juniper: composite-next-hop

In an article on EVPN, I mentioned the need to enable composite-next-hop for EVPN to work, after which at least 10 people asked me the same question - what is composite-next-hop. And I think that composite-next-hop for many is a mysterious technology that can dramatically reduce the number of next-hop-s. This topic is very well covered in the book “MPLS in SDN era”, but on the basis of an article from this book I will briefly describe how it works.

I think that all engineers dealing with routers know that there is a routing information base (RIB) and FIB (forwarding information base). RIB is compiled on the basis of information received from dynamic routing protocols, as well as on the basis of static routes and connected interfaces. Each protocol, whether bgp or isis, installs routes into its database of which the best routes based on protocol preference (administrative distance in terms of cisco) are already installed in the routing table (RIB) and, very importantly, from RIB, routes begin to be announced further . Here's an example entry in rib of the route to the prefix 10.0.0.0/24:

bormoglotx@RZN-PE2> show route table VRF1.inet.0 10.0.0.0/24           
VRF1.inet.0: 8 destinations, 13 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
10.0.0.0/24        *[BGP/170] 15:42:58, localpref 100, from 62.0.0.64
                      AS path: I, validation-state: unverified
                      to 10.0.0.0 via ae0.1, Push 16
                    > to 10.0.0.6 via ae1.0, Push 16, Push 299888(top)
                    [BGP/170] 15:42:58, localpref 100, from 62.0.0.65
                      AS path: I, validation-state: unverified
                      to 10.0.0.0 via ae0.1, Push 16
                    > to 10.0.0.6 via ae1.0, Push 16, Push 299888(top)

RIB provides us with complete information on each route, such as: the protocol that installed the given route, the metric of the route, the availability of equivalent paths, community, etc., as well as the reasons for the current inactivity of the route. But RIB is a pure control plane, and it is not convenient for the router to use this table for forwarding. Therefore, based on the RIB, the router (to be more precise, the RE does it) forms the FIB and sends it to all the PFEs. The FIB no longer has redundant information about protocols and metrics - all that PFE needs to know is the prefix itself, next-hop, through which this prefix is available, as well as the labels that need to be hung when sending a packet:

bormoglotx@RZN-PE2> show route forwarding-table vpn VRF1 destination 10.0.0.0/24                  
Routing table: VRF1.inet
Internet:
Destination        Type RtRef Next hop           Type Index    NhRef Netif
10.0.0.0/24        user     0                    indr  1048576     4
                                                 ulst  1048575     2
                              0:5:86:71:49:c0   Push 16      572     1 ae0.1
                              0:5:86:71:9d:c1   Push 16, Push 299888(top)      583     1 ae1.0

Note: Usually only one route will get into the FIB, but we use ECMP balancing and RE sends two routes to the PFE if there is an equivalent path.

Today we talk about next-hop and talk. There are several types of next hop on Juniper equipment:

VMX(RZN-PE2 vty)# show nhdb summary detail    
      Type              Count
    ---------         ---------
     Discard          18
      Reject          17
     Unicast          47
     Unilist          4
     Indexed          0
    Indirect          4
        Hold          0
     Resolve          5
       Local          20
        Recv          17
    Multi-RT          0
       Bcast          9
       Mcast          11
      Mgroup          3
    mdiscard          11
       Table          17
        Deny          0
     Aggreg.          18
      Crypto          0
      Iflist          0
      Sample          0
       Flood          0
     Service          0
    Multirtc          0
      Compst          7
   DmxResolv          0
      DmxIFL          0
     DmxtIFL          0
       LITAP          0
        Limd          0
          LI          0
      RNH_LE          0
        VCFI          0
        VCMF          0

Many of them are intuitive, some of the above you will not meet at all in your practice. We will dwell on some of the above and start with a simple direct next-hop, which in terms of Juniper is called unicast.

Unicast next-hop.

                604(Unicast, IPv4->MPLS, ifl:340:ge-0/0/2.0, pfe-id:0)      <<<<<<
                605(Unicast, IPv4->MPLS, ifl:341:ge-0/0/3.0, pfe-id:0)      <<<<<<
                606(Unicast, IPv4->MPLS, ifl:342:ge-0/0/4.0, pfe-id:0)      <<<<<<

The simplest form of next-hop is the direct next-hop. It points straight to the physical interface through which the prefix is available. If this type of next-hop were unique, then a separate next-hop would be created for each prefix, and it does not matter in which routing table this prefix is located - in vrf or grt. Yes, this is very simple and understandable, but not everything that is clear to an engineer at first glance is good. Let us give an example: if we have 100 vrfs, each of which will have 100 prefixes, then we will get 10,000 physical next-hops (this is only for vrf prefixes). Add more isis, ldp, rsvp protocol routes, etc.

Note: for simplicity of reasoning and a simpler understanding, we assume that we do not have equivalent paths and aggregation interfaces. If there are any, we will talk about the hierarchy of next-hop a little later.

As a result of the limit for next hop you can reach very quickly. But this is not the main problem - now the glands can withstand more than 1M IPv4 prefixes in the FIB. The fact is that in the event that one of the interfaces crashes and the routes are recalculated, the router will have to rewrite all the next hopes that are currently installed in the forwarding table (in our case, all 10,000). Yes, igp routes will be rewritten quickly - there are not so many of them, but with vpnv4 / l2vpn / evpn routes, as a rule, there are several tens (sometimes hundreds) of them. Naturally, rewriting so many next-hops will take some time and you may lose some of the traffic. And this we have not yet taken into account the possibility of having on the box FW, which now has 645K routes.

The most interesting thing about using direct next-hop is that even if all these 10,000 prefixes arrive from the same PE (that is, have the same protocol next-hop), you still have to update everything 10,000 next hop But if you think logically, then in fact in this situation we will only have 100 unique next-hops (subject to the distribution of service labels per-vrf), which differ only in the service label - the transport label and the outgoing interface will be exactly the same. Now you won’t find direct next-hop for prefixes in vrf (on Junos anyway) - to be more precise, on cards with a TRIO chipset you won’t be able to turn on direct next-hop for L3VPN and other services even if you wanted to - it's just not supported. But you won’t be able to refuse it at all either - unicast next-hop points directly to the interface where you want to send the packet and when using the hierarchical next-hop (which we will talk about later) at the last level of the hierarchy it will be unicast next-hop. Well, how else? It should be mentioned that in addition to the outgoing interface, this next-hop view also includes a label stack and encapsulation, but more on that later.

Looks unicast next-hop for the route obtained by isis, like this:

bormoglotx@RZN-PE2> show route forwarding-table destination 10.0.0.2/31 table default          
Routing table: default.inet
Internet:
Destination        Type RtRef Next hop           Type Index    NhRef Netif
10.0.0.2/31        user     0 10.0.0.6           ucst      693    19 ae1.0
bormoglotx@RZN-PE2> show route forwarding-table destination 10.0.0.2/31 table default extensive    
Routing table: default.inet [Index 0] 
Internet:
Destination:  10.0.0.2/31
  Route type: user                  
  Route reference: 0                   Route interface-index: 0   
  Multicast RPF nh index: 0             
  Flags: sent to PFE, rt nh decoupled  
  Nexthop: 10.0.0.6
  Next-hop type: unicast               Index: 693      Reference: 19   
  Next-hop interface: ae1.0

Aggregate next-hop

            584(Aggreg., IPv4, ifl:326:ae0.1, pfe-id:0)      <<<<<<<<
                585(Unicast, IPv4, ifl:337:ge-0/0/0.1, pfe-id:0)
                586(Unicast, IPv4, ifl:339:ge-0/0/1.1, pfe-id:0)
            603(Aggreg., IPv4->MPLS, ifl:327:ae1.0, pfe-id:0)      <<<<<<<
                604(Unicast, IPv4->MPLS, ifl:340:ge-0/0/2.0, pfe-id:0)
                605(Unicast, IPv4->MPLS, ifl:341:ge-0/0/3.0, pfe-id:0)
                606(Unicast, IPv4->MPLS, ifl:342:ge-0/0/4.0, pfe-id:0)

Everything is very simple here - I think you yourself guess that this hierarchy appears when we have the prefix visible through the aggregation interface. Aggregate next-hops are essentially a list of real next-hops (physical interfaces) that are part of the aggregate. If you use aggregates, the number of next hop increases in proportion to the number of links in the aggregate. In the output above, you see two Aggregate next-hops, each of which in turn points to the physical next-hops that enter these aggregates.

Unilist-next-hop

        1048574(Unilist, IPv4, ifl:0:-, pfe-id:0)      <<<<<<<<
            584(Aggreg., IPv4, ifl:326:ae0.1, pfe-id:0)
                585(Unicast, IPv4, ifl:337:ge-0/0/0.1, pfe-id:0)
                586(Unicast, IPv4, ifl:339:ge-0/0/1.1, pfe-id:0)
            603(Aggreg., IPv4->MPLS, ifl:327:ae1.0, pfe-id:0) 
                604(Unicast, IPv4->MPLS, ifl:340:ge-0/0/2.0, pfe-id:0)
                605(Unicast, IPv4->MPLS, ifl:341:ge-0/0/3.0, pfe-id:0)
                606(Unicast, IPv4->MPLS, ifl:342:ge-0/0/4.0, pfe-id:0)

Actually, this is also a very simple hierarchy and is somewhat similar to aggregate. It appears only when we have equivalent paths and, in essence, it is simply a listing of all equivalent paths. In our case, we have two equivalent paths and both through aggregates.

Note: in our case, the stars agreed so that the unicast id (585, 586) go in order after the Aggregate id (584) (in terms of numbers, and not in order in the hierarchy), but this is not always the case.

All of the listed next hop do not help reduce the number of physical next hop, but rather increase their number. The following two types of next hop are designed to optimize FIB and reduce the number of unicast next hop.

Indirect next-hop.

    1048577(Indirect, IPv4, ifl:327:ae1.0, pfe-id:0, i-ifl:0:-)      <<<<<<
        1048574(Unilist, IPv4, ifl:0:-, pfe-id:0)
            584(Aggreg., IPv4, ifl:326:ae0.1, pfe-id:0)
                585(Unicast, IPv4, ifl:337:ge-0/0/0.1, pfe-id:0)
                586(Unicast, IPv4, ifl:339:ge-0/0/1.1, pfe-id:0)
            603(Aggreg., IPv4->MPLS, ifl:327:ae1.0, pfe-id:0)
                604(Unicast, IPv4->MPLS, ifl:340:ge-0/0/2.0, pfe-id:0)
                605(Unicast, IPv4->MPLS, ifl:341:ge-0/0/3.0, pfe-id:0)
                606(Unicast, IPv4->MPLS, ifl:342:ge-0/0/4.0, pfe-id:0)

Literally, the word indirect is translated as indirect. This type of next hop is used to reduce the number of physical next hop. Nevertheless, the 10,000 next-hopes that we obtained by the method of simple calculations when considering unicast next-hops are somehow too many. Now read our next hop again. We have 100 vrfs, in which there are 100 prefixes each (prefixes are generated by per-vrf) and announced from the same PE. It turns out that all these prefixes in this scenario will have the same protocol-next-hop (loopback of the remote PE) and the outgoing interface (and, as a result, the same transport label). The difference will only be in the service label. But since we generate per-vrf tags, we will only have 100 tags. As a result, we get that 10,000 direct next-hops can be aggregated into 100 next-hop-s,

The concept of indirect next-hop allows for all prefixes that are reachable through the same protocol next-hop to use the same indirect-next-hop. I would like to draw the reader’s attention to the fact that aggregation occurs according to the protocol next-hop, since there may be no service label at all (for example, routes to the Internet), but its presence has a great influence on indirect-next-hop.

Alas, the main problem of indirect-next-hop is that it refers to unicast-next-hop, which indicates the full stack of labels, including the service label:

bormoglotx@RZN-PE2>show route forwarding-table table VRF1 destination 10.2.0.0/24 extensive    
Routing table: VRF1.inet [Index 9] 
Internet:
Destination:  10.2.0.0/24
  Route type: user                  
  Route reference: 0                   Route interface-index: 0   
  Multicast RPF nh index: 0             
  Flags: sent to PFE 
  Next-hop type: indirect              Index: 1048587  Reference: 2    
  Nexthop: 10.0.0.6
  Next-hop type: Push 24008, Push 299920(top) Index: 706 Reference: 2    
  Load Balance Label: None              
  Next-hop interface: ae1.0

This line describes the complete label stack:

  Next-hop type: Push 24008, Push 299920(top) Index: 706 Reference: 2

As you can see, the 24008 tag is a service tag and is pushed onto the stack at the last level of the next-hop hierarchy. Based on this, it is impossible for several indirect-next-hops to point to the same physical one - the service label is different for everyone. In addition, for example, L2CKT and VPLS use different encapsulation. Therefore, under certain conditions described above, indirect-next-hop may not produce any profit.

It’s not hard to guess that if we use the per-prefix label distribution (for some reason unknown to me, this label distribution method is used by default on Cisco and Huawei), then indirect-next-hop will not help us much, since now we have Each prefix will have a separate service label. As a result, we cannot combine several prefixes into one next-hop, because although they are reachable through the same protocol next-hop, they have a different service label, which in the worst-case scenario leads to 10,000 next-hops, though not direct but indirect. It turned out like in the proverb “radish is not sweeter” ... Plus everything else for all L2CKT, even if they are terminated on the same pair of PE-shek, there will be different labels (and there's nothing to be done about it - generating one label for several L2CKTs will not work). How developers defeated this problem will be described later.

Of course, in real conditions, indirect-next-hop allows us to significantly reduce the number of next-hops (since few people use vrf-table-label or per-vrf label distribution). In addition, on Juniper MX, indirect-net-hop is enabled by default and you cannot turn off this feature. In addition, if you have FW on the router, then these prefixes will not have a service mark at all (unless you put FW in vrf of course) and for all Internet prefixes there will be the same inirect-next-hop.

But there is no limit to perfection and we want an even more scalable solution. Besides, I repeat, but L2CKT-you will always have different service labels, and therefore different indirect next-hops. The solution that fixes this problem is called chained-composite-next-hop (in terms of Juniper, Cisco has a slightly different approach).

Chained-composite-next-hop

607(Compst, IPv4->MPLS, ifl:0:-, pfe-id:0, comp-fn:Chain)      <<<<<<<
    1048577(Indirect, IPv4, ifl:327:ae1.0, pfe-id:0, i-ifl:0:-)
        1048574(Unilist, IPv4, ifl:0:-, pfe-id:0)
            584(Aggreg., IPv4, ifl:326:ae0.1, pfe-id:0)
                585(Unicast, IPv4, ifl:337:ge-0/0/0.1, pfe-id:0)
                586(Unicast, IPv4, ifl:339:ge-0/0/1.1, pfe-id:0)
            603(Aggreg., IPv4->MPLS, ifl:327:ae1.0, pfe-id:0)
                604(Unicast, IPv4->MPLS, ifl:340:ge-0/0/2.0, pfe-id:0)
                605(Unicast, IPv4->MPLS, ifl:341:ge-0/0/3.0, pfe-id:0)
                606(Unicast, IPv4->MPLS, ifl:342:ge-0/0/4.0, pfe-id:0)

As we found out, indirect-next-hop is a doll from next-hop. Exactly the same matryoshka and chained-composite-next-hop, but now we have another level in the hierarchy. How else can you combine prefixes into groups and associate them with the same next-hop? What else is common to all L3VPNs or all L2CKTs? True - this is a family of addresses. At the very top of the next-hop hierarchy is composite next-hop, which combines routes by service label, but unlike indirect-next-hop, composite next-hop refers to this label. That is, the service label is now indicated not at the very last level of the hierarchy - uniast, but at the first level of the hierarchy. This allows us to defeat the problem that we identified in the discussion of indirect-next-hop. For an example we will look at record in FIB for the same prefix 10.2.0.

bormoglotx@RZN-PE2> show route forwarding-table table VRF1 destination 10.2.0.0/24 extensive    
Routing table: VRF1.inet [Index 9] 
Internet:
Destination:  10.2.0.0/24
  Route type: user                  
  Route reference: 0                   Route interface-index: 0   
  Multicast RPF nh index: 0             
  Flags: sent to PFE 
  Nexthop:  
  Next-hop type: composite             Index: 608      Reference: 2    
  Load Balance Label: Push 24008, None  
  Next-hop type: indirect              Index: 1048578  Reference: 3    
  Nexthop: 10.0.0.6
  Next-hop type: Push 299920           Index: 664      Reference: 3    
  Load Balance Label: None              
  Next-hop interface: ae1.0

The line Load Balance Label indicates a service label

Load Balance Label: Push 24008, None

By the method of banal erudition, we can come to the following conclusion: how many service marks we have, so many composite next-hops will be. At the next level of the hierarchy is indirect next-hop, though it’s a bit different from the one we discussed earlier. When using composite next-hop, the prerogative of indirect-next-hop is that it aggregates by the address family and protocol-next-hop. That is, as you understand, for all vpnv4 prefixes that have the same protocol next-hop will be the same indirect-next-hop. Well, then indirect-next-hop will point to the real next-hop (usually either unilist or aggregate). The most important thing is that now several indirect-next-hops can point to the same unilist next-hop, since now an incomplete label stack is indicated in unilist next-hop-s,

Now back to the case we are considering with 100 vrfs. In the worst-case scenario, using indirect-next-hop we got 10,000 indirect next-hops and, as a result, as many real next-hops. Now let's see what composite-next-hop gives us. First comes the hierarchy of composite next-hop, and provided that the tags are generated per-prefix, we get 10,000 different service tags, which means the same number of composite-next-hop-s. But, unlike the previous case, composite next-hop will refer not to the real next-hop, but to indirect-next-hop, which aggregates vpnv4 destination prefixes by the address family and protocol next-hop. And this very sharply reduces the number of real next-hop-s. In our scenario, there is only one family of addresses - vpnv4 and one protocol-net-hop, which means that all 10,000 composite-next-hop-s will refer to one single indirect next-hop, and he, in turn, will point to one real next-hop! That is, we eventually got only one real next-hop!

I can say from my own practice that the inclusion of composite-next-hop for ingress lsp can reduce the total number of next-hop 5-8 times (for example, a real figure, a decrease from 1.1M next-hop (before switching on this function) to 170K (after its inclusion), that is, a reduction of 6.5 times - agree, a good indicator).

Note: when you enable composite-next-hop, you will not see a label stack in the forwarding table, since it is indicated in two hierarchies and is displayed only with extensive outputs, for example:

Indirect-next-hop:

bormoglotx@RZN-PE2> show route forwarding-table table VRF1 destination 10.0.0.0/24                
Routing table: VRF1.inet
Internet:
Destination        Type RtRef Next hop           Type Index    NhRef Netif
10.0.0.0/24        user     0                    indr  1048578     4
                                                 ulst  1048577     2
                              0:5:86:71:49:c0   Push 16      699     1 ae0.1
                              0:5:86:71:9d:c1   Push 16, Push 299888(top)      702     1 ae1.0

Composite-next-hop:

bormoglotx@RZN-PE2> show route forwarding-table table VRF1 destination 10.0.0.0/24                   
Routing table: VRF1.inet
Internet:
Destination        Type RtRef Next hop           Type Index    NhRef Netif
10.0.0.0/24        user     0                    comp      608     2

Note: if the Juniper MX chassis has DPC boards (except for service ones), then composite-next-hop cannot be enabled, as this message on the Juniper website says:

On MX Series 3D Universal Edge Routers containing both DPC and MPC FPCs, chained composite next hops are disabled by default. To enable chained composite next hops on the MX240, MX480, and MX960, the chassis must be configured to use the enhanced-ip option in network services mode.

It doesn’t directly say that it won’t turn on, but it might come as a surprise to someone, but DPC cards do not support enhanced-ip mode:

Only Multiservices DPCs (MS-DPCs) and MS-MPCs are powered on with the enhanced network services mode options. No other DPCs function with the enhanced network services mode options.

But you should not think that composite next-hop is needed exclusively on PE routers, although it is mainly useful on them (there are simply not many routes on P-as a rule). Composite-next-hop can be enabled for ingress lsp (on PE) and for transite lsp (on P). In addition, you may have steroid PEs that will also act as P-routers (well, or your network design does not provide FREE CORE at all), or your kernel (P-level) will receive routes to other autonomous systems (Option C) not through redistribution of BGP-LU routes in igp on borders, but through BGP-Labeled unicast sessions with reflectors.

If for ingress lsp we can enable composite-next-hop only for services like L3VPN, L2VPN, EVPN, as well as BGP-LU:

  evpn                 Create composite-chained nexthops for ingress EVPN LSPs
  fec129-vpws          Create composite-chained nexthops for ingress fec129-vpws LSPs
  l2ckt                Create composite-chained nexthops for ingress l2ckt LSPs
  l2vpn                Create composite-chained nexthops for ingress l2vpn LSPs
  l3vpn                Create composite-chained nexthops for ingress l3vpn LSPs
  labeled-bgp          Create composite-chained nexthops for ingress labeled-bgp LSPs

Composite next-hop options for LDP, RSVP, and even static lsp are available for transit devices:

 l2vpn                Create composite-chained nexthops for transit l2vpn LSPs
  l3vpn                Create composite-chained nexthops for transit l3vpn LSPs
  labeled-bgp          Create composite-chained nexthops for transit labeled BGP routes
  ldp                  Create composite-chained nexthops for LDP LSPs
  ldp-p2mp             Create composite-chained nexthops for LDP P2MP LSPs
  rsvp                 Create composite-chained nexthops for RSVP LSPs
  rsvp-p2mp            Create composite-chained nexthops for RSVP p2mp LSPs
  static               Create composite-chained nexthops for static LSPs

This can significantly reduce the number of next-hops for transit lsp. For example, I have only 5 devices in my lab - that is, the mpls.0 table on P-ke looks to put it mildly small:

bormoglotx@RZN-P2> show route table mpls.0 | find ^2[0-9]+ 
299872             *[LDP/9] 1w1d 12:59:13, metric 1
                    > to 10.0.0.5 via ae0.0, Pop      
299872(S=0)        *[LDP/9] 1w1d 12:59:13, metric 1
                    > to 10.0.0.5 via ae0.0, Pop      
299888             *[LDP/9] 1w1d 12:58:30, metric 1
                    > to 10.0.0.5 via ae0.0, Swap 299792
299904             *[LDP/9] 1w1d 12:55:57, metric 1
                    > to 10.0.0.7 via ae1.0, Pop      
299904(S=0)        *[LDP/9] 1w1d 12:55:57, metric 1
                    > to 10.0.0.7 via ae1.0, Pop      
299920             *[LDP/9] 1w1d 12:47:06, metric 1
                    > to 10.0.0.5 via ae0.0, Swap 299824

But the effect of enabling composite-next-hop for LDP will already be visible even in such a small laboratory. Here is the total number of net-hops before enabling composite-next-hop:

VMX(RZN-P2 vty)# show nhdb summary    
 Total number of  NH = 116
VMX(RZN-P2 vty)# show nhdb summary detail    
      Type              Count
    ---------         ---------
     Discard          12
      Reject          11
     Unicast          32      <<<<<<<<<<<<<<
     Unilist          0
     Indexed          0
    Indirect          0
        Hold          0
     Resolve          2
       Local          13
        Recv          8
    Multi-RT          0
       Bcast          4
       Mcast          7
      Mgroup          1
    mdiscard          7
       Table          11
        Deny          0
     Aggreg.          8      <<<<<<<<<<<<<<
      Crypto          0
      Iflist          0
      Sample          0
       Flood          0
     Service          0
    Multirtc          0
      Compst          0      <<<<<<<<<<<<<<
   DmxResolv          0
      DmxIFL          0
     DmxtIFL          0
       LITAP          0
        Limd          0
          LI          0
      RNH_LE          0
        VCFI          0
        VCMF          0

Now enable chained-composite-next-hop for ldp and check the result:

bormoglotx@RZN-P2> show configuration routing-options 
router-id 62.0.0.65;
autonomous-system 6262;
forwarding-table {
    chained-composite-next-hop {
        transit {
            ldp;
        }
    }

Everything is the same in the routing table, though the routes have been updated, which can be seen by their lifetime (this should be taken into account when turning on composite-next-hop on transit devices):

bormoglotx@RZN-P2> show route table mpls.0 | find ^2[0-9]+ 
299872             *[LDP/9] 00:00:57, metric 1
                    > to 10.0.0.5 via ae0.0, Pop      
299872(S=0)        *[LDP/9] 00:00:57, metric 1
                    > to 10.0.0.5 via ae0.0, Pop      
299888             *[LDP/9] 00:00:57, metric 1
                    > to 10.0.0.5 via ae0.0, Swap 299792
299904             *[LDP/9] 00:00:57, metric 1
                    > to 10.0.0.7 via ae1.0, Pop      
299904(S=0)        *[LDP/9] 00:00:57, metric 1
                    > to 10.0.0.7 via ae1.0, Pop      
299920             *[LDP/9] 00:00:57, metric 1
                    > to 10.0.0.5 via ae0.0, Swap 299824

Now check the total number of routes in the FIB:

VMX(RZN-P2 vty)# show nhdb summary    
 Total number of  NH = 94
VMX(RZN-P2 vty)# show nhdb summary detail    
      Type              Count
    ---------         ---------
     Discard          12
      Reject          11
     Unicast          10      <<<<<<<<<<<<<<
     Unilist          0
     Indexed          0
    Indirect          0
        Hold          0
     Resolve          2
       Local          13
        Recv          8
    Multi-RT          0
       Bcast          4
       Mcast          7
      Mgroup          1
    mdiscard          7
       Table          11
        Deny          0
     Aggreg.          2      <<<<<<<<<<<<<<
      Crypto          0
      Iflist          0
      Sample          0
       Flood          0
     Service          0
    Multirtc          0
      Compst          6      <<<<<<<<<<<<<<
   DmxResolv          0
      DmxIFL          0
     DmxtIFL          0
       LITAP          0
        Limd          0
          LI          0
      RNH_LE          0
        VCFI          0
        VCMF          0

Since we had several transit labels, each one had its own unicast next-hop and JunOS didn’t really care that we had only two interfaces to which we could send transit traffic and as a result we got 8 aggregate next- hop-s, which naturally entailed an increase in unicast next-hop-s. After enabling composite-next-hop, instead of generating its own next-hop for each label, composite-next-hop already refers to the existing two aggregate next-hop.

I would like to add that when you turn on composite-next-hop for ingress LSP all BGP sessions jerk, when you turn on composite-next-hop for transit LSP sessions do not jerk (even when BGP-LU is turned on), but all mpls labels will be reset and set back to the forwarding table.

In conclusion, I would like to clearly compare indirect-next-hop and composite-next-hop in the pictures.
Three L3VPNs were launched in the laboratory, with PE3 prefixes (10.2.0.0/24 and 10.3.0.0/24) being announced with the label per-prefix, and with PE1 per-vrf:

and three L2CKT - two to PE1 and one to PE3 :

In addition, the circuit is assembled on aggregation interfaces, and before PE1 there are equivalent paths.

This illustration shows the “tree” of next-hop using indirect-next-hop:

For the prefixes 10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24 - we have one serial label, the same for the prefixes 20.0.0.0/24, 20.0.1.0/24, 20.0.2.0/24 - also one service tag - they are all announced with PE1. As you can see, these prefixes are available through the same indirect-next-hops. But 10.2.0.0/24 and 10.3.0.0/24 have different labels (labels for them are generated per-prefix), which means they have different indirect-next-hops. Well, with L2CKT, I think everything is clear - everyone has different service labels and indirect-next-hops. As a result, we have 29 unicast next-hops.

Now the same thing, but with composite-next-hop enabled:

Here next hop is already less. Prefixes that have the same service label are accessible through the same composite-next-hop. As you recall, the service label is specified in the composite-next-hop hierarchy. Further, all composite-next-hops are referenced in indirect next-hops. In the diagram above, we have two protocol-next-hop (PE1 and PE3) and two services L3VPN and L2CKT. The result is that we have 4 indirect next-

hops

: L3VPN, PE1 L3VPN, PE2 L2CKT LDP, PE1 L2CKT LDP, PE2
And since now at the unicast next-hop hierarchy level we only have a transport label, now indirect next- hop can refer to the same unicast-next-hop. As a result, the number of unicast-next-hops from 29 decreased to 8.

Thank you for your attention.

Tags:

Juniper: composite-next-hop

Also popular now: