
EXTREME'al LACP
The devil is in the details - I always think about it when I deal with something new. New software or a new piece of hardware can be anything cool both technically and economically, but there will certainly be such a trifle that seems to be unprincipled, but it’s crazy to drink a lot of blood. Here about some such insatiable trifles in the network equipment Extreme Networks I want to tell under a cat.

In our Moscow network, Extreme Summit X670V-48x switches are used as aggregation switches . For reliability, they are stacked via VIM4-40G4X modules (these are 4 ports over 40G-ethernet). Now there are 2 chassis in each stack, but there are already tickets for adding a third.

Accordingly, all links are evenly (i.e., in half) distributed between the two chassis and aggregated using LACP. If something happens to one chassis, we get degradation in the lane, but maintenance will continue.
The first thing we buried in, transferring the load to Extreme, was such a thing. We switch providers - the physics rose, but LACP did not gather. That's all right, but no sense. Exactly the same settings in the direction of another provider - everything works there. And here - at least kick with your feet ... It would be possible to lose a lot of time for disassembly with support, but colleagues from AS8359 helped out: Andrei and Pavel. Quickly recommended line:
It helped. The strangest thing in this matter is that the problem does not arise with all partners. With our tsiska - no problem, but if on the other side of the ASR9000 - most likely there will be a problem. But not always. In general, to understand laziness. Remember and repeat as a mantra.
We have 10-gigabit switches, and traffic is much more. Therefore, we have many aggregated links (LAGs). Well, inside the LAG you need to balance, and do it so that all legs are involved as evenly as possible, because it is very unpleasant to rest with one foot. So while there were 8 legs in one LAG, everything was fine, but suddenly one lambda glanced and fell. We always have a reserve for the total lane, the loss of one dozen is not a problem. But in a strange way, the evenly filled legs were stratified. Some rose higher (as expected), others fell. Oops!
They didn’t have time to figure it out for the first time - they repaired the victim, everything returned to normal. But the sediment remained. The next time we returned to this issue, when we added two more lambdas. We look - again bundle. Turn off the "extra" two pieces - everything is smooth. Turn on - the guard! It was experimentally verified that stratification occurs if the number of active legs in the LAG is not degree 2.
Thought, talked with support. And they say: use custom. Frankly, I did not believe it, because if you believe the documentation, then custom with the default settings is completely equal to L3_L4. I was convinced by colleagues from MSK-IX who had already tortured this problem.
We reconfigured our LAGs to use the custom method, and now our balancing uniformity does not depend on the number of active legs in the LAG. To do this, however, I had to delete the old settings and create the LAG again, because the balancing algorithm is set only at the time of creating the LAG. But when did it scare us?
For dessert, my favorite is left :) Port-channel. How is port aggregation done on any tsiska? A special port is created - Port-channel, it is configured, turned on. Then physical ports are added to it (the configuration is not quite true, but now it doesn’t matter). It was necessary to add - add! Extra ones appeared - delete. Any. Comfortable, damn it!
In XOS, the story is different: the LAG configuration is bound to one of the physical ports (it becomes a config master). Such a port must be a member of the LAG. I can add and remove ports as much as I like, unless the config master moves away. But if you need to move the last (master) port, that's all. The only option is to remove one sharing and create a new one. Oops ...
As in any decent office, we love to remodel for ourselves. For various reasons, but we have enough housekeeping. Accordingly, we have already applied several times to this rake. I will not speak for other members of our team, but I personally did not like it.
Judge for yourself: you need to configure VLANs for the port (by the way, in EXTREME's ideology vlan are not hung on the port, and ports on vlan. Accordingly, you cannot hang the entire list of vlan on the port in one line), STP (gentlemen officers, keep quiet! ) and other disgraces. It’s clear that it’s wrong to make a mistake with such a volume of settings - how to send a couple of bytes. So where are you, my favorite cisco-style portchannels? However, not everything is so bad. If the config master port goes down, then the LAG itself remains alive (otherwise it would be completely x ... bad). Even when we disabled the chassis on which the config master is located, everything continued to work. Well, at least so ...
From hopelessness I discovered feature-request FR4-4584728621 last year. As you know, not implemented. And there is no certainty that the manufacturer was thinking about this at all.
The salvation of drowning people is the work of the drowning people themselves. I once looked at a tsiska one portchennel:
And I noticed a beautiful line of Logical Slot / Port. “Oh,” I said. Where does the second slot come from on this non-stackable non-expandable tsisk ?! When I realized that this is a logical slot, I had a PLAN!
But what if we use a port that will never be used as a config master port? No, we feel sorry for the purchased ports, so they will all be used. What about a nonexistent port? One that is on the slot (in the stack), which is not there. Then we do not care which ports are included in a particular LAG. We will do all the settings on this virtual slot. Add and remove ports from it as needed. Yes, we will reduce the number of chassis in one stack, but have you seen so many stacks from 8 chassis? I have not seen thicker than 3.
Then everything turns out simple. We declare a virtual slot, take as many ports as possible (we will declare all LAGs on its ports) and go!
To add new ports:
Now you can safely switch to new ports. LACP will be active all the time. And then we clean the old ports so that it’s useful to use them under a thread:
A quick test in the lab showed the vitality of the idea. But I warn you right away: so far we have not exploited this in battle, although there is already a ticket for implementation.
I tried to reach out to the developers and made a topic on the Extreme forum . You can read the answers yourself.
It seems to me that making a command line interface that would allow you to conveniently work with the LAG should be relatively simple, given that there are built-in mechanisms. I think that just all networkers are suffering from this problem in splendid isolation, so I propose to raise the degree of this problem. If you are an Extreme user, contact support. Refer to the indicated FR and request topping up the foam after the beer has settled. Write in the forum thread everything you think about the manufacturer. Let it be such a network flash mob.

Our previous publications:
»We implement a secure VPN protocol
» We implement an even more secure VPN protocol
» Extra elements or how we balance between servers
» Blowfish on guard ivi
» Non-personalized recommendations: association method
» By city and by weight or how we balance between CDN nodes
» I am Groot. We do our analytics on events
» All for one or how we built CDN

Snack
In our Moscow network, Extreme Summit X670V-48x switches are used as aggregation switches . For reliability, they are stacked via VIM4-40G4X modules (these are 4 ports over 40G-ethernet). Now there are 2 chassis in each stack, but there are already tickets for adding a third.

Accordingly, all links are evenly (i.e., in half) distributed between the two chassis and aggregated using LACP. If something happens to one chassis, we get degradation in the lane, but maintenance will continue.
First
The first thing we buried in, transferring the load to Extreme, was such a thing. We switch providers - the physics rose, but LACP did not gather. That's all right, but no sense. Exactly the same settings in the direction of another provider - everything works there. And here - at least kick with your feet ... It would be possible to lose a lot of time for disassembly with support, but colleagues from AS8359 helped out: Andrei and Pavel. Quickly recommended line:
configure sharing 1 lacp system-priority 1
It helped. The strangest thing in this matter is that the problem does not arise with all partners. With our tsiska - no problem, but if on the other side of the ASR9000 - most likely there will be a problem. But not always. In general, to understand laziness. Remember and repeat as a mantra.
Second
We have 10-gigabit switches, and traffic is much more. Therefore, we have many aggregated links (LAGs). Well, inside the LAG you need to balance, and do it so that all legs are involved as evenly as possible, because it is very unpleasant to rest with one foot. So while there were 8 legs in one LAG, everything was fine, but suddenly one lambda glanced and fell. We always have a reserve for the total lane, the loss of one dozen is not a problem. But in a strange way, the evenly filled legs were stratified. Some rose higher (as expected), others fell. Oops!
They didn’t have time to figure it out for the first time - they repaired the victim, everything returned to normal. But the sediment remained. The next time we returned to this issue, when we added two more lambdas. We look - again bundle. Turn off the "extra" two pieces - everything is smooth. Turn on - the guard! It was experimentally verified that stratification occurs if the number of active legs in the LAG is not degree 2.
Thought, talked with support. And they say: use custom. Frankly, I did not believe it, because if you believe the documentation, then custom with the default settings is completely equal to L3_L4. I was convinced by colleagues from MSK-IX who had already tortured this problem.
We reconfigured our LAGs to use the custom method, and now our balancing uniformity does not depend on the number of active legs in the LAG. To do this, however, I had to delete the old settings and create the LAG again, because the balancing algorithm is set only at the time of creating the LAG. But when did it scare us?
enable sharing 1:1 grouping 1:1-5, 2:1-5 algorithm address-based custom lacp
Dessert
For dessert, my favorite is left :) Port-channel. How is port aggregation done on any tsiska? A special port is created - Port-channel, it is configured, turned on. Then physical ports are added to it (the configuration is not quite true, but now it doesn’t matter). It was necessary to add - add! Extra ones appeared - delete. Any. Comfortable, damn it!
In XOS, the story is different: the LAG configuration is bound to one of the physical ports (it becomes a config master). Such a port must be a member of the LAG. I can add and remove ports as much as I like, unless the config master moves away. But if you need to move the last (master) port, that's all. The only option is to remove one sharing and create a new one. Oops ...
As in any decent office, we love to remodel for ourselves. For various reasons, but we have enough housekeeping. Accordingly, we have already applied several times to this rake. I will not speak for other members of our team, but I personally did not like it.
Judge for yourself: you need to configure VLANs for the port (by the way, in EXTREME's ideology vlan are not hung on the port, and ports on vlan. Accordingly, you cannot hang the entire list of vlan on the port in one line), STP (gentlemen officers, keep quiet! ) and other disgraces. It’s clear that it’s wrong to make a mistake with such a volume of settings - how to send a couple of bytes. So where are you, my favorite cisco-style portchannels? However, not everything is so bad. If the config master port goes down, then the LAG itself remains alive (otherwise it would be completely x ... bad). Even when we disabled the chassis on which the config master is located, everything continued to work. Well, at least so ...
From hopelessness I discovered feature-request FR4-4584728621 last year. As you know, not implemented. And there is no certainty that the manufacturer was thinking about this at all.
The salvation of drowning people is the work of the drowning people themselves. I once looked at a tsiska one portchennel:
#sh int po1 eth
Port-channel1 (Primary aggregator)
Age of the Port-channel = 174d:06h:35m:37s
Logical slot/port = 2/1 Number of ports = 4
HotStandBy port = null
Port state = Port-channel Ag-Inuse
Protocol = LACP
Port security = Disabled
And I noticed a beautiful line of Logical Slot / Port. “Oh,” I said. Where does the second slot come from on this non-stackable non-expandable tsisk ?! When I realized that this is a logical slot, I had a PLAN!
But what if we use a port that will never be used as a config master port? No, we feel sorry for the purchased ports, so they will all be used. What about a nonexistent port? One that is on the slot (in the stack), which is not there. Then we do not care which ports are included in a particular LAG. We will do all the settings on this virtual slot. Add and remove ports from it as needed. Yes, we will reduce the number of chassis in one stack, but have you seen so many stacks from 8 chassis? I have not seen thicker than 3.
Then everything turns out simple. We declare a virtual slot, take as many ports as possible (we will declare all LAGs on its ports) and go!
configure slot 8 module X670V-48x
enable sharing 8:1 grouping 8:1, 1:1-5, 2:1-5 algorithm address-based custom lacp
To add new ports:
configure sharing 8:1 add ports 1:48, 2:48
enable ports 1:48, 2:48
Now you can safely switch to new ports. LACP will be active all the time. And then we clean the old ports so that it’s useful to use them under a thread:
disable ports 1:1, 2:1
configure sharing 8:1 delete ports 1:1, 2:1
A quick test in the lab showed the vitality of the idea. But I warn you right away: so far we have not exploited this in battle, although there is already a ticket for implementation.
I tried to reach out to the developers and made a topic on the Extreme forum . You can read the answers yourself.
It seems to me that making a command line interface that would allow you to conveniently work with the LAG should be relatively simple, given that there are built-in mechanisms. I think that just all networkers are suffering from this problem in splendid isolation, so I propose to raise the degree of this problem. If you are an Extreme user, contact support. Refer to the indicated FR and request topping up the foam after the beer has settled. Write in the forum thread everything you think about the manufacturer. Let it be such a network flash mob.

Our previous publications:
»We implement a secure VPN protocol
» We implement an even more secure VPN protocol
» Extra elements or how we balance between servers
» Blowfish on guard ivi
» Non-personalized recommendations: association method
» By city and by weight or how we balance between CDN nodes
» I am Groot. We do our analytics on events
» All for one or how we built CDN