eucariot October 29, 2013 at 12:04

Inhuman networks

Once upon a time driving was almost an art. In the days of glorious classics (sixes and fives) it was necessary to know how to clean the carburetor, replace the fuel pump and what is “suction”.
Computers were once large and the word debug was used in its most literal sense. When the first PCs began to enter our homes, it was important to understand what the north and south bridge are, how to install the driver for the video card, and what value to change in the system registry in order to start the fastidious game.

Today, at least for personal use, at least for business, you come to the store, buy a car, press the “ON” button and start using it.
Yes, there are nuances - such a trick with System p5 595 or Bagger 293 will not work - you need technical specialists. However, in the basis, even for a company from several branches - you bought a gazelle, several PCs, provided them with Internet access - and you can work.

Some time ago, I had a dispute with a person from distant networks. He had a reasonable question: why it is impossible to automate the creation and configuration of small corporate networks. That is why he can’t buy one piece of iron in each of his branches, press 5 buttons on each and get a working stable network?
The question was even deeper and affecting personal strings: why is there such a technical support staff (for companies, providers, vendors). Is it really impossible to automatically find and correct most of them?

Yes, I know many arguments that immediately pop up in your head. Such a statement of the question seemed to me utopian at first. And it is logical that now it seems impossible in the field of communications - the maximum is within the framework of one home router, but not everything is smooth there.
However, the question is not without a rational beginning, and it has taken possession of my mind for a long time. True, I abstracted from corporate SOHO and SMB networks and devoted my thoughts to provider networks.
From the point of view of a non-technical specialist, it may seem that automatic configuration is more important and easier to implement than troubleshooting. But it should be obvious to any engineer that it is Trouble Shooting that lays the yellow brick track - if we don’t know what the problem may be, how can we try to configure something?

In this starting article under the cut, I want to share my thoughts on various obstacles to the goal and how to overcome them.

In my opinion, the primary task is to teach the equipment to automatically find problems and their causes. That is what I would like to talk about first of all today.
Next - learn how to fix them. That is, knowing what is the cause of the problem, it is quite possible to solve it without the participation of people in most cases.
As a degenerate case of fixing a problem - setting by template. We have LLD (Low Level Design - a detailed network design) and on its basis the Control System will configure all the equipment - from access switches to High-End routers in the core of the network: IP addresses, VLAN, routing protocols, QoS policies.

In principle, in one form or another it is already now - auto-tuning according to the template is not something unseen. It’s just that usually it’s not about the entire network, but about some homogeneous segments, for example, access switches. There is a tight binding to the command interface and, accordingly, the manufacturer.
Now it’s all implemented bluntly - the script does not check for consistency - if there is an error in the design, it will be on the equipment. There are, of course, configuration validators, but this is another “manual” step.

The task is ambitious to the maximum - automatic generation of topology, IP-plan, switching tables and, in fact, setting up everything and everything. The maximum human participation is to approve the design, arrange the equipment and forward the cable.

Automatic troubleshooting

There are many systems that increase fault tolerance and reduce loss / downtime of services when there is a problem on the network - IGP, VRRP, Graceful Restart capability, BFD, MPLS TE FRR and much more. But all these are scattered pieces. They are trying to blind them together with varying success, but from this they do not cease to be diverse entities.
This is reminiscent of the issue of the final theory in science, according to which the available 4 types of interaction are of the same nature. This is a universal, unified theory that explains everything simply and clearly. But so far the picture does not add up.

Here is a beautiful illustration of the configured relationship between the protocols:

IS-IS, such as the IGP protocol, is running on such a network. On top of it is MPLS TE with the FRR function activated.
R1 on the BFD monitors the status of R2 and the status of the TE tunnel / LSP. When a line card is rebooted on R2 due to a software error, the BFD on R1 instantly reports this to all processes that should be in the know. MPLS TE accesses FRR and traffic is redirected to R3 along a temporary path.
Moreover, thanks to the GR functional, all routes R2, all corresponding FIB records, and even the neighborhood relation are stored on R1. At this time, R2 returns to normal, the board loads, interfaces rise. And on R1 everything is ready and in the shortest possible time it is ready to transfer traffic again to R2. As a result, services are returning to their previous course - everyone is happy, everyone is happy, none of the clients felt the curved hand of a programmer.

But can you imagine how much configuration is required to organize this interaction? And how often is the configuration incorrect, and does the backup work out quite differently from what we wanted? And what often do engineers have not enough competence to configure such things, and many corporate networks are with a minimum configuration of services? Everything is working and thank God! And what a huge number of problems can be detected in the early stages and prevented the moment when it will result in 3 hours of downtime and loss of reputation?

Therefore, we will carry out some classification of problems in order to understand what approaches are needed to them.

Real-time critical situations
Problems that are currently available but do not affect services
Hardware issues
Potential software issues
Incorrect configuration

1. Real-time critical situations

The first is the problems that have now arisen in the face of circumstances - cable breakage, software or hardware failure. This is essentially the only type at the moment that developers are somehow struggling with. This is what we reviewed above. We already have more than a dozen protocols that monitor the state of channels and services and can restructure the topology based on a real situation. But the problem is that, as I already noted, these are all elements of different puzzles. Each protocol, each mechanism is individually configured. And extraordinary abilities are needed to cover all this in its entirety and to have a fundamental understanding of the operation of large networks.

Well, one way or another, but we can deal with them - there is a means for this. What is the problem here besides the complexity itself? I will explain: after everything happened, we either did not notice anything (50 ms by eye not always), or sort through tons of logs and crashes, trying to establish a causal relationship between a series of events. And this, you know, is not an easy task, because there may not be enough surface logs, and detailed ones will contain a huge amount of uninformative data, for example, a LSP drop - by recording for each, the board reboot process, etc. This should be done not on one piece of iron, but on everyone in the direction of traffic, and often even those who stand aside. It is necessary to separate the grains from the chaff - logs related to the accident from those related to other problems. And it’s good if the network is mono-vendor,

What am I actually leading to. Logs are good, they are wonderful, they are needed. But they are not readable. Even if you have the right network, with a configured NTP and SYSLOG server that allows you to view in a truly chronological order all the crashes on all devices, it will take a lot of time to find the problem.

In addition, each device knows what happened on it. Returning to the last example, PE sees the fall of tunnels, VPN, rebuilding IGP. He can tell the Control System in human form that, they say, “At 16:20:12 01/01/2013 all the tunnels and VPNs fell in such and such direction, through such and such an interface. In addition, the routing scheme was rebuilt. I’m not sure what happened there, but OSPF informed me that the link between devices A and B disappeared. RSVP also reported a problem. ”
The intermediate P on which the SFP module burned out and says: “At 16:20:12 01/01/2013 my SFP module in port 1/1/1 was damaged. I checked everything - a hardware malfunction, a replacement is needed. "OSPF and RSVP sent a notification to all neighbors."

Joking as a joke, of course, but why not develop a standard or some protocol that allows the device to conduct minimal analysis and send unambiguous information to the Control System. Having received data from all devices, compiled them and analyzed, the Control System can give a very specific message:
“At 16:20:12 on device B, the SFP module in port 1/1/1 failed (here is a reference to the type of module, serial number, uptime, average signal level, number of errors on the interface, reason for failure). This led to the fall of the following tunnels (list, with links to the parameters of the tunnels), VPN (list, with links to the parameters of the VPN). At 16:20:12, traffic was directed through the temporary path A-C-E (link to the path parameters: interfaces, MPLS labels, VPN, etc.). At 16:20:14 a new LSP AB was built ”

2. Problems that are currently available, but do not affect services

What are these problems? Errors on interfaces, which are still under light load, and they do not make themselves felt. Flapping interfaces or routes on backup links, passwords that are too light, lack of ACLs on VTY or on the external interface, a large number of broadcast messages, behavior similar to attacks (lots of ARP, DHCP requests), high CPU utilization by any process, lack of black-hole routes (blackhole) with configured route aggregation.

One way or another, now many such situations are monitored and informational messages are written to the logs. ~~Who would track them?~~However, such things are not paid attention until they call on the carpet due to the lack of communication with thousands of subscribers. And in automatic mode, the equipment does not try to find the cause or correct the situation - in the best case, sending to the SYSLOG server is configured.
Of course, it happens that something happens during certain actions — suppressing broadcast packets if their number exceeds a certain threshold, for example. Or disabling the port on which flapping is observed. But this is all a treatment for symptoms - the equipment does not try to figure out what is the reason for this behavior.

What are your thoughts on this situation? Of course, firstly, this is the standardization of logs and gangways. Standardization is global, at the committee level. All manufacturers must strictly adhere to them, such as the IP standard.

Yes, this is a huge amount of work. It is necessary to provide for all possible situations and messages for all protocols. But one way or another, each vendor does it individually, inventing their own ways to report a problem. So, maybe it's better to get together once and agree once and for all? After all, Martini L2VPN was also once a personal development of Cisco.
You can send it to the control system in this form, for example:
“Message_Number.Parent_Message_Number.Device_ID. Date Time / Time range.Alarm_ID.Optional parameters »

Message_Number - serial number of the network failure.
Parent_Message_Number is the number of the parent accident that caused this.
Device_ID is the unique identifier of the device on the network.
Date - Date of the accident.
Time / Time range - The time of the accident or the period of its duration.
Alarm_ID - a unique identifier for an alarm in the standard.
Optional parameters - Additional parameters specific to this accident.

Secondly, the equipment should be able to conduct a minimal analysis of the situation and logs. It should know where the reason is, where the investigation, and in addition to detailed logs, also send the results of the analysis.

For example, if the fall of BFD, IGP, and other protocols was caused by a physical disconnection of the interface, then it should present this as a dependency branch: the fall of the port entailed this and that.

Thirdly, the Network Intelligent Monitoring System should reflect human accidents.

After analysis, let a standardized message be sent to the monitoring server, for example:

"2374698214.0.8422. 10/29/2013 09: 00: 00-10: 00: 01.65843927456.GE0 /
0/0 "2374698219.2374698214.8422. 10/29/2013 10: 00: 00-10: 00: 05. 50. 90. R2D2. 70 "
" 2374698220.2374698214.8422. 10/29/2013 10:00:01. 182. GE0 / 0/0. Abnormal Power flow. Power treshold is reached. Abnormal power timer is expired »

The Monitoring System parses the message data into components:
Alarm No. 2374698214. Not a consequence of anything. It happened on the device with ID 8422 10/29/2013. Lasted from 09:00:00 to 10:00:05. The universal identifier for the accident is 65843927456. Additional parameters: GE0 / 0/0.

Accident No. 2374698219. Caused by the Accident "2374698214. It happened on the device with ID 8422 10/29/2013. Lasted from 10:00:00 to 10:00:05. The universal identifier for the accident is 50. Additional parameters: 90. 70. R2D2.

Accident No. 2374698220. Caused by the accident No. 2374698214. It happened on the device with ID 8422 10/29/2013 at 10:00:05. The universal identifier for the accident is 182. Additional parameters: GE0 / 0/0. Abnormal Power flow. Power treshold is reached. Abnormal power timer is expired.

Then he contacts the database of network devices and retrieves the device description with the number 8422.
In the online or local copy of the global crash database, he finds the description and value of the accident 65843927456 - an abnormally high power flow. As a parameter - the source interface GE0 / 0/0.
50 - High CPU utilization. Parameters: general load (90), the most loaded process (R2D2) and CPU utilization by this process (70).
182 - turn off the interface. In the parameters, the interface number and the reason the interface would be turned off.

Further, the Control System generates an understandable and comprehensive message:

“An external device was connected to the C3PO switch to the 10GE 0/0/0 interface, which generated an abnormally high power flow from 09:00:00 to 10:00:05.
For this reason, the R2D2 process utilized 70% of the CPU during the time period from 10:00 to 10:00:05. The port was turned off at 10:00:05.
Abnormal Power flow. Power treshold is reached. Abnormal power timer is expired . "

3. Hardware problems

It’s not worthwhile, I think, to say once again that nothing lasts forever, nobody is perfect - interface cards fail, memory cards acquire bad sectors, boards spontaneously reboot, programmers suddenly disappear.

It seems to me that if a hardware problem has been solved one way or another, the equipment can unambiguously determine the cause on its own - a loss of synchronization, a malfunction of the control or monitoring bus, and a failure of the power supply board.

Some problems are cumulative, others are sudden, but it seems to me that all of them can be traced. Even, for example, a complete shutdown of the line card - the control board, even in the absence of a main power supply, should be able to interrogate the card and identify the problem. If you can’t interrogate, it means either the interrogator itself is faulty - it’s easy to check, or the replacement board.

Again, the Control System should receive a message about this:

“The line card in slot 4 has lost synchronization with the switching factories due to damage to the L43F network chip. The board needs to be replaced. ” And right there, the link generated a template for replacing equipment.

4. Potential software issues

Everything is simple here. Either the vendor has a good database of software, patches and their descriptions with a list of available functionality and resolved problems, or not. Naturally, if not, we need to, yes.

The Control System simply monitors all updates and, if necessary, downloads and installs them.

5. Incorrect configuration

Perhaps this is the most difficult aspect. There is a huge variety of variations. Even a regular IP will cause a storm of emotions when trying to implement its automatic debugging.

Formalizing configuration rules means creating a universal language of interaction between the Control System and equipment. Well, you can’t try to get scattered data from Juniper, Cisco, ZTE and Dlink on one server. You cannot create a parser that will adapt to data from different devices.

That is, it will be necessary to standardize at least the configuration storage and transfer it to the Control System.

As I see it: there should be a block describing the capabilities of the system: what type (switch, router, firewall, etc.), functionality (OSPF, MPLS, BGP). Further there should be sections of the configuration itself. Such a structure should be supported by any equipment from the access switch to the VoIP gateway in the IMS core.

Then one can easily find various inaccuracies: inconsistent parameter settings on opposing devices (for example, BFD discriminators, IS-IS network levels, BGP neighbors, IP addresses), matching Router-IDs, inactive PIM between two multicast routers, etc.

But, honestly, these are already non-trivial things and can only be realized by proper standardization of topologies or formalized LLD (Low Level Design).

Real life examples for everything described above I have already cited in this article .

Tech support

In my opinion, in this area (as in many others), now there is a huge amount of unnecessary work and overuse of human resources.

We will talk about carrier-level networks, with SOHO and SMB completely different subtleties.

Take the procedure for replacing a failed board as an example.
Now it is as follows (with some variations for different vendors):
1) The board is out of order, rebooted or started to send strange messages. The customer sees errors and accidents in the logs, but cannot unambiguously identify the problem.
2) The customer calls the vendor support hotline, describes the problem in words, or fills out the standard form. Provides data, logs, files collected by slave labor or completely independently.
3) The hotline operator opens a request and appoints him to a group of technical specialists.
4) The team leader assigns a request to the engineer.
5) The engineer analyzes the data and eventually sees the same accident. Connects to equipment, conducts a series of tests, collects information.
6) Often the engineer does not have the opportunity to establish the true cause, and he cannot voluntarily recommend a replacement - escalating the request to the next level.
7) Depending on the competence of engineers of a higher level, the request may travel there for some time. Until then, by entering certain commands or analyzing logs and diagnostic information according to a certain algorithm, a hardware malfunction will not be established.
8) Through the chain, the recommendation reaches the responsible engineer and further to the customer.
9) Then follows the confirmation procedure for closing the request and various bureaucracy.
10) The customer opens a new request for replacement - fills out the form again, again indicates the problem. The call center transfers the application to the appropriate department, responsible persons are appointed and only then the replacement procedure actually begins.

This is a rather pessimistic scenario, but somehow this whole procedure takes a long time and it takes the efforts of at least 4-5 people - a customer engineer, call center operator, team leader, support engineer, higher level engineers, spare parts department staff .

But in fact - there are algorithms for checking the physical parameters of boards. Yes, there are a lot of them, but we will not dissemble, they can be entered into the software or even into the hardware of the boards / chassis.
The equipment itself must conduct this analysis, and in the event of a hardware problem, the Control System should issue an unambiguous recommendation for replacement (and, possibly, make a replacement request on its own, according to the template). If not a single known hardware problem has been confirmed, the Control System should offer to open a request in TP. And it is better, again, to fill out the template yourself and register the ticket - the task of the person is to confirm the application.

Similarly, on so many other issues.
I can’t judge different vendors, but there are often questions about which versions of the software are currently relevant, which patches should be installed, what functionality is available in them.
I believe that the Control System should deal with all this - upload software, patches, track current known hardware problems, install patches, and update firmware. I will describe in more detail in one of the following articles how I see the operation of such a system.

Questions about the configuration, the inoperability of some services? Some of these things are quite obvious and consist either in the incorrect application of the setup instructions, or inadequate configurations on different devices. But a TP engineer easily monitors such situations by entering certain commands. Can't the Control System do the same? Analyze the configuration and understand the problem and even fix it?

Socio-psychological aspects

Yes, many engineers, including myself, have a significant question - what then should we all do if we can be replaced by automation?
I hasten to reassure you - we will all become obsolete, like chimney sweeps and young ladies on switches.

In fact, this is an eternal question and cause for hassles. Where did the coachmen go with the advent of cars, where did the huge staff serving the first computers with the advent of compact PCs go?
The modern world offers us more and more diverse jobs. In the end, you can get a fuel cell for the matrix.

But the maintenance staff and technical support will not go anywhere - there are a lot of problems that cannot be solved automatically for one reason or another (for example, administrative). I considered these and other questions in another article.

Networks must be designed, cables must be laid, the Monitoring System monitored, problems solved.

Just our life needs to be made somewhat more reasonable.

A much more important issue is vendor support.
I completely agree with the comments on the article on Nag.ru , such a system, instead of with standards and super-protocols, no one needs now.

Vendors have their own NMS, which they sell for a lot of money (huge, I must say). And if there are such standards, the equipment of one vendor can simply be changed to another and no one will notice it . Do they need it?

Large operators (and not very large ones) often have self-written systems. These are configuration validators, auto-tuning scripts, and surface problem analyzers.

Engineers are often inert and lazy, or conversely hyperactive and manually sculpt thousands of lines that will disappear when formatting the hard drive with the next generation of admins.

Be that as it may, all this is not that. Not at all.

After talking with colleagues, I realized that a misunderstanding of the idea was emerging - they say, I want to propose the creation of some kind of Control System software that will parse logs, configurations with scripts and issue a solution. At the same time, it will have 33 thousand templates for different vendors and different software versions. And this is someone's proprietary solution created by the will of one enterprising person.
NOT. We are talking about things larger - the global standardization of the system of messages between devices. Not the Control System should take care to be able to recognize logs with Huawei, Cisco, Extreme, F5 and Juniper. This equipment itself should send logs in a strictly defined format. Not a bunch of inert scripts using different protocols (FTP, TFTP, Telnet, SSH) should collect information about the configuration, crashes, parameters - this should be a single flexible vendor-independent system.

The other extreme is the SDN paradigm. This is also different. SDN - concentrates in itself not only the monitoring functions - it takes over almost all the tasks of the equipment, except for the actual data transfer - it takes all decisions on how to transfer this data. There is no channel to the brain SDN - no network.
What I'm talking about is the same flexible network with independent devices, each of which is self-sufficient. And the Control System allows you to keep abreast of the pulse - to know everything that happens on the network, take care of all the problems with minimal participation of people and provide important information in an accessible form.

PS
I do not pretend to be a complete consideration of the issue - my level of knowledge is clearly not enough to fully embrace it. This is just a reflection.
But I am sure that this is a vector of development of network technologies in terms of service and support. In 50-80 years, everything will change - the networks will cover not only computers, tablets and phones, the network will be everything. Complete convergence - WiFi, fixed networks, 5G, 6G, telephony, video, Internet, M2M. Everything is clearly not going along the path of simplification and more and more resources and resources will be spent on traditional maintenance.
Most importantly, such standards should come on time. Now is not their time, but it is time to talk about it.

In the course of writing this article, which was originally planned as a note at all, I came to the conclusion that the topic is too interesting for me and there will be a series of articles devoted to this problem:

Control system. Opportunities, principles of work.
Protocols of interaction and interchange of service information between devices and the Intelligent System of Work Control.
Detection and elimination of errors in the configuration.
Automation of equipment settings.

Tags: