NOC: Comprehensive Network Management Approach
Complex networks require an integrated management approach. If the entire network consists of a dozen switches and is managed by one engineer, then to maintain it in working condition, a run of simple scripts, several spreadsheets and any primitive monitoring system is enough. In larger networks, consisting of different vendor equipment supported by dozens of engineers scattered across different cities and countries, very specific problems begin to emerge: a heap of self-written scripts becomes completely uncontrollable and unpredictable in behavior, more resources are spent on integrating various control systems among themselves than developing from scratch and installing and so on. As a result, understanding quickly comes that it is only possible to solve the task of a complex network management system in a complex way.
Back in the early 80s, the ISO committee identified the main components of a network management system. The model was called FCAPS . According to ISO, in order to successfully manage the network, one must be able to manage failures (F), equipment and services configuration (C), collect and process statistics on the consumption of services (A), evaluate performance (P) and centrally manage security (S). The past three decades have not added anything fundamentally new, and all network management tasks somehow jump around the main components.
Commercial complexes of this kind are very expensive and far from sinless, and among open-source systems there was a clear and frank gap, which simply encouraged the development of your bike. As a result of the generalization of our personal experience in creating and operating networks, after much trial and error, the NOC system appeared.
In general, it should be noted that NOC is not a monitoring system, and not an alternative to zabbix / nagios / cactus / etc.
The main task is to automate the daily operation of the network control center.
When developing the system, we proceeded from several prerequisites:
One of the prerequisites is that there should be one source of information and it should be convenient to use.
The second premise is delegation of authority. Data in the system is accumulated by different people from different departments.
The third premise - there is nothing to stick around with each individual piece of iron - this is a bargaining chip. Six-ton can stand here today, Force10 will be tomorrow. To manage the network, you need a higher-level interface that is maximally abstracted from a specific vendor and specific model.
Fourth, there are always enough gremlins among goblins. A significant part of accidents is caused by the human factor. You need to be able to quickly understand what has been done and what led to the accident. To do this, you need to map configuration changes, syslog / snmp events, and much more.
Currently, NOC consists of several modules:
Address Space Management (IPAM) - address space management. The main difference from other solutions is support for independent address spaces in separate VRFs, hierarchical allocation of address blocks, and delegation of authority. For example, you can allocate a block to the city and give the right to manage the block to the city branch, and then let them return what they want within the allotted limits. At the same time, it is possible to track by reports how the activity of an individual city complies with general policies. The subsystem normally works with tens of thousands of allocated blocks and hundreds of thousands of addresses and supports IPv4 and IPv6 addresses
DNS Management - if you already tidied up the addresses, then why not synchronize everything with DNS. Thus, a single interface is obtained for managing zones and provisioning zones to various DNS servers using the described logic. For example, data for zones can be generated automatically, zones of a city and clients of this city will go to servers located in this city. There is no need for slave zones, you can easily migrate to other DNS servers. Along the way, the registrar databases are monitored when a specific domain is rotten.
Service Activation - an interface for working with equipment. A wide range of hardware is supported .. The main idea is that there is a set of scripts cubes that have a common interface, perform an action and completely and completely abstract the features of a particular piece of hardware. Examples - get the config, get the software version, create vlan, and so on. The resulting cubes can be made to work in different combinations and solve with them a very wide range of tasks. The map / reduce tasks mechanism is also implemented, which allows you to perform the same action on a large number of equipment and analyze the result of execution.
Configuration Management - tracks where, what, and when has changed. It started as an interface to mercurial, now the functionality of the module has become much more. In particular, when a switch config is changed in a city, a dedicated engineer in that city will receive a message. In a large number of cases, he will have time to quickly respond to local amateur performances, give a hat and prevent an accident. The system is able to check received configs for compliance with established policies and is able to take active actions in case of suspicious situations.
VC Management - VLAN management. When fully deployed, it is enough to enter and delete vlan'es into the database, and they will automatically appear on the necessary switches, regardless of the vendor. For example, in one installation it was necessary to steer simultaneously a mountain of Kiskov six-ton, four-ton, nexus, 3750 / CBS3120, force10 E, C, S-series, HP ProCurve and GbE2c and small Alcatels.
Fault Management - collecting, analyzing and correlating syslog / snmp trap events from hardware. NOC takes an original and flexible approach to event handling. FM is a separate topic for discussion, we can say that there are simply no sane open-source implementations, and sane commercial ones can be counted on the fingers of one hand. The current implementation of FM in NOC is able to process hundreds of events per second and identify abnormal and emergency situations among them. The correlator finds connections between accidents and tries to establish the root cause. For example, a dropped link can spawn hundreds of crashes of various types in different places on the network. A correlator, using knowledge of the network topology and a built-in set of rules in, can establish that the true cause of many accidents lies in the fallen link and clearly indicates where to look for the reason
Peering Management is all about peering and BGP. Allows you to store the peer database, generate filters for BGP, update the RIPE database and much more. When the bill goes to tens and hundreds, the thing is irreplaceable.
Knowledge Base is a regular built-in wiki with a set of additional interesting macros. For example, using the rack macro, you can draw a packing of a row of racks. In KB, you can store instructions, certificates, contracts, rules and policies, useful recipes and so on.
Performance Management - active collection of performance parameters (including snmp). The module is quite interesting and will still be actively developed.
Inventory - a common base for physical hardware. Allows you to work with objects of different levels - from the city and the communication center to the rack and power cord of the switch. The module is under active development.
As a result - NOC, first of all, a highly specialized tool for managing complex networks. If you look in isolation from this context, then it is quite possible to become like three blind people who felt the elephant and the first recognized it as a hose, the second as a tree, and the third as a rope.
NOC is open-source, distributed under the BSD license and has been successfully operated for several years in a number of large Russian and foreign networks. The main programming language is python. The databases used are a bunch of PostgreSQL and MongoDB. The web interface is implemented on Django. We invite competent specialists to take part in the work on the project, we have many interesting areas of work for bright heads.
Project website: http://redmine.nocproject.org .
IRC: # nocproject.org at irc.freenode.net