Monitoring system for mining

Published on June 29, 2018

Monitoring system for mining

    One large mining company came up with an interesting challenge: there are many sites with IT systems. They are located both in the cities and in the fields. These are several dozens of regional offices plus mining companies. 500 kilometers in the taiga without a road is easy! At each facility there is equipment that needs to be “folded” into a common infrastructure and to determine what and in what condition it is working.

    Here we needed not just a technical inventory of all devices on the network (serial numbers, software versions, etc.), but a complete monitoring system. What for? In order to identify the root causes of accidents and promptly warn about it, build network maps, draw communications between equipment, monitor the state of iron and communication channels, make warnings immediately after supporting or switching on new unaccounted equipment, etc. In addition, integration was required with CMDB (taking into account configuration units), so that all the hardware that the monitoring system “found” was compared with what is registered at a specific branch, i.e., is actually in the network.

    It was also necessary to “make friends” with the Asterisk telephony system so that the latter
    in case of some serious abnormal situations such as power cuts at the site in Krasnoyarsk could automatically promptly dial in to responsible people. There was also a task to distinguish the visibility of the objects of monitoring and the powers of user groups. Operators look for equipment, Moscow - Moscow, engineers at the field - only their field.

    The customer chose between several monitoring systems: 1) shareware; 2) one of the commercial solutions; 3) Infosim StableNet system. As a result of testing for the customer, the disadvantages of the shareware product became clear: it is long and difficult to set up, plus there was not the amount of functionality that was required (in terms of the same, for example, drawing links between devices on the network). Out of the box, he does not know how to do it, but with plugins it turns out so-so. A commercial product did not have distributed monitoring agents - these are installed on a specific site and control only their “bush”. Accordingly, they stopped at Infosim - he closed all the Wishlist. And that's why.

    This is what the main InfoSim StableNet administrator screen looks like (this is not a project with minerals, but a test infrastructure).

    The main screen, which displays the current network status:



    On the left is the control panel, in which we can configure the system and display the statistics we need. For example, the Analyzer button allows you to display statistics on any parameter that we collect, in particular, the round-trip time per hour period for a specific piece of iron.



    The Inventory button displays the inventory data of the monitoring objects, neighbors, MAC table for each device that is in the system. Incredibly convenient: it facilitates the process of finding any equipment parameter on the network by serial numbers, equipment types, operating system versions, etc.



    When, somewhere far in the taiga, local staff, for example, installed a new switch and did not tell anyone about it, it became immediately apparent in the system. This equipment falls into a special branch in the device tree "New devices" and automatically - in the CMDB.



    Monitoring objects are polled not only for serials and models, but also for loading memory, interfaces, etc. There is support for many vendors - in particular, by servers, storage systems, telecom equipment, end-user machines. If something is missing, the customer writes to us or the vendor directly and new pieces of iron are added. It's simple.



    The system integrates with MS Active Directory and RADIUS servers for general authorization and application of group policies. This is how the system architecture looks like:


    The central server is responsible for processing and displaying statistics collected from hardware.

    The second important component is the agent responsible for interrogating the equipment and checking the availability of iron. There can be several agents (remote software), we have this geo-distributed topic, by agent for each site. This is necessary in order not to drive the raw telemetry traffic to the parent organization - the customer has a large number of sites connected via expensive satellite channels, so only the measurement result is sent. And a database to store all that is collected.

    If a remote site is unavailable, on-site employees can connect directly to the agent and see the status of their “hive” network even without access to the central server.

    The agent can be an x64 / x86 server with RedHat OS, CentOS, Ubuntu, Windows Server (for large platforms) or a micro agent based on small ARM computers like Raspberry PI (for small platforms). We do not load the channel with iron pings, the agent does it, and it already aggregates packets with statistics.



    We can also capture delay, jitter, jitter variations for Cisco equipment (IP SLA) and Huawei (NQA). Therefore, if in the future the customer adds some other iron, the company will not have problems - we will also be able to help measure the quality indicators of the channels, conduct synthetic tests, load testing communication channels between agents.



    The monitoring system is able to receive syslog messages, SNMP traps from hardware, filter them and generate alarm messages. It automatically builds a topology at the L2 and L3 levels, and based on this, dependencies of emergency situations are automatically adjusted (root cause analyze). This is very cool because it allows administrators to be informed about the root cause of the accident, thereby reducing the time required to solve it. For example, if in the chain of five switches one in the middle fell off, we will get a message that the third switch fell off (root cause), and the fourth and fifth are inaccessible because of this.



    The solution works out of the box, but the process can be customized. So, for example, to facilitate the work of our technical support, we “added” the status statuses of bestereboynik and power supply: if the power is turned off on the site, then instead of 30 alarms we get one power supply. Correlation takes place according to topology, users and rules.

    There is a group configuration of the equipment; you can not just passively interrogate the hardware, but roll out configs such as settings on the switches. To register vlan or ntp on 40 switches? Easy!



    It is also very cool that the system allows the customer to backup the hardware configuration on a schedule: once a day to collect configs or at an event (for example, a syslog about a configuration change comes - you can configure a task that will work out the moment of an event and collect the changed config). The same is on the ladders, on emergency events. This will greatly help in "debriefing" and finding the main culprits of configuration changes. Plus, in essence, an up-to-date database of all device configurations in the network is being created.

    There is an API for integration. In our project, integration of monitoring with CMDB 1C was made: ITIL The management of enterprise information technology for storing all information about equipment (tangible assets). The survey information is compared with what is in the assets, when detecting unaccounted equipment, the system says: "Here is an incomprehensible switch". They find out what it is, they fill in all the necessary fields - installation location, name, etc. The serial number, name, party number, firmware version is obtained from hardware. Next, the monitoring task is sent - the name of the piece of iron in the system is changed, set to the correct position in the location tree, monitoring settings are applied depending on the type of piece of hardware (for example, boundary equipment must be polled more often than the rest), the host name on the device itself changes and dd

    Process on the ground


    First of all, we set up integration with AD. This made life easier for us during implementation, as well as in subsequent operation. No need to create and delete accounts for users every time. The system will automatically receive all active accounts from AD. If someone suddenly quit, the system itself deactivates this account and no one else can enter it.

    For admins and middle management a very urgent task was to get a lot of reports. During the launch, reports were set up on the utilization and availability of channels, on the availability of hardware on sites, top emergency situations, reports on specific types of accidents, OS versions, reports on equipment configuration changes, etc.





    Reports can be viewed in HTML format, received by mail in PDF and XLSX format with the necessary frequency (once a day, week, month, etc.). For different reports, the periodicity and personal addressability of the report consumer was adjusted.

    The system also has the flexibility to notify and perform custom actions in case of emergency situations, it can send e-mail messages, text messages (using an external SMS gateway), plus write your own scripts that will be launched. For example, we in our cloud monitoring service have made Telegram-bot, which notifies responsible employees in our service department about emergencies. It can also be queried for different parameters: “CPU, 10.1.1.100” returns “95%”, but considering the support of a mobile application, this may seem a bit redundant, although convenient.

    Next, we wrote a script for integration with a telephone exchange. Now, in the event of a megacritical situation (power outages at critical sites or data centers), the system calls the responsible people to mobile phones and, in a voice like Siri, says: “The voltage at such an object is below the critical level”. It is done quite simply: the accident is duplicated in a certain folder on the telephone exchange, where it is processed by the telephony service - you only need to specify in advance the numbers to whom to call the machine. In fact, we have automated the process of notifying responsible administrators or management in the event of an accident. In other words, they replaced the person who should call and report on the accident.

    Very convenient search function for users and hardware. The user calls, says: "My network does not work." By its IP address you can immediately see where it is connected (which switch, which port, and which poppy) and where it is connected before:



    You can build different types of graphic topologies that make life easier for engineers. It is necessary, for example, to see where we have some kind of switch. It's simple: found it in the right branch (or used the search) and opened its neighbors. Several levels of the neighborhood are supported (the first is immediate neighbors, the second is neighbors of neighbors, etc.). And you can immediately see where our switch is located in the topology, which ports and where it is connected, which MAC addresses of the ports. Or, see the map of the OSPF, BGP, EIGRP, STP, PIM, MPLS protocols - the system will process and draw all this by itself.



    Or, to visually see how the network “feels” at one of the sites. For convenience, we have divided parts of the WAN and LAN sites and draw them with separate maps. All indicators and links are interactive. When you hover on them you can see the current status and fall into any particular device. Separately, I would like to note that the scheme from Microsoft Visio, which the engineer himself drew, is used as a substrate for such a report. He has seen this scheme many times as a static picture on paper or screen. Now it comes to life and gives feedback in real time. Very comfortably.



    In accordance with the requirement of the customer, user access rights were delimited. There are a lot of roles, but they are flexible. Considering the difference in time zones between the objects, the working hours feature in the roles was very useful: at what time, by what accidents, to whom SMS, and so on.

    InfoSim StableNet collects incident statistics. In our experience in such cases there are problems with planned work - they spoil the reports and cause unnecessary alarms. Here it can be noted that here and here there will be works: then the alarms will go in silent-mode, and in the report it will be marked with a different color that this downtime is a plan. Yes, the planned work in hindsight is not announced.



    If you do not have enough opportunities out of the box, you can create custom templates. For example, there were Motorola access points on the project. There were no ready-made templates for them. Using the built-in “wizard”, we created templates and monitored the parameters that the customer wanted to see (signal level, signal-to-noise ratio).

    There was another case when the system “did not understand” one Russian manufacturer and showed the manufacturer’s code instead of the name. In this case, the system has a functional that allows you to add new vendors and hardware models in a matter of seconds.

    Here is the list of opportunities that the monitoring system currently allows the customer to perform:

    1. Monitor availability using ICMP ping.
    2. Collect info using SNMP.
    3. Scan subnets for new hardware.
    4. Send reports by period.
    5. Perform backup configurations.
    6. Analyze accessibility.
    7. “Beat the alarm” by the unavailability of equipment or the output of indicators beyond the normal range.
    8. Scripting SNMP traps as triggers, syslog data and any input data.
    9. Integrate with AD.
    10. Automatically determine the connectivity of devices (CDP, LLDP, L3 neighborhood) and based on this automatically draw a network map.
    11. Create "weather maps" to visualize the state of the network with the ability to use graphic backgrounds.
    12. Create working screens (dashboards) to display real-time information about the status of the network and devices.
    13. Make an inventory of equipment (type of equipment, manufacturer, model, software version, when the EoS / EoL date, etc.)
    14. There is a REST API for deep integration with CMDB 1C and other external systems.
    15. Perform group configuration of equipment from the monitoring system.
    16. Check device configuration for compliance with company policies

    Links


    - Tales of the first support line.
    - Channels of communication for mineral deposits.
    - My mail: DDrozhzhin@croc.ru