What is monitoring in IT or why admins began to sleep more?

What is IT monitoring for ?
To let administrators learn about infrastructure issues before users. This, in fact, is a quick diagnostic complex that provides timely notification of problems and accurate information about where and what happened specifically.
Example: at 15:05 there is a problem with the mail. Thanks to the monitoring system, the administrator already at 15:07 sees that a specific Windows service did not start on the server, because of which Exchange did not rise, and users will not receive letters. Without monitoring, the manager would call him around 5:00 pm and ask where is the letter from the partner, which he had already sent three times half an hour ago.
How was it before?
Previously, information about the entire infrastructure (servers, network devices, etc.) was simply collected. The role of the “intelligent processor” lay with the administrator: he, like a pilot in the cockpit of an airplane, had to look around all the devices to understand the picture. It is clear that not everyone could do that.
Now everything is more automated and a little more complicated from the point of view of the system. They try to closely associate statuses with business servers so that there is no information about monitoring “in a vacuum”.
Monitoring was also added on behalf of the end user when user actions are emulated - this is a robot that runs a special script once in a certain period of time: it’s as if the user is running through the menus, presses something, and if the robot doesn’t succeed in doing something, it means , and the person will not work.
Plus, the configuration database is now used: information about monitoring objects is presented as a set of configuration units. Each server, each network device is a certain unit, all this is stored in a centralized database. This view then allows you to integrate the monitoring system with the service desk, asset management system and expand the functionality further.
Virtualization
Previously, the entire infrastructure was physical, all servers were separate pieces of iron, were in the rack, they could be received and felt until the admin saw it. Now the infrastructure often consists of virtual machines, when the server is physically one, but on it, for example, it spins with a dozen virtual machines. This requires a number of subtleties in tuning, but it gives a lot of advantages. For example, for us, as for developers of monitoring systems, this is a clear plus: you can place everything in a virtual environment. A monitoring system is software that consists of several modules. And before, a separate server was needed for each module. When there were several pieces of iron, the customer could say that, they say, too much equipment is required for your system. Now you can make these servers virtual, and place them on one physical server. It is also
An example of how this works
There is one example from life (names and faces changed). So, it costs HP Operations. Users who are used to exchanging files via FTP at some point find that the file cannot be uploaded. The first user poked: the server did not start it. The user thought that the failure was temporary and sent the file by mail. Then a couple more people poked, they also didn’t succeed, and someone wrote a ticket in support. Support began to figure out what was going on. In appearance, everything was fine: the server was operational, and yet, no service was unavailable. To look for such a problem “hot” (despite the fact that it is impossible to stop the work of other services) is, in principle, a standard task, but very dreary without a monitoring system. The admin just looked into the list of monitoring events and saw a lot of alerts from firewalls. Moreover, multiple appeals were recorded outside. Very quickly (surprise!) A DDoS attack on this FTP was detected, which was cut off. I think that without monitoring the search for a problem would be three to four hours longer, which could lead to further complications.
Automation
More monitoring systems can automatically perform service actions. For example, a typical situation: the server runs out of space due to temporary files, applications begin to slow down. Admin comes in, cleans temporary files, leaves - all tip-top until the next repetition. Monitoring can determine the moment, for example, when 90% of the disk is full, generate an event - and start the cleaning process automatically in automatic mode.
Since the monitoring system can integrate with service desks, it can automatically create tickets for problems. That is, support ninja can quietly and suddenly solve the problem even before the first call.
How to implement it at home?
We can say that the monitoring system, like any other large-volume system, is a rather complicated thing. Implementation is usually done in stages, regardless of whether the customer does this on their own or with the help of an integrator.
First, monitoring objects are determined (network equipment, servers, applications, etc.). Then critical indicators for each object are selected. If you take too much data, admins will drown in the stream of alerts about exceeding the limit indicators, and if too little, they may miss something important. After that, you need to decide on the architecture, choose a product, solution, vendor. Next, you can begin to configure. Sometimes a pilot mock-up zone is made, and then this mock-up is expanded to the entire infrastructure.
Finished products
Monitoring systems are aimed at customers of different levels. Large, complex and expensive solutions require enormous labor costs for their deployment and implementation, but for large businesses it is worth it. There are smaller and simpler options for medium and small businesses, they are a kind of box that is easy to implement. The most famous low-cost solution is Microsoft SCOM. There are a number of open-source options, they are generally free and require only a rather painstaking setup.
For what size company is the system useful?
The limit is where the system administrator can’t cope with the amount of work and can no longer control each server. In small companies, there is usually no sense in using such systems (or partial solutions can be taken), and in medium-sized and large companies a more or less serious monitoring system must necessarily be. Such systems began to develop about 10 years ago, and now almost all the major customers of IT services have already implemented something similar.
What else can monitoring do?
- Build reports, for example, on the use of resources. You can measure the load of the processor, memory, hard drive, etc. The administrator can see that some server is overloaded, which means that you need to remove some tasks from it, and the other is underloaded, and some services can be transferred there. This is the task of capacity planning and their rational use.
- Visualize problems. There is a certain representation of IT systems, for example, a large screen with a map of company branches and indicators of the status of systems in each of them. Or, for example, a large application map. Industrial monitoring systems have the ability to build dashboards where you can display the desired indicators, draw maps, and so on. Accordingly, this gives clarity: the administrator does not run through the menus, does not seek out the necessary information, but looks at the big screen and sees everything at once. Such an engineering interface is very useful on highload projects or in particularly critical business sectors.
- Search for “bottlenecks”. When the system for the first time calls you a specific broken switch that you need to go and replace with a working one, you will realize how cool it is to have problem-finding algorithms.
Code monitoring
Relatively recently appeared monitoring at the code level. This applies mainly to J2EE and .NET applications. Such modules can detect delays in system calls, memory leaks, delays in executing SQL queries, and so on.
Training
Initially, systems required a lot of effort to set thresholds (what is considered an emergency when the disk is 90% or 95% full). Naturally, with a large number of monitoring objects, this was a laborious task. Now monitoring systems are able to analyze historical data, study the behavior of objects and on the basis of this build the so-called "dynamic thresholds". That is, the monitoring system "learns" and understands what is the normal summing up of the object, and what it says about the accident.
What will this change for the IT department?
Administrators will be able to free themselves from routine work and concentrate on more important and interesting tasks. They will accurately represent what is happening in the system at the moment, i.e. infrastructure will be transparent. There will be no style of work when they are forced to extinguish fires and constantly repair malfunctions, it will be possible to bypass the rake in advance. Solving routine problems can be automated. Of course, unforeseen accidents will still have to be eliminated “manually”, but it will be easier, since there will be an accurate diagnosis.
It remains only to read the Habr and convince the accounting department that if the admin does not work much, then this is incredible happiness.