How to monitor the work of the business process and not be distracted by rubbish


    A source


    According to my estimates, a hundred or two monitoring systems can exist in the world (if you have a more accurate reasoned number, I invite you to comment). These are cloud systems, on-premise, commercial, free, for the network, infrastructure and so on and on all fronts. Among them are those that support the creation of service-resource models. These are tree-like things, the nodes of which are attached to the elements of a business system: web servers, databases, application servers, switches, and many other scary words. Each child affects the parent. For example, if on some server the RAM usage threshold is exceeded, an event is generated (for example, Critical criticality is red), which affects objects higher in service structure.


    Many organizations tie the availability of a business system to the availability of a business process. If I counted SLA, it would turn out that the availability of the IT system dropped to zero (when the memory threshold was exceeded on some server there), and the business process stopped. But this is not so !? At the bottom of the tree there may be a cluster or, in general, clogging up the memory to the eyeballs - normal system operation. In short, the task sounds like this: how to correctly calculate the availability of a business process and not look back at non-critical events from components of business systems? We will analyze it.


    For communication between business and IT, it is necessary to somehow evaluate the availability of services and business processes. There is a very simple way: an incident has arrived from a business - start counting inaccessibility. Finished work - stop the counter. And this is the correct calculation method. But I want more. Imagine a little:


    Introduce your mobile carrier and content selling platform. User is an alpha male who stumbled upon SMS with a balance from the operator. At the end of the message, he saw an offer to use the service "Dating" for only 99.99 rubles per day. In a passing testosterone rush, I dialed a short number - I connected the service. Money from the account, of course, was immediately debited. Half an hour, an hour passes, and offers to get acquainted do not all go. The impulse ends, and the scale of losses is not large, and the user scores for it. Now he realized that using such services leads to a loss of money. The operator is losing revenue.


    The story was fantastic, but the very concept of the situation is very realistic. In order to reduce such losses and give the proverbial proactivity for IT, it is desirable to see the availability / performance of business processes not only from the business side, but also from some other side.


    The user is not always interested in informing the provider about the inoperability of the service, if he uses it for the first time.

    The first idea of ​​monitoring a business process is to control related business systems and assess the impact of events on them. Indeed, in the general case, each business process depends on several business systems. If the process is long, then one set of systems can affect its part, and another set of systems can affect the other part. Thus, the spectrum of possible states of the process expands to the situation when the input is working and the output is dead. For example, a bank can draw up loan applications, but cannot issue loans. And how in such a situation to determine the status of the process? Does it work or not?


    The second idea is more complicated. We did not immediately come to understand how to separate two entities from each other: a process and a system. We tried to add influence factors, adjust the weight of the process connections with systems, and a few more tricks. As a result, we were convinced that in order to assess the status of a business process, it is not at all necessary to take into account the load of some processor there, but a completely different set of metrics is needed.


    The real picture of the work of the business process is given only by metrics characterizing the success and accessibility of the stages of the business process. The result is two isolated systems with their own events and availability. But, in a single interface, and this is the main insight. If we see that one of the steps of the business process is not working properly, this is an occasion to look into related IT systems. We consider the influence of the system on the process unreliable, but for the duty shift or the owner of the process / system, we left the opportunity to view this connection for diagnostics. The very raisin of the “separation of flies from cutlets” approach is that the business does not strain due to events on the infrastructure. The dashboard shines red only in really critical situations, and the technical staff, in which case, knows where to dig. And the wolves are full and the sheep are whole.


    Create two unrelated monitoring loops: business processes and business systems. However, those responsible for the business process should have the opportunity to look at the systems associated with the process.

    And now I will tell you what is needed in the general case to implement such an algorithm:


    ● determine the composition of the company's processes (what exactly do we want to control);
    ● determine the impact of key IT metrics on these processes (for example, the availability of a channel without which 50% of the business does not work and the retail director calls);
    ● decompose the impact on individual systems, and their - on infrastructure and so on;
    ● implement the specified two-circuit model - control of the business process and key transaction events, plus diagnostic information from IT systems and infrastructure.


    If you manage to create a similar scheme of work in your IT, consider that you have taken the first step to fine-tune your contact with the business. If not, keep in mind that most of the time it takes to implement the approach described will take business process analysis. We will talk about our experience in this part next time.


    The author of the article: Anton KASIMOV , architect of monitoring systems at Technoserv.


    Also popular now: