Don Jones. “Creating a unified IT monitoring system in your environment.” Chapter 5. Turning problems into solutions
- Transfer
In this chapter, the author is going to share his vision on how to store and keep up to date the knowledge accumulated as a result of prolonged walking on the rake. The main difficulty in storing and maintaining an array of knowledge is to find people who combine the incongruous: they were thorough, creative, assiduous, had a sharp analytical mind, intuition
Contents
Chapter 1. Managing your IT environment: four things that you do wrong
Chapter 2. Eliminating management practices for individual sites in IT management
Chapter 3. Connecting everything into a single IT management cycle
Chapter 4. Monitoring: looking beyond the data center
Chapter 5 : Turning problems into solutions
Chapter 6: Unified Management with Examples
Chapter 5: Turning Problems into Solutions
The satirical magazine The Onion recently published an economic story. It was told how a special kind of scientist, called a historian , promoted the original idea of looking at the past . “Sometimes,” said one pseudo-historian, “we can look at how people tried to solve problems like the ones we have today.” We can study and understand how their solutions worked then, and this can give us an idea of whether this solution will work for us. ” Ha!
Although this applied more to politicians who continue to make the same mistakes over and over again, The Onion’s tauntsapplicable to IT. “Look, if our problem happened three months ago, and we then solved it, perhaps we will be able to solve it much faster if it suddenly appears now. And what, by the way, did we do last time? Maybe if we do everything the same, then the result will be the same as then? ”
You can say it in other words: Perhaps you have children, or at least you know those who have them. Have you ever told a child not to touch a hot pot on the stove? Of course. Did they touch her? Of course. How many times? Usually only one. This is because the training of human beings is primarily based on the mistakes they make. We remember the mistake, and the fact that we understoodhow to avoid it or to resolve the consequences that have occurred gives us confidence that we can quickly do this in the future. Memory becomes a key factor, and as we get older, we stop reaching for hot pots and start playing with our computers at work, and here it gets harder and harder to remember. This chapter is devoted to the last aspect of unified management: analysis of the solved problem and turning it into solutions for future use.
We close the cycle: we connect the service desk with monitoring
Before we dive into such an aspect of problem solving as memory, we must first close the operating cycle in our toolbox for unified monitoring. Earlier in this book, we discussed that one aspect of an integrated monitoring system is the ability to monitor the status of devices and services, such as a DBMS server. When a problem state is recorded, the monitoring system creates an alarm message, usually displayed on the console, and notifies someone else via email or SMS. A truly unified system can also create a ticket about a problem in the ticket tracking system. The ticket allows the management to see the status of the problem and how long it exists; the system also allows the ticket to pass between different employees, working on a joint solution to the problem. The ticket can be automatically filled with information related to the problem, helping the employee solve it faster. In fig. 5.1 shows this first step: An alarm message is displayed on the console and a ticket is generated from it.
Figure 5.1: Receiving an alarm message and opening a ticket.
We hope that in the end, the problem will be resolved. Usually, at this point, the specialist who has completed work on the ticket closes it and makes a special note about it.
What about our alert - an alarm message?
Of course, the part of the monitoring system that is engaged in real-time tracking will understand that the problem no longer exists, but this does not mean at all that the alarm message will disappear.
Usually, you need to keep alerts until the problem is resolved, until it is completely fixed, which means that when you close the ticket, you need to somehow clear the alarm message.
This problem is very common among organizations that do not have a unified monitoring system: close the ticket in one system, then enter the monitoring system and note that the alert has been processed. However, in a fully unified system, it is usually arranged that closing a ticket also resets the original alarm message. Figure 5.2 shows how this cycle closes within a single system.
Figure 5.2: Closing a ticket resets the original alert.
There is a good reason to separate alerts and tickets.
The ticket is subject to internal use. It contains technical information designed to solve problems and notifications about the process of working on a situation. Anxiety message, however, is suitable for use by a wider group of people. Alert can be used in a large number of indicators used in the company. For example, to demonstrate to users that this system is currently malfunctioning. There is no need to reset the alert, because the monitoring system no longer sees incorrect indicators, but temporary relief from the situation does not at all mean that it has been resolved . You may wantso that the alert remains in place as a high-level indicator, such as “we know that everything is not okay there now,” but at some point you still need to reset it and return the external signs of the system to the “working fine” state . If this is done automatically, as part of the closing of a ticket, this can be a convenient way to notify two different user audiences.
Preserving knowledge means faster resolution of problems in the future
After the problem was solved, information about it did not disappear. At least you hope so. As I said at the beginning of this chapter, each problem solved - this is a potential accelerated resolution of problems in the future - just as if you meet exactly the same, or just similar. In other words, you need to save information about the problem and how to solve it for future use.
Knowledge bases
Probably the oldest way to save information is a knowledge base (KB).
Once, at the very beginning, these were disparate databases consisting of articles describing where to look for a way out in a given situation. If you have a problem, you first do a search in the knowledge base, checking to see if there are at least some hints for solving the problems.
One of the earliest knowledge bases that became widespread was Microsoft's KB, which was shipped on CD in the early 90s. Today, it is a large collection of online articles - so large that there is a separate article in the knowledge base on how to correctly query it (shown in Fig. 5.3, if you did not believe me).
Figure 5.3: article from the Microsoft Knowledge Base.
This shows us one problem related to knowledge bases: people need to learn how to work with them and we must constantly remember how to do it. Unfortunately, IT professionals do not necessarily relate to the audience, for which in most cases it is necessary to get leadership (or a knowledge base) if an incident looms somewhere on the horizon.
With a greater degree of probability, they will plunge into the study of what happened and try to use their own skills to solve the problem. Using the knowledge base, in simple terms - “search by knowledge base”, usually happens after the internal knowledge has been exhausted. Partly this situation arises due to internal professional competence, partly due to their poor use of most of the knowledge bases, and partly because the knowledge bases become outdated very quickly.
This indicates another serious problem: the need to maintain knowledge bases up to date. Exactly as long as you neatly arrange the tags by articles - which versions of the products the solution relates to and so on, then your articles are useful, otherwise they become a source of misinformation. Consider a version of a business application, such as version 1.5, which has a specific problem. You document this in the KB article, then you rely on its content after the problem reappears. As a result, your developers fix the bug in version 1.6. Has anyone bothered to come back and correct the article in the KB? Not. Even if the article indicates that it is applicable to version 1.5, there is no more useful information in it. Was the problem fixed in 1.6? Will the repair procedure work in 1.5? If you are using 1.6. and the problem arises again, should you follow the procedure specified in 1.5 or report it as a new problem - because the developers decided that everything was fixed and working as it should?
All of these assumptions are based on what you understand.The main problem of knowledge bases: timely posting of articles in it. Vendors like Microsoft spend millions of dollars a year on the salaries of people who do nothing more than write documentation and write articles for knowledge bases. Do you want to do this kind of investment? I saw many companies that created knowledge bases that were enthusiastically used for several months, then work with them began to stall and, in the end, their use was reduced to nothing.
Tickets as Knowledge Base Articles
The first solution to many problems inherent in knowledge bases is to stop using a separate database for these tasks and use closed tickets as a knowledge store. Basically, all modern ticket tracking systems today have this feature. This approach solves the global problem of KB: primary filling with content, since tickets are already content. A good ticket processing system will also help answer the question “what’s what”, because your tickets are usually distributed among specific products or services. And if you are reading an old ticket, then at least you know which product or version it is associated with - although you may not know at all whether the information contained in it is applicable to a specific product or device?
Using help desk tickets as a KB does not solve the problem of involving people in the search for answers with its use. In fact, the mass of tickets from the help desk can make the solution to the problem even worse . Imagine: every time a problem arises, a new ticket is created, and when you do a search on a knowledge base (for example, on some old ticket), using a keyword, or simply choosing a product or device, you will get much more search results, where each ticket that matches your criteria appears.
Tickets collected on a help desk do not always constitute a source of documentation for self-help. Not all IT employees are the best writers in the world, and in the tickets themselves there is something related to their collection method ... let's call it an “informal” language that you most likely will not want to bring to the surface and demonstrate to your end users. For example, a user who has logged into your knowledge base is trying to solve the problem himself, instead of asking for a help desk about it, he might not like it if he finds something like "Restart your dumb user computer." Technical specialists may not indicate any details. For example, it is often written “fixed” as a ticket solution — that is, nothing useful for the solution. Nonetheless,
Knowledge Base Unification
There are two things with which you can try to turn tickets from a help desk into useful BZ articles. Firstly, some automation is required. When a new ticket is created, the system where this is done should automatically look at the latest tickets and present them as candidates for a solution to a technical specialist who is going to work on a problem.
A great example of how this can be done if you ask any question - StackOverflow.com; in itself, it already represents a combination of tickets / knowledge bases. It automatically searches for the latest questions and presents them visually in a separate way: they are inserted below your question, but above the field where you enter the details of your request, as shown in Figure 5.4. This forces you to forcibly look at assumptions from the knowledge base, so that you can quickly see that perhaps your question already has an answer.
Figure 5.4: Suggested answers to the question.
As you begin to print the details of the question, irrelevant assumptions begin to disappear from the main field, again helping you usea database of past answers, rather than requiring you to explicitly search for each additional step.
The unified system also helps to take this additional step (Fig. 5.5), either by including potentially related tickets in the search results, or by linking them to a freshly created document. Thus, the system can give the technician an advantage: he proceeds to solve the problem, having an understanding of such situations in the past, and how, at the same time, their solutions look.
Figure 5.5: Using old tickets to solve new problems.
In fact, for automatically generated tickets, the system can potentially do a very good job of finding old tickets that are relevant to the problem. Since the system does not forget to take additional steps, it may include additional search criteria, such as for example the source of the problem, affected devices or services, and so on. The technician may not be sure to include all the details, because they will produce a very large number of results, which often repels the use of the search in the first step. Getting a narrow result from which to start, a system that automatically links tickets will make it more likely to have relevant information on hand. Such a system can be made even better. if the system for working with the help desk included the ability to set up pairs of checkmarks (check boxes) in their tickets. When closing the ticket, the technical specialist should be able to independently affix:
- Does this ticket contain a good faith solution? For example, sometimes a technician can solve a problem by looking at the contents of an old ticket. This means that the current ticket does not contain enough information about how the problem was resolved. But if the specialist solved the current problem, and filled out a ticket with a detailed description of the solution, compared to what was done earlier, then the current ticket can be marked as a “solution”, which makes it appear in the upper lines of the search results.
- Does this ticket contain a solution suitable for the end user and which can be used in the future as a material for self-service? Most ticket tracking systems today contain “private” and “public” fields, which helps to make sure that end users do not see information that users might not adequately perceive, although it is assumed that administrators can sometimes write something in the ticket. At the same time, having unambiguous instructions on hand that this ticket is suitable for use outside the IT service and contains solutions suitable for independent use by the end user, it is possible to build a really working knowledge base of self-service.
Figure 5.6. shows how the system can implement this - in this case, this is not a check box; the system uses the “visibility” item drop-down menu to change the ticket status from “on hold” to “published”.
Figure 5.6: Managing ticket visibility
The mere presence of these checkmarks (or other indicators) can serve as a reminder to technicians that documented solutions are a must. From the point of view of management, organizations can set certain quotas: at least 75% of the tickets to be closed must contain a detailed solution, or refer to a ticket that describes the detailed procedure for resolving the problem. Such metrics are tracked through internal reports in the ticket management system, and can be an additional way of verifying that closed tickets are indeed the basis for maintaining knowledge.
Turning a ticket into an asset
The general idea is to stop thinking that tickets are only suitable for tracking problems and work on creating a complete cycle for solving problems from them. In order to benefit from this, tickets-as-a-solutions must overcome a number of common human prejudices and implementation problems that have interfered with them in the past:
- Technical specialists do not always use the search in the database of tickets, as far as possible this should be done automatically, and tickets should be offered as a potential solution.
- The ability to use search among technical specialists is not always ideal - so a unified system should, to a certain extent, automate this activity, and using the available information, make the first attempt to search for relevant tickets.
- The ability to articulate their thoughts on paper for technical specialists is not always a well-developed skill, so the system should, as far as possible, emphasize the need for complete solutions, and management should take this as a metric. Technical specialists, in turn, should be able to offer versions of the solution, both "internal" and suitable for "external use", if such a need arises.
With the right system — especially one that can be integrated with the monitoring system — creates a truly unified environment — turning a solution into a problem can be done at the click of a mouse.
Past performance is an indicator of future results.
Another way to form the correct expectations of the service level is to use historically accumulated data. I deliberately avoid the term “ service level agreement ” because an SLA is a formal document that often incorporates organizational policy elements. Nevertheless, service level expectations are a level of service based on past performance and performance indicators that you quite realistically want to get in the future. Ideally, an SLA should be based on these real-world expectations, but only if you are able to provide them.
There is one problem that is contained in the SLA of many organizations - they are divorced from reality. Someone sets the ambitious goal of “looking good”, promising availability of 99.999% and then cheerfully claims that they simply “try to match this figure”, someone chooses too cautious approach when setting conditions, forcing the organization to accept a lower level of service than it could really be.
Well, in case of failure - we must not forget about the tools we use. Everything goes back to the first chapter of the book when I wrote about management technologies for individual sites or “towers” that IT is so inclined to work with, as well as various specialized tools that we use to find solutions and solve problems. We have to use the same specialized tools to measure performance levels. Due to the fact that each set of tools uses its own "conceptual language" and a set of metrics, it is quite difficult to bring everything together into a single picture and use a single set of control values. Of course, at the same time it’s rather difficult to understand what our levels of service are in reality.
Bottom line: you have some existing environment. All political and internal problems are set aside; your existing infrastructure is able to provide you with a certain level of technically measurable performance and uptime. You just need to understand and define it, writing in the form of understandable and easily explainable sets of metrics based on the current capacity of your infrastructure. This is difficult to do if you have a mishmash of specialized tools, and even more so it is difficult to do if outsourced elements appear in your infrastructure. Start putting together cloud computing platforms, co-located servers, SaaS platforms, and so on, and you will see that your specialized toolkit is not able to provide you with enough information to solve it.
This brings us back to the previous chapters in this book. For example, you have an excellent set of services and applications - who does not have it now? Figure 5.7 shows the infrastructure that offers many different elements, some inside the data center, some outside.
Figure 5.7: A modern environment includes many components.
You start your measurements in one, most important place: the end user. Place several sensors, agents, synthetic transactions, and anything else you need to understand what is happening - what do users actually see at a single moment, in terms of performance. It is necessary to monitor their work for several days, reflecting the real and workload, while you should not choose the weekend for monitoring, where the load values are significantly lower and unrepresentative. Now you know what your infrastructure is able to provide in reality. It should be taken for granted that you are unlikely to expect anything better, but also not to expect something completely bad. If the level of service expectation is not as good as your SLA - well, in general, everything is fine. You can start looking for areas for improvement by pulling them to the levels prescribed in the SLA.
You may need to collect information on the individual performance of each component - a number of difficulties can arise at this point. It is important that at this monitoring level, you collect everything on one console, use one language (to describe processes) and use a single set of metrics. You need to find a range of performance values for each component working under a normal working day load.
After making sure that each component works within the observed values, you should understand and appreciate the end-user experience associated with these measurements. These values are the basis for your monitoring values: you should be notified in a timely manner about everything outside of them.
Once we have set the expected levels of service, you can begin to measure different levels of workload. See how things are on busy days and how they look on days with low workload (for example, a day off). Then you will begin to feel how the sensations of users change during periods of different loads, and how your infrastructure perceives the load, together with its elements.
Of course, it is extremely important to make sure that all outsourced elements are also included here. As I pointed out in previous chapters, monitoring all of this is slightly different from monitoring things in your data center. You will either need a unified monitoring solution that has the ability for hybrid monitoring, or you will need a special set of tools to collect performance information from these parts of the infrastructure that are outside the geographic boundaries of your company.
Note that two sets of metrics are allocated for monitoring: performance and workload. Too often, I come across SLAs in which the load is not taken into account. “We will provide response time within 100 ms,” - Okay, and under what specific load? Perhaps I can provide you with a hundred-millisecond response time under load, which I understand as normal, but if you start adding users and additional tasks, it is obvious that the response time will start to sink. And again, monitoring solutions can help us with this, not only by measuring the performance of things like processor, memory, disks, etc., but also the workload expressed, for example, in the number of processed transactions, the number of routed network packets, and so on. . It is important that yourperformance expectations also included the concept of workload - this is necessary to formulate service level agreements in the future.
This is a performance database .
All performance data must not only be collected somewhere, but also stored somewhere else . This is the same functionality that many monitoring systems miss: they monitor in real time and report problems, but they do not always store the information passing through them. We will expand our example with a performance database (
Fig. 5.8 ) Figure 5.8: Adding a database with performance information to the environment.
The meaning of this illustration is that you need to collect information from each component - even from those that are outsourced, into this database. What for? There are two reasons:
- This database allows you to understand what a system’s uptime and performance is like when you have a normal day. The expectations of the level of service come from here , and perhaps this will allow you to create more realistic and well-thought-out SLAs.
- This database will tell you when your performance starts to go beyond the previously established standards. I'm not talking about a situation where one of the components starts to fall out of the picture due to a problem - constant monitoring and the notification mechanism will tell you about it. The database is needed in order to show long-term trends: “Hey, are you aware that productivity has fallen by 1% over the past month, and by 0.75% over the previous month? At this pace, you will stop complying with the SLA after 6 months. ”
And frankly, a good monitoring solution should not show you the standard “trend line” of your performance in the first step. A simple indicator with an arrow will be enough: "You are in line with your SLA, and based on current prospects, this will continue to be respected in your foreseeable future." Or, "You meet the conditions of your SLA, but honestly, based on current data, you will not be able to comply with the SLA in one or two months."
And from now on, you can begin to understand the diagrams and graphs that provide you with detailed information, so that you can find the component or components that are a bottleneck in the system and start planning ahead to increase the capacity of your resources before the SLA inconsistency occurs.
Summary
Take the past in your arms and your future will become better — that is what we talked about in this chapter. Do you collect information from tickets in order to solve problems in the future faster and better, or collect information on system performance: for reasonable agreements on expectations from services and the correct calculation of resource capacity - all this concerns the preservation and management of historical data so that the organization stands more confident on foot in the future.
In the next chapter ...
In the last chapter of this book, we are going to go again this way from the very beginning, and look at unified management in terms of studying cases. I will use my practical and consulting skills to create a composite case, outlining the elements of unified management together to show you what a modern, truly unified, environment may look like. I will show you specific problems in each environment and explain how integrated management helps solve these problems better and more efficiently.