Don Jones. “Creating a unified IT monitoring system in your environment.” Chapter 6. Unified management using examples
Well, finally we got to the last chapter in the book. Here we will consider some practical examples; for the sake of ethics, the author practically does not name any specific systems, except for very well-known ones. The state of affairs before the introduction of unified management systems and after is considered.
Chapter 1. Managing your IT environment: four things that you do wrong
Chapter 2. Eliminating management practices for individual sites in IT management
Chapter 3. Connecting everything into a single IT management cycle
Chapter 4. Monitoring: looking beyond the data center
Chapter 5 : Turning Problems into Solutions
Chapter 6: Unified Management with Examples
Chapter 6. Unified Management with Examples
In the final chapter of the book, I would like to once again expel the contents of the previous five chapters. However, I would like to do this in the form of cases - practical tasks. It was good luck for me, at one time, to talk with several of my clients whom I advised - they were just trying to overcome similar problems like those discussed earlier. Not so long ago they tried to use solutions in which an approach was laid out that coincided with that described in the book. They gave me their consent, so that I could retell their stories without mentioning companies and actors, and we can see what they had and what became, and how “unified management” should actually work. In addition, I will also share information about some of the obstacles they encountered along the way, and what challenges they had to face and overcome.
This chapter will also include practical information on unified management, which was not discussed in previous chapters. I will provide a summary of the functions of the unified management, so if you need to evaluate specific solutions, you can put this list in front of you. We will also look at the various sales models offered by vendors so that you have an idea of the flexibility you need when choosing and implementing your solution.
A solution for unified management requires functionality that I personally relate to two fairly broad areas. The first helps you to respond to problems, and the second helps to service requests that are not related to the problems themselves, such as, for example, requests for changes in the environment. For each of them I have my own story and they are both copied from the same customer, although you, in every dialogue, will meet different people in different organizations.
Problem detection and resolution
Lisa is a senior system administrator, responsible in her environment mainly for Windows-based systems. Her colleague, Peter, is responsible for the infrastructure of Unix and Linux servers. Both of them have significant overlapping areas, because most of the company's business applications depend on the success of Windows and * nix resources. “Of course, these are not only servers,” Lisa told me, “This is also all that works on these servers: databases, web services, but you yourself know. There are still people who support these very different parts, so sometimes we spent a lot of time arguing about whose and where there was a personal mistake. ”
I asked her about an example of how everything worked for them before the implementation of a unified management system. She laughed and showed me the file where she used to take notes. It looked like a collection of notes from tickets collected on a help desk. I will give the content here, changing the names. I added a few [editorial] additions where I had to contact Lisa for further clarification.
OPENED By Help Desk in 2009-06-14 13:34 The
user insists that the BOS [business application] is extremely slow.
There are already several e-mail messages on the same subject. Server BOSDB02 responds slowly to pings.
APPOINTED Lheirt [this is Lisa]
NOTES Lheirt 2009-06-14 15:26
BOSDB02 works fine, except that SQL eats up 100% of the CPU. Submitted to the DBMS administrator.
APPOINTED DShields [this is the DBMS administrator]
NOTES DSHilds 2009-06-14 16:53
Perhaps again indexes, SQL takes more time to execute queries than it needs. We plan to rebuild the indices for the evening.
NOTES HelpDesk 2009-06-15 10:44 We
still receive calls about this
NOTES DSHilds 2009-06-15 11:12
APPOINTED HelpDesk 2009-06-15 11:34
Still getting calls about BOSDB02 - slow ping.
APPOINTED DSHilds 2009-06-15 13:12
SQL is still running slowly - looks like disk I / O issues. Disk fragmentation? Need server support.
NOTES LHeirt 2009-06-15 13:47
Server disk fragmentation of less than 2% - the problem is not here. IO is slow because SQL drives very often. Perhaps the databases are fragmented. I'll call you back.
At this point, the dialogue went offline, because the next entry says "the problem is resolved." Unfortunately, there is no official documentation describing what went wrong or what was done to rectify the situation, but Lisa explained: “We continued to transfer the problem to each other - Peter saw something like that in Performance Monitor, which caused the server to work slowly, so he threw it to me, and I told him that it was his SQL server that was to blame and brought the problem back. But I did not have the authority to see what was being done inside the SQL server, and he constantly wanted to reset the ticket from the queue ”
“In the end, it all came down to the SAN issue that Peter was responsible for. Something happened with our main channel before the SAN and we worked through a slow backup connection, and there were still some difficulties with the configuration of the channel, because it did not work at full speed. We saw a slow disk exchange rate because Windows, obviously, believed that the SAN was just one large, logically attached drive. We ran all possible types of tests on the server and SQL server and tried to find the source of the problem, but none of our tools was able to show that the real problem was hidden in a completely different place. "
Peter, too, recalled the incident. “The strange thing was that outwardly everything worked as it should and none of the systems with which I monitored the operation of the SAN showed alarm signals. The problem was with the configuration of several of our hosts. And utilities did not signal the presence of at least some malfunctions, although server access to the SAN was much slower than usual.
“The real problem was that it got out right away on several servers. We did not immediately connect this with each other: each of the hosts used SAN in its own way. On the storage network itself was not only a large DBMS, but also a small web farm, and in addition - a file server. All the symptoms experienced by users were different and the problems constantly fell to different specialists. The problem came to me from the guys involved in file servers. "They saw how fast the disk queues were growing, and they knew that it could be somehow related to the SAN, and so they connected me."
“After we spent a lot of time, that was the source of the problem,” Lisa said. “Each of us tries to think first of all about what he is responsible for, but now there are so many intersections and interdependencies in the systems that when a problem occurs, we don’t see it from our level, because we are completely attached to our tools” .
I also spoke with Kevin, who was responsible for the help desk of the company. He said that such cases for his team are especially difficult: users continue to call, and the help desk has no idea where to take them, cannot say anything about the causes of the malfunction, and about the state of affairs. “Users retell the problem in different words, and each help desk operator opens a new ticket. Of course, we would slow down the work of any specialist, starting to distract him with tickets on the same topic, but in fact, we did not have a real connection. Normally, if you answer an incoming call, then you look if you have a similar ticket open, but we did not have a single place where we could monitor all current open problems. In the end, I even put down a board on which questions were written that required special attention,
I asked Lisa how the work is going now, after the company introduced a unified management system. “We have been working with her for about a year,” she told me, “everything has become different with her.” She showed me a ticket from a recent problem: "This is what we now see."
ALARM 2011-06-14 12:13:42
NODE Windows Server BOSDB02
Instance of SQL Server = DEFAULT
SYMPTOM: SQL Server response time does not fall within acceptable
IP limits : 10.10.15.212
SQL Server DBMS shows 34% free
SQL Server DBMS fragmentation <5%
Disk Queue <1
Network Utilization <40%
CPU Utilization <60%
Memory Utilization <75%
RELATED ALARM 2011-06-14 12:10:52
NODE MBS3667 Router Interface
“Look, here you can immediately begin to guess what could be the matter.” She showed me the monitoring console, which now runs the entire IT service, with information similar to Figure 6.1. “You can see a simple network diagram, it shows not only servers and services, but also network elements - switches and routers. If the server signals that something is wrong with it, it also collects alarm notifications from all dependent elements, such as a router. In our case, the router interface that started to drop packets is to blame. The system itself threw the problem to the specialist in whose competence this question is, and, in addition, raised the alarm on all servers connected to this router, because the clients and the monitoring system saw that the response time of the systems began to increase. The availability of this data has saved us a lot of time in finding the source of the problems. The system is configured to automatically perform basic checks, so that if a problem occurs, the system does a preliminary data collection on its own, without our participation. ”
Figure 6.1: Visual trace of alarms.
Lisa also said that the team began to spend significantly less time on the mutual transfer of tickets. When a system is considered as a single whole, it becomes clearer in which area the failure occurred. “The problem becomes huge if it is outside the data center. We have a large number of applications running through SalesForce.com, and if something happens for these guys, or, more often, one of the providers starts working slower than usual, then our users see that “our” application starts working slower. But the monitoring system is aware of the dependencies and usually by this time it has already informed us of the problems that are starting. We send out a message about applications that depend on the work of these services, and begin to call the service provider to register a ticket with him. ”
Kevin says that this newsletter helps the help desk a lot. “We have a web portal where users can register tickets, the current system status is also shown here. Before they open the ticket, they can look and see what we know about the problem. After we trained them to use the system and trust it, they stopped registering repeated tickets. ”
He acknowledged that learning was a big step forward. “At first, we did not do this, but after users realized that we were fairly honest with them and had a good knowledge of the situation about the state of problems, they began to trust us more. We made a lot of efforts, and now we even have mailing lists and users can add themselves there, so that they can receive a message if something happens to the system. If we are proactive, proactive, then this removes a very large burden from us. ”
The advantage of the unified management system for this team was quite clear: faster problem solving, fewer cases of mutual transmission of tickets, and more active communication with end users. What are the biggest problems they had to face?
«Вопрос доверия», — сказала мне Лиза, — «Нам пришлось доверять новой системе мониторинга, также как мы доверяли инструментам, с которыми были знакомы до того. Когда поначалу что-то шло не так, мы возвращались к ним для решения проблем, но после того как мы поняли, что видим те же самые данные, мы стали доверять новой системе больше, а с какого-то момента стали полагаться только на неё. Мы время от времени выкапываем наши старые добрые инструменты, если нам надо глубоко забраться в неправильно работающую систему, но к этому моменту мы уже точно знаем, где именно находится проблема, и нам не приходится тратить на это много времени. К этому моменту нет необходимости заниматься футболом — вы уже находитесь в правильной проблемной области, и вам осталось точно установить причину».
Fulfillment of custom orders
Kevin talked about another side of unified management. “We are not only responsible for opening tickets for problems. We also open tickets for routine change operations. ” I asked him to give an example of how this was done before the introduction of the integrated management system. He showed me a ticket from the archive:
OPEN Helpdesk 2010-08-12 15:50
BDOUDS user needs a new SharePoint site located in
intranet / projects / universitybid. The user will be the site administrator.
NOTES Jholz 2010-08-13 08:27
A message has been sent to Bill's manager for confirmation. A message was also sent to the special projects department.
NOTES Jholz 2010-08-16 11:12
Bill's manager, KHiki, confirmed the application. Still waiting for confirmation from the special projects department.
NOTES Jholz 2010-08-18 11:05
Still waiting for a response from special projects. So far I have stopped working with a virtual machine.
NOTES HelpDesk 2010-08-20 10:34
User requests status.
NOTES Jholz 2010-08-20 11:34
Tell him to contact the special projects department himself. I need confirmation from them, as this is beyond their budget.
NOTES Jholz 2010-08-22 13:11
Confirmation from special projects received. Raised the site and assigned the user BDOUDS as the site user.
STATUS IS ESTABLISHED IN COMPLETED 2010-08-22 13:12
“This has happened all the time. Someone might call us for access or something else. We assigned a ticket to someone in IT, but then they began to figure out who would be responsible for it. In the end, we had to create a thick book, ”he added, pointing to a thick folder with three rings on its shelf,“ by which we could find out who was responsible for what. ” And then you had to try to hear an answer from them and wait ... How much could it take? Specifically, this problem took us two weeks. This is idiocy, of course, but all this time users called us to find out the state of affairs, and we were not able to tell them anything, because we did not know anything. The work itself after receiving approval from Jeff took only 10 minutes. "
And what does it look like when implementing unified management?
“Actually, it’s quite good,” said Kevin. “Now we have a large online catalog that has everything the user needs. It has the form of online storage through which the user places a request, and the system automatically opens a ticket. At the same time, each received element is associated with a workflow, so that IT does not know anything about it, until it passes through persons coordinating and approving these works. After we see this, then this part has already been completed, and we can only begin and finish our work. For some things, at first we had to redo the original scripts, so we are now well offloaded. ” The organization developed and documented the desired workflow for each product (internal service). Kevin gave an example of the documentation shown in fig. 6.2. “This kind of process documentation is important because we spent a lot of effort on implementing workflows. Owners of business (business processes) can independently use these schemes after they are associated with the specified products in the catalog. ”
Figure 6.2: The documented order used for automated approvals / approvals when requesting a catalog item.
As an example, we discussed access permissions, and I asked how it was before, when someone needed to get it. “Nothing was done,” Kevin admitted, “Once he got access, he stayed with the users until the person left the company. We did not track this. Now it is visible in the general directory. If you don’t need something, then you can 'return it to the store', it will go through a special order of approvals and we will receive a ticket that indicates what access and where to remove it from. Various managers periodically check the credentials of people who have access to their resources, and then tell us who and what needs to be removed or left. IT is no longer involved in this work. ”
I noticed that an automated workflow does not necessarily guarantee a quick response time. “Oh, yes, on some issues, users sometimes have to wait for approval for two weeks, but if they place a request through a directory, they themselves can check the status of the task. And then they can see that he has not yet reached us, and they can try to speed it up on their own by disturbing their managers or those who are responsible for these resources. "We do not deal with issues that are outside the scope of the coordination cycle, and users are aware of this, in addition, the status shows that we do not have it yet." Such systems better inform the user, and help them understand where and at what stage their task has stalled.
“Nodules for memory” when choosing a unified IT management system
I would like to use this section to present a list containing, in my opinion, the mandatory properties of unified systems. As you evaluate the solutions you are considering, make sure that this functionality is there, and also make sure that they work as expected and are useful for your environment.
- Sequence of work. Unified management solutions should offer workflows that help automate service coordination and management. Workflow sequencing should be maximally implemented in the form of normal mouse movements so that programming is minimized.
- Agents I know that there is a huge gap between those people who are quite happy with the arrangement of agents and those who categorically do not accept this; but I would recommend choosing a solution in which you can implement both of these approaches. In some cases, the agentless method of collecting data shows itself perfectly, although it can cause problems related to performance and the number of queries being performed, compared to the installed agent. I think that a hybrid approach is best for most organizations, and a monitoring solution should support it.
- Alert Integration. When a problem arises, a unified management solution should explicitly communicate this to certain individuals it should also open a ticket in the helpdesk and automatically look for similar alerts in the past. With such work, the time to solve the problem is significantly accelerated, and this type of “knowledge automation” is really important.
- Coordination. As I pointed out earlier, tickets are not always needed to work with problems - they are sometimes needed for other work, such as change requests. Unified management should support the stream of revisions / approvals for these requests, so that IT gets the opportunity to move away from its traditional role as a “distribution controller” and instead simply processes tickets assigned to be performed by the business.
- Detection and placement. A unified management solution should help locate designated nodes and services and deploy monitoring agents on them. Detection should occur more or less continuously, or should be performed periodically on a regular basis, so that all changes in your environment could be recorded.
- Routing Tickets - for problems or for requests should be automatically routed through the business rules you define. In other words, tickets should reach the right specialists as quickly as possible.
- Timetable. A unified management system should have a built-in calendar that allows you to put tasks on a schedule. This functionality helps resolve conflicts over maintenance windows and allows you to perform maintenance work at the right time.
- Catalog. This is a key part of the solution for unified management, allowing it to work as a managed system of self-help. In addition to this, the catalog helps to work together on compliance with the business processes of the model selected in your environment, such as, for example, ITIL. The catalog provides users with a list of “ordered products,” but not quite as it does in an online web store. “Purchases” of users are converted into tickets, passing through the necessary checks and approvals and, subsequently, transferred to IT for execution.
- Communications. Users should not only be able to create tickets, but also view their status in the place agreed for them, where your team timely displays the status change. The web portal is a traditional way of such communications, but even better systems that allow you to establish such an exchange of information through system (mail) user boxes, since users are constantly in the system.
- Interface. A unified system cannot have many interfaces, and no matter what solution you choose, you must have a web interface as well as its version for mobile devices.
- Measurements. If you are monitoring customers paying for your services, you should be able to bill for their use. Even if you only work with internal “customers”, the ability to charge for the consumption of IT resources can be very important if your business managers want to improve their management strategy. You should not consider IT as a purely costly thing if the consumed resources can (and should) be monitored and distributed to services that actually consume them.
- SLA A unified management system should help you both in determining and monitoring service level agreements (SLAs) based on actual numbers.
- Trends A solution for unified management should include a DBMS that stores the results of measurements of the performance of components, making it possible to store and correctly process historical trends. This database can help you identify and track SLAs, as well as plan capacity for your environment.
- Polls. Closing the next cycle of work with your users is an important event, in addition, technical SLAs are not the only way to measure the success of your business, and it does not matter if you know about it or not. The ability to conduct surveys among your users helps you define SLAs in their language and allows you to create more acceptable and expected conditions for their observance.
- Reports. Look at the reporting and indicators that provide management and executive information, such as workload, SLA compliance, and so on. Yes, even indicators that help end-users see that their environment is working properly can ultimately help the IT service on its long journey to demonstrate how accurately it responds and follows the needs of the business.
- Visualization. The ability to visualize your environment helps in analyzing the search for root causes and problem solving.
- All in one place. As I have already mentioned several times in this guide, the main value of a joint management system is its unity - the ability to collect and track in one place all the factors affecting productivity using uniform sets of metrics, alarm notifications, identifiers, and so on. A single approach and a single look at the problems helps to get rid of the traditional management of the areas where IT is built, and when a problem arises, it allows all staff to concentrate on it and find the root cause more quickly.
- Preservation of knowledge. A unified management system should help your organization maintain critical knowledge by turning tickets collected on the Hepl desk into an automated, searchable knowledge base.
- Pre-loaded information. If an alarm notification creates a ticket, then the ticket should automatically include the maximum number of details: IP address, response time, and so on. The more information is included in the ticket, the less time it takes for specialists to understand the problem, the faster the solution will be found.
Obviously, this list is not comprehensive, but provides some starting point. If a potential solution offers this functionality and meets the specific needs of your organization, then perhaps you should pay closer attention to it and try it live. Make sure that you do not just put a checkmark in front of the corresponding item - you have a detailed understanding of the implementation of this functionality in a specific system. Also check that it meets your organizational requirements.
Ways to purchase a unified IT management system
I would like to briefly outline the various approaches used by vendors in the implementation of solutions for unified management. I would like to emphasize right away that I do not consider any methods “right” or, on the contrary, “wrong”. The one that is right for you is right , and what is good for you is up to you to decide.
Usually, the price of such decisions is based on the number of nodes that you need to manage, perhaps the number of users in your organization will appear there. A “host” usually refers to any managed device: router, server, and so on. Some vendors are more creative about their licensing models than others, but don't let yourself get scared. In some cases, more complex licensing rules will benefit you, because vendors are trying to adapt to a wide variety of situations with their potential customers. More attention should be paid to what exactly you are licensing.
For example, at one end of the spectrum you will find what I call monolithicdecisions. In this case, you receive and pay for each function, regardless of whether you need them right now or not. I think this is very important - to know that you get a solution, doing everything that you need, although I'm not sure what you want to pay for everything that is written there. Sometimes it is necessary to implement the solution in separate stages, licensing only the functionality that is necessary for a particular stage of the project. This allows you to increase product capabilities over time and save on full licensing. The advantage of monolithic products is that they often have good internal integration, because everything is assembled into one system.
In addition, there are modular frameworks (pluggable frameworks). To such systems, I would include solutions such as HP OpenView. When using these systems, you buy a basic product, and then begin to buy additional parts and modules for it. Such systems offer great flexibility, and if you are going to work with solutions from a large vendor, then you will be able to find solutions in his catalog for almost all your tasks. These decisions carry the risk of turning into massive projects that take a lot of time and effort, and the modules are not as well integrated as you might need. The licensing scheme can be very, very complicated, because plugins are licensed separately for the base product.
Another licensing model is pay as you go (pay as you go). In this model, the solution offers all the functionality that may be required, but you do not turn it on all at once. Instead, you only activate what you need and pay only for that. As your needs grow, you begin to pay a little more. Such an implementation is more like a “cloud” model, where your needs are gradually growing, but you only pay for what you actually use. Here you need to separately purchase plugins, and if they are, then usually the same vendor delivers them. The number of supporters of this approach is growing among many of my clients.
And the last thing you need to think about is where the solution will be deployed. In the age of “clouds”, you have a certain choice - to place your monitoring and control solutions inside your data center or just purchase a service such as a service located in a data center with a provider. In any case, software agents are installed in your environment. I will not delve into the dispute “local placement versus remote”; perhaps you have already decided what is good for you and what is not; but you will undoubtedly need consideration regarding a specific solution. Regardless of the chosen strategy, it would be nice if your solution had the opportunity to use both options.
This is what unified IT management looks like. The general idea that pervades this entire book is quite simple: concentrate on the main theme of "collect everything in one place and on one page." The only revolutionary moment, when compared with the fragmented approach, is that our existing technologies, one way or another, are pushing us to this.
Of course, I do not expect you to drop everything immediately and begin to implement a new monitoring and management solution. These things can be done in small steps, so that they will not have a big impact on your organization, but they will allow you to learn the appropriate approaches and techniques in a natural and non-destructive way.
The main goal is to stop wasting time constantly switching between tools, to put everything and everything into one picture of monitoring the top level of your organization. Integrate everything together with the help desk, which will allow you to keep all interested parties up to date, as well as give you the metrics necessary for an objective analysis of the performance of IT infrastructure.