How I Quest (now Dell) implemented Foglight
Opus on how you do not need to choose and implement a monitoring system
Hello dear Khabrovites.
Let me tell you about the long history of one company, with a very small staff of the hosting team, which suddenly wanted to upgrade its monitoring system. It's about the long and thorny path. A path that only now, after almost two years, approaches this remarkable and controversial concept as maintenance mode. If this story seems interesting to you, welcome to Cat.
So, two years ago it was decided that the SolarWinds ipMonitor, which we had been using successfully for many years, had exhausted its capabilities. The company was growing, the number of servers in the cells was growing, the number of cells themselves was growing as well, and it was decided that ping, telnet and word search in source were not enough. In addition to this system, there were also a great many scripts written by various engineers and naturally without documentation. The scripts broke regularly, sometimes not obvious, and in the end the quality of the service provided suffered.
At one of vmWare presentations, my boss noticed a monitoring system with “huge potential”. A bunch of indicators, buttons. graphs, analysis tools, in general there are a lot of beautiful and sweet things for the uninhabited head of a five-person hosting department. This marvel was called the Quest Foglight Monitoring System tool (FMS further). Without delay, the senior engineer was asked to contact the vendor and make a test deployment. After several weeks of “hard work” the engineer gave the go-ahead. Of course, the boss suggested we all get acquainted with the system before buying and asked for our opinion. So the point of no return has come - we agreed blindly with the elder’s arguments, since no one freed us from the main work and wasting time on something where the elder said “zero” does not make much sense. So the price was announced, naturally we wanted absolutely all the functionality that is possible and the price bit pretty hard. The vendor persuaded us to buy their professional services for several months, but their services seemed to someone too expensive. In the end, after all, we somehow coped with what already happened, and we can handle this, right? O Great Vishnu, how far this opinion turned out to be erroneous. A three-day training package was purchased for the entire group, as well as the PS week and “some customizations” were also ordered. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above. Naturally, we wanted absolutely all the functionality that is possible and the price bit pretty much. The vendor persuaded us to buy their professional services for several months, but their services seemed to someone too expensive. In the end, after all, we somehow coped with what already happened, and we can handle this, right? O Great Vishnu, how far this opinion turned out to be erroneous. A three-day training package was purchased for the entire group, as well as the PS week and “some customizations” were also ordered. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above. Naturally, we wanted absolutely all the functionality that is possible and the price bit pretty much. The vendor persuaded us to buy their professional services for several months, but their services seemed to someone too expensive. In the end, after all, we somehow coped with what already happened, and we can handle this, right? O Great Vishnu, how far this opinion turned out to be erroneous. A three-day training package was purchased for the entire group, as well as the PS week and “some customizations” were also ordered. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above. In the end, after all, we somehow coped with what already happened, and we can handle this, right? O Great Vishnu, how far this opinion turned out to be erroneous. A three-day training package was purchased for the entire group, as well as the PS week and “some customizations” were also ordered. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above. In the end, after all, we somehow coped with what already happened, and we can handle this, right? O Great Vishnu, how far this opinion turned out to be erroneous. A three-day training package was purchased for the entire group, as well as the PS week and “some customizations” were also ordered. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above. Experienced IT specialists of a rather large medium-sized business will probably giggle and twist their fingers at the temple. Hosters probably just sigh and perhaps wonder at the immense short-sightedness of all of the above.
Problems started a minute after the consultant's time was up and we were handed over to the customer support department. It all started with the fact that our senior vendor provided the vendor with a deployment plan that indicated his test sandbox. The vendor must have been happy to sell top-end monitoring systems to people with three dozen virtual machines and one database, but in fact it was a question of several hundred virtual machines on several chassis, with several clusters of database servers, and even geographically on different ends of the continent . At that moment, we could not imagine how gluttonous FMS would be in terms of resources. After creating all the database agents, vCenter, and infrastructure, we suddenly realized that it was hanging tight. Asshole in support they poke us with a nose into the deployment plan and declare that if we had informed in advance about the size of our needs, then we would be talking about other requirements. Two days later, the senior engineer quit. So I appear on stage - in principle, I am still far from senior and have no words in choosing projects for myself.
My first thought was "Should I quit now." But Russians don’t give up, right? First, I knocked out dedicated servers for this fun. Two old Dell 2950 with ESXi on them. I could not select a separate server for the database, and therefore I had to use a virtual machine on them for this as well.
A brief description of the FMS architecture
FMS consists of:
1. Management Server. There may be several of these servers in the active / passive cluster of your own implementation; this is the central point that commands everything.
2. Foglight Agent Manager. Agent Manager is a windows service (daemon if you know how and want) which can be installed several for different needs. In this way, we divided vmWare, SQL staging, SQL production, and OS so that when you have a problem with one type of agent, you do not have to interrupt all observations.
3. Foglight Agent. Agents can be for all occasions: both purchased from a vendor, and written independently.
4. Database. Everything is clear here - we have SQL Server 2008.
Pretty quickly, I realized that working with what is simply impossible. Firstly, the system slowed down even with adequate resources. The page with the rules manager could load the list of rules an arbitrary amount of time from five to fifteen minutes. The support call had an unexpected result - they knew about the problem and promised to fix it in the next version ... in a quarter. Meanwhile, the authorities demanded results and no justification that our version was slowing down was accepted - after all, a considerable amount of money was spent. Gritting his teeth and inventing roundabout ways, everything more or less worked after another six weeks, and then the clock was transferred. What does the DST have to do with it, rightly you ask? The fact is that in this rather long-developed system there was a bug.
Having used the system for some time, we began to understand that the customizations we ordered firstly simply do not work, and secondly, they simply were not needed. We need others, but here's the problem - the vendor bought Dell and the pricing policy has changed a bit. The authorities demand that I urgently write the required customizations myself. The idea that it would be nice to quit visited me again, because I have never been a programmer. Here my soul does not lie to this and that’s it. But Russians don’t give up, right? I’m learning a groovy script on which all this works. In the learning process, I understand that almost half of the functionality we bought can work better if I just rewrite it to our specific needs. I am rewriting and simultaneously stop telling the authorities that I hate this product because it is already 30% my own product:
And now the cherished hour has come - a new version has been released in which, about Great Vishnu, both the problem with loading many pages and the hated bug with DST have been fixed. I confess - on this day I celebrated. The end to the constant nervous tic and trips for coffee “while the page loads”. It was this event that finally brought closer the advance of the cherished maintenance mode. Now, I only occasionally, at the request of the workers, change the alert thresholds and occasionally write new agents who have nothing to do with the infrastructure, but simply notify about completely client-side problems: such as blocking user accounts of our product. Now I am lead and now I know exactly how to choose and implement software.
I will try to present my seemingly obvious conclusions.
1. You can’t immediately buy full functionality without a firm belief that it is needed. Making sure that you really need it is not difficult, because you can hire a consultant with experience with this particular software. Believe me - this is much cheaper than the cham we paid for cartridges that are no longer in use.
2. You can’t rush. Nothing terrible would have happened if we had sat for half a year on what already happened. Several old servers can always be found and no one, except for the sales manager from the vendor, is persuaded to pay you here and now.
3. You need to understand the specifics of the staff that is available. You should not entrust the analysis to just one person, especially if the person is poorly motivated.
4. Do not save on the cost of implementation. True, not worth it. The vendor usually really wants to bring you to production as soon as possible because that's when he will be paid in full, and consultants also have their own benefits if everything goes well. If the vendor says that it will take months with their staff, it means that it most likely is. If there is no money in the budget for this, stop, for you will pay anyway, but more.