Monitoring projects using a messenger on the example of Nagios and Telegram, with analysis of fakaps from the life of Highload 24x7
Figure: Margarita Zakieva
What will happen under the cut:
- Basic Nagios settings in conjunction with Telegram.
- The general concept of our project monitoring with colleagues.
- Analysis of a rake on which we managed to step when working with this system.
Our article will be useful for those who:
- Dissatisfied with the informativeness of his current monitoring.
- Has daily lower back pain with alerts about problems.
This article is not about the "Telegram Bot API "
We started setting up the bundle that will be discussed a month before the public release of the API, therefore, from the very beginning, Telegram CLI for Linux was used to send alarms from the monitoring server . The article is primarily devoted to this particular console client. At the end of the article, we explained in detail why we did not refuse it in favor of innovations from the world of bots.
Who we are and what we do
We are a friendly “Operations” team and dozens of servers have to be administrated; these can be both VPS and “iron” servers, including in Colocation, and they are scattered around the world. Correct and effective monitoring is our top priority.
General concept
We don’t have any people in the state whose duties it would be to not sleep at night and monitor the monitoring, but we have one account registered on the “left” SIM card, on whose behalf we send messages and a certain amount:
- Nagios instances - this has nothing to do with the implementation of sending notifications, we just want to emphasize that with several Nagios at the same time, everything works without any glitches.
Fakap No. 0 - Do not monitor monitoring
Sooner or later, you will come across the fact that monitoring can also break, but you want to know about it right away, and not on Monday, after the weekend. At the same time, it’s logical to check some services “from the inside”, while others, for example, the status of your website’s response via HTTP, is “outside”. To “kill two birds with one stone”, set up another Nagios for yourself from another provider and distribute the checks you need between the two monitors, remembering to set up check_nagios check from one instance to another and mirror the other way around. I hope for you, as well as for us, the simultaneous fall of two providers in different countries is a highly unlikely scenario.
What our monitoring monitoring looks like
- Configured notifications for services - the key point here is to configure only the most important notifications in the messenger, most likely it will be CRITICAL notifications for the most key metrics on the most important hosts. The rest, for example, WARNING or sandbox hosts, are configured to send messages outside of the scheme described in this article. It can be, for example, mail or a "personal" with a robot in the same Telegram.
Fakap No. 1 - Send notifications that require immediate intervention in the system to fix the problem in the same chat as those alarms that can wait or even disappear soon after the service automatically fixes.
If you do this, then everyone who will watch the chat will soon completely stop paying attention to it, especially if they have to wake up at 4 in the morning due to a false positive. The reverse situation is the complete shutdown of monitoring for the log of an important web server for the night. There is no need to do this, there is always the possibility that it is at night that very important information crept in, which will need to be sorted out during the day, a sufficient measure is to send such messages to mail, which you will read during working hours. Divide and rule.
Typical channel to which the attendant responds
- System administrators , who take turns starting a daily “watch” on monitoring, which lasts a day from 23:00 to 23:00. An administrator who is on duty must enable (or not disable) notifications for a channel that is configured as the default destination for critical alarms from Nagios.
Fakap No. 2 - Respond to notifications on the principle of "who was the first to see."
If you do not appoint a duty officer, then one night no one will wake up, and in the morning no one will be guilty. In order not to oversleep a single notification at night while on duty, on a mobile device, we recommend setting up notifications, as shown in the picture below.
Set up notifications on your phone or tablet
- Reserve channels . The idea is simple - if no one has responded to a specific failure within half an hour, monitoring automatically switches from a regular chat to an "emergency" one, in which, like in the previous one, all administrators are located. Its difference lies in the fact that no one can ignore it, notifications should always be on for everyone. You can also make another chat not only with administrators, but also, for example, with directors, in case the service has not been working for an hour and no one whose job it is to monitor them does not respond to monitoring. How exactly they are implemented from a technical point of view is at the very end of the article.
Fakap No. 3 - Rely only on duty.
Bitter experience has shown us that an accident in your DC can happen at the same time as Internet access is disconnected from the duty system administrator at home. Despite the fact that everyone has mobile Internet, by default, everyone has a smartphone connected to home Wi-Fi and the fact that he doesn’t have access to the global web does not bother him, “all three sticks.” However, the admin may not be available due to more simple and linear life scenarios.
Redundant channels for which everyone is always enabled
- Thematic channels . The system administrator can eliminate far from all the malfunctions detected by monitoring, for example, errors in the application logs or specific deadlocks in the database. The concept of “wake up the system administrator so that he wakes the developer backend” seems to us incorrect, therefore, “thematic” channels are created separately for such notifications, responsibility for which is not system administrators, but other specialized specialists.
Fakap No. 4 - Send notifications from the robot to chat rooms where work discussions are held.
It may seem to you that in this way you will attract more attention to the problem and it will be solved faster, but in fact it is not so, you will only annoy people with the presence of incomprehensible messages in the midst of an important discussion of the quarterly report. If necessary, just send the message with the description of the problem from a special channel to a working chat yourself.
As an example, I demonstrate a screenshot with “backup” channels and one thematic dedicated database.
Database Theme Channel
A short summary: after the adoption of the agreements described above, system administrators have become much easier to work with. This allowed them to be distracted by notifications from a smartphone less often and made it possible to learn how to spend working time on improving the infrastructure of the company. The quality of sleep for admins has improved, and the "tops" are no longer worried that there will be a fakap at night with a downtime of vital services for the company and its reputation will be undermined.
We send Nagios notifications to Telegram.
Installation and first launch of the console client
Even if you find telegram-cli in the repositories of your distribution kit (for example, RPMfusion Repository for CentOS) or a ready-made package on the Internet, we strongly recommend that you “compile and compile” yourself , since this procedure is described in detail directly on the github page of the project for many * nix systems.
Note for lovers of Fedora and CentOS
for Fedora 20 and CentOS 6, you must first compile libjansson yourself , which was not in the standard turnips.
After successful compilation of the binary, it is necessary to create a user with the telegramd login in the system so that after the first start of the client you will have the directory /home/telegramd/.telegram-cli in the system, inside which the client will store service files after confirming its authorization files, for example, received private key from Telegram servers.
Why username exactly is 'telegramd'
telegramd - this is the default username used by the client, if you run it in the system on behalf of the superuser, we did not find such information in the documentation, but spied it in " main.c ".
How not to lose access to the account registered on the “left sim card”
It is enough to backup the same ".telegram-cli" folder that was mentioned earlier. Transferring it to another server, Telegram will immediately launch with the authorization and settings you need.
And so, in your hands is a phone with a SIM card, to which we will register Telegram, and on the computer the server console with monitoring is open.
adduser telegramd # --disabled-login
./bin/telegram-cli -k tg-server.pub
Follow the instructions on the screen and get into the console telegram
Now you can add someone to the " contact_list " by his phone number, as far as we know - this is the only way to bring the user to the "contacts" so that subsequently send notifications from Nagios there. You can do this from the console or from any other client , including the Telegram Web-version , of course, having previously logged in there with the same phone number that you just used. To send messages to the general chat or channel on the side of the "robot", you don’t need to do anything at all, just make sure that he is an administrator if you send messages to the "channel".
add_contact +79991112233 My Contact
quit
Configuring the client for sending alerts
Now we have a customized console client with one contact for sending notifications there. For ease of use, wrap this in a script on the bash.
/usr/local/bin/telegram.sh
#!/bin/bash
#This script helps integrate Nagios instances
#with telegrams chats or channels.
sendFunc()
{
"$tgBinPath" `
`--rsa-key "$tgKeyPath" `
`--wait-dialog-list `
`--exec "$tgSendCmd $contactName $messageText" `
`--disable-link-preview `
`--logname "$mesLogFile" `
`>> $mesLogFile
}
#Path setup
tgSendCmd="msg"
tgDir="/usr/local/bin"
tgBinPath=""$tgDir"/telegram-cli"
tgKeyPath=""$tgDir"/tg-server.pub"
logDir="/var/log/telegram"
#dont forget to setup log rotation
mesLogFile=""$logDir"/telegram.log"
#Parse arguments
contactName="$1"
messageText="$2"
sendFunc #send telegram message
exit $?
Configuring system rights (tested on Debian 8 jessie)
mkdir -p /var/log/telegram
chown nagios:telegramd /var/log/telegram -R
chmod 755 /var/log/telegram -R
chown telegramd:nagios /usr/local/bin/t*
chmod +x /usr/local/bin/t*
chown telegramd:nagios /home/telegramd/ -R
chmod 770 /home/telegramd/ -R
ln -s /home/telegramd/.telegram-cli/ /var/lib/nagios/.telegram-cli
Send “foo” a message to “My Contact”
/usr/local/bin/telegram.sh My_Contact foo # обратите внимание на нижнее подчёркивание
Send “bar” to the “Monitoring” channel
/usr/local/bin/telegram.sh Monitoring bar
Sending a notification from Nagios
The description of the commands is based on the classic template for Jabber. The message MONITORING_NAME is used in the message body, so it becomes a hash tag in the message body, this is convenient for us.
Contact definition for Nagios config
define contact{
name telegram-contact
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,c,r,f ; Обратите внимание, уведомления типа "Warning" отправляться не будут
host_notification_options d,u,r,f
service_notification_commands service-notify-by-telegram
host_notification_commands host-notify-by-telegram
register 0
}
define contact{
contact_name telegramonlycrucial
use telegram-contact
alias Telegram OnlyCrucial
address1 Monitoring ; Название канала
}
Definition of commands for Nagios config
define command{
command_name host-notify-by-telegram
command_line /usr/local/bin/telegram.sh $CONTACTADDRESS1$ "***** #Nagios_Instance_Name ***** Host $HOSTNAME$ is $HOSTSTATE$ - Info: $HOSTOUTPUT$"
}
define command{
command_name service-notify-by-telegram
command_line /usr/local/bin/telegram.sh $CONTACTADDRESS1$ "***** #Nagios_Instance_Name ***** $NOTIFICATIONTYPE$ $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$ $SERVICEOUTPUT$ $LONGDATETIME$"
}
The final touch is to monitor Telegram itself
For us, monitoring is the most important and critical thing in the entire infrastructure, and since notifications are one of its main components, it is necessary to monitor telegram-cli itself according to the following metrics:
- Every minute we launch a client, in which we request a list of contacts, after - we check the exit code from the client, if everything is fine, it should always be zero. (It is done as a separate bash script, we think you will have no problems writing your own implementation of such a check)
- We check that there are no lines containing “FAIL” in the message sending log, this particular keyword indicates that something goes wrong when sending notifications. (We use check_logfiles for this check )
- We check that the telegram-cli instances did not hang, and more and more instances of this process do not appear on the system, which tend to leave your server without RAM. (For such monitoring, standardcheck_procs )
Fakap No. 5 - Do not monitor the local agent for sending notifications to Telegram.
Almost immediately after we started using this increasingly popular messenger on servers with Nagios, it turned out that Telegram could break , and we would be left without notice for many hours, and partially even for a couple of days. In the event that monitoring detects any problems with sending notifications via Telegram, this is reported via email.
Why is the local unofficial client, instead of the official API in the cloud?
1. telegram-cli is regularly updated, so it works stably and has all the functionality we need.
2. The API still needs to be monitored, for example, during the release of Bot Api 2.0 , failures were noticed with it, while the regular client worked properly.
3. Since we do not use any communication with our robot and do not control monitoring with its help, we are simply satisfied with the current solution. It works - do not touch.
Undisclosed opportunities Telegram in conjunction with monitoring
When triggered by an error in the log, you often want to blame the problem part without turning on your working computer or see a beautiful graph illustrating the extent of the problem next to the next critical alarm, for example, promptly forward it to your colleagues.
Of course, sending images and other types of documents to Telegram is out of the box, so the possibilities of such monitoring are limited only by your imagination.
Here’s how, for example, how we have implemented the mechanism of “backup” channels, a simplified version of the code is presented here, so that it would be easier for you to understand it.
The previously promised software part responsible for the channel reservation mechanism.
Good luck with monitoring your projects and great uptime to you, colleagues!