As I wrote my monitoring

I decided to share my story. Maybe even someone will find this budget solution to a well-known problem useful.

When I was young and hot and did not know where to put my energy, I decided to freelance a bit. I managed to quickly get a rating and I found a couple of regular customers who asked me to maintain their server on an ongoing basis.

The first thing I thought about was the need for monitoring. I decided to do it as smart people, not to reinvent the wheel, but to look at ready-made options, such as Munin or Zabbix. But it was immediately discovered that the Web version requires a good Internet connection, especially if you first open it from your phone. If you relax in nature away from the city, it is difficult to get a stable connection. Therefore, a console monitoring option was chosen.

Atop and the atop log reader atopsar helped me a lot as console monitoring. They were already mentioned on habr, atop was even taken apart , but almost nothing was said about atopsar.

Installation


Very simple installation, only three teams.

#Centos

yum install atop

# Debian / Ubuntu

apt-get install atop


Next, you can configure the monitoring operation for yourself or use the default settings.

# Debian / Ubuntu / Centos

/etc/default/atop 

Standard file:

 #cat /etc/default/atop
INTERVAL=60                    #Время, через которое создаётся снимок нагрузки в секундах, по умолчанию каждые 10 минут
LOGPATH="/var/log/atop"        #Путь до папки хранения логов
OUTFILE="$LOGPATH/daily.log"   #Название файла логов за сегодняшний день

Add to startup
# Debian / Ubuntu / Centos

systemctl enable atop 

Run atop as the
# Debian / Ubuntu / Centos daemon

systemctl start atop  

For lazy gathered in one team
#Centos

yum install atop && systemctl enable atop && systemctl start atop

# Debian / Ubuntu

apt-get install atop && systemctl enable atop && systemctl start atop

Atopsar


Along with atop, atopsar is also installed, which is a convenient console analyzer of binary logs that are run by the atop daemon. Of course, you can read the logs atop itself, but this is not so convenient if you want to capture a large interval of time.

A small educational program on the work of atopsar.

When atopsar starts without keys, it opens the log for today and displays the load on each core individually and the idl string for all kernels.

The keys that I use:

-A = output all information from the log
-c = display information on the load on the processor cores, the default key is
-m = load on RAM and swap
-d = disk activity
-O = top-3 load processes on CPU
-G= top-3 processes of load on RAM
-D = top-3 processes of load on disk
-N = top-3 processes of load on network
-r = specify the path to the log you want to read if you need to see the load over the past days
-b = time from which to start output
-e = time at which it is necessary to finish output
-M = creates an additional column at the end in which the criticality of the row is marked (+ there is a load, * is a critical load)

Thanks to monitoring, we can understand the reason for the incorrect behavior of the server in any time.

Notifications


So, there is monitoring of the load, but it still does not make it possible to quickly find and solve problems. We need notifications about the problem.

I'm the only one who follows the servers, so I need to notify where I can always see it and at least somehow react to it.

In the beginning there were SMS - fast, reliable, free. But then mobile operators covered up a free SMS distribution through their gateways.
Mail - for a long time, there may be problems with delivery.
Messengers - must be put on the phone, you must create bots.

As a result of the search, the Telegram messenger was selected for its simplicity and convenient application on the phone and desktop.

Created his bot using botfather .
After I put several scripts on the server that track the load on the server (IDL, smartct, etc..l), the presence of errors like “oom killer”, errors when creating a backup, and other operations that need to be controlled.

The scripts are pretty simple written in bash, for example, checking LA and notifying that Load Averadge has exceeded the number of cores on the server.

if [ ${LA[0]} -gt 2000 ] || [ ${LA[1]} -gt 3000 ] || [ ${LA[2]} -gt 4000 ]
    then
        wget -O /dev/null "https://api.telegram.org/$bot_id:$bot_key/sendMessage?chat_id=$chat_id&text=На сервере $ip LA $LAd"
        wget -O /dev/null "https://api.telegram.org/$bot_id:$bot_key/sendMessage?chat_id=$chat_id&text=`top -b -n 1 | grep Cpu`"
        wget -O /dev/null "https://api.telegram.org/$bot_id:$bot_key/sendMessage?chat_id=$chat_id&text=Топ 5 процессов `top -b -n 1 | grep -A 5 'PID USER' | tail -5`"
    fi

The simplicity of the syntax gives a lot of use cases (and anyone who knows at least a little programming language can write / add).

The only caveat - if the server is located in Russia (and you do not have IPv6 on the server), then you need to use a proxy. To do this, at the beginning of the script, you must register the connection string to the proxy:

export https_proxy=http://логин:пароль@IP.адрес:порт

This is not the end


You walk calmly through the mountains with a backpack behind you, take a break from civilization, and then the phone, accidentally catching the connection, throws a notification about a problem that has arisen on your server. What to do? A serene mood was blown away by the wind. Call my wife and dictate commands? Haha

It was urgent to come up with some way to resolve the problems quickly and without the availability of good Internet. Here I was again saved by the messenger (# telegrammivi). I taught my bot to communicate only with me, ignoring everyone else. Now, along with the notification of the problem, I get a little more data, according to which I understand who the source of the problem is, and I can try to solve it remotely. It is enough just to write a message to the bot, toss the phone higher so that this message goes away, and voila - the bot went to do your work. Thus, I can, for example, kill some unwanted process, restart the daemon, block IP and so on.

Here I also transferred future necessary requests from clients, for example, urgent reset of passwords to users (for “Ahhh, we can’t get to the server, we lose millions!”), Search for a user who has access to the desired folder, turn the site on and off, and others . Of course, I constantly modify the functionality of the bot, as the imagination of customers sometimes pops up unexpected and not provided by me requests. But the basic ones are satisfied.

There is also a version for VK, but it somehow did not take root.

Now I calmly travel and explore this world, not being afraid that something will break there, and I will not be able to find out or fix it.

Also popular now: