MySQL process monitoring script in Perl

Hello.

For more than five years I have been working as a system administrator in a hosting company, serving more than a hundred servers with freebsd and centos. During this time, many self-written scripts have accumulated that make my life easier. I want to share these scripts with the community, and listening to healthy criticism will never hurt.

Background.

Many years ago, a friend of mine, having a surplus of money in his wallet, bought server hardware in the form of a single-unit Intel server, placed it in the cheapest data center (as it is fashionable to call it now - on colocation), and began to post his sites there first, then started to distribute hosting to friends, friends began to post their pages, then pulled up the sites of their friends and employers. Since my friend was poorly versed in installing and configuring LAMP, without further ado, he installed Cpanel WHM, but I was attracted to administer all this economy, which is called “for food,” as a free labor force, I also needed hosting.

For several years, a lot of sites took root on the server, the load grew, with which I fought with varying success. Periodically, problems began with mysql. Some users generated slow requests that blocked subsequent ones, while others produced multi-story requests with a bunch of JOINs, which, due to the still not fixed bug, hung in the statistics state, while consuming processor resources. Ultimately, mysql gobbled up processes and stopped responding. There was a need for a monitoring script to look at the list of mysql processes and sound the alarm in case of an emergency.

First I wrote a script in bash. Then, when I had to get to know pearl by the nature of my main job, I rewrote it in Perl.

As practice shows, during normal operation, the mysql server simultaneously processes less than five "slow" queries, even for highly loaded servers, and if there are any, then this is an occasion to study these queries. Of course, you also need to analyze the log of slow queries periodically, but this is the topic for the next article, about a script that analyzes and even automatically builds simple indexes.

The logic of the monitoring script is simple. We will assume that the mysql server is at risk if at the time of verification more than ten (for example) “slow” queries are simultaneously running - more than one second in duration. We call this condition “critical”. If the condition is critical, then you need to sound the alarm.

Further practice has shown that it would be nice to sound the alarm if mysql has been in a precritical state for some time. That is, there are no 10 processes at the same time, but with every minute the number of long-running requests grows. For the precritical state, we take a figure, say, 5.

Run the script once a minute. We look at the list of processes, we consider everything that is not in the Sleep status and takes longer than one second. If the number is more than 10, then we send a letter to the administrator along with a list of processes. Save the resulting number to a file. We read the values ​​from this file for the 5 last values ​​- for the last 5 minutes, and if there were 5 pre-critical states in this period of time, then we send an email to the administrator.

Subsequently, a block was inserted into the script that nailed multi-storey queries that were suspended in the statistics state.

Actually, the script.

#!/usr/bin/perl
#use strict;
use DBI;
use DBD::mysql;
use POSIX;
($sysname, $hostname, $release, $version, $machine) = POSIX::uname();
my $slowtime=1; #  сколько секунд считать за медленный запрос
my $warnlevel=5; # сколько одновременных медленных запросов считать подозрительным
my $warncounter=5; # если  $warncounter подряд то высылаем письмо
my $alarmcounter=10; # если  одновременных медленных запросов >= $alarmcounter то высылаем письмо сразу
my $socket='/tmp/mysql56.sock'; # коннектимся по сокету
my $email="admin\@myemail.net"; # получатель письма
my $wrkdir='/tmp/';
my $procfile=$wrkdir.'alarm.proclist'; # фременный файл для списка процессов
my $datfile=$wrkdir.'alarm.dat'; # файл куда пишем количество одновременных медленных запросов 
my $pidfile=$wrkdir.'alarm.pid'; # pid файл. если mysql повиснет то скрипты мониторинга хотябы плодиться не будут
if (-e "$pidfile") {
  printf("pid file found. Exit.\n");
  exit(255);
}
open (PIDFILE,">$pidfile") || die "cant create $pidfile\n";
print PIDFILE  "$$\n";
close PIDFILE;
open (PROCFILE,">$procfile") || die "cant create $procfile\n";
my ($proc, $dbh, $sth, $totalcounter, $slowcounter, $sleepcounter, $user, $time, $state, $command, $info, $i);
until ($dbh = DBI->connect("DBI:mysql:mysql_socket=$socket", "user", "password")){
  unlink($pidfile);
  die("Can't connect: $DBI::errstr\n");
}
$sth = $dbh->prepare("SHOW FULL PROCESSLIST");
$sth->execute;
my @proclist=();
$totalcounter=$slowcounter=$sleepcounter=0;
while (my $row = $sth->fetchrow_hashref()) {
	$user=$row->{'User'}; $time=$row->{'Time'}; $state=$row->{'State'};
	$command=$row->{'Command'}; $info=$row->{'Info'};
	$totalcounter++;
	next if ($user =~ m/root/);
	if ($command =~ m/(Sleep|Delayed|Binlog)/){ $sleepcounter++; next; };
### убиваем зависшие запросы в состоянии statistics
	if ($state =~ m/statistics/ && $time > 5){
			$statinfo="$user: killed $mid: $dbuser | $db | $time | $state | $command | $info\n\n";
			$sth2 = $dbh->prepare("kill $mid");
			$sth2->execute;
			$sth2->finish;
			open (MAIL,"|/usr/sbin/sendmail -F$hostname $email");
			print MAIL "To:$email\nSubject:".$subj."Hanged query in the statistics state: $hostname, user $user \n\n";
			print MAIL $statinfo;
			close (MAIL);
	};
###
  if ($time>$slowtime) { $slowcounter++; }
  $info =~ s/[\r\n\t]+/ /g;
  push (@proclist,sprintf("%-24s | %4d | %s | %s |  %s  \n", $user, $time, $state, $command, $info));
  printf PROCFILE ("%-24s | %4d | %s | %s |  %s  \n", $user, $time, $state, $command, $info);
}
$sth->finish;
close PROCFILE;
#print "--- $slowcounter slow queries from total $totalcounter ($sleepcounter are sleep) ---- \n";
my @data=();
### read slowcounter timings from dat file
open (DATFILE,"<$datfile");
while(){
  my($line) = $_;
  chomp($line);
  push (@data,$line);
}
close(DATFILE);
### if dat file is smaller than warnlevel then fill timings by zeros
if (scalar(@data)<$warnlevel) {
  for $i ( 0 .. $warnlevel-scalar(@data) ) {
    push (@data,0);
  }
}
### shift timings with last slowcounter
push (@data,$slowcounter);
shift(@data);
### dumping slowcounter timings to dat file
open (DATFILE,"+>$datfile") || die "cant create $datfile\n";
foreach (@data) {
  print DATFILE "$_\n";
}
close(DATFILE);
### get number of bad states for last minutes
my $cnt=0;
foreach (@data) {
  if($_ >= $warnlevel) { $cnt++; }
}
my $subj=" ";
if ($slowcounter>=$alarmcounter) { # very critical state 
    $subj=" VERY ";
}
my $warnmessage="Critical state of $hostname! There was a $warncounter checks with at least $warnlevel long queries!\n";
if ($slowcounter>=$alarmcounter) { # very critical state 
    $warnmessage=$warnmessage."--- !!! Last check shows $slowcounter long queries!\n";
}
if (($cnt >= $warncounter) || $slowcounter>=$alarmcounter){
        open (MAIL,"|/usr/sbin/sendmail -F$hostname $email");
        print MAIL "To:$email\nSubject:".$subj."Critical state of $hostname\n\n";
        print MAIL $warnmessage;
        print MAIL "---------------------------------------------------------------------------------------------------\n";
        print MAIL "---    $slowcounter slow queries from total $totalcounter ($sleepcounter are sleep)                \n";
        print MAIL "---------------------------------------------------------------------------------------------------\n";
        foreach (@proclist) {
                print MAIL "$_";
        }
        close (MAIL);
}
unlink($pidfile);


I apologize for the comments in Nizhny Novgorod English.

Subsequently, the script was implemented on the mysql combat servers in the hosting company where I work, and helped to prevent the Mysql server from being denied service many times, and just reported a useless resource was wasted by some user.

the script works:

- when spam bots attack the forums forgotten by God. Under load, performance drops, forum tables begin to crash, requests are piled in the queue in the “Locked” status. From the script comes a very characteristic and intuitive list of processes;

- when a benchmark attack is performed on user sites in blind SQL injections;

- when mysql hangs stupidly, and it happens occasionally under load (one process works indefinitely, all the others just hang without any status, and accumulate until the connection limit is chosen) - the script works faster than the monitoring system that polls the mysql port ;

- when users have impressive tables in terms of data volume and so unoptimized queries that one query is executed for several seconds, or even minutes. The remaining queries to the table accumulate and wait for the queue in the Locked status. An unoptimized query is immediately visible in the letter, you can quickly see the explain query and build an index if necessary. If innodb is used, then the monitoring script also works on intensive slow queries, because their bundle is hanging in the status of “Sending data” or “Copying to tmp table”. Such requests in large numbers are very dangerous, since they greatly reduce the performance of the server as a whole;

- when multi-story JOIN requests hang. The script nails them automatically, but sometimes they are not killed - a reason to restart mysql;

- several times the script caught mysql hangs on specific queries, further study of which led to an update, after detecting a bug with similar queries on the bugtracker.

Occasionally, of course, false positives come if someone repairs, optimizes, or dumps a large table.

I would be glad if this script is useful to anyone.

Also popular now: