ESergey April 12, 2016 at 11:58

Free CRC error monitoring

From the sandbox

Often, unpleasant things arise on a storage network such as an increase in the number of errors on ports and an increase in the level of signal attenuation on sfp modules. Taking into account the high level of reliability of the SAN infrastructure consisting of two or more factories, the probability of an emergency is not so great, but the imposition of negative factors can lead to data loss or degradation of performance. For example, imagine a situation: at one of the factories, FOS is updated, everything works through the second factory, and on it, between the switch to which the disk array is connected and the switch to which the servers are connected, CRC errors on one of the trunk ports begin to grow rapidly. Or even worse, the link disappears due to a decrease in the signal level caused by an increase in the temperature of the SFP module, which in turn has increased due to increased utilization of this channel. In such cases, they usually say: “Well, who knew” or “100% reliable systems do not exist” and so on.

Competent architecture + proper monitoring = fault tolerance

So the problem is identified, it is necessary to develop a set of measures to increase the fault tolerance of the data storage network, it can be divided into two stages:

Bringing storage architecture to SAN best practices
monitoring system deployment

If there are a lot of literature and training courses about SAN best practices, and you can invite cool specialists from an integrator to conduct an examination, then choosing the right way to create a good SAN network monitoring system is not so easy. This can be explained by tight binding: the software developer is the manufacturer of the switches. Of course, I do not want to say that the Cisco Fabric Manager or Brocade Network Advisor is bad, but they do not allow you to do everything that is necessary in my opinion to increase the resiliency of the SAN network.

What to do

And so, the task is set, it is necessary to find a solution, often this can be complicated by the lack of money in the budget for this year, or the ignorance of the integrator about the existence of suitable software, but this is not a problem since all the necessary components are freely available and you only need to make it all work.
Let us examine the implementation of monitoring CRC errors on the SAN ports of brocade switches; most of the other parameters can be monitored in the same way.

Step One, Data Acquisition Protocol

Information on the number of CRC errors can be obtained from the switches in different ways (snmp, https, telnet and ssh); my choice fell on the latter since telnet is not safe and it is better to disable it, https is difficult to extract specific values, and the snmp tree can change significantly both on different switches and when switching to a new FOS.

Step Two, Data Collection Method

For working with ssh, linux is best adapted in conjunction with bash + expect, this method allows you to connect via ssh with dialog input of commands.

Step Three, Where to Store

There is not much difference, you can store it in text files, but we will consider an example with mysql. All monitoring is implemented in two scripts:

porterrshow.sh - collecting information and searching for increment of CRC error values
expect.tcl - connecting via ssh

and three txt files:
temp.txt - data buffer
switches.txt - list of san switches in the format name login password on each line
crc.txt - report on found CRC errors

The Select request searches for the growth increment of CRC errors compared to the data received one hour ago, respectively, the script must be run once per hour, and the script must start and finish its work at the same hour. This restriction can be easily circumvented by entering the field of the serial number for running the script, or by losing performance and setting a more complicated condition for selecting time values. The packages expect, mysql and ssh client must be installed on the server. The dbname database must have user user with read and write permissions to the tablename table. In the table tablename we get data similar to the output of the porterrshow command on the switch + date and time.

porterrshow.sh

#!/bin/bash
rm /var/scripts/temp.txt        #Удаляем ранее созданный temp.txt
while read line                 #Читаем строку из файла switches.txt
do read sw user pass <<< $line  #Разбиваем строку на переменные
n=0                                                               #Обнуляем счетчик
while read line;                                                  #Читаем строку из вывода expect.tcl
do array[n]="$line"; n=$[n+1];                                    #Заполняем массив строками из вывода expect.tcl
done < <(/var/scripts/expect.tcl $sw $user $pass porterrshow)     #Отправляем данные в цикл
if echo ${array[4]} | grep -q '=';             #Проверяем с какой строки начинается вывод полезной информации
then k=5;
else k=4;
fi;
for i in `seq $k $[n-1]`;                                                                                                                            #В последней строке данных нет
do read a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 <<<  ${array[i]};      #Разбиваем строку на значения
(echo $sw,${a1%:},`date +%F`,`date +%T`,$a2,$a3,$a4,$a5,$a6,$a7,$a8,$a9,$a10,$a11,$a12,$a13,$a14,$a15,$a16,$a17) >> temp.txt		#Формируем подгрузочный файл
done;
done < /var/scripts/switches.txt          #Читаем файл со списком свичей
mysql -uuser -ppass dbname << EOF;
LOAD DATA LOCAL INFILE "temp.txt" INTO TABLE tablename FIELDS TERMINATED BY ',';
EOF
#Загружаем данные в БД
(mysql -uuser -ppass dbname << EOF
select new.switch, new.port, new.crcerr-old.crcerr from tablename new, tablename old where new.switch=old.switch and new.port=old.port and new.date=old.date and new.crcerr!=old.crcerr and new.crcerr!=0 and new.date=curdate() and hour(new.time)=hour(now()) and hour(old.time)=hour(now())-1;
EOF
) > /var/scripts/crc.txt           #Проверяем инкремент CRC по портам и пишем отчет в файл
if grep -q 'switch' /var/scripts/crc.txt
then
cat /var/scripts/crc.txt | mailx -r SAN_Switch_CRC_Tester -s "CRC errors is increased" sanadmin1@mywork.com
fi
#Отправляем информацию администратору

expect.tcl

#!/usr/bin/expect
#Устанавливаем таймаут соединения 10 сек 
set timeout 10
#Проверям число параметров передаваемых скрипту
if {$argc != 4} {
    puts "Usage $argv0 host user pass command"
    exit 1}
#Назначаем параметры переменным
set host    [lindex $argv 0]
set user    [lindex $argv 1]
set pass    [lindex $argv 2]
set command [lindex $argv 3]
#Производим подключение по SSH
spawn ssh -oStrictHostKeyChecking=no -oCheckHostIP=no $user@$host $command
expect *assword:
send "$pass\r"
expect eof

Tags: