Migrating from Nagios to Icinga2 in Australia

Hello.


I am the linux sysadmin, I moved from Russia to Australia on an independent professional visa in 2015, but the article will not be about how to get a piglet a tractor. Such articles are already enough (nevertheless, if there is interest, I’ll write about it as well), so I would like to talk about how, at my work in Australia as a linux-ops engineer, I initiated migration from one system monitoring to another. Specifically - Nagios => Icinga2.


The article is partly technical and partly about communication with people and problems associated with the difference in culture and working methods.


Unfortunately, the "code" tag does not highlight the Puppet and yaml code, so I had to use "plaintext".


Nothing boded ill on the morning of December 21, 2016. I, as usual, read Habr with an unregistered anonymus in the first half hour of the working day, absorbing coffee and came across this article .


Since Nagios was used in my company, without thinking twice, I created a ticket in Redmine and threw the link into the general chat, because I thought it was important. The initiative is punishable even in Australia, so the lead engineer hung this problem on me since I discovered it.


Screen from Redmine

In our department, before setting out our opinion, it is customary to propose at least one alternative, even if the choice is obvious, so I started by googling which monitoring systems in general are currently relevant, since in Russia I had my own personal recording system in my last job, very primitive, but nevertheless quite a working one and fulfilling all the tasks assigned to it. Python, St. Petersburg Polytechnic and the metro rule. No, the subway sucks. This is personal (11 years of work) and worthy of a separate article, but not now.


A little bit about the rules for making changes to the infrastructure configuration at my current location. We use Puppet, Gitlab and the principle of Infrastructure as a Code, so that:


  • No manual changes through SSH by manually modifying any files on virtual machines. For three years of work, I received a hat for this many times, the last one a week ago and I don’t think it was the last time. Well, in fact - correct one line in the config, restart the service and see if the problem has been resolved - 10 seconds. Create a new branch in Gitlab, push the changes, wait for r10k to work on Puppetmaster, run Puppet --environment = mybranch and wait a couple of minutes until all this works - 5 minutes minimum.
  • Any changes are made by creating a Merge Request in Gitlab and you need to get approval from at least one member of the team. Major changes to the team lead require two or three approvals.
  • All changes are textual in one way or another (since Puppet manifests, Hiera scripts and data are text), binary files are highly discouraged, and good reasons are needed to approve such files.

So, the options I looked at:


  • Munin - if there are more than 10 servers in the infrastructure, administration turns into hell (from this article . I didn’t have much desire to check this, so I took my word for it).
  • Zabbix - has long been eyeing, back in Russia, but then it was redundant for my tasks. Here - had to be dropped due to the use of Puppet as a configuration manager and Gitlab as a version control system. At that time, as I understand it, Zabbix stores the entire configuration in a database, and therefore it was not clear how to manage the configuration in the current conditions and how to track changes.
  • Prometheus is what we will come to in the end, judging by the mood in the department, but at that time I could not master it and could not demonstrate a really working sample (Proof of Concept), so I had to refuse.
  • There were several other options that either required a complete redesign of the system, or were in their infancy / abandoned and for the same reason were rejected.

In the end, I settled on Icinga2 for three reasons:


1 - compatibility with Nrpe (a client service that runs checks on commands from Nagios). This was very important, because at that time we had 135 (now there are 165 of them in 2019) virtual machines with a bunch of self-written services / checks and redoing all this would be a terrible hemorrhoids.
2 - all configuration files are text, which makes it easy to edit this matter, create merge requests with the ability to see what has been added or deleted.
3 is a lively and growing OpenSource project. We are very fond of OpenSource and make a feasible contribution to it by creating Pull Requests and Issues to solve problems.


So let's go, Icinga2.


The first thing I had to face was the inertia of my colleagues. Everyone is accustomed to Nagios / Najios (although even here they could not compromise on how to pronounce this) and the CheckMK interface. The icinga interface looks completely different (it was a minus), but it’s possible to flexibly configure what you need to see with filters using literally any parameter (it was a plus, but I fought for it notably).


Filters

Estimate the ratio of the size of the scroll bar to the size of the scroll field.


Second, everyone is used to seeing the entire infrastructure on one monitor, because CheckMk allows you to work with several Nagios hosts, but Icinga didn’t know how (actually, but more on that below). An alternative was a thing called Thruk, but its design caused vomiting for all team members, except for one - the one who proposed it (not me).


Thruk Firebox - Unanimous Team Decision

After a couple of days of a brain storm, I proposed the idea of ​​cluster monitoring, when there is one master host in the production zone and two subordinates - one in dev / test and one external host located at another provider in order to monitor our services from the point of view of a client or an outsider observer. This configuration allowed you to see all the problems in one web-based interface and worked quite well, but Puppet ... The problem with Puppet was that the master host now had to know about all the hosts and services / checks in the system and had to distribute them between zones (dev-test, staging-prod, ext), but sending changes through the Icinga API takes a couple of seconds, but compiling the Puppet directory of all services for all hosts takes a couple of minutes. This is still blamed on me, although I have already explained several times how everything works and why it all takes so long.


Third - a bunch of SnowFlakes (snowflakes) - things that are knocked out of the general system, because they have something special, so the general rules do not apply to them. It was solved by a frontal attack - if there is an alarm, but in fact everything is in order, then here you need to dig deeper and understand why it alerts me, although it should not. Or vice versa - why Nagios is panicking, but Icinga is not.


Fourth - Nagios worked here for me for three years and initially there was more trust in him than in my newfangled hipster system, so every time Icinga raised a panic - no one did anything until Nagios got excited on the same issue. But very rarely Icinga issued real alarms earlier than Nagios and I consider this a serious jamb, which I will discuss in the "Conclusions" section.


As a result, the commissioning was delayed for more than 5 months (planned on June 28, 2018, in fact - December 3, 2018), mainly because of the “parity check” - that crap when there are several services in Nagios that no one is talking about I haven’t heard anything for the last couple of years, but NOW they damn issued crit for no reason and I had to explain why they were not on my panel and had to add them to Icinga to “parity check is complete” (All services / checks in Nagios correspond to services / checks in Icinga)


Implementation:
The first is the Code vs Data war, such as Puppet Style. All data, right here everything in general, should be in Hiera and nothing else. All code is in .pp files. Variables, abstractions, functions - everything goes in pp.
As a result, we have a bunch of virtual machines (165 at the time of writing) and 68 web applications that need to be monitored for the health and validity of SSL certificates. But due to historical hemorrhoids, information for monitoring applications is taken from a separate gitlab repository and the data format has not changed since Puppet 3, which creates additional configuration difficulties.


Puppet-code for applications, take care
define profiles::services::monitoring::docker_apps(
  Hash $app_list,
  Hash $apps_accessible_from,
  Hash $apps_access_list,
  Hash $webhost_defaults,
  Hash $webcheck_defaults,
  Hash $service_overrides,
  Hash $targets,
  Hash $app_checks,
  )
{
#### APPS ####
  $zone = $name
  $app_list.each | String $app_name, Hash $app_data |
  {
    $notify_group = { 'notify_group' => ($webcheck_defaults[$zone]['notify_group'] + pick($app_data['notify_group'], {} )) } # adds notifications for default group (systems) + any group defined in int/pm_docker_apps.eyaml
    $data = merge($webhost_defaults, $apps_accessible_from, $app_data)
    $site_domain = $app_data['site_domain']
    $regexp = pick($app_data['check_regex'], 'html')        # Pick a regex to check
    $check_url = $app_data['check_url'] ? {
      undef   => { 'http_uri' => '/' },
      default => { 'http_uri' => $app_data['check_url'] }
    }
    $check_regex = $regexp ?{
      'absent' => {},
      default  => {'http_expect_body_regex' => $regexp}
    }
    $site_domain.each | String $vhost, Hash $vdata | {        # Split an app by domains if there are two or more
      $vhost_name = {'http_vhost' => $vhost}
      $vars = $data['vars'] + $vhost_name + $check_regex + $check_url
      $web_ipaddress = is_array($vdata['web_ipaddress']) ? {  # Make IP-address an array if it's not, because askizzy has 2 ips and it's an array
        true  => $vdata['web_ipaddress'],
        false => [$vdata['web_ipaddress']],
      }
      $access_from_zones = [$zone] + $apps_access_list[$data['accessible_from']] # Merge default zone (where the app is defined) and extra zones if they exist
      $web_ipaddress.each | String $ip_address | {            # For each IP (if we have multiple)
        $suffix = length($web_ipaddress) ? {                  # If we have more than one - add IP as a suffix to this hostname to avoid duplicating resources
          1       => '',
          default => "_${ip_address}"
        }
        $octets = split($ip_address, '\.')
        $ip_tag = "${octets[2]}.${octets[3]}" # Using last octet only causes a collision between nginx-vip 203.15.70.94 and ext. ip 49.255.194.94
        $access_from_zones.each | $zone_prefix |{
          $zone_target = $targets[$zone_prefix]
          $nginx_vip_name = "${zone_prefix}_nginx-vip-${ip_tag}" # If it's a host for ext - prefix becomes 'ext_' (ext_nginx-vip...)
          $nginx_host_vip = {
            $nginx_vip_name => {
              ensure        => present,
              target        => $zone_target,
              address       => $ip_address,
              check_command => 'hostalive',
              groups        => ['nginx_vip',],
            }
          }
          $ssl_vars = $app_checks['ssl']
          $regex_vars = $app_checks['http'] + $vars + $webcheck_defaults[$zone] + $notify_group
          if !defined( Profiles::Services::Monitoring::Host[$nginx_vip_name] ) {
          ensure_resources('profiles::services::monitoring::host', $nginx_host_vip)
          }
          if !defined( Icinga2::Object::Service["${nginx_vip_name}_ssl"] ) {
            icinga2::object::service {"${nginx_vip_name}_ssl":
              ensure         => $data['ensure'],
              assign         => ["host.name == $nginx_vip_name",],
              groups         => ['webchecks',],
              check_command  => 'ssl',
              check_interval => $service_overrides['ssl']['check_interval'],
              target         => $targets['services'],
              apply          => true,
              vars           => $ssl_vars
            }
          }
          if $regexp != 'absent'{
            if !defined(Icinga2::Object::Service["${vhost}${$suffix} regex"]){
              icinga2::object::service {"${vhost}${$suffix} regex":
                ensure          => $data['ensure'],
                assign          => ["match(*_nginx-vip-${ip_tag}, host.name)",],
                groups          => ['webchecks',],
                check_command   => 'http',
                check_interval  => $service_overrides['regex']['check_interval'],
                target          => $targets['services'],
                enable_flapping => true,
                apply           => true,
                vars            => $regex_vars
              }
            }
          }
        }
      }
    }
  }
}

The host and service configuration code also looks awful:


monitoring / config.pp

class profiles::services::monitoring::config(
  Array $default_config,
  Array $hostgroups,
  Hash $hosts = {},
  Hash $host_defaults,
  Hash $services,
  Hash $service_defaults,
  Hash $service_overrides,
  Hash $webcheck_defaults,
  Hash $servicegroups,
  String $servicegroup_target,
  Hash $user_defaults,
  Hash $users,
  Hash $oncall,
  Hash $usergroup_defaults,
  Hash $usergroups,
  Hash $notifications,
  Hash $notification_defaults,
  Hash $notification_commands,
  Hash $timeperiods,
  Hash $webhost_defaults,
  Hash $apps_access_list,
  Hash $check_commands,
  Hash $hosts_api = {},
  Hash $targets = {},
  Hash $host_api_defaults = {},
)
{
  # Profiles::Services::Monitoring::Hostgroup <<| |>> # will be enabled when we move to icinga completely
#### APPS ####
  case $location {
    'int', 'ext': {
      $apps_by_zone = {}
    }
    'pm': {
      $int_apps         = hiera('int_docker_apps')
      $int_app_defaults = hiera('int_docker_app_common')
      $st_apps          = hiera('staging_docker_apps')
      $srs_apps         = hiera('pm_docker_apps_srs')
      $pm_apps          = hiera('pm_docker_apps') + $st_apps + $srs_apps
      $pm_app_defaults  = hiera('pm_docker_app_common')
      $apps_by_zone = {
        'int' => $int_apps,
        'pm'  => $pm_apps,
      }
      $app_access_by_zone = {
        'int' => {'accessible_from' => $int_app_defaults['accessible_from']},
        'pm'  => {'accessible_from' => $pm_app_defaults['accessible_from']},
      }
    }
    default: {
      fail('Please ensure the node has $location fact set (int, pm, ext)')
    }
  }
  file { '/etc/icinga2/conf.d/':
    ensure  => directory,
    recurse => true,
    purge   => true,
    owner   => 'icinga',
    group   => 'icinga',
    mode    => '0750',
    notify  => Service['icinga2'],
  }
  $default_config.each | String $file_name |{
    file {"/etc/icinga2/conf.d/${file_name}":
      ensure => present,
      source => "puppet:///modules/profiles/services/monitoring/default_config/${file_name}",
      owner  => 'icinga',
      group  => 'icinga',
      mode    => '0640',
    }
  }
  $app_checks = {
    'ssl' => $services['webchecks']['checks']['ssl']['vars'],
    'http' => $services['webchecks']['checks']['http_regexp']['vars']
  }
  $apps_by_zone.each | String $zone, Hash $app_list | {
    profiles::services::monitoring::docker_apps{$zone:
      app_list             => $app_list,
      apps_accessible_from => $app_access_by_zone[$zone],
      apps_access_list     => $apps_access_list,
      webhost_defaults     => $webhost_defaults,
      webcheck_defaults    => $webcheck_defaults,
      service_overrides    => $service_overrides,
      targets              => $targets,
      app_checks           => $app_checks,
    }
  }
####    HOSTS    ####
  # Profiles::Services::Monitoring::Host <<| |>> # This is for spaceship invasion when it's ready.
  $hosts_has_large_disks = query_nodes('mountpoints.*.size_bytes >= 1099511627776')
  $hosts.each | String $hostgroup, Hash $list_of_hosts_with_settings | {           # Splitting site lists by hostgroups - docker_host/gluster_host/etc
    $list_of_hosts_in_group = $list_of_hosts_with_settings['hosts']
    $hostgroup_settings     = $list_of_hosts_with_settings['settings']
    $merged_hostgroup_settings = deep_merge($host_defaults, $list_of_hosts_with_settings['settings'])
    $list_of_hosts_in_group.each | String $host_name, Hash $host_settings |{  # Splitting grouplists by hosts
      # Is this host in the array $hosts_has_large_disks ? If so set host.vars.has_large_disks
      if ( $hosts_has_large_disks.reduce(false) | $found, $value| { ( $value =~ "^${host_name}" ) or $found } ) {
        $vars_has_large_disks = { 'has_large_disks' => true }
      } else {
        $vars_has_large_disks = {}
      }
      $host_data = deep_merge($merged_hostgroup_settings, $host_settings)
      $hostgroup_settings_vars = pick($hostgroup_settings['vars'], {})
      $host_settings_vars = pick($host_settings['vars'], {})
      $host_notify_group = delete_undef_values($host_defaults['vars']['notify_group'] + $hostgroup_settings_vars['notify_group'] + $host_settings_vars['notify_group'])
      $host_data_vars = delete_undef_values(deep_merge($host_data['vars'] , {'notify_group' => $host_notify_group}, $vars_has_large_disks)) # Merging vars separately
      $hostgroups = delete_undef_values([$hostgroup] + $host_data['groups'])
      profiles::services::monitoring::host{$host_name:
        ensure             => $host_data['ensure'],
        display_name       => $host_data['display_name'],
        address            => $host_data['address'],
        groups             => $hostgroups,
        target             => $host_data['target'],
        check_command      => $host_data['check_command'],
        check_interval     => $host_data['check_interval'],
        max_check_attempts => $host_data['max_check_attempts'],
        vars               => $host_data_vars,
        template           => $host_data['template'],
      }
    }
  }
  if !empty($hosts_api){                                                                # All hosts managed by API
    $hosts_api.each | String $zone, Hash $hosts_api_zone | {                            # Split api hosts by zones
      $hosts_api_zone.each | String $hostgroup, Hash $list_of_hosts_with_settings | {   # Splitting site lists by hostgroups - docker_host/gluster_host/etc
        $list_of_hosts_in_group = $list_of_hosts_with_settings['hosts']
        $hostgroup_settings     = $list_of_hosts_with_settings['settings']
        $merged_hostgroup_settings = deep_merge($host_api_defaults, $list_of_hosts_with_settings['settings'])
        $list_of_hosts_in_group.each | String $host_name, Hash $host_settings |{        # Splitting grouplists by hosts
          # Is this host in the array $hosts_has_large_disks ? If so set host.vars.has_large_disks
          if ( $hosts_has_large_disks.reduce(false) | $found, $value| { ( $value =~ "^${host_name}" ) or $found } ) {
            $vars_has_large_disks = { 'has_large_disks' => true }
          } else {
            $vars_has_large_disks = {}
          }
          $host_data = deep_merge($merged_hostgroup_settings, $host_settings)
          $hostgroup_settings_vars = pick($hostgroup_settings['vars'], {})
          $host_settings_vars = pick($host_settings['vars'], {})
          $host_api_notify_group = delete_undef_values($host_defaults['vars']['notify_group'] + $hostgroup_settings_vars['notify_group'] + $host_settings_vars['notify_group'])
          $host_data_vars = delete_undef_values(deep_merge($host_data['vars'] , {'notify_group' => $host_api_notify_group}, $vars_has_large_disks))
          $hostgroups = delete_undef_values([$hostgroup] + $host_data['groups'])
          if defined(Profiles::Services::Monitoring::Host[$host_name]){
            $hostname = "${host_name}_from_${zone}"
          }
          else
          {
            $hostname = $host_name
          }
          profiles::services::monitoring::host{$hostname:
            ensure             => $host_data['ensure'],
            display_name       => $host_data['display_name'],
            address            => $host_data['address'],
            groups             => $hostgroups,
            target             => "${host_data['target_base']}/${zone}/hosts.conf",
            check_command      => $host_data['check_command'],
            check_interval     => $host_data['check_interval'],
            max_check_attempts => $host_data['max_check_attempts'],
            vars               => $host_data_vars,
            template           => $host_data['template'],
          }
        }
      }
    }
  }
#### END OF HOSTS ####
####   SERVICES   ####
  $services.each | String $service_group, Hash $s_list |{             # Service_group and list of services in that group
    $service_list = $s_list['checks']                                 # List of actual checks, separately from SG settings
    $service_list.each | String $service_name, Hash $data |{
      $merged_defaults = merge($service_defaults, $s_list['settings']) # global service defaults + service group defaults
      $merged_data = merge($merged_defaults, $data)
      $settings_vars = pick($s_list['settings']['vars'], {})
      $this_service_vars = pick($data['vars'], {})
      $all_service_vars = delete_undef_values($service_defaults['vars'] + $settings_vars + $this_service_vars)
      # If we override default check_timeout, but not nrpe_timeout, make nrpe_timeout the same as check_timeout
      if ( $merged_data['check_timeout'] and ! $this_service_vars['nrpe_timeout'] ) {
        # NB: Icinga will convert 1m to 60 automatically!
        $nrpe = { 'nrpe_timeout' => $merged_data['check_timeout'] }
      } else {
        $nrpe = {}
      }
      # By default we use nrpe and all commands are run via nrpe. So vars.nrpe_command = $service_name is a default value
      # If it's server-side Icinga command - we don't need 'nrpe_command'
      # but there is no harm to have that var and the code is shorter
      if $merged_data['check_command'] == 'nrpe'{
        $check_command = $merged_data['vars']['nrpe_command'] ? {
          undef   => { 'nrpe_command' => $service_name },
          default => { 'nrpe_command' => $merged_data['vars']['nrpe_command'] }
        }
      }else{
        $check_command = {}
      }
      # Assembling $vars from Global Default service settings, servicegroup settings, this particular check settings and let's not forget nrpe settings.
      if $all_service_vars['graphite_template'] {
        $graphite_template = {'check_command' => $all_service_vars['graphite_template']}
      }else{
        $graphite_template = {'check_command' => $service_name}
      }
      $service_notify = [] + pick($settings_vars['notify_group'], []) + pick($this_service_vars['notify_group'], []) # pick is required everywhere, otherwise becomes "The value '' cannot be converted to Numeric"
      $service_notify_group = $service_notify ? {
        []      => $service_defaults['vars']['notify_group'],
        default => $service_notify
      } # Assing default group (systems) if no other groups are defined
      $vars = $all_service_vars + $nrpe + $check_command + $graphite_template + {'notify_group' => $service_notify_group}
      # This needs to be merged separately, because merging it as part of MERGED_DATA overwrites arrays instead of merging them, so we lose some "assign" and "ignore" values
      $assign = delete_undef_values($service_defaults['assign'] + $s_list['settings']['assign'] + $data['assign'])
      $ignore = delete_undef_values($service_defaults['ignore'] + $s_list['settings']['ignore'] + $data['ignore'])
      icinga2::object::service {$service_name:
        ensure             => $merged_data['ensure'],
        apply              => $merged_data['apply'],
        enable_flapping    => $merged_data['enable_flapping'],
        assign             => $assign,
        ignore             => $ignore,
        groups             => [$service_group],
        check_command      => $merged_data['check_command'],
        check_interval     => $merged_data['check_interval'],
        check_timeout      => $merged_data['check_timeout'],
        check_period       => $merged_data['check_period'],
        display_name       => $merged_data['display_name'],
        event_command      => $merged_data['event_command'],
        retry_interval     => $merged_data['retry_interval'],
        max_check_attempts => $merged_data['max_check_attempts'],
        target             => $merged_data['target'],
        vars               => $vars,
        template           => $merged_data['template'],
      }
    }
  }
#### END OF SERVICES ####
#### OTHER BORING STUFF ####
  $servicegroups.each | $servicegroup, $description |{
    icinga2::object::servicegroup{ $servicegroup:
      target       => $servicegroup_target,
      display_name => $description
    }
  }
  $hostgroups.each| String $hostgroup |{
    profiles::services::monitoring::hostgroup { $hostgroup:}
  }
  $notifications.each | String $name, Hash $settings |{
    $assign = pick($notification_defaults['assign'], []) + $settings['assign']
    $ignore = pick($notification_defaults['ignore'], []) + $settings['ignore']
    $merged_settings = $settings + $notification_defaults
    icinga2::object::notification{$name:
      target       => $merged_settings['target'],
      apply        => $merged_settings['apply'],
      apply_target => $merged_settings['apply_target'],
      command      => $merged_settings['command'],
      interval     => $merged_settings['interval'],
      states       => $merged_settings['states'],
      types        => $merged_settings['types'],
      assign       => delete_undef_values($assign),
      ignore       => delete_undef_values($ignore),
      user_groups  => $merged_settings['user_groups'],
      period       => $merged_settings['period'],
      vars         => $merged_settings['vars'],
    }
  }
  # Merging notification settings for users with other settings
  $users_oncall = deep_merge($users, $oncall)
  # Magic. Do not touch.
  create_resources('icinga2::object::user', $users_oncall, $user_defaults)
  create_resources('icinga2::object::usergroup', $usergroups, $usergroup_defaults)
  create_resources('icinga2::object::timeperiod',$timeperiods)
  create_resources('icinga2::object::checkcommand', $check_commands)
  create_resources('icinga2::object::notificationcommand', $notification_commands)
  profiles::services::sudoers { 'icinga_runs_ping_l2':
    ensure            => present,
    sudoersd_template => 'profiles/os/redhat/centos7/sudoers/icinga.erb',
  }
}

I am still working on this noodle and improving it whenever possible. However, it was such code that allowed the use of simple and clear syntax in Hiera:


Data
profiles::services::monitoring::config::services:
  perf_checks:
    settings:
      check_interval: '2m'
      assign:
        - 'host.vars.type == linux'
    checks:
      procs: {}
      load: {}
      memory: {}
      disk:
        check_interval: '5m'
        vars:
          notification_period: '24x7'
      disk_iops:
        vars:
          notifications:
            - 'silent'
      cpu:
        vars:
          notifications:
            - 'silent'
      dns_fqdn:
        check_interval: '15m'
        ignore:
          - 'xenserver in host.groups'
        vars:
          notifications:
            - 'silent'
      iftraffic_nrpe:
        vars:
          notifications:
            - 'silent'
  logging:
    settings:
      assign:
        - 'logserver in host.groups'
    checks:
       rsyslog: {}
      nginx_limit_req_other: {}
      nginx_limit_req_s2s: {}
      nginx_limit_req_s2x: {}
      nginx_limit_req_srs: {}
     logstash: {}
      logstash_api:
        vars:
          notifications:
            - 'silent'

All checks are divided into groups, each group has default settings such as where and how often to run these checks, which notifications to send and to whom.


In each check, you can override any option, and all this eventually adds up to the default settings of all checks as a whole. Therefore, such noodles are written in config.pp - there is a merger of all default settings with group settings and then with each individual check.


Also, a very important change was the ability to use functions in the settings, for example, the function of changing the port, address and url to check http_regex.


http_regexp:
  assign:
    - 'host.vars.http_regex'
    - 'static_sites in host.groups'
  check_command: 'http'
  check_interval: '1m'
  retry_interval: '20s'
  max_check_attempts: 6
  http_port: '{{ if(host.vars.http_port) { return host.vars.http_port } else { return 443 } }}'
  vars:
    notification_period: 'host.vars.notification_period'
    http_vhost: '{{ if(host.vars.http_vhost) { return host.vars.http_vhost } else { return host.name } }}'
    http_ssl: '{{ if(host.vars.http_ssl) { return false } else { return true } }}'
    http_expect_body_regex: 'host.vars.http_regex'
    http_uri: '{{ if(host.vars.http_uri) { return host.vars.http_uri } else { return "/" } }}'
    http_onredirect: 'follow'
    http_warn_time: 8
    http_critical_time: 15
    http_timeout: 30
    http_sni: true

This means - if there is an http_port variable in the host definition - use it, otherwise 443. For example, the jabber web interface hangs on 9090, and Unifi on 7443.
http_vhost means to ignore DNS and take this address.
If uri is specified in the host, then go along it, otherwise take "/".


A funny story came out with http_ssl - this infection did not want to disconnect on demand. I stupidly stumbled on this line for a long time, until it dawned on me that the variable in the host definition:


http_ssl: false

Substitutes into expression


if(host.vars.http_ssl) { return false } else { return true }

as false and in the end it turns out


if(false) { return false } else { return true }

that is, the ssl check is always active. It was decided by replacing the syntax:


http_ssl: no

Conclusions :


Pros:


  • We now have one monitoring system, and not two, as was the last 7-8 months, or one, outdated and vulnerable.
  • Структура данных хостов / служб(проверок)теперь (на мой взгляд) намного более читаема и понятна. Для других это оказалось не так очевидно, так что пришлось запилить пару страниц в местной вики для разъяснения как оно всё работает и что где править.
  • Есть возможность гибкой настройки проверок с помощью переменных и функций, например для проверки http_regexp искомый паттерн, код возврата, url и порт можно задавать в настройках хоста.
  • Есть несколько панелей(dashboards), для каждой из готорых можно определить свой список отображаемых тревог и управлять всем этим через Puppet и merge requests.

Минусы:


  • Инертность членов команды — Нагиос работал, работал и работал, а эта твоя Исинга постоянно глючит и тормозит. А как тут посмотреть историю? А, блин, она же не обновляется… (Реальная проблема — история тревог не обновляется автоматом, только по F5)
  • Инертность системы — когда я кликаю в web-интерфейсе на "обновить" (check now) — результат выполнения зависит от погоды на Марсе, особенно на сложных сервисах, которые требуют десятки секунд для выполнения. Подобный результат — нормальное дело.
  • В целом по полугодовой статистике работы двух систем бок о бок Нагиос всегда отрабатывал быстрей, чем Icinga и это очень меня раздражало. Как мне кажется, там что-то намутили с таймерами и проверка раз в пять минут по факту идёт раз в 5:30 или что-то в этом духе.
  • Если перезапустить сервис в любой момент времени (systemctl restart icinga2) — все проверки, которые на тот момент были в процессе выполнения, выдадут тревогу critical на экран и со стороны это выглядит, как будто вообще всё упало (подтверждённый баг).

Но в целом — оно работает.


Also popular now: