Fighting Redundant Logging in Openstack

    Content: The heartbreaking growth rate of auth.log on hosts with neutron-plugin-openvswitch-agent. Analysis of the causes, method of elimination. A little about the work of sudo, PAM and its session.

    What will we talk about? Openstack is a cloud building platform. Neutron is the name of its subsystem responsible for the network, a fashionable hipster webdvanol , which is considered more perfect and functional than the first attempt called nova-networking. openvswitch-plugin is a neutron plugin that implements its functionality using Open vSwitch - a software switch that allows you to do smart things like GRE tunnels, bonding and port mirroring, imposing port rules inside an iptables-style virtual switch, etc.

    neutron-openvswitch-plugin-agent is one of the components of this plugin, working on all hosts that have at least some real relation to the transmission of network traffic of virtual machines. In other words, these are all compute nodes (where virtual machines work), networking nodes (which make the Internet for virtual machines). Only API servers and other service servers drop out of the list. Considering that the majority of the cloud consists of compute + networking, we can say a little coarser that this neutron-openvswitch-plugin-agent is installed on all hosts. Logstash is a centralized logging system; Elasticsearch is a database for working with these logs.

    For a timely response to software problems, all the logs of all applications should be collected and analyzed by the monitoring system. More about this we have already written. However, there may be too much good. It was quickly discovered that most of what was being collected from the hosts was ridiculous messages of the following form:

    server sudo: neutron: TTY = unknown; PWD = /; USER = root; COMMAND = / usr / bin / neutron-rootwrap /etc/neutron/rootwrap.conf ovs-vsctl --timeout = 2 --format = json - --columns = name, external_ids list Interface
    server sudo: pam_unix (sudo: session): session opened for user root by (uid = 108)
    server sudo: pam_unix (sudo: session): session closed for user root
    server sudo: nova: TTY = unknown; PWD = /; USER = root; COMMAND = / usr / bin / nova-rootwrap /etc/nova/rootwrap.conf chown 107 / var / lib / nova / instances / _base / 421f0808ac5dd178bc574eff6abe8df949edc319
    server sudo: pam_unix (sudo: session): session opened for user root by (uid = 107)
    server sudo: pam_unix (sudo: session): session closed for user root

    These messages were repeated every two (!) Seconds. From each server that has an agent. It turns out that from every hundred servers we will receive 730 GB of garbage per year. Logs are collected in Elasticsearch, that is, this, sorry, terabytes in the database. Good luck with query queries, so to speak.

    On the one hand, all this stuff is asking for a shutdown.

    On the other hand, completely disabling auth.log or its hard local rotation is a bad decision, because information about suspicious activity should be collected and stored as carefully as possible, moreover, preferably, away from the greedy handles of a compromised host (if such is announced).

    In addition to neutron-plugin-openvswitch-agent, nova-compute (control of virtualizers in Openstack) also appeared nearby, although not so often.

    PAM and its sessions

    Most likely, many readers use sudo. Most likely, many of those who use sudo do not really think about what happens when it starts.

    When sudo asks for a password, it can request authentication from the user. Most often it’s a password, but, for example, on my laptop it’s a fingerprint scanner (which is not used when logging in, because the password decrypts the disk when logging in, but it is great for safe sudo). In order not to implement code for working with passwords and other things in each program, PAM was created - a user authentication system that allows you to verify that the user knows the password without writing the / etc / shadow parser and supporting all other types of authorization (including LDAP, kerberos , OpenID, OTP, and any other crazy things you can think of).

    PAM sessions are a set of rituals that PAM performs with a user login. Among other things, the entry in auth.log is included in the ritual.

    Generally speaking, PAM sessions are not necessary for normal operation. Requesting a user’s verification and creating a session for him is three big differences (the third is the “close session” function). All self-respecting programs use them. But in our case, this causes some problems.

    More information about PAM, its history and principles of operation is available on the IBM website , for the meticulous - Linux-PAM System Administration Guide .

    But why so often?

    Let me remind you that the problem is that the entries in the logs from sudo and pam_session occurred almost constantly - every two seconds.

    What kind of messages about pam_session we have already found out. It remains to find out why neutron (and a little nova) should run something so often, and even with sudo?

    In the Openstack model, components are not particularly keen to learn the intricacies of underlying technologies. Instead of receiving information about changes from ovs-vswitchd, libvirtd, etc., and understanding for a long time “what could it mean and should I react to this?” A much more crude and effective approach is used: when is the api server - either the subsystem wants to know the status of the service, it sends a request through the message queue to it - and the service, in turn, requests their current state from the programs used.

    At the output, the server has a bunch of information: firstly, that the service is still alive (heartbeats), and secondly, information about the current configuration. Without any nuances, "missed a Very Important State Update."

    For neutron to work properly, it needs to know the state of the ports. In the context of openvswitch-plugin-agent, it means running ovs-vsctl show and the like to get the current status of all ports.

    Unfortunately, this requires privileges.

    To manage privileges, Openstack uses its own wrapper, the root-wraper, which performs a finer check of the arguments to run than regular sudo. But the root-wraper itself is written in Python (like everything in Openstack), and you cannot put suid on interpreted files. As a simple and reasonable solution, root-wraper is called as sudo root-wraper.

    And what to do?

    Our goal:
    1. Do not save root-warp execution information by nova / neutron users via sudo
    2. Save all other information.

    The first point comes down to two independent questions: disconnect recording information from sudo, and disconnect recording information from pam_session.

    Preliminary: if you are using old distributions, upgrade sudo to version 1.8.7 or later (Ubuntu 12.04 comes with 1.8.3, everything is fine with Ubuntu 14.04).

    Next - for neutron, in the /etc/sudoers.d/neutron_sudoers file we replace the lines:
    - Defaults:neutron !requiretty
    + Defaults:neutron !requiretty, !syslog, !pam_session

    For nova, the changes are similar in the /etc/sudoers.d/nova_sudoers file.

    A brief reference on sudoers syntax, for those who do not like BNF from man pages - an exclamation mark inverts the value, Defaults: apply to a specific section.
    • syslog - enables syslog logging !syslog, respectively, disables.
    • pam_session - indicates to create a new PAM session !pam_session- says not to create a new session. It was for pam_session that we needed a new version of sudo.

    Neutron polling rate control

    The obvious thought is, let’s lower the polling rate? From a practical point of view, a decrease in frequency means an increase in the interval between “made” and “seen changes”. If two seconds seem too frequent, then you can do, for example, ten. But putting a minute there is already too much (trouble will be with interactivity). At the same time, ten seconds will not save us much - instead of 43 thousand calls per day (for each server), we will have almost 9 thousand.

    For the reduction in frequency, this is the response of the config variable of neutron-server: report_interval(in seconds).

    Also popular now: