Automate disk replacement with Ansible

    Hello. I work as a leading system administrator in OK and am responsible for the stable operation of the portal. I want to talk about how we built the process of automatic replacement of disks, and then, as an administrator was excluded from this process and replaced it with a bot.

    This article is a kind of transliteration of the performance at HighLoad + 2018

    Building a disc replacement process

    First a few numbers

    OK is a gigantic service used by millions of people. It is served by about 7 thousand servers, which are located in 4 different data centers. The servers cost more than 70 thousand drives. If you stack them on top of each other, you get a tower with a height of more than 1 km.

    Hard drives are a component of the server that crashes most often. With such volumes, we have to change about 30 discs per week, and this procedure has become a not very pleasant routine.


    We have introduced a full-fledged incident management in our company. Every incident we record in Jira, and then we solve and disassemble it. If the incident was with an effect for users, then we are definitely going to think about how to respond faster in such cases, how to reduce the effect and of course how to prevent a recurrence.

    Drives are no exception. Their status is monitored by Zabbix. We monitor messages in Syslog for write / read errors, analyze the status of HW / SW raids, monitor SMART, and calculate wear for SSDs.

    How discs changed before

    When a trigger lights up in Zabbix, an incident is created in Jira and is automatically put to the appropriate engineers in the data centers. We do this with all HW incidents, that is, those that require some kind of physical work with the equipment in the data center.
    An engineer of a data center is a person who solves issues related to hardware, is responsible for the installation, maintenance, dismantling of servers. Having received a ticket, the engineer starts work. In disk shelves, he changes disks on his own. But if he does not have access to the desired device, the engineer turns to the system administrators on duty for help. First of all, you need to remove the disk from rotation. To do this, you need to make the necessary changes on the server, stop the application, unmount the disk.

    The system administrator on duty during the shift is responsible for the operation of the entire portal. He investigates incidents, repairs, helps developers perform small tasks. He does not deal only with hard drives.

    Previously, data center engineers chatted with the system administrator. Engineers sent links to Jira tickets, the administrator went through them, kept a log of work in some notepad. But chats are inconvenient for such tasks: the information there is not structured and is quickly lost. And the administrator could just move away from the computer and for some time not respond to requests, and the engineer stood at the server with a bunch of disks and waited.

    But the worst thing was that the administrators did not see the whole picture: what disk incidents exist, where the problem could potentially arise. This is due to the fact that we give all HW incidents to engineers. Yes, it was possible to display all incidents on the admin dashboard. But there are a lot of them, and the administrator was involved only in some of them.

    In addition, the engineer could not correctly prioritize, because he does not know anything about the purpose of specific servers, about the distribution of information across drives.

    New replacement procedure

    The first thing we did was take all the disk incidents into a separate type of “HW-disk” and add the fields “block device name”, “size” and “disk type” to it so that this information would be saved in the ticket and it would not have to constantly chatting.

    We also agreed that within the framework of one incident we will change only one disk. This greatly simplified the automation process, statistics collection and work.

    In addition, the “responsible administrator” field was added. The system administrator is automatically substituted there. This is very convenient, because now the engineer always sees who is responsible. No need to go to the calendar and search. It was this field that allowed putting tickets on the administrator’s dashboard, in which, perhaps, his help would be needed.

    To ensure that all participants receive the maximum benefit from the innovations, we created filters and dashboards, told the guys about them. When people understand the changes, they do not distance themselves from them as from something unnecessary. It is important for an engineer to know the rack number where the server is located, the size and type of disk. The administrator needs, first of all, to understand what kind of server group this is, what kind of effect it can be when replacing a disk.

    The presence of fields and their display is convenient, but this did not save us from the need to use chats. To do this, I had to change the workflow.

    It used to be like this:

    Today, engineers continue to work like this when they don’t need administrator help.

    The first thing we did was introduce a new Investigate status . The ticket is in this status when the engineer has not yet decided whether he will need an administrator or not. Through this status, the engineer can pass the ticket to the administrator. In addition, we mark tickets with this status when a disk replacement is required, but there is no disk itself on the site. This happens with CDNs and remote sites.

    We also added Ready status . The ticket is transferred to it after replacing the disk. That is, everything has already been done, but HW / SW RAID is synchronized on the server. This can be quite time consuming.

    If an administrator is involved, the scheme is a bit more complicated.

    From the Open status, a ticket can be transferred by both a system administrator and an engineer. In the In progress status, the administrator removes the disk from rotation so that the engineer can simply remove it: it turns on the backlight, unmounts the disk, and stops applications, depending on the specific server group.

    Then the ticket is converted to Ready to change : this is a signal to the engineer that the disk can be pulled out. All fields in Jira are already filled, the engineer knows what type and size of the disk. This data is affixed either on the previous status automatically or by the administrator.

    After replacing a disk, the ticket is transferred to the Changed status. It is checked that the correct disk has been inserted, markup is done, the application is launched, and some data recovery tasks are performed. Also, the ticket can be transferred to Ready status , in which case the administrator will remain responsible, because he started the disk in rotation. The full outline looks like this.

    Adding new fields made our life much easier. The guys began to work with structured information, it became clear what and at what stage to do. Priorities have become much more relevant since they are now set by the administrator.

    The need for chats has disappeared. Of course, the administrator can write to the engineer "you need to replace faster here," or "already evening, will you have time to replace?". But we no longer chat daily on these issues.

    Disks began to change in packs. If the administrator came to work a little earlier, he has free time, and nothing has happened, he can prepare a number of servers for replacement: put down fields, remove disks from rotation and transfer the task to the engineer. An engineer later arrives at the data center, sees the task, takes the necessary drives from the warehouse and immediately changes it. As a result, the replacement speed has increased.

    Lessons Learned in Building Workflow

    • When building a procedure, you need to collect information from different sources.
      Some of our administrators did not know that the engineer changed the disks on their own. Some thought that engineers monitored the MD RAID synchronization, although some of them did not even have access to this. Some leading engineers did this, but not always, because the process was not described anywhere.
    • The procedure should be simple and straightforward.
      It’s hard for a person to keep many steps in his head. The most important neighboring statuses in Jira need to be displayed on the main screen. You can rename them, for example, In progress we call Ready to change. And the remaining statuses can be hidden in the drop-down menu so that they do not callus eyes. But it’s better not to limit people, to give the opportunity to make the transition.
      Explain the value of innovation. When people understand, they better accept the new procedure. It was very important for us that people did not call the whole process, but follow it. Then we built on this automation.
    • Wait, analyze, understand.
      It took us about a month to build the procedure, technical implementation, meetings and discussions. And for implementation - more than three months. I saw how people are slowly starting to use the innovation. In the early stages, there was a lot of negativity. But he was completely independent of the procedure itself, its technical implementation. For example, one administrator did not use Jira, but the Jira plugin in Confluence, and some things were not available to him. Showed him Jira, the administrator has increased productivity and overall tasks, and for replacing disks.

    Drive Replacement Automation

    We went over to the automation of replacing disks several times. We already had operating time, scripts, but all of them worked either interactively or in manual mode, they required launching. And only after the introduction of the new procedure, we realized that it was just that we were missing.

    Since now the replacement process is divided into stages, each of which has an executor and a list of actions, we can turn on automation in stages, and not all at once. For example, the simplest step - Ready (checking RAID / data synchronization) can be easily delegated to the bot. When the bot learns a little, you can give it a more responsible task - putting the disk into rotation, etc.

    Zoo setups

    Before talking about the bot, we will take a short excursion to our installation zoo. First of all, it is due to the gigantic size of our infrastructure. Secondly, for each service we try to choose the optimal configuration of iron. We have about 20 hardware RAID models, mainly LSI and Adaptec, but there are both HP and DELL of different versions. Each RAID controller has its own management utility. The set of commands and the issuance of them may differ from version to version of each RAID controller. Where HW-RAID is not used, it may be mdraid.

    Almost all new installations we do without disk backup. We try to no longer use hardware and software RAID, as we reserve our systems at the level of data centers, not servers. But of course there are many legacy servers that need to be supported.

    Somewhere, the disks in the RAID controllers throw raw devices; somewhere, they use JBOD. There are configurations with one system drive in the server, and if you need to replace it, you have to re-format the server with the installation of the OS and applications, with the same versions, then add configuration files, launch applications. There are also a lot of server groups where redundancy is carried out not at the level of the disk subsystem, but directly in the applications themselves.

    In total, we have more than 400 unique groups of servers that run about 100 different applications. To cover such a huge number of options, we needed a multifunctional automation tool. It is advisable with a simple DSL, so that not only the person who wrote this can support it.

    We chose Ansible because it is agentless: there was no need to prepare the infrastructure, quick start. In addition, it is written in Python, which is accepted as the standard in the team.

    General scheme

    Let's look at a general automation scheme using one incident as an example. Zabbix detects that the sdb drive is out of order, the trigger lights up, a ticket is created in Jira. The administrator looked at it, realized that this is not a duplicate and not false positive, that is, you need to change the disk, and translates the ticket in In progress.

    The DiskoBot application written in Python periodically polls Jira for new tickets. It notices that a new In progress ticket has appeared, the corresponding thread is triggered, which launches the playbook in Ansible (this is done for each status in Jira). In this case, Prepare2change starts.

    Ansible goes to the host, removes the disk from rotation, and reports the status to the application through Callbacks.

    According to the results, the bot automatically transfers the ticket to Ready to change. The engineer receives a notification and goes to change the disk, after which he transfers the ticket to Changed.

    According to the above scheme, the ticket gets back to the bot, it launches another playbook, goes to the host and enters the disk into rotation. The bot closes the ticket. Hurrah!

    Now let's talk about some of the components of the system.


    This application is written in Python. It selects tickets from Jira according to JQL . Depending on the ticket status, the latter gets to the corresponding handler, which in turn launches the corresponding Ansible playbook status.

    JQL and polling intervals are defined in the application configuration file.

        jql: '… status = Open and "Disk Size" is EMPTY'
        interval: 180
        jql: '…  and "Disk Size" is not EMPTY and "Device Name" is not EMPTY'
        jql: '… and (labels not in ("dbot_ignore") or labels is EMPTY)'
        interval: 7200

    For example, among tickets in In progress status, only those with the Disk size and Device name fields are filled out. Device name is the name of the block device needed to run the playbook. Disk size is needed so that the engineer knows what size disk is needed.

    And among tickets with Ready status, tickets with the dbot_ignore label are filtered out. By the way, we use Jira labels both for such filtering, and for marking duplicate tickets, and collecting statistics.

    If the playbook crashes, Jira assigns the dbot_failed label so that you can figure it out later.

    Interaction with Ansible

    The application interacts with Ansible through the Ansible Python API . In playbook_executor, we pass the file name and set of variables. This allows you to keep the Ansible project in the form of regular yml files, rather than describing it in Python code.

    Also in Ansible through * extra_vars * the name of the block device, the status of the ticket, as well as callback_url, in which the issue key is sewn, is used - it is used for callback in HTTP.

    For each launch, a temporary inventory is generated, consisting of one host and the group this host belongs to so that group_vars is applied.

    Here is an example of a task in which HTTP callback is implemented.

    The result of the playbooks we get using callaback (s). They are of two types:

    • Ansible callback plugin , it provides data on the results of a playbook. It describes the tasks that were launched, performed successfully or unsuccessfully. This callback is called at the end of the playbook.
    • HTTP callback to get information while playing a playbook. In Ansible, we perform a POST / GET request to the side of our application.

    Via HTTP callback (s), variables that were defined during the execution of the playbook and which we want to save and use in subsequent runs are transmitted. We write this data in sqlite.

    Also through HTTP callback we leave comments and change the ticket status.

    HTTP callback
    # Make callback to Diskobot App
    # Variables:
    #    callback_post_body: # A dict with follow keys. All keys are optional
    #       msg: If exist it would be posted to Jira as comment
    #       data: If exist it would be saved in Incident.variables
    #       desire_state: Set desire_state for incident
    #       status: If exist Proceed issue to that status
      - name: Callback to Diskobot app (jira comment/status)
          url: "{{ callback_url }}/{{ devname }}"
          user: "{{ diskobot_user }}"
          password: "{{ diskobot_pass }}"
          force_basic_auth: True
          method: POST
          body: "{{ callback_post_body | to_json }}"
          body_format: json

    Like many of the same type of tasks, we put it in a separate common file and include it if necessary, so as not to repeat constantly in playbooks. Callback_ url appears here, in which issue key and host name are protected. When Ansible executes this POST request, the bot realizes that it came as part of such an incident.

    And here is an example from a playbook, in which we displayed a disk from an MD device:

      # Save mdadm configuration
      - include: common/callback.yml
            status: 'Ready to change'
            msg: "Removed disk from mdraid {{ mdadm_remove_disk.msg | comment_jira }}"
              mdadm_data: "{{ mdadm_remove_disk.removed }}"
              parted_info: "{{ parted_info | default() }}"
          - mdadm_remove_disk | changed
          - mdadm_remove_disk.removed

    This task puts the Jira ticket in the “Ready to change” status and adds a comment. Also, the mdam_data variable stores the list of md devices from which the disk was deleted, and the parted_ dump of the parted partition in parted_info.

    When the engineer inserts a new disk, we will be able to use these variables to restore the partition dump, as well as insert the disk into the md devices from which it was deleted.

    Ansible check mode

    Turning on automation was scary. Therefore, we decided to run all playbooks in
    dry run mode , in which Ansible does not perform any actions on the servers, but only emulates them.

    Such a launch is run through a separate callback module, and the result of the playbook is saved in Jira as a comment.

    Firstly, it allowed to validate the work of the bot and playbooks. Secondly, it increased the trust of administrators in the bot.

    When we went through validation and realized that you can run Ansible not only in dry run mode, we made the Run Diskobot button in Jira to start the same playbook with the same variables on the same host, but in normal mode.

    In addition, the button is used to restart the playbook in the event of its failure.

    Playbooks Structure

    I already mentioned that depending on the status of the Jira ticket, the bot launches different playbooks.

    Firstly, it’s so much easier to arrange entry.
    Secondly, in some cases it is simply necessary.

    For example, when replacing a system disk, you first need to go to the deployment system, create a task, and after the correct deployment, the server will be accessible via ssh, and you can roll the application onto it. If we did all this in one playbook, Ansible would not be able to execute it due to the inaccessibility of the host.

    We use Ansible roles for each server group. Here you can see how the playbook (s) are organized in one of them.

    This is convenient, because it is immediately clear where what tasks are located. In main.yml, which is the input for the Ansible role, we can just include by ticket status or general tasks necessary for everyone, for example, passing identification or receiving a token.


    Runs for tickets in the status of Investigation and Open. The most important thing for this playbook is the name of the block device. This information is not always available.

    To get it, we analyze the Jira summary, the last value from the Zabbix trigger. It may contain the name of the block device - lucky. Or it might contain a mount point, - then you need to go to the server, parse and calculate the desired drive. Also, a trigger can transmit an scsi address or some other information. But it also happens that there are no clues, and you have to analyze.

    Having found out the name of the block device, we collect information on it about the type and size of the disk to fill in the fields in Jira. We also remove information about the vendor, model, firmware, ID, SMART, and insert all this into a comment in the Jira ticket. The administrator and engineer no longer need to look for this data. :)


    The output of the disk from rotation, preparation for replacement. The most difficult, crucial stage. This is where you can stop the application when it cannot be stopped. Or pull out a disk that did not have enough replicas, and thereby have an effect on users, lose some data. Here we have the most checks and notifications in the chat.

    In the simplest case, we are talking about removing a drive from HW / MD RAID.

    In more complex situations (in our storage systems), when the backup is performed at the application level, you need to go to the application using the API, report the disk output, deactivate it and start recovery.

    We are now massively migrating to the cloud, and if the server is cloudy, then Diskobot accesses the cloud API, says that it is going to work with this minion - the server on which the containers are running - and asks “immigrate all containers from this minion”. And at the same time it turns on the backlight so that the engineer immediately sees which one to pull out.


    After replacing a disk, we first check its availability.

    Engineers don’t always put in new discs, so we added a check for SMART values ​​that satisfy us.

    What attributes are we looking at
    Reallocated Sectors Count (5) <100
    Current Pending Sector Count (107) == 0

    If the drive fails the test, the engineer is notified of a replacement. If everything is in order, the backlight turns off, markup is applied and the disk is inserted into rotation.


    The simplest case: checking the HW / SW raid synchronization or ending the data synchronization in the application.

    Application API

    I mentioned several times that the bot often accesses the application APIs. Of course, not all applications had the necessary methods, so I had to refine them. Here are the most important methods we use:
    • Status The status of a cluster or disk to understand whether it is possible to work with it;
    • Start / stop. Activation-deactivation of the disk;
    • Migrate / restore. Migration and data recovery during and after replacement.

    Lessons Learned by Ansible

    I really love Ansible. But often, when I look at different opensource projects and see how people write playbooks, I get a little scared. Complex logical weave from when / loop, lack of flexibility and idempotency due to the frequent use of shell / command.

    We decided to simplify everything as much as possible, taking advantage of Ansible - modularity. At the highest level are playbooks, they can be written by any administrator, a third-party developer who knows Ansible a little.

    - name: Blink disk
      become: True
      register: locate_action
          locate: '{{ locate }}'
          devname: '{{ devname }}'
          ids: '{{ locate_ids | default(pd_id) | default(omit) }}'

    If any logic is difficult to implement in playbooks, we place it in an Ansible module or filter. Scripts can be written both in Python and in any other language.

    They are easy and fast to write. For example, the disk highlighting module, an example of the use of which is given above, consists of 265 lines.

    At the lowest level is the library. For this project, we wrote a separate application, a kind of abstraction over the hardware and software RAIDs that perform the corresponding requests.

    Ansible's greatest strengths are its simplicity and comprehensible playbooks. I believe that you need to use this and not generate scary yaml files and a huge number of conditions, shell code and loops.

    If you want to repeat our experience with the Ansible API, keep in mind two things:

    • Playbook_executor and generally playbook cannot be timed out. There is a timeout on the ssh session, but there is no timeout on the playbook. If we try to unmount a drive that does not already exist in the system, the playbook will run indefinitely, so we had to wrap it up in a separate wrapper and kill by timeout.
    • Ansible is forked, so its API is not thread safe. We launch all of our playbooks and single-threaded.

    As a result, we were able to automate the replacement of about 80% of the drives. In general, the replacement rate has doubled. Today, the administrator only looks at the incident and decides whether to change the disk or not, and then makes one click.

    But now we are starting to face another problem: some new administrators do not know how to change drives. :)

    Also popular now: