Badoo November 25, 2011 at 16:21

How long skillfully, or mass launch of servers with a minimum of labor

In our first article, as previously announced, we hasten to share our experience on such a rarely discussed issue as the rapid deployment of hundreds of servers as part of a highly loaded project.

How to deploy several hundred servers in a geographically remote data center with no physical access to equipment? How does Badoo solve this problem?
We will tell you about this in the following example.

Below we will talk about the very first stage of configuring server hardware; about how quickly and on time we completed a specific task, and not about writing optimal scripts. If this topic seems interesting to you, we will be happy to tell you about installing the OS on the server and setting up the working environment, which also has its own subtleties.

So, we were tasked with deploying several hundred new servers that had just arrived at our two data centers.

Upon successful execution, at the output we should receive:

description of the entire "iron" component for each server;
correspondence of the ordered equipment to the received one (there are cases that the configuration that was not ordered comes);
ready to install OS and server software (with updated versions of firmware RAID, ROM, etc., configured RAID arrays, no problems with hardware).

Information we had:

servers are mounted in racks / cabinets and included;
factory logins and passwords are known;
all servers have an IPMI interface (management interface);
the servers do not have any presets from the manufacturer (in particular, RAID is not configured, power settings are not set);
all servers are connected to the network equipment by at least two interfaces, are in a previously known VLAN ;
new servers receive their IP addresses dynamically, respectively, they are available immediately after switching on;
It is known how many servers of which configurations should have been delivered.

Of course, there were some additional difficulties. First, our engineers did not have physical access to the servers. Secondly, the hardware in the delivery had several different configurations. And thirdly, all we knew about our servers was actually only their factory logins and passwords.

Most often, this kind of tasks is proposed to be solved with the help of a dd , PXE server and rsync , task setting, attracting data center employees. But we approached the question differently.

The solution we found involves some automation. Please note that all of the scripts below are for informational purposes only and do not claim to be perfect.

To complete the task we needed:

several text files that are deleted upon completion of work;
some very simple scripts using expect;
configured network boot server (in our case, it is xCAT );
configured and working image of the OS (no matter what, the main thing is that all the utilities we need are included in this image);
a customized equipment inventory system (in our case, the glpi project ).

First, we needed to find out which IP addresses were received by our new servers. To do this, we made a text file nodes with data about logins and passwords in the format

ILOHOSTNAME1 ILOPASSWORD1
ILOHOSTNAME2 ILOPASSWORD2

We got this data from the stickers available on each server using a barcode scanner. The standard login was known in advance - Administrator. Sticker example:

Now it was possible to run a command that collected us hostname and IP matches:

for i in $ (cat nodes | awk {'print $ 1'}); do j = $ (cat nodes | grep $ i | awk {'print $ 2'}); ssh DHCPD_SERVER_FQDN "sudo cat / var / log / messages | grep $ i | tail -1 | sed 's / $ /' $ j '/ g'"; done

As a result, we got lines like the ones below, added them to the nodeswip file:

Jul 1 10:31:23 local @ DHCPD_SERVER dhcpd: DHCPACK on 10.10.10.213 to 9c: 8e: 99: 19: 3a: 68 (ILOUSE125NDBF) via 10.10.10.1 W3G554L7
Jul 1 10:31:35 local @ DHCPD_SERVER dhcpd: DHCPACK on 10.10.10.210 to 9c: 8e: 99: 19: b6: aa (ILOUSE125NDBA) via 10.10.10.1 BJCP691P
Jul 1 10:31:47 local @ DHCPD_SERVER dhcpd: DHCPACK on 10.10.10.211 to 9c: 8e: 99: 19: 58: 7c (ILOUSE125NDBG) via 10.10.10.1 67MG91SV

Now we needed to create a standard user with a set of rights available to him on all IPMI interfaces of the new servers. Here it was also necessary to obtain the MAC addresses of net and mgm interfaces for further structuring. To do this, we ran the command

for i in $ (cat nodeswip | awk {'print $ 8'}); do j = $ (grep $ i nodeswip | awk {'print $ 14'}); expect expwip.sh $ i $ j | grep Port1NIC_MACAddress; done;

where the sh-script expwip.sh looked like this:

#! / usr / bin / expect
set timeout 600  
set ip [lindex $ argv 0]
set pass [lindex $ argv 1]
spawn ssh Administrator @ $ ip
set answ "$ pass"
set comm1 "create / map1 / accounts1 username = deployer password = PASSWORD name = deployer group = admin, config, oemhp_vm, oemhp_power, oemhp_rc"
expect "Administrator @ $ ip's password:"
send "$ answ \ r"
expect "hpiLO->"
send "$ comm1 \ r"
expect "hpiLO->"
send "show / system1 / network1 / Integrated_NICs \ r"
expect "hpiLO->"
send "exit \ r"
expect eof

The resulting list of MAC addresses of the net interfaces of our servers was added to the table editor, which allowed us to see all the matches.

Then we took steps to configure the DHCP server, and then sent the server to netboot. After that, we had to send the IPMI interfaces to reboot so that they occupied the addresses assigned to them. This was done using the command

expect reset_ilo.sh $ i $ j

where $ i is the server address obtained earlier, $ j is the factory administrator password.

The reset_ilo.sh script looked like this:

#! / usr / bin / expect
set timeout 600  
set ip [lindex $ argv 0]
set pass [lindex $ argv 1]
spawn ssh Administrator @ $ ip
set answ "$ pass"
set comm1 "reset / map1"
expect "Administrator @ $ ip's password:"
send "$ answ \ r"
expect "hpiLO->"
send "$ comm1 \ r"
expect eof

Next, we proceeded to the automatic formation of RAID-arrays, updating all possible firmware versions on the hardware, obtaining comprehensive information on the layout of servers in a convenient form. All these operations were performed during network boot.

First, the init script started, which “prepared” the RAID array:

LD = `/ usr / sbin / hpacucli ctrl slot = 0 logicaldrive all show | awk '$ 0 ~ / RAID 5 / || / RAID 0 / || / RAID 1 / {print $ 1 "" $ 2} '`
LD = $ {LD: -NULL}
if ["$ LD"! = "NULL"]; then / usr / sbin / hpacucli ctrl slot = 0 $ LD delete forced; fi
/ usr / sbin / hpacucli ctrl slot = 0 create type = ld drives = `/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | awk '$ 1 ~ / physicaldrive / {split ($ 2, arr,": "); print $ 2} '| tr "\ n" "," | sed' s /, $ // '' raid = 1 + 0
if [`/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | grep physicaldrive | wc -l` -gt 1]; then r = `/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | grep physicaldrive | wc -l`; let t = $ r% 2; if [$ t -ne 0]; then let tl = $ r-1; / usr / sbin / hpacucli ctrl slot = 0 create type = ld drives = `/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | grep physicaldrive | head - $ tl | awk '$ 1 ~ / physicaldrive / {split ($ 2, arr, ":"); print $ 2}' | tr "\ n" "," | sed 's /, $ //' 'raid = 1 + 0; / usr / sbin / hpacucli ctrl slot = 0 array all add spares = `/ usr / sbin / hpacucli ctrl slot = 0 physicaldrive all show | grep physicaldrive | tail -1 | awk '$ 1 ~ / physicaldrive / {split ($ 2, arr, ":"); print $ 2}' | tr "\ n" "," | sed 's /, $ //' '; fi fi

As a result, we got 1 + 0 or a “mirror”. Then, an agent was launched, which sent information about the hardware to our inventory system. We use a fusion inventory agent, in the settings of which we did not change anything except the address of the information collection server. The result is visible in the Fusion Inventory interface :

The last script to run was updating all firmware on the hardware. To do this, we used several classes in puppet that ran on new servers. Below is an example of a class that “looks” at the current server configuration and, if necessary, updates the firmware version of the RAID controller to the required one. In the same scenario, the rest of the hardware updates were performed.

class hp_raid_update_rom {
        exec {"updateraid":
                command => "wget -P / tmp / http: //WEBSERVER/install/soft/firmware/hp/raid/5_12/CP015960.scexe; wget -P / tmp / http: // WEBSERVER / install / soft / update_hp_raid_firmware_512. sh; chmod + x /tmp/CP015960.scexe; chmod + x /tmp/update_hp_raid_firmware_512.sh; /tmp/update_hp_raid_firmware_512.sh; echo '5.12'> / tmp / firmware_raid ",
                onlyif => "/ usr / bin / test` / sbin / lspci | grep -i 'Hewlett-Packard Company Smart Array G6' | wc -l`! = '0' && / usr / bin / test `/ usr / sbin / hpacucli ctrl all show detail | grep -i firmware | awk {'print \ $ 3'} `! = '5.12' && ([! -f / tmp / firmware_raid] || [` cat / tmp / firmware_raid`! = ' 5.12 ']) ",
                path => "/ usr / bin: / bin",
                require => Exec ["remove_report_file", "remove_empty_report_file"],
        }
        exec {"remove_report_file":
                command => "/ bin / rm / tmp / firmware_raid",
                onlyif => "[-f / tmp / firmware_raid] && [` cat / tmp / firmware_raid` == `/ usr / sbin / hpacucli ctrl all show detail | grep -i firmware | awk {'print \ $ 3'}`] ",
                path => "/ usr / bin: / bin",
        }
        exec {"remove_empty_report_file":
                command => "/ bin / rm / tmp / firmware_raid",
                onlyif => "[-f / tmp / firmware_raid] && [` cat / tmp / firmware_raid | wc -l` == '0'] ",
                path => "/ usr / bin: / bin",
        }
}

Thus, we solved the task using only our own resources. All our machines were ready to install a combat OS, install software, and start serving Badoo users.

All of the above describes only the preparatory stage of setting up the equipment, outside the scope of the article there were questions of installing the OS and configuring the working environment. If you are interested in this topic, we will be happy to prepare material on xSAT and puppet and share our ways of solving specific problems using these tools.

You can safely leave your suggestions, questions and comments on the above in the comments - we are always open for dialogue!

Badoo Company

Tags:

How long skillfully, or mass launch of servers with a minimum of labor

Also popular now: