Gentleman's sysadmin set

Admin - this is the man without whom nothing in the IT company will not work. And with a happy and productive admin, things will move better and faster, so a comfortable working atmosphere is the concern of the company. Anton Turetsky’s ( banuchka ) report on Highload ++ 2017 was about using which tools to make the team productive.

Anton likes infrastructure tasks and automation of everything that can be automated, so his story is based on the example of setting up infrastructure in the data center and related technologies ( Docker, Consul, Puppet ...). But aspects that hinder quality work and how to solve them are as versatile as possible and are suitable for almost any executive team. So you are welcome under the cat for decoding this report.

BadooEvery year it grows, here are a few numbers that reflect this: 350 million messages per day, 364 million registered users worldwide, 300 thousand new users per day. But this is far from the most important thing, for a person who works in Badoo, the main thing is first of all a way of thinking and a team. Badoo is a family, it's about people and it's cool!

I want to start with a provocation, which perhaps someone will not support:

Admin is the main person in the company!

I think you will agree with me: the admin is the person without whom nothing in the company will work: the equipment comes to him, the system is set by him, the new equipment is allocated to him again. That's why I think that he is the main one.

I will give an example from personal practice in Badoo. Judge this situation yourself: we had a new project called ReThink. We renewed our logo: we changed the font and color of the letters from multi-colored to purple, added a heart shape - monotonous and cool. But administrators to warn about what will happen ReThink - we just take and switch - warned last night almost before leaving home. And then banged a somewhat unpredictable load in one of the clusters. Thanks to the person who was on duty and helped the rest of the team just to find additional servers and dump them. The project actually shot, we did not fall, rolled out normally and everyone was happy.

In confirmation of my words, I want to say that a happy and productive administrator in a company is, among other things, beneficial and interesting to the company.I would like to ask all companies to make their admins happy . Then you will be fine!

Let's think about what makes the admin sad . Many will come to the head that the admin is sad from a fallen server and lost backups. This is all true, but if the admin would have thought every time and went into sadness when he did something wrong - and he does something wrong every day - the nerves wouldn't be enough.

Therefore, I denote the problem that lies in a certain human factor, namely in the context switching.

Context switch

There is a sufficiently large amount of research on what happens when a person is torn off, and why it is bad. One of the last good studies is the work of Chris Parnin , an employee of the University of Technology in Georgia. He collected a bunch of different data on this topic and made quite a few conclusions, the main one of which is:

A person who has been torn off from work on a task takes 10–15 minutes to return to it.

This is an average figure. Someone may have more, some less, depending on the switching. By simple addition, it can be calculated that if you were distracted 4-5 times within one hour for something, a whole hour of working time is likely to be lost, and you are unlikely to do your task.

This is a theory - a person researched, came to conclusions. In practice, you probably came across such a situation: you come to work, spent the whole working day at work - you did something all day, didn’t have dinner, you didn’t respond to messengers and mail. By the end of the working day you are all tortured, it seems to you that you have done a lot of things. But at best, in the evening you realize that you did not do halfwhat was planned for the working day. Worse, when a manager or a colleague comes to you and asks: “What did you do today at all?” And you understand that you ran, ran, ran - and there is nothing at the exit .

In many ways, this comes from switching our context and the inability to concentrate on the task. For the admin - a simple artist - this is so.

But there are still managers / team leaders and the other side. The timlid chip is that, like maniacs, this context-switching is not something that they can survive, but they even sometimes increase it to reduce it. That is, they focus a lot of meetings with this switch in a few hours, and then they rest in the evening, working on one task. The switching skill is developed to the point that it takes only 5 minutes to dive into a new task. This is very cool, and for the mere fact that they know how to do it, managers can be valued and respected. But for the admin and the performer it is better to get rid of switching .

Process opacity

The second important problem is the opacity of the processes, which can be divided into two zones:

the opacity of the processes within the team ;
opacity of processes outside the command .

Inside the team - this is what we can influence: the lack of words or lack of agreement between the team members. The worst thing to which the opacity of the processes within a team can lead you is the duplication of work . In principle, this is not bad, apart from the fact that you are losing, most likely, the working time of one of the employees.

Here you can find pluses and say: “Perhaps Vasya did better than Petya! Let's take his decision. ” But they could talk among themselves, and someone would do it alone. It is important.

If opaque processes are outside the team, for example, in general, something incomprehensible is happening in the company, inside the team this can lead to incorrect prioritization of tasks.

For example, a developer from a mobile web came to me and said that it is important for him to pick up a certain service that will give something for the new API today. I have many other tasks, and it does not seem to me at all that his task is priority. He was waiting for his release week, wait two more days, I will do later. For business, this is not always the case. If a command comes to us from above that the current task has a high priority, because it is part of a very large next task, then it is important that it is not even the manager reported, but that every team member understands this simply without another word .

On the solution of these two main problems within the team from the point of view of the performer and the admin, I would like to build my story today. I will talk about how we found a few rules in order tominimize context switching and make processes as transparent as possible .

How to solve the context switch problem

The admin came to work, drank a cup of coffee, read the mail, the backups work, nothing fell - sit, work, which can interfere.

Consider the usual situation. The man came fresh, everything is fine, he opened his work tools, wrote in the chat and in the mail, and then the phone rang - they asked what fell at night - distracted. Then the wife or girl posted a cool picture - you need to go poles, Facebook is also moving. Here friends come to discuss yesterday's football meeting, they call in the evening to drink beer or now tea. And all this comes to man from all sides little by little.

What to do with this problem? We have a person, there is his general social life, there is its working aspect. In this case, we can consider and optimize only the part that concerns him.working tools . We cannot forbid him to go to drink beer after work or to use social welfare, because we are not in prison after all.

Therefore, we decided to look at what working tools the administrator has, where he is often pulled from, and what can we do to reduce it.

The first idea is rather strange, but we tried it - it is to allow the admin to simply not use chat , because they write a lot to chat. You work on a task, and one you wrote that it matters to him, the other that that matters to him. And we allowed admins not to use chat - do not respond and do not write anything there.

The idea, of course, did not take off, because besides the fact that you need to read what you need to read in the chat, chat is the fastest way to communicate. You just need to write there. Just a week later it became clear that the idea was utopian, we decided to abandon it and went further.

We made a partly strange decision - we singled out one team member and told him: “Dude, you will be a conditional leader! This is not a career advancement, you simply know quite a lot about which of your colleagues in which area is good, you know the general flow of tasks and more or less about priorities. Therefore, come on, you will work on the following scenario. There is a pool of tasks that fall on all admins in a team, you can see who is busy with what, you know what deadlines for the task, and you can always give it to the person who will cope with it as quickly as possible; or, if there is a lot of time to do it, you can assign it to the junior. Junior needs to say basic things, but you know that if he is helped, he will pump over and everything will be cool. ” In principle, the idea is quite sensible.

One of the reasons why it did not come to the end is that all administrators in our country like to work on what they like. We can do tasks, when everything is on fire and must be done - we do not understand, take and do, no matter who. Another thing is when you have a choice: “I am working on one task now and I want to set up replication in MySQL, I don’t want to touch Puppet - let someone else do it”.

People started to bugger, some had few tasks, some had a lot, someone got uninteresting - something incomprehensible and inexplicable. Perhaps it was our miscalculation, but this approach did not work.

At about the same time, we are trying to reload the Arbitrator with another duty. To the admin team, other teams set tasks to do something — backup, restore, etc. A person with such an application is, in fact, a client, and he is always waiting for feedback. When, having set the task, he sees that the task has passed in the general pool from the status “not assigned” to “assigned” to a specific performer, 2-3 hours passed, one working day, another, and the task has no beats, it is not clear, in general are engaged in his task or not.

There are admins who do not like to conduct their tasks in the form of correspondence. Therefore, the Arbitrator now has to arrange one-to-one rallies with each member of his team, lead almost every task, ask if there are any difficulties on the task, how to help, and summarize the collected information.every 1-2 days.

Tasks began to be carried on somehow. But everything stalled, because our current Arbitrator just buried in so much knowledge . After all, in order for you to summarize something, you need to understand each subject area, to figure out what stage the employee has reached, what is stopping him, and writing it. When there are a lot of such tasks, the Arbitrator simply throws up something to write, and the tasks cease to be conducted in the same way. Therefore, it was necessary to move on and change something again.

Eisenhower Matrix

Perhaps you have already seen this matrix, just do not know the name. The bottom line is that we divide the sheet with the tasks into 4 parts according to two parameters:

urgent / not urgent;
important / not important.

All of our tasks, we just scatter in this wonderful tablet, and begin to work.

It should immediately be noted that the most productive and comfortable for the artist cell B is an important and not urgent task. This is a great motivator for a person, when your task is important either for the team, or for the project, or just for you. You understand that you are working not just on some kind of nonsense, but on what people will use, and this adds incentive. The advantage of non-delay is that you are left to yourself. You have time to read, test, make some calculations.

We sat down, thought and came up with the idea of dividing all the tasks coming into the operation department, and the tasks of the format are not very important and not very urgent to be divided into a separate projectITGROOVE . Here we put tasks that, in the future, maybe someday really turn into a problem, but now they are not a problem, and it would be nice to make them in some foreseeable future - a week, two.

After that, we introduced the function of day duty administrator , the essence of which is as follows. We have the first line of support and response to alarms and triggers, monitoring. If she cannot cope with the problem and decides that she needs to escalate, then the first person who is involved in this task during the daytime is the day duty administrator.

If before that I told you that we are getting rid of the influence of context switching, here we are simply throwing a person at the embrasure and telling you to do everything in general, to switch as soon as you can.

Actually, this is not entirely true, because the day duty administrator performs the following actions: either he escalates the problem and sends it to the best specialist in the given subject area that is available at the moment, or he is almost automatically fixing the problem himself. This is not mental activity - wake up a person at night, he will go and fix it.

As an added bonus, we offered the day duty officer, if he has nothing to do and bored, to engage in the project ITGROOVE. Not only does a person cover the rest of the team, it also closes unimportant and non-urgent tasks!

By introducing the role of day duty officer and dividing the tasks into very unimportant and project ones, we allowed the rest of the team to work in the most comfortable zone B on non-urgent, but important tasks. People just emerged from point A, looked around, and point B is there - and I feel comfortable and all is well - cool! Will be working!

I will not disregard the tasks from point C. It sounds like something crazy: “Urgent, but not important” - either urgent or not important. In our case, usually there is no work in this segment. Tasks with criteria “it does not matter, but urgently” either become “neither important nor urgent”, or simply disappear, and we are not working on them.

Since I touched upon the fact that we have introduced the role of daytime administrator on duty, let's briefly look at what kind of admins we have:

Admin ordinary. In principle, everyone is always engaged in everything, but the ordinary administrator mainly works on tasks in Jira.
The day duty administrator mainly responds to the telephone and to the escalation from monitoring.
The night administrator on duty - a kind of mixture of ordinary and daytime admins - responds to calls and escalations at night, and works as an ordinary administrator during the day.

How to make processes transparent

The difficulty of our particular team lies in the fact that one of its parts is located in London, the other in Moscow, this is a fairly large shift in time zones. In Moscow, the guys start to work much earlier, in London they just come to work, and they have already done something. In turn, we in the London office, modifying in the evening, do some other things that people in Moscow, when they went home, did not know. To coordinate the processes within the team, we have a weekly Monday rally.

It looks like this:

We have one negotiation in Moscow, one in London.
And the time is set so that in London they just came to work, and in Moscow they already returned from lunch. To tune in to a worker, everyone needs about 40 minutes. Therefore, we gather on television in an informal setting, take an agenda and begin to discuss.
This is a many-to-many discussion. We tell each other what important projects we have done, what we expect, what we plan to do, we make appointments with each other.

But the problem is that by the evening of Tuesday or Wednesday morning, coordination of actions is a little lost . For example, I started working on the task, went away, I have different tasks for this week, my colleague from Moscow is undergoing something similar. We will get out of sync until next Monday, before the next agenda - we need to do something about it.

Status Hero

There is a cool tool called Status Hero . Its essence is that when you come to work, you plan for yourself certain tasks. In Status Hero there are 3 fields to fill. And this is not a mandatory tool, we can not fill it and not use it.

The trick is this: I have come to work fresh, and I know that today I want to fix some DNS, set up resetting metrics in Prometheus, see how new schedules will work, and maybe close current tasks. I put all this in the plan for today.

But over the plan for today, my line is flickering, which says that yesterday you promised yourself to do this, and come on, you will first write what you did for yesterday from what you promised, and then what you will do Today.

Also there is a wonderful third item. This field is used to mark some external events that block the execution of tasks . For example, someone from the other team did not give you any information that a patch, a fix, the necessary data to do the work, and you are a shy guy and you cannot call and demand it. Now you can write something here, it will be highlighted in red, and either the manager or the guys from the team will help you. That is, you will voice your problem , and you will not sit silently and wait until the problem independent of you is solved and you can do your job.

In turn, the team also sees this. We have a special group in HipChat, where after someone has filled out the form, it is shown to the whole team. A man of a quick glance is enough to view the chat and understand what his colleagues will do. If suddenly there is some kind of blocker that he can resolve and thereby help his colleague, then he does it. That's cool!

Why does Status Hero work?

The most important aspect is that you promise yourself . From practice I can say that if you promise yourself from Monday to Friday, then, most likely, by Thursday you will have made at least one of the points that you wrote on Monday. Status Hero every day you will be angered and say: "He promised - did not!" And also colleagues know that you actually also promised, so you take and do, just by force.
The next positive point is that the transparency obtained allows us to help each other . When I see that, for example, my colleague is going to perform a certain task, in which my knowledge is probably more, or I can just help with something, I say: “Come on, I will help you. I know where to send the documentation and what to read, or do it right away so as not to lose a couple of days of work. It will be better for you. ”
Now those quiet, who sat and did not say that something prevents them from working, can also quietly write, and they will be helped. Perhaps some problem will be solved, which otherwise could not cope.

Status Hero revealed the problem

But not only did Status Hero help us in organizing this activity, he also revealed one rather strange problem for us. It consisted in the fact that at that time there was either no operational documentation or it was not enough.

It was possible to understand this approximately when you began to see what your colleagues are working on, help them and tell them how to do something. When you explain the same thing for the sixth time , you realize that if you wrote it once, a colleague would have walked through the script once, made edits and comments, and that wouldn’t be distracted by these six times. . A person, in turn, would not need to ask about banal things that you can read about.

The documentation was, but in insufficient quantities, as it turned out. As soon as Status Hero began to be used, there were really more articles in the internal Wiki , articles began to be edited and commented, even likes were entered into Confluence, and they also began to supplement triggers in monitoring systems that are triggered. We began to write more clearly, in human terms, about what is actually happening, who to call and where to look.

And that is not all. There is another aspect in which the Status Hero also helps us.

Team Contribution

Alexey Rybak spoke at HighLoad ++ with a story about the Review process in Badoo . This is a cool, mostly managerial thing, because they need to evaluate their staff: how we work, how the team works. From a manager’s point of view, this is a cool tool by which all information becomes structured.

From the point of view of the administrator - a simple employee - the opposite is true . It's almost like exam preparation per session. To complete the Review is given a week for which you need to write what you have been doing the last six months. But usually it reaches the last day, which is almost all spent on re-reading their tickets for a long time, to penetrate into them and write something about their achievements.

To make the Review writing process not so painful, we are invited to fill in snippets . This can be done both at the end of the working day and at the end of the working week.

But, since we already talked about the problem of context switching, it is obviously not always possible on Friday, for example, to recall what I did on Monday or Tuesday. At best, I will write what I did on Thursday and Friday, at worst it will be the last 3 hours of work on Friday. As for the daily snippets - the working day can be different, and in the evening I want to go home to the pub - anything, but not to write about what I have done today.

Here again comes Status Hero. Every day we wrote in it what we promised to do and what we did. For the period that we need, you can simply make a selection of positive points - what we actually did.

Not only is this sample positive , there is one more plus: in the Status Hero we wrote for ourselves, and when we make a sample for writing a semi-annual report, then reading what you wrote for yourself, you plunge into perspective in context. You do not need to get into the ticket and remember what you did there, for a long time or not for a long time.

It is beautiful and wonderful, but

“The theory without practice is dead, practice without theory is blind”
A. Suvorov

One day in the life of the admin

So that my statements that Status Hero is cool were not unfounded, let's look at one day in the life of the admin in Badoo. The situation is half-thought out, but quite real.

A person comes to work in the morning, for example, after a weekend. He rested and knows that he has a big project in perspective. The first thing he needs to do to start working is to plan a working day. Suppose he decided to set up the infrastructure in a new data center.

We all remember well the conscience and the fact that if he promised on Monday, then by Friday, he will probably do it. But consider the ideal situation that the task will close within one working day.

The man wrote this and thought that in order to raise a new data center, he needed to configure the infrastructure for xCAT.

Then the guys who came to work in London in the meantime join, and each of them adds that you still need to install Puppet, you can’t start it without it, Consul is also needed, and how without Docker, and glpi, and so on. Too much detail about each of these systems to tell time is not enough, consider them briefly.

Our data center consists of five puzzle elements, on the basis of which we can begin to work further.

The first main management tool is iron , which just arrived from the factory. He was put in a data center, mounted in the rack. The admin needs to update the firmware, install the OS, build the Raid, and move the car to the place where it will work later.

We have used and continue to use a product called xCATwhich contains a PXE server and a dhcp server. There we store the base of all our subnets, dns addresses, dynamic ranges and other information. For us, this is a server base, but a database in the format server — name — mac address — network interfaces and constant IP directly in the cluster in which it will be located if we transfer the server to the cluster.

It is important that xCAT provides an opportunity to follow what is happening in consoles.servers. If some kind of Kernel Panic happens, then we get an impression from the monitor just as text and then we can use it. That is, xCAT works in a format, a management node, which knows everything about everything, but can take part of its workload, transferring it to a service node, on which, in turn, we raise the server of consoles. If the data center is small - conditionally 100 machines, then everything will fit on one management node, we will not lift the second one. If the data center is large, there are many servers with consoles, we will take and simply horizontally increase the number of SNs and pick them all up to the master. Therefore, in the diagram, xCAT SN are in square brackets.

In fact, the person who raises the DC and sets up xCAT, runs one container with the management node, enters the information about new subnets that will be in this DC, generates a file with dhcp and reports, if necessary, to network engineers that for these subnets dhcp helper will be on the new container.

In case it is necessary to raise the server of the consoles on a separate container, we simply launch the next one and everything becomes great, at least we have basic equipment tuning.

Docker

I wouldn’t have been me if I hadn’t said the word Docker here - the hat would have to be removed eventually. But I will not talk deeply about Docker, its infrastructure for any of our data center looks something like this.

The essence of the Docker is not in himself, but in how the registry is arranged , because we need it in order to pull out further containers of our services and services from there. This scheme had several iterations while we implemented Docker and used it, but at the moment the working scheme registry in Badoo is in the form shown above. All images, all layers and everything else we store in Ceph through the Swift API .

In order to keep the cache from our registry, we use Redis. HTTP nodes, which are Docker distribution services, we can scale horizontally as many as you want, the only condition is that we always need to lead all docker-registry nodes to the same address of the Redis caching service and specify, respectively, one endpoint for Ceph.

Before the HTTP service as a balancer is nginx, which terminates SSL, makes basic Auth. Next are our target servers that access the registry in order to pull or push.

Consul

In today's reality, the new data center will definitely need Consul, which is currently used, rather, not as a service discovery for the whole Badoo, but as a service discovery for the infrastructure part .

Demonstrating how the Consul installation looks basic in any of the data centers probably has no meaning. This is usually at least 3 master servers and synchronization with all available data centers.

Why infrastructure, especially a new Consul data center?

Puppet

Let's take a look at our wonderful Puppet infrastructure.

The essence of Consul here is that we raise the infrastructure from top to bottom (if you look at the slide above):

To get started, you need PostgreSQL, which in turn will be required for PuppetDB.
Picking up PostgreSQL, we register it with Consul. By picking up PuppetDB, we take the Consul about PostgreSQL, connect to it, and pass the information about PuppetDB back to Consul.
Next we raise the required number of Puppet-server nodes on Java. The information for them we take from Consul, the information about them we put in Consul.
The final step is to load load balancing to nginx, which handles SSL termination, served by 3 ports:
1. port for directly puppet agents;
2. port for Puppet DB;
3. port for statistics.

All other clients go through load balancing.

GLPI

We have such a thing called glpi, it is necessary for any data center. Everything is pretty clumsy and simple - this is a service for inventory .

It works as follows:

Each server runs a simple FusionInventory Agent , which collects all the information on hardware, software, antivirus, file systems - it all depends on the settings. We are usually interested in all sorts of "iron" indicators: how much memory, what drives, controller, cache, etc.
This information is sent at a certain time interval (in our case, once a day) to a certain PHP endpoint, in which data is processed and transferred to the glpi database.

Another advantage of using GLPI and FusionInventory is that we can not only inventory server equipment, but also network equipment, in order to get information about ports and speed, and most importantly which server with which serial number in which rack, it is connected to which network node and which ports. The result of all this action is a web page where you can watch all this information.

We reviewed 5 tools that were described in our Wiki, our hypothetical admin looked at them and launched no more than 3-5 containers for each - the infrastructure is ready. We got a house of happy people who are productively working: they identified one task, others helped it, in general they got acquainted, read - and raised such a thing.

In Badoo there are more such little men with balls in the admin team, but we are productive and most of them are happy . Our team of friendly professionals was able to create, because we were able to identify three problems and learn to cope with them.

So, what is necessary for performers (it seems to me, not only for the admin):

Reduce context switching . Let the person work - if he is a techie, let him sit and work, do not tear it off!
Make the processes transparent. If you break deadlines and there is a suspicion that something is wrong with the prioritization of tasks, give the team information about why a particular task is important. A person must see beyond his monitor, and know that his participation in a project is important. Then he will work differently, he will understand the urgency and usefulness of his work.
Write good documentation. And it is good if this documentation is divided into different parts. It can be detailed and deep if you want to read and dig. But at the same time, you should have an excerpt about the service or service that fits on one page and contains a set of 5-6 actions that must be done before the escalation. Moreover, the documentation is always important to keep up to date.

When you increase the transparency of work in the department, the problem of updating the documentation is solved by itself, because you see what iterations occur, and you are constantly asked: "Update, update, update."

References

These are links to various studies on context-switching, how to work competently, how not to be distracted and do more, as well as links to all the products I have talked about, which are the basis and support of any of the Badoo data centers.

The Siberian version of the conference for developers of high-load projects Highload ++ Siberia will start on Monday and will take place on June 25 and 26 . On it, Anton will talk about the evolution of tools and services in service with the Badoo operation team,

and another 30 recognized experts and representatives of industry leaders will present their achievements and share their experience - see the program .

Tags: