Oh, my code. How to become a system administrator

Deputy Technical Director Mail.Ru Group Tatiana Bakharevskaya talks about the path of the system administrator, the advantages of working as a sysadmin and the features of operation in a large company. Tatyana was responsible and responsible for the services of the two largest portals of Russia.

The host of the program is Pavel Shcherbinin.

- Tell a little about yourself.

- I came to the profession for a long time. I got a job as a junior system administrator in a small startup that developed my search engine and a number of other Internet projects. It was Yandex where I worked for many years. I grew to a serious system administrator, then I headed the system administration department. In 2005, 5 people worked in this department, and after 10 years - 250, it was a large structure, several divisions were formed. We have learned to hire, grow engineers, have done such activities as Root, CIT. In Yandex, I was responsible for the continuous uninterrupted operation of the company, and now very soon I have been doing the same thing in Mail.Ru Group for a year. At first it seemed that the tasks were similar, but on closer examination it turned out that there are many things in common, but there are plenty of differences, and this is interesting.

-There are many different terms for service exploitation. This is just an operation, and a system administrator, SRE, SE, DevOps. Tell me more about each. Or is it the same thing? What is the difference?

- In fact, the system administrator is a fairly broad concept, starting with the fact that a person can be responsible for a small office with a small office infrastructure for several employees, ending with the responsibility for the continuous operation of a high-load service. At some point, it is still divided into different directions. In companies such as Mail.Ru Group, Yandex, Google, the system administrator is closer to what is now called the fashion words SRE - Site Reliability Engineer, that is, the person responsible for accessibility of the site.

Our work requires a lot of different knowledge of technologies: Linux / Unix, networks, databases, web servers, cloud technologies, equipment that we use to build services (processors, memory, disks) and a lot more. About technology, you need to understand how to apply them, how they differ. There is always a lot of routine work that needs to be automated. Write the code too. Modern system administrators / SRE for the most part program. Currently, the main language for automation is Python, plus, of course, bash. Knowledge of C has also always been a plus. For example, the best documentation on Linux: open the kernel code and see how it works.

It is also important to understand how to build highly loaded and fault tolerant systems. There is already quite a lot about this at conferences and written on the Internet.

So, if you sum up, a modern engineer who is responsible for a high-load service, you need to be able to program, know and apply different technologies, have an idea of how to build reliable and scalable services.

- Let's go back to the past. Very interesting is the initial stage. Why did you choose exploitation?

- It was fun. In those years, all decent girls wanted to become accountants. I also wanted, so I went to the courses. They said that in order to become an accountant, you need to master the abacus and the Felix arithmometer, I decided that it was too difficult, and “computer knowledge” (as was written in job advertisements) would make my life and job search easier. As a result, she went to “study computer” in the nearest Moscow Engineering Physics Institute, at the Faculty of Cybernetics, at the Department of Electronic Computers. It turned out that in this computer, in addition to Word and Excel, there is still a lot of everything - the processor, memory, pipelines, input-output devices. By the end of the training I wanted to become a programmer. In the first courses, my programming was rather difficult, and at the end of the course I was right in writing the code. Could do it all day long. In the evening, sat down and wrote the code, and the next evening she opened her eyes. Everything went pretty well, the program worked. But I realized that I was a keen person, and decided to choose something simpler. And she went into operation, but it turned out that it was not easy here, but even in some places it was much more difficult. But I stayed, and for more than 20 years I have been doing it.

- I wonder at what point you make a decision, to be a programmer or an admin?

- Differently. For the last many years, I have come across students, both in Yandex, and in Mail.Ru. People as students come to try themselves in programming and administration. Someone remains in operation and understands that it is his. Someone, having worked a little, goes into development. Someone, having worked in development, understands that he wants to get deeper into some problems, to find out the stack of what is below, under his program, how it is exploited, how it lives, and plunges into operation. There are some borderline cases that are now called the buzzword DevOps. These people need to know a lot about iron, and about operation, and about the code.

It all depends on the person, on what he likes and does not like. And these professions are very similar, largely overlap.

- There are legends about you in Yandex, that at one time you had a special switch that could turn off one data center at any time to test the stability of the system. Tell me more.

- This story began many, many years ago with a major incident: almost all data centers were disconnected from Yandex. More precisely, one was disconnected, but in it there was all network equipment of the company. Yandex did not work for several hours. After that, the task was set to make everything reliable, fault-tolerant, so that everything would work if one of the data centers was disconnected. Today, this problem is not so relevant, especially for commercial data centers. Reliability has become much higher, there are examples of how modern data centers live for several days on diesel. But then it was all different.

For several years, we analyzed the architecture of all applications, wrote task plans, how and what needs to be done to ensure complete fault tolerance. Where it was impossible or too difficult, we discussed the SLA (service level agreement). The main attention was focused on popular and high-loaded services. The first test outage was very scary. Half of the employees monitored the monitoring data. Disconnected and quickly turned on, recorded all the bugs, finalized a number of systems. And so a few iterations.

After some time, they got to the point where they could easily live an hour or two, turning off one data center. Everyone understood that the skill needed to be maintained, regularly conduct shutdown exercises. It’s like plumbing: if you don’t open the tap for a long time, you don’t close it, then it will turn sour and you will not open it at the right moment. Therefore, we regularly opened and closed "taps". And it worked. I consider it an achievement that once they called me at night and said that the data center had fallen, and I asked why they woke me up :-)

- What do you think, where is the line between programmers and system administrators? At what point a programmer can say that he is not responsible for this, does not know what kind of database is there, this is for admins. Or this face is not?

- It seems to me that the admin is responsible for the application "from the tip of the nose to the tip of the tail." In a good way, he can get into the code, see how it works there, how to fix it. He participates in the choice of technology, because there are good technologies for programmers, it is very convenient to write with them, but it is impossible to live 24/7 with them.

Programmers can focus more on the features of the product that they need: additional functionality, design, additional code that allows the project to scale better. That is, the separation is still there. In international practice, this Site Reliability and Software Engineer. There are different theories about where and how roles should be divided. It seems to me that the paradigm adopted in the Mail.Ru Group, within which there is exploitation and development, and these are different people, works quite well.

- Probably, not everyone knows how it is arranged in Mail.Ru Group. Tell me more.

- We have a service operation, which is responsible for the work of services. It consists of several departments. Each department is responsible for a specific product or group of products, depending on the scale. For example, several departments deal with mail: one is a repository, another is a web. And there are departments that work on several projects on a smaller scale.

Our farm has Mail, Search, Portal, Delivery club, Yula, My World, ICQ and many others. There are projects that were launched long ago and are our core-products, for example, Mail and Portal. There are projects purchased by us, which we place on our infrastructure, exchange with them operating practices. And there are those who were born with us and very quickly grew up, for example, Yula. The farm is quite diverse :-)

-What is the architecture of a typical Mail.Ru Group service?

- We have several data centers. We have our own data centers, both private and commercial, with commercial equipment and networks. The total capacity of the channels is measured by terabits.

We host project servers in several data centers so that turning off one does not affect the operation of the service. Most of our projects are websites. The architecture is standard: a load balancer, a web server below it, then an application server, and then a DBMS and / or storage.

Further details begin.

Basically, we all live on iron servers, but there are clouds too. For example, the development and testing uses a cloud on OpenStack, where development and testing can get resources at the touch of a button.

We implement Kubernetes, but this process requires a lot of things to change in the processes of both operation and development. It is not going fast. We try to do everything carefully so as not to break anything.

Let's return to what is happening with users. First, the user gets on the balancer. For load balancing, BGP and RIP network protocols are used, and traditional software - ipvs, haproxy and nginx. After that, web servers show users beautiful pages, mostly using nginx and Apache.

But behind them are the application servers. Since, as I said above, there are legacy and fairly new projects, there are quite a lot of programming languages in which all of this is written.

MySQL, PostgreSQL and our internal development Tarantool are used as DBMS for new projects. Users should not feel the loss of servers of some storage or its part, we are trying to back up and replicate data to neighboring data centers.

Basically, we use open source, as we have a lot of programmers and engineers in the company who can fix something at any time. There is also a development. For example, the repository in which the letters of users lie are their own development.

- How many people do you have in submission?

- Now about 70, but this number is growing regularly. We are actively expanding, now there are a lot of open positions.

- How many servers do they serve?

- Several tens of thousands of servers that are located in our data centers. Mainly in Moscow, but also we have servers in other cities, in the USA and Europe. All this server park should be monitored and maintained, serviced. We ourselves, of course, do not go to data centers, except perhaps for excursions.

- What is the volume of the channel should be?

- Several terabits. The entire Mail.Ru Group has a common network through which a lot of information is transmitted. Take at least "VK" and "OK", which show a bunch of videos, and yet there is still Mail, Search, analytics, and many other high-load services. Therefore, the network is an important component.

- What you need to know to become a good system administrator?

- Of course, Linux. Many commercial companies now use this OS. Mostly within companies they try not to use different distributions, everyone aspires to be alone, it is easier to update and maintain the systems. Everyone has their own preferences for the distribution, we use CentOS. So, first of all, you need to know Linux, how and what is arranged there, how interprocess communication is arranged, how everything loads and works.

Next comes specialization - who is closer to what and what is the soul :-). Someone specializes in automation, someone on web servers, someone on networks, someone on databases, and someone on cloud technologies. For example, I really liked the database at one time. You need to understand how applications work — be able to customize them, understand the pros and cons of using an application in a task, and, of course, be able to fix it very quickly in case of problems.

Examples of such specializations: network engineers understand the protocols and know where they are best to use, know how to configure global and local routing, know how to ensure the reliability and resiliency of the network.

Database specialists know how to shard, replicate, back up a database to reliably save information and ensure high speed of work. These people know how to see query plans, they know why indexes are needed and what they are like.

A typical task: to discuss why a request is executed for a long time, you need to look at the plan and see if there are any problems with server loading (memory, processor, I / O).

- In public opinion, admins are presented as guys with a big beard in a stretched sweater. Is it difficult for you to work in the men's team?

- Difficult question, because I have been working for many years. First, I got used to it. Secondly, if we talk about the industry as a whole, then there were already quite a lot of girls in use.

Such a myth comes from far times when a lot of physical work was required. We still remember with a friend how the two of them dragged a large, heavy, multi-unit server and put it on the floor, because they could not take it to a special place for maintenance. And they sat on the floor in the middle of a data center with screwdrivers, changed disks. There was no sled then :-)

Now there is no such thing. We work in a comfortable office at the table. Our work today is no different from the work of a programmer, who also has never been purely masculine: female programmers are quite a frequent occurrence.

- Our quiz. What is your laptop?

- Apple.

- Which is better, bash or perl?

- bash.

- Startup or big company?

- Startup in a big company.

- What was the last thing you did not have enough money?

- To the yacht.

- Excellent answer. All at once will understand the salary level in Mail.Ru Group.

- Right.

- ICQ or TamTam?

- ICQ.

- "VK" or "Classmates"?

- VC.

- Who is your idol?

- I have no idol. I believe that many people in the Russian and foreign Internet have done quite a lot for this industry. Thanks to them, it is developing at a pace. I was lucky, I know many of them personally.

- Name the large Russian.

- Quite a bit of; again, I'm afraid I will not list them all. If you need to allocate someone personally, then I am glad that in life it turned out to work with Ilya Segalovich.

Tags:

Oh, my code. How to become a system administrator

Also popular now: