Highly loaded application architecture. Scaling distributed systems. Part one

    Some time ago, Alexei Rybak, deputy head of the Moscow Badoo development office, and IT-Kompot presenters recorded the release of the podcast “Architecture of Highly Loaded Applications. Scaling of distributed systems. "

    Now we have made the decoding of the podcast, brought it into a readable form and divided it into 2 parts.

    What we talked about in the first part:
    • General information about the Badoo project: technology stack, nature and volume of workload, traffic.
    • Horizontal scaling of the project:

    - web servers, caching, monitoring etc;
    - pitfalls when scaling a project;
    - scaling databases, how to do sharding.


    Presenters: Hello everyone, you are listening to the 45th issue of the IT Compot podcast, and you are hosted by Anton Kopylov and Anton Sergeyev.
    Today we decided to talk about back-ends, about web development, and more specifically about the architecture of highly loaded applications and how to scale distributed systems. Our guest, Alexey Rybak, will help us with this. Alexey, hello!


    Alexey Rybak: Hello!

    Vedas:My Circle website says that Alexey is the head of platform development and deputy. Heads of Badoo’s Moscow office. Even a very long time ago Alexei has been developing various complex distributed high-load systems using a variety of technologies, including Pinba server monitoring and some other things. Maybe I don’t know something - Alexey will supplement or correct me. In addition, our guest takes an active part in conferences, is a member of the HighLoad organizing committee, a large and powerful conference, at PHPConf, and probably somewhere else.

    AR: I would like to correct a little. Pinba was made by my colleague Anton Dovgal, and the first version was made by Andrei Nigmatulin in general. I suggested something more there, invented.
    And yes, I really am the deputy head of Badoo development and mainly deal with platform projects, oversee large "open source" projects. In general, everything that I tell is more likely to be done not by myself, but by our guys. We have a fairly large team of engineers. We have been working with many since the time of Mamba. Therefore, to say that I am developing such a gigantic system there in one person is, of course, wrong - I am more involved in administrative work, and recently I have led several development teams.

    Vedas: I see. But we were not going to, so to speak, greatly exalt you against the background of other developers, to say that you are one such good fellow in Badoo. We will still have questions about the team, the audience asked us such questions in the comments.

    I will now announce a small plan to our listeners. Alexey will tell you briefly about himself, how he started, what he did, what he came to, and so on; We’ll talk about what Badoo is, if someone doesn’t know what the project is. We will touch upon the issue of horizontal scaling and tell how to scale in general, what technologies are there, what problems are, and we will consider this whole business as an example of the experience of a large service. And another very interesting topic we will have today is about all sorts of different asynchronous tasks in highly loaded systems (these are job queues, message queues).

    Alexei, please tell me briefly how you came to the development world, what you are doing now and what you plan to do in the future in general, what are your plans.


    A.R .:Oh well. In short - I was not going to become a programmer when I was at school, and generally did not do programming at all. In the 9-10 grade, I went to the university to study Fortran, but before we got into these classes, we often went somewhere to drink beer or wine. In general, I somehow did not work out with programming at all, I entered the Physics Department of Moscow State University and wanted to study physics. But it so happened that doing physics in the 1990s didn’t make much sense, all my friends left, and I decided to go in for web development. And the first thing I did was not really web programming - I was a webmaster. Besides the fact that I wrote some scripts in Perl and automated something, I drew banners, cut something there in Photoshop and so on. Then he worked in various companies and I was lucky - in 2004 I got to Mamba. This is a very large project, which instantly became popular in Russian-speaking countries.

    Vedas: Everyone probably knows this dating site, the first big known social network. At least I remember her from the time of the “dial-up” of the Internet.

    Vedas: Yes, I remember it was very popular to “partner” with Mamba when you screw your domain to this system and get a commission from your users.

    A.R .:“Mamba” also developed at the expense of large and small partners. It was a giant step forward and a revolution in my career. It was a very small but very professional team. Partially there I had to learn from my own mistakes. We made a huge number of mistakes related specifically to the development of several years in advance. But he survived - this project is still alive. True, there, most likely, everything has already been redone. Somewhere from 2006 or from the end of 2005, we have not been involved in this project anymore - we now have our own team there, and we began to deal with the Badoo project.

    From the very beginning, Badoo had a very small team, literally a dozen engineers, including remote employees, system administrators - in general. Since then, the company has grown. There I was responsible for the entire development of the server side of the back-ends, with the exception of C-demons, that is, for all deployments of the main features, just for the development of the site. In fact, there was only one more “techno-manager” - the technical director, who was also the head of administrators and “sishniks”.

    Vedas: But I wanted to find out. You just mentioned that your back-end is syshny - do you really have a lot of sosh code in your application?

    AR: Actually, there is a lot of code, but, if you will, I’ll tell you a little bit later why we have such a technology stack.

    Vedas:Good. Alexei, it turns out that you already had good luggage, experience in developing dating services, some ideas about what things you need to do and don’t need to do there, and this is how you actually entered Badoo?

    AR: Yes, we have grown very much over the past few years, and therefore I had to leave the development. But the experience that we have accumulated here is quite interesting, and we will convert it into articles, into speeches. We even have a seminar - my seminar on the development of large projects. Therefore, this topic is very interesting to me and I will be happy to talk about what we learned, what methods and other we use.

    Regarding the technology stack: we use Linux and a bunch of "open source" software. And even if you do not develop anything in C or C ++, then it often happens that there is some kind of underground knock, and you do not know what it is. You need to have competency within the company: take, open source, read, understand and correct. Starting from a certain project scale, with a certain load, such competence should be. Imagine that you have good system administrators, and good system administrators, as a rule, can read C code. But if you have staff members who can read and correct this, then this is generally wonderful. Therefore, one of the areas that sishniks cover in our country is the refinement and development of various kinds of “open source” software.

    Vedas:Yeah. And so basically, as far as I know, Badoo is written in our native PHP?

    AR: Yes, basically it is the PHP language, and we use MySQL as a database, that is, such very, seemingly simple technologies. Nevertheless, everything can be done on them - great and wonderful.

    Vedas: Yes. And how many active daily audience do you have now?

    AR: Good question. I’m probably not remembering specific numbers now; I could be mistaken by tens of percent. Interested in daily or monthly? I think that the daily audience in the region of 10 million. This is me talking about authorized users. And if we talk about those who just sometimes come and look at some profiles, I think there are all 20 million.

    Vedas:And at the same time you have a rather large disparate application. I mean that it is actively represented on the mobile platform - these are phones, tablets, the web version ... Are you by any chance absent on the Xbox, PlayStation consoles?

    A.R .:This is an interesting question. The fact is that once the development was really made so that it worked, roughly speaking, on the TV, in which JavaScript, if any, is castrated there and so on. If we talk about which applications, Badoo is a hybrid of a social network and dating site. But we have separate applications that somehow simply repeat the functionality of Badoo. But they can be completely separate. It can be applications on Facebook, it can be applications on the phone that are separately distributed through their channels, no matter how badoo. There is the main Badoo site, the application - there is just a “vote”: a beautiful photo or not, good or not. All sorts of things. There are really a lot of applications, and I'm afraid there are already dozens of them. But anyway

    Vedas: And this is how the project always starts. With him, if this is a really successful, interesting project for users, the audience begins to grow rapidly. Accordingly, it is necessary somehow to properly support this whole thing, so that the service does not fall, so that the service responds adequately. Accordingly, here such a very banal thing that only the lazy is not talking about now is horizontal scaling. There are different ways and technologies - Amazon, not Amazon, clouds, physical hosting; anything is used - PHP, Python, Ruby, Java, I don’t know ... even Erlang - maybe someone is someone who immediately uses it, God forbid. How did you approach this, what are you using, and why exactly these technologies?

    A.R .:It’s hard for me to answer in the most general way, I’ll talk more about what we use. It seems to me that the “technology zoo” in the world is such that so many guys, having chosen this or that technology, are forced to invent a bicycle sooner or later. But since the community is large enough around one or another language, in the end this bike becomes a bicycle for the community as a whole, and one way or another, choosing one or another language, the scaling problem is most likely to be solved.

    We learned quite easily how to scale the application server directly. In order to, relatively speaking, scale the application server, all that is needed is to keep state storage in the application as little as possible, and so that request processing can easily flow from one application server to another, and so that as little as possible there was a load on the processor. Here with scripting languages ​​is not very good.

    The tasks that we solved here were mainly related not even to scaling, but to reliability. Since 2005, in my opinion, we have patched PHP. As a result, the project turned into the so-called PHP FastCGI process manager or PHP-FPM and entered the core of PHP in 2008 or 2009. And this is actually our development. Andrei Nigmatulin did all this initially, and this was done rather not even for scalability, but for the convenience of support, so that you can smoothly restart and so on.

    And scaling directly application servers is a fairly simple thing, you just need to competently scatter traffic. Difficulties arise when scaling databases: here's how to make the scaling of databases directly horizontal? And with the cache, too, everything is quite simple, because the cache is such a thing that you can smudge it in some simple way, and if you add new servers, then it's okay if you have this data spreading that will completely transfer from one server to other. If you can’t lift the system with caches with such re-smearing of data - most likely, you just have something wrong with the architecture there, that is, “adding” servers or redistributing data across them is rubbish.

    And with databases it doesn’t work out that way. There are several methods to scale them, but all of them, one way or another, should be based on usability. We have at the moment - now I do not remember, somewhere I could be mistaken - about 500 database servers. In fact, a rather large installation, it lives all in several data centers and is managed, in fact, by one person. Here is one person in charge of the database. We have done everything so that it is optimally convenient.

    Vedas: But you achieved control of one person by means of automation, raising new nodes, backups and so on?

    A.R .: Yes.

    Vedas: Do you have your own tools for this or are you using any solutions?

    A.R .:The fact is that, probably, now there are some solutions that could be considered. But such research is expensive, and we designed everything in 2005, and at that time there was nothing of the kind. Therefore, it is now more profitable for us to work within the framework of what we already have. That is, I will not recommend people to think up everything from scratch, but we are within the framework of the technologies that were laid down in 2005. But I must say that we laid them cool enough - we moved from one data center to two. If there is a third data center, we will absolutely calmly scale further, and most likely, this stack will not change with us. But the meaning is this: one of the main mistakes that people make when they first start scaling databases is that they scatter data on servers.

    Vedas: Yeah.

    AR: The remainder of the division, the first letter of the login, there are all sorts of ways - they all do not work in support. That is, they are simple and beautiful there, but they all do not work in support. Because as soon as it turns out that some kind of node crashes and you need to immediately use only eight instead of ten servers for some reason, two nodes need to be temporarily turned off, the data from them must be transferred somewhere, and then the ambush appears, because that this formula fails. Further, people who work with support in one way or another, understand that it is necessary to somehow modify this method, come up with all sorts of different tricks, but they already turn out to be semi-determined.

    The most convenient way is the so-called virtual buckets or virtual baskets, virtual “shards”. The point is that some mapping of the same identifier or something to some number is used, but this number is virtual, and there are many of them. And then already these virtual numbers or virtual shards are "mapped" by a special configuration file to physical tables or servers. And in this way two tasks are solved at once - one task is that mapping is anyway, one way or another, routing data - where to load data from or where to save data. It turns out to be deterministic, in the sense that the function is easy to calculate. And on the other hand, if a node fails, then the administrator just needs to change the configuration file, if there is a backup, "transfer" some data and run it all in production. Almost all projects stop there, but this problem does not solve the problem of several data centers and the connectivity problem.

    The user moved from one country to another. Since we still have some deterministic part, we can always “map” the user to only one specific shard. Conditionally, yes, well, because the function works for us. We cannot take it and say, specifically this user is moving from one shard to another. And we have such a problem, because we sometimes migrate between user data centers in order for everything to work faster. But in order to do this, we have to do the most difficult mapping: we, as it were, bind each user to a specific shard through, roughly speaking, a separate storage - very fast storage, but we can relocate one user specifically between shards.

    I thus described, probably, the whole range of solutions, one way or another tied to the horizontal scaling of databases - simple deterministic, through virtual buckets and with each user tied to his own shard.

    Vedas: By the way, the user Andrei asks us about this. He asks: "Alexei, I wonder how you have a data storage architecture, cache and database?" With the base it’s clear. And how do you determine whether a user needs to be moved, that is, why this user may need to move to another shard. How do you define?

    A.R .:I’ll give a very simple case, there is nothing secret here. There are many examples in general, but this case is glaring and a little ridiculous. We have two data centers, one in America, the other in Europe. And when we launched the American data center, we thought about whom we want to “land” there - and it is obvious that Americans, Canadians, as well as Latin Americans need to land there. In general, all who live in South and North America. Excellent! And so we lived for some time and suddenly realized that Asia is very slow in our country. And we did some research and realized that if we had some Asian countries land in America, not Europe, it would be great. Now imagine: you have a country in which there are already several million active users, and specifically these users need to be relocated from one data center to another. Here

    Vedas: Great, and now you get the system itself, there are automated tools. Through this layer you can redistribute users flexibly and easily, if you need it. Generally great.

    AR: We originally laid it, yes.

    Vedas: Here's another thing about scaling. Let's get back to our comments, because users want to learn a lot from you, Alexey. Vyacheslav Fedotov asks: "And share the successful results, what kind of load you held." And there is still a lot of things that interest him: sizes, query intensity, processing speed, collection, storage, the duration of stable operation of the system under this load - and so on.

    A.R .:So, well, I'll try to list some key things. If I forget anything, just add. So that’s important here. It is important that this or that part always fails in a large system. Just because there are a lot of subcomponents, and with some probability something flies. To say that we keep a bunch of nines in the period is probably wrong, but our service is quite stable. Some of the inaccessibility of the main features for users during the year we may have, maybe a few hours. As for some times, I’ll probably tell you about the average processing time of one request.

    Requests, of course, are different. If you have a native application for some iPhone or Android, then it generally “talks” via http, but, roughly speaking, it transfers data using special binary protocols. Accordingly, it receives data not in html, but also in some binary protocol - this is a very easy request, it’s somewhere, I think, microseconds, maybe tens of microseconds, of which about one third of processor time, and the rest is requests to various storage servers, authorized servers, and so on.

    If this is the site itself, then, of course, the “requester” is harder, but we consider it a normal situation if on the server side the processing time of the script is around 0.1-0.2 seconds, and 0.2 seconds is, of course, too much for us. Again, I’m not talking about processor time, but about general time, which includes database queries, if necessary, cache requests, and so on. And in the event that this time exceeds 0.5 seconds on the server side - for us this is a signal to climb and understand what is wrong. Until this threshold is exceeded, we do not climb there. We collect from almost every application node using the very same Pinba mentioned above, all-all information about the request time - url, how much processor time, how much total time, what additional timers,

    Ved .: Alexei, I also wanted to ask this question relevant: if there are a lot of database servers, then how to organize the backup process? How do you do this and what tools do you use, and do you have some kind of automation if some node falls, that is, an on-the-fly replacement or some other replacement?

    A.R .:Everything is mixed up right away, let's do it in order. First, with backups. As a rule, when such decisions are made on hundreds of servers, the choice is the following: do we want to be able, in the event of a failure of some server, to instantly enable users to work with the system on another server? Or are we ready to endure for some time - for example, a disk crashed and for some reason rebuild takes some time or you just need to quickly replace something with a power outage, with the failure of the nodes? Why is this an important business issue? Because, imagine, a social network, 100 million users, for example, lives on a thousand servers. I take, the numbers may be too large, but to make it easier to count. In this case, it will live on one server - how much we have there 0, 001 from 100 million - 100,000 users. Then there are two options: either we need not 1000 servers, but some other number of servers, much larger and with the possibility of "hot swap", backups, that is, so that some new server fully takes up the load, and at the same time, users are not affected. Or we are ready to tolerate that one thousandth of our cluster will be unavailable for several hours. We have decided that we are ready to endure. Therefore, in the event that some of the databases where some user lives unexpectedly completely fails, this user will receive a polite warning that we are resting now, but literally in a few hours everything will be fine. The probability of such an event is rather small. Because here the question is precisely in the probabilities:

    Vedas: Well, yes, the price is still a question, backup systems are expensive.

    AR: I didn’t really speak about backup, but about hot swapping. And backups can be done by other means, but with some lag. You can configure replication, but when we did our system, we were actually forced to make our own replacement for standard replication.

    In 2005, we had a lot of different ideas, and one of them was to make sure that our replication was actually ours, statement based. And it so happened that we are chasing data now between data centers and between different nodes through our own replication, it works well, but it can be slightly slower than native replication. It is very convenient for us as a debugging and everything else, but at the same time we are late for a certain number of minutes, sometimes hours under heavy load. That is, if our data fails suddenly, some server fails, then it will be necessary to completely re-process the data from somewhere - maybe we will be late with this snapshot for some time.

    Again, since we are talking about data such as a user profile update, I am not talking about money now (money is generally a separate subsystem), we are ready to tolerate this. As for the hot, this is a backup that is done one way or another in this mode, always and constantly. At the same time, from time to time we make a hot backup. But administrators already do this with specialized tools for MySQL.

    Vedas: Is there any kind of Mysqldump, something basic, yes, or something more complicated?

    AR: I do not want to lie to you now. If interested, you can ask me a little later about this, and I will find out from the admins.

    Vedas: I see. The point is that it is simply done with hands and at certain specific moments when it is required, right?

    A.R .:Yes, this is due not to drive data, for example, across the ocean, and to quickly get a local copy of the data.

    Vedas: Here is another question: data growth rate, what is it? That is, “trucks” with hard drives, at what speed do you need to buy?

    AR: In general, in this sense, in terms of data in the databases, we are not growing very much. That is, how, for example, our photos grow: we upload, it seems, several million photos to each data center per day. So photographs "eat up" places much more with us than databases. We are probably buying a few dozen machines - roughly 10-15 cars a year for each data center.

    Vedas:Alexei, if you still return to the technology stack, it turns out that you use mainly on your front-end, on the servers - PHP, yes, and what kind of server? What did you choose, nginx?

    A.R .: Of course, nginx.

    Vedas:By the way, a separate respect for PHP-FPM. I honestly did not know - apparently, at that time I was still very young and green, because I still remember how I came here to work in the company where I work now - FlySoft, and then it was just there talk that FPM-ku will be included in the delivery of PHP by default, but for me then it was still a somewhat unknown field and I just started to study it, but I remember that even then we switched and abandoned Apache completely, used nginx and PHP -FPM, in general, query processing "puffed". What else interests you: whether you use cloud services or something, I don’t know, Amazon S3 for storing statics, EC2 for some auxiliary things. Here are some who raise the instances at a critical load, in order to at least somehow provide themselves with short-term,

    A.R .:We do everything in our data centers. Generally speaking, when it comes to clouds, I’m not a great specialist in all of this - all I can say here is that if you use a “someone else’s cloud”, then most likely you will be cut in one way or another in terms of resources and you will never find out where and how. That is the whole problem. That is, if you try to do something that lives on one or two servers, and then you suddenly need ten in the short term, maybe it will work. If you have thousands of servers, the need to use some kind of cloud solutions is doubtful for me. Unless you have something very simple there, and you do not need staff at all, and you are ready to pay money. As for where, one way or another, we came into contact with the clouds, oddly enough, these are two things: firstly, CDN, and secondly, our development platform.

    The fact is that in our office, at the very entrance, there is a large room with servers, which blinks with bulbs. There are several racks. We have a small data center inside the office. We need this in order to repeat our entire infrastructure, including deployment, to have a data center here at least on some micro scale, otherwise we cannot develop in a normal way. And of course, for this business we are actively using virtualization. Therefore, we have such a small cloud of several dozen there, and, probably, soon there will be more than a hundred different virtual machines that play different roles.

    Vedas: I see.

    A.R .:This applies to our development site. As for the CDN, yes, we use some solutions there, but now we understand that it is very difficult to influence them. That is, if we want to do some very, very tricky research on connectivity, for example, why users from South Korea take average picture loading time for such and such a time. It’s quite difficult to get and get some detailed statistics from the CDN and somehow process it if it's a big CDN. Well, just because everyone, everyone wants to make money and provide some kind of fairly stable thing. And where there is a struggle for a fraction of a second there at boot time - there you already need to involve engineers so that you have a dedicated engineer, so that some kind of research is done for you. This, of course, no one wants to do. Therefore, now we have only part of the cluster using CDN. We experiment and compare, that is, we give something through our points, somewhere we make local CDNs and use someone else's CDN, and we make all sorts of different interesting comparisons.

    Vedas: Well, in principle, it’s great that you have such a micro-infrastructure inside your office. We also think now, it’s worth it - it’s not worth it, I look in the direction of Vagrant, I understand it, in principle, a good thing. What do you use for caching? Did you mention memcached, huh?

    AR: Yes, the most beautiful and “dumb” memcached is our everything.

    Vedas: Yeah, but looked something like these newfangled Redis, other things like that?

    AR: Of course, they did . But we did not use in production. That is, somewhere out there, temporarily, Redis lived on some of the side projects. In production, we are completely happy with memcached.

    Vedas:That is, you know him well, all of his hand, all the subtleties of settings and work, so you decided to use it and use it, and are you happy, as I understand it?

    AR: Yes, and is he so stupid that he should be known there? And he doesn’t give anything to reward with his stupidity, it’s generally one of the great engineering principles for creating these or those tools - to make the tool so dumb that even an engineer with great and rich imagination could not use this tool so that (laughter) ... so that everything breaks down .

    Vedas:Yeah, by the way, here we come again (laughter) to one interesting question. Roman Squage asks us, and I also had such thoughts: “What approaches do you use in development? Are you trying to write something elegant and easy to maintain or dirty, but more productive? ” So here they asked me. That is, I rephrase a little here: you try to use all sorts of solutions such as, say, ORM for the database, all sorts of beautiful wrappers that give something, maybe syntactic sugar, which simplifies the number, I don’t know of clicks, or you try write lower-level code that runs faster? What is your opinion? I know that you had some kind of report, you said something bad there, that there are wrong architectural solutions such as ORM and so on. Tell me about it, how you look at it, what you use,

    A.R .:Let's divide the question into two parts - ORM will go to the second part, because this is a completely separate story, and it is fundamental and has nothing to do with the elegant-not-elegant development methodology. It is impossible to answer the question unambiguously, because what, for example, is “to make beautiful”? Make beautiful - for a long time. Therefore, if something on production has broken, then the approach most often used is: if something very important has broken and, say, it should still be beautiful, because new developers come to this all the time, they should easily figure it out and so on, we rule dirty, but it immediately goes into the medium term for refactoring tasks. It is the manager’s task to “smooth out” the situation so that, on the one hand, the business does not suffer, and on the other hand, in the long run, so that technical debt does not accumulate.
    An engineer in such a situation will often say: “No, let everything lie and not work, I will do everything right.” Well, of course, I'm exaggerating. But often it happens that an engineer from the very beginning wants to do everything well and does not want to do it crookedly. Sometimes you first need to do it crookedly, but at the same time agree to immediately put refactoring in the near future - this is the best way.

    To be continued ...
    And we will be glad to receive your feedback on whether such a format is interesting and whether to continue to lay out transcripts.

    Listen to the full podcast

    Download the podcast issue

    Also popular now: