Scalable nginx configuration

    Igor Sysoev

    Igor Sysoev ( isysoev )


    My name is Igor Sysoev, I am the author of nginx and co-founder of the company of the same name.

    We continue to develop open source. Since the founding of the company, the pace of development has increased significantly, since many people work on the product. As part of open source, we provide paid support .

    I will talk about the scalable configuration of nginx, but this is not about how to serve hundreds of thousands of simultaneous connections with nginx, because nginx does not need to be configured for this. You need to set an adequate number of workflows or put it in “auto” mode, put worker_connections in 100,000 connections, after that it’s much more global to tune the kernel than just setting up nginx. Therefore, I will talk about another scalability - the scalability of the nginx configuration, i.e. on how to ensure configuration growth from hundreds of lines to several thousand and at the same time spend minimal (preferably constant) time to maintain this configuration.



    Why, in fact, did such a topic arise? About 15 years ago, I started working in Rambler and administering servers, in particular apache. And apache has such an unpleasant feature, which is well illustrated by the following two configurations:



    There are two locations, and they go in a different order. The same request, depending on what configuration is used, will be processed by different files - either a php file or an html file. That is, when working with the apache configuration, the order matters. And this cannot be undone - when processing requests, apache passes through all location'am, tries to find those that somehow match this request, and collects the configuration from all these location'ov. He merges it and, in the end, uses the resulting.

    This is convenient if you have a small configuration - so you can make it even smaller. But as you grow, you encounter the following problems. For example, when adding a new location at the end, everything works, but after that you need to change the configuration in the middle or throw an irrelevant location out of the middle. You need to look at the entire configuration after these locations to make sure that everything continues to work as before. Thus, the configuration turns into a house of cards - pulling out one card, we can ruin the entire structure.

    In apache, in order to add hell to the configuration, there are several sections that work in the same way, they are processed in a different order, but one result configuration is collected from them all. All this is done in runtime, i.e. if you have many modules, then in each module configurations will merge (this partially explains why nginx in some tests is faster than apache - because nginx does not merge the configuration in runtime). Some of these sections and most directives can be placed in .htaccess files that are scattered throughout the site, and in order to make your and your colleagues' lives even more interesting, you can rename these files, and look for this configuration ...

    And the “cherry on the cake” is RewriteRules, which allows you to make the configuration look like sendmail. Few appreciated the humor, because, fortunately, most do not know what it is.

    RewriteRules is generally a nightmare. A lot of administrators come not so much with the apache background as with the apache administration background on shared hosting, i.e. when the only administration tool was .htaccess. And in it they make very intricate RewriteRules, which are very hard to understand both by virtue of syntax and by virtue of logic.

    This was one of the drawbacks of apache, which annoyed me very much, it did not allow creating large enough configurations. During the development of nginx, I wanted to change this, I fixed a lot of annoying features of apache, added my own. This is what the previous example in nginx looks like:



    Unlike apache, regardless of the order of location, the request will be processed the same way, because nginx looks for the maximum possible match with a prefix location not specified by the regular expression, and then selects this location. The configuration of the selected location is used, and all other locations are ignored. This approach allows you to write configurations with hundreds of locations and not think about how this will affect everything else, i.e. get some kind of containers.

    Consider how nginx selects the configuration that it will use when processing a request. The first step is to look for a suitable server subquery. The selection is based on the address and port first, and then all the server names associated with the given address and port.

    If you want, say, to place a server on several addresses, write a lot of server names there, and want all these server names to work on all addresses, then you need to duplicate the addresses on all names. After the server is selected, a suitable location is searched inside the server. First, all prefix locations are checked, the maximum match is searched, then it is checked whether the locations specified by regular expressions exist. Since we cannot determine the maximum match for regular expressions, we select location where the regular expression matches the very first one. After that, the configuration names of this location are used. If no regular expression matches, then the configuration that was found before, with the most matching prefix, is used.

    Regular expressions add order dependency, thus creating poorly supported configurations, because life is more complicated than theory, and very often the site structure is such a dump from a bunch of static files, scripts, etc., and in this case the only way to resolve everything queries are regular expressions.



    On this slide, an illustration of how not to do it. If you already have such sites, then you need to redo them.

    When I talked about the configuration processing order, it was the original design. Then it became possible to describe locations inside a location, i.e. inclusive locations, and the order has adapted a bit. Those. first, the maximum matching prefix location is searched, then inside it the maximum matching prefix location is searched. Such a recursive search continues until we reach a location where there is nothing more.

    After that, we begin to check locations with regular expressions in the reverse order, i.e. we entered the most nested location, see if there is a regular expression there. If not, then go down to a lower level, etc. Again, the first matching location with the regular expression “wins”. This approach allows you to do this processing:



    Here we have two locations with regular expressions, but for the request /admin/index.php the nested first location will be selected, and not the second.

    In addition, the second part of the search for regular expressions can be disabled by marking location with the symbol ^ ~:



    Such a ban means that if this location showed the maximum match, then regular expressions will not be searched after it.

    Very often people try to make the configuration smaller, i.e. they take out some general part of the configuration and simply redirect requests there. Here, for example, is a very bad way to throw everything into php processing:



    Nginx has other methods for highlighting common parts of a configuration. First of all, it inherits the configuration from the previous level. For example, here we can write at the http level enable sendfile for all servers and all location'ov:



    This configuration is inherited in all sub-servers and location'y. If we need to cancel sendfile somewhere because, for example, the file system does not support it or for some other reasons, then we can turn it off in a specific location or in a specific server.

    Or, for example, for the server we can write a common root, where we need to redefine it.

    This approach differs from apache in that we know specific places where we need to look for common parts that may affect our location.

    The only thing that cannot be shared is, for example, at the http level that location cannot be described. This was done deliberately. In apache this can be done, but it causes a lot of problems when used.

    Personally, I prefer to describe locations explicitly in the configuration. If you do not want to do this, then you can include it through an external file.



    Now I would like to talk about why people want to write less, i.e. why they rummage shared configurations. They believe that they will spend less effort. In fact, people want to write less than spend less time. But they don’t think about the future, but they believe that if they write less now, it will continue to be the same ...

    The right approach is to use copy-paste. That is, inside the location there should be all the necessary directives for processing it.

    The usual argument for DRY (Don’t Repeat Youself) lovers is that if you need to fix something, you can fix it in one place and everything will be fine.

    In fact, modern editors have find-replace functionality. If you need, for example, to correct the name / port of the backend or change root, the header passed to the backend, etc., you can safely do this with find.

    In order to understand whether you need to change a parameter in a given place, a couple of seconds is enough. For example, you have 100 locations, you will spend 2 seconds for each location, for a total of 200 seconds. ~ 3 min That's not a lot. But when in the future you have to untie some location from the common part, it will be much more complicated. You will need to understand what to change, how it will affect other locations, etc. Therefore, with regard to the nginx configuration, you need to use copy-paste.

    Generally speaking, administrators do not like to spend a lot of time on their configurations. I myself am like that. The administrator can have 2-3 favorite products, he can mess around with them a lot, while there are a dozen other products that you don’t want to spend time on. For example, I have mail on my personal site, this is Exim, Dovecot. I do not like to administer them. I just want them to work, and if you need to add something, it takes no more than a couple of minutes. I'm just too lazy to learn the configuration, and I think most nginx administrators are the same, they want to administer ngnix as little as possible, it is important for them to work. If you are such an administrator, then use copy-paste.

    Examples of how you can turn short non-scalable configurations into what you need:



    Here a person thinks that he wrote a regular expression, there are only a few, all is well. In fact, because there is a regular expression, it’s bad - it can affect everything else. Therefore, I personally do this:



    If you have this root common to all locations, or at least used in most of them, then you can even do this:



    This is, in general, a legal configuration, i.e. completely empty location configuration.

    The second way to avoid copy-paste is this example:



    Administrators who used to work with apache think admin / index.php should request authorization. In nginx, this does not work, because index.php is processed in one location, and location / admin is completely different. But you can make a nested configuration and then index.php will naturally request authorization.

    Often it is necessary to use regular expressions in order to "bite" some parts from the URL and use them in processing. This is a bad way: The



    right thing is to use nested locations, so we isolate regular expressions from the configuration of the rest of the site, i.e. beyond this location / img /, which is placed on the screen, the control will not go away:



    Another place where it is safe to use regular expressions in nginx is maps, i.e. to form variables based on some other variables using regular expressions, etc.:



    I did not say anything about using Rewrites, because they should not be used at all. If you cannot not use them, then use them on the backend side.

    Evil is also not a recommended design in nginx, because 10 people in the world know how Evil works inside, and you are hardly one of them.

    Here is the configuration when we have two if (true):



    It is expected that gzip and etag will be turned off. In fact, only the last if will work.

    There is one safe use of if is when you use it to return a response to the client. You can use rewrite in this place, but I do not like it, I use return (it allows you to add code, etc.): We



    summarize:

    • It is advisable to use only prefix locations;
    • avoid regular expressions, if regular expressions are still needed in the configuration, then it is better to isolate them;
    • use maps;
    • do not listen to people who say that DRY is a universal paradigm. This is good when you like a product or you program a product. If you just need to ease your administrative life, then copy-paste is for you. Your friend is an editor with a good find-replace;
    • do not use rewrites;
    • use if only to return some kind of response to the client.

    Question from the audience: If I use rewrites http on https, where is it better to use it - in nginx or on the backend?

    Answer: Use it in nginx. Ideally, you do two servers. You have one plain text server and it only does rewrites. In this place there will be literally several directives - server listen on the port, server name, if necessary, and return to 301 or 302 on https with duplication of request URI. There, even rewrite is not needed, use return.

    If you want to do something more complicated, you can insert if if somewhere. Suppose you work out part of the locations in plain text, describe them using regular expressions in a map, for example, and everything else can be redirected to https. Or, conversely, insert one if inside each location,

    Question from the audience: Thanks for nginx. I have a somewhat playful question. You do not plan to add a startup key or a compilation key that will prevent the use of the include directive, will not allow the use of if, regular expressions in location'ah?

    Answer: No, hardly. We usually add some directives, improve them, and then deprecated them. They display a warning message in the log for some time before disappearing completely, but they work in some mode. We are unlikely to do what you said, we better write a good User Guide, perhaps based on the materials of this presentation.

    Question from the audience: The usual desire that occurs when using if is because it can be used in the server and in the location, and map cannot be used. Why did this happen?

    Answer:All variables in nginx are computed on demand, i.e. if map is described at the http level, this does not mean that when processing the request this variable will be computed necessarily. Map is needed in order to map something into one, and then something one to another, and you can use the resulting variable in if or inside some expression, proxy somewhere, etc. Map is just like a declaration ... Maybe it makes sense to move them to the server in order to make them local to the server if you have the same variable. It was simply harder to program there, so they were moved to a global server. There are no variables in nginx that would be local inside the server.

    From the point of view of performance there are no problems, it's just an inconvenience. It will be necessary to make, say, three servers and three maps, and the variable will have the prefix "server such-and-such" ... You can, in principle, describe them in front of the server, i.e. these maps — one will be in front of the first server, then before the second ... By the config you will not need to jump up and down, they will be closer to the server.

    Question from the audience: I am new to the logic of return work. Please tell us where it is worth using returns instead of rewrite, some use cases specific?

    Answer:In general, rewrite is replaced by such a construction: location with a regular expression in which you can make some captures - captures, selections, and the return directive. Those. one rewrite is its left part in location, and the right part is what will be in return after the response code. Return offers the possibility of returning a different response code, and in rewrite there are only 301, 302 to return to the client. Return can return 404 with some kind of body, maybe 200, 500, it can return redirect. And in his body you can use a variable, write something. If it is 301, 302, then this is not the body, it is already the URL to which redirect needs to be done. In general, return has richer functionality.

    Question from the audience:I have an applied question. Nginx can be used as a mail proxy. Is it possible to give SMTP access to the mail client, send a letter through this mail client, and nginx to intercept the data and send it to the script, bypassing the mail web server? Now we are realizing this task using postfix - it intercepts the message and then throws it at the script where the processing takes place.

    Answer:I doubt that this can be done via nginx. I can briefly describe the functionality that SMTP Proxy has in nginx. He knows how to do the following - an SMTP client connects to it, shows some authentication, nginx goes to an external script, checks the username and password, and then says, let the client go to some servers (and transfers which ones), or do not let. That's all he can do. If you decide to start somewhere, then nginx via SMTP connects to this server and passes it to it. Whether this fits into your scenario, I can’t say. Unlikely.

    SMTP Proxy with authorization appeared because in Rambler there is a special server for mail clients through which these clients send mail. And it turned out that about 90% of the connections are not Rambler’s clients, but spam and viruses. In order not to load postfixes, not to raise unnecessary processes, they were set up with nginx, which checks if this client provides its authentication data. Actually, this was done for this - just to beat off the "garbage" customers.

    Question from the audience:You mentioned containers today, of course, this is a promising approach, but it involves a changing topology and dynamic configuration. Now this leads to the fact that people are building some external “crutches” that periodically respond to events of topology change, generate an actual nginx config through some template, palm off it and kick it to recount the config. Interesting - the company has some development plans towards containerization, i.e. towards providing more convenient and natural means for this trend?

    Answer: It depends on what you mean by containers in this case. When I spoke about containers, I compared, I said that these locations look isolated from each other.

    Question:We went back to docker, to the possibility of running backends somewhere in containers, which is dynamically executed on different hosts, and, roughly speaking, we need to add a new host to the balancing ...

    Answer: We have in NGINX + one of the Advanced Load Balancing parts, it implies that you can dynamically add servers to the upstream. It turns out that you do not need to reload the nginx config, but all this is done on the fly - there is an API for this.

    Also included are active helchecks. When a regular open source nginx connects to the backend, if the backend does not respond, then nginx does not access it for some time, i.e. There is also a kind of helcheck here, but customers suffer. If you have 50 clients who went to one backend at the same time, and it lies or falls off after a timeout in 5-10 seconds, then the clients will see this, and only after that they will be transferred to another upstream. In NGINX +, we have proactive backend testing, i.e. backends themselves are tested, and customers simply don’t send back fallen backends.

    Question: And since there is an active helcheck, then maybe they already sawed then a beautiful JSON-shaped status page that can be parsed?

    Answer: Yes, we have monitoring, it is available, including through JSON, and also it is in the form of a beautiful html.

    Contacts


    » Isysoev
    » Nginx Blog

    This report is a transcript of one of the best speeches at the conference of developers of highly loaded systems HighLoad ++ . Now we are actively preparing the 2016 conference - this year HighLoad ++ will be held in Skolkovo on November 7 and 8.

    This year the DevOps section was prepared by a separate Program Committee, which was supervised by Express42 . A dozen reports , including a report by Maxim Dunin on news from the world of Nginx.

    Also, some of these materials are used by us in an online training course on the development of highly loaded systems HighLoad.Guide- This is a chain of specially selected letters, articles, materials, videos. Already in our textbook more than 30 unique materials. Get connected!

    Also popular now: