NGINX - History of rebirth under Windows

    Since here we have a “week” of nginx, for example here or here , I’ll try and make my contribution, so to speak. It will be about nginx 4 windows, namely about a more or less official assembly for this proprietary, some not very favorite platform.

    Why Windows. It's simple, in the corporate sector of Windows on the server, and on workstations - often a required program. And from these requirements for the platform, for example, in the ultimatum form voiced by the client, you can’t get anywhere.
    And since we have Windows, but I don’t feel like tormenting myself with IIS, apache and others like them, if I want to use my favorite tools, and nginx definitely refers to them, then we sometimes have to put up with even some limitations on this platform. Rather, I had to ...

    Although it should be noted that even with these limitations, nginx will give odds to almost any web server under windows for many factors, including stability, memory consumption, and most importantly performance.

    I hasten to share the good news right away - there are more restrictions critical for high performance, when using nginx under windows, there are practically no, and the last of the critical ones, with a high probability, will also disappear soon. But in order ...

    The known problems of nginx 4 windows are described here , namely:

    • A workflow can handle no more than 1024 concurrent connections.
    • Cache and other modules that require shared memory support do not work under Windows Vista and later because address space randomization is enabled on these versions of Windows.
    • Although it is possible to run several workflows, only one of them really works.

    I changed the order a bit, because it was in this sequence that I understood these limitations, sorted so to speak “historically”.

    1024 concurrent connections


    In fact, this is not true, or rather, not quite true - from time immemorial, nginx could be built under Windows without this restriction - you just had to determine at the build stage FD_SETSIZEequal to the number of connections you need.
    For example, for VS, adding a directive --with-cc-opt=-DFD_SETSIZE=10240, the nginx worker will be able to manage 10K simultaneous connections, if you specify in the configuration worker_connections 10240;.

    Cache and other modules requiring shared memory support


    Until recently, all these functions and modules did not really work under Windows, starting with the x64 version or where, by default, the entire system works with ASLR turned on.
    Moreover, disabling ASLR for nginx does not change anything, because functions for working with shared memory are wired deep into kernel, i.e. ASLR (and with it probably DEP, for some reason did not work with it) must be disabled for the entire system.

    This is actually not a small list of functionality: Cache, any zones, respectively, limit_req, etc. etc. By the way, without the support of shared memory, it would be much more difficult to remove the 3rd restriction, i.e. implement support for multiple workers under windows.

    I will not bore the reader as I struggled with this, but together with Max (thanks mdounin) we did it to the release version. A little about this, who are interested, see under the spoiler or in the sources of hg.nginx.org or github ...
    A bit of theory ...
    Shared memory itself can also be used with randomization of address space. One doesn’t seem to interfere with the other, just when ASLR is turned on, you are almost guaranteed to get a pointer to the “same memory” in a different process, but under a different address. This is actually not critical, as long as the contents of this space itself do not contain direct pointers, aka pointer. Those. offset pointers relative to the start address of shmem are valid, but not direct pointers as they are.
    Accordingly, without rewriting all the functionality that works with pointer inside the shared mem in nginx, there is only one option, to deceive to force windows to issue a link to shmem under a constant address for all work processes. Well, then everything is not very difficult, in fact.
    You can read the start of the discussion about this here . Maxim, by the way, fixed the problem I missed (remapping), which sometimes occurs after rebooting the workers (reload on the fly).
    Viva open source!

    Those. officially this restriction is no longer valid with Release 1.9.0 dated Apr 28, 2015:
    Feature: shared memory can now be used on Windows versions with 
    address space layout randomization.
    

    Only one workflow really works.


    Nginx has a master process and child processes called workers or workers.
    Under windows, nginx can run several workflows, i.e. specifying " worker_processes 4;" in the configuration will force the wizard to start four child workflows. The problem is that only one of them, "stealing" the listener connection from the master (using SO_REUSEADDR), will really listen to this socket, i.e. do accept incoming connections. As a result, other workers - no incoming connections - no work.
    This limitation is related to the technical implementation of winsock, and the only way to get a distributed listener connection for all work processes on Windows is to clone the socket from the master process, i.e. use inherited handle from socket from it.
    Those who are interested in implementation details can see them under the spoiler or in the source, so far only at my github .
    More details
    To begin with, even if you start child processes ( CreateProcess) using bInheritHandle=TRUE, and setting SECURITY_ATTRIBUTES::bInheritHandleTRUE to create a socket, you most likely will not succeed, i.e. in the workflow using this handle, you will get “failed (10022: An invalid argument was supplied)”. And having “successfully” duplicated this socket with the help DuplicateHandle, the duplicated handle will also not accept any function working with sockets (probably with error 10038 - WSAENOTSOCK).
    Why does this happen, one quote from MSDN opens - DuplicateHandle :
    Sockets. No error is returned, but the duplicate handle may not be recognized by Winsock at the target process. Also, using DuplicateHandle interferes with internal reference counting on the underlying object. To duplicate a socket handle, use the WSADuplicateSocket function.
    The problem is that to duplicate handle using the WSADuplicateSocket, you need to know the pid of the process in advance, i.e. this cannot be done before the process is started.
    As a result, in order to inform the child process of the information received by the wizard from WSADuplicateSocket, which is necessary to create a clone socket in the workflow, we have two options, either use something like IPC, for example, as described in MSDN - WSADuplicateSocket , or transfer it via shared memory (Fortunately, we have already fixed it above).
    I decided to use the second option, because I believe that this is the least time-consuming of the two and the fastest way to implement connection inheritance.
    Below are the changes in the algorithm for starting workflows under windows (marked *):
    • The master process creates all listener sockets;
    • [cycle] The master process creates a workflow;
    • *[win32] the master calls a new function ngx_share_listening_sockets: for each listener socket, information is requested (for inheritance) specifically for this new worker, ("cloned" through WSADuplicateSocket for pid), which will be stored in shared memory - shinfo (protocol structure);
    • The master process waits until worker sets a ready event - event “worker_nnn”;
    • *[win32] The workflow performs a new function ngx_get_listening_share_infoto obtain shinfo inheritance information, which will be used to create a new socket descriptor for the shared listener socket of the master process;
    • * [win32] The workflow creates all listener sockets using the shinfo information from the master process;
    • The worker process sets an event - event "worker_nnn";
    • The master process stops waiting, and creates the next workflow by repeating [cycle].

    If necessary, here is a link to a discussion about the fix, so that it will be.

    As a result, nginx under windows now starts N “full-fledged”, in terms of “listening”, and most importantly, establishing a connection, workflows that process incoming connections really in parallel.

    This fix really still lies in the "pool request" (I sent the changeset to nginx-dev), but you can already try it for example by downloading from my github and building it yourself under windows. If there are those who wish, I will post a binary somewhere.

    I tortured my hardware for quite some time, chasing it with tests and under load "scripts" - the result, all the workers are more or less evenly loaded and really work in parallel. I also tried on the fly (reload) to reboot nginx and randomly “killed” some workers by simulating the “crash” of the latter - everything works without the slightest criticism.
    So far, the only "flaw" has appeared, IMHO - if you run
    netstat /abo | grep LISTEN
    
    then you will see only the master process in the list of "listeners", although in reality it just never establishes a connection, only its child workflows.

    By the way, my experience so far says that accept_mutexfor the windows platform you probably need to disable " accept_mutex off;", because at least on my test systems, with it turned on accept_mutexthey worked significantly slower than with it turned off. But I think everyone should check this experimentally (because it depends on a bunch of parameters, such as the number of cores, workers, keep-alive connections, etc., etc.).

    Well, as without beautiful plates with performance comparison numbers , before (the first column is marked ** NF) and after.
    The test is done on Windows7 - i5-2400 cpu @ 3.10GHz (4 core).
    Request: statics, 452 bytes (+ header) - small gif-icons.
    Workers x Concur. 1 x 5 ** NF 2 x 5 4 x 5 4 x 15
    Transactions 5624 hits 11048 hits 16319 hits 16732 hits
    Availability 100.00% 100.00% 100.00% 100.00%
    Elapsed time 2.97 secs 2.97 secs 2.97 secs 2.96 secs
    Data transferred 2.42 MB 4.76 MB 7.03 MB 7.21 MB
    Response time 0.00 secs 0.00 secs 0.00 secs 0.00 secs
    Transaction rate 1893.60 trans / sec 3719.87 trans / sec 5496.46 trans / sec 5645.07 trans / sec
    Throughput 0.82 MB / sec 1.60 MB / sec 2.37 MB / sec 2.43 MB / sec
    Concurrency 4.99 4.99 4.99 14.92
    Successfulful transactions 5624 11048 16319 16732
    Failed transactions 0 0 0 0
    Longest transaction 0.11 0.11 0.11 0.11
    Shortest transaction 0.00 0.00 0.00 0.00

    And may nginx be with you under windows too.

    Also popular now: