Translation: how gitLab uses unicorn and unicorn-worker-killer

Original author: Jacob Vosmaer
  • Transfer
I bring to your attention a translation of a short article in which GitLab engineers tell how their application works on Unicorn and what they do with the memory that is flowing. This article can be considered as a simplified version of another author’s article already translated on the hub.

Unicorn


To handle HTTP requests from git and users, GitLab uses Unicorn , a Ruby server with prefork . Unicorn is a daemon written in Ruby and C that can download and run a Ruby on Rails application, in our case, GitLab Community Edition or GitLab Enterprise Edition.

Unicorn has a multiprocess architecture for supporting multi-core systems (processes can run in parallel on different cores) and for fault tolerance (a process that crashes does not cause GitLab to terminate). When launched, the Unicorn main process loads Ruby and Gitlab into memory, after which it launches a number of worker processes that inherit this “initial” memory cast. The main Unicorn process does not handle incoming requests - workflows do this. The network stack of the operating system receives incoming connections and distributes them among the work processes.

In an ideal world, the main process starts a pool of workflows once, which then process incoming network connections until the end of time. In practice, workflows may crash or be killed if the timeout period is exceeded. If the main Unicorn process detects that one of the worker processes has been processing the request for too long, then it kills the process with SIGKILL ( kill -9 ). Regardless of how the workflow ended, the main process will replace it with a new one that inherits the same “initial” state. One of the features of Unicorn is the ability to replace defective workflows without interrupting network connections with user requests.

An example workflow timeout can be found inunicorn_stderr.log . The identifier of the main process 56227:

[2015-06-05T10:58:08.660325 #56227] ERROR -- : worker=10 PID:53009 timeout (61s > 60s), killing
[2015-06-05T10:58:08.699360 #56227] ERROR -- : reaped # worker=10
[2015-06-05T10:58:08.708141 #62538]  INFO -- : worker=10 spawned pid=62538
[2015-06-05T10:58:08.708824 #62538]  INFO -- : worker=10 ready


The main Unicorn settings for working with processes are the number of processes and the timeout after which the process will be completed. A description of these settings can be found in this section of the GitLab documentation.

unicorn-worker-killer


GitLab has memory leaks. These leaks manifest themselves in long-running processes, in particular, in the work processes created by Unicorn (while in the main Unicorn process there are no such leaks, since it does not process requests).

To combat these memory leaks, GitLab uses the unicorn-worker-killer , which modifies Unicorn workflows to check memory usage every 16 requests. If the amount of used workflow memory exceeds the set limit, the process ends and the main Unicorn process automatically replaces it with a new one.

In fact, this is a good way to deal with memory leaks, as the Unicorn design allows you to not lose the user's request at the end of the workflow. Moreover, unicorn-worker-killer completes the process between processing requests, so that this does not affect working with them.

This is how the restart of the workflow due to a memory leak looks in the unicorn_stderr.log file . As you can see, the process with the identifier 125918 after the introspection decides to end. The threshold value of memory in this case is 254802235 bytes, that is, about 250 megabytes. GitLab uses a random number as a threshold in the range of 200 to 250 megabytes. The main GitLab process with ID 117565 then creates a new workflow with ID 127549:

[2015-06-05T12:07:41.828374 #125918]  WARN -- : #: worker (pid: 125918) exceeds memory limit (256413696 bytes > 254802235 bytes)
[2015-06-05T12:07:41.828472 #125918]  WARN -- : Unicorn::WorkerKiller send SIGQUIT (pid: 125918) alive: 23 sec (trial 1)
[2015-06-05T12:07:42.025916 #117565]  INFO -- : reaped # worker=4
[2015-06-05T12:07:42.034527 #127549]  INFO -- : worker=4 spawned pid=127549
[2015-06-05T12:07:42.035217 #127549]  INFO -- : worker=4 ready


What else catches your eye when studying this log: the workflow processed only 23 requests before being completed due to memory leaks. This is currently the norm for gitlab.com.

Such a frequent restart of workflows on GitLab servers can be a cause of concern for system administrators and devops, but in practice this is most often normal behavior.

Also popular now: