Rethinking PID 1. Part 4

Original author: Lennart Poettering
  • Transfer




Regarding Upstart


First of all, let me emphasize that I really like the Upstart code, it is very well documented and easy to navigate. In general, other projects (including mine) should learn from his example.

I said this, but I can’t say that I generally agree with the approach used in Upstart. But for starters, a little more about Upstart.

Upstart does not share code with sysvinit, and its functionality is an add-on, and it provides some degree of compatibility with well-known SysV scripts. Its main feature is an event-oriented approach: starting and stopping processes is tied to events in the system, where many different things can act as an event, such as: accessibility of a network interface or launching of a program.

Upstart serializes services through these events: if a syslog-started signal was sent it is used as a signal to start D-Bus so that Syslog can use it. Further, when the dbus-started signal is sent , NetworkManager will be launched, since now it can use D-Bus and so on and so on.

Some might say that in this way the current real dependencies that exist and are understood by administrators were simply converted and encoded into a set of events and rules of behavior: each logical rule “ a needs b ”, which the administrator / developer worries about, turns into “run a when will start b "plus" stopa when it will be stopped b". In some situations, this is simply a simplification of things: for code in Upstart itself. Nevertheless, I would argue that this simplification of things is actually detrimental. First of all, the logical dependencies do not disappear, the person who writes Upstart files is now forced to manually convert these logical dependencies into a set of events and rules of behavior (in fact, there are two rules for each dependency). As a result, instead of letting the machine figure out what to do based on the dependencies, the user must manually convert the dependencies into a simple set of events / rules of behavior. Also, because dependency information is never encoded, it is not available at run time, which means that an administrator who is trying to find out why this or that happened, for example, why astarted when b started , then he has no chance to find out.

Moreover, the system of events turns all dependencies upside down. Instead of minimizing the amount of work (which a good loading system should do, as noted at the beginning of the article), it actually increases the amount of work that needs to be done. Or in other words, instead of having a clear goal and doing only what is really necessary to achieve the goal, he takes one step, and after completing this step, he performs all the steps that could possibly follow him.

Simply put: the fact that the user launched D-Bus is not an indicator that NetworkManager should be started (but this is what Upstart would actually do). In the opposite case, this statement is also true: when a user requests NetworkManager, this is clearly an indicator of what D-Bus should be running (which most users expect to see, right?).

A good boot system should run only what is needed and only on demand. And either using a "lazy" load, or using concurrency in advance. Nevertheless, it should not start more than necessary, in particular, not everything that is installed and can use any service must be started.

Finally, I do not see the real benefits of using event logic. It seems to me that most of the events that are presented in Upstart are not really punctual in nature, but have some runtime: the service starts, runs and stops. When a device is connected, accessible and disconnected again. Mount point during connection, fully connected and disconnected. The power cable is connected, the system is powered on, and the power cable is disconnected. The boot system or process, the observer must control the minimum number of events to comply with punctuality, most of which are a sequence of start, conditions and stops. Again, this information is not available in Upstart because it focuses on single events and ignores runtime dependencies.

Now, I am aware that some of the problems that I voiced above in some way were solved by the latest changes in Upstart, in particular, syntax based on conditions such as start on (local-filesystem and net-device-up IFACE = lo) in Upstart rule files -a. However, for me this seems more like an attempt to fix a system that has a hard flaw in its code.

Leaving all this behind, Upstart is a good “nurse” for services, even if some solutions seem ambiguous (read above), and has many unrealized features (also read above).

There are other boot systems besides Upstart, sysvinit, and launchd. Most of them offer little more than Upstart or sysvinit. The most interesting rival is the Solaris SMF, which maintains the correct dependencies between services. However, in many cases they are too complex and let me say a little academic in their overuse of XML and new terminology for well-known things. It is also closely related to specific Solaris features such as the contract system.

Putting it all together


So, now is a good time for the second break, before I explain how a good PID 1 should behave and what most current systems do, and find out where the dog is buried. So go and pour a new cup of coffee. It will be worth it.

You probably guessed that what I suggested above as requirements and functionality for an ideal boot system is actually now available, in a (so far experimental) boot system called systemd and which I want to announce right now and here! And again, here is the code. And here is a quick overview of its functionality and the grain of rationality behind them.

systemdstarts and manages the entire system (pay attention to the name ...). It implements the entire functional indicated above and a few more. Everything is based on the concept of units . Unit have type and name. Since units usually download their configuration from the file system, therefore, unit names are file names. For example: the avahi.service unit reads its configuration file with the same name, and of course it is also the unit encapsulating the Avahi daemon. There are several types of units:

  1. service: this is the most obvious type of unit: daemons that can be started, stopped, overloaded, and reinitialized. For compatibility with SysV, we not only support our own configuration files, but also have the ability to read the classic SysV boot scripts, in particular, we parse the LSB header, if present. Therefore /etc/init.d is just another source of configuration files
  2. socket: this unit type encapsulates a socket in the file system or on the Internet.
    Currently we support AF_INET , AF_INET6 , AF_UNIX sockets of the following types: streaming, datagrams and serial packets. We also support classic FIFOs like vehicles. Each socket unit has a corresponding service unit, which starts as soon as the first connection to the socket or FIFO is received. For example: nscd.socket starts nscd.service on an incoming connection.
  3. device: this unit type encapsulates a device on a Linux system. If a device is marked through the udev rule, then it will appear in the device unit in systemd. Properties set using udev can be used as a configuration source to establish device device dependencies .
  4. mount: this unit type encapsulates the mount point on the file system. systemd monitors all mount points while they connect further on their life paths, and can also be used to connect and disconnect mount points. / etc / fstab is used as an additional source for mount points, just like SysV boot scripts can be used as an additional configuration source for units of type service .
  5. automount: this unit type encapsulates an automatic mount point in the file system. Each automount unit has a corresponding mount unit that starts (i.e. connects) as soon as an attempt is made to access the auto-connect folder.
  6. target: this type of unit is used for logical grouping of units: instead of actually doing something useful, it simply refers to other units, as a result of which they can be controlled together. An example is: multi-user.target , which plays the role of runlevel 5 in the classic SysV system, or bluetooth.target , which is requested as soon as a bluetooth dongle is connected to the system and that simply starts all the bluetooth-related services that otherwise they would not be running: bluetoothd and obexd as an example
  7. snapshot: this type of unit is similar to target and does nothing in its essence and its only purpose is to reference other units. Snapshots can be used to save / roll back the state of all services and units of the boot system. It is mainly intended for use in two cases: to allow the user to temporarily enter a certain state such as "Emergency Shell", i.e. interruption of current services and an easy way to return to the previous state, i.e. starting all services that have been temporarily stopped. It can also be used as a simple way to suspend the system (sleep): there are many services that do not behave correctly when the system goes into sleep mode, and it would often be a good idea to just stop these services before sleep and start the system from sleep mode after exiting .

All these units can have dependencies between each other (both positive and negative, that is, “Requires” and “Conflicts”): the device may have a dependency on the service, bearing in mind that as soon as the device becomes available the corresponding service will start. Mount points have implicit dependencies on the devices with which they connect. Mount points also have implicit dependencies to other mount points that portend them (for example: mount point / home / lennart implicitly has a mount point / home ), etc.

Here is a short list of features:

  1. For each process that spawns, you can control: environment, resource limits, working and root directory, umask, killer OOM settings, nice level, IO class and priority, CPU policies and priority, attraction to the CPU, user id, group id, id of side groups, read / write folders and non-access folders, general / private / secondary mount flags, set of options / restrictions, security attributes, CPU scheduler settings, private / tmp space, cgroup for various subsystems. Also you can easily connect stdin / stdout / stderr services to syslog, / dev / kmsg or any TTY. If connected to TTY for input, then systemd makes sure that the process has exclusive access to TTY, either waiting for access or forcing it.
  2. Each running process gets its own cgroup (currently only in the debug subsystem, since the debug subsystem is not used for other purposes and does nothing more than simply grouping processes) and it’s very easy to configure systemd so that each service has its own cgroup configured outside of systemd, let's say through the libcgroups utilities.
  3. Configuration files use the syntax that closely follows the well-known .desktop files. This is a simple syntax, a parser for which is present in many libraries. Also, it allows us to refer to existing tools for i18n in service descriptions. Administrators and developers will not have to learn the new syntax.
  4. As mentioned, we maintain compatibility with SysV download scripts. We take advantage of LSB and Red Hat chkconfig headers if present. If they are not presented, we try to squeeze out the best of the available information, such as startup priorities in /etc/rc.d. SysV boot scripts are used as an additional source of configuration, therefore, the migration path to systemd is simplified. Optionally, we can read the classic PID files for services to identify the service master process. Note that we take the LSB headers and convert them to the systemd dependency system. Distracted note: Upstart cannot collect and use this kind of information. Downloading with Upstart on a system where LSB SysV scripts prevail will not be parallelized, although it will be on the same system with systemd. In fact, for Upstart, all SysV boot scripts run together as a single task, while for systemd they are another configuration source and they are all managed and designated individually as a native systemd service.
  5. In a similar way, we read the existing / etc / fsta b, and consider it as an additional source of configuration. Using the fstab option comment = we can even mark an element in / etc / fstab as a systemd-controlled mount point.
  6. If the same unit is configured in several configuration sources (for example, there is a file /etc/systemd/system/avahi.service and /etc/init.d/avahi ) then the native configuration file will be given priority, ignoring the outdated configuration file, allowing the package use both the SysV script and the systemd configuration file for some time.
  7. We support a simple template / instance mechanism. For example: instead of supporting six configuration files for six gettys, we simply support one getty @ .service instance of which will be created for getty@tty2.service and so on. The interface part can even be inherited by dependency expressions, i.e. it is easy to encode that the dhcpd@eth0.service service starts avahi-autopid@eth0.service
    while leaving part of the line - eth0 - masked.
  8. To activate the socket, we support full compatibility with traditional inetd modes, as well as a very simple mode that tries to simulate the launchd activation method and this is the recommended method for new services. Inetd mode allows you to transfer only one socket to a daemon, while natively supported mode allows you to transfer as many file descriptors as you like. We also support one instance per connection, as well as one instance per connection. In the first mode, we call the cgroup of the service that will be launched, with the connection parameters, and use the template logic mentioned above. For example: sshd.socket can spawn sshd@192.168.0.1-4711-192.168.0.2-22 services called cgroup sshd @ .service / 192.168.0.1-4711-192.168.0.2-22(i.e., the IP address and port numbers are used in the instance name. For the AF_UNIX socket, we use the PID and client connection identifier). Such a mechanism provides administrators with a good way to identify different instances of a service or individually control their runtime. Native socket transfer mode is very easy to implement in applications: if the $ LISTEN_FDS variable is set , then it will contain the number of transferred sockets and the daemon will be able to find them sorted as indicated in the .service file, starting from handle 3 (a well-written daemon can also use fstat () and getsocketname () to identify each of the sockets in case more than one is transmitted). In addition, we set the variable$ LISTEN_PID to the PID value of the daemon that file descriptors should receive, because environment variables are usually inherited by the child process and, therefore, could be misleading the following down the chain. Moreover, such a socket transfer logic is very easy to implement in daemons. We will provide the BSD with a licensed reference implementation that shows how to work with this. We ported a couple of demons implementing this scheme.
  9. To some extent, we provide compatibility with / dev / initctl . This compatibility is actually implemented using FIFO-activated services that simply convert these old requests to D-Bus requests. In essence, this means that the old shutdown , poweroff, and similar commands from Upstart and sysvinit continue to work with systemd.
  10. We also provide utmp and wtmp compatibility . Perhaps even a slightly better version than the existing utmp and wtmp .
  11. systemd supports several types of dependencies between units. After / Before can be used to interfere with unit activation order. Also fully orthogonal Requires and Wants , which are expressed in a positively demanding relationship, either mandatory or not mandatory. Also, there is Conflicts , which translates into negatively requiring dependency. Finally, there are also three less commonly used types of dependencies.
  12. systemd have a minimal transaction system. This means: if the unit wants to start or stop, we will add the service and all its dependencies to a temporary transaction . Next, we will make sure that the transaction is complete (i.e., sorting through After / Before of all units is free from cyclical nature). If this is not the case, try to fix it, and remove non-essential jobs from the transaction, which can remove recursion. Also, systemd is trying to hold back on non-essential jobs that might put a service on startup. Non-essential queries are those that are not directly involved, but which are pulled through dependencies like Wants. Finally, we check if there are jobs that contradict jobs that have already been added to the queue, which may interrupt the transaction. If all the buzz and the transaction are consistent (integral) and its influence is minimized, then it will be merged with the upcoming tasks and will be added to the launch queue. In reality, this means that before performing the requested operation, we will check whether it makes sense to perform it at all, correct it if possible and “surrender” if in fact an unresolvable situation occurs.
  13. We record the start / stop time, as well as the PID and exit code of each process that we start and which we follow. We can use this data to cross-connect services and their data in abrtd, auditd and syslog. Imagine a user interface that highlights daemon crashes and provides you with easy navigation to the appropriate user interface for syslog, abrtd and auditd, which will show the generated information for this daemon on the current start.
  14. We support the re-execution of the download process by ourselves at any time. The daemon state is serialized before re-execution and deserialized after execution. In this way, we provide an easy way to facilitate boot system updates, as well as transferring the boot daemon to the final daemon. Open sockets and mount points of autofs are correctly serialized so that they can connect to them all the time, so that clients will not even notice that the boot system restarts itself. Also, the fact that most of the state of services is encoded in a virtual FS cgroup does not allow us to continue executing without access to serialized data. The restart code path is actually very similar to the configuration reload code path for the boot system,
  15. Starting to remove boot scripts from the boot system, we wrote down part of the basic system setup in C and transferred it directly to systemd. This also includes connecting the file system APIs (i.e. virtual file systems such as / proc , / sys and / dev ) and setting the host name.
  16. Server status is monitored and monitored via D-Bus. It has not yet been completed, but has advanced far ahead.
  17. For now, we want to emphasize socket-based and bus-based activation, and therefore support for dependencies between sockets and services. We also support several ways how these types of services can signal their readiness: by forking and having a stop status of the starting process (i.e. the traditional behavior of daemonize () ) as well as monitoring the bus until the configured service name appears.
  18. There is an interactive mode that asks each time for confirmation when the process is spawned by systemd. You can enable it by passing systemd.confirm_spawn = 1 in the kernel startup arguments.
  19. With the kernel parameter systemd.default = you can specify from which unit the systemd boot will start. Usually you specify something like multi-user.target , but you can even specify one single service instead of a target. For an example from the box, we provide emergency.service , which is similar in its utility to init = / bin / bash , nevertheless having the advantage of a running boot system, therefore providing the ability to boot a complete system from the emergency shell.
  20. There is also a minimal user interface that allows you to start / stop / inspect services. It is far from a full user interface, but is useful as a debugging tool. It is written in Vala ( yoo eee !!) and has the name systemadm


I must say that systemd uses many specific Linux features and does not limit itself to POSIX. This gives us access to enormous functionality that a system designed to transfer to other operating systems cannot provide.

To be continued…

Also popular now: