
Why data centers need operating systems
- Transfer

Developers today are creating new application classes. These applications are no longer developed for a separate server, but are launched from several servers in the data center. Examples include frameworks that implement analytic computing such as Apache Hadoop and Apache Spark, message brokers like Apache Kafka, key-value stores like Apache Cassandra, as well as applications that end users directly use, such as those used by Twitter and Netflix.
These new applications are more than just applications; they are distributed systems. Just as it once became customary for developers to create multi-threaded applications for individual machines, it is becoming customary to design distributed systems for data centers.
But it’s rather difficult for developers to create such systems, and it’s hard for administrators to support them. Why? Because we use the wrong level of abstraction, both for developers and for administrators - the level of machines.
Machines - Inappropriate Level of Abstraction
Machines are not suitable as an abstraction layer for creating distributed applications and working with them. Using this abstraction by developers complicates the process of creating software without any obvious need, linking developers to the characteristics of specific computers, such as IP addresses and the amount of memory on the local machine. This makes it difficult, and sometimes even impossible, to migrate and resize applications, leading to the fact that supporting the work of a data center becomes a time-consuming and painful procedure.
Machines taken as the level of abstraction force administrators to deploy applications with the expectation of the possibility of failure of individual pieces of equipment, which leads to the use of the simplest and most conservative approach, which consists in installing one application on one machine. This almost always means that the equipment is not used at full capacity, because in most cases we do not buy machines (both physical and virtual ) specifically for any applications and do not verify the size of applications for the capabilities of a particular machine.
By launching only one application on one machine, we come to the conclusion that our data center becomes an extremely static, inflexible design of groups of machines. As a result, applications that process analytics are launched on one group of machines, another is working with databases, the third supports web servers, another works with message queues, and so on. And the number of such groups only grows as companies replace monolithic architecture with service-oriented and microservice architecture.
What happens when a machine in one of these static groups fails? It is hoped that we can effectively replace it with a new one (and this is a waste of money) or we can allocate some of the available machines for these tasks (and this is a waste of time and effort). But what if the web traffic is reduced to a minimum level during the day? We are preparing static groups for peak loads, which means that when traffic drops to a minimum level, additional power is wasted. That is why the standard data center is usually loaded at 8-15% of its capacity.

And finally, using machines as an abstraction, organizations are forced to hire a huge staff of people who will manually configure and repair each individual application on each individual machine. People are the bottleneck of such a system; it is not possible to launch new applications due to the human factor, even if the company has already purchased enough resources that are not currently in use.
If my laptop were a data center
Imagine what would happen if we ran applications on laptops the same way we do it in data centers. Every time we would launch a web browser or a text editor, we would have to specify with what hardware capacities it is necessary to process this process. It's good that our laptops have operating systems, thanks to which we can abstract from the complexities of managing the resources of our PC manually.
In essence, we have operating systems for our workstations, servers, mainframes, supercomputers, and mobile devices, each of which is optimized for the unique characteristics of these devices and form factors.
We have already begun to perceive the data center itself as one giant computer the size of a huge data warehouse. But we still do not have operating systems that could handle this level of abstraction and manage the hardware in the data center in the same way that the OS does it on our computers.
It's time to create an OS for data centers
What could an OS be like to manage a data center?
From the point of view of the administrator, she would work with all the machines from the data center (or cloud) and aggregate them into one giant resource pool, within which applications could be launched. You would no longer have to configure individual machines for individual applications, all applications could run on any of the available resources, on any machine, even if other applications were already launched from it.
From the point of view of the developer, the data center operating system should act as an intermediary between applications and machines, ensuring the implementation of common primitives and simplifying the development of distributed applications.
The data center operating system should not replace Linux or any other operating system that we use in data centers today. The OS for the data center should have been a software stack installed on top of the main operating system. Continuing to use the primary OS to enable standard operations would be critical to operational support for existing applications.
The data center OS should provide functionality similar to the tasks of the OS of a separate machine, namely: resource management and process isolation. In the same way as the traditional OS, the data center operating system should provide the ability for many users to simultaneously launch multiple applications (support multithreading) in the presence of a common group of resources and the apparent isolation of running applications from each other.
API for data center
Perhaps the defining characteristic of the OS for the data center would be that it would provide an interface for developing distributed applications. By analogy with a system call for traditional operating systems, the data center operating system API could allow distributed applications to reserve and release resources, start, monitor and terminate processes, and much more. The API would provide the implementation of primitives with the common functionality needed by all distributed systems. And developers, therefore, would not need to independently introduce the fundamental primitives of distributed systems over and over again (and inevitably suffer from the same bugs and performance problems).

Centralizing the overall functionality within the framework of API primitives would allow developers to create new distributed applications easier, safer and faster. This is reminiscent of a situation where virtual memory was added to standard OSs. In fact, one of the pioneers in the development of virtual memory wrote that “It was very clear to the developers of operating systems in the early 1960s that automatic memory allocation could greatly simplify the programming process.”
Examples of primitives
Two primitives characteristic of data center operating systems that could immediately simplify the creation of distributed applications are service discovery and coordination. Unlike the OS on a separate device, where very few applications running within the same system require detection, for distributed applications, the detection function is quite natural. Moreover, the availability and fault tolerance of most distributed applications is achieved through effective and adequate coordination and / or consensus, which is extremely difficult to achieve.
Thanks to the data center OS, the software interface could replace a person. Today, developers are forced to choose between existing service discovery and coordination tools, such as Apache ZooKeeper and etcd for CoreOS. This forces organizations to deploy many tools for various applications, which significantly increases operational complexity and the need for further support.
If the primitives for detection and coordination are provided at the expense of the OS of the data center, this will not only simplify the development, but also provide “portability” of applications. Organizations will be able to change the underlying implementation of the entire system, without the need to make changes to the applications - much like you now choose to implement the file system on the local OS.
New way to deploy applications
OS of data centers will allow software interfaces to replace people with whom developers usually interact now, trying to ensure the deployment of their applications. Instead of asking someone to allocate machines and configure them to run applications, the developer will be able to launch their applications using the data center OS (using the CLI or GUI), and the application will be executed using the data center OS API.
This approach will provide a clear distinction between the interests of administrators and users: administrators will determine the amount of resources allocated for each user, and users will launch any applications they want using any available resources. Since the administrator will determine what type of resource (but not the resource itself) is available and in what quantity, the data center OS and launched distributed applications will know more about the resources that can be used to work more efficiently and be more fault tolerant. Since most distributed applications have increased requirements for process planning (for example, Apache Hadoop) and specific needs for recovery (which is typical for a database),
Cloud is not an OS
But why do we need a new OS? Are IaaS (infrastructure as a service) and PaaS (platform as a service) solutions to these problems?

IaaS does not solve our problems, because this approach still focuses on machines. As part of this approach, applications do not access the software interface for subsequent execution. IaaS focuses on people providing the system with virtual machines, which other people in turn will use to deploy applications. IaaS turns machines into more virtual entities, but does not contain any primitives that could make it easier for a developer to create distributed applications that work with resources of several virtual machines at once.
PaaS, on the other hand, abstracts from machines, but is created primarily for interaction with humans. Many PaaS solutions include various third-party (with respect to the immediate tasks of the platform) services and integration mechanisms that simplify the creation of distributed applications, however, such mechanisms cannot be used within several PaaS solutions at once.
Distributed computing has become the rule, not the exception, and now we need an OS for data centers that would provide the necessary level of abstraction and portable APIs for distributed applications. Their absence hinders the development of our industry. Developers should be able to create distributed applications without the need for multiple re-implementation of core functionality. Distributed applications created within one organization should be easy to run in other companies.