The fundamental problem of package managers for programming languages
- Transfer
Why are there so many different package managers? They can be found both in many operating systems (apt, yum, pacman, Homebrew), and working with many programming languages (Bundler, Cabal, Composer, CPAN, CRAN, CTAN, EasyInstall, Go Get, Maven, npm, NuGet, OPAM , PEAR, pip, RubyGems, etc. etc.). “Each programming language needs its own package manager, this has already become a universally recognized truth.” What kind of inexplicable attraction forces programming languages, one by one, to slide into this cliff? Why don't we just use existing package managers?
You probably already have some suggestions why using apt to manage Ruby packages is not a good idea. “System package managers and package managers for programming languages are completely different things. The central distribution of all packages is great, but completely unsuitable for most libraries posted on GitHub. Centralized packet distribution is too slow. All programming languages are different and their community does not interact with each other. Such package managers install packages globally, and I want to manage versions of the libraries used. ”These shortcomings are certainly present in this solution. But they miss the very essence of all these problems.
The fundamental problem is that package managers for various programming languages are decentralized .
This decentralization is implied even in the definition of a package manager: it is a program that installs programs and libraries from remote sources that were not available locally at the time of installation. Even if you imagine an ideal centralized package manager, even there will be two copies of this library: one is somewhere on the server, the second is located locally by the programmer who writes the application using this library. However, in reality, the library ecosystem suffers greatly from fragmentation - it brings together many libraries created by different developers. Of course, all libraries can be loaded and indexed in one place, but this does not mean that the authors of the libraries will be aware of any other use cases. And then we get what in the world of Perl is called DarkPAN: countless amounts of code that seem to exist, but of which we have no idea, since it is wired somewhere in a proprietary code or functions somewhere on corporate servers. Decentralization can only be bypassed when you control absolutelyall the code for your application. But in this case, you are unlikely to need a package manager, right? (By the way, my colleagues told me that this is mandatory for large projects, such as the Windows operating system or the Google Chrome browser.)
Decentralized systems are complex. Seriously, very complicated. If you do not carefully consider the architecture of such a system, then you will certainly expect dependency hell. There is no one “correct” solution to this problem: I can name at least three different approaches to solving this problem that are used by different generations of package managers, and each of them has its pros and cons.
Dockable versions.Perhaps the most popular is the opinion that the developer should strictly indicate the version of the package used. This approach is promoted by managers such as the Bundler for Ruby, Composer for PHP, pip in conjunction with virtualenv for Python and any other inspired Ruby / node.js approach (for example, Gradle for Java or Cargo for Rust). Recreation of assemblies in them rules the ball - these package managers solve the problem of decentralization, simply assuming that the whole ecosystem of packages ceases to exist as soon as you fix the versions. The main advantage of this approach is that you can specify the versions of libraries that you use in your code. Of course, this is also a minus - you will always have to control the versions of these libraries. Usually versions are simply fixed, safely forgetting about them, even if some important security update comes out. Development cycles are needed to have updated versions of all the dependencies, but this time is often spent on other things (for example, developing new features).
Stable version. If package management requires that each individual application developer spend time and effort in maintaining all the dependencies up to date and check that they continue to work correctly with the application and with each other, we might wonder - is there a way to centralize this work? This leads us to another approach: create a centralized repository with approved packages that work together, and issue bugfixes and security updates for them while we maintain backward compatibility. For various programming languages, there are implementations of such package managers. At least two that I know of are Anaconda for Python and Stackage for Haskell. But if you look closely, we will see that the exact sameThe model is used in package managers of operating systems. As a system administrator, I often recommend that users give preference to libraries distributed in operating system repositories. They will not break backward compatibility until we switch to a new release version of the OS, and at the same time, you will always use the latest bug fixes and security updates. (Yes, you will not be able to use features from the new versions, but, in itself, this goes against the concept of stability.)
Considering decentralization.Up to this point, we tried not to consider decentralization at all as an acceptable approach. They said that a central repository and control over updates by the developer are needed. But are we not splashing the baby with water? The main disadvantage of a centralized approach is the huge amount of work that needs to be done in order to ensure the stable operation of all packages and keep these packages up to date. In addition, no one expects that absolutely all packages will be compatible with each other, but, nevertheless, this does not interfere with the use of certain categories of packages together with others. An ideal decentralized system shifts the task of determining which packages can work together for everyone who takes part in this system, which again brings us back to the fundamental question:
Here are a few principles that can help us:
For a long time, the source code ecosystem was completely built around centralized systems. The spread of version control systems such as Git has fundamentally changed the situation: although Git may seem more complex than Subversion for people far from technology to master, the advantages of decentralization are much wider and more diverse. But no one has managed to create the same Git for package management yet. If someone assures you that the package management problem has been resolved and everything is just reinventing the Bundler, I ask you - think about decentralization as it should.
You probably already have some suggestions why using apt to manage Ruby packages is not a good idea. “System package managers and package managers for programming languages are completely different things. The central distribution of all packages is great, but completely unsuitable for most libraries posted on GitHub. Centralized packet distribution is too slow. All programming languages are different and their community does not interact with each other. Such package managers install packages globally, and I want to manage versions of the libraries used. ”These shortcomings are certainly present in this solution. But they miss the very essence of all these problems.
The fundamental problem is that package managers for various programming languages are decentralized .
This decentralization is implied even in the definition of a package manager: it is a program that installs programs and libraries from remote sources that were not available locally at the time of installation. Even if you imagine an ideal centralized package manager, even there will be two copies of this library: one is somewhere on the server, the second is located locally by the programmer who writes the application using this library. However, in reality, the library ecosystem suffers greatly from fragmentation - it brings together many libraries created by different developers. Of course, all libraries can be loaded and indexed in one place, but this does not mean that the authors of the libraries will be aware of any other use cases. And then we get what in the world of Perl is called DarkPAN: countless amounts of code that seem to exist, but of which we have no idea, since it is wired somewhere in a proprietary code or functions somewhere on corporate servers. Decentralization can only be bypassed when you control absolutelyall the code for your application. But in this case, you are unlikely to need a package manager, right? (By the way, my colleagues told me that this is mandatory for large projects, such as the Windows operating system or the Google Chrome browser.)
Decentralized systems are complex. Seriously, very complicated. If you do not carefully consider the architecture of such a system, then you will certainly expect dependency hell. There is no one “correct” solution to this problem: I can name at least three different approaches to solving this problem that are used by different generations of package managers, and each of them has its pros and cons.
Dockable versions.Perhaps the most popular is the opinion that the developer should strictly indicate the version of the package used. This approach is promoted by managers such as the Bundler for Ruby, Composer for PHP, pip in conjunction with virtualenv for Python and any other inspired Ruby / node.js approach (for example, Gradle for Java or Cargo for Rust). Recreation of assemblies in them rules the ball - these package managers solve the problem of decentralization, simply assuming that the whole ecosystem of packages ceases to exist as soon as you fix the versions. The main advantage of this approach is that you can specify the versions of libraries that you use in your code. Of course, this is also a minus - you will always have to control the versions of these libraries. Usually versions are simply fixed, safely forgetting about them, even if some important security update comes out. Development cycles are needed to have updated versions of all the dependencies, but this time is often spent on other things (for example, developing new features).
Stable version. If package management requires that each individual application developer spend time and effort in maintaining all the dependencies up to date and check that they continue to work correctly with the application and with each other, we might wonder - is there a way to centralize this work? This leads us to another approach: create a centralized repository with approved packages that work together, and issue bugfixes and security updates for them while we maintain backward compatibility. For various programming languages, there are implementations of such package managers. At least two that I know of are Anaconda for Python and Stackage for Haskell. But if you look closely, we will see that the exact sameThe model is used in package managers of operating systems. As a system administrator, I often recommend that users give preference to libraries distributed in operating system repositories. They will not break backward compatibility until we switch to a new release version of the OS, and at the same time, you will always use the latest bug fixes and security updates. (Yes, you will not be able to use features from the new versions, but, in itself, this goes against the concept of stability.)
Considering decentralization.Up to this point, we tried not to consider decentralization at all as an acceptable approach. They said that a central repository and control over updates by the developer are needed. But are we not splashing the baby with water? The main disadvantage of a centralized approach is the huge amount of work that needs to be done in order to ensure the stable operation of all packages and keep these packages up to date. In addition, no one expects that absolutely all packages will be compatible with each other, but, nevertheless, this does not interfere with the use of certain categories of packages together with others. An ideal decentralized system shifts the task of determining which packages can work together for everyone who takes part in this system, which again brings us back to the fundamental question:
Here are a few principles that can help us:
- Strict encapsulation of dependencies. One of the reasons that makes dependency hell such an insidious problem is that package dependencies are often an integral part of its core API: thus, choosing a dependency is more of a global choice that affects the entire application. If a library uses any dependencies inside, and this choice is completely determined only by the details of the internal implementation of this library, it should not lead to any global restrictions. NPM for NodeJS takes this principle to its logical limit - by default it does not limit duplication of dependencies, allowing each library to load its own instance of the dependent package. Although I doubt it Since it is worth duplicating absolutely all packages (this is found in the Maven ecosystem for Java), I certainly agree that this approach increases dependency composability.
- Promotion of semantic versioning. In decentralized systems, it is especially important that library developers provide as accurate information about the library as possible so that users and utilities working with packages can make informed decisions. Various version formats and version ranges only complicate an already difficult task (as I wrote in a previous post ). If you have the opportunity to use semantic versions , or even better, instead of semantic versions, use a more correct approach, indicating type-level dependenciesin their interfaces, our utilities will be able to make the best choice. The “gold standard” of information in decentralized systems is “Is package A compatible with package B”, and this information is often very difficult to analyze (or impossible, for systems with dynamic typing).
- Centralization for special occasions. One of the principles of a decentralized system is that each participant can collect the most suitable environment for themselves. This implies a certain freedom in choosing a central source or the creation and use of one's own - centralization for special cases. If we assume that users will create their own repositories in the style used in operating systems, we must provide them with tools by which it will be easy and painless to create and use these repositories.
For a long time, the source code ecosystem was completely built around centralized systems. The spread of version control systems such as Git has fundamentally changed the situation: although Git may seem more complex than Subversion for people far from technology to master, the advantages of decentralization are much wider and more diverse. But no one has managed to create the same Git for package management yet. If someone assures you that the package management problem has been resolved and everything is just reinventing the Bundler, I ask you - think about decentralization as it should.