How our mirror works

    Mirror Selectel

    A mirror is a copy of the data of one information resource on another. Mirrors are used to provide access to copies of information through several sources. Using mirrors, for example, distribution of * nix-system distributions is carried out: copies of repositories are stored on numerous mirrors located in different parts of the world. Using mirrors allows you to rationally distribute the load and provide high speed download packages.

    Our company also has its own package mirror, which stores copies of repositories of popular linux systems. In this article, we would like to talk in detail about its device.

    When we launched the cloud server project in 2010, we chose the net-install installation model for them, in which distributions are installed by a “native” installer from one of the official mirrors. Thanks to this model, you can always get the latest software versions with all the latest changes made by the distribution maintainers. Another advantage of the net-install model is that it allows you to get rid of a number of problems associated with cloned instances (the need to generate SSH keys, UUIDs of file systems, etc.).

    As the main mirror, we chose, because it is closely located and contains all the repositories our customers need. At first it suited us perfectly. But then the unexpected happened. The number of installations grew, engineers relied on template testing; in the end, Yandex, outraged by the huge number of identical requests, simply closed access to its mirror for our subnets.

    We began to look for a solution with which we could ensure stability and minimize the likelihood of emergencies. We had the following idea: to raise nginx as a proxy server for several mirrors. This solution seemed to us quite reasonable and reliable: even if one of the uplinks crashes, we can easily download files from the other. However, we immediately faced the problem of the heterogeneous structure of mirrors: for example, the CentOS repository on one uplink could be in / centos, on the other in / CentOS, and on the third in general / www / mirror / srv / pub / centos.

    Since universal mirrors containing repositories of all the distributions we need (CentOS, Debian, Ubuntu, OpenSUSE) can be counted on the fingers, for each of the distributions we had to create a separate list of mirrors.

    Putting this idea into practice, we are faced with much more serious difficulties:
    • the uplink speed is inconsistent: very often it happens that the same host gives 5-10 Mb / s, and after a couple of hours - no more than 5-20 Kb / s. Since the installer downloads packages one by one, due to differences in speed, the installation may be delayed indefinitely;
    • some uplinks could be incorrectly configured: it so happened that in response to a request, instead of an RPM package, they received an HTML page “It works!”;
    • at some uplinks the packages indicated in the catalog might be missing. Or packages were present but had incorrect checksums. This could happen, for example, because of a disturbed synchronization with upstream: first index files, and then packages, and not vice versa. Errors could also occur due to the incorrect configuration of rsync, which recorded files in place, but did not save the contents to a temporary file with subsequent atomic replacement.

    Due to all these difficulties, the automatic installation failed more than once. In order to get rid of failures once and for all, we created our own mirror - It is available only from Selectel's IP addresses (outgoing traffic is paid for us and we do not risk providing it to the public, because you can easily get 10-20 gigabits).

    By creating our own mirror, we solved all the problems mentioned above. Among the advantages obtained through its own mirror, you should also name the following:
    • synchronization with uplinks occurs without interrupting customer service and does not affect the working copy that is given to them;
    • a synchronized copy replaces the current one only if the checksums converge for all new packages;
    • if the uplink for some reason is not available or returns erroneous data, the mirror continues to send data from the old but working copy;
    • uplink synchronization is divided into distributions: for some distributions it can be done less frequently than for others. It is also possible to partially clone some repositories.

    From this mirror, operating systems are installed on dedicated servers.

    How do repositories work?

    As a rule, repositories consist of two main parts: a directory (index) and a pool (package storage).

    The directory stores information about all packages that are in the repository: name, description, architecture, version, checksums, and in some cases also information about dependencies and package contents. The directory also indicates where exactly in the pool lies the file of one or another version of each package.

    The pool itself stores the package files. They can be decomposed in accordance with any hierarchy or simply folded into one directory.

    RPM repositories

    At the root of each RPM repository is a directory with directory files - repodata. A description of all sections of the directory is stored in the repomd.xml file. Each section is represented by a separate file in the directory directory. The description shows the path to the file containing the section, as well as its checksum.

    The contents of the repomd.xml file may look, for example, like this:

    The RPM catalog consists of the following sections:
    • primary - contains a description of all packages stored in the repository, the paths to the files of these packages and their checksums;
    • filelists - contains lists of files included in each package;
    • group - contains descriptions of the groups of packages that are installed using yum groupinstall;
    • other - contains additional information (for example, change logs - changelogs).

    The structuring and grouping of packages for different operating systems is organized differently. For example, CentOS stores all package files in the Packages directory located in the root of the repository. In addition, a separate repository has been created for each of the available architectures.

    OpenSUSE stores packages for all architectures in a single repository with separate pools in the directories i686 / x86_64 / etc.

    DEB repositories

    In DEB repositories, all packages are stored in a common pool. This avoids duplication of packages included in different releases. A separate directory has been created for each release in the repository.

    The parsing of the directory begins with the file / dists / [distribution] / Release (distribution here means the codename of the release is squeeze / wheezy / jessie). It contains a list of release components, as well as information about the size and checksums of all index files. Release file is signed by archive maintainers; the signature is stored in the Release.gpg file (sometimes the contents of the Release along with the signature may be in the InRelease file).

    The description of the contents of the pool is in two types of index files: Packages (they list binary packages) and Sources (they list sources).

    The path to the Packages file is / dists / [distribution] / [component] / binary- [architecture] / Packages, and the path to the Sources file is / dists / [distribution] / [component] / source / Sources.

    Note: sometimes index files are compressed using gzip or bzip2 - in this case, the extension .gz or .bz2 is appended to the file name. Some clients support LZMA (.lzma), XZ (.xz) and LZIP (.lz).

    Here is an example of an entry from the Packages file:
    Package: openssh-server
    Source: openssh
    Version: 1: 6.2p2-6
    Installed-Size: 747
    Maintainer: Debian OpenSSH Maintainers 
    Architecture: amd64
    Replaces: openssh-client (<= 2.16), libcomerr2 (> = 1.01), libgssapi-krb5-2 (> = 1.10 + dfsg ~), libkrb5-3 (> = 1.6.dfsg.2), libpam0g (> = 0.99 .7.1), libselinux1 (> = 1.32), libssl1.0.0 (> = 1.0.1), libwrap0 (> = 7.6-4 ~), zlib1g (> = 1: 1.1.4), openssh-client (= 1: 6.2p2-6), sysv-rc (> = 2.88dsf-24) | file-rc (> = 0.8.16), libpam-runtime (> = 0.76-14), libpam-modules (> = 0.72-9), adduser (> = 3.9), dpkg (> = 1.9.0), lsb -base (> = 4.1 + Debian3), procps
    Recommends: xauth, ncurses-term
    Suggests: ssh-askpass, rssh, molly-guard, ufw, monkeysphere, openssh-blacklist, openssh-blacklist-extra
    Conflicts: rsh-client (<< 0.16.1-1), sftp, ssh (<< 1: 3.8.1p1-9), ssh-krb5 (<< 1: 4.3p2-7), ssh-nonfree (<< 2), ssh-socks, ssh2
    Description: secure shell (SSH) server, for secure access from remote machines
    Multi-Arch: foreign
    Description-md5: 842cc998cae371b9d8106c1696373919
    Tag: admin :: login, implemented-in :: c, interface :: daemon, network :: server,
    protocol :: ssh, role :: program, security :: authentication,
    security :: cryptography, use :: login, use :: transmission
    Section: net
    Priority: optional
    Filename: pool / main / o / openssh / openssh-server_6.2p2-6_amd64.deb
    Size: 257438
    MD5sum: 1f18e568c17d81cc2c493ee48c93a03f
    SHA1: 207f131bbd4d709a47bcb69c997520c998ed7593
    SHA256: 242b7f041292dea0702b24e19dc6355f47147796b227f1024665920a493641f2

    How our mirror works

    The repository of each distribution on the mirror is stored in two copies: shadow (background) and working (foreground). Both parts lie on a separate LVM volume, which allows you to add disk space to them on the go. In the working part, a verified copy of the mirror is stored, it is distributed using nginx. The shadow part is synchronized with the upstream mirror, and then undergoes a thorough validation check.

    The validation procedure includes checking the directory, its digital signature (if any), as well as checking the checksums of all index files. Checking the checksums of all packages is quite difficult: pools of some repositories can store tens or even hundreds of gigabytes of packets. Therefore, checksums are only checked for new packages that rsync has touched. After verification, the shadow and working parts are interchanged. This operation is done using simple mv. Thus, atomicity of substitution can be practically ensured (three quick mv calls are enough to swap directories) and minimize possible downtime. The return of open files during the replacement does not stop.

    After the two parts are reversed, the shadow part locally “catches up” to the current state from the working copy.


    The algorithm described above is implemented in our scripting suite called mirror-sync, recently published on GitHub under the GNU GPL. We hope that our developments will be useful to a wide audience, and one of our readers will use our experience in creating our own mirror. All comments containing comments and suggestions for improving the mirror, we will take into account in future work.

    For those who cannot comment on posts on Habré, we invite to our blog .

    Also popular now: