elena_voronina November 14, 2014 at 14:01

VexorCI under the hood

Hello, Habr! The time of the long-awaited post about the internal device of Vexor - cloud continuous integration for developers, which allows you to effectively test projects and pay only for those resources that are actually used.

History

The project has emerged from the internal workings of the Evrone company . Initially, we used Jenkins to run the tests. But when it became obvious that the CI service needed to be further developed for our specific tasks, we changed it to GitlabCI. There were several reasons for this:

GitlabCI is written in Ruby, which is the team’s native language
it is small and simple, which means it is easy to modify.

During use, as often happens, GitlabCI mutated quite a lot. And at that moment when he already had little in common with the original, we just rewrote everything from scratch. So the first version of Vexor appeared, which was used only within the team.

Evrone is developing several projects in parallel. Some of them are very large, and with each commit it is necessary to run many tests in them. This means that for workers, you always need to keep a lot of servers ready. And accordingly, pay for them in full.

But if you think about it, you understand:

At night and on weekends, test workers are not needed at all.
If the team is large and the process is designed so that a lot of commits are made at the same time, then there are a lot of parallel workers. For example, if weekly iterations are used, then usually at the end of the iteration several features will be released and 5-20 pull requests are made at the same time, each of which is accompanied by a test run. There is a situation in which you need, for example, 20+ workers.

Obviously, workers need to be raised and removed automatically, focusing on current requests.

The first version of auto-scaling was written in a couple of hours based on Amazon EC2. The implementation was very naive, but even with it, we immediately dropped checks for using servers. CI began to work much more stable, because we eliminated the situation when a sudden influx of tests led to a lack of workers. Then integration with the cloud was redone several times.

Now in the cloud is a server pool. This is controlled by a separate application to which the workers connect. The application monitors their statuses: live / crashed / failed to start / no work. The application automatically changes the size of this pool, depending on the current state of the workers, the size of the task queue and an approximate estimate of the time spent on their implementation.

Initially, we used Amazon EC2 as the cloud. But on Amazon, the drives that connect to the servers are not physically located on the host, but in a separate storage that is connected over the network. When the disks are heavily used (and the speed of the test runs depends on the speed of the disks), the speed will rest on the bandwidth of the channel to the storage and the allocated band. Amazon solves this problem only for some money, which I did not want to pay at all. Considered other options: Rackspace, DigitalOcean, Azure and GCE. Comparing them, we settled on Rackcspace.

Architecture

Now a little about architecture.

VexorCI- not a monolithic application, but a set of related applications. For communication between them, RabbitMQ is mainly used. What rabbit is good in our case:

He knows message acknowledgment and publisher confirm. This allows you to solve a lot of problems, in particular, allows you to write applications in the popular in Erlang style “Let it crash”. That is, when any problems occur, a crash occurs, but as soon as the service returns to normal operation, all tasks will be completed and none of them will be lost.
RabbitMQ is a broker with the ability to build branchy topologies of queues and exchange points, as well as configure routing between them. This makes it possible, for example, to easily test new versions of services in a production environment on current production tasks.
RabbitMQ works steadily with large messages. The record for today is 120Mb in one message. VexorCI does not have the task of processing millions of messages per minute, but the message itself can weigh tens of Mb or more (for example, when transferring logs).

RabbitMQ also has well-known shortcomings, which also have to be dealt with:

It requires a perfectly working network between the client and server. Ideally, the server should be on the same physical host as the clients. Otherwise, the rabbit’s customers will behave like canaries in a submarine: fall for any problems that no other service sees.
C RabbitMQ is difficult to provide high availability. There are as many as three solutions for this, but only federation and shovel provide real high availability. Which, unlike cluster (which you can read about here ), is not so easy to integrate into the existing application architecture, since they do not provide data consistency.

Since our servers are physically located in several data centers, and the pool with workers, in case of any problems with Rackspace, can switch to another data center, to ensure stable operation of RabbitMQ, we use federation .

Logs

SOA architecture entails one more difficulty: collecting logs becomes a non-trivial task. When you have only a couple of applications, you don’t have to think about it: the logs are on a few hosts that you can always go into and grab on to. But when there are many applications, and one event is processed by several services, a single service is needed on which the logs will be stored.

In Vexor, the elasticsearch + logstash + logstash-forwarder bundle is responsible for this. All the logs of our applications are written immediately in JSON format, all events from the applications are logged, as well as PG, RabbitMQ, Docker and other system messages (dmesg, mail and others) are collected. We try to write everything to the maximum, because the workers work only for a certain time. After shutting down the server with the worker, we don’t know anything about the problem, except for what is stored in the logs.

Container

To run tests by workers, we use Docker. This is an excellent solution for working with isolated containers, which provides all the necessary tools. Now Docker is very stable and delivers a minimum of problems (especially if you have a fresh OS kernel). Although bugs are also found, for example,like that .

Tests in Vexor are run in a container based on Ubuntu 14.04, in which popular services and libraries necessary for work are preinstalled. A complete list of packages and their versions can be found here . We periodically update the image, so the set of preinstalled software is always fresh.

In order to use one image for all supported languages and not to make the image size too large, we put the necessary language versions (Ruby, Python, Node.js, Go - a complete and current list of supported languages here ) from the packages when building the builds. This rather quick procedure takes a few seconds, while this solution allows us to easily support a large set of language versions without overloading the image.

We rebuild deb packages for the image at least once a week. They are always available in the public repository at “https://mirror.pkg.vexor.io/trusty main”. If, for example, you use Ubuntu 14.04 amd64, then when you connect it, you can immediately get 12 versions of Ruby, already compiled with the latest versions of bundler and gem, ready to install.

In order not to do apt-get update when installing packages in runtime and use fuzzy matching for versions, we wrote a utility with which you can very quickly install packages of the necessary versions from our repository, for example:

$ time vxvm install go 1.3
Installed to / opt / vexor / packages / go-1.3
...
real 0m3.765s

Configuration

Ideally, Vexor itself understands what it needs to run for your project to work. We are working to automatically recognize which technologies you need and run them. But this is not always possible. Therefore, for unrecognized cases, we ask users to create a configuration file for the project.

In order not to reinvent the wheel, we use the .travis.yml configuration file. Travis CI is one of the most popular and well-known services today, which means it is good if users experience a minimum of difficulties when switching from it. After all, if the project’s root directory already has .travis.yml, then everything will start instantly. Well, the team will receive the joys of fast CI for our modest per-minute rates :)

Servers

We administer many servers on which many tasks are performed. Therefore, we actively use various tools, for example Ansible, Packer and Vagrant. Ansible is responsible for the allocation and configuration of servers and performs its tasks perfectly. Packer and Vagrant are used to build and test Docker images and server workers. To automate the assembly of images, VexorCI itself is used, which automatically reassembles everything you need.

Who needs our project?

Small projects that run tests are not so much and often, but do not want to pay a lot and think about system administration and deployment, enjoying the delights of continous integration.

For large projects, where there are many tests, we give an unlimited number of resources, we are able to parallelize tests and speed up their run at times.

For projects with a large team, we solve the problem with “queues for miscalculation of tests”. Now any number of tests can be run simultaneously, eliminating long expectations.

Friends, in conclusion, we want to invite everyone to our Vexor . Connect your projects and enjoy the benefits.

Tags:

VexorCI under the hood

Also popular now: