Testing Jet9 - Fail-Safe Website Hosting with Geographic Optimization

Published on July 10, 2015

Testing Jet9 - Fail-Safe Website Hosting with Geographic Optimization

    We have created a platform for running Jet9 web applications and are currently conducting public beta testing of web hosting built on this platform. Here we will talk about what it is, what tasks it solves, and how everything is organized.

    In subsequent articles, we will tell you more about the Jet9 device, about the technical solutions used for various components, about the pitfalls we encountered, and how to fix them or get around them.

    The purpose of these publications is to attract specialists to test and get bug reports, inform potential clients about the project and share experience with colleagues. As materials appear here, we will add materials on our website.

    What is Jet9


    We called the service “Fail-Safe Website Hosting with Geographic Optimization”. Despite the verbosity, this reflects a smaller part of the possibilities, but the most noticeable.

    The main features of Jet9: increased fault tolerance, integrated CDN / ADN, guaranteed allocation of resources in a wide range. All this in one ready-made solution, without the need for the customer to independently organize the interaction of a large stack of components and change the architecture of the site. As a result, this ensures stable, quick operation of the site with a minimum of downtime or degradation of work. The solution is focused on web projects in which such requirements have already arisen, but to implement and maintain them independently is too difficult or too expensive.

    Private installation of the platform (Private Jet9) is designed for small and medium-sized projects that require from several servers to several hundred to operate. Web hosting (PaaS Jet9) provides both minimal tariffs for small sites with low traffic, and large tariffs that make up almost all the resources of a powerful server, for demanding sites with a high load - up to several hundred requests per second and hundreds of thousands of users per day.

    As the hardware for Jet9 web hosting, standard servers TrueVirtual V8 and TrueVirtual T4 with network storage and local SSD cache are used.

    During the development, several know-how and patent pending inventions were used, but the main work consists in a large amount of research, painstaking engineering work to connect many components and their completion, programming the missing nodes, as well as in long testing of behavior in various combinations of conditions, documenting everything on all stages and drawing up regulations for regular and emergency procedures.

    How Jet9 Works


    Web environment

    image
    Control panels and web stacks

    As a user interface, you can use control panels designed for shared web hosting. Now the ISPManager 5 control panel is used. SSH, SFTP, FTPS are available for automated deployment.

    The web environment currently corresponds to the generally accepted LAMP set: Linux, Apache, Mysql, PHP. In addition to regular CGIs, you can run FastCGI applications (perl, python). That is, everything that is available on a conventional web hosting. Private Jet9 has the ability to use Unicorn, Thin and Puma application servers for Ruby on Rails, Tomcat and Jetty for Java / JavaEE, Python Python WSGI applications, PostgreSQL, MongoDB, CouchDB databases. But in web hosting at the current stage of testing, these stacks are not available, only LAMP.

    Resource Accounting and Load Isolation

    Once upon a time, we were doing resource accounting and isolating hosting clients on FreeBSD 4.1. I had to add a lot of patches to the kernel, apache, and some system utilities. A lot could be written about this. And now it turns out briefly: cgroup to different memory classes, to the processor, to disk operations; rlimit on processes and open files. Each user has their own instance of apache, which simplifies the organization of user privileges for the web server and scripts, and simplifies the control of resource consumption.

    Kernel tweaks are required only for more flexible control of access rights and additional isolation of users, and are used only for Jet9 web hosting.

    Some errors from the ideal load sharing through cgroups are compensated by the obligatory transfer of at least 10% of all resources to the general reserve, due to which each user is provided with 10% overload over tariff limits.

    Backups

    Backups are stored in an independent long-term archive storage system located in the third geographically remote data center. Incremental backup is performed every day. Copies of the file system through rsync are used. Block device snapshots are not used due to the fact that if the file system meta data is damaged, both the replica in the cluster and the backup copy will be damaged.

    Archives are rotated according to a multi-cycle scheme, which ensures thinning out old copies in such a way that old copies are saved in a smaller number, and new copies are in a larger number. That is, when storing seven archives, it will contain copies of approximately the following ages: 1 year, 6 months, 3 months, 1 month, 6 days, 2 days, 1 day.

    Failover Cluster

    imageEach web backend runs on top of the HA cluster with replicated storage. In two independent data centers, there are two sides of the cluster - the master and backup. At one moment in time, only one of the parties can work - either a master or a backup. As a policy for the work of the cluster parties for web hosting, a ban on split brain is adopted - a situation where both the master and the backup work simultaneously. This policy is a consequence of the accepted requirement to ensure consistent consistency. For private installations, other policies that allow split brain can be used to ensure maximum service availability even at the cost of data inconsistency.

    On each side of the cluster there is its own repository, with which all work is underway, and the changes of which are replicated in real time to another data center, to another availability zone - from master to backup. For us, this is a more convenient option than the alternative - a common repository distributed in both data centers. Organizing a cluster on a replicated storage is generally much more complicated than on a shared storage, but it gives a significant advantage - lower latency requirements for communication between data centers, significantly lower bandwidth requirements, and as a result, the ability to build more efficient systems. Now we use three data centers, two of which have a direct connection, and both are connected to the third via the Internet. HA clusters are used both on master backup pairs connected via direct connection and on pairs

    When we started using pacemaker for internal services, heartbeat was used for it and we independently introduced additional arbitration mechanisms to protect against split-brain. In Jet9, we switched to pacemaker and corosync with quorum. Pacemaker is a good powerful product, but it has many inconveniences and features that complicate its use with a large number of clusters and on unreliable or complex networks. Therefore, we have developed our own cluster controller, better prepared to solve our problems. Now it’s run too little and for production we continue to use pacemaker.

    The master and backup use different IP addresses from different routing networks, so that a routing failure does not lead to unavailability of both the master and the backup. This is a more reliable alternative than a migrating IP address, since in the latter case a consistent structure of two points of failure is obtained - external routing (BGP) and internal (OSPF).

    On Jet9 web hosting, local storage uses fast SSD + bcache with writeback for caching.

    Geographic Optimization

    imageAll sites are served through a network of web accelerators - a geographically distributed network of caching web servers that deliver content from websites at maximum speed and cache it for re-distribution. Requests to the site are served through the web accelerator, which is closest to the visitor, and thus all pages open much faster.

    Unlike CDN for static files, which require modification of site scripts in order to upload files, CDN Jet9 transparently and without alterations works with websites and receives and distributes all the content itself. Connecting CDNs and web accelerators to the HA cluster is done automatically when the site is created and does not require any DNS settings or site settings.

    An additional advantage over foreign services is the normal coverage in Russia. That is, the mirror closest to Tyumen is not in Holland, but in Yekaterinburg. For a test installation of Jet9, a small network is used - the UK, Moscow, St. Petersburg and Novosibirsk. In production, Rostov-on-Don, Samara, Yekaterinburg, and Holland are added to them. So far, the issue with the Far East is giving in poorly - a large imbalance in the cost of communication and population is not economically justified, but we will continue.

    For geographical balancing, a hybrid scheme is used - DNS anycast plus the calculation of speed and distance on DNS servers. Squid is used as the reverse proxy and, in addition to SSL, Nginx.

    Testing


    The main objective of this testing for us is to find problems in the API for managing domains on frontends and backends, in integrating ISPManager with our web environment on backends, and in integrating ISPManager with web accelerators and geo-balancing on frontends. Planned testing period: 1 - 2 months.

    We do not expect new shortcomings in the work of HA clusters and web accelerators separately, since we have been using them in production for quite some time. But support for website control panels, front-end and back-end domain management APIs, and a bunch of front-end-backends were made only this spring. Therefore, with a considerable probability problems may appear that we have not identified on our own.

    In the testing process, we will regularly, with prior notice, arrange for failures in various areas in order to check the correct operation of the automation for these failures, and how this can affect the functioning of sites.

    Test participants will receive an additional 10% discount over 2 years for all Jet9 products - both for web hosting and licenses for private installations.

    To join the test it is enough to register an order on jet9.ru with the required amount of resources and your details. There is no need to write separately about testing, now all applications are automatically issued as test applications.

    In addition to direct testing, it is also very interesting to get tricky technical questions. Since if there is a question to which we do not know the answer, then a bug may be hidden under this question. And then, in the following articles, where we will talk more about each component, we will also add the questions asked here with answers.