Big migration: how we raised a private cloud on RISC

    In a previous post, we started a story about our private cloud. In large companies, projects of this scale are legacy and unexpected surprises in the process of migration. Today we want to share our experience of migrating different systems and show a small piece of our infrastructure, heavily dotted with “DSP” vultures and all kinds of NDA.



    Than x86 did not suit us


    In a small company, the administrator can deploy a server on Gentoo in an exotic configuration, with which it will be more convenient for him to work. When VTB is in the area of ​​responsibility, every detail is evaluated at numerous meetings by a large number of people. Otherwise, the deployment and operation of new systems can turn into big problems.

    RISC systems are widespread in the banking sector, and we are no exception. Some of our services are already running on RISC architecture or are being translated to it. At the same time, cloud providers for projects of our scale work mainly with x86. Having evaluated all the pros and cons, we decided to do without changing the architecture and not transfer all services to x86, leaving on RISC what worked successfully on RISC before the cloud was introduced. Moreover, part of x86-services during the project, on the contrary, was transferred to RISC.

    Why did we decide to do this? Migration in itself is associated with risks that we could not allow. A number of mission-critical systems withstand the required operating parameters only on RISC - here the stability of these systems is higher. With commensurate configurations of RISC and x86 machines, our ABS and Oracle databases show great performance at first. Finally, RISC allows you to spend less on maintenance, which is also important for the banking sector. In general, neither the harsh security department nor the legislation will allow trusting third parties with internal business-critical services for migration.

    A private cloud is not only capable of taking on existing mission-critical systems, but also has wide functionality in working with RISC-based systems. Previously, the process of deploying a test or industrial system for the tasks described above took many months of planning, purchasing and putting the equipment into operation. Now an engineer can get, for example, a new Veritas Infoscale cluster of two LPAR pairs with resource group configuration for a DR scenario, for example. And there are dozens of such flexible templates and scripts in the cloud, from allocating a simple virtual machine or physical server to a specific task, to deploying a cluster using the required technology.

    Of course, no one refused x86, a huge number of tasks are still being solved on x86-systems. The cloud has a single self-service portal based on HP technologies and products and is a single entry point for managing both x86 and RISC systems.

    What do VTB and IBM Watson have in common?


    The core of the cloud system is the IBM P1-80A servers, the heart of which is the IBM POWER8 processor. IBM uses such servers in its Watson supercomputer. Their key advantage is a large number of cores and support for SMT8, an analog of Intel Hyper-threading. Services that work well in parallel load feel great on the basis of systems with these CPUs.

    Each server has 16 POWER8 processors with a clock frequency of 4.02 GHz, 4 processors per system node. Each processor has 12 cores, which in total gives 192 cores per server. In order to efficiently use processor resources, P1-80A physical servers are combined into a Power Enterprise Pool, and the Mobile Capacity on Demand (CoD) licensing scheme is also used. Each server has 8 TB of RAM installed. The built-in disk subsystem is represented by 24x387GB SFF SSDs in the expansion shelf. Virtual Partitioning Technologies (LPAR) are used to run and run applications, and Live Partition Mobility technologies are used to migrate virtual partitions between physical servers.



    The great advantage for us when concluding implementation and support contracts was that the equipment suppliers are from Russia. Our partner is the Russian company Yadro, which actively implements solutions based on RISC architecture and has everything necessary to work with our domestic cryptography (which is critical for banks). Yadro is the first IBM OEM partner in Russia to have the necessary certificates for assembling computing systems and storage systems based on their solutions.

    Migration


    Deploying a cloud system on RISC was not an easy task - as far as we know, no one has done this before in Russia at this level. The relocation process began in the summer of 2017 and is now in full swing. We carefully planned everything so that the transfer does not affect the work of our services. Work is in full swing: more than 60 different IT systems are affected by migration.

    We approach everything carefully. We are actively using test relax environments, we are not experimenting with the transition to x86 once again. In the course of migration, they switched to newer generations of POWER processors and versions of Oracle DBMS, and also began to use arrays with carriers of FMD and SSD as external data storage. Abandoned HP UX and SPARC. In general, they reduced the number of vendors and types of equipment, and left the end-of-life platform.



    Progress


    According to the results of the upgrade and migration, statistics were collected for a number of critical systems. So, in the centralized automated banking system (TsABS) “New Athena”, the time required to complete a ruble payment order decreased. We lay out the stages of the business process:

    • Automatic processing from DBE - from 107 ms to 32 ms;
    • Reading - from 88 ms to 20 ms;
    • File cabinet processing - from 158 ms to 31 ms.

    In addition, for most procedures of the settlement system, the execution speed increased by 2 times. The average processing time for one document decreased from 92 to 84 milliseconds.

    Finally, the Mbank system for servicing legal entities and individuals has also shown noticeable progress: the processing time for one branch transaction in Novosibirsk has decreased from 464 to 227 milliseconds, in Yekaterinburg - from 179 to 130 milliseconds.  

    Fault tolerance and conclusions


    When designing the cloud, disaster tolerance (DR) requirements were taken into account, the cloud was deployed at two geographically dispersed sites. Geo-balancing of the front-end load is carried out using the BigIP F5 platform. Disaster Recovery at the storage system level is implemented using HDS Global Access Device (GAD) technology, with a cluster arbiter at the third site. Each LUN issued to the host, if necessary, is replicated to both sites and has independent paths from arrays located at different sites.  

    Now the introduction of the cloud is in full swing. As a result of the move, we plan to get an even more reliable system to service the tasks of the VTB group, a little extra gray hair from responsible engineers and very tangible financial savings. At the same time, today the cloud allows you to increase the power of RISC-based computing systems, adds the ability to scale highly loaded and critical systems and facilitates the migration of existing services from the standalone RISC platform to a new one - flexible, scalable, fault-tolerant.

    Also popular now: