Ceph as pluggable storage: 5 practical insights from a large project

    Given the growth of data, nowadays more and more often it is said about software-defined and distributed data warehouses, and a lot of attention is traditionally paid to the open Ceph platform. Today we want to talk about the conclusions that we came to in the process of implementing a data storage project for one large Russian department.



    When it comes to storing data of various types, of course, a distributed data warehouse immediately comes to mind. Purely theoretically, there are many advantages to such solutions: you can use any disks, the system works on any servers (even very old ones), there are practically no limits for scaling. That is why the introduction of such a system was launched several years ago in one of the largest Russian departments with units not only in all regions of the Russian Federation, but even in all more or less large cities.

    After analyzing the available solutions, the choice was made in favor of Ceph. There were a number of reasons for this decision:
    • Ceph is a fairly mature product, and today there are installations of Ceph on petabytes of information.
    • A large community is involved in the development (including us), which means that new functions and improvements will appear for the storage.
    • Ceph already has a good API with support for various programming languages. This was important because the product obviously needed to be refined to meet the requirements and expectations of the customer.
    • Licenses cost nothing. No, of course, the system needs to be further developed, but in the case of the specific tasks of the customer, it would have been necessary to carry out additional development anyway, so why not do it on the basis of a free product?
    • Finally, sanctions. State-owned companies should be insured that the next time someone comes up with the idea of ​​imposing restrictions on them, and therefore relying on a foreign and especially American product is dangerous. Another thing, Open Source.

    Practical Conclusions
    The introduction of Ceph took place gradually over several months. First, the storage was launched in the central region, and then we replicated the solution by connecting regional data centers. With the advent of each new network node, storage performance increased, despite the increase in data flows within it, ensuring the transfer of information from region to region.
    A feature of the work of any large organization is the need to store heterogeneous information, which is often a binary file. As practice shows, employees simply have no time to figure out what kind of files they are, categorize them and process them in a timely manner - information manages to accumulate faster. And in order not to lose potentially important data for operational activities, it is necessary to organize their competent storage. For example, based on distributed storage.
    And in the process of implementing such a project, we made several conclusions on the use of Ceph:

    Conclusion 1: Ceph completely replaces all backup solutions
    As practice has shown, backup for most unstructured information is not performed at all, since it is extremely difficult to implement it. When Ceph is implemented, the backup is obtained as if “in the form of a bonus”. When setting up, we simply set replication parameters - the number of copies and the location of their placement. If the customer has several data centers, a disastrous configuration is obtained that simply does not require additional backups if there are 3-4 copies of data on different disks and servers. Such a system works better than any hardware solution, at least for the time being we are talking about large amounts of data and geographically distributed systems.

    Conclusion 2: With large installations, Ceph performance is 99% equal to network performance.
    When we transferred data from a PostgreSQL database (more on that below) to a Ceph-based storage, the upload speed in most cases was equal to the data transmission network bandwidth. If in some cases this was not the case, reconfiguring Ceph allowed to achieve this speed. Of course, we are not talking about 100 Gb / s connections, but with standard data channels for geographically distributed infrastructures, it is quite possible to achieve a Ceph dot dot performance of 10 Mbit / s, 100 Mbit / s or 1 Gbit / s. It is enough to correctly distribute the disks and configure the clustering of information.

    Conclusion 3: The main thing is to correctly configure Ceph taking into account the peculiarities of the company
    Speaking of settings, the biggest part of the expertise in Ceph work is required at the stage of system configuration. In addition to replication parameters, the solution also allows you to set access levels, data retention rules, and so on. For example, if we have mini-computing centers throughout Russia, we can organize quick access to documents and files created in our region, as well as access to all corporate documents from anywhere. The latter will work with slightly longer delays and lower speed, but such a “concentration” of information at the place of ownership creates optimal conditions for the organization.

    Conclusion 4: When it is already configured, any Linux administrator can manage Ceph
    Perhaps one of the most pleasant features of Ceph is that the system works without unnecessary human involvement when it is already configured. That is, it turned out that in remote mini-data centers it was enough to simply contain a Linux administrator, since support for the next Ceph segment does not require any additional knowledge.

    Conclusion 5: Supplementing Ceph with an external indexing system makes storage convenient for contextual search
    As you know, inside Ceph there is no index that can be used to search by context. Therefore, when an object is entered into storage, it is possible to save meta-data serving as an index. Their volume is quite small, and therefore a regular relational DBMS can easily cope with them. Of course, this is an additional system, but this approach allows you to quickly find information by context among the huge volumes of unstructured data.



    A few words about data transfer
    A large project involves many stages, but the most interesting for us, perhaps, was the process of transferring huge amounts of data from PostgreSQL to a new repository. After launching Ceph, the task arose to migrate data from multiple databases without stopping services and business processes and ensuring the integrity of information.
    To do this, we had to contribute to the development of the Ceph Open Source project and create the pg_rbytea migration module, the source code of which can be found at the link ( https://github.com/val5244/pg_rbytea ). The essence of the solution was to simultaneously transfer data from the specified database to the Ceph repository. The developed module allows you to instantly migrate data without stopping the database, using the abstraction of the Rados object storage, support of which is implemented in Ceph at the native level. By the way, we made a report about this at PG Conf at the beginning of 2018 ( https://pgconf.ru/2018/107082 ).
    At the first stage, various binary data necessary for the functional work of the departments of the department were moved to the repository. In fact, all those files and objects that are not clear how to store because of their huge total volume and fuzzy structure. Next, it is planned to transfer various media content to Ceph, storing the original documents that are created before recognition and attachments from corporate letters.
    In order for all this to work on top of the storage, RESTful services were developed that allowed using Ceph for integration into customer systems. Here again, the presence of a convenient API played a role, which allows you to create a plug-in service for various information systems. So Ceph has become the main repository, claiming more and more volumes and types of information within the organization.

    Conclusion
    Various distributed data warehouses are on the market, including commercial solutions and other open source products. Some of them use special optimizations, others work with compression or use Erasure Coding. However, in practice, we were convinced that Ceph is ideally suited for truly distributed environments and huge storages, because in this case system performance is limited only by the speed of communication channels, and you save a lot of money on licenses by the number of servers or by the amount of data (depending on with which product to compare). A well-tuned Ceph system allows you to ensure optimal performance with minimal supervision by local administrators in the field. And this is a serious advantage if you introduce a geographically distributed implementation.

    Only registered users can participate in the survey. Please come in.

    Which Ceph benefits do you find most important?

    • 45% Open Source 18
    • 55% Active Community 22
    • 42.5% No royalties 17
    • 25% Availability of ready-made “cases” for use in various fields 10
    • 30% None 12

    Also popular now: