Data lossless ElasticSearch data migration


    Academic data warehouse design recommends keeping everything in a normalized form, with links between. Then the roll forward of changes in relational math will provide a reliable repository with transaction support. Atomicity, Consistency, Isolation, Durability - that's all. In other words, the storage is specifically built to safely update data. But it is not at all optimal for searching, especially with a broad gesture on the tables and fields. Looking for indexes, a lot of indexes. Volumes expand, recording slows down. SQL LIKE is not indexed, and JOIN GROUP BY sends to meditate in the query planner.


    The increasing load on one machine forces it to expand, either vertically into the ceiling or horizontally, by purchasing some more nodes. Resiliency requirements cause data to be spread across multiple nodes. And the requirement for immediate recovery after a failure, without a denial of service, forces us to set up a cluster of machines so that at any time any of them can perform both writing and reading. That is, to already be a master, or become them automatically and immediately.


    The problem of quick search was solved by installing a number of second storage optimized for indexing. Search full-text, faceted, with Stemmingand blackjack. The second store accepts records from the first table as input, analyzes and builds an index. Thus, the data storage cluster was supplemented with another cluster for their search. With a similar master configuration to match the overall SLA . Everything is good, business is delighted, admins sleep at night ... until the machines in the master-master cluster become more than three.


    Elastic


    The NoSQL movement has significantly expanded the scaling horizon for both small and big data. NoSQL cluster nodes are able to distribute data among themselves so that the failure of one or more of them does not lead to a denial of service for the entire cluster. The price for the high availability of distributed data was the impossibility of ensuring their complete consistency on the record at each point in time. Instead, NoSQL speaks of eventual consistency . That is, it is believed that once all the data will disperse across the cluster nodes, and they will become consistent eventually.


    Thus, the relational model was supplemented with a non-relational one and gave rise to many database engines that solve the problems of the CAP triangle with one success or another . Developers got into the hands of fashionable tools to build their own perfect persistence layer - for every taste, budget and profile of the load.


    ElasticSearch is a representative of cluster NoSQL with RESTful JSON API on the Lucene engine, open source in Java, which can not only build a search index, but also store the original document. Such a trick helps to rethink the role of a separate database management system for storing the originals, or even completely abandon it. The end of the entry.


    Mapping


    Mapping in ElasticSearch is something like a schema (table structure, in SQL terms), which tells you exactly how to index incoming documents (records, in SQL terms). Mapping can be static, dynamic, or absent. Static mapping does not allow itself to change. Dynamic allows you to add new fields. If mapping is not specified, ElasticSearch will make it himself, receiving the first document to be written. Analyze the structure of fields, make some assumptions about the types of data in them, skip through the default settings and write. At first glance, such behaviorless behavior seems very convenient. But in fact, more suitable for experiments than for surprises in production.


    So, the data is indexed, and this is a unidirectional process. Once created, the mapping cannot be changed dynamically as ALTER TABLE in SQL. Because the SQL table stores the original document to which you can screw the search index. And in ElasticSearch the opposite. He himself is a search index to which you can fasten the original document. That is why the index scheme is static. Theoretically, one could either create a field in the mapping or delete it. But in practice, ElasticSearch only allows you to add fields. Attempting to delete a field does not lead to anything.


    Alias


    The alias is this optional name for the ElasticSearch index. Aliases can be multiple for a single index. Or one alias for multiple indexes. Then the indices seem to be logically combined and look the same from the outside. Alias ​​is very convenient for services that communicate with the index throughout its life. For example, the pseudonym of products can hide both products_v2 and products_v25 behind them , without having to change the names in the service. Alias ​​is indispensable for data migration when they are already transferred from the old scheme to the new one, and you need to switch the application to work with the new index. Switching an alias from index to index is an atomic operation. That is, it is performed in one step without loss.


    Reindex API


    The data scheme, mapping, tends to change from time to time. New fields are added, unnecessary fields are deleted. If ElasticSearch plays the role of a single repository, then you need a tool to change the mapping on the fly. For this, there is a special command to transfer data from one index to another, the so-called _reindex API . It works with a ready or empty mapping of the recipient index, on the server side, quickly indexing in batches of 1000 documents at a time.


    Reindexing can do a simple field type conversion. For example, long in text and back in long , or boolean in text and back in boolean . But -9.99 in boolean is no longer able,this is not PHP. On the other hand, type conversion is an insecure thing. Service written in a language with dynamic typing such a sin, maybe, forgive. But if reindex cannot convert the type, the document will simply not be recorded. In general, data migration should take place in 3 stages: add a new field, release a service with it, clean up the old one.


    A field is added like this. The source index scheme is taken, a new property fits in, an empty index is created. Then reindexing is started:


    {
      "source": {
        "index": "test"
      },
      "dest": {
        "index": "test_clone"
      }
    }

    Removes the field in a similar way. The source index scheme is taken, the field is removed, an empty index is created. Then, reindexing is started with the list of copied fields:


    {
      "source": {
        "index": "test",
        "_source": ["field1", "field3"]
      },
      "dest": {
        "index": "test_clone"
      }
    }

    For convenience, both cases are combined into the cloning function in Kaizen, the desktop client for ElasticSearch. Cloning can adapt to the mapping of the recipient index. The example below shows how a partial clone is made from an index with three collections (types, in terms of ElasticSearch) act , line , scene . All that remains is a line with two fields, static mapping is turned on, and the speech_number field from text becomes long .



    Migration


    The reindex API has one unpleasant feature - it does not know how to monitor changes in the source index. If after the start of reindexing something changes there, then the changes are not reflected in the recipient index. To solve this problem ElasticSearch FollowUp Plugin was developed, which adds logging commands. The plugin can follow the index, returning in JSON format the actions performed on documents in chronological order. The index, type, document ID and operation on it - INDEX or DELETE is remembered. The FollowUp Plugin is published on GitHub and compiled for almost all versions of ElasticSearch.


    So, for the lossless data migration, you will need a FollowUp installed on the node on which the reindexing will be launched. It is assumed that the alias index is already available, and all applications run through it. Immediately before reindexing the plugin is enabled. When reindexing is complete, the plugin is turned off, and alias is transferred to a new index. Then the recorded actions are reproduced on the recipient index, catching up with its state. In spite of the high speed of reindexing, two types of collisions may occur during playback:


    • in the new index there is no more document with such _id . They managed to delete the document after switching over the alias to the new index.
    • in the new index there is a document with the same _id , but with the version number higher than in the source index. They managed to update the document after switching the pseudonym to the new index.

    In these cases, the action should not be reproduced in the recipient index. The remaining changes are reproduced.


    Happy coding!


    Also popular now: