divanikus July 12, 2018 at 15:21

Small tricks with Elasticsearch

A short note, rather for myself, about small tricks for data recovery in Elasticsearch. How to fix the red index if there is no backup, what to do if I deleted the documents, and there is no copy left - unfortunately in the official documentation they are silent about these features.

Backups

The first thing to do is set up backups of important data. How this is done is described in the official documentation .

All in all, nothing complicated. In the simplest version, create a ball on another server, mount it to all elastic nodes in any convenient way (nfs, smbfs, whatever). Next, use cron, your application or anything to send requests for periodic snapshots.

The first snapshot will be long, the subsequent ones will contain only the delta between the states of the index. Keep in mind that if you periodically do a forcemerge , the delta will be huge and, accordingly, the time it takes to create a snapshot will be like the first time.

What to consider:

Check the status of backups, for example by _cat: . Partial or Failed snapshots are not your bro.curl localhost:9200/_cat/snapshots/yourbackuprepo/
Starting with ES 6.x, the elastic is very demanding on request headers. If you do them manually (not through the API), check what you have set Content-Type: application/json, otherwise all your requests simply break off and backup does not occur
Snapshot cannot be restored to an open index. It must be closed or removed first. However, you can restore the snapshot side by side using rename_pattern, rename_replacement ( see the example in the dock ). In addition, when restoring a snapshot, its settings are also restored, including aliases, number of replicas, etc. If you do not need this, add index_settings ( see the dock for an example ) with the necessary changes to the restore request .
Repos (ball) with snapshots can be connected to more than one cluster and restore snapshots from any cluster to any other. The main thing is that the elastic versions are compatible.

In general, look at the documentation, there this topic is more or less disclosed.

Elasticdump

A small utility on nodejs that allows you to copy data from one index to another index, cluster, file, stdout.

By the way, the output to a file or stdout can be used as an alternative backup method - the output is a regular valid json (something like sql dump), which can be reused as you want. For example, you can stick the output in a pipe, where your script will somehow transform the data and send it to another repository, such as clickhouse. The simplest js conversions can be done directly by elasticdump itself, there is a corresponding key --transform . In general, a flight of fancy.

Of the pitfalls:

As a backup method, it is much slower than snapshots. Plus, the backup is extended over time, so the result on a frequently changing index may be inconsistent. Keep in mind.
Do not use nodejs from the debian repository, there is too old a version that negatively affects the stability of tools.
Stability may vary, especially if one of the parties is overloaded. Do not try to backup from one server to another by running the tool on the office machine - all traffic will flow through it.
Fig copying mappings. If you have something complicated there, then create the index manually, and only then fill the data into it.
Sometimes it makes sense to change the size of the chunk (parameter --limit). This option directly affects copy speed.

To merge a large number of indices at the same time, there is a multielasticdump with a simplified set of options, but all indices merge in parallel.

Comment! The author of the utility said that he no longer has time to support, so the program is looking for a new maintainer .

From personal experience: useful utility, rescued more than once. Speed and stability are so-so, I would like an adequate replacement, but so far nothing is on the horizon.

Checkindex

So, we begin to get to the dark side. Situation: the index has gone red. In the logs - something went wrong, the check sum does not match, you probably have a memory or a disk:

org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)

Of course, mother admins never do this, because they have top-end hardware with triple replication, superECC memory with correction of absolutely all error levels On the fly and generally snapshots are configured every second.

But reality unfortunately sometimes prompts such options, when the backup was relatively long (if you have gigabytes per hour indexed, is the backup too old 2 hours ago?), There is no way to restore the data, replication did not have time and stuff like that.

Of course, if there is a snapshot, backup or the like. - Excellent, roll out and do not worry. And if not? Fortunately, at least some of the data can still be saved.

First of all, close the index and / or turn off the elastic, make a backup copy of the failed shard.

Lucene (namely, it works as a backend in elasticsearch) has a wonderful CheckIndex method. We just need to summon it over a broken shard. Lucene will check all of its segments and remove the damaged ones. Yes, data will be lost, but at least not everything. Although there’s how lucky.

There are at least 2 ways.

Method 1: Directly on the site

Such a simple script will help us.

#!/bin/bash
pushd /usr/share/elasticsearch/lib
  java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex "$@"
popd

Calling it without parameters, we get something like this:

ERROR: index path not specified
Usage: java org.apache.lucene.index.CheckIndex pathToIndex [-exorcise] [-crossCheckTermVectors] [-segment X] [-segment Y] [-dir-impl X]
  -exorcise: actually write a new segments_N file, removing any problematic segments
  -fast: just verify file checksums, omitting logical integrity checks
  -crossCheckTermVectors: verifies that term vectors match postings; THIS IS VERY SLOW!
  -codec X: when exorcising, codec to write the new segments_N file with
  -verbose: print additional details
  -segment X: only check the specified segments.  This can be specified multiple
              times, to check more than one segment, eg '-segment _2 -segment _a'.
              You can't use this with the -exorcise option
  -dir-impl X: use a specific FSDirectory implementation. If no package is specified the org.apache.lucene.store package will be used.
**WARNING**: -exorcise *LOSES DATA*. This should only be used on an emergency basis as it will cause
documents (perhaps many) to be permanently removed from the index.  Always make
a backup copy of your index before running this!  Do not run this tool on an index
that is actively being written to.  You have been warned!
Run without -exorcise, this tool will open the index, report version information
and report any exceptions it hits and what action it would take if -exorcise were
specified.  With -exorcise, this tool will remove any segments that have issues and
write a new segments_N file.  This means all documents contained in the affected
segments will be removed.
This tool exits with exit code 1 if the index cannot be opened or has any
corruption, else 0.

Actually, we can either simply run the index test, or make CheckIndex “fix” it, cutting out everything that is damaged.

The Lucenz index lives in a similar way: / var / lib / elasticsearch / nodes / 0 / indices / str4ngEHashVa1uE / 0 / index /, where 0 and 0 are the node number on the server and the number of the shard on the node. The scary value between them - the internal name of the index - can be obtained from the output of curl localhost: 9200 / _cat / indices.

I usually make a copy to another directory, and repair in-place. Then I restart elasticsearch. As a rule, everything is picked up, albeit with data loss. Sometimes the index still does not want to be read due to the * corrupted * files in the shards folder. Move them to a safe place for a while.

Method 2: Luke

(picture from the Internet)

There is a wonderful utility called Luke for working with Lucene .

It's still simpler here. Find out the Lucene version from your elasticsearch:

$ curl localhost:9200
{
  "name" : "node00",
  "cluster_name" : "main",
  "cluster_uuid" : "UCbEivvLTcyWSQElOipgTQ",
  "version" : {
    "number" : "6.2.4",
    "build_hash" : "ccec39f",
    "build_date" : "2018-04-12T20:37:28.497551Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Take the same version of Luke. We open in it an index (copy of course) with a daw Do not open IndexReader (when openning corrupted index) . Next, click Tools / Check Index. First I recommend to run dry, and only then in the repair mode. Further actions are similar - copy back the elastic, restart / open the index.

Recover Deleted Documents

Situation: You executed a destructive query that deleted a lot / all the data you need. And nowhere to restore, or very expensive. Well, of course, SSZB that there are no backups, but this also happens.

Unfortunately or fortunately, Lucene never removes anything directly. Its philosophy is closer to CoW, so deleted data is not actually deleted, but only marked as deleted. Actually, deletion occurs during index optimization - live data from segments are copied to newly created segments, old segments are simply deleted. In general, while the status of the deleted index is not 0, there are chances to get it out.

$ curl localhost:9200/_cat/indices?v
health status index                   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   data.0                  R0fgvfPnTUaoI2KKyQsgdg   5   1    7238685      1291566     45.1gb         22.6gb

After forcemerge there is no chance.

So, first of all, close the index, stop the elastic, copy the index (files) to a safe place.

It is not possible to pull out an individual deleted document. You can only recover all deleted documents in the specified segment.

For versions Lucene below 4, everything is very simple. The Lucene API has a function called undeleteAll. You can call her directly from Luke from the previous paragraph.

For newer versions, alas, the functionality was cut. But still there is still a way. Information about live documents is stored in * .liv files. However, simply removing them will make the index unreadable. It is necessary to correct the segments_N file so that it completely forgets about their existence.

Open the segments_N file (N is an integer) in your favorite Hex editor. The official documentation will help us navigate it :

segments_N: Header, LuceneVersion, Version, NameCounter, SegCount, MinSegmentLuceneVersion, SegCount, CommitUserData, Footer

From all this, we need the values of DelGen (Int64) and DeletionCount (Int32). The first must be set to -1, and the second 0.

It is not difficult to find them, they are immediately behind SegCodec, which is a very conspicuous string like Lucene62. In this screenshot you can see that DelGen has a value of 3, and DeletionCount - 184614. We replace the first with 0xFFFFFFFFFFFFFFFF, and the second with 0x00000000. Repeat for all necessary segments, save.

However, the fixed index will not want to load, citing a checksum error. It's still simpler here. Take Luke, load the index with the disabled IndexReader, Tools / Check Index. We make a test run and immediately catch that segments_N is damaged. Such and such a check is expected, but such and such is received.

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=51fbdb5c actual=6e964d17

Nonsense! We take the expected checksum and enter it in the last 4 bytes of the file.

Save. We run CheckIndex again to make sure everything is OK and the index is loading.

Et voilà!

Tags: