ElasticSearch - mapping and search without surprises

  • Tutorial
In the article, we will consider how and why to use mapping. Is it needed at all and in what cases. I will give examples of its installation, as well as try to share some useful tricks that can help you improve the search on your site.

Anyone who is interested in the modern search engine ElasticSearch, please under cat.


In a previous article by general vote, this topic was selected. In this article I will post a vote again, please take part. I will try to write the most complete series of articles on ES, if it will be interesting to the public.

Why do we need mapping?


Mapping is similar to the definition of a table in sql databases. We explicitly indicate the type of each field and additional parameters, such as the analyzer, the default value, source, and so on. More details below.

We can specify mapping when creating the index, thereby defining for all types in the index in one query.
curl -XPOST 'http://localhost:9200/test' -d '{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "type1" : {
            "_source" : { "enabled" : false },
            "properties" : {
                "field1" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'


We can also specify mapping directly for a certain type in the index:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '{
    "tweet" : {
        "properties" : {
            "message" : {"type" : "string", "store" : true }
        }
    }
}'


And we can specify mapping for several indices at once:
$ curl -XPUT 'http://localhost:9200/kimchy,elasticsearch/tweet/_mapping' -d '{ ... }'


Is he really needed?


ES does not require explicit definition of data types in a document. In most simple cases, it determines the data type correctly.
So why then should it be determined?
Well, firstly, this is useful for code cleanliness and confidence in what is currently stored in the index.
An important feature of mapping is the fine-tuning of data and its processing, as we can indicate whether the field should be analyzed, whether the source should be stored. Let's look at most of the possibilities with an example.

Basic data types


I think everyone has already guessed what will be discussed. There are only 7 basic types: string, integer / long, float / double, boolean, null

Example:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '{
    "tweet" : {
        "_source" : {"enabled" : false},
        "properties" : {
            "user" : {"type" : "string", "index" : "not_analyzed"},
            "message" : {"type" : "string", "null_value" : "na", "store": true},
            "postDate" : {"type" : "date"},
            "priority" : {"type" : "integer"},
            "rank" : {"type" : "float", "index_name" : "rating"}
        }
    }
}'


Here we specified additional parameters:
  1. "_source" : {"enabled" : false}- Thus, we indicated that it is not necessary to store the source data for this type. When can this be needed? For example, you have a very heavy document with a bunch of information that you only need to index, but do not need to display in the answer
  2. "store": true for the message field says that this is the source of the field must be stored in the index
  3. "index" : "not_analyzed"- here we indicated that this field should not be analyzed, i.e. should be kept as is. What are analyzers
  4. "null_value" : "na" - default value for the field
  5. "index_name" : "rating"- here we specified the alias for the field. Now we can refer to it both to "rank" and to "rating"


Note: By default, _source = true and the entire document is stored in the index in its original state and returned upon request. And this works faster than storing individual fields in the index, provided that your document is not huge. Then storing only the necessary fields can give a profit. Therefore, I do not recommend touching this field without good reason.

Types array / object / nested

We can specify not only the array type for the field, but also indicate the type for each field inside the array, here is an example:
#source
{
    "tweet" : {
        "message" : "some arrays in this tweet...",
        "lists" : [
            {
                "name" : "prog_list",
                "description" : "programming list"
            },
            {
                "name" : "cool_list",
                "description" : "cool stuff list"
            }
        ]
    }
}
#mapping
{
    "tweet" : {
        "properties" : {
            "message" : {"type" : "string"},
            "lists" : {
                "properties" : {
                    "name" : {"type" : "string"},
                    "description" : {"type" : "string"}
                }
            }
        }
    }
}

For objects, everything is the same, except that it can be dynamic (by default it is).
Those. You can add a new field to the object at any time and it will be added without errors.
Disable can be as follows: "dynamic" : false. Read more here .

Nested (nested) type

Essentially, we define a document inside a document. Why is this needed? Great example from the documentation:
{
    "obj1" : [
        {
            "name" : "blue",
            "count" : 4
        },
        {
            "name" : "green",
            "count" : 6
        }
    ]
}


If we search, name = blue && count>5then this document will be found to avoid such a scenario, it is worth using the nested type.
Example:
{
    "type1" : {
        "properties" : {
            "obj1" : {
                "type" : "nested",
                "properties": {
                    "name" : {"type": "string", "index": "not_analyzed"},
                    "count" : {"type": "integer"}
                }
            }
        }
    }
}


It is not necessary to specify properties for object elements; ES will do this automatically.
To search by nested type, use nested query or nested filter .

Multi-fields


Starting with version 1.0, this beautiful parameter has been added to all base types (except nested and object).
What is he doing? This parameter allows you to specify different mapping settings for a single field.
Why might this be necessary? for example, you have a field by which you want to search and group. If you turn off the analyzer, the search will not work to its fullest, but if you turn it on, then we will group not by raw data, but by processed data. For example, St. Petersburg after the analyzer will be “St.” and “Petersburg” (maybe a little differently, but for example it will do). If we group by this field, we will not get what we wanted.

Example:
"title": {
    "type": "string",
    "fields": {
        "raw":   { "type": "string", "index": "not_analyzed" }
    }
}

Now we can turn to the “title” for the search and to the “raw” for the grouping and any other types of sorting.

Other types

ES supports 4 more data types:
  1. ip type - store ip as numbers
  2. geo point type - storage of coordinates (useful when searching for the nearest objects to a specific coordinate)
  3. geo point type - a rather specific type for storing certain polygons
  4. attachment type - Storage of files in the database encoded in base64. It is usually used in conjunction with its own analyzer. (Although, as for me, the pleasure is doubtful)

I did not consider these types in detail, because they are quite specific or do not fundamentally differ from those considered above (for example, IP).

I hope that I was able to intelligibly talk about the main mapping functions in ES. If you have questions, I will be glad to answer.

Other ES articles:
ElasticSearch - Aggregate
ElasticSearch data and search the other way around. Percolate API - Achieving Goals


Only registered users can participate in the survey. Please come in.

Theme of the next article

  • 56.5% ElasticSearch and river. We transfer data from SQL / NoSQL database to ES 69
  • 42.6% Warmer - warm up and speed up the ES before the battle 52
  • 0.8% Other (in the comments) 1

Also popular now: