OneArt June 23, 2014 at 12:21

ElasticSearch - data aggregation

Tutorial

In this article, we will look at how to correctly implement data aggregation, why this might be needed, and add a bunch of working examples.

For everyone who is interested in how to make their queries in ES more interesting and look at the usual search from the other side, I ask for a cat.

In the previous article, users were divided evenly between the article on a simpler topic and a more complex one, so I chose a not very difficult topic, but rather fresh, which was added to ES relatively recently (v1.0) and carries quite interesting functionality.

Aggregation module

This module replaced Facets in ES, and in a persistent form, Facets are now considered obsolete and will be removed in coming releases. Although aggregates were added to v1.0.0RC1, and now> 1.2, I still do not recommend using Facets.
Why did you need to change the working tool?
Probably the main feature of the units is their nesting. I will give the general syntax of the request:

"aggregations" : {
    "" : {
        "" : {
            
        }
        [,"aggregations" : { []+ } ]?
    }
    [,"" : { ... } ]*
}

As can be seen from the structure, there can be arbitrarily many aggregates, and each element can have an embedded element without depth restrictions.
Using nesting, we can get very interesting statistics (an example at the end of the article).

Unit Types

There are a lot of types of aggregates , but all of them can be combined into 2 main types:

- Bucketing (Generalization)
For ease of understanding, this can be compared with all the familiar “GROUP BY” tools. Of course, this is a rather simplified comparison, but the principle of operation is similar. This type based on filters summarizes documents according to some specific criterion; a good example is terms aggregation .

- Metric (Metric)
These are aggregates that calculate any value for a specific set of documents. For example, sum aggregation

I think that is enough for the beginning of the theory, anyone who is interested in more fundamental information on this module can get acquainted with it at this link .

Simple example

For those who want to try everything with their own hands, I suggest using this dump

Structure and data for the test

Dump insolently taken from this beautiful article

curl -XPUT "http://localhost:9200/sports/" -d'
{
   "mappings": {
      "athlete": {
         "properties": {
            "birthdate": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "location": {
               "type": "geo_point"
            },
            "name": {
               "type": "string"
            },
            "rating": {
               "type": "integer"
            },
            "sport": {
               "type": "string"
            }
         }
      }
   }
}'
curl -XPOST "http://localhost:9200/sports/_bulk" -d'
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"],  "location":"46.22,-68.45"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"],  "location":"45.21,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"],  "location":"45.16,-63.58" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"],  "location":"46.22,-68.85"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.12,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Jeff", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"],  "location":"46.22,-68.45"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"],  "location":"45.21,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.16,-63.58" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"],  "location":"46.22,-68.85"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"],  "location":"45.12,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }'

Let's group the athletes by their sport and find out how many of them are in each sport:

curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d'
{
   "size": 0, 
   "aggregations": {
      "the_name": {
         "terms": {
            "field": "sport"
         }
      }
   }
}'

Here we use the “terms” aggregate, which groups the document by the “sport” field.
"size" : 0(0 is replaced by Integer.MAX_VALUE automatically) indicates that we need all the documents without exception, in our case speed is not important, but we must take into account that a more accurate result requires more time.

Answer:

{
  ...
  "aggregations" : {
    "the_name" : {
      "buckets" : [ {
        "key" : "baseball",
        "doc_count" : 16
      }, {
        "key" : "golf",
        "doc_count" : 2
      }, {
        "key" : "basketball",
        "doc_count" : 1
      }, {
        "key" : "football",
        "doc_count" : 1
      }, {
        "key" : "hockey",
        "doc_count" : 1
      } ]
    }
  }
}

Great, baseball players the most.
Let's sort the athletes by the average value of their rating, from larger to smaller:

curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d'
{
   "size": 0, 
   "aggregations": {
      "the_name": {
         "terms": {
            "field": "name",
            "order": {
               "rating_avg": "desc"
            }
         },
         "aggregations": {
            "rating_avg": {
               "avg": {
                  "field": "rating"
               }
            }
         }
      }
   }
}'

Here you can clearly see what a nested aggregate is and how it can help us choose documents as flexibly as possible.
First, we indicate that we need to group athletes by name, then sort by “rating_avg”, which is calculated in the “avg” aggregate, by the “rating” field. Notice how elegantly ES works with arrays ( "rating" : [10, 9]) and easily calculates the average value.

Answer:

{
  ...
  "aggregations" : {
    "the_name" : {
      "buckets" : [ {
        "key" : "brady",
        "doc_count" : 1,
        "rating_avg" : {
          "value" : 10.0
        }
      }, {
        "key" : "wayne",
        "doc_count" : 1,
        "rating_avg" : {
          "value" : 10.0
        }
      }, {
        "key" : "james",
        "doc_count" : 1,
        "rating_avg" : {
          "value" : 9.0
        }
      }, {
        "key" : "bingo",
        "doc_count" : 1,
        "rating_avg" : {
          "value" : 8.5
        }
      },
      ... {} ...
      {
        "key" : "duke",
        "doc_count" : 1,
        "rating_avg" : {
          "value" : 3.5
        }
      }, {
        "key" : "bob",
        "doc_count" : 1,
        "rating_avg" : {
          "value" : 3.5
        }
      } ]
    }
  }
}

Another great feature of the units is the use "script". For example:

curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d'
{
   "size": 0,
   "aggregations": {
      "age_ranges": {
         "range": {
            "script": "DateTime.now().year - doc[\"birthdate\"].date.year",
            "ranges": [
               {
                  "from": 22,
                  "to": 25
               }
            ]
         }
      }
   }
}'

Starting with version 1.2.0, script execution is disabled by default. You can enable it , provided that users do not have direct access to ES (I hope this is so, otherwise I advise you to immediately close this access for the security of your data).

Aggregation in all its glory or something more complicated

Let's find all the athletes who are within a radius of 20 miles from the point. "46.12,-68.55"
Group them by sport and display detailed statistics on the ranking of athletes in this sport.
Sounds good, but here's an example.

curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d'
{
   "size": 0,
   "aggregations": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         },
         "aggregations": {
            "sport": {
         		"terms": {
              		   "field": "sport"
         		},
              	        "aggregations": {
                           "rating_stats": {
                               "stats": {
                                   "field": "rating"
                               }
                            }
                       }
                    }
      		}
         }
      }
   }
}'

Answer:

{
  ...
  "aggregations" : {
    "baseball_player_ring" : {
      "buckets" : [ {
        "key" : "*-20.0",
        "from" : 0.0,
        "to" : 20.0,
        "doc_count" : 13,
        "sport" : {
          "buckets" : [ {
            "key" : "baseball",
            "doc_count" : 8,
            "rating_stats" : {
              "count" : 14,
              "min" : 2.0,
              "max" : 5.0,
              "avg" : 3.357142857142857,
              "sum" : 47.0
            }
          }, {
            "key" : "golf",
            "doc_count" : 2,
            "rating_stats" : {
              "count" : 4,
              "min" : 4.0,
              "max" : 10.0,
              "avg" : 6.75,
              "sum" : 27.0
            }
          }, {
            "key" : "basketball",
            "doc_count" : 1,
            "rating_stats" : {
              "count" : 2,
              "min" : 8.0,
              "max" : 10.0,
              "avg" : 9.0,
              "sum" : 18.0
            }
          }, {
            "key" : "football",
            "doc_count" : 1,
            "rating_stats" : {
              "count" : 1,
              "min" : 10.0,
              "max" : 10.0,
              "avg" : 10.0,
              "sum" : 10.0
            }
          }, {
            "key" : "hockey",
            "doc_count" : 1,
            "rating_stats" : {
              "count" : 1,
              "min" : 10.0,
              "max" : 10.0,
              "avg" : 10.0,
              "sum" : 10.0
            }
          } ]
        }
      } ]
    }
  }
}

Conclusion

I hope I was able to convey the general capabilities of this wonderful module. Anyone who is interested in this topic, I advise you to read the entire list of filters at this link .
I am glad to any useful comments and additions on the topic.

You can also read my previous article on ES - ElasticSearch and search the other way around. Percolate API
And vote in the bottom of the article.

- Achievements of goals

Only registered users can participate in the survey. Please come in.

Theme of the next article

65.2% Mapping - why is it needed and how to use it correctly 62
33.6% Warmer - warm up and speed up the ES before the battle 32
1% Other (in comments) 1

Tags: