OneArt June 19, 2014 at 20:34

ElasticSearch and vice versa. Percolate API

The question of clever categorization of something arises sharply when developing so many sites. Of course, you can always give this to a person to fill out and at first the result will be much better than the machine one, but what if you need to categorize hundreds and thousands of “goods” in real time.
You have to give it to the car. There are not many options here, and writing your own AI for 99.9% of the tasks is a waste of time.

Those who are interested in how to solve this using ElasticSearch, I ask for cat.

If you are not familiar with ElastiSearch, then I recommend the excellent article “ElasticSearch Fast Full-Text Search” by brujeo

General idea

At SmartProgress, we implemented categorization into groups that combine user goals into common interest groups. But how to correlate these groups (of which there are already more than 100) with the user’s goal so that a maximum of 3 groups are offered to him and at the same time they are the most relevant goals?

The simplest option would be to use, for example, tags to bind to a particular group, but in reality it does not work as well as we would like, plus forcing the same users to fill in tags can only be justified for the IT sector.

Suppose we have the category "Programming in Ruby on Rails", then a search query by the rules of simple query string will look something like this:

Ruby | RoR | "Ruby on Rails" | "программирование Ruby"~4 | "вставать на рельсы" -php -java -net

^{I’ll explain a little request: | - or, "..." - occurrences of the whole phrase, ~ N - it is possible to dilute the phrase N with words.}

If you need to find all the "products" (in our case, goals) that fit this query, then just search. And if you need to find all categories for a particular product? Percolate API comes to the rescue here.

Percolate API

I admit honestly, my acquaintance with ElasticS began with this "chip", before that I only worked with the sphinx, but it does not know how to do the reverse search.
Therefore, after reading the documentation, I really did not understand what it was and how to work with it, and there was very little information in Google, especially according to version> 1.X. But perseverance won (on> 5 page of Google there is life).

I'll try to explain on fingers how this works:

We create an index or take an existing one
Document (s) with a special type of .percolator , any unique id and c body in the form of our request is added to it (example below)
Next, we make a request to _percolate and look at which categories the “product” fits into

Working example

Let's try it in action:

- Create the “test” index (without mapping, we will not need it)

curl -XPUT 'http://localhost:9200/test'

- Create .percolator


curl -XPUT 'http://localhost:9200/test/.percolator/simple-search' -d '
{
   "query" : {
         "simple_query_string" : {
            "query" : "Ruby | RoR | \"Ruby on Rails\" | \"программирование Ruby\"~4 | \"вставать на рельсы\" -php -java -net",
            "analyzer" : "simple",
            "fields" : ["name^5", "description"],
            "default_operator" : "and"
         }
   },
   "language" : "ru",
}'

Details:
test - Index
.percolator - Type
simple-search - ID (can be either int or string)
"query"- search
simple_query_string - Function to search. Full list
"fields" : ["name^5", "description"] - here we indicated by which fields the search is in progress, and indicated a coefficient of 5 for the “name” field, because usually there is the most important information. More details .
"active" : 1- Additional parameters, optional, may be of any any type, are used in filtering the result.

In essence .percolator is the same object as any other in the index, so mapping can also be applied to it.

- We are looking for:


curl -XPOST 'http://localhost:9200/test/category/_percolate?pretty' -d '
{
   "doc" : {
       "name" : "Изучить Ruby on Rails максимально быстро",
      "description" : "Я хочу программировать на Ruby"
   },
   "filter" : {
      "term" : {
         "language" : "ru"
      }
   }
}'

Answer:


{
  "took" : 5,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "total" : 1,
  "matches" : [ {
    "_index" : "test",
    "_id" : "simply-search"
  } ]
}

There may be several

matches. This method is perfect if you for some reason do not want to transfer all the information to ElasticS. Or want to test your percolator

Search existing data (ElasticS> = v1.0)

Let's add 1 entry to the test index


curl -XPUT 'http://localhost:9200/test/category/1' -d '
{
      "name" : "Изучить Ruby on Rails максимально быстро",
      "description" : "Я хочу программировать на Ruby"
}'

And look at what categories this entry fits into:


curl -XGET 'http://localhost:9200/test/category/1/_percolate?pretty'

Answer:


{
  "took" : 4,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "total" : 1,
  "matches" : [ {
    "_index" : "test",
    "_id" : "simply-search"
  } ]
}

Well, then, we were able to do a “reverse search”. Where can this be used? Applications of the sea, from tricky selections to event reminders, everything is limited only by your imagination and RAM.

A fly in the ointment

Unfortunately, not everything is as wonderful as it looks from the outside. Yes, it works, but there are cons :

All .percolators are stored in RAM
Each document is indexed in RAM
Runtime linearly to the number of .percolator indexes

Yes, they support replication, like any ElasticSearch object, but nevertheless, you must use this mechanism with extreme caution.

A couple of simple tips to avoid out of memory:

If your selection can be made a simpler method, for example, using macth / bool query then use this, query language is rather slow relative to the usual comparison of values
Use filters, narrow down the search as much as the application logic allows you, this will save you some memory
Do not create too many .percolator indexes, if you are supposed to have thousands of such indexes, then you should revise your logic or stock up on RAM

Helpful information

Whats new in percolator - a great presentation from ES developers, which very clearly explains the essence of the technology
Percolator API - official documentation page
DHC - REST HTTP API Client - a great plugin for Google chrome that allows you to quickly and conveniently communicate with ES

Also my ES article - ElasticSearch - P.S. data aggregation

I am not an ES guru, so I welcome any comments and additions.

Only registered users can participate in the survey. Please come in.

Do I need more articles on ElasticSearch

72.9% Are needed for simpler things (CRUD, tuning, ...) 167
74.2% Need for more complex things 170
0.8% Not needed 2

Tags: