Comparison of products using Elasticsearch for a competitor price monitoring service

Back in 2017, the idea arose to develop a service for monitoring the prices of competitors. Its distinctive feature from other similar services should have been the function of daily automatic comparison of goods. Apparently because of the almost complete lack of information on how to do this, price monitoring services offered only the possibility of manual comparison by the customers themselves, or by service operators with a price from 0.2 to 1 ruble for the fact of comparison. The real situation with, for example, 10 sites and 20,000 products on each, inevitably requires automation of the process, since manual matching is too long and expensive.

An approach to automatic matching will be described below using a number of competing pharmacies using the Elaticsearch technology .

Environment description


  1. OS: Windows 10
  2. Base: Elaticsearch 6.2
  3. Client for requests: Postman 6.2

Elaticsearch setup


Configuration of the product mapper and analyzer fields in one request

PUT http://localhost:9200/app 
{
  "mappings": {
    "product": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "name_analyzer"# указываем анализатор из настроек для имени товара
        },
        "manufacturer": {
          "type": "text"
        },
        "city_id": {
          "type": "integer"
        },
        "company_id": {
          "type": "integer"
        },
        "category_id": {
          "type": "integer"
        },
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "name_analyzer": {
            "type": "custom",
            "tokenizer": "standard", # про этот токенайзер можно подробно почитать в документации, в целом подходит под нашу задачу"char_filter": [
              "html_strip", # удаляем случайно попавшие в названия товаров html теги"comma_to_dot_char_filter"# заменяем запятые на точки, чтобы вещественные числа парсились
            ],
            "filter": [
              "word_delimeter_filter", # указываем кастомные разделители термов"synonym_filter", # добавляем группы синонимов"lowercase"# переводим все в нижний регистр
            ]
          }
        },
        "filter": {
          "synonym_filter": {
            "type": "synonym_graph",
            "synonyms": [
              "тюб, тюбик",
              "кап, капельница",
              "капс, капсула",
              "амп, ампула, ампулы",
              "офтальмол, офтальмологический",
              "таб, тбл, табл, таблетки",
              "увл, увлажняющий",
              "наз, назальный",
              "доз, дозированный, дозировка",
              "жев, жеват, жевательные",
              "раств, раствор, растворимые, р-ра, р-р",
              "ин, инъекций, инъекция",
              "покр, покрытый, покрытая, покрытые",
              "инд, индивидуальная",
              "конт, контурная",
              "уп, упак, упаковка",
              "расс, рассас, рассасывания",
              "подъязыч, подъязычные",
              "шип, шипучие",
              "пор, порошек",
              "приг, приготовления",
              "шт, штук, ном, номер",
              "тр, трава",
              "г, g",
              "ml, мл"
            ]
          },
          "word_delimeter_filter": {
            "type": "word_delimiter",
            "type_table": [
              ". => DIGIT", # чтобы попадали в термы вещественные числа"- => ALPHANUM",
              "; => SUBWORD_DELIM",
              "` => SUBWORD_DELIM"
            ]
          }
        },
        "char_filter": {
          "comma_to_dot_char_filter": {
            "type": "mapping",
            "mappings": [
              ", => ."
            ]
          }
        }
      }
    }
  }
}

For example, we can look at which parts of the analyzer “name_analyzer” will break the name of the drug “Hyoxysone 10mg + 30mg / g ointment for external use of the 10g tube”. Use the query _analyze .

POST http://localhost:9200/app/_analyze
{
  "analyzer" : "name_analyzer",
  "text" : "Гиоксизон 10мг+30мг/г мазь для наружного применения туба 10г"
}

result
{
    "tokens": [
        {
            "token": "гиоксизон",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<ALPHANUM>", 
            "position": 0
        },
        {
            "token": "10",
            "start_offset": 10,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "мг",
            "start_offset": 12,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "30",
            "start_offset": 15,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "мг",
            "start_offset": 17,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "g",
            "start_offset": 20,
            "end_offset": 21,
            "type": "SYNONYM", #видим, что строка "g" определилась как SYNONYM, это означает, что она совпадет с любым вхождением своей группы синонимов "г, g""position": 5
        },
        {
            "token": "г",
            "start_offset": 20,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "мазь",
            "start_offset": 22,
            "end_offset": 26,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "для",
            "start_offset": 27,
            "end_offset": 30,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "наружного",
            "start_offset": 31,
            "end_offset": 40,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "применения",
            "start_offset": 41,
            "end_offset": 51,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "туба",
            "start_offset": 52,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 10
        },
        {
            "token": "10",
            "start_offset": 57,
            "end_offset": 59,
            "type": "<ALPHANUM>",
            "position": 11
        },
        {
            "token": "g",
            "start_offset": 59,
            "end_offset": 60,
            "type": "SYNONYM",
            "position": 12
        },
        {
            "token": "г",
            "start_offset": 59,
            "end_offset": 60,
            "type": "<ALPHANUM>",
            "position": 12
        }
    ]
}

Filling with test data


_Bulk request

POST http://localhost:9200/_bulk
{
  "index": {
    "_index": "app",
    "_type": "product",
    "_id": 195111
  }
}
{
  "name": "Гиоксизон 10мг+30мг/г мазь для наружного применения туба 10г",
  "manufacturer": "Муромский приборостроительный завод АО",
  "city_id": 1,
  "company_id": 2,
  "category_id": 1
}
{
  "index": {
    "_index": "app",
    "_type": "product",
    "_id": 195222
  }
}
{
  "name": "ГИОКСИЗОН мазь для наружнего применения 10 мг+30 мг/г: 10 г",
  "manufacturer": "МПЗ",
  "city_id": 1,
  "company_id": 3,
  "category_id": 1
}

Search mappings


Let the goods of our client, for which we want to find all similar products of competitors, have characteristics

{
  "name": "Гиоксизон мазь для наружного применения 10 мг+30 мг/г туба алюминиевая 10 г",
  "manufacturer": "Муромский приборостроительный завод АО",
  "city_id": 1,
  "company_id": 1,
  "category_id": 1
}

Using the reference book of medicines we select the name of the drug from the name of the product. In this case, the word "hyoxyson". This word will be a mandatory criterion.

We also cut out all the numbers from the name - “10 30 10”, they will also be an obligatory criterion. Moreover, if some number was included twice, it should also be included in the found goods, otherwise we will increase the chance of coincidence with the wrong goods.

Query _search

GET http://localhost:9200/app/product/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "company_id": [
              2,
              3,
              4,
              5,
              6,
              7,
              8
            ]
          }
        },
        {
          "term": {
            "city_id": {
              "value": 1,
              "boost": 1
            }
          }
        },
        {
          "term": {
            "category_id": {
              "value": 1,
              "boost": 1
            }
          }
        }
      ],
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "name": {
                    "query": "мазь для наружного применения мг+ мг/г туба алюминиевая г",
                    "boost": 1,
                    "operator": "or",
                    "minimum_should_match": 0,
                    "fuzziness": "AUTO"
                  }
                }
              }
            ],
            "must": [
              {
                "match": {
                  "name": {
                    "query": "Гиоксизон",
                    "boost": 2,
                    "operator": "or",
                    "minimum_should_match": "70%",
                    "fuzziness": "AUTO"
                  }
                }
              },
              {
                "match_phrase": {
                  "name": {
                    "query": "10 30 10",
                    "boost": 2,
                    "slop": 100
                  }
                }
              }
            ]
          }
        }
      ],
      "should": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "manufacturer": {
                    "query": "Муромский приборостроительный завод АО",
                    "boost": 1,
                    "operator": "or",
                    "minimum_should_match": "70%",
                    "fuzziness": "AUTO"
                  }
                }
              },
              {
                "match": {
                  "manufacturer": {
                    "query": "Вalenta Фarmacevtika ОАО",
                    "boost": 1,
                    "operator": "or",
                    "minimum_should_match": "70%",
                    "fuzziness": "AUTO"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "name": {}
    }
  },
  "size": 50
}

At the exit, we get the id of the goods, as well as their names + score for analytics, with highlighted fragments.

  • Hyoxysone 10 mg + 30 mg / g ointment for external use of a tube 10 g - Evaluation by the algorithm: 69.84
  • HIOXISON ointment for external use 10 mg + 30 mg / g : 10 g - Evaluation by the algorithm: 49.79

Conclusion


The described method certainly will not give 100% accuracy of comparison, but it will greatly facilitate the process of manual comparison of goods. Also suitable for tasks that do not require absolute accuracy.
In general, if we improve the search query with the methods of additional heuristics and increasing the number of synonyms, we can achieve a result close to satisfactory.
In addition, the performance tests performed on the old i7, showed good results. 10 search queries in an array of 200,000 products run within a couple of seconds. Live this example of drugs can be found here .

Offer your options, ways of comparison in the comments.

Thanks for attention!

Also popular now: