ViArt November 24, 2014 at 11:13

InterSystems iKnow. Part one. iKnow and beach holidays

I have long wanted to write my article on iKnow technology. Three years have passed since its inception, but so far there have been no publications on the applications of this technology in Russian-language solutions. The explanation for this is quite simple - there was no full support for the Russian language. But with each new release, starting with Cache 2013.1, the situation has changed for the better. And finally, we decided to implement the first project on iKnow. About how it was, what happened and what didn’t, read further in my article.

So, as I said, until now there have been no applications and real Russian-language solutions created using iKnow, although support for the semantic model for the Russian language first appeared in version 2013.1. At some point, it became clear that what works for Latin languages is not suitable for Russian (as well as for all Slavic languages). And the fault is the variety of forms that one word can take. When iKnow analyzes the text, the concept is counted (here we can assume that the concept is a noun), and the concepts “apple”, “apples” and “apples” are completely different terms and they are calculated separately. For this reason, for example, the recipe for making charlotte, iKnow will be able to understand it as an article about servicing electric ovens, because the term is “oven” (turn on “oven”, open “oven”, turn off “oven”, etc. ) will be found in the text more often than the forms of the word "apple" separately. This was the difficulty. And using it helped to uselemmatization - a tool to bring to normal form of the word.
And so, in Cache 2015.1 FieldTest lemmatization support was implemented using the Hunspell library. And this means that it has become possible to create full-fledged applications for analyzing data from texts in Russian and Ukrainian. And I immediately wanted to do something that would be a great practical example of using iKnow and at the same time not a useless analogue of “Hello world”. And such a task was found!
We were provided with a database of 27,000 reviews of the 100 most popular hotels in Turkey and Egypt. The circle of immediate tasks that had to be solved was immediately determined. However, I will tell you everything in order.
What is a tourist review. This is, first of all, unstructured data or text (the concept of “unstructured text”, which many people like to use, seems senseless to me). People returning from vacation (we considered a beach vacation) go to the portal, give ratings to the hotel in which they lived, or to the categories of this hotel, for example, service, food, hospitality, etc. Then they describe the rest in their own words, note what was good and what was bad. Numerical estimates of rest (for example, on a five-point scale) are metadata that the portal administration can easily use to calculate hotel ratings. But often, people just write the text and forget to rate it. There are a lot of such cases - in our project more than half of the reviews contain only text, without numerical ratings. It turns out such rating is useless for rating. So the first task that was to be solved was to teach iKnow to calculate the hotel rating solely from the text of the review.
The remaining tasks were also formulated quite quickly:

calculate the assessment of certain categories of the hotel (comfort, service, food, hospitality, territory, location);
evaluate how consistent this calculation is with the ratings that the reviewers set themselves;
to synthesize the final phrase about the rest in the hotel (for example, “Of the 653 vacationers in the hotel, 278 people (43%) note the courtesy and friendliness of the staff, 220 vacationers (34%) liked the food in restaurants, and 76 guests (12%) would like to again here to relax ");
learn how to determine the most useful reviews in order to primarily offer them to portal visitors;
find suspicious and commissioned reviews that are written for gingerbread cookies for advertising purposes, and often have little to do with sad reality;

There were other tasks, but I will dwell on the description of how I solved the above.
Now I need to explain what iKnow is and what you can get from it without significant effort. iKnow is a technology that allows you to analyze texts. iKnow API - a set of functions for working with unstructured data. There is also a GUI that allows you to visualize the results of indexing texts and extract useful information from the data. When we upload something to iKnow, we get the same text in the output, but divided into conceptsand the relationship between them. Concepts in sentences, as a rule, represent the subject and additions. The connections between concepts in most cases are verbs, verb forms or prepositions. In addition, iKnow can carefully calculate how many times in a tourist review the term “hotel”, “sea”, “beach” or “food poisoning” is mentioned.

An example of dividing a sentence into concepts and relationships. Concepts are highlighted in yellow, relationships are underlined, insignificant words are marked in gray.

What else we can get from the text depends mainly on our imagination, and a little more on the diversity of the iKnow API.
How can I rate the hotel according to the text of the review? Below is one of the approaches for calculating the numerical characteristics in the text.
The first thing to do is break the entire text into pieces. The iKnow API allows splitting into sentences or along the way . Everything is simple with sentences; it is a part of the text limited by periods, question marks or exclamation marks, as well as a semicolon. A path is a part of a sentence that describes interconnected concepts. In practice, in most cases, the paths and sentences are the same. And only complex or complex sentences consist of several paths.

In the issue we found the ants promised in the reviews , called the reception and informed them about our problem.

Together this proposal, and in parts - the way.
We divided the text into sentences.
The second task is to understand what the sentence says. In other words, we need to determine what category of the hotel we are talking about in this proposal. For this, dictionaries of so-called functional markers are needed. For example, if the sentence has the term “restaurant”, “juice”, “tea”, then we are talking about the category “food”, and the terms themselves are included in the marker dictionary for this category.

The drinks in the restaurant and bars are powder juices, carbonated drinks, tea, instant coffee, wine and beer, strong alcohol in bars.

Our vocabulary of functional markers contains about 300 terms specific to the 6 evaluated hotel categories. And now, what’s important - if there weren’t lemmatization, then for the system to work correctly, all the forms of these 300 words would have to be written in the dictionary. Yes, at first this is not an impossible task. But if the dictionary grows?
At the first stage, marker dictionaries were formed manually. They introduced the terms that occurred when subtracting the first 200 reviews. At the second stage, the dictionary was expanded automatically using the dictionary learning algorithm built using iKnow features. As a result, the volume of the dictionary increased on average by an order of magnitude.
Well, now we know what the proposal says. It remains to understand whether the reviewer liked it or not. In other words, it is necessary to determine the emotional color of the sentence. For this, a dictionary of emotional markers was formed. As a rule, in Russian the emotional coloring is set by adjectives (delicious coffee, convenient entry to the sea, etc.). We can also take into account explicit nouns (dirt, mass alcohol poisoning, joy).
And now that “magic” begins, when numerical estimates, graphs and tables are formed from the text. We can calculate the number of positive and negative terms in relation to the categories of the hotel for each individual review, and then determine the share of positive. For calculation, I used the following formula:

Evaluation = N_positive / (N_positive + N_negative) Get a

certain number from 0 to 1. Moreover, the better the hotel, the closer this value to 1. If you multiply this number by some factor, for example 5, you can get a hotel rating on a five-point scale, which was done.
The next task is to make sure that all this makes sense. This was perhaps the key point in the work. To understand how the estimates calculated using iKnow correlate with the author’s ratings, we built a graph that includes all the hotels we rate.

Figure 1. Correlation of copyright and calculated estimates.

In blue, here are the average values of the author's hotel ratings, and in green are calculated by iKnow. As you can see, the correlation is clearly present. Although it is too early to draw final conclusions, it is already clear that such an approach works and can be further developed. Of course, a similar algorithm for the quantitative assessment of hotels and their parameters works with a statistically large number of reviews: in our case, estimates were formed with the number of reviews of at least 20 for each hotel. By the way, I want to note that to build analytics I used another Intersystems technology - DeepSee .
The next tasks that were to be solved were the search for useful and custom reviews. Everything is quite simple here, you just need to formulate the appropriate criteria. Here, for example, are the recall usefulness criteria:

The review describes the maximum number of hotel categories. That is, here you can read about the level of service, room comfort, food quality, etc.
recall should be balanced emotionally. There should not be only naked criticism or just an admiring description. As a rule, people like some things, while others remain unhappy. That is, in the review there must be both positive and negative emotional markers.
The hotel’s rating calculated from such a review should be close to the average rating.

Similar criteria were formulated to assess the suspiciousness of reviews. I would like to note that it is unambiguous to say that a custom review would be too presumptuous for iKnow, but calling the review suspicious and drawing the attention of the administrator of the tourist portal to it is quite possible.
Separately, I want to say about DeepSee, which supports the ability to use measurements and metrics on data from iKnow. There are a lot of interesting, and most importantly useful information for tourists, which can be obtained from reviews in addition to static ratings. Using the analyzer, it was found that estimates vary depending on the month, from year to year. And all this is clearly visible on the DeepSee charts.

Figure 2. Change in hotel rating by month

I’ll try to summarize the work done. We managed to implement the first meaningful project on iKnow in Russian. With enough imagination and richness of the iKnow API, you can build quite complex solutions. In the future, we plan to develop the analyzer of reviews for hotels into a universal tool, because there are a lot of reviews on the Internet: movies, cars, phones, etc.
But it will be wrong to talk only about the merits, without mentioning the problems that had to be faced. There are two such problems:

processing negative offers;
imperfection of lemmatization.

Negative offer processing first appeared in iKnow in version 2014.1. But using it in a review analyzer is very problematic. How, for example, the system to evaluate this phrase:

I can not say that we did not like the beach.

or

Bottom line, our personal vacation, went WELL, but I would never have gone to this hotel again!

iKnow is able to find the presence of negation in a sentence, but it is difficult to determine with certainty what exactly is denied. This problem is especially characteristic of the Russian language, where there is no rigid established order of words in a sentence. How to deal with such offers, you decide, the options are: turn over the assessment of emotional markers (multiply by -1), ignore such offers altogether, or leave everything as it is. In any case, this will lead to either a loss of information or a loss of accuracy in the analysis of individual sentences. And it is not at all clear how to analyze sarcasm.
There is still a problem with lemmatization. It is there and it works, but sometimes it gives out absolutely wonderful things. For example, the prepositional case from the noun hotel - “about the hotel”, is given to the livestock term “calving”, and “green wild plum” is transformed into a plumbing “green wild plum”.
However, I think we should be optimistic and believe that over time these problems will be resolved. Work on improving iKnow continues.
There are no technical details or details in this article. About them and how to create their own iKnow-applications, I will be happy to tell in the sequel.

Tags:

InterSystems iKnow. Part one. iKnow and beach holidays

Also popular now: