Why can't you do without linguistics in your search?

    Today we will talk about what role linguistics plays in Internet search. To put this in context, I’ll start with how linguists and a large search company are connected, for example, Yandex (more than 5,000 people), Google (more than 50,000 people), Baidu (more than 20,000 ) Between a third and half of these people work directly for search. Linguists within these companies are roughly divided between search and other areas - news, translation, etc.



    Today I will talk about the part of linguists that intersects with the search. It is indicated by hatching in the diagram. Perhaps, in Google and other companies, everything is arranged a little differently than ours, however, the general picture is approximately the following: linguistics is an important but not determining direction of the work of search companies. Another important addition: in life, of course, the boundaries are vague - it is impossible to say, for example, where linguistics ends and machine learning begins. Each linguist working in search is engaged in programming a little, a little - in machine learning.

    Since there are mainly people associated with science here, I would like to briefly describe the difference between the world of science and the world of production. Here I drew pseudographics: the x-axis shows the complexity of the tasks to be solved, the y-axis shows the return on these tasks, it doesn’t matter in money or in the total benefit to humanity. People involved in production are very fond of choosing tasks that are in the upper left quadrant - simple and with great returns, and people of science - tasks from the right edge, complex and not yet solved by anyone, but with a rather arbitrary distribution of returns. Somewhere in the upper right quadrant they meet. I would very much hope that it is there that the tasks that we are engaged in are located.



    The last thing I wanted to mention in this introduction: science and production exist in two completely different timelines. To make it clear what I'm talking about, I wrote down several dates for the emergence of companies, ideas or technologies that are now considered important. Search companies appeared about fifteen years ago, Facebook and Twitter - less than ten. At the same time, the work of Noam Chomsky dates back to the 1950s, the “meaning-text” model, if I understand correctly, to the 1960s, latent-semantic analysis to the 1980s. By the standards of science, this is only recently, but, on the other hand, this is a time when it’s not like Internet search, the Internet itself has not yet existed. I do not want to say that science has stopped developing, it just has its own timeline.

    Now let's move on to the topic. To understand the role of linguists in the search, let's first try to come up with our own search and check how it works. Here we understand that this task is not as simple as many people think, and get acquainted with the data structures that will be useful to us later.

    Intuitively, users imagine that modern search engines search “simply” for keywords. The system finds documents in which all the keywords from the query are found, and somehow organizes them. For example, by popularity. This approach is sometimes opposed by “smart”, “semantic” search, which turns out to be much better due to the use of some special knowledge. This picture is not very similar to the truth, and to understand why, let's try to mentally do a similar “keyword search” and see what he can and cannot find.

    For clarity, we take only three documents - an informational one (excerpt from Wikipedia), a text about shops and a random anecdote:

    1. Novokosino is a district in the city of Moscow and the metro station of the same name, the terminal of the Kalinin line. Follows the Novogireevo station.
    2. Shops "Crossroads". Selling food. Range overview. Addresses of stores. Information for customers.
    3. Police caught a group of scammers selling diplomas in the subway. “We had to let them go,” said Sergeant Ivanov, Doctor of Economics.

    A real search engine does not have such documents, but three, but three billion, so we can not afford every time the user asks his request, even just look at each of them. We need to put them in some kind of data structure that allows us to search faster. In this data structure for each word it is indicated in which documents it appears. For example, from several words of these three documents you get the following index:
    the shops2. 1. 1
    of products2.2.2
    in1. 1. 3, 3. 1 .7
    Moscow1.5.5
    Metro3. 1. 8
    sergeant3. 2. 9

    The first digit indicates the number of the document in which the word occurs, the second indicates the sentence, and the third indicates the position in it.

    Now try to answer the user's search query. First request: [grocery stores]. Everything worked, the search found the second document. Let's take a more complex query: [grocery stores in Moscow]. The search found these words in all documents, and it is not even clear which one wins. Request [Moscow metro]. The word "Moscow" is not found anywhere, and the word "metro" is in the third document with jokes. Moreover, in the first document there is the word "metro". It turns out that in the first case, we found what we were looking for, in the second - we found too much, and now we have to correctly sort what we found, and in the third we did not find anything useful, because we did not know that the “metro” and “the metro” This is the same. Here we saw two typical Internet search problems: completeness and ranking.

    It happens that the user finds a lot of documents, and the search engine shows something like "there were 10 million answers." This in itself is not a problem, the problem is to properly organize them and be the first to give those answers that are most relevant to his request. This is a ranking problem, the main problem of internet search. Fullness is also a serious problem. If we do not find the only relevant document for a search query containing the word "metro" just because it does not have the word "metro", but the word "metro", this is not very good.

    Now consider the role of linguistics in solving these two problems. There are no numbers in the diagram below, the general feelings are shown here: linguistics helps in ranking insofar as; The so-called ranking factors, machine learning and antispam play an important role in ranking. I will now explain these terms. What is a "ranking factor"? We can notice that some characteristics of the request, or the document, or the “request-document” pair should clearly affect the order in which we should present them to the person, but it is not known in advance how exactly and how much. Such characteristics include, for example, the popularity of the document, whether all the query words appear in the text of the document, and where exactly, how many times and with what keywords this document is referenced, URL length, number of special characters in it, etc. . “Ranking Factor” is not a common name; in Google, if I’m not mistaken, it’s calledsignals .



    You can come up with several thousand different similar ranking factors, and the system has the ability to calculate them for each request for all documents found for this request. Intuitively, we can conclude that such and such a factor should probably increase the relevance of the document (for example, its popularity), and another, probably, lower (for example, the number of slashes in the page address). But no man can combine thousands of factors into a single procedure or formula that works best, which is why machine learning comes into play. What it is? Special people pre-select as many queries and assign ratings to the documents that are for each of them. After that, the goal of the machine is to choose a formula that links the ranking factors so that that the result of calculations using this formula will be as similar as possible to those estimates that people came up with. That is, the machine learns to rank as people do.

    Another important thing in ranking is antispam. As you know, the amount of spam on the Internet is monstrous. Recently, it is less likely to catch the eye than several years ago, but this did not happen by itself, but as a result of the active actions of anti-spam employees of various companies.

    Why doesn't linguistics really help in ranking? Of course, maybe there is no good answer to this question, but the fact is that the industry simply did not wait for any of its genius. But it still seems to me that there is a more substantial reason. It would seem that the deeper the machine understands the meaning of texts and documents and queries, the better it can respond, and it is linguistics that can provide it with such a deep understanding. But the problem is that life is very diverse, and people ask questions that cannot be driven into any framework. Any attempt to find some kind of scheme, a comprehensive classification, to build a unified query ontology, or their universal parser, usually fails.

    Now I will show some examples of real queries that even a person stumbles about, not to mention a car. They were asked in search of living people, in fact. For example: [dating in Moscow without registration]. It is not clear what registration is meant. But a person who is used to asking requests like [how to assemble a table with your own hands] generates a request [how to clean the sewer with your own hands]. Often there are requests that even a person is not able to understand, and you need to rank the answers to them.

    We turn to the problem of completeness. There is a striking difference, because for the most part it is solved precisely by the means of linguistics. Some ranking factors may also help. For example, a search engine may include the following rule: if such a ranking factor for a query-document pair has a value above a certain threshold, then show this document for this query, even if not all words were found in it. For example, a query begins with the word "Facebook", and then something completely incomprehensible and not found anywhere else is written. And we will still show the facebook.com home page in the hope that the user had exactly that in mind, and next to it he wrote, say, an unknown Facebook name in Arabic. And machine learning also somehow affects the completeness, for example, helps to formulate such rules.



    I will give some examples of why completeness cannot be achieved without the use of linguistics. First of all, we need to be able to generate different forms of words, to understand, for example, that “left” and “gone” are forms of the same word. You need to know that “Moscow” and “Moscow” are words related to each other, and that the words “surfing” and “surfer” are related to each other, and that the letters “ё” and “e” are often one and the same . We need to be able to understand requests for Internet jargon, or recorded using non-letter characters, and understand what words of the Russian language correspond to them.

    It is important to be able to match words, phrases and texts in different languages. The user can ask the request [Head of the Verkhovna Rada of Ukraine], and in the document he is interested in, the name of this state body is found only in Ukrainian. It would be nice to be able to answer a query that was asked in one language, with results in other languages ​​- related or not. In fact, the more we look at user requests, the more we realize that any search is a translation from the language of the request into the language of the document.

    Now let's talk in more detail about how linguists help solve search-related problems. I will use the concept of a linguistic pyramid in which the analysis of the text is divided into several levels: lexical (breaking the text into words), morphological (changing words), syntactic (a combination of words with each other in phrases and sentences), semantic (meaning of words and sentences) and pragmatic (the purpose of generating the utterance and its external context, including non-linguistic).



    Let's start with the lexical level. What is a word? Search companies in 1995 had to answer this question at least somehow in order to be able to get started. For example, define a word as a sequence of alphabetic characters, limited to left and right by spaces or punctuation marks. Unfortunately, this is often not true, especially if we are not talking about words in the usual sense, but about indivisible pieces of text found on the Internet. Among them are common, for example, Internet addresses, dates, phone numbers, etc. It is often difficult to answer the question of where one word ends and another begins. Here, for example, the amount indicating the currency - is it one word or two? We need to somehow decide on this. Currency symbols, phone number formats in different countries, and things like that can seem very far from linguistics,

    In addition to the already mentioned frequent cases - email addresses, phone numbers, dates - there are other examples of complex, non-trivial “words”. Here is an example of a phrase, which in a search would be convenient to consider as one word: “Frunze plant”. In this case, it would be nice to consider "the plant them. Frunze "form of the same" word ". The query [New York] is often written without a hyphen, but this does not turn into two separate words: it would be incorrect to show documents on it where the words “York” and, for example, “New Hampshire” are used separately.

    It happens that punctuation marks and special characters are an important part of the word. Users can write the name of the radio station like this: “Europe +”, and the TV channel can be called “1 + 1”. There is even a musical group in the world with a wonderful name that no search engine finds in this form: “#####”. Similar problems often arise when searching for variable names and functions, as well as other requests from programmers. Why it happens? Because someone at the dawn of search engines decided that the characters%, _, # and + would not be considered letters.



    It would seem that now we have accumulated the necessary experience, we know about all these special cases, and we can divide the text into words much more correctly, but the problem is that the search index is already ready. If now we change our understanding of what a word is, we will need to make changes to all programs that work with this index and reindex the entire Internet, that is, billions of documents. This means that it is very expensive to correct any lexical analysis error. So the lexical analysis of documents must be implemented as correctly as possible from the very first version, and this requires wisdom that no one had fifteen years ago.

    We turn to morphological analysis. What do we want from him? First of all, we want to know what word forms dictionary words have. Fortunately, this problem has already been solved for the main languages ​​of the world. The words in them are already divided into parts of speech, declension, conjugation and similar categories, and we know how they vary by cases, by gender, by time, etc. The situation is worse with proper nouns: there are many of them, and practically no dictionaries of proper names exist. At the same time, it would be nice to know that “Moscow” and “to Moscow” are one and the same thing, it would be nice to know how the Lower Owls are inclined, not to mention the names and surnames, which are quite numerous. It’s especially difficult with surnames, and this is an important case: a particular sin of the search engine is if a person searches and cannot find himself. In addition, there are words that are not found in any dictionary, and which are constantly changing: names of brands, organizations, firms. So, you need to be able to automatically attribute an unfamiliar word to one of the word-formation models known to us.

    Further, we are faced with another interesting problem, which, if it occurs outside the search, is rare: which forms of the word are close to each other and which are far away? For example, if the request contains the word “go”, then is it necessary to search for documents with the words “walking” or “walking”? Most likely, the person making the request had in mind the word “go” in this form. Another example is digging and digging. The word "dig" in the request is likely to occur in the context of "how to dig a ditch", and the word "swarm" can be the form of the words "dig", "swarm" and the name "Swarm". Even if this is really the form of the word “dig,” its use in the text, quite possibly, has nothing to do with the user's request. Not the fact that we should even look for the main form of the word used in the query. For instance, in the query the word “houses” was found - is it necessary to search for documents with the word “houses”? TV show "Dom-2", the magazine "Your Home" ...

    We are not dealing with one language, and from this everything becomes only more complicated. There is, for example, the Ukrainian noun “meta”, from which it is possible to form the forms “meti” and “metu”, which coincide with the Russian forms of the verb “revenge”. It turns out a complete mess. The problem is that the Ukrainian word “meta” is the name of the Ukrainian Internet portal, and therefore it can also be found in the Russian text. That is, we need to be able to recognize languages ​​and understand what language the document is written in, and, ideally, every word in this document.

    Sometimes, by context or by some other means, you can remove homonymy and understand what kind of word we are talking about. But you need to understand that this process is not always 100% accurate and complete, and it is impossible to implement it without involving any data about the real world. For example, it is very difficult for a computer to understand the phrase “How many goals does Pavlyuchenko have?”: In order to understand that the shape of “goals” comes from the word “goal” and not “head”, you need to know who Pavlyuchenko is. Most of the algorithms for removing homonymy stumble about this example.



    A separate topic is synonyms. It is not clear what level of linguistic pyramid it refers to, because it is close to semantics and morphology at the same time. We can consider synonyms as an extension of morphology: if we believe that a word can appear in one form in a request and in another document in a found document, why not do the same with synonyms. For example, to issue documents with the word “old ladies” upon a request with the word “grandmothers”, in the request “hippo” - in the response to “hippo”, “mug” - “face”, and so on. The question is how to search for them automatically, because you cannot manually make all such pairs.

    One way is to look for words that are often found in the same contexts. Here we are faced not so much with linguistic as with programming problems. For example, to find all pairs of words that occur in the same contexts, we first need to multiply the matrix itself, which has dimensions of the order of 1.000.000.000 * 1.000.000.000, in which it is written which word and how often occurs in which contexts. This is already a nontrivial operation. Say, such amounts of data, in principle, cannot fit on a single computer. In addition, I want these calculations to go as quickly as possible, since in order to achieve the required quality, we need to try many of its options, slightly different from each other.

    Most pairs of synonyms do not look like "grandmothers" - "old women", but consist of words similar in sound. It may not be entirely correct to use the word “synonyms” to designate such pairs, but for lack of a better one I will use this term. For example, there are various permissible spellings of the same word, there are transliterations, there are words that can be written both together and separately. Special consideration of such patterns helps to find more synonyms and do it more precisely.

    The next problem: there are actually no synonyms, because there are no two words that completely replace each other: “my grandmother” and “my old woman” are completely different things, “Hippo cat” cannot be replaced with “Hippo cat”, and the word "Erysipelas" in the sense of illness - the word "face". So, synonyms must be context-sensitive, that is, a synonymy relationship is possible not between a pair of words, but a pair of words in some context . The full task of accounting for context in synonymy has not yet been solved. Sometimes one cannot do without a rather complicated context: in one example, II can mean “Second”, and in another - “Second”. The words “eighty-sixth” can mean both the number 86 and the number 1986.



    A separate problem is typos. Is it correct to consider words written with typos, synonyms of words spelled correctly, or should they be treated in a special way? A definite answer does not exist, but, apparently, typos need to be done differently, for example, to correct a user's error explicitly. Moreover, even correcting typos in common words is a hellishly difficult task. For example, many correctly spelled Ukrainian words can be easily mistaken for a typo in a Russian-language query. There are even queries like [unlikely or unlikely], the meaning of which is completely lost if the spelling is corrected in them. Search engines often correct many names with similar names of celebrities, such actions can easily spoil the user's mood.

    We pass to the parsing. I was a little cunning when I illustrated the concept of a linguistic pyramid with a trapezoid, because in a real search the pyramid does not look like that. The level of parsing is very small and practically nothing is done on it. I have a few assumptions about why this is so. First: the point is the conflict of data structures. Parsing involves a tree-like data structure with complex relationships between parts of a sentence, etc. And at the time of the search, the system has only an inverted index, in which the numbers of documents are compared with the word and some other simple information like numbers of sentences. And it’s not entirely clear how to put such a complex data structure there so that it does not take up too much space, so that this presentation is convenient, extensible. This is not impossible, but, as I understand it, this task has not really been solved by anyone. Second assumption: it's all about the structure of user queries. Usually they consist of a small number of words, often even inconsistent with each other, and parsing in principle cannot give us much information about them.

    However, there is an example where it helps search a lot. This happens when processing a specific class of queries in which the so-called stop words are important - “if”, “and”, “the”, as well as other articles, prepositions and conjunctions that are very common in natural texts. The first search engines ignored them altogether, and even now they are treated in a special way. Let's say the word “c” is in almost any document on the Russian-speaking Internet, it is difficult to get around all the documents where it occurs. But there are special kinds of queries in which stop words are very important. For example, [roof for a car] and [car without a roof] are two different requests. Without proper accounting for stop words, you cannot process, for example, a search for a musical group “The The” or a wonderful query [in a c of].



    Let's move on to semantic analysis. Traditionally, this is considered a difficult task, but in the search for some things that can be attributed to semantic analysis, are common. The most common: all the major search engines in one way or another are able to determine the "genre" of the text: they distinguish from each other articles, store pages, blog entries, news feeds, etc. This is a completely linguistic task - to build a classifier that learns to relate the text to one of these genres. Another task is to determine from the text of the document or request what topic is being discussed. For example, “computers”, “cars”, “family”, “astrology”. The coincidence of the subject matter of the document and the query can then be used, for example, as a ranking factor. A separate important topic is “adult” sites and queries. When the user didn’t mean anything like that, and the search gives him links to "adult" sites, this is traditionally annoying. To solve this problem, we have to build the appropriate classifiers.

    But here is a text like ten years ago that could be seen on the Internet with great frequency, and now for any normal query you will not see it in the search results:
    Before answering, to palm off a helping hand to a coral buddy - service. And here the unspeakable rescue tarzan saw that on an empty stomach it would never leave, center. But his inlaid mind, legal with math, phones. That after hitting down from the shoulder, he groups more slowly, paralyzing the work of all microbots within it, samsung.

    In fact, the generated texts now occupy about half of the Internet, and to deceive the search engine, normal texts can be used in which words are interchanged or some words are replaced by synonyms. The task is to bring the user to the site with this nonsense and get some benefit from it. It helps us to distinguish between normal text and similarly generated semantic analysis. An interesting feature is not so important how to conduct it: as soon as we find any pattern that all natural texts obey, be it n-gram patterns or features of the use of words, such pattern reveals almost all unnatural texts. Until the people behind them find a way to imitate it. There is a kind of arms race.



    Now about pragmatic analysis. It would seem that it is impossible to use it in a text search, since by definition we are talking about circumstances outside the text. However, when processing requests, there are procedures that can be conditionally attributed to pragmatic analysis. There are words that users regularly use in queries, it’s clear what they mean, but the use of these words in a query does not mean a desire to find them in the text of the document, this is an indication of user pragmatics. For example, "free" or "in good quality." It should be borne in mind that, for example, “inexpensively” for Moscow and the province are two different things. It would be nice to take into account each such common pragmatist in their own special way, and this is a typical project in search, everyday work.

    It is interesting that such clarifying words may not be not only in the text of the document, but also in the text of the request, but we still somehow have to guess what they mean. For example, “Harry Potter” can mean both a film and a book; "Jaguar" is an animal, a drink and a machine. It would be good for us to understand this, and show in the extradition all these things, moreover, in the correct proportions. The request [election] sometimes implies news about current events, and at other times it is informational and correctly show links to legislation, an article on Wikipedia, etc. All this can be learned to do, all the necessary information about user interests for this is there, it is reported to us by users themselves through their requests. One must be able to find connections between different requests and have some kind of reality model.



    We discussed what linguists are doing in the search right now, now a little about the future. At first - examples of things for linguists familiar, but not yet used in the search, or not fully applied. For example, “sentiment analysis”: we have the text of a review or blog entry, and we need to understand whether this review is positive or negative, and what advantages and disadvantages are reflected in it. If we learn how to do this, it will be possible to show users that there were, say, forty reviews about the product they are interested in, of which so many positive and so many negative, so many people noted such and such a flaw, and give them the opportunity to read it is the reviews containing a description of this particular drawback. This is not “rocket science”, but in the search it has not yet been done. Need to do, it will benefit people.

    Another thing is a fact finding. I would like to understand requests like [who first flew over the Atlantic] or [Everest height], look for complete short answers on them and show them to the user in a convenient form. There is the American Watson system, which so far knows how to do it better than anyone and even defeated everyone in the American version of “Its Game”. All search engines have the beginnings of this, however, Wikipedia remains the main source of such information for them.

    Another topic is monitoring the mentions of yourself or your company on the Internet. Those who are engaged in this, in response, do not want to receive a continuous stream of messages, in which there is garbage and spam, but more structured information. There is appropriate software on the market, but this task has not yet been completely solved by anyone, although it does not look incredibly complex.

    Now that search and search company can give linguistics. Here I must first dispel some misconceptions and say about what we really do not have. We do not, and cannot have accurate statistics, on how often a particular word or expression is used on the Internet or on blogs. At times, including in the linguistic environment, one can come across arguments based on how many found by the search engine, but in reality these numbers mean little, they are true at best accurate to a factor of 10. This is not even a number documents that she can show you: try to ask any request and see the documents on the hundredth page. The search engine will refuse to show you more than a thousand documents, and will ask you to ask another request. This is mainly because

    Now about what linguistics can provide. There is a language that people use when asking search queries. You can see examples of common “utterances” in this language in the search hints that appear when you start typing your query. What is interesting about him? Firstly, this language has its own grammar: for example, the phrase “listen to online music online” is valid (this is a frequent request), and the phrase “listen to online music online” is not (they don’t ask in this form at all), although from the point of view of the Russian language, the second phrase is more correct.

    That is, the query language is a small natural language, and it is not part of the Russian language. A detailed story about him is dedicated to my previous post.. Here I will give the main ideas directly related to our topic today. This new language has recently spun off from Russian, and consists for the most part of the same words (although some of them are used in a very special way, for example, the word "download" in the query language is not a verb, but some special part of speech, there is no analogue in Russian). The query language develops naturally. People learn this language, and this process is in many ways similar to the study of "real" natural languages, although it is mediated by a computer. A person learns to generate statements in this language in the same way, examples of statements of other people are available to him, he can use the most successful constructions, he meets with communicative success and with communicative failures.

    The query language is a natural language in the sense that it withstands many tests, for example, obeys Zipf's law and other laws of natural languages. What is perfect for linguists, I’m not afraid of the word, drop dead - this language has its own grammar. It is very simple, there is almost no recursion, but there is still a grammar. You can even create an automatic query generator, similar to the famous phrase "colorless green ideas violently sleep." They will be grammatically correct, although meaningless: [American brooms to buy in China], etc.

    This is a real language that has arisen and is developing before our eyes, and the important thing is that we know the almost complete set of all statements ever generated in this language. This is a unique property that no natural language possesses. Moreover, we know and can compare the frequencies with which certain statements are generated, the changes in these frequencies by time of day, seasons, regions. Such a language is a very good topic for a variety of studies. There are obvious obstacles here. Respect for the user's privacy is needed: even taking into account the fact that all requests are anonymous, some stronger guarantees are needed that no information about individual users can be extracted from such a large number of requests. It is also necessary to take into account the difference in mentality between science and production, which I spoke about at the very beginning.

    I hope that science will continue to move forward as slowly and unstoppably.

    Also popular now: