Latent-semantic analysis and artificial intelligence (LSA and AI)

I would like to write this post more philosophically than mathematical (more precisely algebraic): not what kind of terrible beast - LSA , but what kind of benefit “our collective farm” can do, that is AI .

It’s no secret to anyone that AI consists of many mutually disjoint or weakly overlapping areas: pattern recognition, speech recognition, implementation of motor functions in space, etc. But one of the main goals of AI is to teach hardware to think that it doesn’t include only processes of understanding, but also the generation of new information: free or creative thinking. In this regard, questions arise not so much of developing methods for teaching systems as understanding of the processes of thinking, the possibility of their implementation.

On the basis of the work of LSA, as already mentioned at the beginning of the article, I will not stop now (I plan in the next post), but for now I will refer to Wikipedia , preferably even English ( LSA ). But I’ll try to describe the main idea of ​​this method in words.

LSA is used to identify latent (hidden) associative-semantic relationships between terms (words, n-grams) by reducing the factor space of term-by-documents. Terms can be both words and their combinations, the so-called. n-grams, with documents - ideally: sets of thematically homogeneous texts, or just any desirable voluminous text (several million word forms), arbitrarily broken into pieces, for example paragraphs.

"On fingers":
The main idea of ​​latent-semantic analysis is as follows: if in the original probability space consisting of word vectors (vector = sentence, paragraph, document, etc.), no relationship can be observed between any two words from two different vectors, then after some algebraic transformation of a given vector space, this dependence may appear, and the value of this dependence will determine the strength of the associative-semantic connection between these two words.

For example, consider two simple messages from different sources (just an example for clarity):
  • 1st ad source: “This wonderful XXX phone has a powerful battery!”
  • 2nd source blogs: "By the way, the device XXX has a good battery . "

Since the vocabulary of blogs and advertising does not overlap much, the words “ battery ” and “ battery ” will receive different weights, say, the first is small and the second, on the contrary, is big. Then these messages can be combined only on the basis of the name “ XXX ” (strong criterion), but the details about the battery (let's call it weak criterion) will disappear.
However, if we conduct an LSA, then the weights of the “ accumulator ” and “ battery ” will equalize, and these messages can be combined on the basis of a weak criterion, but the criterion most important for the product.
Thus, LSA “pulls together” words that are different in spelling, but close in meaning.

The question is, why is this necessary, and here does the associative-semantic connection and AI? Let's turn to the history.

One of the questions posed by the great thinkers of mankind since the time of Plato is the question of our ability to know the world. In the XX century, the famous American linguist Noam Chomsky formulated the so-called Plato's problem: why is the volume of knowledge of an individual much more than he can learn from his everyday experience? In other words, how can information obtained from a sequence of relatively small variability of events be correctly used and adapted to a potentially infinite number of situations?

For example, the vocabulary of children on average daily increases by 3-8 words. Moreover, as linguists say, a denotatus does not always have its own strictly defined referent, or in human words - not every word has a correlation with really existing things or performed actions (for example, abstract concepts, words that carry an uninformative load, etc.).
The question arises: how does a child determine each new meaning of a word and its relationship with other meanings, or why do new “meanings” (denotations) form and how do they relate to each other?

The work of “semantic” mechanisms can conceptually be compared with the processes of categorization or clustering. With this approach, the problem arises of determining the initial concepts or primary clusters, their boundaries and their number.

LSA, its varieties (PLSA, GLSA) and similar ones (LDA - the notorious latent Dirichlet location) allows you to model associative-semantic relationships between words, which on the one hand allows you to abandon the rigid binding of the lexical unit to any of the clusters, and on the other to present holistic system of connections between words.

This means that the words in our brains are not classified according to concepts (do not lie on shelves-clusters), but form a complex system of connections among themselves, and these connections can dynamically change depending on many reasons: context, emotions, knowledge about the outside world and etc., etc. And algorithms like LSA give us the opportunity to simulate the simplest elements of “understanding”. But, they will object to me how to prove that the brain works on the principle of LSA. Most likely nothing, because this is not necessary: ​​the planes also fly, but they do not wave their wings. LSA is only one of the methods that allows simulating the simplest systems of “thinking” for their use both for practical purposes (intelligent systems) and for further research on human cognitive functions.

An obvious disadvantage of LSA is the abnormality (non-Gaussianity) of the probability distribution of words in any natural language. But this problem can be solved by smoothing the sample (for example, by using phonetic words: the distribution becomes more “normal”). Or use probabilistic LSA, the so-called. PLSA based on multinomial distribution.
Other, less obvious drawbacks of LSA (and similar methods) as applied to the processing of unstructured information include the “nebula” of the method itself (in particular, the choice of the number of singular values ​​of the diagonal matrix) and interpretation of the result, not to mention the problem of the balance of the training text.

As a rule, less than 1-2 percent of the total number of diagonal values ​​(after SVD conversion, but more on that in the next post) is left for high-quality model building. And, as practice shows, an increase in the number of factors leads to a deterioration in the result. But reaching about 10 percent of the total number of diagonal values, again there may be a surge similar to the result obtained at 1%.

Case balance is an eternal problem that does not have a good solution to date. Therefore, it is customary to keep silent about her.

The interpretation of LSA results (as well as DLA) is also difficult: a person can still understand what topic will contain the topic obtained as a result of the analysis, but the machine cannot understand (not annotate) the topic without involving a large number of good and different thesauruses.

Thus, despite the complexity and opacity of LSA, it can be successfully used for a variety of tasks where it is important to catch the semantics of the message, to generalize or expand the "meanings" of the search query.

Since this post was written ideologically (and why is this necessary?), I would like to devote the next post to practical things (and how does it work?).

1. Landauer TK, Dumais ST A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge // Psychological Review. 1997.104. - P.211-240.
2. Landauer TK, Foltz P., Laham D. An Introduction to Latent Semantic Analysis. Discours Processes, 25, 1998 - P.259-284.
3. Readings in Latent Semantic Analysis for Cognitive Science and Education. - A collection of articles and links about LSA.
4. - site dedicated to the modeling of LSA.

Also popular now: