
Extract facts. Synonymy and homonymy
This post arose as a result of communication with one naive person and the result of his own thoughts about such a complex and controversial subject as language (in this case, Russian).
About the conversation: the essence was that (let's call it Someone) Someone claimed that the process of extracting facts from a text in a natural language is a fairly simple and easy thing, they say, we are looking for verbs (words ending with "um / ute / él / ... » ) And adjacent nouns (words longer than 4 letters), compose triplets and drive them into the ontology database - this is the engine for extracting facts.
Immediately, according to my own system of classification of intelligence, a person received one of the smallest assessments, but this made me think about some aspects of presenting information in IT and the difficulties that arise when extracting information from it.
Today we will talk about synonymy and homonymy.
Synonymy is a feature of the Russian language when the same meaning can be expressed in different ways. For example, the words “cavalry” and “cavalry” mean the same thing (morphological synonymy), and the meaning expressed by the phrase “Smith was unable to translate this text just because there were many special terms in it” can be expressed more than million synonymous periphrases (syntactic synonymy)! In fact, “I could not = I could not = I was unable to = he failed ...” , “only = only = exclusively = solely = ...” , “because = = because = because of that ...” , etc. d. - all of these options create a huge number of options for conveying meaning, and their direct (Cartesian product) is huge - an n-dimensional set of options.
Homonymy, in contrast to synonymy, hides several, sometimes opposite meanings behind the same word (morphological homonymy) or expression (syntactic homonymy). For example, the word “steel” can be used both in the phrase “Workers smelted a lot of steel per shift” and in “Children over the summer have become stronger” and have completely different meanings and destinations in the proposal. The syntactic homonymy of a sentence can be easily demonstrated by the statement “Husband must not be changed . ” A more complex example that everyone in the school traverses - “He brought the fruits of scholarship from Germany foggy” (A.S. Pushkin) - here we can talk about “foggy Germany”(this is exactly what is understood by the majority - but is Germany really considered a foggy country), and it can be said about "foggy learning" (the foggy learning of Lensky is not subject to
any particular doubts).
We must not forget about another subspecies of homonymy - polysemy. The effect is when the same word (which is not one of the word forms similar in spelling and pronunciation, as in the case of “steel” ), for example, the word “nose” - “the nose of the boat stuck into the sandy shore” and “nose at Vasya with the boogers . " A person easily understands which of the meanings to take, and the computer?
Methods of combating homonymy have been developed and debugged for a very long time - they have their pros and cons. These are hidden Markov models, subordination trees, context analysis, turnovers, compatibility dictionaries, and more. Unfortunately, their detailed (or even approximate) description does not fit into the scope of the article - therefore I will postpone it until the next time.
More >>>
About the conversation: the essence was that (let's call it Someone) Someone claimed that the process of extracting facts from a text in a natural language is a fairly simple and easy thing, they say, we are looking for verbs (words ending with "um / ute / él / ... » ) And adjacent nouns (words longer than 4 letters), compose triplets and drive them into the ontology database - this is the engine for extracting facts.
Immediately, according to my own system of classification of intelligence, a person received one of the smallest assessments, but this made me think about some aspects of presenting information in IT and the difficulties that arise when extracting information from it.
Today we will talk about synonymy and homonymy.
Synonymy
Synonymy is a feature of the Russian language when the same meaning can be expressed in different ways. For example, the words “cavalry” and “cavalry” mean the same thing (morphological synonymy), and the meaning expressed by the phrase “Smith was unable to translate this text just because there were many special terms in it” can be expressed more than million synonymous periphrases (syntactic synonymy)! In fact, “I could not = I could not = I was unable to = he failed ...” , “only = only = exclusively = solely = ...” , “because = = because = because of that ...” , etc. d. - all of these options create a huge number of options for conveying meaning, and their direct (Cartesian product) is huge - an n-dimensional set of options.
Homonymy
Homonymy, in contrast to synonymy, hides several, sometimes opposite meanings behind the same word (morphological homonymy) or expression (syntactic homonymy). For example, the word “steel” can be used both in the phrase “Workers smelted a lot of steel per shift” and in “Children over the summer have become stronger” and have completely different meanings and destinations in the proposal. The syntactic homonymy of a sentence can be easily demonstrated by the statement “Husband must not be changed . ” A more complex example that everyone in the school traverses - “He brought the fruits of scholarship from Germany foggy” (A.S. Pushkin) - here we can talk about “foggy Germany”(this is exactly what is understood by the majority - but is Germany really considered a foggy country), and it can be said about "foggy learning" (the foggy learning of Lensky is not subject to
any particular doubts).
We must not forget about another subspecies of homonymy - polysemy. The effect is when the same word (which is not one of the word forms similar in spelling and pronunciation, as in the case of “steel” ), for example, the word “nose” - “the nose of the boat stuck into the sandy shore” and “nose at Vasya with the boogers . " A person easily understands which of the meanings to take, and the computer?
Methods of combating homonymy have been developed and debugged for a very long time - they have their pros and cons. These are hidden Markov models, subordination trees, context analysis, turnovers, compatibility dictionaries, and more. Unfortunately, their detailed (or even approximate) description does not fit into the scope of the article - therefore I will postpone it until the next time.
Literature:
- Gladky A.V. The syntactic structures of a natural language in automated communication systems. M .: “Science” 1985.
More >>>