New Year's dataset: open semantics of the Russian language
New Year is a time of miracles and gifts. The main miracle that nature has given us, of course, is natural language and human speech. And we, in turn, want to make a New Year's gift to all researchers of this phenomenon and share a dataset on the open semantics of the Russian language.
In the article, we allow ourselves to discuss a little bit on the topic of meanings, tell how we came to the need to create open semantic markup, talk about the present results and future directions of this great work. And, of course, we’ll give a link to the dataset, which you can download and use for your experiments and research.
In habrastati "Teach the bot! - marking up emotions and semantics of the Russian language ” we talked about the beginning of a lot of work to create an open semantic markup for the Russian language. Now we have received the first results and want to share with the community.
First of all, we focused on marking up objects of the material world, as well as emotionally evaluative coloring of words and expressions of the Russian language. These are the two most valuable from a practical point of view and the most understandable from the point of view of marking up the field of semantics.
Link to Github: open semantics of the Russian language (dataset) .
Semantics or the science of meanings is generally recognized as one of the most difficult sections of linguistics. This is not surprising, given the fact that even the concept of meaning is not so easy to define. (Try to explain on fingers what is the point?)
The text that we analyze by computer methods is devoid of these very meanings. Those. the text sets a certain outline, but the real meaning materializes only at the moment a person reads the text, when our brain forms a mental image or a mental scene of what is written inside.
This is an unsolvable difficulty for the machine, as they themselves in no way possess an interpreter of the human language and can effectively solve only those tasks that do not require interpretation, but can be solved at the level of the text and statistics calculated on top of it.
NB Strictly speaking, when you apply machine learning on top of a marked-up array of texts (for example, by sentiment), then markup is your semantics in this particular task. The problem here is the low resolution of such markup and the inability to generalize and apply the acquired "knowledge" to a fundamentally new task.
We show clearly. For a computer, our language is as follows:
In the given fragment, you can calculate various statistics, compatibility, n-grams, ending systems, etc. It is possible to construct an algorithm that, taking into account the extracted statistical information, will simulate a question-answer system, i.e. find in the text sentences that are most similar to the question or even make an answer from several fragments. In the presence of a large amount of data and subject to the construction of a high-quality model, such a system can very well mimic a person.
But real work with meaning, when you need to operate with extralinguistic knowledge of the world, for example, answering questions that need to be deduced, is hardly feasible in a purely statistical paradigm.
The essence of our work is to create a simplified model of the world and mark up the language in terms of this model. Those. try to attach elements of the language to extra-linguistic reality.
NB In fairness it should be noted that people became interested in the possibility of grouping vocabulary by semantic similarity back in Ancient Rome. If it will be interesting for you to look into the history of the issue, we recommend that you refer to the book by V. Morkovkin “Ideographic Dictionaries”, where a detailed historical excursion is made in the second chapter.
The worlds in which humans live are incredibly complex and diverse. Especially the world that exists in our heads - emotions, feelings and emotions, abstract concepts, creativity, ethics and morality.
The semantics of the intangible spheres are engaged in entire collectives of eminent scientists for many years. We deliberately do not fit into these areas. Till. We’ll get in more precisely, but not in such detail and not in the first place.
Basically, our focus is on the material world and that small piece of the intangible world regarding evaluations and emotions. This is primarily due to the fact that most of the applications of NLUs are in these areas, and thus they are most interesting from a practical point of view. Second: you need to start with something simpler and more unambiguous, and the choice of the material sphere in this light is quite justified.
The sphere of emotions is certainly an intangible world, but it is difficult to find a more important aspect of the human psyche. Moreover, this is directly related to a useful practical task - the analysis of the tonality of the text. In addition, the written language is greatly deprived of information about emotions. For example, the contexts of polar emotions are often very symmetrical, and by purely statistical methods you cannot distinguish words with positive and negative emotional charge.
We divide all words into two large classes - physical objects / phenomena and everything else. We leave the last part aside - it will be of interest to us secondarily.
We divide physical entities into four large classes: living things, places, objects and substances.
Weather and food occupy a somewhat separate place in the human mind - they do not fall under any of the previous classes in a good way. Accordingly, it makes sense to re-size them separately.
The second most part of the first stage of our work is the markup of the emotional-evaluative component of language signs. Here, all entities (tangible and intangible) are divided into three classes: positive, negative and neutral. In polar classes, the estimated charge strength is estimated. However, assessments are a topic for a separate big conversation, they are too illusory and elusive, but even here human ingenuity is able to find a way out of the situation.
Two key principles that we adhere to when marking up are the naivety of the picture of the world and the rejection of the context.
The world that surrounds us may change depending on our knowledge of it. More precisely, the world most likely remains the same, but our perception of it and, accordingly, the classification system of objects and phenomena is a flexible thing. So, for example, we are surprised to learn that the biological classification of watermelon is a berry. Although, it would seem, what kind of berry - it is so huge. And a tomato in a number of scientific systems is a fruit, which does not correspond either to our everyday view of it, nor to the order of the display on the window of a grocery store. Nevertheless, it is important for us to capture precisely the everyday or naive picture of the world.
The second important markup principle is to abandon the context.Language units are considered separately from the flow of speech and its natural environment in a certain average, most frequent and obvious sense. Sometimes it goes sideways. So, for example, the word minus can be completely neutral, if interpreted as an arithmetic operation. But as a synonym for the word lack, it acquires a negative connotation. But in general, if you build your system competently and do not ignore the laws of statistics, then such roughness should be smoothed out at the level of machine learning methods.
Refusal from the context is, at least, a moot decision. But at the first stage, it was important to do this for three reasons. First, contextual considerations significantly increase the complexity and volume of markup with completely unobvious benefits. Secondly, the eternal question of how to fix the context in a machine-readable form and how to attach a token to a specific value when using data. And the third point. Each meaning of a word in a language has its own frequency of use, which also varies between topics. This is the parameter whose value in the explanatory dictionaries is recorded rarely with a mean litter. for rarely used values and completely inaccessible to the machine.
Our decision, which we made more as engineers than as scientists, turned out to be justified by the results of the first experiments, and machine learning methods are really in most cases really able to compensate for the averaging of grades in various contexts.
At the first stage of the work, we tried to cover, as it seems to us, the most important areas from a practical point of view - the material world and the emotional-evaluative component of linguistic signs. But in parallel with the main direction of the markup, we tried to launch several experimental slices, which will allow us to plan our future work more meaningfully:
We will not go into details for now; a more detailed description is in the repository.
In the very near future, we plan to launch work in the following areas:
But our world is not limited to the sphere of tangible and in more distant plans:
Traditionally, we not only share data, but also give ideas of ready-made experiments and research areas that seemed to us worthy of attention.
Interesting results can be obtained by combining the dataset by semantics and association (in the same repository). We are already doing this to refine the markup by tonality; The dataset is located next to the repository by reference.
Remember and describe in the comments any case where you needed explicit semantic markup in your work, but it was not at hand. This will provide us with valuable food for thought on the further development of the dataset.
Dataset: open semantics of the Russian language
Dataset is licensed under CC BY-NC-SA 4.0 .
In the article, we allow ourselves to discuss a little bit on the topic of meanings, tell how we came to the need to create open semantic markup, talk about the present results and future directions of this great work. And, of course, we’ll give a link to the dataset, which you can download and use for your experiments and research.
TL; DR
In habrastati "Teach the bot! - marking up emotions and semantics of the Russian language ” we talked about the beginning of a lot of work to create an open semantic markup for the Russian language. Now we have received the first results and want to share with the community.
First of all, we focused on marking up objects of the material world, as well as emotionally evaluative coloring of words and expressions of the Russian language. These are the two most valuable from a practical point of view and the most understandable from the point of view of marking up the field of semantics.
Link to Github: open semantics of the Russian language (dataset) .
About semantics and meanings
Semantics or the science of meanings is generally recognized as one of the most difficult sections of linguistics. This is not surprising, given the fact that even the concept of meaning is not so easy to define. (Try to explain on fingers what is the point?)
The text that we analyze by computer methods is devoid of these very meanings. Those. the text sets a certain outline, but the real meaning materializes only at the moment a person reads the text, when our brain forms a mental image or a mental scene of what is written inside.
The significance of a word is a fiction, an empty phrase, if it is not based on something that is not just a relation. We can talk about the meaning of the word “mouton” ( French “ram” and “lamb”), on the one hand, and “ram” and “lamb”, on the other, only insofar as we know what, in fact, we are talking about, that is, to which segment of extra-linguistic reality these words belong.
Morkovkin V.V. Ideographic dictionaries. - M.: From Moscow State University, 1970.
This is an unsolvable difficulty for the machine, as they themselves in no way possess an interpreter of the human language and can effectively solve only those tasks that do not require interpretation, but can be solved at the level of the text and statistics calculated on top of it.
NB Strictly speaking, when you apply machine learning on top of a marked-up array of texts (for example, by sentiment), then markup is your semantics in this particular task. The problem here is the low resolution of such markup and the inability to generalize and apply the acquired "knowledge" to a fundamentally new task.
We show clearly. For a computer, our language is as follows:
Ezo Kmetdzafpäez I have grown and lemagiruyu dvozhlodz, Ph.D. Granted oligophilia reme le ovataza ilzemkmezazomor hepofehedno yagina and erysipelas ebbensiflo mestiza food ze gatakhi, nozomye lezmuyuz and zemzmezmezatsii, and meschaera la umofle zendza and fykhidpellich kofzemzino.
Ego klebshgafryaeg brya nasima melaslesyny shrokhmoshg, g.t. Shana Omi mi fotoy nele me odrabaeg imgelklegagolon zherofegeshtopo yachty and noheg evvetgifmo forest be ge mesmerizing, toogo meleduyug imgelklegatsii, and forestry maulofme getshga and fyzhishremmiz kofelzgepo mep.
(Several variations of the paragraph above, in which vowels are stored, and consonants are mixed inside similar groups of letters.)
In the given fragment, you can calculate various statistics, compatibility, n-grams, ending systems, etc. It is possible to construct an algorithm that, taking into account the extracted statistical information, will simulate a question-answer system, i.e. find in the text sentences that are most similar to the question or even make an answer from several fragments. In the presence of a large amount of data and subject to the construction of a high-quality model, such a system can very well mimic a person.
But real work with meaning, when you need to operate with extralinguistic knowledge of the world, for example, answering questions that need to be deduced, is hardly feasible in a purely statistical paradigm.
The essence of our work is to create a simplified model of the world and mark up the language in terms of this model. Those. try to attach elements of the language to extra-linguistic reality.
NB In fairness it should be noted that people became interested in the possibility of grouping vocabulary by semantic similarity back in Ancient Rome. If it will be interesting for you to look into the history of the issue, we recommend that you refer to the book by V. Morkovkin “Ideographic Dictionaries”, where a detailed historical excursion is made in the second chapter.
What we do: philosophy
The worlds in which humans live are incredibly complex and diverse. Especially the world that exists in our heads - emotions, feelings and emotions, abstract concepts, creativity, ethics and morality.
The semantics of the intangible spheres are engaged in entire collectives of eminent scientists for many years. We deliberately do not fit into these areas. Till. We’ll get in more precisely, but not in such detail and not in the first place.
Basically, our focus is on the material world and that small piece of the intangible world regarding evaluations and emotions. This is primarily due to the fact that most of the applications of NLUs are in these areas, and thus they are most interesting from a practical point of view. Second: you need to start with something simpler and more unambiguous, and the choice of the material sphere in this light is quite justified.
The sphere of emotions is certainly an intangible world, but it is difficult to find a more important aspect of the human psyche. Moreover, this is directly related to a useful practical task - the analysis of the tonality of the text. In addition, the written language is greatly deprived of information about emotions. For example, the contexts of polar emotions are often very symmetrical, and by purely statistical methods you cannot distinguish words with positive and negative emotional charge.
What we do: specifics
We divide all words into two large classes - physical objects / phenomena and everything else. We leave the last part aside - it will be of interest to us secondarily.
We divide physical entities into four large classes: living things, places, objects and substances.
Weather and food occupy a somewhat separate place in the human mind - they do not fall under any of the previous classes in a good way. Accordingly, it makes sense to re-size them separately.
The second most part of the first stage of our work is the markup of the emotional-evaluative component of language signs. Here, all entities (tangible and intangible) are divided into three classes: positive, negative and neutral. In polar classes, the estimated charge strength is estimated. However, assessments are a topic for a separate big conversation, they are too illusory and elusive, but even here human ingenuity is able to find a way out of the situation.
Two key principles (assumptions)
Two key principles that we adhere to when marking up are the naivety of the picture of the world and the rejection of the context.
The world that surrounds us may change depending on our knowledge of it. More precisely, the world most likely remains the same, but our perception of it and, accordingly, the classification system of objects and phenomena is a flexible thing. So, for example, we are surprised to learn that the biological classification of watermelon is a berry. Although, it would seem, what kind of berry - it is so huge. And a tomato in a number of scientific systems is a fruit, which does not correspond either to our everyday view of it, nor to the order of the display on the window of a grocery store. Nevertheless, it is important for us to capture precisely the everyday or naive picture of the world.
The second important markup principle is to abandon the context.Language units are considered separately from the flow of speech and its natural environment in a certain average, most frequent and obvious sense. Sometimes it goes sideways. So, for example, the word minus can be completely neutral, if interpreted as an arithmetic operation. But as a synonym for the word lack, it acquires a negative connotation. But in general, if you build your system competently and do not ignore the laws of statistics, then such roughness should be smoothed out at the level of machine learning methods.
Refusal from the context is, at least, a moot decision. But at the first stage, it was important to do this for three reasons. First, contextual considerations significantly increase the complexity and volume of markup with completely unobvious benefits. Secondly, the eternal question of how to fix the context in a machine-readable form and how to attach a token to a specific value when using data. And the third point. Each meaning of a word in a language has its own frequency of use, which also varies between topics. This is the parameter whose value in the explanatory dictionaries is recorded rarely with a mean litter. for rarely used values and completely inaccessible to the machine.
Our decision, which we made more as engineers than as scientists, turned out to be justified by the results of the first experiments, and machine learning methods are really in most cases really able to compensate for the averaging of grades in various contexts.
Additional markup
At the first stage of the work, we tried to cover, as it seems to us, the most important areas from a practical point of view - the material world and the emotional-evaluative component of linguistic signs. But in parallel with the main direction of the markup, we tried to launch several experimental slices, which will allow us to plan our future work more meaningfully:
- food and drink;
- dishes and household appliances;
- weather and times of day;
- verbs of movement;
- thesaurus i.e. correlation of words and expressions among themselves.
We will not go into details for now; a more detailed description is in the repository.
Future plans
In the very near future, we plan to launch work in the following areas:
- division of verbs into material and intangible;
- detailed marking of living entities: plants, animals, people;
- topology: the ability of a physical object to be a container, flat surfaces, spatial relationships;
- buildings and constructions;
- substances: solid, liquid, gaseous;
- colors.
But our world is not limited to the sphere of tangible and in more distant plans:
- marking up of intangible entities: events, processes, feelings / emotions / experiences, meronymy, properties, actions, interactions, text and information, sports and mass events.
- gradation of material entities in size, softness (ability to resist exposure), perceived temperature.
Experiment ideas or what can be done with the dataset
Traditionally, we not only share data, but also give ideas of ready-made experiments and research areas that seemed to us worthy of attention.
- SUPER HOT! build a tonality analyzer that does not require pre-marked data;
- train the semantic markup system. For example, the classifier of materiality / immateriality of nouns, based on the surrounding context;
- to explore the semantic constructions inherent in humor, sarcasm, irony;
- teach a computer to automatically determine contexts where a word is used in a figurative sense;
- conduct a study of the style of famous authors in terms of using emotionally colored words.
Interesting results can be obtained by combining the dataset by semantics and association (in the same repository). We are already doing this to refine the markup by tonality; The dataset is located next to the repository by reference.
From the world according to the case - the dataset is useful
Remember and describe in the comments any case where you needed explicit semantic markup in your work, but it was not at hand. This will provide us with valuable food for thought on the further development of the dataset.
Download link and license
Dataset: open semantics of the Russian language
Dataset is licensed under CC BY-NC-SA 4.0 .