Teach the bot! - markup of emotions and semantics of the Russian language

    From all sides, the prospects of a bright robotic future are pouring on us. Or not very bright, in the spirit of the Matrix and the Terminator. In fact, machines are already confidently coping with translations, they can recognize faces and objects of the world around them no worse and much faster, learn to understand and synthesize speech. Cool? Not that word!

    But the matter is seriously complicated by the fact that computers have learned to navigate our world. Everything that they do so well, they do by analogy, without going into the essence and not loading themselves with the meaning of what is happening. Maybe it’s for the better - we’ll live longer without being enslaved by a soulless tribe of machines.

    But curiosity pushes to risky steps, namely to attempts to introduce the computer to our world, including the inner one - feelings, emotions and feelings.

    How we plan to pump the consciousness of machines, teach them emotions, feelings and value judgments, as well as where you can freely download marked-up
    data - read the article.

    I do not want to read, show the result!


    You can immediately try to train the bot at the link: Teach the bot!

    If you like to answer - create your own Card and the result will be remembered.

    Limitations of Distribution Semantics


    meme about distribution semantics, word2vec, robot, coffee

    What, in fact, is the problem of computer understanding of texts, because a machine can study all textual cultural heritage and learn everything from there? Better than words will tell the result of word2vec.

    For the token “man”:
    woman 0.650
    married 0.594
    middle-aged 0.542 anti-
    man 0.538
    ...
    pregnant 0.519
    nulliparous 0.516
    girl 0.498
    ...

    Or for the word "hot":
    warm 0.510
    ...
    cold 0.498
    cool 0.486
    hot 0.467
    ...

    And for a strongly positive emotion, "delight":
    admiration 0.715
    ...
    indignation 0.609
    rage 0.597
    horror 0.586
    despair 0.584
    ...
    awe 0.531
    confusion 0.523
    perplexity 0.522
    ...
    rage 0.472
    ...

    Or for the broad concept of "technique":
    ...
    technology 0.569
    art 0.451
    skill 0.410
    ...
    aircraft construction 0.393
    industry 0.392
    medicine 0.379
    craft 0.375
    ...
    industry 0.370
    ...
    knowledge 0.360
    science 0.358
    ...

    Actually, these examples clearly show how much information the context provides. Enough, but clearly not enough to breed antonyms, part-whole, general-particular, to distinguish between vertical and horizontal connections.

    Therefore, it is quite reasonable that many researchers along with distributive semantics approaches (read: word2vec) use thesauri. For English, such a resource is WordNet, for Russian - RuTez, Wiktionary.

    The obvious is not so obvious


    meme about semantics, The Lion King

    Every researcher who decides to make a daring attempt to explain the meanings to the machine will sooner or later come across the fact that the most seemingly trivial things to the computer are completely unobvious. Moreover, not even a word is written about them in children's books. The world, in a number of aspects, is cognized by us through our perceptual organs - through sight, hearing, smell, touch, taste and others.

    Then we already communicate with each other the extremely brief and concise context of the situation, which unfolds in a single head in a detailed picture. Moreover, for each person, the situation is revealed in different ways, depending on personal experience, cultural background, character traits and worldview.

    Emotions, feelings, experiences


    Words and phrases carry much more meaning than recorded in explanatory dictionaries. This is primarily due to such unsteady and poorly perceptible properties as assessment and the accompanying emotional coloring. For example, the phrase heavy torment carries a strong negative emotion. And the phrase wild joy - a strong positive. Not a gift - it is something negative, but not too much. And, for example, the virtuoso has a rather strong positive assessment.

    The difficulty with fixing such subtle characteristics of words is that they are extremely subjective and poorly formalizable. Let's say the word strategy- is it positive or neutral? One can only agree that it is not negative.

    Nevertheless, emotional and evaluative attributes are an integral part of linguistic units and play a rather important role in human communication. Therefore, if we want to make the machine more humane and enjoyable in communication, it must also be imbued with these subtle matters.



    What to do?


    Manually creating such a dictionary would be extremely time-consuming, because you want to mark out not only words, but also phrases. In addition, all assessments will be strongly tied to the subjective opinion of the researcher.

    Good news! We live in 2017 and we have access to such wonderful technologies as the Internet and crowdsourcing. The latter allows you to simultaneously cope with both the problem of labor intensity and the subjectivity of the estimates. Of course, this gives rise to the effect of “average for the hospital,” but for the first approximation, we allow ourselves to close our eyes to irregularities of this kind.

    Teach the bot! - markup of emotions and semantics of the Russian language


    The idea is implemented on the language platform Word Map . Work will be carried out in several directions:

    • Evaluation markup. The task is to mark out the words and expressions of the Russian language according to the criteria of positive / neutral / negative and the strength of the severity of the sign.
    • Emotional markup. The task is to mark out emotionally colored words and expressions by polarization and the strength of the emotional background.
    • Thesaurus markup. The task is to mark vertical and horizontal connections between words, put down semantic tags for words and expressions.
    • Experimental markup of relations according to the theory “Meaning ⇔ Text” proposed by I. A. Melchuk: MAGN (coffee) = strong coffee, MAGN (feeling) = strong feeling, etc.

    In order to use human labor to the maximum benefit and make tasks interesting for respondents, the approaches of distributive semantics and machine learning are applied. For the basis of the system of semantic categories, we took the classification used for NKRY.

    How to take part?




    An important goal of our initiative is to fill in the missing linguistic resources for the Russian language, open for use by researchers, linguistic scientists and practical engineers. We expect that, based on the markup data, interesting studies will be carried out, scientific articles, articles on the Habré will be written, engineering products and open technologies will appear.

    You can help the project in the following ways:

    • Participate in bot training. This is easy and fun, and also allows you to pump your linguistic consciousness and notice the interesting features of the Russian language.
    • Like, cher, Alisher! Share links to the project on social networks, tell about it in your blog or website.
    • Constructive criticism helps to develop and not plunge into the swamp of one's own illusions. Discussion is very important in order to adjust the course in time and create a really useful resource. The only wish: criticize - offer.
    • Semantics and cognitive linguistics. We try to pump our understanding of modern approaches to semantics and the creation of such resources. We will be glad to advice or recommendations, what to read, what to study, with whom to consult.
    • Spread of information. Your advice on where else you can talk about the project will come in handy - it could be your favorite tech blog, an online technology magazine, a group on VKontakte / Slaka / Telegram, or something else.

    Open data


    Aggregated markup results will be open for download and licensed under CC BY-NC 4.0.

    We expect to get and publish the first results by the middle / end of July - everything will depend on the activity of the respondents. In order not to miss anything, put asterisks and subscribe to our github:

    Open data on a word map

    Where is the money, Zin?


    It's great to try to combine crowdsourcing and crowdfunding in one project, which we did by launching a fundraising campaign on Planet.ru:

    To teach a computer to understand our world and emotions



    Important. We are already doing the project and will bring it to the result on our own and with the available resources. The data collected, as promised, will be open and accessible to all comers. The only question is the timing and volume of the markup. Now we expect to get a basic result (10,000 most frequent words) for three months, marking up the full volume will take about two years.

    Additional resources will help to significantly accelerate the receipt of the result. We need to help the developers involved in creating and improving the markup system, add new semantic categories and conduct research. Also, funds are needed to promote the project and conduct contests.

    You can donate any amount of money to the campaign - at the same time you will know that there is your contribution to the overall success, and each invested ruble will be spent on a cool and useful business.

    Do not forget that you can help the initiative without money. Like and talk about the project on social networks - this is a very simple, completely free, but very effective way to promote.

    And remember ...
    The choice is always yours.

    image

    Corporation sponsorship


    Do you represent an established business and are you interested in the development of open linguistic data in Russia? Become a corporate sponsor of the project! You get an eternal graphic link from the project page, additional advertising for thousands of people and unearthly respect from the community.

    We will spend every ruble invested with incredible efficiency, and for several monthly salaries of one programmer in a large company, we will make the entire project, the results of which will be used by thousands of researchers, scientists and engineers.

    Commercial use


    For commercial use or business-specific markup, write to kartaslov@mail.ru or in the PM to the author of the article.

    Acknowledgments


    I would like to express my deep gratitude to the organizers and participants of the Dialogue 2017 - the 23rd international conference on computer linguistics and intelligent technologies.

    It was in the backstage discussions of the event that the need for this kind of markup became clear, and a group of like-minded people was assembled to discuss experimental markup of relations according to the theory “Sense ⇔ Text”. Hopefully, next year, based on the collected data, it will be possible to launch a new interesting competition in the framework of Dialogue Evaluation.

    References


    1. Teach the bot! on a word map
    2. RusVectōrēs: ready-made word2vec models for the Russian language
    3. Russian language thesaurus RuTez (RuWordNet)
    4. Wiktionary
    5. About lexical and semantic information in NKRJ

    Also popular now: