Friday format: “language” developments - studies combining IT and linguistics
In today's article, we’ll try to talk about several technological projects that are directly related to natural language processing, working with dictionaries and databases based on arrays of texts, studying what users write on social networks, using the example of foreign research and development of ITMO University. photo emeraldschell
A number of areas of work with natural languages involve the use of semantic technologies. In this case, the work is carried out primarily with ontologies, which define relations between objects of semantic connections and allow making interaction with a machine more “human”.
The “Semantic Web” as a direction in the development of the Internet and machine interaction is an idea well known and has been developing for a long time. However, there are still quite a few new directions for applying semantic data. Semantic technology projects are also being worked on at ITMO University.
For example, a resident company of the Technopark of ITMO University VISmart is developing an Ontodia project, which allows the use of semantic technologies for applied needs, including for the needs of developers. The user can upload semantic data to Ontodia, and at the output receives their visualization in the form of a graph.
As examples of the use of such visualizations, the developers provide a search and comparison of information from unstructured medical data at the Northwest Medical Research Center named after V.A. Almazova.
Another example of implemented projects based on semantic technologies is an extension for the Open EdX system, which allows you to personalize the educational process as part of online courses. ITMO University staff from an international laboratory“Intelligent information processing methods and semantic technologies”, together with a colleague from Yandex, created an otology describing all the MOOC components: content, usage scenarios, process participants, etc. As a result, developers have the opportunity to identify interdisciplinary connections between courses published on the edX platform.
Thanks to this, students and creators of MOOC can track how and in what capacity a particular concept is used in different courses, what is meant by it within the framework of different subject areas - and, ultimately, get a voluminous idea of the concept of interest.
Another area of work with the natural language is the use of algorithms for counting and evaluating certain characteristics of large arrays of text data. Despite the fact that this task seems to be a trivial example of working with big data, there are also some nuances here.
According to Dmitry Muromtsev, head of the department of informatics and applied mathematics at ITMO University and head of the international laboratory Intelligent Information Processing Methods and Semantic Technologies, work on such projects is often based on a similar scenario: developers analyze a large array of texts and evaluate its linguistic characteristics - morphology, syntax, nuances associated with the use of certain words and phrases and so on.
Such work sometimes leads to unexpected results. For example, not so long ago, a similar method allowed scientists to conduct a more detailed analysis of Shakespeare's legacy. It turned out that 17 of 44 of his plays were written “co-authored” (a 1986 study revealed only 8 “collaborations”). The very practice of borrowing and finalizing works by different authors is not out of the ordinary for English poets of the 16th century.
Moreover, in some cases it was difficult to determine the exact authorship of a work or part of it until recently, since the writers not only exchanged ideas, but also tried to imitate each other's style.
An analysis of the so-called service words that do not have nominative functions and reflect the relationship between "independent" words. Analysts were able to identify patterns of their use, which can uniquely indicate a particular author and make up his "unique linguistic portrait." For example, one of the distinguishing characteristics of Shakespeare was the construction of “and with” (as in “With mirth in funeral and with dirge in marriage”).
The exact definition of which of the poets was involved in the creation of famous plays allows, according to scientists, to some extent debunk the myth of Shakespeare's exclusivity. For example, Shakespeare, as it turned out, wrote the “heavyweight” first part of the Henry VI trilogy himself (it was previously attributed to possible co-authors), but Thomas Middleton had a hand in the play “Everything is good, which ends well”.
Another unusual example of a big data-based linguistic project is the “ dejargonizer ”". The project of Israeli scientists allows us to evaluate a number of characteristics of a scientific text (based on an analysis of a corpus of 500 thousand scientific articles) and determine how much it will be understood by a wide audience. The service counts the number of words of a specific vocabulary, as well as rare words, and on the basis of the received data determines the availability of the text (we wrote more about this project here ).
A number of studies (including those conducted at ITMO University) involve several natural language analysis technologies at once. An example is opinion mining projects (text sentiment analysis). The analysis of tonality involves the creation of an ontology of the subject area, and the use of statistical tools for the analysis of natural language, the use of machine learning algorithms, and (in some cases) the involvement of experts for a more accurate assessment of texts.
At ITMO University, a similar project was implemented as part of the solution to the problem of analyzing public opinion on the Internet. For the analysis of opinions, employees of the Laboratory of Advanced Computing Technologies, Scientific Research Institute of NKTuse data from social networks (VKontakte, Twitter, Instagram, Live Journal), which form the basis for further processing. Next, each publication is marked in accordance with a given set of characteristics (the number of likes, reposts, comments, shares), and the data itself is combined by a graph of links by which you can track the distribution of information.
This project is used to study social processes on the Internet and continues to develop. For example, several research studies have already been carried out at the Scientific Research Institute of NKT, which are based on the analysis of data from social networks and natural language processing.
One of them is monitoring the network activity of informal communities, which allows you to further study the features of information dissemination and the phenomenon of the emergence of problem-oriented communities with informational influence. Another project is the construction of an “ emotional map ” for a given area, when, based on publications with geotags and an assessment of their content, analysts can get an idea of how people feel in a particular place.
There are more and more projects related to natural language processing every year, and they themselves are more ambitious. Scientists from the UK, for example, say that "the computing power of computers is increasingly turning to the solution of linguistic problems, because these are some of the most complex and time-consuming tasks for modern developers."
Semantic technology
A number of areas of work with natural languages involve the use of semantic technologies. In this case, the work is carried out primarily with ontologies, which define relations between objects of semantic connections and allow making interaction with a machine more “human”.
The “Semantic Web” as a direction in the development of the Internet and machine interaction is an idea well known and has been developing for a long time. However, there are still quite a few new directions for applying semantic data. Semantic technology projects are also being worked on at ITMO University.
For example, a resident company of the Technopark of ITMO University VISmart is developing an Ontodia project, which allows the use of semantic technologies for applied needs, including for the needs of developers. The user can upload semantic data to Ontodia, and at the output receives their visualization in the form of a graph.
As examples of the use of such visualizations, the developers provide a search and comparison of information from unstructured medical data at the Northwest Medical Research Center named after V.A. Almazova.
Another example of implemented projects based on semantic technologies is an extension for the Open EdX system, which allows you to personalize the educational process as part of online courses. ITMO University staff from an international laboratory“Intelligent information processing methods and semantic technologies”, together with a colleague from Yandex, created an otology describing all the MOOC components: content, usage scenarios, process participants, etc. As a result, developers have the opportunity to identify interdisciplinary connections between courses published on the edX platform.
From the point of view of NLP algorithms, we use the following mechanism: we take text content from the course content (for video lectures it is subtitles) and from them using the algorithms we select keywords - the so-called “domain concepts”.
We mark these concepts on the prepared ontology. Thus, we get semantic units of content in each course, with the help of which we can further link different courses on various topics and different subject areas with each other.
- Dmitry Volchek, graduate student, Department of Informatics and Applied Mathematics, ITMO University
Thanks to this, students and creators of MOOC can track how and in what capacity a particular concept is used in different courses, what is meant by it within the framework of different subject areas - and, ultimately, get a voluminous idea of the concept of interest.
Word Processing Algorithms and Big Data
Another area of work with the natural language is the use of algorithms for counting and evaluating certain characteristics of large arrays of text data. Despite the fact that this task seems to be a trivial example of working with big data, there are also some nuances here.
According to Dmitry Muromtsev, head of the department of informatics and applied mathematics at ITMO University and head of the international laboratory Intelligent Information Processing Methods and Semantic Technologies, work on such projects is often based on a similar scenario: developers analyze a large array of texts and evaluate its linguistic characteristics - morphology, syntax, nuances associated with the use of certain words and phrases and so on.
The very idea and algorithms of such services are approximately the same. They use a set of word processing approaches that have become standard. The uniqueness lies in the fact that these algorithms must be very precisely tuned for each specific language. In our laboratory, in particular, we are also engaged in such work.
After all, when we talk in life, we use the rules that we learn almost from birth - at school, in daily communication and so on. The same thing needs to be done with the machine: actually from scratch and very high quality to teach it to these rules
- Dmitry Muromtsev
Such work sometimes leads to unexpected results. For example, not so long ago, a similar method allowed scientists to conduct a more detailed analysis of Shakespeare's legacy. It turned out that 17 of 44 of his plays were written “co-authored” (a 1986 study revealed only 8 “collaborations”). The very practice of borrowing and finalizing works by different authors is not out of the ordinary for English poets of the 16th century.
Moreover, in some cases it was difficult to determine the exact authorship of a work or part of it until recently, since the writers not only exchanged ideas, but also tried to imitate each other's style.
An analysis of the so-called service words that do not have nominative functions and reflect the relationship between "independent" words. Analysts were able to identify patterns of their use, which can uniquely indicate a particular author and make up his "unique linguistic portrait." For example, one of the distinguishing characteristics of Shakespeare was the construction of “and with” (as in “With mirth in funeral and with dirge in marriage”).
The exact definition of which of the poets was involved in the creation of famous plays allows, according to scientists, to some extent debunk the myth of Shakespeare's exclusivity. For example, Shakespeare, as it turned out, wrote the “heavyweight” first part of the Henry VI trilogy himself (it was previously attributed to possible co-authors), but Thomas Middleton had a hand in the play “Everything is good, which ends well”.
Another unusual example of a big data-based linguistic project is the “ dejargonizer ”". The project of Israeli scientists allows us to evaluate a number of characteristics of a scientific text (based on an analysis of a corpus of 500 thousand scientific articles) and determine how much it will be understood by a wide audience. The service counts the number of words of a specific vocabulary, as well as rare words, and on the basis of the received data determines the availability of the text (we wrote more about this project here ).
Text sentiment analysis
A number of studies (including those conducted at ITMO University) involve several natural language analysis technologies at once. An example is opinion mining projects (text sentiment analysis). The analysis of tonality involves the creation of an ontology of the subject area, and the use of statistical tools for the analysis of natural language, the use of machine learning algorithms, and (in some cases) the involvement of experts for a more accurate assessment of texts.
At ITMO University, a similar project was implemented as part of the solution to the problem of analyzing public opinion on the Internet. For the analysis of opinions, employees of the Laboratory of Advanced Computing Technologies, Scientific Research Institute of NKTuse data from social networks (VKontakte, Twitter, Instagram, Live Journal), which form the basis for further processing. Next, each publication is marked in accordance with a given set of characteristics (the number of likes, reposts, comments, shares), and the data itself is combined by a graph of links by which you can track the distribution of information.
This project is used to study social processes on the Internet and continues to develop. For example, several research studies have already been carried out at the Scientific Research Institute of NKT, which are based on the analysis of data from social networks and natural language processing.
One of them is monitoring the network activity of informal communities, which allows you to further study the features of information dissemination and the phenomenon of the emergence of problem-oriented communities with informational influence. Another project is the construction of an “ emotional map ” for a given area, when, based on publications with geotags and an assessment of their content, analysts can get an idea of how people feel in a particular place.
There are more and more projects related to natural language processing every year, and they themselves are more ambitious. Scientists from the UK, for example, say that "the computing power of computers is increasingly turning to the solution of linguistic problems, because these are some of the most complex and time-consuming tasks for modern developers."