Linguistic technology testing: competitions on automatic resolution of coreference and anaphora

    So, as promised, we tell: recently the results of competitions for the automatic resolution of anaphora and coreference have been summed up. Such competitions for the Russian language were held for the first time and their team from the HSE-MSU organized them.

    We are sure that among our readers there are many linguists who even without us know very well what anaphora and coreference are, we tell the rest. One and the same object of the real world can be mentioned in the text several times in different ways. "Vasya is a millionaire; he wants to buy an island." In this phrase, the pronoun “he” and the noun “Vasya” refer to the same person (that is, they have the same referent ). If the text analysis system understands that “he” is “Vasya”, then it knows how to resolve anaphora.

    It is more difficult when Vasya appears in the text several more times - for example, as “Ivanov”, “client”, “head of the company” or “soccer player”. Then we are no longer talking about pronoun anaphora, but about the coherence of nouns. The task of the system in this case is to combine all the words behind which this person is hiding in one core chain. Here are a few examples, and at the same time show how our Compreno technology does it.

    1. Evgeni Plushenko - the only skater in the world who was able to win medals of the four Winter Olympics. The athlete received his first Olympic experience in 2002 at the games in the American Salt Lake City.

    Due to the syntax, the system understands that “Plushenko” and “skater” are one person, then this person is combined with the person who stood out on the “athlete” due to their connection in the semantic hierarchy, and in addition the anaphoric rules replace the pronoun “own” in the parse tree with this the same "athlete." The result is a core chain.

    2. Darrell Lance Abbott was born in Arlington, Texas, a suburb of Dallas and Fort Worth, in the family of musician and producer Jerry Abbott. His father owned Pantego Sound Studios in Pantego, where Darrell saw and heard a lot of blues guitarists, but after he heard Ace Freilly from the Kiss band, hehe wanted to start playing the guitar.

    Here, the system immediately correctly parses the name Darrell Lance Abbott into parts and then identifies it in parts. Therefore, Abbot's father Jerry Abbott did NOT get into the core chain - the surname is the same, but the name is different. But in the next sentence, the system recognizes Darrell by name without a surname.

    3. Rosneft may gain control of all airports in Kyrgyzstan. The Russian company has signed a memorandum on the acquisition of at least 51% of OJSC Manas International Airport. Novaport Roman Trotsenko, who previously acted as a partner of Rosneft in the project, is likely to become an operator of Kyrgyz airports.

    Here again, due to the fact that in the semantic hierarchy SK “ROSNEFT” is a descendant of SK “COMPANIES”, Compreno understands that the second sentence also refers to Rosneft. This example shows how resolving coreference helps to correctly extract participants in the events - it is clear to us who signed the memorandum, although the proposal simply says “Russian company”.

    But back to the competition. Their goal was to evaluate the quality of the methods developed for the analysis of anaphora and coreference in Russian. Seven developers took part in the competition: ABBYY, RCO, SemSyn, Open Corpora (St. Petersburg),, Institute for System Analysis of the Russian Academy of Sciences, Sergey Ponomarev. We emphasize once again: the goal was to compare the algorithms, and not the products of companies. The results of the competition were summed up at the Dialog conference, the largest conference in the field of computer linguistics in Russia.

    On the first track, it was necessary to find complete coreferential chains, on the second - to resolve the anaphora, that is, for all pronouns to find whom they point to. Both of these tasks are more complicated than syntactic and morphological analysis (competitions were held several years ago on these topics ), while most systems use syntax and morphology to mark up a text collection before resolving anaphora.

    Three participants competed on the first track, seven on the second, but there were seventeen “runs” in total on the second track. A variety of systems participated - from experimental ones (their goal was to test specific algorithms for resolving anaphora) to complex ones, in which the module that defines referential connections is just one of the components.

    How did the competition go?

    At first, participants were given the opportunity to train their systems on a hand-labeled small text collection. It included 100 texts, each of which contained from 5 to 100 sentences, the longest - 170 sentences. In the corpus, 2,000 anaphoric pairs “pronoun - antecedent (the word the pronoun refers to)” were allocated. Then the systems had to analyze a large text corpus. For the competition, a case was specially assembled, which included excerpts from texts of various genres: news articles, scientific articles, blog posts, and fiction. All texts were taken from open sources: the Open Corpora of the Russian language (Open Corpora), the network library, the publication, Wikipedia and other resources — 1342 texts in all.

    The results were evaluated by comparing with the "Gold Standard" - part of the same case, manually marked. The assessment took place in a semi-automatic mode (disputed places were checked twice by experts).

    Competition Results

    Competitions showed that existing systems are good at resolving anaphora (for example, Compreno, who won first place, showed F-measure 76% with an accuracy of over 80%), while a complete analysis of coreference is worse. For the Russian language, those methods that are used in the English language are insufficient - the free word order, some other features of the language and the acute lack of open labeled cases (created by the organizers, apparently, became the first resource of this kind). The new building can be used by developers to independently test their algorithms, and the markup rules formulated by the organizers during the work on it will help researchers create new buildings for the same purposes.

    Important result for ABBYY - our Compreno won on both tracks. According to the rules of the competition, we can’t open all the names of winners and losers on our blog. The meaning of these rules is that the competition (or, to be more precise, testing) is not for PR, but for the benefit of developers who compare their algorithms with the development of colleagues and get estimates (they can be referenced in scientific publications) and experience. In addition, according to the results of the competition, a test labeled corps, the Gold Standard, is always created, at which then everyone (for example, students) can run their own algorithms and compare with the level achieved in the industry.

    We cannot name winners and losers in blogs and the media, but in the near future a detailed article will be posted on the Dialog website with an analysis of the results of the competition, which will include the final rating. Read the article by the organizers about the preparation of the competitions and the evaluation method here .

    Also popular now: