FactRuEval - a competition for extracting named entities and extracting facts
Competitions on various aspects of text analysis are held at the Dialog international computer linguistics conference every year. Usually the competitions themselves take place within a few months before the event, and at the conference itself announce the results. Three competitions are planned this year:
The article you started reading has three goals. First, we would like to invite developers of automatic text analysis systems to take part in competitions. Secondly, we are looking for helpers who could mark up text collections on which the systems of our participants will be checked (this is, firstly, interesting, and secondly, you can bring real benefits to science). Well, and the third - the competition for the allocation of named entities and facts is held on the “Dialogue" for the first time, and we want to tell all interested readers how they will happen.
Western computer linguists have long paid attention to extracting facts from texts. The first competition conference was called the Message Understanding Conference (MUC) and was held in 1987. The event was sponsored by the military (DARPA), and the texts were initially focused on their interests: reports on naval operations and terrorism in Latin American countries. Then there were news articles on economic topics, articles on rocket launches and plane crashes.
Since 1999, competitions have continued under the Automatic Content Extraction (ACE) program and are no longer limited to English (added Chinese and Arabic). Participants were offered the following tasks:
Detailed instructions for tasks on ACE in different years are available on the Linguistic Data Consortium website .
Since 2009, similar content tasks have been presented in the Knowledge Base Population section of the Text Analysis Conference (TAC) . In 2015, KBP included the following tracks:
Publications of TAC Workshops are available on the NIST website .
The first factographic search competitions were held in 2004-2006. as part of the ROMIP workshop .
In 2004, a fact sheet was given a collection of texts and a list of persons (for example: Sting, an English pop singer). It was necessary to find facts (events) associated with this person in the collection and provide a list of documents and the coordinates of the fragments in them (the beginning of the fragment and its length), where these events are mentioned.
In 2005 and 2006, several tasks were proposed in this direction :
About 10 years have passed since then. During this time, experts in the field of computer linguistics, data mining and other related areas have done quite a lot. Moreover, both large companies and small research groups. However, there is little reliable information on the results obtained in the public domain. And now, as part of the Dialogue conference, an independent comparative testing of information extraction systems for the Russian language will be held, the results of which will be available to everyone.
The FactRuEval-2016 competition will include three tracks: two for extracting named entities and one for extracting facts. All three tracks will be evaluated on one collection of modern news texts. Next, with one short example, I will explain the task of each of the tracks. The text will be like this:
The task of the first track is to highlight each occurrence of a named entity in the text and determine its type. Those. in the above text, three entities should be distinguished: “Martyshkino village”, person “Ivan Petrov” and again “Martyshkino”. As an answer in this problem, you need to generate a text file in which for each entity the type, number of the first letter of the selected fragment from the beginning of the text and its length will be indicated:
In this track, you no longer need to bind entities to positions in the text. Instead, you need to link all references to the same entity within the text into one object and determine the attributes of this object. For example, in the example under discussion, Martyshkino is mentioned twice, however, it should appear only once in the extradition. For persons, the last name, first name, middle name and nickname must be separately indicated. The final result will be like this:
A fact is a relation between several objects. A fact has a type and a set of fields. For example: the fact type is Occupation, the fields are Who, Where, Position, and Phase (start, finish, or indefinitely). This year we will extract several types of facts:
One fact should be extracted from our example: Ivan Petrov works as the head of the village “Martyshkino”.
Factographic search competitions always have quite extensive markup guides, and this competition was no exception. Participants need to study the “Description of the tracks” and “The format of the output results” .
The competition will be held in January 2016. Before its start, participants will be provided with a demonstration collection of marked-up texts and a comparator program, with which you can independently evaluate their results. The comparator will be published as Python source code. A few weeks will be given to finalize their systems and bring the results of their work in the expected format.
After that, in order to evaluate the quality of work of the systems of the participants of the competition, they will be provided with a test collection, which includes several hundred pre-marked documents. Since several hundred documents, theoretically, participants can manually mark up, then tens of thousands of documents from the same sources as pre-marked documents will be added to the test collection. The markup of all these documents will be given two days. The results of the systems in the described format will be transmitted to the organizing committee.
The body of competition texts consists of news and analytical texts on a socio-political topic in Russian. Sources of texts are the following editions:
The case is divided into two parts: demo and test. The ratio of the number of texts from different sources in the two parts is the same. Balance for any other indicators is not guaranteed.
The markup for this text collection is currently underway on OpenCorpora.org. We invite all interested to join these works. About how markup is arranged, it is written in a separate article “How, by reading the news, to benefit science?” . Detailed markup instructions are here .
The task of marking the corpus is to find in the text the names, surnames and patronymics of people, the names of organizations and geographical names, select them with the mouse and select the type of the selected object. For organizations and geographical names, you must also specify a descriptor (word or phrase denoting a generic term). After that, the selected text fragments (spans) must be combined into references to objects. For example, the name and surname should be combined into a reference to an object of type Person, and the organization descriptor (“Research Institute”) and its name (“Research Institute of Transport and Road Facilities”) should be combined into a reference to an object of type Org. The list of references to the objects, which should be the result, is shown in the following picture. The instructions detail examples and complex cases that arise when marking up.
You can participate in any of the announced tracks or all at once. You need to teach your system to output the results in the format described. After that, using the comparator, evaluate its work on the demonstration part of the collection (as soon as its marking is completed, we will publish it). Make the necessary changes according to the discrepancies found.
In the very near future, we ask potential participants to register (we will send you news and let us know about the beginning of the results evaluation procedure) and help redevelop the corps (the task page is available after login on OpenCorpora) . We would like to publish its demo part as soon as possible.
We will also welcome any comments and suggestions here or by letter.
- to highlight named entities and facts - FactRuEval ;
- tonality analysis - SentiRuEval ;
- typos - SpellRuEval .
The article you started reading has three goals. First, we would like to invite developers of automatic text analysis systems to take part in competitions. Secondly, we are looking for helpers who could mark up text collections on which the systems of our participants will be checked (this is, firstly, interesting, and secondly, you can bring real benefits to science). Well, and the third - the competition for the allocation of named entities and facts is held on the “Dialogue" for the first time, and we want to tell all interested readers how they will happen.
Factographic Search Competition
Western computer linguists have long paid attention to extracting facts from texts. The first competition conference was called the Message Understanding Conference (MUC) and was held in 1987. The event was sponsored by the military (DARPA), and the texts were initially focused on their interests: reports on naval operations and terrorism in Latin American countries. Then there were news articles on economic topics, articles on rocket launches and plane crashes.
Since 1999, competitions have continued under the Automatic Content Extraction (ACE) program and are no longer limited to English (added Chinese and Arabic). Participants were offered the following tasks:
- Entity Detection and Tracking - seven types were distinguished (person, organization, location, enterprise, weapon, vehicle and geo-political entity) with subtypes.
- Relation Detection and Characterization - spatial relations, family and business relations between persons, places of work, membership in organizations, ownership, nationality and others.
- Event Detection and Characterization - interaction, movement, movement, creation and destruction.
Detailed instructions for tasks on ACE in different years are available on the Linguistic Data Consortium website .
Since 2009, similar content tasks have been presented in the Knowledge Base Population section of the Text Analysis Conference (TAC) . In 2015, KBP included the following tracks:
- Cold Start KBP - given a database schema and a large collection of texts; it is necessary to fill the database with information about the objects encountered in the texts and the relations between them.
- Tri-Lingual Entity Discovery and Linking - given a non-empty database and a collection of texts in three languages (English, Chinese, Spanish); it is necessary to select from the texts the references to the objects available in the database and to link these references to objects from the database. Missing objects in the database must be added there.
- Event Track - extract information about events and their attributes.
- Validation / Ensembling Track - improving the results of a system that extracts the attributes of objects from the text by combining the responses of several such systems or additional linguistic processing.
Publications of TAC Workshops are available on the NIST website .
How about us?
The first factographic search competitions were held in 2004-2006. as part of the ROMIP workshop .
In 2004, a fact sheet was given a collection of texts and a list of persons (for example: Sting, an English pop singer). It was necessary to find facts (events) associated with this person in the collection and provide a list of documents and the coordinates of the fragments in them (the beginning of the fragment and its length), where these events are mentioned.
In 2005 and 2006, several tasks were proposed in this direction :
- allocation of named entities (person, organization, geographical object, etc.);
- highlighting the facts of several types (place of work, ownership of the organization).
About 10 years have passed since then. During this time, experts in the field of computer linguistics, data mining and other related areas have done quite a lot. Moreover, both large companies and small research groups. However, there is little reliable information on the results obtained in the public domain. And now, as part of the Dialogue conference, an independent comparative testing of information extraction systems for the Russian language will be held, the results of which will be available to everyone.
The FactRuEval-2016 competition will include three tracks: two for extracting named entities and one for extracting facts. All three tracks will be evaluated on one collection of modern news texts. Next, with one short example, I will explain the task of each of the tracks. The text will be like this:
Глава села Мартышкино Иван Петров заявил, что в Мартышкино …
0 1 2 3 4 5 6
01234567890123456789012345678901234567890123456789012345678901
Track # 1: named entities
The task of the first track is to highlight each occurrence of a named entity in the text and determine its type. Those. in the above text, three entities should be distinguished: “Martyshkino village”, person “Ivan Petrov” and again “Martyshkino”. As an answer in this problem, you need to generate a text file in which for each entity the type, number of the first letter of the selected fragment from the beginning of the text and its length will be indicated:
LOC 6 15
PER 22 11
LOC 48 10
Lane 2: Entity Identification and Attributes
In this track, you no longer need to bind entities to positions in the text. Instead, you need to link all references to the same entity within the text into one object and determine the attributes of this object. For example, in the example under discussion, Martyshkino is mentioned twice, however, it should appear only once in the extradition. For persons, the last name, first name, middle name and nickname must be separately indicated. The final result will be like this:
PER
Firstname:Иван
Lastname:Петров
LOC
Name:село Мартышкино
Lane 3: Extract Facts
A fact is a relation between several objects. A fact has a type and a set of fields. For example: the fact type is Occupation, the fields are Who, Where, Position, and Phase (start, finish, or indefinitely). This year we will extract several types of facts:
- Occupation (person's work in the organization)
- Deal (transaction between several parties without specifying its subject and conditions)
- Ownership
- Meeting (meeting of several persons)
One fact should be extracted from our example: Ivan Petrov works as the head of the village “Martyshkino”.
Occupation
Who:Иван Петров
Where:село Мартышкино
Position:глава
Factographic search competitions always have quite extensive markup guides, and this competition was no exception. Participants need to study the “Description of the tracks” and “The format of the output results” .
Evaluation of the results
The competition will be held in January 2016. Before its start, participants will be provided with a demonstration collection of marked-up texts and a comparator program, with which you can independently evaluate their results. The comparator will be published as Python source code. A few weeks will be given to finalize their systems and bring the results of their work in the expected format.
After that, in order to evaluate the quality of work of the systems of the participants of the competition, they will be provided with a test collection, which includes several hundred pre-marked documents. Since several hundred documents, theoretically, participants can manually mark up, then tens of thousands of documents from the same sources as pre-marked documents will be added to the test collection. The markup of all these documents will be given two days. The results of the systems in the described format will be transmitted to the organizing committee.
Text Collection
The body of competition texts consists of news and analytical texts on a socio-political topic in Russian. Sources of texts are the following editions:
- “Private correspondent” www.chaskor.ru
- “Wikinews” en.wikinews.org
- “Lentapedia” en.wikisource.org/wiki/Lentapedia
The case is divided into two parts: demo and test. The ratio of the number of texts from different sources in the two parts is the same. Balance for any other indicators is not guaranteed.
The markup for this text collection is currently underway on OpenCorpora.org. We invite all interested to join these works. About how markup is arranged, it is written in a separate article “How, by reading the news, to benefit science?” . Detailed markup instructions are here .
The task of marking the corpus is to find in the text the names, surnames and patronymics of people, the names of organizations and geographical names, select them with the mouse and select the type of the selected object. For organizations and geographical names, you must also specify a descriptor (word or phrase denoting a generic term). After that, the selected text fragments (spans) must be combined into references to objects. For example, the name and surname should be combined into a reference to an object of type Person, and the organization descriptor (“Research Institute”) and its name (“Research Institute of Transport and Road Facilities”) should be combined into a reference to an object of type Org. The list of references to the objects, which should be the result, is shown in the following picture. The instructions detail examples and complex cases that arise when marking up.
How to take part in the competition?
You can participate in any of the announced tracks or all at once. You need to teach your system to output the results in the format described. After that, using the comparator, evaluate its work on the demonstration part of the collection (as soon as its marking is completed, we will publish it). Make the necessary changes according to the discrepancies found.
In the very near future, we ask potential participants to register (we will send you news and let us know about the beginning of the results evaluation procedure) and help redevelop the corps (the task page is available after login on OpenCorpora) . We would like to publish its demo part as soon as possible.
We will also welcome any comments and suggestions here or by letter.