Yandex Tomita parser for everyone

Yandex continues to develop its API functions. And here is the result in the form of a new parser. Tomita parser is a tool for extracting structured data (facts) from natural language text. Facts are extracted using context-free grammars and keyword dictionaries. The parser allows you to write your grammar, add your dictionaries and run on texts.

Tomita-parser allows you to select patterns of words or facts broken down into fields from the text according to patterns written by the user (KS-grammars). For example, you can write patterns to highlight addresses. Here the fact is the address, and its fields are “city name”, “street name”, “house number”, etc. The parser includes three standard linguistic processors: a tokenizer (word wrap), a segmenter (word wrap) and a morphological analyzer (mystem). The main components of the parser are: a gazetteer, a set of KS grammars, and many descriptions of the types of facts that these grammars generate as a result of the interpretation procedure.

The parser algorithm on one sentence and one grammar

1. Search for occurrences of all keys from the gazetteer. If the key consists of several words (for example, "Nizhny Novgorod"), then a new artificial word is created, which we call "multi-word".

2. Of all the gazetteer keys found, those that are mentioned in the grammar are selected.

3. Among the selected keys there may also be multivords that intersect with each other or include single keywords. The parser tries to cover the sentence with disjoint keywords so that as large chunks of the sentence as possible are covered by them.

4. A linear chain of words and multi-words is input to the GLR parser. Grammar terminals are mapped to input words and multi-words.

5. On the sequence of terminal sets, the GLR parser builds all possible options. Of all the options constructed, those that cover the offer as widely as possible are also selected.

6. Then the parser starts the interpretation procedure on the constructed syntax tree. He selects specially labeled subnodes, and the words that correspond to them are written into the fact fields generated by the grammar.

What tasks can be solved? For example, to provide structured information about the dates of birth of famous personalities, place of birth, educational institutions in which they studied, and so on. Probably, we can say that this is the first text analyzer of a serious level, to which there will be free access to solve new linguistic applied problems in word processing and their output. Developers have yet to realize the full power of the resulting tools, but it’s already clear that these capabilities will breathe new life into the technology of creating sites.

Also popular now: