We study syntax parsers for the Russian language

    Hello! My name is Denis Kiryanov, I work at Sberbank and I deal with the problems of natural language processing (NLP). Once we needed to choose a parser for working with the Russian language. To do this, we delved into the wilds of morphology and tokenization, tested various options and evaluated their application. Share experiences in this post.

    Preparing for the selection

    Let's start with the basics: how does everything work? We take the text, perform tokenization and get some array of pseudowlog tokens. The stages of further analysis fit into the pyramid:

    Everything begins with morphology — by analyzing the form of the word and its grammatical categories (gender, case, etc.). The morphology is based on syntax - relationships beyond the framework of a single word, between words. The syntactic parsers to be discussed analyze the text and provide the structure of the dependencies of words from each other.

    The grammar of dependencies and grammar of direct components

    There are two main approaches to the syntactic analysis, which in linguistic theory are approximately equal.

    In the first line, the sentence is parsed as a dependency grammar. This approach is taught in school. Each word in a sentence is somehow related to others. “Soaps” is a predicate on which the subject “mother” depends (here the dependency grammar differs from the school grammar, where the predicate depends on the subject). The subject has a dependent definition of “mine.” The predicate has a dependent direct addition to the frame. A direct addition to the "frame" - the definition of "dirty."

    In the second line, the analysis is in accordance with the grammar of the components.
    According to her, the sentence is divided into groups of words (phrases). Words within one group are linked more closely. The words “my” and “mother” are more closely connected, “frame” and “dirty” too. And there is still a separate "soap".

    The second approach for the automatic parsing of the Russian language is poorly applied, because in it closely related words (members of one group) very often do not stand in a row. We would have to combine them with strange brackets - in one or two words. Therefore, in the automatic parsing of the Russian language it is customary to work based on the grammar of dependencies. It is also convenient because everybody knows this “framework” in school.

    Dependency tree

    A set of dependencies we can translate into a tree structure. Top - the word "soap", some words directly depend on it, some depend on its dependent. Here is the definition of a dependency tree from Martin and Jurafsky’s textbook:

    Dependency tree:

    • There is a single designated root node that has no incoming arcs.
    • This is where the vertex has been.
    • There is a unique path for each vertex in V.

    There is a top level node - predicate. From it you can walk to any word. Each word depends on the other, but only on one. The dependency tree looks like this:

    In this tree, edges are signed by some special type of syntactic relation. In the grammar of dependencies analyze not only the fact of the connection between words, but also the nature of this connection. For example, “is taken” is almost one verb form, “inventory” is subject to “is taken”. Accordingly, we have an edge from “is” both in one and in the other direction. These are not identical connections, they are of a different nature, so that they must be distinguished.

    Hereinafter, we consider simple cases where the members of the sentence are present, and not implied. There are structures and marks that allow you to deal with gaps. Something appears in the tree that does not have a superficial expression - a word. But this is the subject of another study, but we still need to focus on our own.

    Universal Dependencies Project

    To facilitate the choice of a parser, we turned our attention to the Universal Dependencies project and the recent CoNLL Shared Task competition .

    Universal Dependencies is a project to unify the markup of syntax corpses (tribanks) within the framework of the dependency grammar. In Russian, the number of types of syntactic links is limited - subject, predicate, etc. In English, the same, but the set is different. For example, there appears the article, which also must somehow be labeled. If we wanted to write a magic parser that could handle all languages, then we would rather quickly come up against the problems of comparing different grammars. The heroic creators of Universal Dependencies were able to agree among themselves and mark up all the buildings that were at their disposal in a single format. It is not very important exactly how they agreed, the main thing is that we got a certain uniform format for the presentation of the whole story -more than 100 tribanks for 60 languages .

    The CoNLL Shared Task is a competition between the developers of syntactic parsing algorithms, conducted as part of the Universal Dependencies project. The organizers take a certain number of tribanks and divide each of them into three parts - training, validation and test. The first part is provided to the participants of the competition, so that they train their models on it. The second part is also used by the participants to evaluate the operation of the algorithm after training. Training and assessment of participants can be iteratively repeated. Then they give their best algorithm to the organizers, who run it on the test part, closed to the participants. The results of the work of models on the test parts of the three banks are the results of the competition.

    Quality metrics

    We have connections between words and their types. We can evaluate whether the top of the word was found correctly - the UAS (Unlabeled attachment score) metric. Or evaluate whether the vertex and the type of the dependency were found correctly - the LAS (Labeled attachment score) metric.

    It would seem that an estimate of accuracy (accuracy) suggests itself - we count how many times we are in the total number of cases. If we have 5 words and for 4 we have correctly identified the peak, then we get 80%.

    But in fact, to evaluate the parser in its pure form is problematic. Developers who solve the problems of automatic parsing often take the raw text as input, which, in accordance with the analysis pyramid, goes through tokenization and morphological analysis. The quality of the parser can be affected by errors from these earlier stages. In particular, this refers to the tokenization procedure - word extraction. If we have identified the wrong words-units, then we will not be able to correctly evaluate the syntactic links between them - after all, in our original marked body, the units were different.

    Therefore, the evaluation formula in this case is a factor, where precision (precision) is the fraction of accurate hits relative to the total number of predictions, and completeness is the fraction of exact hits relative to the number of links in the marked data.

    When we continue to give estimates, we need to remember that the metrics used affect not only the syntax, but also the quality of token splitting.

    Russian language in Universal Dependencies

    In order for the parser to syntactically mark up sentences that he has not yet seen, he needs to feed the marked body for training. There are several such cases for the Russian language:

    The second column indicates the number of tokens - words. The more tokens, the larger the learning body and the better the final algorithm (if it is good data). Obviously, all experiments are conducted on SynTagRus (developed by IITP RAS), in which there are more than a million tokens. It will be trained all the algorithms, which will be discussed further.

    Parsers for the Russian language in the CoNLL Shared Task

    According to the results of the last year's competition , models that studied on the same SynTagRus achieved the following LAS indicators:

    The results of parsers for Russian are impressive - they are better than those of parsers for English, French and other more rare languages. You and I were very lucky right away for two reasons. First, the algorithms do well with the Russian language. Secondly, we have the SynTagRus - a large and marked housing.

    By the way, the competition of 2018 has already passed, but we conducted our research this spring, so we rely on the results of the track of last year. Looking ahead, we note that the new version of UDPipe (Future) was even higher this year.

    The list did not include Syntaxnet - Google's parser. What's wrong with him? The answer is simple: Syntaxnet started only from the stage of morphological analysis. He took the perfect perfect tokenization, and already built on top of processing. Therefore, it is not fair to evaluate it on par with the others - the rest did breakdown into tokens with their own algorithms, and this could worsen the results at the subsequent stage of the syntax. In Syntaxnet sample of 2017, the result is better than the entire list above, but it is not fair to compare directly.

    Two versions of UDPipe hit the table, for 12th and 15th places. The same people who took an active part in the Universal Dependencies project are engaged in the development of this parser.

    Updates to UDPipe periodically appear (somewhat less frequently, by the way, case marking is also updated). So, after the last year's competition, UDPipe was updated (these were commits to the not yet released version 2.0; in the future, for simplicity, we will roughly call the UDPipe 2.0 commit made by us, although strictly speaking it is not so); Of course there are no such updates in the competition table. The result of “our” commit is approximately in the region of the seventh place.

    So, we need to choose a parser for the Russian language. As a starting point, we have a label above with leading Syntaxnet and with UDPipe 2.0 somewhere in 7th place.

    Choose a model

    We make it simple: we start with the parser with the highest rates. If something is wrong with him, go below. Something may be wrong according to the following criteria - maybe they are not perfect, but we were approached:

    • Speed operation . Our parser should work fast enough. The syntax, of course, is far from being the only module “under the hood” of a real-time system, so you shouldn’t spend more than ten milliseconds on it.
    • The quality of work . At a minimum, the parser itself is on the Russian data. The requirement is obvious. For the Russian language, we have fairly good morphological analyzers that can fit into our pyramid. If we can make sure that the parser itself without morphology works cool, then we will be satisfied with this - we will add morphology later.
    • Availability of a learning code and preferably a model in open access . If there is a learning code, we will be able to repeat the results of the author of the model. To do this, they must be open. And besides, you need to carefully monitor the distribution conditions of the enclosures and models - will we have to, if we use them within our algorithms, buy a license to use them?
    • Run without super efforts . This item is very subjective, but important. What does it mean? This means that if we sit for three days and start something, but it does not start, then we will not be able to choose this parser, even if there is an ideal quality.

    All that in the parser chart was higher than UDPipe 2.0, we did not fit. We have a project in Python, and some parsers from the list are not written in Python. To implement them in the Python project, we would have to apply those very super efforts. In other cases, we have come across closed source code, academic, industrial developments - in general, you will not get to the bottom.

    Stellar Syntaxnet deserves a separate story about the quality of work. Here he did not give us the speed of work. The time of his answer to some simple phrases common in chat rooms is from 100 milliseconds. If we spend so much on syntax, we will not have enough time for anything else. At the same time, UDPipe 2.0 does parse offers for ~ 3ms. As a result, the choice fell on UDPipe 2.0.

    UDPipe 2.0

    UDPipe is a pipeline that learns tokenization, lemmatization, morphological tagging and parsing based on the grammar of dependencies. We can teach him all this or something separately. For example, to make with it another morphological analyzer for the Russian language. Or train and use UDPipe as a tokenizer.

    UDPipe 2.0 is documented in detail. There is a description of the architecture , a repository with a learning code , a manual . The most interesting is the finished models , including those for the Russian language. Download and run. Also on this resource, the learning parameters chosen for each language corpus are released.. For each such model, you need about 60 learning parameters, and with their help you can independently achieve the same quality indicators as in the table. They may not be optimal, but at least we can be sure that the pipeline will work quite correctly. In addition, the presence of such reference allows us to experiment with the model on our own.

    How UDPipe 2.0 works

    First, the text is divided into sentences, and sentences - into words. UDPipe does it all at once with the help of a joint module - a neural network (single-layer two-way GRU), which predicts for each character whether it is the last in a sentence or in a word.

    Then the tagger begins to work - a piece that predicts the morphological properties of the token: in which case the word stands, in what number. For the last four characters of each word, the tegger generates hypotheses about the part of speech and the morphological tags of the word, and then selects the best option with the help of the perceptron.

    UDPipe also has a lemmatizer that selects the initial form for words. He learns about the same principle that a non-native speaker could try to define a lemma of an unfamiliar word. We cut off the prefix and the end of the word, add some "be", which is present in the initial form of the verb, etc. This is how candidates are generated from which the best perceptron selects.

    The morphological tagging scheme (determining the number, case, and everything else) and the predictions of the lemmas are very similar. They can be predicted together, but better separately - the morphology of the Russian language is too rich. You can also connect your list of lemmas.

    We turn to the most interesting - to the parser. There are several dependency parser architectures. UDPipe is a transition-based architecture: it works quickly, passing all tokens once per linear time.

    Syntax parsing in such an architecture begins with a stack (where only root is at the beginning) and an empty configuration. There are three default ways to change it:

    • LeftArc - applicable if the second stack item is not root. Keeps the relationship between the token at the top of the stack and the second token, and also throws the second out of the stack.
    • RightArc is the same, but the relationship is built in the other direction, and the tip is discarded.
    • Shift - transfers the next word from the buffer to the stack.

    Below is an example of the work of the parser ( source ). We have the phrase “book me the morning flight”, and we restore the connections in it:

    Here is the result:

    The classic transition-based parser has three operations listed above: the arrow in one direction, the arrow in the other direction and shift. There is also a Swap operation, in the base architectures of transition-based parsers it is not used, but it is enabled in UDPipe. Swap returns the second element of the stack to the buffer in order to take the next one from the buffer (in case they are spaced). It helps to miss a certain number of words and restore the correct connection.

    By referenceThere is a good article of the man who invented the swap operation. We single out one thing from it: despite the fact that we repeatedly go through the original token buffer (that is, our time is no longer linear), these operations can be optimized so that the time is returned very close to linear. That is, we have not only an operation that is sensible from the point of view of language, but also a tool that does not slow down the work of the parser.

    In the example above, we have shown operations, as a result of which we get a certain configuration — the token buffer and the connections between them. We give this configuration to the current transition-based parser, and with the help of it it must predict the configuration in the next step. Comparing incoming vectors and configurations at each step, the model is trained.

    So, we selected a parser that fits all our criteria, and even understood how it works. Go to the experiments.

    UDPipe problems

    Let's set a small sentence: “Translate a hundred rubles to mum”. The result makes you clutch your head.

    “Translate” turned out to be a pretext, but this is quite logical. We define the grammar of the word form by the last four characters. “Lead” is something like “in the middle”, so the choice is relatively logical. It is more interesting with “mom”: “mom” was in the prepositional and became the pinnacle of this sentence.

    If you try to interpret everything based on the results of the parsing, then we would get something like "in the midst of a mother (whose mothers? Whose mother is this?) Hundreds of rubles." Not exactly what was in the beginning. We need to somehow deal with this. And we figured out how.

    In the pyramid of analysis, syntax is built on top of morphology, based on morphological tags. Here is a textbook example of linguist L.V. Scherbs on this account:

    "Globally kuzdra shteko boshanula bokra and kurdyachit bokryonka."

    Analysis of this proposal does not cause problems. Why? Because we, as a UDPipe tagger, look at the end of a word and understand what part of speech it belongs to and what form it is. The story of “translate” as a pretext completely contradicts our intuition, but it turns out to be logical at the moment when we try to do the same with unfamiliar words. A person might think the same way.

    Let us estimate the UDPipe tagger separately. If it doesn’t suit us, take another tagger - then build syntactic parsing on top of another morphological markup.

    Tagging from plain text (CoNLL17 F1 score)

    • gold forms: 301639 ,
    • upostag: 98.15% ,
    • xpostag: 99.89% ,
    • feats: 93.97% ,
    • alltags: 93.44% ,
    • lemmas: 96.68%

    The quality of morphology UDPipe 2.0 is quite good. But for the Russian language is achievable better. The Mystem analyzer ( development of Yandex ) in determining parts of speech achieves better results than UDPipe. In addition, the rest of the analyzers are more difficult to implement in a python project, and they run slower with quality comparable to Mystem. By the way, a couple of interesting articles are devoted to the comparison of morphological analyzers for the Russian language .
    You can try to use its output morphological markup as an input for the UDPipe parser. But there are problems. Many people know that Mystem does not fully understand the morphological homonymy. He knows that in the sentence “Mom washed the frame” the word “soap” comes from the word “wash” and not from “soap”. But this is not enough for us. We also need that in words like “director”, where the lemma is absolutely obvious (director), we understand what kind of case it is. It may be:

    • "No director" - the genitive singular
    • “I see the director” - i.e. accusative singular
    • “These are some directors” - the nominative plural form (we do not have the accent on the letter)

    In such cases, Mystem honestly gives the whole chain: But we cannot send a UDPipe to the entire chain, but must indicate some better tag. How to choose it? If nothing is touched, I want to take the first one, maybe it will work. But the tags are sorted alphabetically in accordance with the English names, so our choice will be close to random, and some analyzes practically lose their chances of becoming the first.

    m.analyze("нет директора")
    [{'analysis': [{'lex': 'нет', 'gr': 'PART='}], 'text': 'нет'},
    {'text': ' '},
    {'analysis': [{'lex': 'директор', 'gr': 'S,муж,од=(вин,ед|род,ед|им,мн)'}],
    'text': 'директора'},
    {'text': '\n'}]

    There is an analyzer that can give the best option - Pymorphy2. But with the analysis of morphology he is worse. In addition, he gives the best word without context. Pymorphy2 will issue only one parsing for “no director”, “see the director” and “director”. It will not be accidental, but really the best in probabilities, which in pymorphy2 were considered on a separate corpus of texts. But a certain percentage of incorrect analyzes of combat texts will be guaranteed, simply because they may well contain phrases with different real forms: both “I see the director” and “the director came to the meeting”, and “there is no director”. The non-contextual parsing probability does not suit us.

    How to get contextually the best set of tags? Using the RNNMorph analyzer. Few people have heard about him, but last year he won a competition among morphological analyzers, held in the framework of the Dialogue conference.

    RNNMorph has its own problem: it does not have tokenization. If Mystem is able to tokenize raw text, then RNNMorph requires a list of tokens as input. To get to the syntax, you will first have to apply some external tokenizer, then give the result to RNNMorph and only then feed the resulting morphology to the syntax parser.

    Here are the options we have. For now, we will not abandon the context-free parsing in pymorphy2 over the controversial cases in the Mystem - suddenly he will not lag far behind RNNMorph. Although if we compare them purely at the level of morphological markup quality (data from MorphoRuEval-2017), then the loss is significant - about 15%, if we assume accuracy by the words.
    Next, we need to convert the output of Mystem into the format that UDPipe understands, conllu. And this is again a problem, even two. Purely technical - the lines do not match. And conceptual - it is not always completely clear how to compare them. Faced with two different markup language data, you will almost certainly run into the problem of matching tags, see the examples below. Answers to the question “which tag is correct here” may be different, and probably the correct answer depends on the task. Because of this inconsistency, mapping markup systems is not an easy task in and of itself.

    How to convert? There are russian_tagsets _ package- a package for Python that can convert different formats. There is no translation from the format of issuing Mystem to Conllu, which is adopted in Universal Dependencies, but there is a translation to conllu, for example, from the markup format of the national corpus of the Russian language (and vice versa). The author of the package (by the way, he is also the author of pymorphy2) wrote a wonderful thing right in the documentation: “If you can not use this package, do not use it”. He did this not because the programmer is a crooked programmer (he is an excellent programmer!), But because if you need to convert one into the other, then you risk getting problems due to the linguistic inconsistency of the markup conventions.

    Here is an example. The school taught the "categories of state" (cold, necessary). Some say - this is an adverb, others - an adjective. You need to convert it, and you add some rules, but you still do not achieve a one-to-one correspondence between one format and another.

    Another example: a pledge (either someone did something, or did something with someone). “Petya killed someone” or “Petya was killed”. “Vasya takes pictures” - “Vasya takes pictures” (that is, in fact, “Vasya is photographed”). There is also a medial pledge in SynTagRus - we will not even go into what it is and why. And in Mystem it is not. If you need to somehow bring one format to another, this is a dead end.

    We more or less honestly used the advice of the author of the package russian_tagsets - we did not use its development, because we did not find the right pair in the list of matching formats. As a result, we wrote our custom converter from Mystem to Conllu and drove on.

    We connect third-party tegger and UDPipe parser

    After all the adventures, we took three algorithms, which were described above:

    • baseline UDPipe
    • Mym with disambiguating tags from pymorphy2
    • Rnnmorph

    We lost in quality for a reasonably understandable reason. We took a UDPipe model that was trained in one morphology, but slipped another morphology as input. The classic problem is the discrepancy between the train and test data - hence the drop in quality.

    We tried to match our automatic morphological markup tools with the SynTagRus markup, which is marked up manually. We did not succeed, so in the training building SynTagRus we will replace all the manual morphological markup with that obtained from Mystem and pymorphy2 in one case and from RNNMorph in the other. In a hand-marked validated case, we are forced to change the manual layout to automatic, because “in battle” we will never get a manual layout.

    As a result, we trained the UDPipe parser (only the parser) with the same hyper parameters as the baseline. What was responsible for the syntax is the ID of the vertex, on whom the type of connection depends, we left it, everything else was changed.


    I will continue to compare us with Syntaxnet and other algorithms. CoNLL Shared Task organizers released SynTagRus partitioning (train / dev / test 80/10/10). We initially took another (train / test 70/30), so the data do not always coincide with us, although they were obtained on the same package. In addition, we took the latest (as of February-March) release from the SynTagRus repository - this version is slightly different from the one that was in the competition. The data for the fact that we did not take off, are given for articles where the split was the same as at the competition - such algorithms are marked with an asterisk in the table.

    Here are the final results:

    RNNMorph really turned out to be better - not in an absolute sense, but as an auxiliary tool for obtaining a common metric based on the results of syntax analysis (as compared with Mystem + pymorphy2). That is, the better the morphology, the better the syntax, but the “syntactic” gap is much less morphological. Note also that we did not leave very far from the baseline model, which means that in morphology there was in fact not so much as we had supposed.

    It is interesting, how much does it generally lie on morphology? Is it possible to achieve a fundamental improvement in the syntactic parser due to the ideal morphology? To answer this question, we drove UDPipe 2.0 on perfectly calibrated (on the standard manual markup) tokenization and morphology. There was a gap (see the Gold Morph line in the table; it turns out + 1.54% from RNNMorph_reannotated_syntax) from what we had, including from the point of view of the correct definition of the type of connection. If someone once writes an absolutely perfect morphological analyzer of the Russian language, it is likely that the results that we get using the abstract syntax parser will also grow. And we roughly understand the ceiling (at least, the ceiling for the architecture and the combination of parameters that we used for UDPipe,

    Interestingly, we have almost reached the LAS metric for the Syntaxnet version. It is clear that we have slightly different data, but in principle it is still comparable. In Syntaxnet, the tokenization is “golden”, and in our case - from Mystem. We wrote the aforementioned wrapper to Mystem, but the analysis still goes on automatically; probably Mystem is also wrong somewhere. From the line of the “UDPipe 2.0 gold tok” table it can be seen that if you take the default UDPipe and gold tokenization, it still loses a bit of Syntaxnet-2017. But it works much faster.

    What no one reached out to is the Stanford parser. It works the same way as Syntaxnet, so it works for a long time. In UDPipe, we just go through the stack. The architecture of the Stanford parser and Syntaxnet has a different concept: first, they generate a full directed graph, and then the algorithm works to leave the skeleton (the minimum spanning tree) that will be most likely. To do this, he enumerates the combinations, and this search is no longer linear, because you will apply to one word more than once. Despite the fact that it is long from the point of view of pure science, at least for the Russian language, it is a more efficient architecture. We tried to raise this academic development for two days - alas, it did not work out. But based on its architecture, it is clear that it does not work quickly.

    As for our approach - although we formally barely climbed the metrics, but now we have everything in order with the “mother”.

    In the phrase “translate mom a hundred rubles,” we have “translate” - really a verb in the imperative mood. "Mama" got her dative case. And the most important thing for us is our label (iobj) indirect object (addressee). Although the increase in figures is insignificant, we coped well with the problem with which the task began.

    Bonus track: punctuation

    If you return to the real data, it turns out that the syntax depends on punctuation. Take the phrase "execute can not pardon." What exactly cannot be - "executed" or "pardoned" - depends on where the comma is. Even if we put the linguist to mark up the data, he will need punctuation as some kind of auxiliary tool. Without it, he can not cope.

    Let's take the phrases “Petya Hi” and “Petya, Hi” and look at their analysis of the baseline-UDPipe model. Leaving aside the problems of the fact that, according to this model, then:
    1) “Peter” is a feminine noun;
    2) “Petya” is (judging by the set of tags) the initial form, but at the same time the lemma is supposedly not “Petya”.

    Here's how the result changes because of the comma, with its help we get something similar to the truth.

    In the second case, “Petya” is a subject, and “hello” is a verb. Let us return to the prediction of the form of the word based on the last four characters. In the interpretation of the algorithm, this is not “Petya Hi”, but “Petya Leading”. Type "Peter sings" or "Peter will come." The analysis is quite understandable: in Russian, the comma between the subject and the predicate cannot be. Therefore, if the comma is, this word is “hello”, and if there is no comma, it may well be something like “Peter is in love”.

    We will encounter this in production quite often, because spelling checkers will correct spelling, and punctuation will not. Worse, the user may incorrectly put commas, and our algorithm will take them into account in understanding natural language. What are some possible solutions here? We see two options.

    The first option is to do what is sometimes done when translating speech into text. Initially, there is no punctuation in such a text, so it is restored through the model. The output is a relatively competent material in terms of the rules of the Russian language, which helps the syntactic parser to work correctly.

    The second idea is somewhat more daring and contrary to the school lessons of the Russian language. It assumes work without punctuation: if suddenly the input data will be punctuation, we will remove it from there. From the learning corps, we will also remove absolutely all the punctuation. We assume that the Russian language exists without punctuation. Only points for division into sentences.

    Technically, this is quite simple, because we don’t change non-finite nodes in the syntax tree. We can not have such that the punctuation mark is the top. It is always some terminal node, except the% sign, which for some reason in SynTagRus is the top for the preceding numeral (50% in SynTagRus is marked as% - the top, and 50 - dependent).

    Perform tests using the Mystem (+ pymorphy 2) model.

    It is crucial for us not to give a model without punctuation punctuation text. But on the other hand, if we always give the text without punctuation, then we will find ourselves within the framework of the top line and receive at least acceptable results. If the text is without punctuation and the model works non-punctuation, then with respect to the ideal punctuation and punctuation model, a drop of only about 3% will turn out.

    What to do with it? We can dwell on these figures - obtained using a non-punctuation model and punctuation cleaning. Or come up with some kind of classifier to restore punctuation. We will not achieve ideal numbers (those with punctuation on the punctuation model), because the punctuation recovery algorithm works with some error, and the “ideal” numbers were calculated on an absolutely pure SynTagRus. But if we write a model that restores punctuation, will progress pay for our costs? The answer is not yet obvious.

    We can reflect on the parser architecture for a long time, but we must remember that in reality there is not yet a large syntax for the marked body of web texts. Its existence would help to solve real problems better. So far, we are learning on the corpuses of absolutely literate, edited texts - and we lose in quality, getting custom texts in battle, which are often illiterately written.


    We considered the use of various syntactic parsing algorithms based on the grammar of dependencies in relation to the Russian language. It turned out that in terms of speed, convenience and quality of work, UDPipe was the best tool. His baseline model can be improved by giving the stages of tokenization and morphological analysis at the mercy of other, third-party analyzers: such a trick makes it possible to correct the incorrect behavior of the tagger and, as a result, the parser in important cases for analysis.

    We also analyzed the problem of the relationship between punctuation and parsing and came to the conclusion that in our case it is better to delete the punctuation before syntactic parsing.

    We hope that the application points discussed in our article will help to use syntax parsing for solving your problems as efficiently as possible.

    The author thanks Nikita Kuznetsova and Natalia Filippova for their help in preparing the article; for assistance in the study - Anton Alekseev, Nikita Kuznetsov, Andrey Kutuzov, Boris Orekhov and Mikhail Popov.

    Also popular now: