NLP Notes (Part 10)
(First parts: 1 2 3 4 5 6 7 8 9 ). As the famous advertisement said, “you didn’t wait, but we came” :)
During the time that has passed since the publication of the ninth part, I read one good book on the topic (a couple more in the to-read list), many articles, and also talked to several experts. Accordingly, a new volume of material has accumulated that deserves a separate note. As usual, I introduce others, at the same time I structure knowledge for myself.
I apologize right away: this part is quite difficult to read and understand. Well, yes, as they say, not all cats are Pancake week. Challenges correspond to complex texts :)
In general, Wilkes sees the development of the industry. Initially, everything was based on rules. Then, in the early 90s, a group of guys from IBM came up with clean statistics (with no rules at all) and got results that were much better than expected. However, Wilkes says that good results at first do not guarantee further improvements, since the “theoretical ceiling” of the technology may not be so high. And in general, in accordance with the "Wilkes law", any theory, even the most abnormal, allows you to get good results in machine translation . There is also the "second law of Wilks": a successful machine translation system usually does not work on the declared principles .
That is, if we take the current state of the SYSTRAN dinosaur, it turns out that calling this system “rule-based” is not entirely correct, it uses a bunch of different “crutches” and specific algorithms. Similarly, Google & IBM name systems quickly moved out of the “purely statistical” category. Say, IBM initially denied even morphology modules for individual languages, assuming absolutely everything to be deduced “on the fly” from the corps of texts. Now they don’t do that. Yes, we also note: a “pure” statistical analysis of the text somehow reduces to the statistics of following one words after another. If we consider "trigrams", that is, three consecutive words as a basic element of analysis, a modern computer can even more or less cope (although for the beginning of the 90s trigrams were very difficult).
The book draws attention to the fact that Google also does not reach fanaticism. For example, SYSTRAN for some areas has been developing for decades, and catching it with any alternative algorithm will not work quickly. Therefore, Google for individual pairs of languages really uses SYSTRAN, and not its newfangled algorithms. There are some more good comments on this subject. For example, the author observes that the statistical translator from English to French is based on a good parallel corpus of Anglo-French texts (minutes of meetings of the Canadian Parliament). Moreover, for rare pairs of languages (and in this context, almost all other pairs are rare), it is very difficult to find such an extensive corpus. Of course, “War and Peace” can be found in dozens of languages, but these translations are literary, and are unlikely to fit the role of elements of a parallel corps.
In general, Wilkes believes that: (1) bare statistics, like bare rules, are not very promising (good machine translation requires a knowledge base about text objects and the outside world); (2) future systems will be hybrid, but it is not yet clear how to make these hybrids; (3) even purely statistical algorithms must be smarter. Here is another such observation. Now statistics are based on speech recognition systems. The system initially "trains" on examples of input data, and then can already be used in practice. It is estimated that if children for learning speech would need as much training data as computer recognizers, teaching children the language would take more than 100 years of round-the-clock lessons.
I have already been accused of unjustified dislike for statistical methods. I think this is not entirely true: from my point of view, the ideal direction for the development of a syntactic (or syntactic-semantic) analyzer is automatic, that is, based on statistical methods, isolation of parsing rules from an existing tribank. Just in this direction my understanding of the subject has noticeably advanced in the last month.
First, let's talk a little about the tribanks themselves (recall, we discuss almost exclusively dependency treebanks). It seems to me that creating a bank of samples of the analyzed offers is a very correct task. Obviously, tagged phrases allow you to perform much more interesting types of text analysis than untagged sentences. Moreover, the amount of work to create a tribank is not so great. In essence, compiling a tribank is a parsing of proposals for members, that is, work that each of us is familiar with at school. Authors of Finnish Turku dependency treebankconsider that on the basis of a tribank of ten thousand sentences you can write a full-fledged parser. Okay, if every day, without straining, to analyze 10 sentences, such a tribank can be done in three years. And if you work together, then in a year. Is that a lot? The number of tribanks in the world is growing, it is a fact. Moreover, many of them are available to anyone, often for free.
With bitterness, you can see that with the Russian language, as usual, everything is complicated. There is a fairly large SynTagRus tribank (about 42 thousand parsed offers). I did not contact its authors, but from the point of view of an outside observer, everything is somehow opaque to them. “Write letters, and maybe we will answer. Maybe we will, or maybe we will not. ” Compare this with freely available Czech and Finnish cases! I’m not saying that the bank must be free, but it’s clearly easier to clearly explain the rules of distribution than to annotate 42 thousand phrases? Why has the difficult work been done, but the simple one is still “hanging"? .. On the site ruscorpora.ru, you can search for individual words in the corpus and evaluate the parsing trees issued in PDF. At the same time, the phrase “especially pleases”:“No offline versions of the case are available yet, but work in this direction is underway . ” I very vividly imagine these “ongoing work”: apparently, on an old 80286 computer, the RAR archiver works around the clock, packing gigabytes of trees for subsequent uploading of the archive to the site. What else to do? The already mentioned Finnish tribank is stupidly laid out in the archive with explanations, and no one complains.
I think it makes sense to talk about two main problems in treebank-based parsing: how to annotate a tribank and how to automatically create a text parser based on a ready-made tribank. Let's start with the second task. Suppose a tribank is already there and “somehow” annotated. That is, for each sentence from the tribank there is a ready parsing graph; in other words, the links between the words (which word is associated with which) are listed, and for each connection its type is indicated (for example, “subject”).
In one of the previous parts, I mentioned the work in which the tribank was converted to XDG grammar rules for further analysis. Unfortunately, this direction has stalled. The author explains the reason as follows. Tribrib does not provide enough restrictions for trees. That is, the output results in rules by which ten different trees can be generated for one input sentence. In general, the author believes that the problem can be solved by ranking according to the "goodness" of the resulting trees. Indeed, there are such studies. Let's see what comes of them (or maybe we will take part :)). By the way, to the question of Google and pure statistics: one of the comrades promoting this area of ranking trees, Liang Huang himself worked for a while in Google.
Generating rules a la XDG, however, is not the only option for building a parser. There are more successful projects to date, for example, MaltParser . This thing promises almost fiction: give it a tribank of any language at the entrance, and it will generate the corresponding parser. Moreover, the system works, apparently, not bad. I remember somewhere I tried to pest: if the statistics are so good, why didn’t anyone try to write a Pascal statistical analyzer? :) After all, Pascal is clearly simpler than any natural language! So, the authors of MaltParser really managed to solve this problem - they generated the C ++ parser using MaltParser!
I have not yet got acquainted with the details of the algorithms that convert tribans to parsers, but in the most general form, the bottom line is that the parsing process is presented in the form of a procedure that depends on numerical parameters. Different parameter values guide the parsing procedure for a particular scenario. The tribank is used as a training set for some standard machine learning algorithm, with which the required set of parameters is selected. It turns out that “a set of parameters + a fixed algorithm = a parser”.
Here I would like to stop and make some important points about the internal, fatal limitations of MaltParser:
- Statistical methods always try to parse the input sentence, even if it is incorrect. In principle, this can be perceived in two ways. If the task is to parse any, even incorrectly formed, phrases, this is a plus. If the goal is to create a spell-checking system, this is a clear minus.
- The statistical method is, in fact, a “black box”. If we are not satisfied with the analysis of any particular phrase, there is no way to analyze why this happens, and somehow "fix" the algorithm. You can only change the tribank and restart the learning algorithm.
- At the output of the parser, only one parse tree is generated, which is considered to be the solution to the parsing task.
In principle, with the first two restrictions everything is clear. But the third point I would like to discuss in more detail. Here we are faced with another reincarnation of the basic question: what exactly should the parser do?
One and only one tree: is it good or bad? I think it is necessary to separate the tree as a structure of relations “in general” and the tree as a structure of clearly defined types of relations. In other words, a tree as (1) a graph with unlabeled edges and as (2) a graph with labeled edges.
Consider the phrase Ivan came from the guests . A tree with ribs corresponds to it (came, Ivan), (came, from) and (from, guests). The phrase Ivan came out of politeness syntactically arranged in the same way, and it corresponds to a similar parse tree. However, if you add edge labels that describe some “connection types”, the trees will no longer be identical.Ivan came (from where?) From the guests , but: Ivan came (why?) Out of politeness .
Now let's think about how important it is for the parser to be able to generate trees with different structures and different labels. In other words, how probable is the case when one sentence corresponds to different (separately in structure and separately in labels) parse trees.
It seems to me that this is the situation. The proposal must correspond to exactly one graph without labels. The case when there are two or more graphs is possible, but in this case we have a verbal pun - as if with text and subtext. Whether we need such a play on words in practice is a big question. Honestly, I don’t think I need it. In fact, here we are talking about examples of this kind:
Of course, it’s hard for me to say for sure, but according to my feelings, the proposal should correspond to only one parse tree (with unlabeled edges). Whether a parser can independently build such a tree, and what kind of information it needs is a difficult question. For example, we slightly change the phrase with the countess (forgive what was in my head, I am reporting): The countess was riding in a carriage with her ass raised . Here one could notice that the countess has an ass, but not the carriage; therefore, in the correct parsing tree, the “ass” should be connected with the word “countess”, and not with the word “carriage”. But this requires an extensive knowledge of the anatomy from the parser, so the question of the appropriateness of analyzing such difficulties remains open.
Thus, for untagged graphs, MaltParser is great. The situation is noticeably complicated for graphs with edge labels. According to the logic of things, a single tree always corresponds to the ideal parsing of a phrase. However, in practice, with the growth of our appetites, the capabilities of the parser are sharply reduced. The real extent of the problem depends on the intended types of communication between the words of the tribank. The more sophisticated the communication system, the more difficult it is for the parser. As a result, the parser is trying to parse sentences “in the image and likeness” of the tribank, that is, using all the methods for connecting words that are known to it. (Thus, we approach the first problem mentioned above - the task of annotating a tribank).
Do not think that I'm talking about some purely theoretical things. The layout of the tribank is a non-trivial question. There is no single system for marking up tribanks. There are separate developed approaches that you can take. And you can not use it. Say, the authors of MaltParser say that while training in different tribanks, they received different quality of analysis. So, Prague Dependency Treebank as a whole gives worse quality. And this is due to the fact that the markup in it is more detailed, there are more types of connections. Accordingly, a parser that operates with such a solid number of connections has more opportunities to make a mistake.
So, here I would like to raise the question a bit: what kind of detail should be in the triban to make the parser useful? If we confine ourselves to the “subject-predicate-complement” system, MaltParser (and other statistical analyzers) is quite enough. If we attribute to each word a whole set of semantic attributes, the problem of “underspecification” inevitably arises, that is, a true lack of data to build a single tree.
For example, in a simple markup system, the phrase “I see a bow” is trivially understood: subject - predicate - addition.
In a more detailed system, two equivalent trees arise: “I see the onion-weapon” and “I see the onion-food”.
In principle, you can parse according to the “simple scenario”, and then shift the responsibility to identify clear connections and meanings of words to subsequent modules. But will it not turn out that all the work that we do not want to think about has simply been transferred to another place, and still it will have to be done somehow? ..
I don’t know yet. Digging on :)
During the time that has passed since the publication of the ninth part, I read one good book on the topic (a couple more in the to-read list), many articles, and also talked to several experts. Accordingly, a new volume of material has accumulated that deserves a separate note. As usual, I introduce others, at the same time I structure knowledge for myself.
I apologize right away: this part is quite difficult to read and understand. Well, yes, as they say, not all cats are Pancake week. Challenges correspond to complex texts :)
A little bit about machine translation: statistics versus rules
I'll start with my favorite holivar. Which is better - statistical models or models based on explicit grammar rules? Yorick Wilks writes about this well in Machine Translation: Its Scope and Limits . The book discusses the problem of machine translation. It is clear that indiscriminately the sentence structure does not go far in translation, so the topics of parsing and translation are closely related.In general, Wilkes sees the development of the industry. Initially, everything was based on rules. Then, in the early 90s, a group of guys from IBM came up with clean statistics (with no rules at all) and got results that were much better than expected. However, Wilkes says that good results at first do not guarantee further improvements, since the “theoretical ceiling” of the technology may not be so high. And in general, in accordance with the "Wilkes law", any theory, even the most abnormal, allows you to get good results in machine translation . There is also the "second law of Wilks": a successful machine translation system usually does not work on the declared principles .
That is, if we take the current state of the SYSTRAN dinosaur, it turns out that calling this system “rule-based” is not entirely correct, it uses a bunch of different “crutches” and specific algorithms. Similarly, Google & IBM name systems quickly moved out of the “purely statistical” category. Say, IBM initially denied even morphology modules for individual languages, assuming absolutely everything to be deduced “on the fly” from the corps of texts. Now they don’t do that. Yes, we also note: a “pure” statistical analysis of the text somehow reduces to the statistics of following one words after another. If we consider "trigrams", that is, three consecutive words as a basic element of analysis, a modern computer can even more or less cope (although for the beginning of the 90s trigrams were very difficult).
The book draws attention to the fact that Google also does not reach fanaticism. For example, SYSTRAN for some areas has been developing for decades, and catching it with any alternative algorithm will not work quickly. Therefore, Google for individual pairs of languages really uses SYSTRAN, and not its newfangled algorithms. There are some more good comments on this subject. For example, the author observes that the statistical translator from English to French is based on a good parallel corpus of Anglo-French texts (minutes of meetings of the Canadian Parliament). Moreover, for rare pairs of languages (and in this context, almost all other pairs are rare), it is very difficult to find such an extensive corpus. Of course, “War and Peace” can be found in dozens of languages, but these translations are literary, and are unlikely to fit the role of elements of a parallel corps.
In general, Wilkes believes that: (1) bare statistics, like bare rules, are not very promising (good machine translation requires a knowledge base about text objects and the outside world); (2) future systems will be hybrid, but it is not yet clear how to make these hybrids; (3) even purely statistical algorithms must be smarter. Here is another such observation. Now statistics are based on speech recognition systems. The system initially "trains" on examples of input data, and then can already be used in practice. It is estimated that if children for learning speech would need as much training data as computer recognizers, teaching children the language would take more than 100 years of round-the-clock lessons.
Toward Hybridization
Machine translation is a complex and multifaceted topic. So let's get back to parsing, that is, to parsing sentences.I have already been accused of unjustified dislike for statistical methods. I think this is not entirely true: from my point of view, the ideal direction for the development of a syntactic (or syntactic-semantic) analyzer is automatic, that is, based on statistical methods, isolation of parsing rules from an existing tribank. Just in this direction my understanding of the subject has noticeably advanced in the last month.
First, let's talk a little about the tribanks themselves (recall, we discuss almost exclusively dependency treebanks). It seems to me that creating a bank of samples of the analyzed offers is a very correct task. Obviously, tagged phrases allow you to perform much more interesting types of text analysis than untagged sentences. Moreover, the amount of work to create a tribank is not so great. In essence, compiling a tribank is a parsing of proposals for members, that is, work that each of us is familiar with at school. Authors of Finnish Turku dependency treebankconsider that on the basis of a tribank of ten thousand sentences you can write a full-fledged parser. Okay, if every day, without straining, to analyze 10 sentences, such a tribank can be done in three years. And if you work together, then in a year. Is that a lot? The number of tribanks in the world is growing, it is a fact. Moreover, many of them are available to anyone, often for free.
With bitterness, you can see that with the Russian language, as usual, everything is complicated. There is a fairly large SynTagRus tribank (about 42 thousand parsed offers). I did not contact its authors, but from the point of view of an outside observer, everything is somehow opaque to them. “Write letters, and maybe we will answer. Maybe we will, or maybe we will not. ” Compare this with freely available Czech and Finnish cases! I’m not saying that the bank must be free, but it’s clearly easier to clearly explain the rules of distribution than to annotate 42 thousand phrases? Why has the difficult work been done, but the simple one is still “hanging"? .. On the site ruscorpora.ru, you can search for individual words in the corpus and evaluate the parsing trees issued in PDF. At the same time, the phrase “especially pleases”:“No offline versions of the case are available yet, but work in this direction is underway . ” I very vividly imagine these “ongoing work”: apparently, on an old 80286 computer, the RAR archiver works around the clock, packing gigabytes of trees for subsequent uploading of the archive to the site. What else to do? The already mentioned Finnish tribank is stupidly laid out in the archive with explanations, and no one complains.
I think it makes sense to talk about two main problems in treebank-based parsing: how to annotate a tribank and how to automatically create a text parser based on a ready-made tribank. Let's start with the second task. Suppose a tribank is already there and “somehow” annotated. That is, for each sentence from the tribank there is a ready parsing graph; in other words, the links between the words (which word is associated with which) are listed, and for each connection its type is indicated (for example, “subject”).
In one of the previous parts, I mentioned the work in which the tribank was converted to XDG grammar rules for further analysis. Unfortunately, this direction has stalled. The author explains the reason as follows. Tribrib does not provide enough restrictions for trees. That is, the output results in rules by which ten different trees can be generated for one input sentence. In general, the author believes that the problem can be solved by ranking according to the "goodness" of the resulting trees. Indeed, there are such studies. Let's see what comes of them (or maybe we will take part :)). By the way, to the question of Google and pure statistics: one of the comrades promoting this area of ranking trees, Liang Huang himself worked for a while in Google.
Generating rules a la XDG, however, is not the only option for building a parser. There are more successful projects to date, for example, MaltParser . This thing promises almost fiction: give it a tribank of any language at the entrance, and it will generate the corresponding parser. Moreover, the system works, apparently, not bad. I remember somewhere I tried to pest: if the statistics are so good, why didn’t anyone try to write a Pascal statistical analyzer? :) After all, Pascal is clearly simpler than any natural language! So, the authors of MaltParser really managed to solve this problem - they generated the C ++ parser using MaltParser!
I have not yet got acquainted with the details of the algorithms that convert tribans to parsers, but in the most general form, the bottom line is that the parsing process is presented in the form of a procedure that depends on numerical parameters. Different parameter values guide the parsing procedure for a particular scenario. The tribank is used as a training set for some standard machine learning algorithm, with which the required set of parameters is selected. It turns out that “a set of parameters + a fixed algorithm = a parser”.
Here I would like to stop and make some important points about the internal, fatal limitations of MaltParser:
- Statistical methods always try to parse the input sentence, even if it is incorrect. In principle, this can be perceived in two ways. If the task is to parse any, even incorrectly formed, phrases, this is a plus. If the goal is to create a spell-checking system, this is a clear minus.
- The statistical method is, in fact, a “black box”. If we are not satisfied with the analysis of any particular phrase, there is no way to analyze why this happens, and somehow "fix" the algorithm. You can only change the tribank and restart the learning algorithm.
- At the output of the parser, only one parse tree is generated, which is considered to be the solution to the parsing task.
In principle, with the first two restrictions everything is clear. But the third point I would like to discuss in more detail. Here we are faced with another reincarnation of the basic question: what exactly should the parser do?
Facets of Parsing
I warn you right away, I have not had time to analyze my colleagues' thoughts on this issue, so I express my personal opinion.One and only one tree: is it good or bad? I think it is necessary to separate the tree as a structure of relations “in general” and the tree as a structure of clearly defined types of relations. In other words, a tree as (1) a graph with unlabeled edges and as (2) a graph with labeled edges.
Consider the phrase Ivan came from the guests . A tree with ribs corresponds to it (came, Ivan), (came, from) and (from, guests). The phrase Ivan came out of politeness syntactically arranged in the same way, and it corresponds to a similar parse tree. However, if you add edge labels that describe some “connection types”, the trees will no longer be identical.Ivan came (from where?) From the guests , but: Ivan came (why?) Out of politeness .
Now let's think about how important it is for the parser to be able to generate trees with different structures and different labels. In other words, how probable is the case when one sentence corresponds to different (separately in structure and separately in labels) parse trees.
It seems to me that this is the situation. The proposal must correspond to exactly one graph without labels. The case when there are two or more graphs is possible, but in this case we have a verbal pun - as if with text and subtext. Whether we need such a play on words in practice is a big question. Honestly, I don’t think I need it. In fact, here we are talking about examples of this kind:
He saw her in front of his eyes => he saw (who?) Her (where?) In front of his eyes.
he saw (what?) her before (how?) with his own eyes.
The countess was riding in a carriage with her butt raised => the countess was riding in a carriage (which one?) With her butt raised.
riding in a carriage (who?) the countess with her ass raised.Of course, it’s hard for me to say for sure, but according to my feelings, the proposal should correspond to only one parse tree (with unlabeled edges). Whether a parser can independently build such a tree, and what kind of information it needs is a difficult question. For example, we slightly change the phrase with the countess (forgive what was in my head, I am reporting): The countess was riding in a carriage with her ass raised . Here one could notice that the countess has an ass, but not the carriage; therefore, in the correct parsing tree, the “ass” should be connected with the word “countess”, and not with the word “carriage”. But this requires an extensive knowledge of the anatomy from the parser, so the question of the appropriateness of analyzing such difficulties remains open.
Thus, for untagged graphs, MaltParser is great. The situation is noticeably complicated for graphs with edge labels. According to the logic of things, a single tree always corresponds to the ideal parsing of a phrase. However, in practice, with the growth of our appetites, the capabilities of the parser are sharply reduced. The real extent of the problem depends on the intended types of communication between the words of the tribank. The more sophisticated the communication system, the more difficult it is for the parser. As a result, the parser is trying to parse sentences “in the image and likeness” of the tribank, that is, using all the methods for connecting words that are known to it. (Thus, we approach the first problem mentioned above - the task of annotating a tribank).
Do not think that I'm talking about some purely theoretical things. The layout of the tribank is a non-trivial question. There is no single system for marking up tribanks. There are separate developed approaches that you can take. And you can not use it. Say, the authors of MaltParser say that while training in different tribanks, they received different quality of analysis. So, Prague Dependency Treebank as a whole gives worse quality. And this is due to the fact that the markup in it is more detailed, there are more types of connections. Accordingly, a parser that operates with such a solid number of connections has more opportunities to make a mistake.
So, here I would like to raise the question a bit: what kind of detail should be in the triban to make the parser useful? If we confine ourselves to the “subject-predicate-complement” system, MaltParser (and other statistical analyzers) is quite enough. If we attribute to each word a whole set of semantic attributes, the problem of “underspecification” inevitably arises, that is, a true lack of data to build a single tree.
For example, in a simple markup system, the phrase “I see a bow” is trivially understood: subject - predicate - addition.
In a more detailed system, two equivalent trees arise: “I see the onion-weapon” and “I see the onion-food”.
In principle, you can parse according to the “simple scenario”, and then shift the responsibility to identify clear connections and meanings of words to subsequent modules. But will it not turn out that all the work that we do not want to think about has simply been transferred to another place, and still it will have to be done somehow? ..
I don’t know yet. Digging on :)