
NLP Notes (Part 6)
(First parts: 1 2 3 4 5 ). I hope the conversation about the natural language of readers has not yet bored! In my opinion, the topic is really interesting (although the popularity of topics is clearly declining :)). Well, let's see how many parts I still have enough. I think we have already passed the equator, but three or four topics can still be addressed.
This time, the article is entirely devoted to the XDG / XDK project, which I am trying to study at my leisure. I can’t call myself an XDG specialist yet. But I'm moving slowly.
So, we settled on what we will consider dependency parsing, and, if possible, with non-projective constructions and multiple tree generation. This naturally leads us to the XDG / XDK project that satisfies all these requirements. I myself came to the topic with sympathy for lexicalized dependency parsing, but for the rest I started almost from scratch, and the choice of XDG was dictated by cold screening of other things. He studied what is, chose what is better. Today it turned out that XDG seems to be better (at least within the framework of the given concept).
What is it? XDG stands for Extensible Dependency Grammar, and XDK, respectively, as the XDG Development Kit. XDG is a formalism for describing permissible relationships between words in a sentence. This is a lexical formalism: for each concrete word of the language it will be necessary to describe how and with what it can be glued. XDK includes a set of tools, the most important of which is Solver, aka parser. Solver accepts a grammar of the language and a sentence on it at the input. The output generates a lot of trees corresponding to the parsed sentence. As you can see, in words everything is very simple :)
As for the way of writing these rules - it seems to me quite obvious and adequate. I believe that with the help of XDG it is possible to describe not only a natural language, but also, say, Pascal. Of course, such a parser will be far from perfect, but, in theory, it should work :)
I don’t even know why no one has taken up such a dependency parser kit before. The project is quite fresh, the current version is dated 2007 (however, I have already said that I don’t notice much enthusiasm about the XDK so far, and at the moment the project is not developing).
Perhaps it makes sense to give a little "feel" the language of the proposed grammars. It is based on three basic principles:
lexicalization . Grammar elements are the words of the language. Each word can be assigned a set of attributes, in fact, which determines that the word can stick to itself and what it can stick to itself. Each attribute has a name, type, and value. The most commonly used types are "string" and "multiple lines."
Principles of creating a graph. In addition to the restrictions on “gluing”, set at the level of individual words, you can set restrictions for the entire column parsing the phrase. Such restrictions are called “principles”. The simplest example of a principle is the tree principle, which requires the generated graph to be a tree. We will talk about some other useful principles a bit later.
Multidimensionality. Each attribute of a word is defined in a certain namespace (user-defined). The principles of creating a graph are also described in a particular namespace. Each such space is called a "dimension." A phrase parsing graph for each dimension is built separately. Thus, you can get several parsing columns at once, reflecting certain structural features of the proposal (that is, the user can look at the proposal as if from different angles). For example, along with a dependency graph, you can build a degenerate list graph showing the order of words in a sentence. Or, say, a graph connecting all noun phrases.
The specification of the out attribute of the verb “eats” indicates that a word can be attached to one subject (subj), one object (obj) and an arbitrary number of circumstances (adv *). The specification of the in attribute of the word "Peter" allows this word to be a subject or object. The specification of the word spaghetti is the same. The word “today” is described as a potential circumstance. Thus, if we feed the phrase “Peter eats spaghetti today” to the parser, the resulting tree will have “eats” as the root, “Peter” and “spaghetti” will be attached to the root as the subject and object, and the word “today” in as a circumstance.
There is also the “principle of ordering” (I’ll sacrifice it for the sake of saving space), with which you can indicate that an earlier word in a sentence takes precedence as a subject (thus, “spaghetti” will not become a subject).
In order for the word “Peter” to join now, you have to specify its attributes of number and face:
As can be seen from these examples, grammar does not directly support morphology. So, the words “eat” and “eats” are different for her. However, I don’t quite understand how the support of “morphology in general” can be crammed into universal formalism. This problem, however, is solved quite simply: you need to run the input sentence through a morphological analyzer and generate grammar rules (with the correct endings and word attributes) automatically, "on the fly."
The analysis ends when it was possible to assemble the entire structure and attach it to the word that has the attribute "top of the tree." For example, as such a word, you can describe a point (because the sentence ends with a point):

Here I propose to put a semicolon if not, then, because “ briefly "continue to fail. The next “block of thoughts” draws to a separate post (tomorrow, probably? ..) We will continue about XDG, most likely.
Unless, in order to return to this question, I will list the problems of HDK that are striking. Firstly, the current build is built under ASCII with support for Western European languages. That is, the Cyrillic alphabet turns into a croak (it will still be disassembled, but it is inconvenient to debase). However, the problem is small, it seems to me. Secondly, the author defended his dissertation and went to work in industry, he no longer supports the project, and no one else has picked up, even though it is open source. Thirdly, “entering the project” is not so simple, because everything is written on the relatively marginal programming environment Mozart / Oz in the constraint programming paradigm. It’s difficult to criticize the author for this - the task is very specific. I have already said that a full analysis draws on an NP-complete task. The author considered parsing as a constraint satisfaction problem: given nodes of the graph and given a set of conditions that limit the edges. It is required to find a graph satisfying the conditions. That is, the task was posed in a more general form, and as a result, it was decided to use a specialized tool for this kind of thing. Correctly done. There is something else that is missing me personally, but more on that next time.
This time, the article is entirely devoted to the XDG / XDK project, which I am trying to study at my leisure. I can’t call myself an XDG specialist yet. But I'm moving slowly.
So, we settled on what we will consider dependency parsing, and, if possible, with non-projective constructions and multiple tree generation. This naturally leads us to the XDG / XDK project that satisfies all these requirements. I myself came to the topic with sympathy for lexicalized dependency parsing, but for the rest I started almost from scratch, and the choice of XDG was dictated by cold screening of other things. He studied what is, chose what is better. Today it turned out that XDG seems to be better (at least within the framework of the given concept).
What is it? XDG stands for Extensible Dependency Grammar, and XDK, respectively, as the XDG Development Kit. XDG is a formalism for describing permissible relationships between words in a sentence. This is a lexical formalism: for each concrete word of the language it will be necessary to describe how and with what it can be glued. XDK includes a set of tools, the most important of which is Solver, aka parser. Solver accepts a grammar of the language and a sentence on it at the input. The output generates a lot of trees corresponding to the parsed sentence. As you can see, in words everything is very simple :)
Grammar Composition
Where to get the grammar? Here I would like to emphasize: yes, in the simplest scenario (although for the developer it is obviously not the easiest :)) the grammar is written manually. However, let's not forget that grammar rules are just an accurate formal description of language constructions. And where they came from - written by experts or somehow derived from existing texts - is the second question. Be that as it may, rules are needed for a precise analysis of sentences, and grammar is just a variation of their notation.As for the way of writing these rules - it seems to me quite obvious and adequate. I believe that with the help of XDG it is possible to describe not only a natural language, but also, say, Pascal. Of course, such a parser will be far from perfect, but, in theory, it should work :)
I don’t even know why no one has taken up such a dependency parser kit before. The project is quite fresh, the current version is dated 2007 (however, I have already said that I don’t notice much enthusiasm about the XDK so far, and at the moment the project is not developing).
Perhaps it makes sense to give a little "feel" the language of the proposed grammars. It is based on three basic principles:
lexicalization . Grammar elements are the words of the language. Each word can be assigned a set of attributes, in fact, which determines that the word can stick to itself and what it can stick to itself. Each attribute has a name, type, and value. The most commonly used types are "string" and "multiple lines."
Principles of creating a graph. In addition to the restrictions on “gluing”, set at the level of individual words, you can set restrictions for the entire column parsing the phrase. Such restrictions are called “principles”. The simplest example of a principle is the tree principle, which requires the generated graph to be a tree. We will talk about some other useful principles a bit later.
Multidimensionality. Each attribute of a word is defined in a certain namespace (user-defined). The principles of creating a graph are also described in a particular namespace. Each such space is called a "dimension." A phrase parsing graph for each dimension is built separately. Thus, you can get several parsing columns at once, reflecting certain structural features of the proposal (that is, the user can look at the proposal as if from different angles). For example, along with a dependency graph, you can build a degenerate list graph showing the order of words in a sentence. Or, say, a graph connecting all noun phrases.
Valency principle
Now a few words about what principles are. It would be desirable primarily mentioned principle valence indicating that a link between two words w1 and w2 may be set only if the attribute out words w1 coincides with attribute in words w2. I will explain with an example:defentry {dim lex {word: "eats"} dim syn {in: {root} out: {subj obj adv *}}} defentry {dim lex {word: "Peter"} dim syn {in: {subj? obj?} out: {}}} defentry {dim lex {word: "spaghetti"} dim syn {in: {subj? obj?} out: {}}} defentry {dim lex {word: "today"} dim syn {in: {adv?} out: {}}}
The specification of the out attribute of the verb “eats” indicates that a word can be attached to one subject (subj), one object (obj) and an arbitrary number of circumstances (adv *). The specification of the in attribute of the word "Peter" allows this word to be a subject or object. The specification of the word spaghetti is the same. The word “today” is described as a potential circumstance. Thus, if we feed the phrase “Peter eats spaghetti today” to the parser, the resulting tree will have “eats” as the root, “Peter” and “spaghetti” will be attached to the root as the subject and object, and the word “today” in as a circumstance.
There is also the “principle of ordering” (I’ll sacrifice it for the sake of saving space), with which you can indicate that an earlier word in a sentence takes precedence as a subject (thus, “spaghetti” will not become a subject).
Consistency principle
This is another important principle that I can’t pass by. He establishes the following requirement: the words to be associated must be consistent on some attribute. For example, you can indicate that the verb "eats" as a subject requires a third party, a singular. At the same time, the verb “eat” will add anything but a third person, singular:defentry {dim lex {word: "eats"} dim syn {agrs: {["3" sg]} agree: {subj}}} defentry {dim lex {word: "eat"} dim syn {agrs: {["1" sg] ["2" sg] ["1" pl] ["2" pl] ["3" pl]} agree: {subj}}
In order for the word “Peter” to join now, you have to specify its attributes of number and face:
defentry {dim lex {word: "Peter"} dim syn {agrs: {["3" sg]} agree: {}}}
As can be seen from these examples, grammar does not directly support morphology. So, the words “eat” and “eats” are different for her. However, I don’t quite understand how the support of “morphology in general” can be crammed into universal formalism. This problem, however, is solved quite simply: you need to run the input sentence through a morphological analyzer and generate grammar rules (with the correct endings and word attributes) automatically, "on the fly."
The analysis ends when it was possible to assemble the entire structure and attach it to the word that has the attribute "top of the tree." For example, as such a word, you can describe a point (because the sentence ends with a point):
defentry { dim lex {word: "."} dim syn {in: {} out: {root} agrs: top }}
Start!
If the resulting grammar (of course, only its fragments are shown here!) Is fed to the analyzer together with the input line “Peter eats spaghetti today.”, The expected tree will be generated at the output:
Here I propose to put a semicolon if not, then, because “ briefly "continue to fail. The next “block of thoughts” draws to a separate post (tomorrow, probably? ..) We will continue about XDG, most likely.
Unless, in order to return to this question, I will list the problems of HDK that are striking. Firstly, the current build is built under ASCII with support for Western European languages. That is, the Cyrillic alphabet turns into a croak (it will still be disassembled, but it is inconvenient to debase). However, the problem is small, it seems to me. Secondly, the author defended his dissertation and went to work in industry, he no longer supports the project, and no one else has picked up, even though it is open source. Thirdly, “entering the project” is not so simple, because everything is written on the relatively marginal programming environment Mozart / Oz in the constraint programming paradigm. It’s difficult to criticize the author for this - the task is very specific. I have already said that a full analysis draws on an NP-complete task. The author considered parsing as a constraint satisfaction problem: given nodes of the graph and given a set of conditions that limit the edges. It is required to find a graph satisfying the conditions. That is, the task was posed in a more general form, and as a result, it was decided to use a specialized tool for this kind of thing. Correctly done. There is something else that is missing me personally, but more on that next time.