
Strutext library of word processing in C ++ - implementation of the lexical level
Basic principles
This text is a continuation of a post on the Strutext C ++ text processing library. Here we will describe the implementation of the lexical level of language representation, in particular, the implementation of morphology.
According to the author, there are the following main tasks that must be solved when implementing the program model of the lexical level of language representation:
- Isolation from the source text of chains of characters that make sense, i.e. presenting the text as a sequence of words.
- Identification of selected chains as elements of lexical types.
- The definition for the selected chain of its lexical attributes (more about them below).
Lexical types are usually represented as finite sets of chains of characters that have the same meaning in sentences of a language. Elements of the lexical class are usually called word forms, the set of word forms is called a paradigm, and the lexical type is called a word or lemma. For example, the lexical type “mom” consists of word forms {mom, mom, mom, ..., mom, mom, ...}.
Lexical types are divided into syntactic categories - parts of speech. Part of the speech defines the role that the word plays in the sentence of the language. This role is important in determining the correct place of a word in a sentence and, therefore, determining the meaning of a sentence. Famous parts of Russian language speech: noun, adjective, verb, adverb, etc.
Word forms of lexical type can have properties. Such properties are also called lexical attributes or lexical attributes. The types of these properties depend on the syntactic category to which the given lexical type belongs. For example, the case form plays an important role for nouns, but this attribute is not available for verbs.
What specific syntactic categories are used to group lexical types and what lexical attributes they have depends on both the language being implemented and the concrete model of lexical analysis being implemented. The lexical model of AOT will be considered below .
Lexical ambiguities
It may happen that in the process of extracting words from the source text, ambiguities will arise. The ambiguities of two kinds are considered here:
- Ambiguities of the first kind arise in the process of assigning a lexical type of character string selected from the text. Consider the example "mom washed the frame." Here the chain of characters “soap” can be the verb “wash,” and it can also be the noun “soap”. Such cases of ambiguity are also called lexical homonymy.
- Ambiguities of the second kind arise in the process of cutting the source text into chains of words. In most natural languages, words are separated from each other by spaces, although this principle can sometimes be violated (as an example, composites in German can be cited). But in programming languages there are interesting examples. Consider, for example, an expression of the form “a >> b” in C ++. In classical C, this expression is interpreted unambiguously: identifier "a", right-shift operator ">>", identifier "b". But in recent versions of C ++, this expression can mean the end of the list of template parameters, when the template also acts as the last parameter in the list. In this case, the sequence of words will be as follows: identifier "a", end of the list of template parameters ">",
In this text we consider only lexical ambiguities of the first kind.
Morphological model of the AOT dictionary
The Strutext library implements a morphological model from AOT . Therefore, we will give its description a certain place.
In the AOT dictionary, each lexical type is defined by two parameters:
- The base string (word root) to which suffixes are added to form word forms.
- The number of the declination paradigm, which is a list of pairs (suffix, a set of lexical attributes).
There are relatively few combinations of sets of lexical features, they are listed in a special file and each such combination is encoded with a two-letter code. For instance:
аа A С мр,ед,им
аб A С мр,ед,рд
Эф A С мр,ед,рд,2
...
иа Y П прев,мр,ед,им,од,но
иб Y П прев,мр,ед,рд,од,но
...
кб a Г дст,нст,1л,ед
кв a Г дст,нст,1л,мн
кг a Г дст,нст,2л,ед
...
Here, the first element of each line is the two-letter code of the set, the third element is the code of the part of speech (C is the noun, P is the adjective, G is the verb, etc.), then the codes of grammatical signs are listed with a comma.
The dictionary description file consists of five sections, of which two sections are most important. This section of the description of the paradigm of declination and the section of the basics (lexical types). Each line in this section represents a declination paradigm. In the section of the description of lexical types, along with the basis, the line number of the declination paradigm is set.
Consider, for example, the word green. The lexical type of this word in the AOT dictionary is given by a string of the form
ЗЕЛЕН 15 12 1 Фа -
Here, the number 15 is the declination paradigm number in the paradigm section. The line of this paradigm looks like this:
%КА*га%КИ*гб%КЕ*гв%КУ*гг%КОЙ*гд%КОЮ*гд%КЕ*ге%КИ*гж%ОК*гз%КАМ*ги%КИ*гй%КАМИ*гк%КАХ*гл
Each pair in the paradigm is separated from the other by the symbol "%", and the elements of the pairs are separated from each other by the symbol "*". The first pair (KA, ha) defines the word form green + ka = zelenka and has a set of lexical attributes: ha = G C zr, ed, im = noun, feminine, singular, nominative. Other paradigm pairs can be deciphered accordingly.
The word encoding method used in AOT has its advantages and disadvantages. We will not discuss them here, we note only an interesting fact: the dictionary contains lexical types with an empty base. For example, the word "person" in the plural is represented by the word form "people", which does not have a common basis with the form "person". Therefore, this word must be set by a simple enumeration of word forms:
%ЧЕЛОВЕК*аа%ЧЕЛОВЕКА*аб%ЧЕЛОВЕКУ*ав%ЧЕЛОВЕКА*аг%ЧЕЛОВЕКОМ*ад%ЧЕЛОВЕКЕ*ае%ЛЮДИ*аж%ЛЮДЕЙ*аз%ЧЕЛОВЕК*аз%ЛЮДЯМ*аи%ЧЕЛОВЕКАМ*аи%ЛЮДЕЙ*ай%ЛЮДЬМИ*ак%ЧЕЛОВЕКАМИ*ак%ЛЮДЯХ*ал%ЧЕЛОВЕКАХ*ал
This paradigm can be used with other words (having a non-empty root), such as the God-man and the monkey-man.
Let us consider in more detail the set of syntactic categories and the corresponding lexical attributes of the AOT dictionary.
AOT Syntax Categories
As already mentioned above, the syntactic categories of the AOT dictionary are defined in a separate file and are sets of strings in which two-letter codes are given a part of speech and a set of lexical attributes. In the Strutext library, parts of speech and their attributes are represented as a hierarchy of classes in C ++. Consider this implementation in more detail.
Models of syntactic categories of the AOT dictionary are defined in the morpho / models directory. Models for Russian and English are presented. Consider some fragments of the files morpho / models / rus_model.h, which presents a description of the Russian language model.
The base class for all models is the abstract PartOfSpeech class, which contains the language label as an enumerator, and also sets a virtual method for returning this label:
class PartOfSpeech : private boost::noncopyable {
public:
/// Type of smart pointer to the class object.
typedef boost::shared_ptr Ptr;
/// Language tag definitions.
enum LanguageTag {
UNKNOWN_LANG = 0 ///< Unknown language.
, RUSSIAN_LANG = 1 ///< Russian language.
, ENGLISH_LANG = 2 ///< English language.
};
/// Language tag.
virtual LanguageTag GetLangTag() const = 0;
/// Virtual destruction for abstract class.
virtual ~PartOfSpeech() {}
};
The base class for all syntactic categories of the Russian language is inherited from this class:
struct RussianPos : public PartOfSpeech {
/// Type of smart pointer to the class object.
typedef boost::shared_ptr Ptr;
/// Possible parts of speech.
enum PosTag {
UNKNOWN_PS = 0 ///< Unknown part of speech.
, NOUN_PS = 1 ///< существительное
, ADJECTIVE_PS = 2 ///< прилагательное
, PRONOUN_NOUN_PS = 3 ///< местоимение-существительное
, VERB_PS = 4 ///< глагол в личной форме
, PARTICIPLE_PS = 5 ///< причастие
, ADVERB_PARTICIPLE_PS = 6 ///< деепричастие
, PRONOUN_PREDICATIVE_PS = 7 ///< местоимение-предикатив
, PRONOUN_ADJECTIVE_PS = 8 ///< местоименное прилагательное
, NUMERAL_QUANTITATIVE_PS = 9 ///< числительное (количественное)
, NUMERAL_ORDINAL_PS = 10 ///< порядковое числительное
, ADVERB_PS = 11 ///< наречие
, PREDICATE_PS = 12 ///< предикатив
, PREPOSITION_PS = 13 ///< предлог
, CONJUCTION_PS = 14 ///< союз
, INTERJECTION_PS = 15 ///< междометие
, PARTICLE_PS = 16 ///< частица
, INTRODUCTORY_WORD_PS = 17 ///< вводное слово
, UP_BOUND_PS
};
/// Number.
enum Number {
UNKNOUN_NUMBER = 0 ///< Unknown number.
, SINGULAR_NUMBER = 0x01 ///< Единственное.
, PLURAL_NUMBER = 0x02 ///< Множественное.
};
/// Language.
enum Lang {
NORMAL_LANG = 0 // Normal language.
, SLANG_LANG = 1
, ARCHAIZM_LANG = 2
, INFORMAL_LANG = 3
};
/// Gender definitions.
enum Gender {
UNKNOWN_GENDER = 0 ///< Unknown gender value.
, MASCULINE_GENDER = 0x01 ///< мужской
, FEMININE_GENDER = 0x02 ///< женский
, NEUTER_GENDER = 0x04 ///< средний
};
/// Case definition.
enum Case {
UNKNOWN_CASE = 0 ///< Unknown case.
, NOMINATIVE_CASE = 1 ///< именительный
, GENITIVE_CASE = 2 ///< родительный
, GENITIVE2_CASE = 3 ///< второй родительный
, DATIVE_CASE = 4 ///< дательный
, ACCUSATIVE_CASE = 5 ///< винительный
, INSTRUMENTAL_CASE = 6 ///< творительный
, PREPOSITIONAL_CASE = 7 ///< предложный
, PREPOSITIONAL2_CASE = 8 ///< второй предложный
, VOCATIVE_CASE = 9 ///< звательный
};
/// Time.
enum Time {
UNKNOWN_TIME = 0 ///< Unknown time.
, PRESENT_TIME = 0x01 ///< настоящее
, FUTURE_TIME = 0x02 ///< будущее
, PAST_TIME = 0x04 ///< прошедшее
};
/// Person.
enum Person {
UNKNOWN_PERSON = 0 ///< Unknown person.
, FIRST_PERSON = 0x01 ///< первое
, SECOND_PERSON = 0x02 ///< второе
, THIRD_PERSON = 0x04 ///< третье
};
/// Entity kind.
enum Entity {
UNKNOWN_ENTITY = 0 ///< Unknown entity, for ordinal words.
, ABBREVIATION_ENTITY = 1 ///< аббревиатуры.
, FIRST_NAME_ENTITY = 2 ///< имена.
, MIDDLE_NAME_ENTITY = 3 ///< отчества.
, FAMILY_NAME_ENTITY = 4 ///< фамилии.
};
/// Animation.
enum Animation {
UNKNOWN_ANIMATION = 0
, ANIMATE_ANIMATION = 0x01 ///< одушевленный.
, INANIMATE_ANIMATION = 0x02 ///< неодушевленный.
};
/// Voice defintion.
enum Voice {
UNKNOWN_VOICE = 0 ///< Unknown voice.
, ACTIVE_VOICE = 0x01 ///< действительный залог.
, PASSIVE_VOICE = 0x02 ///< страдательный залог.
};
/// Language tag.
LanguageTag GetLangTag() const { return RUSSIAN_LANG; }
/// Class is absract one -- virtual destruction.
virtual ~RussianPos() {}
/// Get part of speech tag.
virtual PosTag GetPosTag() const = 0;
/// Serialization implementaion.
virtual void Serialize(uint32_t& out) const = 0;
/// Desirialization implementation.
virtual void Deserialize(const uint32_t& in) = 0;
/// Write POS signature.
static void WritePosSign(PosTag pos, uint32_t& out) {
// Write to lower 5 bits.
out |= static_cast(pos);
}
/// Read POS signature.
static PosTag ReadPosSign(const uint32_t& in) {
return PosTag(in & 0x1f);
}
};
The class contains labels of syntactic categories in the form of a PosTag enumerator, and lexical attributes are defined. In addition to the grammatical component, the class defines Serialize and Deserialize methods for converting to / from a binary format. For each syntax type, a four-byte conversion is defined, represented by the uint32_t type.
The RussianPos class is abstract; classes representing specific syntactic categories are inherited from it. For example, the class Noun defines a noun:
struct Noun : public RussianPos {
Noun()
: number_(UNKNOUN_NUMBER)
, lang_(NORMAL_LANG)
, gender_(UNKNOWN_GENDER)
, case_(UNKNOWN_CASE)
, entity_(UNKNOWN_ENTITY) {}
/// Get part of speech tag.
PosTag GetPosTag() const { return NOUN_PS; }
/**
* \brief Serialization implementaion.
*
* Binary map of the object:
* 13 3 4 3 2 2 5
* -----------------------------------------------------------
* Unused | Entity | Case | Gender | Lang | Number | POS tag |
* -----------------------------------------------------------
*
* \param[out] ob The buffer to write to.
*/
void Serialize(uint32_t& ob) const {
ob |= static_cast(number_) << 5;
ob |= static_cast(lang_) << 7;
ob |= static_cast(gender_) << 9;
ob |= static_cast(case_) << 12;
ob |= static_cast(entity_) << 16;
}
/**
* \brief Desirialization implementaion.
*
* Binary map of the object:
* 13 3 4 3 2 2 5
* -----------------------------------------------------------
* Unused | Entity | Case | Gender | Lang | Number | POS tag |
* -----------------------------------------------------------
*
* \param ib The buffer to write to.
*/
void Deserialize(const uint32_t& ib) {
number_ = static_cast((ib & 0x0060) >> 5);
lang_ = static_cast((ib & 0x0180) >> 7);
gender_ = static_cast((ib & 0x0e00) >> 9);
case_ = static_cast((ib & 0xf000) >> 12);
entity_ = static_cast((ib & 0x070000) >> 16);
}
Number number_;
Lang lang_;
Gender gender_;
Case case_;
Entity entity_;
};
The noun class stores lexical attributes: number, type of language (ordinary, anachronism, colloquial, etc.), gender, case, and a sign of a name or abbreviation.
State machines for coding dictionaries
To store dictionaries, as well as efficiently extract words from a dictionary, the Strutext library uses state machines. Finite state machines are defined by the corresponding C ++ types in the automata directory.
Recall that a finite state machine is defined by a transition function that associates with some pairs (state, symbol) a certain state: delta: Q x V -> Q. There is one initial state in which the machine starts its work and a certain number of “admitting” states. The machine reads the input line character-by-symbol, if for the current state and the character read, the transition function matches a certain state, then the machine "goes" into this new state, after which the cycle for reading a new character starts again. The machine can stop in two cases: if there is no transition through the pair (current state, character read) and if the entire character chain is read to the end. In the first case, the input chain is considered not allowed given by the machine, in the second case the chain is allowed,
Thus, each time a new character of the input chain is read, the automaton is faced with the task of finding a pair (state, symbol) of a new state. In the Strutext library, the implementation of this search function is highlighted in a separate class called Transition. An automaton is an array of objects of the Transition class defined for each state (automata / fsm.h):
template
struct FiniteStateMachine {
/// Type of transition table.
typedef TransImpl Transitions;
...
/// State definition.
struct State {
Transitions trans_; ///< Move table.
bool is_accepted_; ///< Is the state accepptable.
/// Default initialization.
explicit State(bool is_accepted = false) : is_accepted_(is_accepted) {}
};
/// Type of states' list.
typedef std::vector StateTable
...
StateTable states_; ///< The table of states.
};
Here, the TransImpl template parameter represents the transition function.
The Strutext library has two methods for implementing the transition function. One way is based on the usual std :: map (automata / flex_transitions.h), where the symbol code is used as the key, and the status number is used as the value. Another way (automata / flat_transitions.h) is based on a sparse array when an array is allocated corresponding to possible character codes. Each element of the array contains a status code. The value zero is reserved for invalid states, i.e. means that there is no transition. If the value is nonzero, then this pair (array index = symbol code, state number in the array cell) sets the transition.
The FiniteStateMachine class is not able to say anything about the input chain, except that the given chain is allowed. To store additional information about allowed chains, you must add attributes to the allowed states. This is done in the AttributeFsm template class. The class takes as the parameter of the template the implementation of the transition function and the attribute type for the enabling state. It should be noted that attributes can be attached not only to admitting states (although it is unclear whether this makes sense), but also that more than one attribute can be attached to a state, all of them are stored in a vector.
Storage of a dictionary in a state machine defines a tree structure for the transition functions of the state machine of this dictionary. For this structure, the term trie is also used, introduced by D. Knut. The Strutext library has an implementation of such a state machine in the automata / trie.h file:
template
struct Trie : public AttributeFsm {
/// Chain identifier type.
typedef Attribute ChainId;
/// Attribute FSM type.
typedef AttributeFsm AttributeFsmImpl;
/// Default initialization.
explicit Trie(size_t rsize = AttributeFsmImpl::kReservedStateTableSize) : AttributeFsmImpl(rsize) {}
/// It may be base class.
virtual ~Trie() {}
/**
* \brief Adding chain of symbols.
*
* \param begin Iterator of the chain's begin.
* \param end Iterator of the chain's end.
* \param id Chain identifier.
*
* \return The number of last state of the chain.
*/
template
StateId AddChain(SymbolIterator begin, SymbolIterator end, const ChainId& id);
/**
* \brief Adding chain of symbols.
*
* \param begin Iterator of the chain's begin.
* \param end Iterator of the chain's end.
*
* \return The number of last state of the chain.
*/
template
StateId AddChain(SymbolIterator begin, SymbolIterator end);
/**
* \brief Search of the passed chain in the trie
*
* \param begin Iterator of the chain's begin.
* \param end Iterator of the chain's end.
* \result The reference to the list of attributes of the chain if any.
*/
template
const typename AttributeFsmImpl::AttributeList& Search(SymbolIterator begin, SymbolIterator end) const;
};
The code shows that there are two main methods: AddChain and Search. The last method is noteworthy in that it returns a reference to the attribute vector, i.e. When searching, state attributes are not copied. If the input character string is not found, then the attribute vector will be empty.
The Strutext library also implements the Aho-Korasik submachine gun for efficient search for dictionary elements in the text. The implementation is presented in automata / aho_corasick.h. The presentation of the principles and methods of its implementation is beyond the scope of this text, we only note that the interface is quite simple to use, and there is also an iterator for the chains found in the text.
It should also be noted that all automata can be serialized / deserialized in std :: stream. This allows you to store automata in files on disk, i.e. use as storage of dictionaries in binary format.
Morphological analyzer
A morphological analyzer is a library located in the morpho / morpholib directory. The main interface class, Morphologist, is located in the file morpho / morpholib / morpho.h.
Before describing the interface and implementation of the class, we first describe the basic principles on which this implementation is based.
Firstly, there is a dictionary of the fundamentals that are implemented in an object of the Trie class.
Secondly, a declination paradigm is assigned to each base in an acceptable state (as before, this is a vector of pairs (suffix, set of lexical attributes). The set of attributes is represented by an instance of a class inherited from PartOfSpeech).
Thirdly, each lexical type is given a unique numerical identifier, the basis number in the dictionary.
Thus, in order to recognize the transferred word form as a word, it is necessary to search for the basis of the machine (an identifier of the lexical type corresponding to this basis will be found), then at the end to find the corresponding attributes. All this must be done taking into account the ambiguities, both in the search for the basics, and in determining the endings. The code for the search is as follows:
/**
* \brief Implementation of morphological analysis of passed form.
*
* \param text Input text in UTF-8 encoding.
* \param[out] lem_list List of lemmas within morphological attributes.
*/
void Analize(const std::string& text, LemList& lem_list) const {
// The first phase. Go throw the passed word text, encode symbol
// and remember symbol codes in the string. If found word base on
// some position, remember attribute and position for an each
// attribute.
// Try starts with empty bases
typedef std::list > BaseList;
BaseList base_list;
strutext::automata::StateId state = strutext::automata::kStartState;
if (bases_trie_.IsAcceptable(state)) {
const typename Trie::AttributeList& attrs = bases_trie_.GetStateAttributes(state);
for (size_t i = 0; i < attrs.size(); ++i) {
base_list.push_back(std::make_pair(attrs[i], 0));
}
}
// Permorm the first phase.
std::string code_str;
typedef strutext::encode::Utf8Iterator Utf8Iterator;
for (Utf8Iterator sym_it(text.begin(), text.end()); sym_it != Utf8Iterator(); ++sym_it) {
Code c = alphabet_.Encode(*sym_it);
code_str += c;
if (state != strutext::automata::kInvalidState) {
state = bases_trie_.Go(state, c);
if (bases_trie_.IsAcceptable(state)) {
const typename Trie::AttributeList& attrs = bases_trie_.GetStateAttributes(state);
for (size_t i = 0; i < attrs.size(); ++i) {
base_list.push_back(std::make_pair(attrs[i], code_str.size()));
}
}
}
}
// The second phase. Go throuth the found base list and find suffixes for them.
// If suffixes have been found then add them to the lemma list.
lem_list.clear();
for (BaseList::iterator base_it = base_list.begin(); base_it != base_list.end(); ++base_it) {
AttrMap attr;
attr.auto_attr_ = base_it->first;
SuffixStorage::AttrList att_list;
std::string suffix = code_str.substr(base_it->second);
// If suffix is empty (empty suffix passed), add zero symbol to it.
if (suffix.empty()) {
suffix.push_back('\0');
}
if (const SuffixStorage::AttrList* att_list = suff_store_.SearchAttrs(attr.line_id_, suffix)) {
for (size_t i = 0; i < att_list->size(); ++i) {
lem_list.push_back(Lemma(attr.lem_id_, (*att_list)[i]));
}
}
}
}
As you can see, the determination algorithm is divided into two stages. First, the basics are highlighted (here the existence of empty basics still needs to be considered). For each base, its position in the input chain is remembered, so that you can then select the ending. At the second stage, a search is made for endings that correspond to the selected basics. If the ending is found in the declension paradigm corresponding to the given basis, then the lexical attributes of this ending are returned along with the identifier of the word.
The Morphologist class also provides a service for generating word forms by the base number and the transmitted lexical attributes. The Generate method does this:
/**
* \brief Generate form.
*
* \param lem_id The lemma identifier.
* \param attrs The attributes of the form.
* \return Generated text in UTF-8 encoding.
*/
std::string Generate(uint32_t lem_id, uint32_t attrs) const;
There is also a GenAllForms method for generating all forms of a given word and a GenMainForm method that returns the main form of a word. For the noun, this is obviously the singular form of the nominative case.
The morpho / aot directory in the main.cpp file implements the AOT dictionary representation parser in the original format, which returns a binary representation compatible with the morphology library as a result. The resulting binary dictionary can be used in the Morphologist class. Binary dictionaries themselves are not stored in the repository, but can be generated by the user if necessary. To implement the Russian dictionary, you can use the following command:
./Release/bin/aot-parser -t ../morpho/aot/rus_tabs.txt -d ../morpho/aot/rus_morphs.txt -m rus -b aot-rus.bin
In binary form, the dictionary size of the dictionary is slightly less than 20 MB.
To isolate word forms from the source text, you can use the WordIterator class defined in utility / word_iterator.h. This class considers words as sequences of characters (symbols :: IsLetter). The iterator returns the word as a unicode string. You can transcode this string to UTF-8 using the GetUtf8Sequence function defined in encode :: utf8_generator.h.
Afterword
The text turned out to be quite voluminous and probably difficult to read. The author tried to simplify the presentation, where it was only possible, but given the complexity of the material, there were apparently not many such places in the text.
Nevertheless, the author has hopes that the Strutext library described in the text will be useful and the work on its implementation will not be in vain.