mefrill October 27, 2014 at 16:18

Strutext Library for C ++ Text Processing

Introduction

This text can be considered as an overview of the Strutext library, which was conceived by the author as a set of effective algorithms for linguistic text processing in C ++. The library code is in the repository on Github . The library is open source and comes under the Apache License 2.0, i.e. can be used completely free of charge without any significant restrictions.

The last commit to the repository was made on February 16, 2014, in other words, the library has not been developed for more than six months. But, nevertheless, it seems to the author that the library already has enough interesting implemented algorithms to be useful to those developers who are involved in processing texts in natural languages.

Build Strutext implemented based on cmake. Almost all of the library code is implemented in C ++ 2003, but there is a small script written in Python version 2.7. The library uses boost version 1.53 and higher, as well as the Log4Cplus library as a tool for logging. At the moment, the library is built under Linux using the gcc compiler, but it seems to the author that it is quite easy to make a build under Windows.

A legitimate question will probably arise: why such a library is needed, if there is, for example, ICU? From the point of view of the author, some components of the library are not implemented efficiently enough. The implementation style is based on low-level C or high-level Java. This circumstance, from the author’s point of view, does not allow the convenient and efficient use of the ICU library in C ++. In addition, the ICU does not implement such an important component as morphology. Also, one of the main advantages of the Strutext library is the availability of effective text search algorithms based on the implemented finite state machine library. In general, ICU implements only one component, character processing in UNICODE, and in this sense, the Strutext library provides more features.

Library structure

Strutext is conceived as a tool for processing texts in natural languages at various levels of representation of these languages. Typically, the following levels of language representation are distinguished:

Character level. At this level, the text of the document is considered simply as a sequence of characters. Symbols must be classified in some way, i.e. I would like to distinguish letters from numbers, punctuation marks from spaces, etc. Also, I would like to have such a classification for as many languages as possible.
Lexical level. A sequence of characters is divided into words. Words, in turn, can be classified in some way. The class of many words is also called the lexical type. In the classical tradition, the lexical type is the word (for example, mother), and the words themselves are called word forms (for the word mother it will be: mother, mother, mothers, etc.). A set of word forms of a word is called a paradigm, and the word itself is called a token.
The syntactic level. At this level, the text is divided into sentences and the relationships between the words in the sentence are examined. Communications are basically hierarchical, i.e. appear as a tree, but there are also more complex and intricate relationships.
Semantic level. This is the level of selection of semantic constructions, which are built on the basis of syntactic structures extracted from the text of the document.

There are also other levels of presentation of the language, for example, pragmatic, but this will not be discussed here.

The Strutext library currently implements the symbolic and lexical levels of language representation. The main components of the library:

symbols - implementation of the symbolic level of language representation based on UNICODE32.
encode - a set of iterators for transcoding text.
automata is an implementation of a state machine library designed to search for character sequences in text.
morpho - a morphological analysis library for Russian and English.

The description of each component will take up a lot of space. Therefore, in this text, only the symbols component will be described below. If the audience expresses interest in the library, then the author will gladly continue the description in other articles. The component description can also be considered as an introduction to the corresponding programming area, i.e. manual, how to programmatically implement this component of word processing. In this sense, the author did not try to limit himself to a simple description of the library interface, but also set out the ideas and implementation methods that underlie it.

Symbols processing library

A bit about UNICODE

As you know, the UNICODE consortium was created to develop a standard for representing the symbols of world languages in computer memory. Since its inception, the consortium has already released seven versions of such a presentation. In the machine’s memory, characters can be encoded in various ways, depending on the purpose of use. For example, to represent characters in files, when it is important to save on size, the UTF-8 representation is used. If you do not need to use specific language features, then you can use the double-byte representation of UTF-16. For a complete view of the entire gamut of the UNICODE specification, it is better to use the four-byte UTF-32 representations.

Characters (letters) in the UNICODE standard are divided into classes. There are relatively few classes, we list some of them:

Lu is an uppercase letter.
Ll is a lowercase letter.
Nd is the decimal digit.
Zs - space used as a delimiter
Po - punctuation mark without additional specification
Cc is the control character.

One of the functions of the symbols library is to efficiently map the symbol code in UTF-32 to its class. To implement this functionality, it is convenient to use the UnicodeData.txt file . This file contains an enumeration of all the characters of the standard, together with their four-byte codes (in hexadecimal representation) and classes. The file is intended for processing by the machine, but is understandable to humans. For example, a space character is specified by a line of the form:

0020;SPACE;Zs;0;WS;;;;;N;;;;;

In the file, the ';' character is used as field separators. Accordingly, decimal digits are set as follows:

0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;;
0031;DIGIT ONE;Nd;0;EN;;1;1;1;N;;;;;
0032;DIGIT TWO;Nd;0;EN;;2;2;2;N;;;;;
0033;DIGIT THREE;Nd;0;EN;;3;3;3;N;;;;;
0034;DIGIT FOUR;Nd;0;EN;;4;4;4;N;;;;;
0035;DIGIT FIVE;Nd;0;EN;;5;5;5;N;;;;;
0036;DIGIT SIX;Nd;0;EN;;6;6;6;N;;;;;
0037;DIGIT SEVEN;Nd;0;EN;;7;7;7;N;;;;;
0038;DIGIT EIGHT;Nd;0;EN;;8;8;8;N;;;;;
0039;DIGIT NINE;Nd;0;EN;;9;9;9;N;;;;;

We also list several definitions of letters:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063;
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
0063;LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043

It is interesting to note that for letters with classes Lu and Ll, codes for the corresponding lower (upper) letters are also set. This makes it possible to implement register conversion functions in the library.

Symbols library implementation

The symbols library implements the definition of UNICODE character classes, as well as conversion to upper or lower case.

To define classes, use the SymbolClass enumerator:

enum SymbolClass {
  UPPERCASE_LETTER      = 0x00000001,
  LOWERCASE_LETTER      = 0x00000002,
  TITLECASE_LETTER      = 0x00000004,
  CASED_LETTER          = UPPERCASE_LETTER | LOWERCASE_LETTER | TITLECASE_LETTER,
  MODIFIER_LETTER       = 0x00000008,
  OTHER_LETTER          = 0x00000010,
  LETTER                = CASED_LETTER | MODIFIER_LETTER | OTHER_LETTER,
  NONSPACING_MARK       = 0x00000020,
...

Enumerator elements are specified by powers of two, so they can be used as bit flags. In the library implementation, a four-byte value is set for each character in which the bits corresponding to its classes are set to units. Then, checking for a symbol to belong to a particular class is simply the value of the corresponding bit:

template
inline bool Is(const SymbolCode& code)  {
  return static_cast(class_name) & GetSymbolClass(code);
}

For the most commonly used classes, the corresponding functions are implemented:

inline bool IsLetter(const SymbolCode& code) {
  return Is(code);
}
inline bool IsNumber(const SymbolCode& code) {
  return Is(code);
}
inline bool IsPunctuation(const SymbolCode& code) {
  return Is(code);
}

To convert to upper / lower case, the ToLower and ToUpper functions are used. It should be noted that these functions can be applied not only to letters, but also to any other characters. For a character to which the concept of register is not applicable, the function returns the same code that was transmitted at the input.

Technically, all this is implemented quite efficiently. At the configuration stage, it runs a script written in Python, which reads the UnicodeData.txt file and generates a symbols.cpp file in which three arrays are defined for the character classes, upper and lower case. These arrays are declared as follows:

namespace details {
// The symbol tables declarations.
extern uint32_t    SYM_CLASS_TABLE[];
extern SymbolCode  SYM_UPPER_TABLE[];
extern SymbolCode  SYM_LOWER_TABLE[];
}  // namespace details.

The conversion functions to upper and lower case are set simply:

inline SymbolCode ToLower(const SymbolCode& code) {
  return details::SYM_LOWER_TABLE[code];
}
inline SymbolCode ToUpper(const SymbolCode& code) {
  return details::SYM_UPPER_TABLE[code];
}

To obtain a set of character classes, the following function is used:

inline const uint32_t& GetSymbolClass(const SymbolCode& code) {
  return details::SYM_CLASS_TABLE[code];
}

As you can see from the definition, the index in the array is the character code, so accessing the array element does not require any additional costs. The size of each array is limited by the number of characters defined in UnicodeData.txt. At the moment, this number is 0x200000, i.e. each array occupies 8 MB in memory, and all together - 24 MB. This seems like a small fee for efficiency.

Of course, in files, characters are almost never stored in UTF-32, it is inefficient to use 4 bytes to store a character code that fits in one byte. For storage, single-byte encodings are used, which came from the “pre-unicode world” (CP1251, Koi8-r, etc.), as well as UTF-8 encoding, specially designed for efficient storage of characters in files. The Strutext library provides powerful tools for analyzing character codes in UTF-8. The encode component does this.

Afterword

One of the main motivations for both writing text and developing the Strutext library was the author’s desire to share his experience in developing C ++ word processing applications with other developers. The library code can be considered as the material embodiment of the author’s professional experience (the author hopes that this code does not exhaust all his experience).

Of course, the verb “share” implies two sides: the one that shares, and the one that wants to accept this division. If it’s clear to everyone that there is no problem with the first side, then the presence of the second side is meant to be discovered, including by this publication. If there will be responses to the text, then the author is ready to work hard and describe the other components of the Strutext library.

Tags: