CoruNethron December 12, 2012 at 13:42

Creating a frequency dictionary based on an analysis of a library of fiction

From the sandbox

Hello everyone.

Recently, for polishing a morphological dictionary capable of (presumably) generating all possible word forms from an infinitive, I needed a fairly voluminous frequency dictionary of the Russian language. The frequency dictionary is a very simple thing, the words in it are ordered by the frequency with which they occur in the analyzed text.

The task, it would seem very simple and is probably solved when studying programming in the forefront. But for the analysis of such a bulky library, and the library I use extends over 157 hectares, the resources of an average home computer are tight enough. More precisely, the library is stored in fifty zip archives of 0.5 - 2 GB, each of which contains 20-30 thousand works in fb2 format. In a compressed form, it weighs 60 GB.

The problem was solved in c #. The result must be obtained in adequate time, for which I chose no more than 8 hours, so that I could start the performance in the evening and get the result in the morning.

Search for a solution

Array

The most obvious method of solving what is called “forehead” is two arrays, in the first - words, in the second - a number. Having met a new word, add it to the first array if it is absent there or add one to the index in the second array if we find the word in the first. Having tried this option, I immediately became disappointed in it, after a few hours the program creaked along the first half of the first archive. Any professional programmer probably already laughs at this approach. However, I confess - I did not even imagine that a catch was waiting for me.

Then I began to search - where is the very bottleneck that prevents the program from breathing. The bottleneck was at the time of adding a new word. To keep the array orderly - the word has to be inserted somewhere in the middle, and sometimes at the very beginning. To do this, you have to shift each element of the array to the right of the selected position to the right. When it comes to millions of words, this activity becomes very tiring for the processor and it takes revenge, postponing the completion of the program for weeks to come.

Ordered list

I had to look for such a data structure that would add each element simply to the end of that structure, but allow them to be ordered at the same time. Then I came across ordered lists. A cell of an ordered list stores a piece of data itself and a pointer to another cell. Everything is excellent, we just add a new word and change 2 indexes, the index of the previous word indicating this and the index of this word indicating the following. But, this is bad luck, a search in such a list is very computationally demanding. If in an ordered array we could start the search from the middle and divide it in half in one iteration, then in the ordered list we have to climb along the pointers from one to the other, like a thread, until we find the right place in the whole ball. I didn’t try this option,

Binary search tree

I found the following data structure after only a few hours. I came across binary search trees. After reading a little about the various options, I settled on a self-balancing AVL tree, created, by the way, by Soviet scientists Adelson-Welsky Georgy Maksimovich and Landis Evgeny Mikhailovich, and inheriting their name from their surnames.

The structure of the binary tree is similar to an ordered list, but each element, except for a few leaf (so-called leaves) refers to two elements. The root element is the one in the middle of the ordered array. It refers to the element smaller (left) and larger (right), in addition - it is guaranteed that all elements of the left branch will be less than this, and all elements of the right branch will be more. Having examined the left or right element, to which we have come - we will see the same hierarchy, it also refers to two elements, the left branch is also smaller, the right one is larger. To understand how to balance such a tree, it is best to read the corresponding article, for example here: AVL-trees. It is important to note that the binary search tree fully met my requirements. Quick search and quick addition of a new item.

Result

As a result, after spending several more hours on optimization, I got a multi-threaded application that unpacks the archive into RAM, counts the frequency of various words and processes the data using the AVL tree.

Here are a few lines on the results of the program, on the left is the word, on the right is the frequency. These are the most common words of various lengths:

i 34361402
a 36192092
from 52479822
in 126422246
and 158458227

...

by 23978868
he is 32602346
then 42163693
at 83001625
no 97183097

...

all 19576264
this is 21318968
him 27719894
like 30228770
that 63106338

...

even 6789652
was 6832494
if 7750404
me 12381969
was 15450767

...

may 5455561
very 5,477,013
time 6317279
when 9788599
to 9987225

...

hate 296
highly qualified 350
excellence 384
excellencies 489
excellence 3739

...

hate 46
hate 52
private business 60
excellence 91
National Socialist 96

In total, 9.5 million words were received, the analysis lasted 8482 seconds, the average processing speed of the unpacked text was 18.57 mb / s.

Now you can use the data to refine my morphological vocabulary, and having obtained the vocabulary it will be possible to improve the “frequency analyzer”, because it will be possible to group similar root words. In addition, the work requires a "frequency analyzer" with various encodings, etc. Next is parsing the sentence. In the end, I want to get some sort of adequate semantic network.

Despite the fact that my “report” covers both the topic of programming and linguistics - I ask you not to blame me for inaccuracies in writing (especially punctuation) or not the smoothest solution to the problem, but I ask you to point out these errors or suggest more elegant solutions, I I will be glad.

Thanks to all.

Tags: