How to collect bigrams for a case of any size on a home computer

In modern computer linguistics, bigrams, or generally n-grams, are an important statistical tool. In the article we will tell you what difficulties you may encounter when calculating bigrams on a large body of texts and give an algorithm that can be used on any home computer.

A bigram is two words that are adjacent in a text or, in our case, in a corpus of texts. For example, in a sentence:

It was a hot summer.
1 2 3 4 5 (the space after the summer is not a typo or a mistake)

there will be such bigrams:

  • It was
  • it was hot
  • in hot summer
  • in summer .

Strictly speaking, bigrams do not consist of words, but of tokens. Token i.e. an indivisible unit in the framework of our task will be either a word or a punctuation mark. Tokenization is not an easy topic, so we will assume that our body is already tokenized and divided into proposals, and to convert a proposal into a list of tokens, it is enough to break it by a space.

Our task will be to get this list:

  • That was 190360
  • it was hot 1017
  • hot summer 2621
  • in summer . 42146

where the number shows how many times a bigram is found in the whole case.

Sometimes in the text we allow ourselves to use the term double combination as a synonym for the word bigram.

In this article, we intentionally omitted all implementation details, as The approach is interesting in itself and it is not difficult to implement it in your favorite programming language. In addition, the implementation contains enough interesting details to talk about it in a separate article. The minimum number of explanatory remarks will be given in Java.

Naive approach


  • run through the hull
  • extract bigrams from each sentence
  • read them using a multiset data structure (in Java, this is Multiset or ConcurrentHashMultiset in multi-threaded version)

On a relatively small and clean case it might work, but in the general case, with a naive approach, your memory will run out before you can count the entire array of texts.

Add intermediate clipping


It is very simple to turn a naive approach into a working one if you modify it a little:

  • run through the hull
  • extract bigrams from each sentence
  • count them using a multiset
  • as soon as we see that the memory is ending, we clear the counter, deleting the bigrams that have occurred at this point less than the threshold number of times

This modification gives a completely working algorithm, but cutoffs bring one trouble: along with unnecessary noise, such as irregular typos and errors, we will remove a lot of useful information about rare words. For example, if a lexeme (a set of word forms) occurs in the corpus 1000 times, then each of its individual word forms can occur, say, less than 200 times per corpus, and what can we say about double combinations.

Our task is to count the bigrams as honestly as possible, without using intermediate cutoffs.

We use a disk as temporary memory


RAM is now relatively inexpensive, but still worth it. In addition, for many versions of laptops larger than 16 gigabytes, you can’t install it with all your desire - a platform limitation. There is no problem with disk space - it costs significantly less and you can always use an external drive if you wish.

When semantic tags are mentioned, #hard_drive and #algorithm in memory pop up merge sort and merge sorted lists, which many wrote at school in Pascal. The ideas underlying these algorithms are well suited for solving our problem.

Schematic diagram of the solution to the problem


Before moving on to the details, we will present a schematic diagram of the solution of the problem:

  1. Break the case into approximately equal blocks, say 1 gigabyte each.
  2. Count the bigrams separately for each block, sort them in lexicographic order and write to disk.
  3. Using the algorithm for merging ordered lists, combine separate files with two combinations into one, summing the number of occurrences for matching bigrams.

The size of each block can be chosen to taste (read: according to the number of installed RAM), but in most tasks the size in gigabytes is more than convenient. You can also work with a monolithic case, making cut-offs according to the size of the processed text in the program, dropping the result to disk and clearing data structures.

Looking at the algorithm from a bird's eye view, you can go to the details.

Count the bigrams for each block


To build the optimal architecture for a double counter, we will consider two important requirements:

  1. We want to count in several threads.
  2. At the output, you need to get a list of bigrams sorted in lexicographic order.

It turns out that these two tasks can be effectively solved together. It is proposed instead of using a single-level card (a multiset is essentially a key-counter card)

ConcurrentHashMultiset

to calculate bigrams use a map of cards:

ConcurrentMap>

The experiment shows that multi-threaded calculation of combinations using both data structures is performed in approximately the same time, but sorting using a two-level map is much faster, because You can sort the keys of the external and internal cards independently.

Another major advantage that a two-level card provides is that you can very quickly do additional filtering according to the bigram. For example, check their entry in the dictionary or even perform normalization (fast ride -> fast ride). It’s very expensive to perform these operations before you group the same combinations. the same operation will be performed many times.

Combine results for all blocks


So, at the output of the previous algorithm, we have many files with entries of the form:

bigram1 count1
bigram2 count2
...
bigramN countN

where the keys are sorted in lexicographical order. Our task is to combine these files into one, adding up the number of occurrences for matching keys. The task of summing the two files will be considered trivial and left without further explanation.

The general task of combining all files can be solved using the battery file, sequentially adding files of individual blocks to it:

((((((N1 + N2) + N3) + N4) + N5) + N6) + N7)...

However, this campaign is ineffective, because after a number of iterations, we will add to a relatively large battery relatively small files of individual blocks and used for most of the time to spend on reading data from the disk and written to disk. It is much more profitable to build a strategy where blocks of approximately equal size will be summed at each iteration, because the matching keys will collapse into one record and the resulting file will be smaller than the sum of the two original ones.

An implementation framework for merge sorting that uses recursion is great for implementation. Schematically for 15 files it will look like this (for the merge function, the first index is on, the second is excluded):

| _ merge (0, 15) = merge (0, 7) + merge (7, 15)
  | _ merge (0, 7) = merge (0, 3) + merge (3, 7)
    | _ merge (0, 3) = 0 + merge (1, 3)
      | _ merge (1, 3) = 1 + 2
    | _ merge (3, 7) = merge (3, 5) + merge (5, 7)
      | _ merge (3, 5) = 3 + 4
      | _ merge (5, 7) = 5 + 6
  | _ merge (7, 15) = merge (7, 11) + merge (11, 15)
    | _ merge (7, 11) = merge (7, 9) + merge (9, 11)
      | _ merge (7, 9) = 7 + 8
      | _ merge (9, 11) = 9 + 10
    | _ merge (11, 15) = merge (11, 13) + merge (13, 15)
      | _ merge (11, 13) = 11 + 12
      | _ merge (13, 15) = 13 + 14

As a result, the algorithm will do the same 14 mergers, but it will work much more efficiently with the battery option. The theoretical memory requirements of the merge algorithm are O (1), but in practice, memory is allocated only for read and write buffers.

Finally


Using the above approach, it is possible to collect not only bigrams in the case, but also n-grams for any n. Although there you may have to use smaller blocks and often throw off intermediate results to disk.

As we said at the beginning, implementation details deserve a separate discussion and we will talk about them in the next article.

Also popular now: