2che April 9, 2018 at 22:48

Literary Archiver

First of all, I congratulate all Orthodox Christians and those who sympathize with them for Easter and the end of Lent, and all the rest for the coming of spring. Only a month ago in the sandbox my debut about programming in Cyrillic finally drowned. I do not know what attracted the attention of readers to the greenery, but commented on the sheets, like a real article. In his sheet, TrllServ suggested using the idea for archiving. I love people who know how to find practical application of ideas. By expanding the notepad, I tried to sketch out an algorithm based on the properties of my encoding, namely, the unique typing of a character by the first bits. It is convenient to compress text with such an algorithm, that is, articles, books or copy-paste from the Internet - that which consists of words, and where the case of letters has a grammatical meaning. Subsequently, averages based on the rules of the Russian language were added to a simple algorithm, and all this came together in one complex program that effectively compresses a textbook of literature. We will call it Literary Archiver.

Archiving Methods

As said, the text consists of words. In experimental archivers, characters are always replaced by sequences of bits, because there are few of them, and codes take up less memory space.

A = 00, B = 01, B = 10, G = 110

When you need to compress a huge book, where each letter occurs millions of times, the stored memory is also huge. However, this method does not only work with characters. Words are also repeating units of text , and there are not as many as they seem. Dahl’s dictionary contains about 200,000 words, and the average anonymous person uses no more than 3-4 thousand words in everyday speech. Of course, in Russian, words can bend, conjugate, etc., but, if you believe tribayana.ru/?p=4143, in “War and Peace” by L. N. Tolstoy there are only 38873 original tokens. With one register (see method 2) and a fixed code length, each word will be encoded in two bytes. It’s good if the source file is written in UTF-8, where two bytes is a Cyrillic grapheme. The average word length in Russian is 7.2 letters. That is, the gain is 7.2 times. In the article, the list of words found in the file is called a “dictionary”.

“Dad” = 0, “soup” = 10, “herbs” = 11
If you decided to make an archiver, limit the encoding. In normal cases, we do not need all Unicode characters to encode the file. Most often, only the ASCII page and the Russian alphabet are useful. 256 characters - not many, but TrllServNo wonder I looked into the sandbox. Letters are uppercase and lowercase. Forget about hieroglyphs for now. If lowercase letters are the basis of the text, then capital letters most often cost one in each sentence. Moreover, if you build the code table according to the usual rules, they will have to select rather long bit sequences (with prefix encoding), and, even worse, the “dictionary” will be filled with the same words that differ only in case. A lot of memory is wasted. For us, “ё” and “ё” are one and the same letter, and for a machine - two completely different values from the code table. In order not to produce unnecessary words, all our letters are already stored in the archiver itself in pairs. He drops all capital letters to lowercase. At the same time, the words of the text are marked, where there were capital letters, and their number is saved in each word. Capital letters here always appear at the beginning of a word, “Camel Style” is cut into separate tokens: “Camel” and “Style”. So the word number in the text with the number of capital letters in it is uniquely decoded.

“Beginning” - 1
“CAPS” - 4
“JAMS” - 3
“OptOvIk”: “Op” - 1, “TOv” - 2, “Ik” - 1
Each non-letter character (arithmetic or punctuation) is encoded by a separate token (one character). Only a space is specially encoded. Words are almost always separated by spaces; manually inserting each is pointless. It is enough to establish a standard: when decoding between two words, one space is meant, unless otherwise indicated. And otherwise just indicated by a sign of the gap in the archive. There are several such exceptions:
- A series of spaces (more than 1) in the archive means the same series of spaces in the text. Useful when compressing python scripts with strict indentation.
  
  “Word1 ____ word2” -> “word1” + “____” + “word”
  * underscore here - spaces
- A space between two words in the archive means no space between them in the text. Yes, like that. Crutch to camel style.
  
  "ZABBOCHERG" -> "for" + "" + "bbo" + "" + "che" + "" + "rg"
  * all 4 words are marked, the number of capital letters for each is saved
- By default, non-letter characters “attach” right next to the word on the left and are separated by a space from the word (or sign) on the right. A space in the archive changes the rules.
  
  "Saltpeter, _ sugar" -> "saltpeter +", "+" sugar "
  " comma _, _ sir "->" comma "+" "+", "+" sir "
  " hey -, redneck! " -> "hey" + "," + "" + "redneck" + "!"
  “I’ve touched _, - the schoolboy” -> “I’ve touched” + "" + "," + "" + "the schoolboy"
  * here "_" is a space, and "-" is its absence. Due to the features of Habr, auto-correct.
- A single space at the very beginning of the archive (the first character) means a single space at the beginning of the text.
  
  [beginning of text] "first" -> [beginning of archive] "" + "first"
A series of spaces replaces single automatic spaces.
Numbers are encoded as letters, numbers as words. Letters with numbers concatenate together. Digit case is not possible.

“265” -> “265”
“228 articles” -> “228 articles”
“Adyn 1111 bumblebee” -> “Adyn 1111 bumblebee”

These 4 methods provide the "folding" of the source file without loss and distortion.

General algorithm

So what is going on?

The text is read from the source file and cut in parallel by the parser, a “dictionary” is compiled, “dictionary” indexes are substituted for the words of the text themselves. At the same time, the frequency of occurrence of each “vocabulary” word is calculated, and the labels of the capital letters are collected.
The "dictionary" is sorted by frequency of occurrence. In the implementation of the program, an index list is sorted for acceleration.
An “alphabet” is formed - a list of all the characters existing in the text, it is sorted by the frequency of occurrence of the characters in the “dictionary”. Instead of characters in the "dictionary" are substituted indexes of the "alphabet". The implementation also sorts a wildcard list.
The "alphabet" and "dictionary" are encoded in any way, for example, according to Huffman.
The archive file for recording opens. The code table of the ordered “alphabet” is introduced. The end is separated by vertical tabs.
Using the “alphabet” code table, the “dictionary” code table is bitwise entered. The end is separated by vertical tabs.
Capitalization indices and their numbers are introduced for those words where there are several. The end is separated by vertical tabs.
Using the codebook "dictionary" bit by bit text is entered.

Implementation

The figure is sad. While I was writing the code, it got confused. In the article, the algorithm looks simple, but understandable to the machine, it has become incomprehensible to humans. All permutations and indexes are, in fact, a terrible pyramid of pointers. I did everything I could, but I could almost everything. The program works and is almost ready, it does not implement only the algorithm for prefix encoding of the "dictionary" and "alphabet" - not the most difficult, but the most important part. Without it, the application works and correctly performs the remaining functions, but the size of the archive decreases slightly. Why niasil? Yes, ashamed. I got lost in arrays of pointers, and therefore did not encode. It takes time to unravel and refactor your creation, and people of my specialization are sorely lacking it now. Minobra course: less knowledge - more type tests for preparation. One could finish the project with colleagues, ask them for help, but as it turned out, the archiver algorithm for most of them is deadly. However, I will show what is. I did not find the “War and Peace” without the “introductory part”, so I ground the archive of “Masters and Margarita”. Fewer pages make more sense.

original size: 1.4 MB.
archive size: 1.6 MB.

Eureka! The archiver increased the file size. Without coding, it is pointless. Better consider a specific example.
Two lines with the mentioned exceptions:

JLK ijo, ewrt lhuiOij jfds + huk uih EGE! *
897 y98 uih - 124 jfds JLK

turned into this:

Of course, there is no talk about any compression without proper encoding, but the archive structure is clearly shown (essence) .

In general, if, while there are tests, there will a good anonymous, who adores dancing with tambourines and ready to write a crutch crutches, that the source code in C .

PS The implementation is ready.

With love.

Tags:

Literary Archiver

Archiving Methods

General algorithm

Implementation

Also popular now: