shvs May 1, 2014 at 12:19

Hacking a Vigenere cipher using frequency cryptanalysis

From the sandbox

“Imagine such a situation ... Once, when you leave the service at about one in the morning (the manager should set a good example), you notice a crumpled scrap of paper sticking out in the doorway ... The paper is excellent, it smells slightly of musk; the handwriting is clearly feminine and blows from it a kind of French charm. Now, with common sense, the new employee of Miss Hari begins to seem to you, perhaps, a little too exotic. Her French accent, an invariable black cocktail dress, a string of black pearls accentuating the neckline, and this exhilarating scent of musk filling the room when she enters ... She says she worked before at the Mac Donald Regional Computing Center in Kyokaku. Something is wrong here. Wait ... Is Miss Hari spying in favor of the famous French firm And Bay Em? And this note is an encryption, in which all the secrets of your latest miracle compiler? To incriminate Miss Hari, the note needs to be decrypted. But how?"

On Habré a couple of times articles flickered on Charles Weatherly’s book “Etudes for programmers”. Here is a fragment of one of the most interesting, in my opinion, sketches - "Secrets of the company", the main task of which is to crack the Vigenere cipher . Not so long ago, I realized this sketch, and in my article I will talk about how I did it and what happened in the end.

Task

The author suggests that we write a program that will allow us to crack the Vigenere cipher, or rather, one of its modifications. The text is encrypted as follows. A table is compiled, the so-called. Vizhener's square: the alphabet is written on the top and left, then some permutation of the original alphabet is placed on the first line, the same permutation cyclically shifted by one position, and so on. As a result, we get a table that maps each character pair to a character.

Then a keyword is selected and written repeatedly under the source text. Finally, each pair (the symbol of the source text; the symbol of the keyword below it) is encrypted using the table with the symbol corresponding to this pair. For example, the phrase “race star” will be encrypted using the “baggage” key and the above table like this:

Solution idea

Consider the reverse Vigenère square, that is, a table that maps to a pair (ciphertext symbol; keyword symbol) the source text symbol.

We agree to call the distance between two characters the difference in their positions in the alphabet: for example, the distance between 'a' and 'b' is 2. Then you can see that if the distance between the ith and jth characters of the keyword is d, then the distance between the corresponding characters of the source text is -d. Therefore, if we manage to find a keyword, then we can reduce the Vigenere cipher to a simple replacement cipher: each letter of the source text will be replaced by another, and the correspondence of the letters will be one-to-one. Hacking such a code is not difficult.

It remains to find the keyword. Let us know that the length of the keyword is L characters. Then the text can be divided into L groups, each of which will be encrypted using one character of the keyword, i.e. it will be a simple replacement code. Moreover, the permutations used in the alphabet will differ only in a shift equal to the sign of the distance between the corresponding characters of the keyword. Using methods of frequency cryptanalysis, we can determine these shifts.

Thus, the program will consist of three parts:
1) Determining the length of the keyword
2) Search for the keyword
3) Search for permutation of the alphabet and decoding of the text

Keyword Length

The keyword length is easiest to find using the match index method . This method was proposed by William Friedman in 1922 to crack the original Vigenere cipher, but it will work in our case as well. The method is based on the fact that the probability of coincidence of two random letters in some fairly long text (coincidence index) is a constant. Thus, if we divide the text into L groups of characters, each of which is encrypted with a simple replacement code (recall, this means that L is the length of the keyword), then the match indices for each of the groups will be quite close to the theoretical value of this value; for all other partitions, the match indices will be much lower. The coincidence index can be calculated using the formula

(fi is the number of i-letters of the alphabet in the text, and n is its length)

Below, for example, are indexes of matches for text encrypted using the “project” key (6 characters).

Thus, to determine the length of a keyword, it is necessary to calculate the indices of coincidence for breaking the text into L = 1, 2, ... groups, and then select the first, significantly superior to most of the others, from the obtained values. But ... there is a little trick. What if the text is encrypted using a key that can be divided into several similar parts? Below is a table of coincidence indices for the same text as in the first example, but encrypted using the “space” key (the same 6 characters). How to determine the key length in this case?

I played with various methods and found out that it is easiest to choose the necessary index like this: you need to take the first index, for which it is true: 1.06 * IP> ICi, i = 1, 2, ... Multiplying by 1.06 in most cases is enough to make a real IP exceed “multiple” IPs (in our example, these are indices at L = 12 and 18), but it is not enough for false indices (at L = 3, 9 and 15) to surpass the real ones.

Of course, my method does not give 100% result. Moreover, if the text is encrypted using a word consisting of identical parts (for example, “tartar”), then the length of the keyword will in any case be determined incorrectly (unless, of course, we want to find a meaningful keyword): there is no difference between encryption the key is “tartar” and “tar”. Therefore, it is important to give the user the ability to change the key length during the work of the program: for example, 6 did not fit, so it is worth trying 3 or 12.
Fortunately, for an encrypted note from a book, everything is determined quite simply. Looking at the table of coincidence indexes, we can say with confidence that the length of the keyword is 7 characters.

Keyword

Now that the key length is found, you can begin to search for the keyword itself. First we need to determine the probabilities of various shifts between the groups into which we divided the cipher (I recall that each group is encrypted using some permutation of the alphabet, while the permutations differ only by shifting each letter by several positions in the alphabet). To find the probabilities of shifts, I used the idea proposed by the translators of Etudes. If we omit all the mathematical calculations, it is as follows. For each group, we consider the number of characters x (x = 'a', 'b', ..., 'i') in it and based on information about the frequency of letters in the Russian language, we estimate the probability that the character y of the cipher corresponds to the character y (y = 'a', 'b', ..., 'i') of the source text. Comparing the obtained probability tables for different groups, we can calculate the probabilities of shifts r (r = 1, 2, ..., 32) between these groups. As a result, we get a table showing what is the probability of a shift r between groups i and j for any two groups and any shift.

There are two ways to find the keyword itself. The first way is to find the best shift configuration using the (only) resulting table. A complete search would take a very long time, so I went the other way. I searched for the key as follows. Take any word of suitable length and calculate its characteristic - the sum of the probabilities of shifts between every two characters in it (shifts between the characters of the keyword and between the letters of permutations in different groups of the cipher coincide up to a sign). Now make a small change to the word under study. If after a change the characterization of a word has become better, then we remember a new word, otherwise we forget a new word and return to the old one. By making random changes in this way, we quickly find the best word.

The only problem is that the best word is far from always the key. Tests have shown that even for texts of 1024 characters, the key and the best word often differ by one character. For example, the best word for a note from a book is “Fedisk.” I do not know such a word! For shorter texts, the key is determined completely incorrectly. Therefore, I had to refuse to search for the key based on only the table.

The second way to find the key is simple and uninteresting, but effective. We take a dictionary and calculate the characteristics of each suitable word (to increase the speed of the program, I divided the words by their lengths). We take the best word as a key word. This method works correctly even for short (400-500 characters) texts, but it has an obvious drawback: if the keyword is not in the dictionary, then the program will not work correctly under any circumstances.

However, if there is a key in the dictionary, the second method is much more effective than the first. Therefore, I used it in my program. This method immediately gave the correct (as it turned out later) keyword - “radish”.

Alphabet permutation

Having a keyword, we can modify the text so that it is encrypted with a simple replacement cipher. To do this, it is necessary to cyclically shift the symbols of the Lth group (L = 2, 3, ..., key length) by the distance between the first and Lth letters of the keyword to the left in the alphabetical order. After making these changes, it remains for us to find which letters of the source text correspond to the characters of the ciphertext.
Let's try to do this in the same way as we searched for a keyword without a dictionary. Take any correspondence table between the characters of the original and encrypted texts (for example, suppose that the letter 'a' corresponds to the letter 'a', the letter 'b' to 'b', etc.) and decrypt the text using it. We calculate the number of various bigrams (combinations of 2 letters) in the resulting text. Comparing the result with the reference table, we calculate the characteristic of the resulting text (I calculated the characteristic like this:

(b_ij is the number of bigrams in the “decrypted” text, and p_ij is the probability of a certain bigram in Russian)

Then we will make random changes to the correspondence table and at each step calculate a new characteristic. If the new characteristic turns out to be better (more) than the old one, we remember the changes, otherwise we roll them back. As a result, we get the correct correspondence table of letters, and we can restore the original text!

How fast is this permutation search algorithm? Below is a graph of the dependence of permutations on the number of iterations for an encrypted note from a book (for other texts, the graphs are similar). The number of iterations is plotted on the horizontal axis, and the characterization of the resulting text is plotted on the vertical axis. Characteristic changes occur jerky at each finding of the best table, therefore, for clarity, the graph is an interpolated point at which the changes occurred. When the graph line becomes straight and horizontal, the correct permutation table is found.

It is also interesting to look at the graphs of the dependence of the iteration number on the number of the successful change of the permutation table (left) and the text characteristics on the table change number (on the right).

We see that the characterization of the text changes with each change in the permutation table more or less linearly, and the number of attempts required to find a new successful change in the table grows rather quickly. The dependence of the characteristics of the text on the number of iterations is similar to the logarithmic. Indeed, the lower graphs show that if x is the number of iterations, y is the characteristic, and t is the number of successful table changes, then x∼e ^ t, y∼t, therefore, y∼ln (x). The logarithm grows rather slowly, but its growth is not limited by asymptotes, so finding a correct correspondence table takes a rather large, but not infinite amount of work. However, tests show that 10-12 thousand iterations are always enough.

Conclusion

After implementing the described algorithms and adding a graphical interface, we got such a program:

The program successfully decrypts texts with a length of 400-500 characters and more, the operating time does not exceed 10 seconds. I think this is a good result.

UPD Here is the linkto my program. I did it in Visual Studio 2012 and used Windows Forms, so before starting, make sure that you have the appropriate redistributable package and .net Framework 4 installed. You can encrypt the text in the program (for this, enter the text at the top of the form and the keyword; if the keyword will not be entered, it will be randomly selected) or to crack the cipher (for this, enter the ciphertext and press "Hack!"; if this indicates the length of the keyword or the keyword itself, this data will be used to decrypt, otherwise the program picks them up herself).

Tags: