LinkedIn hash leak analysis of mass audit capabilities
A week ago, a hash database with LinkedIn leaked , for others this event may be noteworthy in itself, but for me, first of all, this means an opportunity to analyze the modern possibilities of cracking passwords. And I’m not going to talk about how many times the word “password” was found among passwords and how long it takes to sort out six-character combinations. Rather, I will scare users with how complicated passwords can be “cracked” in a few hours. And I’ll tell programmers how to implement this effectively, and as a small gift I’ll attach a program that I wrote for a mass audit. There is also some educational program on the use of rainbow tables with simple conclusions.
And so, in an hourmanaged to "recover" about 2.5 million passwords on an average working configuration, without special dictionaries and rainbow tables. Among the passwords found there are 16-character alphanumeric combinations, and not in a single copy.
But let's start in order. Imagine that you come across a database of 6.5 million fresh hashes (unique 5,787,239 in total), what methods exist to recover the maximum number of passwords in a reasonable (say 7 days) time?
A small educational program . Many people believe in the wonderful properties of rainbow tables , supposedly they are "something capable of breaking everything at once." This is a huge misconception, moreover, for mass audit (thousands or millions of hashes) they are completely unsuitable. Therefore, forget the phrase “an attacker can generate a rainbow table!”
Why? Take a set of frail hundred rainbow tables the size of a hundred gigabytes , which are able with a probability of 98% to recover any password to 8 alphanumeric characters in one register. By the way, such tables can be generated somewhere in a couple of months on a rather powerful machine, but let them already be with us, like some kind of divine gift.
The time required to find a password for one hash value in such tables is about a minute. During this time, it is necessary to perform chain_length hash, reduction, and search operations on 100Gb thereby.
The ability to optimally search for multiple passwords to different hash values at once does not exist; for each hash, you need to build a separate rainbow chain and look for it in the table. Thus, we need approximately 5.7 million minutes to search.
This is about 10 years. Therefore, our divine gift in this case cannot be considered a modest gift.
However, a good set of rainbow tables will help restore the password to one hash value in minutes (given the same modest restrictions of 8-9 characters in length).
Direct searchit differs slightly from rainbow tables for mass hacking - it is easily optimized for searching for enumerated values in large sets of hash values.
We need to take each line from the set {a..z0..9} ^ 8, calculate its hash, and look in the database of hash values that LinkedIn carefully provided to us in this case .
And search is an operation that in this case is quite easy to optimize. Looking ahead, in my program I achieved O (1) searches even on such large bases.
The basis of the search is simple filtering - do not try to look for those hashes that we obviously will not find. We create an array of bit values (checkup) of size SIZE, about a hundred megabytes and build a function that maps the hash value to the index of this array. Oddly enough, such an object is also called a hash function, but not cryptographic, and it is often called a convolution. For each hash from LinkedIn, we compute the convolution, and write “1” in the corresponding bits of the checkup array.
When sorting, we consider the convolution == j from the received value, look at checkup [j], if there is 0, then there is no point in looking for such a value in the set (this is checked for O (1)). Otherwise, we use a binary search that already copes with O (log (N)).
If we go back to the numbers, then direct enumeration with similar optimization will take about a month on the same hardware, or several days on video cards.
That is, for mass hacking, even direct enumeration is more profitable than tables!
But we would like to deal with passwords longer than 8 characters and dictionaries come to the rescue . Dictionaries are very cool! They contain those passwords that were already someone’s, and the probability of being among ours is much higher for them than for random lines. And if you add a certain set of replacement rules, then you can work wonders. In this case, the speed of such enumeration will be comparable to the speed of direct enumeration.
But there is one drawback - dictionaries must come from somewhere. That is, the selection of words should not be random, but should be the result of some analysis. And downloading the dictionary from the vastness of the Global Network, you can get complete “junk”, the efficiency of which will be lower than the direct search (for example, I’ll specifically enter the lines there that hardly anyone will put as a password).
So we came to the optimal mass hacking strategy, in my opinion, related to frequency analysis . And we don’t need anything at all except a merged list of hashes. First
step . We start a direct search through the set of all characters that can be entered from the keyboard, up to 5-6 characters long. In fact, we don’t have a task to get all passwords of this length, we just want to get a certain amount, hundreds of thousands, for further analysis. If 6 characters is too short, then you can take 7-8, again, we do not need to exhaust the entire range for enumeration. The main thing is that it works for 5-10 minutes.
So, we found some passwords. Now you can conduct a frequency analysis, highlighting the most likely combinations of characters in a row. For example, “pass” is one of these combinations, but on LinkedIn and “link” too.
Step second . We start the search for the concatenation of the dictionary of frequent combinations with itself. Now I just remind, but if the essence is not clear, I recommend reading my previous note .
Let it work for 5-10 minutes too, and note that it will be much faster to find passwords than last time.
And the third step . Having collected the “critical mass” of the found passwords, say a tenth of all, we repeat the frequency analysis and again start the search, according to the dictionaries obtained. Now you can not stop - he will do his job.
At the moment, the search has been working for me for two hours without stopping, and it finds somewhere in the eye 50-100 passwords per second.
And here is an example of the resulting dictionary: dl.dropbox.com/u/243445/md5h/relevant.txt
You can check the security of your password by trying to "collect it from pieces" of the dictionary.
If four or fewer pieces are enough - it's bad, change it.
UPD is a screen with calculated statistics for the found passwords (I can calculate some other parameters if necessary):
We see that neither the password length, nor the use of special characters, nor a combination of registers and numbers, nor randomness can save.
As promised - this is the program that managed to do this. And by the way, you are free to try to repeat the result (or achieve the best).
Sources: dl.dropbox.com/u/243445/md5h/src.7z
Binary: dl.dropbox.com/u/243445/md5h/MD5BLAST.exe
Do not be surprised that it is called MD5- and not SHA- because, in As a workpiece, I took my program, which I already talked about .
The CUDA Toolkit is still needed: developer.nvidia.com/cuda-toolkit-32-downloads#Windows
Dictionary speed for SHA1 on GTX460 with 5.7 million unique hashes in the list - more than 60 mpwd / s. And do not scold the "low" speed - it's the same:
* SHA1;
* 5.7 million hashes to search;
* concatenation of the dictionary from strings of arbitrary length.
There are still no analogues with a higher speed for this task.
To start the search, you need to save the hashes in the hash_list.txt file, the dictionary in the words.txt file and call:
where 3 is the maximum degree (the number of dictionary concatenations), and 0.0 is the initial progress in percent.
For step 1 of the above diagram, in words.txt there should be all characters entered from the keyboard, one for each line: To get a list of relevant combinations: where the first file is the source for analysis, the second is for recording the results (yes, a bit not Unix -way)
and the remaining three parameters are adaptive frequency thresholds for two, three, and four-character combinations, respectively (higher number - fewer combinations as a result, you can experiment with them).
And so, in an hourmanaged to "recover" about 2.5 million passwords on an average working configuration, without special dictionaries and rainbow tables. Among the passwords found there are 16-character alphanumeric combinations, and not in a single copy.
But let's start in order. Imagine that you come across a database of 6.5 million fresh hashes (unique 5,787,239 in total), what methods exist to recover the maximum number of passwords in a reasonable (say 7 days) time?
- Rainbow tables ;
- Direct search;
- Dictionary attack (with rules);
- Frequency analysis.
A small educational program . Many people believe in the wonderful properties of rainbow tables , supposedly they are "something capable of breaking everything at once." This is a huge misconception, moreover, for mass audit (thousands or millions of hashes) they are completely unsuitable. Therefore, forget the phrase “an attacker can generate a rainbow table!”
Why? Take a set of frail hundred rainbow tables the size of a hundred gigabytes , which are able with a probability of 98% to recover any password to 8 alphanumeric characters in one register. By the way, such tables can be generated somewhere in a couple of months on a rather powerful machine, but let them already be with us, like some kind of divine gift.
The time required to find a password for one hash value in such tables is about a minute. During this time, it is necessary to perform chain_length hash, reduction, and search operations on 100Gb thereby.
The ability to optimally search for multiple passwords to different hash values at once does not exist; for each hash, you need to build a separate rainbow chain and look for it in the table. Thus, we need approximately 5.7 million minutes to search.
This is about 10 years. Therefore, our divine gift in this case cannot be considered a modest gift.
However, a good set of rainbow tables will help restore the password to one hash value in minutes (given the same modest restrictions of 8-9 characters in length).
Direct searchit differs slightly from rainbow tables for mass hacking - it is easily optimized for searching for enumerated values in large sets of hash values.
We need to take each line from the set {a..z0..9} ^ 8, calculate its hash, and look in the database of hash values that LinkedIn carefully provided to us in this case .
And search is an operation that in this case is quite easy to optimize. Looking ahead, in my program I achieved O (1) searches even on such large bases.
The basis of the search is simple filtering - do not try to look for those hashes that we obviously will not find. We create an array of bit values (checkup) of size SIZE, about a hundred megabytes and build a function that maps the hash value to the index of this array. Oddly enough, such an object is also called a hash function, but not cryptographic, and it is often called a convolution. For each hash from LinkedIn, we compute the convolution, and write “1” in the corresponding bits of the checkup array.
When sorting, we consider the convolution == j from the received value, look at checkup [j], if there is 0, then there is no point in looking for such a value in the set (this is checked for O (1)). Otherwise, we use a binary search that already copes with O (log (N)).
If we go back to the numbers, then direct enumeration with similar optimization will take about a month on the same hardware, or several days on video cards.
That is, for mass hacking, even direct enumeration is more profitable than tables!
But we would like to deal with passwords longer than 8 characters and dictionaries come to the rescue . Dictionaries are very cool! They contain those passwords that were already someone’s, and the probability of being among ours is much higher for them than for random lines. And if you add a certain set of replacement rules, then you can work wonders. In this case, the speed of such enumeration will be comparable to the speed of direct enumeration.
But there is one drawback - dictionaries must come from somewhere. That is, the selection of words should not be random, but should be the result of some analysis. And downloading the dictionary from the vastness of the Global Network, you can get complete “junk”, the efficiency of which will be lower than the direct search (for example, I’ll specifically enter the lines there that hardly anyone will put as a password).
Do it yourself
So we came to the optimal mass hacking strategy, in my opinion, related to frequency analysis . And we don’t need anything at all except a merged list of hashes. First
step . We start a direct search through the set of all characters that can be entered from the keyboard, up to 5-6 characters long. In fact, we don’t have a task to get all passwords of this length, we just want to get a certain amount, hundreds of thousands, for further analysis. If 6 characters is too short, then you can take 7-8, again, we do not need to exhaust the entire range for enumeration. The main thing is that it works for 5-10 minutes.
So, we found some passwords. Now you can conduct a frequency analysis, highlighting the most likely combinations of characters in a row. For example, “pass” is one of these combinations, but on LinkedIn and “link” too.
Step second . We start the search for the concatenation of the dictionary of frequent combinations with itself. Now I just remind, but if the essence is not clear, I recommend reading my previous note .
Let it work for 5-10 minutes too, and note that it will be much faster to find passwords than last time.
And the third step . Having collected the “critical mass” of the found passwords, say a tenth of all, we repeat the frequency analysis and again start the search, according to the dictionaries obtained. Now you can not stop - he will do his job.
At the moment, the search has been working for me for two hours without stopping, and it finds somewhere in the eye 50-100 passwords per second.
And here is an example of the resulting dictionary: dl.dropbox.com/u/243445/md5h/relevant.txt
You can check the security of your password by trying to "collect it from pieces" of the dictionary.
If four or fewer pieces are enough - it's bad, change it.
What managed to "break"
linkedinmel1234, andrea71103245, hockey101155239, magmag624222, carlito5657224, linda@790212, supercow779212, jesus143mary143, linkedin#239133, linkedinpassword123, thepassword1776000, 13051987159000, meatballstew123, latenightbreeze, whatthedillyo, friendofkellyg, hannah11emily9, linkedin7barry5, linkedin.passwd, linkedinrocksforeva, philip23marcus, 54fordpickup, nabe1959@, ge0rgin@, #1dust67, logic123tree456, ramgopal@123456, Jk971423, tiger!376400, ...
UPD is a screen with calculated statistics for the found passwords (I can calculate some other parameters if necessary):
We see that neither the password length, nor the use of special characters, nor a combination of registers and numbers, nor randomness can save.
Program
As promised - this is the program that managed to do this. And by the way, you are free to try to repeat the result (or achieve the best).
Sources: dl.dropbox.com/u/243445/md5h/src.7z
Binary: dl.dropbox.com/u/243445/md5h/MD5BLAST.exe
Do not be surprised that it is called MD5- and not SHA- because, in As a workpiece, I took my program, which I already talked about .
The CUDA Toolkit is still needed: developer.nvidia.com/cuda-toolkit-32-downloads#Windows
Dictionary speed for SHA1 on GTX460 with 5.7 million unique hashes in the list - more than 60 mpwd / s. And do not scold the "low" speed - it's the same:
* SHA1;
* 5.7 million hashes to search;
* concatenation of the dictionary from strings of arbitrary length.
There are still no analogues with a higher speed for this task.
To start the search, you need to save the hashes in the hash_list.txt file, the dictionary in the words.txt file and call:
MD5Blast words.txt 3 0.0
where 3 is the maximum degree (the number of dictionary concatenations), and 0.0 is the initial progress in percent.
For step 1 of the above diagram, in words.txt there should be all characters entered from the keyboard, one for each line: To get a list of relevant combinations: where the first file is the source for analysis, the second is for recording the results (yes, a bit not Unix -way)
a
b
c
...
MD5Blast _found.txt relevant.txt 1.0 4.0 16.0
and the remaining three parameters are adaptive frequency thresholds for two, three, and four-character combinations, respectively (higher number - fewer combinations as a result, you can experiment with them).
Small conclusions
- Rainbow tables with mass hacking are useless;
- The length of the password, the presence of special characters, its meaninglessness, different registers do not individually constitute security;
- Hashing passwords without using salt is like storing half of them in clear text ( learning from our mistakes );
- Frequency analysis allows you to generate good dictionaries "out of nothing."