Unique Carnegie Mellon University Password Database Study

Published on November 10, 2013

Unique Carnegie Mellon University Password Database Study



    A recent study of the Carnegie Mellon University password database revealed several interesting correlations between demographic characteristics and the quality of passwords people use. The uniqueness of this study is that all the accounts for which the passwords were examined belong to employees and students of Carnegie Mellon University. These passwords were used to access very important data and functions on the university website, the restrictions on the length and complexity of the password during registration were quite strict. The university’s database contained detailed personal data of all users, and the authentication server logs contained information about the speed of entering passwords, successful and unsuccessful login attempts. In total, about 40,000 passwords from active and disabled accounts were examined.

    The regular password databases available after leaks from hacked sites contain a lot of garbage in the form of one-time accounts of random visitors with passwords like “12345” or “password” and vice versa, there is little information about users - usually just a username or email address.


    As can be seen in the graph above, the strongest passwords were expected to be among employees and students of the faculties of computer and engineering sciences. Behind them settled humanities and artists. Those who devoted themselves to the study of politics and business turned out to be the most vulnerable. The probability of picking up a password for a student of the Faculty of Computer Science is 45% lower than the password for a student at a business school. Apparently for good reason the antagonism between system administrators and accountants most often becomes the subject of jokes and tales. Another regularity - men come up with passwords 8% better than women.

    Along with demographic and professional factors, the properties of the passwords themselves were also investigated. The fact that the longer the password, and the more digits, upper case letters and special characters in it, the better, everyone knows. However, now it became clear how much better. Adding a lowercase letter to the password reduces the probability of guessing by up to 70% (the probability of selecting the original password is taken as 100%). Symbols and letters in upper case give a reduction to 56% and 46%, respectively. Of great importance is the arrangement of characters and numbers in the password text. Upper case at the beginning of the password does not have many advantages. The numbers and special characters at the end also do not work so well. Best of all, if they are "smeared" with a password. Patterns are clearly visible in these diagrams:



    Of particular interest is the technique of collecting and processing information. How did the plaintext password database and user personal data fall into the hands of scientists? Strictly speaking, she did not hit. The study was conducted with the assistance of the university’s security service in rather harsh conditions. The fact is that, for historical reasons, the university server user passwords were not stored in the form of hashes, but in the form of encrypted records, and the encryption key was stored in the security service. Scientists agreed to conduct research before the university switched to a more modern technology, with hashes and salt.

    The password database was decrypted on a separate computer that was not connected to the network, only some security officers had physical access to it. Passwords were stored only in RAM, the swap was disabled. Scientists had to write scripts to obtain statistical indicators without seeing the data itself. Each line of their code, as well as the output, has been carefully scanned to make sure that no critical data leaves the system. Debugging scripts in this mode was very difficult and slow.

    After the end of the work, all the source data was carefully destroyed. The statistical indicators themselves were selected so that it was impossible to identify a narrow group of users with particularly weak passwords and to launch a targeted attack on their accounts - that is why in some cases small groups were united into an impersonal category of “others”.