Analysis of real user passwords and improved exhaustive search

    I read translation of the article Distribution of characters in passwords on Habr today . I wanted to do my little analysis. Of interest to me are the lengths of passwords, the first characters of passwords and the bigrams used in passwords (pairs of adjacent characters). Also, the article will consider an algorithm for improved complete enumeration of passwords.

    I downloaded the archive with passwords from here: http://thepiratebay.org/torrent/6443601 I
    used only 1 file for analysis: Sony_Pictures_International_BEAUTY_USERS.txt
    Number of passwords: 20 921

    Pictures in the article are clickable .

    What are the most popular password lengths?




    It can be seen that passwords have different for and us. It is surprising that users have passwords shorter than 6 characters. It is strange that the registration system generally allows the use of such passwords. The number of passwords of this length is less than 2.5% of the total number of passwords. There are 2 passwords with a length of 35 - it is very likely that they were received by some program for generating passwords.

    What characters prevail in passwords?




    As expected, vowels in popularity come first, followed by consonants with numbers. Uppercase characters are used an order of magnitude less often. The number of characters in uppercase is less than 2.8% of the number of all characters.

    What characters do passwords most often begin with?




    Most often, passwords begin with the characters s, m, b, c. The next most popular characters are p, t, d, a, j, l, r. The less popular group of characters is g, k, 1, h, f, w, n, e. All other characters are less popular in this sense.

    What bigrams (pairs of adjacent characters) in passwords are more common than others?




    As can be seen from the diagram, the distribution of bigrams is not quite random. The five most encountered bigrams in decreasing popularity are:
    ar (1367), le (1315), on (1239), ie (1136), es (1134).

    Algorithm for linearizing enumeration of words of different lengths, the characters of which belong to the countable alphabet.


    Now consider a linearization algorithm for enumerating two- character word combinations . Symbols belong to a counting algorithm (not necessarily finite).



    What if u need to organize enumeration of words from three characters of the countable alphabet? I don’t want to come up with formulas to bypass a three-dimensional cube. You can arrange a round of the square, and interpret one of the characters as a number in the sequence of rounding two-character words. Thus, it is possible to organize a crawl of words of any given length.

    Special attention should be paid to circumventing words of different lengths. It is necessary to additionally linearize the bypass of the "word length" and the word itself. Thus, words of all lengths whose characters belong to a countable set will be sorted.

    In fact, now we are faced with the task of sorting out words of all lengths and with an infinite alphabet. The alphabet is actually finite, but in the general case the problem is easier to solve.

    Why might all this information be required?


    Now let's make the following linear lists:
    - Password lengths
    - First password characters
    - Second bigram characters

    Let's make a list of password lengths by their popularity:
    6, 8, 7, 9, 10, 11, 12, 13, 14, 16,15

    Let's make a list of first characters in passwords by their popularity:
    s, m, b, c, p, t, d, a, j, l, r, g, k, 1, h, f, w, n, e, 0, o, i, 2, y, v, S, M, 4, B, 3, C, P, 5, T, D, z, ...

    Let's make lists of the second characters from the bigrams in popularity for the given first letters of the bigrams:
    e: s, l, n , e, y, t, 1, b, r, c, w, 2, v, x, f, 3.4, a, 7.5, k, 6.9, j, h, u, d, m , 8, z, p, o, ...
    a: r, t, s, l, m, c, d, b, i, y, g, p, k, u, h, v, w, x, 2, f, 0, j, 4,9,3,7,1,6, n, z, e, o, ...
    o: n, o, r, l, m, u, s, g, v, k, b , d, c, 1,2, x, i, w, p, t, 0,3,6, j, z, 5,9,7, e, y, a, 8, ...
    r: a, i, o, l, d, t, 1, r, k, n, e, g, b, c, 4,5,7,8, s, v, 3,6, u, 9, w, y, h, j, 2, p, m, z, ...
    ...

    Now you can organize a linear enumeration of passwords, starting with the most popular options. And it is possible to solve the inverse problem. From the specific password, display its number in the sorted sequence. This, however, is a bit difficult due to the fact that the list of letters of the alphabet is selected based on the previous letter.

    If the topic is interesting to the community, I will try to get the password number in the password sequence in one of my next articles.

    Also popular now: