Analysis of leaked passwords Gmail, Yandex and Mail.Ru

    More recently, the password databases of popular email services have been made publicly available [ 1 , 2 , 3 ] and today we will analyze them and answer a number of questions about the quality of passwords and the possible source (or sources). We will also discuss the quality metrics of individual passwords and the entire sample.

    No less interesting are some anomalies and patterns of password databases, perhaps they can shed light on what could serve as a data source and how dangerous this sample is from the point of view of an ordinary user.

    Formally, we will consider the following questions: how reliable are the passwords in the database and could they be collected by a dictionary attack? Are there any signs of phishing attacks? Could a data leak be the only source of data? Could this database be accumulated over a long period or is the data exclusively “fresh”?

    Article structure:

    1. Data Description
    2. Invalid passwords and non-passwords
    3. Password length distribution
    4. Password Strength Distribution
    5. Dictionary attack
    6. Top passwords
    7. Gmail Fetch
    8. Rambler Fetch
    9. Open source analysis
    10. Conclusion

    Data Description

    Data from all three databases is a set of address-password pairs separated by a colon. No other "meta data" is available. However, the data is quite noisy i.e. they contain strings that are neither mail addresses nor valid passwords.

    If we study the features of the data, we can put forward (or refute) the hypothesis about the process by which passwords could be obtained.

    Invalid passwords and non-passwords

    The simplest criterion for the password to be invalid is the mismatch of the password length with the requirements of email services.

    The data obtained say that the passwords from the sample could not be obtained as a result of an “internal” leak, since several thousand passwords are not valid passwords, in principle, due to restrictions on the password length of six characters (and for modern gmail passwords, eight characters) .

    Consider these abnormally long (over 60) and short passwords (less than 6) in detail.


    Long passwords are pieces of HTML, one representative example:

    Similar examples indicate that phishing might be one of the password sources. The entry in the database was clearly not verified by a person and obtained automatically; phishing is also indicated by the fact that the password contains html markup, which is quite unusual for password theft through infection.

    Brief selection of passwords that are too short:

    Another indicator that phishing might be one of the sources is the lack of a username and password in the entries. An apostrophe without a password is especially interesting. Perhaps the potential victim guessed the phishing form and tried to check for SQL injection.

    What can be unequivocally confirmed by verified data? Automatic database validation did not occur. Most likely hypotheses: phishing and virus infection.

    In order to evaluate the quality of the entire sample, we will remove obviously wrong passwords of length less than 6 and more than 60 from it and consider the entire distribution as a whole for several parameters.

    Password length distribution

    As you can see from the graph below, most passwords are 8 characters or less in length. Which may indicate that a significant layer of passwords is potentially unstable to various types of attacks of brute force attacks.

    Password Strength Distribution

    In order to test this hypothesis, consider a simple password strength metric based on
    the PCI standard .
    Let the password receive a conditional score for satisfying one of the following conditions:
    • Password contains at least 7 characters;
    • password contains at least one lowercase letter;
    • Password contains at least one uppercase letter;
    • Password contains at least one digit;
    • Password contains at least one special character.

    If the password receives 4/5, then we call it strong (very reliable for 5/5), respectively, we call 3/5 medium and 2/5 weak (0 or 1 point we will call very weak). R code is given below.

    Reliability function
    strength <- function(password){
      # must contain at least 7 characters
      score = 0
      if (nchar(password) >= 7){
        inc(score) <- 1
      # at least one digit
      if(grepl("[[:digit:]]", password)){
        inc(score) <- 1
      # at least one lowercase letter 
      if(grepl("[[:lower:]]", password)){
        inc(score) <- 1
      # at least one uppercase letter 
      if(grepl("[[:upper:]]", password)){
        inc(score) <- 1
      # at least one special symbol
      if(grepl("[#!?^@*+&%]", password)){
        inc(score) <- 1
      # 0-1 very weak
      # 2 - weak
      # 3 - medium
      # 4 - strong
      # 5 - very strong

    Then the distribution of reliability is:

    As you can see from the graph, most passwords fall into the category of non-reliable. As an example, consider passwords with zero security, as this is most likely another representative example of invalid passwords.

    Zero Strength Passwords

    As can be seen from the examples above, these passwords are not valid (and from the point of view of a person they look more like an input error than a valid password), since mail services do not allow you to register a mailbox if they consider the password too simple, for example, by repeating the same character six times. This means that perhaps an even larger layer of passwords is not valid according to modern requirements.

    Is it possible that a substantial part of the database was collected over a long period of time when password requirements were softer? Otherwise, it is quite difficult to explain such a large group of passwords that do not meet the requirements of modern mail systems.

    Dictionary attack

    As an additional argument, we’ll conduct the following experiment: take a selection of relevant shared dictionaries of passwords, conduct an attack on available passwords using these dictionaries, and estimate the percentage of passwords contained in this selection of dictionaries (the author literally did not go further than the first three Google links for [password dictionary ]).

    From the table above it can be seen that a significant proportion of passwords is contained in dictionaries, which also indicates that part of the passwords could be obtained as a result of a dictionary attack (or some sort of enumeration modification).

    Top passwords

    We give a selection of the most popular passwords and note that most of them are not valid passwords now.

    Gmail Fetch

    The actions and data described and received in this and the next part were performed and transmitted by a friend of my friend who wished to remain anonymous.

    Task: check the validity (i.e. that the password really matches) passwords. Action: in a small sample of ~ 150-200, try to access the boxes. Of the entire sample, in principle, ~ 2-3% are valid (after several hours of data appearing in the public domain), and in fact, all are deactivated at the time of verification. Less than 1% of the boxes were actually valid and they were abandoned by the owners for at least a year.

    Rambler Fetch

    It’s easy to find on the network lists of “really valid” addresses compiled by a wide range of stakeholders (aka kulkhackers).

    Interestingly, there is a rather large percentage of rambler addresses among them.

    Rambler was warned a few days before publication and a response was received that the necessary security measures would be taken in the near future.

    <humor> Interestingly, the percentage of valid passwords is significantly higher, and until recently, rambler was outside the media field of events and did not activate additional security systems. This allowed an unknown leak anthropologist to evaluate the last moments of the life of mailboxes. Despite the validity of passwords, all mailboxes were abandoned for a long time (~ 1-1.5 years) and ended with one of these letters:

    This is another confirmation of the phishing hypothesis and the cumulative nature of the base.

    Open source analysis

    Let's go back to the consideration of open sources. An active search for passwords, logins, led us to a number of distributions from gaming forums:

    It turns out that part of the list has already walked in some form on the network.

    Thus, the data allows us to reject the hypothesis of a single data source such as an “internal leak”.

    The main part of the code used:
    Data Analysis and Visualization
    print("loading yandex data")
    yandex <- read.csv("yandex.txt", header = FALSE, sep = ":", quote = "", stringsAsFactors = FALSE)
    print("loading mailru data")
    mailru <- read.csv("mail.txt",   header = FALSE, sep = ":", quote = "", stringsAsFactors = FALSE)
    print("loading gmail data")
    gmail  <- read.csv("gmail.txt",  header = FALSE, sep = ":", quote = "", stringsAsFactors = FALSE)
    ##testing if data loaded correctly
    print("testing, if loaded correctly")
    ##changing names
    names(yandex) <- c("email", "password")
    names(mailru) <- c("email", "password")
    names(gmail)  <- c("email", "password")
    print("computing lengths of passwords and adding to the datasets")
    yandex$pass_length   <- sapply(yandex$password, nchar)
    mailru$pass_length   <- sapply(mailru$password, nchar)
    gmail$pass_length    <- sapply(gmail$password,  nchar)
    print("number of invalid passwords by length")
    print(nrow(yandex[yandex$pass_length <  6,]))
    print(nrow(yandex[yandex$pass_length >  60,]))
    print(nrow(mailru[mailru$pass_length <  6,]))
    print(nrow(mailru[mailru$pass_length >  60,]))
    print(nrow(gmail[gmail$pass_length <  6,]))
    print(nrow(gmail[gmail$pass_length >  60,]))
    print("removing invalid passwords by length")
    yandex <- subset(yandex, pass_length >= 6 & pass_length <= 60)
    mailru <- subset(mailru, pass_length >= 6 & pass_length <= 60)
    gmail  <- subset(gmail , pass_length >= 6 & pass_length <= 60)
    #print("checking that they are removed")
    print(nrow(yandex[yandex$pass_length <  6,]))
    print(nrow(yandex[yandex$pass_length >  60,]))
    print(nrow(mailru[mailru$pass_length <  6,]))
    print(nrow(mailru[mailru$pass_length >  60,]))
    print(nrow(gmail[gmail$pass_length <  6,]))
    print(nrow(gmail[gmail$pass_length >  60,]))
    print("visualizing distribution of password lenghts by provider")
    gmailcolor  <- "deepskyblue"
    yandexcolor <- "orangered1"
    mailrucolor <- "limegreen"
     pgmail <- ggplot(data=gmail, aes(x=pass_length)) + scale_x_discrete(limits=seq(6, 20, 1), breaks=seq(6, 20, 1), drop=TRUE) + geom_histogram(colour="black", fill=gmailcolor, aes(y=..density..)) + coord_cartesian(xlim=c(5,21.5)) + xlab(expression("Длина пароля"))+ ylab(expression("Доля"))+ggtitle("Gmail")
     pyandex <- ggplot(data=yandex, aes(x=pass_length)) + scale_x_discrete(limits=seq(6, 21, 1), breaks=seq(6, 21, 1), drop=TRUE) + geom_histogram(colour="black", fill=yandexcolor, aes(y=..density..)) + coord_cartesian(xlim=c(5,21.5)) + xlab(expression("Длина пароля"))+ ylab(expression("Доля"))+ggtitle("Yandex")     
     pmailru <- ggplot(data=mailru, aes(x=pass_length)) + scale_x_discrete(limits=seq(6, 20, 1), breaks=seq(6, 20, 1), drop=TRUE) + geom_histogram(colour="black", fill=mailrucolor, aes(y=..density..)) + coord_cartesian(xlim=c(5,20.5)) + xlab(expression("Длина пароля"))+ ylab(expression("Доля"))+ggtitle("")     
     multiplot(pgmail, pyandex, pmailru, cols=3)
    print("computing strength of the passwords")
    yandex$strength <- sapply(yandex$password, strength)
    mailru$strength <- sapply(mailru$password, strength)
    gmail$strength  <- sapply(gmail$password,  strength)
    scale <- scale_x_discrete(limits=c(1,2,3,4,5), breaks=c(1,2,3,4,5), drop=TRUE, labels=c("Очень\nслабый", "Слабый", "Средний", "Надежный", "Очень\nнадежный"))
    pgmail <- ggplot(data=gmail  , aes(factor(strength))) + geom_bar(colour="black", fill=gmailcolor) + xlab(expression("Надежность"))+ coord + ylab(expression("Доля"))+ggtitle("Gmail") + scale
    pyandex <- ggplot(data=yandex, aes(factor(strength))) + geom_bar(colour="black", fill=yandexcolor, binwidth=0.5) + xlab(expression("Надежность"))+ coord + ylab(expression("Доля"))+ggtitle("Yandex") + scale
    pmailru <- ggplot(data=mailru, aes(factor(strength))) + geom_bar(colour="black", fill=mailrucolor, binwidth=0.5) + xlab(expression("Надежность"))+ coord + ylab(expression("Доля"))+ggtitle("") + scale    
    multiplot(pgmail, pyandex, pmailru, cols=3)
     print("Zero strength passwords")
     print(head(gmail[gmail$strength == 0,]))
     print(head(yandex[yandex$strength == 0,]))
     print(head(mailru[mailru$strength == 0,]))
    table_gmail  <- sort(table(gmail$password) , TRUE)
    table_yandex <- sort(table(yandex$password), TRUE)
    table_mailru <- sort(table(mailru$password), TRUE)
    print("gmail most frequent")
    print(head(table_gmail, 100))
    print("yandex most frequent")
    print("mailru most frequent")
    only_pass_gmail  <- gmail[ ,2] 
    write.csv(only_pass_gmail,  "only_pass_gmail",  row.names = FALSE)
    only_pass_yandex <- yandex[,2] 
    write.csv(only_pass_yandex, "only_pass_yandex", row.names = FALSE)
    only_pass_mailru <- mailru[,2] 
    write.csv(only_pass_mailru, "only_pass_mailru", row.names = FALSE)

    Vocabulary Attack experiment code
    > $dict
    while read p; do
      echo -n $j
      if grep -q "^$p$" dictionary/*; then
        echo " in "
        echo $p >> $dict
        echo " out " 
      if (("$j" > 10000)); then
    done <$data


    Thus, the hypothesis that this sample is a compilation of various sources (phishing, infection, dictionary attacks, collection of popular collections) over a long period of time seems most likely. A sufficient part of the data, in principle, is not valid passwords according to formal syntactic criteria, which was also confirmed by experimental verification.

    From the point of view of the user, this event does not pose a significant danger and rather looks like an attempt to create an information guide.

    UPD Another evidence that the merged data is a compilation of various sources is the presence of a collection of gmail accounts with "+" features in the database, when the address looks like name + domain \ word at (thanks to geka for reminding about this feature )
    Top 10 domains from the selection (the whole list is here )
    176 xtube
    132 daz
    88 filedropper
    66 daz3d
    64 eharmony
    63 friendster
    62 savage
    57 spam
    54 bioware
    52 savage2
    11 paygr
    11 comicbookdb
    About paygr: user gkond wrote that
    Found my email in the dump. The password next to it was automatically generated by the service back in May 2011.

    at the same time, “paygr” occurs 11 times in the “+” gmail list. Perhaps their base has also been compromised.

    But the most important thing is that comicbookdb acknowledged that their base was actually stolen along with passwords (thanks to EnterSandman for the link ):

    Also popular now: