Leaks of classified information found in 100,000 repositories on GitHub


    The methodology for collecting secrets includes various phases, which allows you to ultimately identify classified information with a high degree of confidence. Illustration from

    GitHub 's scientific work and similar platforms for open source publishing today have become standard tools for developers. However, a problem arises if this open source code works with authentication tokens, private API keys, and private cryptographic keys. To ensure security, this data must be kept secret. Unfortunately, many developers add sensitive information to the code, which often leads to accidental information leaks.

    A group of researchers from the University of North Carolina conducted a large-scale studysecret data leaks on github. They scanned billions of files, which were collected by two complementary methods:

    • Nearly six-month GitHub public commit scans in real time
    • snapshot of public repositories covering 13% of all repositories on GitHub, about 4 million repositories in total.

    The conclusions are disappointing. Scientists have not only discovered that leaks are widespread and affect more than 100,000 repositories. Even worse, thousands of new, unique “secrets” get to GitHub every day.

    The table lists the APIs of popular services and the risks associated with the leak of this information.



    General statistics on the secret objects found shows that most often Google API keys get into the public domain. RSA private keys and Google OAuth identifiers are also common. Typically, the vast majority of leaks occur through single-owner repositories.

    SecretTotalUnique%, one owner
    Google API Key212 89285 31195.10%
    RSA Secret Key158 01137,78190.42%
    Google OAuth ID106 90947,81496.67%
    Regular private key30,28612,57688.99%
    Amazon AWS Access Key ID26 395464891.57%
    Twitter access token20,760795394.83%
    Private key EC7838158474.67%
    Facebook access token6367171597.35%
    PGP Private Key209168482.58%
    MailGun API Key186874294.25%
    MailChimp API Key87148492.51%
    Stripe Standard API Key 54221391.87%
    Twilio API Key320fifty90.00%
    Square Access Token1216196.67%
    Secret Square OAuth281994.74%
    Amazon MWS Auth Token28thirteen100.00%
    Braintree Access Token24887.50%
    Picatic API Key54100.00%
    Total575,456201 64293.58%

    Real-time monitoring of commits made it possible to determine how much sensitive information is removed from the repositories shortly after getting there. It turned out that on the first day a little more than 10% of secrets are deleted, and on the next days a few percent, however, more than 80% of private information remains in the repositories two weeks after the addition, and this proportion practically does not decrease in the subsequent.

    Among the most notable leaks are an AWS account from a government agency in one of the countries of Eastern Europe, as well as 7,280 private RSA keys for accessing thousands of private VPNs.

    The study shows that an attacker, even with minimal resources, can compromise many GitHub users and find a ton of private keys. The authors note that many existing protection methods are ineffective against the collection of classified information. For example, tools like TruffleHog show only 25% efficiency. The built-in GitHub limit on the number of API requests is also easily bypassed.

    However, many secrets discovered have clear patterns that make
    them easier to find. It is logical to assume that these same patterns can be used to monitor leakage of classified information and warn developers. Probably, such mechanisms should be implemented on the server side, i.e. on GitHub. A service can issue a warning right during a commit.

    GitHub recently implemented a beta version of token scanning ( Token Scanning function ), which scans repositories, searches for tokens and notifies service providers of information leakage. In turn, the vendor can cancel this key. The authors believe that thanks to their research, GitHub can improve this feature and expand the number of vendors.

    Also popular now: