Leaks of classified information found in 100,000 repositories on GitHub
The methodology for collecting secrets includes various phases, which allows you to ultimately identify classified information with a high degree of confidence. Illustration from
GitHub 's scientific work and similar platforms for open source publishing today have become standard tools for developers. However, a problem arises if this open source code works with authentication tokens, private API keys, and private cryptographic keys. To ensure security, this data must be kept secret. Unfortunately, many developers add sensitive information to the code, which often leads to accidental information leaks.
A group of researchers from the University of North Carolina conducted a large-scale studysecret data leaks on github. They scanned billions of files, which were collected by two complementary methods:
- Nearly six-month GitHub public commit scans in real time
- snapshot of public repositories covering 13% of all repositories on GitHub, about 4 million repositories in total.
The conclusions are disappointing. Scientists have not only discovered that leaks are widespread and affect more than 100,000 repositories. Even worse, thousands of new, unique “secrets” get to GitHub every day.
The table lists the APIs of popular services and the risks associated with the leak of this information.
General statistics on the secret objects found shows that most often Google API keys get into the public domain. RSA private keys and Google OAuth identifiers are also common. Typically, the vast majority of leaks occur through single-owner repositories.
Secret | Total | Unique | %, one owner |
---|---|---|---|
Google API Key | 212 892 | 85 311 | 95.10% |
RSA Secret Key | 158 011 | 37,781 | 90.42% |
Google OAuth ID | 106 909 | 47,814 | 96.67% |
Regular private key | 30,286 | 12,576 | 88.99% |
Amazon AWS Access Key ID | 26 395 | 4648 | 91.57% |
Twitter access token | 20,760 | 7953 | 94.83% |
Private key EC | 7838 | 1584 | 74.67% |
Facebook access token | 6367 | 1715 | 97.35% |
PGP Private Key | 2091 | 684 | 82.58% |
MailGun API Key | 1868 | 742 | 94.25% |
MailChimp API Key | 871 | 484 | 92.51% |
Stripe Standard API Key | 542 | 213 | 91.87% |
Twilio API Key | 320 | fifty | 90.00% |
Square Access Token | 121 | 61 | 96.67% |
Secret Square OAuth | 28 | 19 | 94.74% |
Amazon MWS Auth Token | 28 | thirteen | 100.00% |
Braintree Access Token | 24 | 8 | 87.50% |
Picatic API Key | 5 | 4 | 100.00% |
Total | 575,456 | 201 642 | 93.58% |
Real-time monitoring of commits made it possible to determine how much sensitive information is removed from the repositories shortly after getting there. It turned out that on the first day a little more than 10% of secrets are deleted, and on the next days a few percent, however, more than 80% of private information remains in the repositories two weeks after the addition, and this proportion practically does not decrease in the subsequent.
Among the most notable leaks are an AWS account from a government agency in one of the countries of Eastern Europe, as well as 7,280 private RSA keys for accessing thousands of private VPNs.
The study shows that an attacker, even with minimal resources, can compromise many GitHub users and find a ton of private keys. The authors note that many existing protection methods are ineffective against the collection of classified information. For example, tools like TruffleHog show only 25% efficiency. The built-in GitHub limit on the number of API requests is also easily bypassed.
However, many secrets discovered have clear patterns that make
them easier to find. It is logical to assume that these same patterns can be used to monitor leakage of classified information and warn developers. Probably, such mechanisms should be implemented on the server side, i.e. on GitHub. A service can issue a warning right during a commit.
GitHub recently implemented a beta version of token scanning ( Token Scanning function ), which scans repositories, searches for tokens and notifies service providers of information leakage. In turn, the vendor can cancel this key. The authors believe that thanks to their research, GitHub can improve this feature and expand the number of vendors.