They crawled github

    A group of researchers from the University of North Carolina (North Carolina State University, NCSU) conducted a study of the service for hosting IT projects and their joint development of GitHub. Experts have found that over 100 thousand GitHub repositories contain API keys, tokens and cryptographic keys.



    The problem of an unintentional leak of critical information (encryption keys, tokens and API keys from various online services, etc.) has long been one of the hottest topics.


    Thanks to such leaks, several major incidents with personal data have already occurred (Uber, DJI, DXC Technologies, etc.).


    Between October 31, 2017 and April 20, 2018, researchers from the NCSU crawled 4,394,476 files in 681,784 repositories through the search API of GitHub itself and 2,312,763,353 files in 3,374,973 repositories pre-compiled in the Google BigQuery database.


    In the process of scanning, experts searched for strings that would fall under the templates of API keys (Stripe, MailChimp, YouTube, etc.), tokens (Amazon MWS, PayPal Braintree, Amazon AWS, etc.) or cryptographic keys (RSA, PGP, etc.).



    In total, experts found about 575,476 tokens, API and cryptographic keys, with 201,642 of them being unique. 93.58% of finds were associated with accounts with one owner.



    When manually checking part of the selected results, AWS credentials were found for the site of a major government department in one of the countries of Western Europe and for a server with millions of applications for admission to an American college.


    An interesting trend was revealed during the study: if the data owners detected a leak, then 19% of the data monitored by experts was deleted (as “deleted”, see below) within 16 days (12% of them during the first day), and 81% were not removed during the observation period.


    The most interesting thing is that all the “deleted” data that the researchers observed was not actually physically deleted, and their owners simply made a new commit.


    At the end of last year, we wrote a small note on Habr , in which we described how to use the DeviceLock DLP solution to prevent unintentional leaks by controlling the data downloaded to GitHub.


    Regular news about individual cases of data leaks is quickly published on the channel of information leakage .


    Also popular now: