Bigbrother is watching you or the dark side of the Internet

    UPD: PHPBB is really to blame - it considers bots to be registered users and gives them read permissions. Thanks khim mikes
    google is watching you Some years ago I read somewhere that Google is going to index the “dark side of the Internet” - these are all kinds of databases, closed libraries, and generally paid sites. Those. information for viewing which you must enter at least a username and password. According to some estimates, “dark” information on the Internet can be from 90 to 98%.
    Then I was delighted - it would be possible to watch the same experts-exchange.com (I know about the End key) and similar sites, which I used.

    But recently, I needed to create an internal forum for the organization. The organization is quite large and distributed throughout the country. The task was to make simple communication of geographically distributed employees within the organization. It was planned to discuss internal information, access to which competitors was, to put it mildly, undesirable.

    What I've done:
    • Added a sub-domain
    • Install and configure PHPBB
    • Closed all forums - an unauthorized user receives a message “There are no forums on this site”
    • Added an additional field with a question to the registration page, the answer to which is known only to employees working in this organization.
    • Notified employees only by mail. The link on the Internet did not shine anywhere.
    However, a week later I noticed spiders googlebot, yandexbot, and others of lesser known in the logs. It didn’t bother me - there are a bunch of services that show DNS statistics - through them, search engines could go to the forum.
    However, a month later, I noticed Google indexing the forum in the logs:
    66.249.71.178 - - [time] "GET /robots.txt HTTP / 1.1" 404 2152 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)"
    66.249.71.178 - - [time] "GET / HTTP / 1.1" 200 17743 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)"
    66.249.71.178 - - [time] "GET /viewtopic.php?f=x5&p=y96 HTTP / 1.1" 200 26238 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com /bot.html) "
    66.249.71.178 - - [time] "GET /viewforum.php?f=x5 HTTP / 1.1" 200 13482 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot .html) "
    66.249.71.177 - - [time] "GET /viewforum.php?f=x0 HTTP / 1.1" 200 14550 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot .html) "
    66.249.71.178 - - [time] "GET /viewtopic.php?f=x5&p=y34 HTTP / 1.1" 200 15503 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com /bot.html) "
    


    I was somewhat shocked. AS? How did google get access to the forum? At this time, the first 2 links appear for the query "site: forum.of.site.com".
    Quickly added robots.txt
    	User-agent: Googlebot
    	Disallow: /
    	

    After a while, the bot reread robots.txt, but continued indexing. A week later, several dozen pages appeared in Google’s cache.

    I started looking for information on how to remove information from the index and cache.
    Google recommends adding lines to HTML

    Which was immediately done, nevertheless, the indexing continued, the pages in the cache increased.

    Continued the search - found a tool to create an application for deleting a web page , the service is not convenient in that it allows you to delete only one URL at a time, and asks a lot of questions, but anyone can submit an application.
    Fortunately, I found a way to delete the entire site - add to your sites in the toolbar, confirm management and then you can delete it. Perhaps in the near future will be in demand profession SED (Search Engine Deoptimizator) :)?

    But the main question remains:

    How did Google get access?


    I have only one assumption: one of the employees uses Google Desktop - (this is indicated by its user-agent string). Apparently, Google Desktop transmits cookies. Essentially steals cookies. I don’t think that he passes all the form data - it would be a scandal, and there are no POST requests from the bot.

    UPD: PHPBB is really to blame - it considers bots to be registered users and gives them read permissions. Thanks khim mikes

    Also popular now: