Internet Archive will crawl websites regardless of robots.txt settings
A website is a common set of files and folders that lies on the server. Among these files there is almost always one, which is called robots.txt, it is placed in the root. It serves to instruct the "spiders", it is set up so that search robots understand what can be scanned and what is not. In some cases, webmasters use duplicate content (tags, categories, etc.) to improve SEO performance using such instructions, and protect data from robots that should not be online for some reason.
The idea of robots.txt appeared more than 20 years ago and since then, although different settings for various search bots have changed, everything works the same way as many years ago. Instructions stored in this file, listen to almost all search engines, as well as the Internet Archive bot, which wanders through the Internet looking for information for archiving. Now the developers of the service believe that it is time to stop paying attention to what is posted in robots.txt.
The problem is that in many cases the domains of abandoned sites are “dropping”, that is, they are not extended. Or just the contents of the resource is destroyed. Then such domains are “parked” (for a variety of purposes, including receiving money for advertisements placed on a parked domain). The robots.txt file webmasters usually close the entire contents of the parked domain. Worst of all, when the Internet Archive robot sees the file instructions for closing the directory from indexing, it deletes the already saved content for a site that was previously located on this domain.
In other words, there was a site in the Internet Archive database, and there is none, although the domain owner is different, and the contents of the site saved by the service have long since sunk into oblivion. As a result, unique data are deleted that could well be of great value for a certain category of people.
Internet Archive takes snapshots of sites. If the site exists for a certain amount of time, there can be many such “snapshots”. So the history of the development of various sites can be traced from the very beginning to the newest version. An example of this is habrahabr.ru . When blocking access to bots using robots.txt, track its history or get at least some information becomes impossible.
A few months ago, the Internet Archive staff stoppedtrack instructions in the specified file on US government sites. This experiment was successful and now the Internet Archive bot will stop paying attention to the instructions in robots.txt for any sites. If the webmaster wants to delete the contents of his resource from the archive, he can contact the Internet Archive administration by mail.
So far, developers will monitor the behavior of the robot and the work of the service itself in connection with future changes. If all goes well, these changes will be saved.
Only registered users can participate in the survey. Sign in , please.