alizar January 7, 2013 at 18:06

What part of the web is archived

The Internet Archive Time Machine is the largest and most famous archive that has been storing web pages since 1995. In addition to it, there are dozens of other services that also archive the web: these are search engine indexes and highly specialized archives like Archive-It , UK Web Archive , Web Cite , ArchiefWeb , Diigo , etc. It is interesting to know how many web pages fall into these archives , regarding the total number of documents on the Internet?

It is known that the Internet Archive database for 2011 contains more than 2.7 billion URIs , many of them in several copies, made at different points in time. For example, the main page of Habr was "photographed" already 518 times, sinceJuly 3, 2006 .

It is also known that the Google link base five years ago crossed the mark of a trillion unique URLs , although many documents are duplicated there. Google is not able to analyze all the URLs, so the company decided to consider the number of documents on the Internet to be infinite.

Google cites a calendar web application as an example of "infinity of web pages." It makes no sense to download and index all its pages for millions of years in advance, because each of the pages is generated by request.

Nevertheless, scientists are interested to know at least approximately what part of the web is archived and stored for posterity. Until now, no one could answer this question. Specialists from Old Dominion University in Norfolk conducted a studyand got a rough estimate.

To process the data, they used the Memento HTTP framework , which operates with the following concepts:

URI-R to identify the address of the original resource.
URI-M to identify the archived state of this resource at time t.

Accordingly, each URI-R may have zero or more URI-M states.

From November 2010 to January 2011, an experiment was continued to determine the proportion of publicly available pages that fall into the archives. Since the number of URIs on the Internet is infinite (see above), it was necessary to find an acceptable sample that is representative of the entire web. Here, scientists used a combination of several approaches:

A selection from the Open Directory Project (DMOZ).
Random sampling of URIs from search engines, as described by Ziv Bar-Joseph and Maxim Gurevich, “Random sampling from a search engine's index” (Journal of the ACM (JACM), 55 (5), 2008).
The last URIs added to the Delicious social bookmarking site using the Delicious Recent Random URI Generator.
Bitly link shortening service, links were selected using a hash generator.

For practical reasons, the size of each sample was limited to a thousand addresses. The analysis results are shown in the summary table for each of the four samples.

The study showed that from 35% to 90% of all URIs on the Internet have at least one copy in the archive. Between 17% and 49% of URIs have 2 to 5 copies. From 1% to 8% of URIs are "photographed" 6-10 times, and from 8% to 63% of URIs are 10 or more times.

With relative certainty, we can say that at least 31.3% of all URIs are archived once a month or more often. At least 35% of all pages in the archives have at least one copy.

Naturally, the above figures do not apply to the so-called Deep Web, to which it is customary to include dynamically generated pages from the database, password protected directories, social networks, paid archives of newspapers and magazines, flash sites, digital books and other resources that are hidden behind the firewall, in the public domain and / or are not available for indexing by search engines. According to some estimates , a deep network may be several orders of magnitude larger than the surface layer.

Tags:

What part of the web is archived

Also popular now: