Broken links - some statistics
Seeing today the topic of D'Artagnan and the Internet, or work on the problem of broken links , I decided to share with you some statistics collected when writing my master's thesis.
In my diploma, one of the tasks was to solve the problem of broken links on a single resource. In order to show the relevance of the problem, I downloaded the dump of the Wikipedia database and checked the functionality of 700 thousand external links in articles by the program.
It turned out that 20% of the links are broken!
The link was considered broken in the following cases:
○ Removing a domain from DNS.
○ Failure when connecting over HTTP.
○ Obtaining an HTTP response code 4xx or 5xx — basically, deleting a page (404), denying access (403), server error (500).
○ Redirect from the internal page to the home page.
○ Infinite HTTP 3xx redirection.
Also, the substitution of the page content for another and program errors of PHP, ASP, etc were monitored, but this data was not included in the statistics.
The database was obtained in August 2009.
Then 3 checks were made:
● October 2009 - 20.7% of links are broken
● November 2009 - 22.4%
● April 2010 - 23.8%
A gradual increase in the number of broken links can be noted. At the same time, only 4% of those who had not worked previously restored their work. Those. in the vast majority of cases, failure is irreversible.
The figure below shows statistics for the reasons of inoperability of links:
A similar check of the catalog of links to external sites of the federal educational portal www.edu.ru revealed a similar picture - 24.5% of links do not work.
Of course, such a study is not serious and scientific, and the results are not very accurate. Perhaps the links being checked belonged to old versions of articles, I could not track this. But it is obvious that the problem of broken links exists. Some more numbers:
According to DomainToolsthe number of domain names of sites that have ceased to exist in just one day is about 100,000, and in general their sum exceeds the number of existing ones more than 3 times (for zones .com, .org, .net, .info, .biz and .us )
Archive.org claims that the average life span of a webpage is between 44-75 days.
If you need to ensure the smooth functioning of external links, you can use one of the following methods:
1. Periodic automatic testing and hiding / deleting broken links.
This approach is applicable in cases where the performance of links is not critical or you just need to give a link to the entire site. There are ready-made programs that implement this principle: PHP Spider, ht: // Check, VEinS, etc.
2. Saving a copy of the resource on your server and issuing a link to it.
This approach is preferable to the first, if it is important to provide users with access to the resource for an unlimited time. It also excludes the possibility of replacing page content with another. This raises the problem of compliance with copyrights to stored copies.
This method is more suitable for linking to a specific page / document, as keeping a copy of the entire site is difficult enough.
An example of a service using this principle is Peeep.us . Also, since 1996, The Wayback Machine service of the Internet Archive electronic library has been operating , which periodically collects copies of publicly accessible web pages on the Internet.
3. The combination of methods 1 and 2 is to provide a link to the original resource, and in case of loss of operability or change, to a saved copy.
4. URN , PURL - how they really can be used is not very clear.
In my diploma, one of the tasks was to solve the problem of broken links on a single resource. In order to show the relevance of the problem, I downloaded the dump of the Wikipedia database and checked the functionality of 700 thousand external links in articles by the program.
It turned out that 20% of the links are broken!
Study
The link was considered broken in the following cases:
○ Removing a domain from DNS.
○ Failure when connecting over HTTP.
○ Obtaining an HTTP response code 4xx or 5xx — basically, deleting a page (404), denying access (403), server error (500).
○ Redirect from the internal page to the home page.
○ Infinite HTTP 3xx redirection.
Also, the substitution of the page content for another and program errors of PHP, ASP, etc were monitored, but this data was not included in the statistics.
The database was obtained in August 2009.
Then 3 checks were made:
● October 2009 - 20.7% of links are broken
● November 2009 - 22.4%
● April 2010 - 23.8%
A gradual increase in the number of broken links can be noted. At the same time, only 4% of those who had not worked previously restored their work. Those. in the vast majority of cases, failure is irreversible.
The figure below shows statistics for the reasons of inoperability of links:
A similar check of the catalog of links to external sites of the federal educational portal www.edu.ru revealed a similar picture - 24.5% of links do not work.
Of course, such a study is not serious and scientific, and the results are not very accurate. Perhaps the links being checked belonged to old versions of articles, I could not track this. But it is obvious that the problem of broken links exists. Some more numbers:
According to DomainToolsthe number of domain names of sites that have ceased to exist in just one day is about 100,000, and in general their sum exceeds the number of existing ones more than 3 times (for zones .com, .org, .net, .info, .biz and .us )
Archive.org claims that the average life span of a webpage is between 44-75 days.
What to do
If you need to ensure the smooth functioning of external links, you can use one of the following methods:
1. Periodic automatic testing and hiding / deleting broken links.
This approach is applicable in cases where the performance of links is not critical or you just need to give a link to the entire site. There are ready-made programs that implement this principle: PHP Spider, ht: // Check, VEinS, etc.
2. Saving a copy of the resource on your server and issuing a link to it.
This approach is preferable to the first, if it is important to provide users with access to the resource for an unlimited time. It also excludes the possibility of replacing page content with another. This raises the problem of compliance with copyrights to stored copies.
This method is more suitable for linking to a specific page / document, as keeping a copy of the entire site is difficult enough.
An example of a service using this principle is Peeep.us . Also, since 1996, The Wayback Machine service of the Internet Archive electronic library has been operating , which periodically collects copies of publicly accessible web pages on the Internet.
3. The combination of methods 1 and 2 is to provide a link to the original resource, and in case of loss of operability or change, to a saved copy.
4. URN , PURL - how they really can be used is not very clear.