Website archiving

Original author: Antoine Beaupré
  • Transfer
Recently, I was deeply immersed in the topic of website archiving. I was asked by friends who were afraid of losing control over their work on the Internet due to poor system administration or hostile deletion. Such threats make website archiving an essential tool for any sysadmin. As it turned out, some sites are much more difficult to archive than others. This article demonstrates the process of archiving traditional websites and shows how it does not work on trendy one-page applications that inflate the modern web.

Simple Website Conversion


The days are long gone when websites were written manually in HTML. Now they are dynamic and built "on the fly" using the latest JavaScript, PHP or Python-frameworks. As a result, sites have become more fragile: a database failure, a false update, or vulnerabilities could lead to data loss. In my previous life as a web developer, I had to accept the thought: customers expect websites to work forever. This expectation does not fit well with the web development principle of “moving fast and breaking things.” Work with Drupal content management systemproved to be particularly difficult in this regard, since large updates intentionally break compatibility with third-party modules, which implies an expensive update process that customers rarely can afford. The solution was to archive these sites: take a live, dynamic website - and turn it into simple HTML files that any web server can issue forever. This process is useful for your own dynamic sites, as well as for third-party sites that are beyond your control and that you want to protect.

With respect to simple or static sites, the venerable Wget program does an excellent job . Although to mirror the entire site you will need a real spell:

    $ nice wget --mirror --execute robots=off --no-verbose --convert-links \
                --backup-converted --page-requisites --adjust-extension \
                --base=./ --directory-prefix=./ --span-hosts \
                --domains=www.example.com,example.com http://www.example.com/

This command loads the contents of the web page, and also crawls all links in the specified domains. Before launching this action on your favorite site, consider the possible consequences of crawling. The above command deliberately ignores the rules of robots.txt , as is customary for archivists , and downloads the site at maximum speed. Most crawlers have options for a pause between calls and bandwidth restrictions so as not to put an excessive load on the target site.

This command also gets the “page details”, that is, style sheets (CSS), images and scripts. The loaded content of the page is changed in such a way that the links point to the local copy. The resulting set of files can be placed on any web server, representing a static copy of the original web site.

But this is when everything goes well. Anyone who has ever worked with a computer knows that things rarely go according to plan: there are many interesting ways to disrupt the procedure. For example, some time ago it was fashionable to put blocks with a calendar on sites. CMS will generate them "on the fly" and send the crawlers into an infinite loop, trying to get more and more new pages. Sly archivists can use regular expressions (for example, in Wget there is an option--reject-regex) to ignore problem resources. Alternatively, if a website administration interface is available, disable calendars, login forms, comment forms, and other dynamic areas. As soon as the site becomes static, they still stop working, so it makes sense to remove this mess from the original site.

Nightmare javascript


Unfortunately, some websites are much more than just HTML. For example, on single-page websites, the web browser itself creates the content by running a small JavaScript program. A simple user agent such as Wget will unsuccessfully attempt to restore a meaningful static copy of these sites, since it does not support JavaScript at all. Theoretically, sites should maintain progressive improvements so that content and functionality is available without JavaScript, but these directives are rarely followed, as confirmed by anyone using plug-ins like NoScript or uMatrix .

Traditional archiving methods sometimes fail in the most dull way. When trying to backup a local newspaperI found that WordPress adds query strings (for example, ?ver=1.12.4) at the end of the include. This is confusing the content-type detection on the web servers serving the archive, because they Content-Typerely on the file extension to issue the correct header . When such an archive is loaded into a browser, it cannot load scripts, which breaks dynamic websites.

Since the browser is gradually becoming the virtual machine for running arbitrary code, archiving methods based on pure HTML analysis should be adapted. The solution to these problems is recording (and reproducing) HTTP headers delivered by the server during crawling, and indeed professional archivists use just this approach.

Creating and Displaying WARC Files


In the Internet archive, Brewster Kahle and Mike Berner developed the ARC (ARChive) format in 1996 : a way to combine the millions of small files created during the archiving process. Ultimately, the format was standardized as the WARC (Web ARChive) specification , released as an ISO standard in 2009 and revised in 2017. Standardization efforts are led by the International Consortium for the Preservation of the Internet (IIPC). According to Wikipedia, this is “an international organization of libraries and other organizations created to coordinate efforts to preserve Internet content for the future,” including members of the Library of Congress and the Internet archive. The latter uses the WARC format in its Heritrix Java crawler ..

The WARC file combines several resources in one compressed archive, such as HTTP headers, file content, and other metadata. Conveniently, this format is also supported by the Wget crawler with a parameter --warc. Unfortunately, browsers cannot directly display WARC files, so a special viewer is required to access the archive. Or you have to convert it. The easiest viewer I found is pywb , the Python package. It runs a simple web server with a Wayback Machine interface to view the contents of the WARC files. The following command set displays the WARC file to http://localhost:8080/:

    $ pip install pywb
    $ wb-manager init example
    $ wb-manager add example crawl.warc.gz
    $ wayback

By the way, this tool was created by the developers of the Webrecorder service , which with the help of a browser stores the dynamic content of the page.

Unfortunately, pywb cannot load WARC files generated by Wget, because it obeys the incorrect requirements of the WARC 1.0 specification , which were fixed in version 1.1 . Until Wget or pywb resolves these problems, the WARC files created by Wget are not reliable enough, so personally I started looking for other alternatives. My attention was attracted by the crawler under the simple name crawl . Here is how it starts:

    $ crawl https://example.com/

The program supports some command line parameters, but most defaults are quite workable: it downloads resources like CSS and images from other domains (if the flag is not specified -exclude-related), but the recursion will not go beyond the specified host. By default, ten parallel connections are started: this parameter is changed by the flag -c. But the main thing is that the resulting WARC files are perfectly loaded into pywb.

Future work and alternatives


There are quite a few resources on using WARC files. In particular, there is a replacement Wget called Wpull , specially designed for archiving websites. It has experimental support for PhantomJS and integration with youtube-dl , which will allow you to download more complex JavaScript-sites and download streaming media, respectively. The program is the basis of the ArchiveBot archiving tool , which is developed by a “free team of mischievous archivists, programmers, writers and talkers” from ArchiveTeamin an attempt to "save the story before it disappears forever." It seems that PhantomJS integration is not as good as we would like, therefore ArchiveTeam also uses a bunch of other tools to mirror more complex sites. For example, snscrape scans social network profiles and generates lists of pages to send to ArchiveBot. Another tool is crocoite , which launches Chrome in headless mode for archiving sites with a lot of JavaScript.

This article would be incomplete without mentioning HTTrack's “Xerox Sites”. Like Wget, the HTTrack program creates local copies of sites, but, unfortunately, does not support saving to WARC. Interactive features may be more interesting for novice users unfamiliar with the command line.

In the same vein, during my research I found an alternative to Wget called Wget2 with the support of multi-threaded work, which speeds up the work of the program. However, some Wget features are missing here , including templates, saving to WARC and FTP support, but added support for RSS, DNS caching and improved support for TLS.

Finally, my personal dream for such tools would be to integrate them with my existing bookmarking system. I currently store interesting links in Wallabag, a service of local saving interesting pages, developed as an alternative to the free program Pocket (now owned by Mozilla). But Wallabag in its design creates only a "readable" version of the article instead of a full copy. In some cases, the “readable version” is actually unreadable , and Wallabag sometimes does not cope with parsing . Instead, other tools, such as bookmark-archiver or reminescence , save a screenshot of the page along with full HTML, but unfortunately do not support the WARC format, which would provide even more accurate reproduction.

The sad truth of my mirroring and archiving is that the data is dying. Fortunately, amateur archivists have at their disposal tools for storing interesting content on the Internet. For those who do not want to do this on their own, there is an Internet archive, as well as the ArchiveTeam group, which is working on creating a backup copy of the Internet archive itself .

Also popular now: