Background: “Archive of the Internet” - the history of creation, mission and subsidiary projects
Probably Habré are not many users who have never heard of "Internet Archive» (of Internet the Archive), the service, which is engaged in the exploration and preservation are important to all mankind digital data, whether it is Internet pages, books, videos, or information of another type .
Who manages the Internet archive, when did it appear and what is its mission? Read about it in today's "Help".
Why do we need the Archive?
This is not only entertainment. The mission of the organization is universal access to all information. The Internet Archive seeks to combat the monopoly on the provision of information by both telecommunication companies (Google, Facebook, etc.) and states.
At the same time, “Archive” is a law-abiding organization. If, according to the law of the United States, some information must be deleted, the organization does so.
The Internet Archive also serves as a tool for the work of scientists, special services, historians (for example, archeographers) and representatives of many other fields, not to mention individual users.
When did the “Internet archive” appear?
The creator of the "Archive" - American Brewster Cale, who created the company Alexa Internet. Both of his services have become extremely popular, both of them are flourishing now.
The Internet Archive began to archive information from sites and store copies of web pages, starting in 1996. The headquarters of this non-profit organization is located in San Francisco, USA.
However, for five years, the data were not available for public access - the data was stored on the servers of the Archive, and that’s all, only the administration of the service could view old copies of the sites. Since 2001, the administration of the service decided to provide access to the stored data to everyone.
At the very beginning, the “Internet archive” was only a web archive, but then the organization began to save books, audio, moving images, software. Now the “Internet Archive” acts as a repository for photos and other images of NASA, Open Library texts, etc.
What is the organization?
"Archive" exists on voluntary donations - both organizations and individuals. You can provide support in bitcoins, wallet 1Archive1n2C579dMsAu3iC6tWzuQJz8dN. This wallet, by the way, has received 357.47245492 BTC for all its existence, which is approximately $ 2.25 million at the current exchange rate.
How does the "Archive"?
Most of the staff are employed in the centers for scanning books, performing routine, but rather time-consuming work. The organization has three data centers located in California, USA. One - in San Francisco, the second - Redwood City, the third - Richmond. In order to avoid the danger of data loss in the event of a natural disaster or other disasters, the "Archive" has spare capacity in Egypt and Amsterdam.
“Millions of people have spent a lot of time and effort to share what we know in the form of the Internet with others. We want to create a library for this new publishing platform, ”said Brewster Kahle, founder of the Internet Archive.
How big is the “Archive” now?
The “Internet archive” has several divisions, and the one that collects information from websites has its own name - Wayback Machine. At the time of writing "Help" in the archive were stored 339 billion saved web pages. In 2017, the “Archive” contained 30 petabytes of information, approximately 300 billion web pages, 12 million books, 4 million audio recordings, 3.3 million videos, 1.5 million photographs, and 170 thousand different software distributions. In just one year, the service noticeably “gained weight”, now “Archive” stores 339 billion web pages, 19 million books, 4.5 million video files, 4.7 million audio files, 3.2 million images of various kinds, 381 thousand distributions BY.
How is data storage organized?
Information is stored on hard drives in the so-called "data nodes". These are servers, each of which contains 36 hard drives (plus two disks with operating systems). Data nodes are grouped into arrays of 10 machines each and are cluster storage. In 2016, “Archive” used 8-terabyte HDD, now the situation is about the same. It turns out that one node holds about 288 terabytes of data. In general, hard drives are also used in other sizes: 2, 3 and 4 TB.
In 2016, there were about 20,000 hard drives. The Archive data centers are equipped with air conditioning systems to maintain a microclimate with constant characteristics. One cluster storage of 10 nodes consumes about 5 kW of energy.
The structure of the Internet Archive is a virtual “library”, which is divided into sections such as books, films, music, etc. For each item there is a description listed in the catalog - usually this is the name, the name of the author and additional information. From a technical point of view, the elements are structured and are located in Linux directories.
The total amount of data stored by Archive is 22 PB, while now there is room for another 22 PB. “Because we are paranoid,” say service representatives.
Look at the screenshot of the contents of the directory - there is a file with a name ending in "_files.xml". This is a directory with information about all the files in a directory.
What will happen to the data if one or more servers fail?
Nothing bad will happen - the data is duplicated . As soon as a new item appears in the Archive library, it is immediately replicated and placed on different hard drives on different servers. The process of “mirroring” the content helps to cope with problems such as power outages and file system failures.
If the hard drive fails, it is replaced with a new one. Thanks to the mirrored and reduplicable data structure, the newcomer is immediately filled with data that was on the old, failed HDD.
The “Archive” has a specialized system that monitors the state of the HDD. On the day you have to replace 6-7 failed drives.
What is a wayback machine?
This is just one of the “Internet archive” services, which specializes in saving web pages. The service has its own “spider”, which regularly inspects all sites accessible on the network and stores them on specialized servers. The more popular the website, the more often the robot copies its contents. If the resource administrator does not want the site information to be copied by the bot, it is enough to register the ban in the robots.txt file.
Popular resources are copied frequently - almost daily. Wayback Machine even indexes social networks, including Twitter, Facebook.
In 2017, “Archive” launched the updated service Wayback Machineby promising more convenient access to your saved web pages. The service was written, if not from scratch, then redesigned great. Now it supports a number of file formats that were simply not previously saved. In the same 2017, the organization stated that its servers save about 1 billion web pages every week.
It looked like Twitter in 2007
What else can be found in the database of "Internet archive"?
Books The collection of the organization is huge, it includes digitized books, both common and very rare editions. Books are stored not only in English, but also in many other languages. The "Archive" has specialized centers for scanning books, there are 33 such centers, they are located in five countries around the world.
On the day, the centers' employees scan about 1,000 books. The service database contains millions of publications, the work on their digitization is financed by both ordinary people and various organizations, including libraries and funds.
Since 2007, the Internet Archive has been keeping publicly accessible books from Google Book Search in its database. After launch, the base of books quickly grew - in 2013 there were already more than 900 thousand books saved from the Google service.
One of the “Archive” services also provides access to books that are fully open, there are already over a million of them. This service is called Open Library.
Video. Service stores 4.5 million clips. They are divided by subject and have very different directions. The Archive servers store films, documentaries, sports events, TV shows and many other materials.
In 2015, “Archive” gave rise to a large-scale project - digitizing video tapes . At first, it was about 40 thousand cassettes from the archive of Marion Stokes, a woman who had recorded news on tapes for many decades. Then other videotapes were added that fans of the idea of digitizing data important to humanity sent to the Archive.
Audio. Similarly, video, “Archive” stores audio files, which are also divided by subject. Last year, “Archive” began to implement its new project - deciphering shellac records, the oldest audio recording format. The sound was preserved on shellac plates - a natural resin that is isolated by female worms. In total, the Great 78 Project archive contains several hundred thousand records .
Software. Of course, it’s simply impossible to store all the software created by humanity, even for the “Archive”. The servers store vintage — for example, Macintosh software, DOS software, and other software. In 2016, the staff of "Archive" laid out 1500+ programs under Windows 3.1 , you can work directly in the browser. In 2017, the Internet Archive releasedarchive of software for the first Macintosh .
Games. Yes, Archive provides access to a huge number of games. You can play some of them in the browser emulator environment. Games are stored very different, including with portable analog-digital set-top boxes . There are MS-DOS games and console games for Atari and ColecoVision.
For the first time the archive of old games the organization postedback in 2013. We are talking about 30-40 years old titles that could be played directly in the browser. These are games for consoles Atari 2600 (1977), Atari 7800 (1986), ColecoVision (1982), Philips Videopac G7000 (1978) and Astrocade (1983). The most interesting thing is that the Internet Archive has achieved that you can play quite legally. Now the collection has more than 3,400 games and continues to grow.