
Search technologies or what is the catch write your search engine
Once upon a time, an idea came to my mind: write your own search engine. It was a very long time ago, when I was still studying at a university, I knew little about the technology for developing large projects, but I was fluent in a couple of dozens of programming languages and protocols, and there were a lot of my sites at that time.
Well, I have a craving for monstrous projects, yes ...
At that time, little was known about how they work. Articles in English and very scarce. Some of my friends who were then aware of my searches, based on documents and ideas dug by me and them, including those born in the course of our disputes, are now taking good courses, coming up with new search technologies, in general, this topic gave the development of quite interesting work. These works led, among other things, to new developments of various large companies, including Google, but I personally have no direct relation to this.
At the moment, I have my own, learning search engine from and to, with many nuances - calculating PR, collecting statistics-topics, learning the ranking function, know-how in the form of cutting off irrelevant page content such as menus and ads. Indexing speed is approximately half a million pages per day. All this is spinning on my two home servers, and at the moment I'm working on scaling the system to about 5 free servers, to which I have access.
Here, for the first time, in public, I will describe what was personally done by me. I think many will be interested in how Yandex, Google and almost all the search engines known to me from the inside work.
There are many tasks in building such systems, which are almost impossible to solve in the general case, but with the help of some tricks, notions and a good understanding of how the hardware of your computer works, you can seriously simplify it. As an example, recounting PR, which, in the case of several tens of millions of pages, can no longer be placed in the largest RAM, especially if you, like me, are greedy for information and want to store many more useful things besides 1 digit. Another task is the storage and updating of the index, at least a two-dimensional database, in which a list of documents on which it is found is compared to a particular word.
Just think, Google stores, according to one estimate, more than 500 billion pages in the index. If each word appeared on 1 page only 1 time, and to store this, 1 byte was needed - which is impossible, because you need to store at least the id of the page - from 4 bytes, so then the volume of the index would be 500GB. In reality, one word occurs on the page on average up to 10 times, the amount of information to enter is rarely when less than 30-50 bytes, the entire index increases thousands of times ... Well, how do you order it to be stored? And to update?
Well, how it all works and works, I’ll talk systematically, as well as how to count PR quickly and incrementally, how to store millions and billions of page texts, their addresses and quickly search by addresses, how different parts are organized my database, how to incrementally update the index for many hundreds of gigs, well, and probably will tell you how to make a learning ranking algorithm.
Today, the volume of only the index used to search - 57Gb, is increasing by about 1Gb every day. The volume of compressed texts is 25Gb, well, I also store a bunch of other useful information, the volume of which is very difficult to calculate due to its abundance.
Here is a complete list of articles that relate to my project and are described here:
0. Search technologies or what the catch is to write your search engine
1. What does the search engine start with, or a few thoughts about the crawler
2. General words about the search engine on the Web
3. The data engine of the search engine
4. About the removal of insignificant parts of pages when indexing the site
5. Methods for optimizing application performance when working with RDB
6. Little about designing databases for a search engine
7. AVL trees and the breadth of their application
8. Working with URLs and their storage
9. Building an index for a search engine
Well, I have a craving for monstrous projects, yes ...
At that time, little was known about how they work. Articles in English and very scarce. Some of my friends who were then aware of my searches, based on documents and ideas dug by me and them, including those born in the course of our disputes, are now taking good courses, coming up with new search technologies, in general, this topic gave the development of quite interesting work. These works led, among other things, to new developments of various large companies, including Google, but I personally have no direct relation to this.
At the moment, I have my own, learning search engine from and to, with many nuances - calculating PR, collecting statistics-topics, learning the ranking function, know-how in the form of cutting off irrelevant page content such as menus and ads. Indexing speed is approximately half a million pages per day. All this is spinning on my two home servers, and at the moment I'm working on scaling the system to about 5 free servers, to which I have access.
Here, for the first time, in public, I will describe what was personally done by me. I think many will be interested in how Yandex, Google and almost all the search engines known to me from the inside work.
There are many tasks in building such systems, which are almost impossible to solve in the general case, but with the help of some tricks, notions and a good understanding of how the hardware of your computer works, you can seriously simplify it. As an example, recounting PR, which, in the case of several tens of millions of pages, can no longer be placed in the largest RAM, especially if you, like me, are greedy for information and want to store many more useful things besides 1 digit. Another task is the storage and updating of the index, at least a two-dimensional database, in which a list of documents on which it is found is compared to a particular word.
Just think, Google stores, according to one estimate, more than 500 billion pages in the index. If each word appeared on 1 page only 1 time, and to store this, 1 byte was needed - which is impossible, because you need to store at least the id of the page - from 4 bytes, so then the volume of the index would be 500GB. In reality, one word occurs on the page on average up to 10 times, the amount of information to enter is rarely when less than 30-50 bytes, the entire index increases thousands of times ... Well, how do you order it to be stored? And to update?
Well, how it all works and works, I’ll talk systematically, as well as how to count PR quickly and incrementally, how to store millions and billions of page texts, their addresses and quickly search by addresses, how different parts are organized my database, how to incrementally update the index for many hundreds of gigs, well, and probably will tell you how to make a learning ranking algorithm.
Today, the volume of only the index used to search - 57Gb, is increasing by about 1Gb every day. The volume of compressed texts is 25Gb, well, I also store a bunch of other useful information, the volume of which is very difficult to calculate due to its abundance.
Here is a complete list of articles that relate to my project and are described here:
0. Search technologies or what the catch is to write your search engine
1. What does the search engine start with, or a few thoughts about the crawler
2. General words about the search engine on the Web
3. The data engine of the search engine
4. About the removal of insignificant parts of pages when indexing the site
5. Methods for optimizing application performance when working with RDB
6. Little about designing databases for a search engine
7. AVL trees and the breadth of their application
8. Working with URLs and their storage
9. Building an index for a search engine