Where does the search engine begin, or a few thoughts about crawler

    In continuation of the topic started about our own search engine.

    So there are several major tasks that the search system must solve, let's start with the fact that you need to get and save a separate page.
    There are several ways, depending on which processing methods you choose in the future.

    Obviously, you need to have a queue of pages that you need to download from the web, if only then to look at them on long winter evenings, if you can come up with nothing better. I prefer to have a queue of sites and their main pages, and a local mini queue of what I will process at this time. The reason is simple - a list of all the pages that I would like to download in just a month can significantly exceed the size of my rather big hard drive :), so I only store what is really needed - sites, there are 600,000 of them at the moment, and their priorities and load times.


    When loading the next page, all links from this page must either be added to the local queue if they remain within the site that I am processing, or in the main list of sites to which I have to return sooner or later.

    How many pages to get from one site at a time? Personally, I prefer no more than 100,000, although I periodically change this limit to just 1,000 pages. And there are not many sites on which there are more pages.
    Now consider in more detail:

    If we get 1 page at a time, all pages in sequence, then how many pages will we process, say, in an hour?
    - the page retrieval time consists of:
    · The time that we are waiting for the response of the CSN (it, as practice shows, is not small at all). DNS compares the name of the site "site.ru" with the IP address of the server on which it lies, and this is not an easy task considering that sites tend to move, packet routing routes change, and much more. In short, the DNS server stores an address table, and each time we knock on it to understand the address - where to go for the page.
    · Time for connecting and sending the request (quickly if you have at least an average channel)
    · time for receiving the actual response - page

    That is why Yandex, according to rumors, once faced with the very first problem - if you get really many pages, then the provider's CSN is not able to cope with this - in my experience the delay was up to 10 seconds to the address, especially since you still need to transmit the answer back and forth over the network, and I’m not the only provider. I note that when requesting sequentially 1000 pages from one site, you will each time pull the provider 1000 times.

    With modern hardware, it’s quite simple to install a local DNS server caching on a local network and load it with your work, not a provider - then the provider will start sending your packets faster. However, you can get confused and write a cache within your page loader if you write at a fairly low level.
    If you use ready-made solutions such as LWP or HTTP modules for Perl, then the local DNS server will be optimal.

    Now suppose that the answer goes to you 1-10 seconds on average - there are fast servers, and there are also very slow ones. Then per minute you received 6-60 pages, per hour 360-3600, per day from about 8000 to 60,000 (consciously rounding down for all kinds of delays: in reality, when you request 1 page at a time without a local DNS, on the channel 100mbit / s You will receive 10,000 pages per day, of course, if the sites are different, and not one very fast)

    And even considering that the processing time is not taken into account here, saving the pages is a frankly miserable result.

    Ok, I said, and made 128 requests at a time in parallel, everything flew fine - a peak of 120 thousand pages per hour, until obscene logs came from server admins where I was knocking about DDoS attacks, well, yes 5000 requests in 5 minutes is probably not any hosting allows.

    Everything was decided by the fact that at the same time I began to load 8-16 different sites, no more than 2-3 pages in parallel. It turned out something about 20-30 thousand pages per hour, and it suited me. I must say that the indicators grow much at night. The

    full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

    Also popular now: