Common words about Web search engine

    Since so many questions arose about the general functionality of the search engine, here is a small introductory article. To make it a little clear what a search engine is and what it should do, I will describe it in general terms. Probably for specialists, programmers will not be very interesting, do not blame me.

    But, to the point: the search engine in my humble opinion should be able to find the most relevant results for the search query. In the case of the text search, to which we are all accustomed, the search query is a set of words, I personally limited its length to eight words. The answer is a set of links to pages that are most relevant to the search query. It is advisable to provide links with annotation so that the person knows what to expect and can choose the desired one from the results - the annotation is called a snippet.

    I must say that the search problem in general is not solved - for any document that has the greatest relevance, for example, by the word "work", you can create a modified copy that will have even better, from the point of view of the search engine, relevance, but it will be complete nonsense from the point of view person. A matter of price and time, of course. Due to the vastness of the Internet today, there are a lot of such pages, to put it mildly. Different systems struggle with them in different ways and with varying success, someday artificial intelligence will defeat all of us ...


    Meaning recognition algorithms would help here, but I am familiar with just one of them (which really recognize the meaning, and do not consider statistics) and have little idea of ​​its applicability. Therefore, tasks are solved empirically - i.e. the selection of some manipulations with the pages to separate the "grain from the chaff."

    In the real world, browsing the entire Internet in a second and finding the best results is not yet possible, so the search engine stores a local copy of the piece of the network that it managed to collect and process. In order to quickly get out of a billion pages only those that contain the necessary words, an “index” is built - a database in which each word corresponds to a list of pages that contain this word. Naturally, in it it is necessary to store in what places the searched words were found, how they were highlighted in the text, other numerical page metrics in order to use them in the sorting process.

    Say I have 100 million pages. The average word occurs on 1-1.5% of the pages, i.e. 1 million pages per word (there are words that appear on every second page, but there are more rare ones). Just say 3 million words found - the rest are much less common and this is mainly typos and numbers. To store 1 record that a particular word occurs on a particular page, the page id is 4 bytes, the site id is 4 bytes, packed information about where and how it was allocated is 16-32 bytes, 3 reference ranking coefficients - 12 bytes, the rest metrics are still about 12-24 bytes. How much will the index be - I leave it to you to estimate:
    3mln * 1mln * the total amount of the record.

    In order to build this index there are 3 mechanisms:

    indexing pages - receiving pages from the web and their initial processing
    building link metrics of the PageRank type based on primary information
    updating an existing index - adding new information there and sorting it by received metrics, in particular PageRank.

    Additionally, you need to save the page texts - to build annotations in the search process Search

    process

    You can select many relevance metrics, some depend on the “usefulness” of the result for a particular user, others depend on the total number found, others on the performance of the pages themselves — for example, some search engines have some "Standard" to which they strive.

    In order for the machine, i.e. the server, could sort the results found by some metric, a set of numbers is used that is mapped to each page. For example - the total number of words found in the text of this page, their weight, calculated based on the allocation of these words in the text of the page, etc. Another kind of such coefficients does not always depend on the request - for example, the number of pages that link to this one. The more it is, the more weighty the page in the output. The third type of coefficients depends on the query itself - how rarely used words are found in it, which of them are common, and you can skip them.
    Based on the large number of these coefficients for each page, the search should output one number - relevance and sort all the results by it.

    When the index is already built, you can search for it:

    break the query into words, select pieces of the index corresponding to each word, cross or do something else, depending on the chosen policy,
    calculate the coefficients for each page - their number, if desired, can be far per thousand,
    build the relevance metric based on the coefficients, sort, select the best results,
    build annotations - snippets and display the result The

    full content and list of my articles on the search engine will be updated here: http: // habraha br.ru/blogs/search_engines/123671/

    Also popular now: