About removing unimportant parts of pages when indexing a site

    The question of separating the necessary and useful content from the other ryushechek quite often arises for those who collect this or that information on the Web.

    I think there is no particular reason to dwell on the algorithm for parsing HTML into a tree, especially since in a generalized form such parsers are taught to write a course at 3-4 universities. The usual stack, a few chips to skip arguments (except for those that will be needed later), and the output tree as a result of parsing. The text is broken into words right in the process of parsing, and the words are sent to a separate list, where, in addition to general information, all the positions of the word in the document are remembered. It is clear that the words in the list are already in the 1st normal form, I already wrote about morphology, here I just copy from the previous article.


    First, on the basis of Zaliznyak’s morphological dictionary, we select the largest basis, cut off the ending, substitute the 1st dictionary form. This whole process is assembled into a tree for quick spelling, the ending leaves contain ending options. We run by the word, while going down the tree starting from the letter we meet, until we get to the lowest possible leaf - there, based on the endings, we substitute the normalized one.
    If we didn’t find a normal form, then we use stemming - based on the texts of books downloaded from lib.ru, I built a table of the frequency of occurrence of endings, we look for the most common of suitable ones (also with a tree) and replace it with a normal form. Stemming works well if the word was not in the language 5-10 years ago - it will easily parse "crawlers" into "crawlers"


    After long experiments with parsing HTML, I noticed that the same blocks in HTML obviously make up the same subtrees - roughly speaking, if there are 2 pages and 2 trees and XOR between them, then only the necessary content will remain. Well, or if it’s so simpler - crossing most of these trees at one site will give a probabilistic model — the more blocks there are, the less its significance. All that is found in more than 20% -30% - I throw it away, it makes no sense to waste time on duplicate content.

    Obviously, a solution was born: it learns to count a certain CRC from a subtree, then for each subtree it is easy to count the number of repetitions. Then, when parsing again, it’s easy to reset the vertices of a tree that is too common, and you can always collect the page text from the remaining tree (although this is essentially not needed anywhere).

    So, in 2 runs on all pages of the site - first we collect statistics, then index - the issue of isolating patterns is easily solved. In addition, we get many advantages - type constructs , and the rest meaningless ones will be thrown out first of all. The

    full content and list of my articles will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

    Also popular now: