DIY readability

    Since we have not yet learned how to win the Great Chinese Roskomnadzor, a thing for bypassing locks on the Internet , but I still want to tell something strange about my work, I will talk about the reimplementation of an algorithm similar to Readability using Node.js and the Beijing Institute of Technology .

    What is it all about

    Readability is a radical continuation of AdBlock’s idea of ​​removing unnecessary elements from websites. Where AdBlock tries to tear down only the most useless things for the user (mainly advertising), Readability also removes scripts, styles, navigation and everything else that is unnecessary. Previously, this type of page was called the “print version”, although in fact the text is intended for reading (hence the name Readability - “Readability”).

    Lyrical digression about parsers

    The main characteristic of the site parser, or other loosely structured formats, is the amount of knowledge about particular cases of using the format in the wild.

    A degenerate case of possessing all the knowledge  is a parser of a single site. Those. if we want to steal articles from Habrahabr, for example, to print them at night on an inkjet printer and sacrifice to Satan - we can look at the existing layout and easily determine what the title of the post is h1.title.

    A program written in this way will hardly be mistaken; for every site different from Habrahabr, you will have to write a new program.

    A degenerate ideal case: the parser does not know at all in what format it received the data. An example of such a program isstrings(exists on most non-game operating systems).

    If applied stringsto some non-readable file, you can get a list of everything that looks like text inside this file. For example, the command
    strings `which ls`
    print a bunch of lines for formatting inside the binary ls, and help.

    %e %b %T %Y 
    %e %b %R 
    usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]

    The less knowledge, the more universal the parser.

    What is already there

    Sources of the first version of Readability are published and are a chilling ball of regular expressions. This in itself is not bad, but special cases are just awful. I would like an algorithm that has much less knowledge about popular sites on the Internet (see above, “lyrical digression”).

    The current version of Readability is closed and hung with buns of diverse relevance. There is an API .

    There is a fork of Apple's first version of Readability (Reader feature in Safari). The source code is not very open, but you can look at it, there are even more regular expressions and special cases (for example, there is such a variable - isWordPressSite).

    The problems of the original script are the complexity of the modification, the arcade heuristic. It mainly works, but requires a non-trivial file refinement. The Apple version is also licensed unclear.

    What to write

    A site parser with minimal markup knowledge. Input data - one page of a site, or a fragment of a page. The result is a textual representation of the input.

    An important criterion is universality: the program will work on both the client and the server. Therefore, we do not get attached to existing DOM implementations, but build our own data structure (it also works faster than a full-fledged DOM, because we need data from a gulkin, say, a nose).

    For the same reason, the program will not be able to independently download pages from the Internet, store the results on disk, have a user interface, or cross-stitch.

    Algorithm Life and Adventures

    The search engine found several articles on the algorithmization of the process described above. Most of all I liked these Chinese PDFs .

    My formulas turned out to be slightly different, so I will tell you briefly about my version of the Chinese algorithm.

    For each tag in the document:
    1. We consider the estimate.

      Here chars  is the amount of text (characters) inside the tag, hyperchars  is the amount of text in the links, tags  is the number of nested tags (all metrics are recursive).
    2. We consider the sum of the grades.
      The sum of grades of first-generation children (i.e., not recursively).
    3. Found a tag with the maximum amount.
      This is a high probability container for the main text. Or the longest comment. In any case, there is a letter inside, that's cool.

    Plenty of room for labor

    Further optimization. I will describe several cases, but in general this is the most interesting topic, you can chat in the comments.

    Garbage in the main text. All sorrow bloggers like to put numerous buttons of their social contacts, twitter, etc. directly into the body of the post. unnecessary things. For such buttons, the score (score, see above) tends to zero, according to this principle they can be demolished.

    Just in case, I also check that after the garbage has been removed, the parent’s score has increased, if not (or has grown insignificantly), then I won’t delete it.

    HTML The algorithm does not use knowledge about the structure of the document, they can now be added to improve (or speed up) the program. That is, let's pessimize in advance

    Also popular now: