printf April 26, 2014 at 21:07

DIY readability

Since we have not yet learned how to win the ~~Great Chinese~~ Roskomnadzor, a thing for bypassing locks on the Internet , but I still want to tell something strange about my work, I will talk about the reimplementation of an algorithm similar to Readability using Node.js and the Beijing Institute of Technology .

What is it all about

Readability is a radical continuation of AdBlock’s idea of removing unnecessary elements from websites. Where AdBlock tries to tear down only the most useless things for the user (mainly advertising), Readability also removes scripts, styles, navigation and everything else that is unnecessary. Previously, this type of page was called the “print version”, although in fact the text is intended for reading (hence the name Readability - “Readability”).

Lyrical digression about parsers

The main characteristic of the site parser, or other loosely structured formats, is the amount of knowledge about particular cases of using the format in the wild.

A degenerate case of possessing all the knowledge is a parser of a single site. Those. if we want to steal articles from Habrahabr, for example, to print them at night on an inkjet printer and sacrifice to Satan - we can look at the existing layout and easily determine what the title of the post is h1.title.

A program written in this way will hardly be mistaken; for every site different from Habrahabr, you will have to write a new program.

A degenerate ideal case: the parser does not know at all in what format it received the data. An example of such a program isstrings(exists on most non-game operating systems).

If applied stringsto some non-readable file, you can get a list of everything that looks like text inside this file. For example, the command

strings `which ls`

print a bunch of lines for formatting inside the binary ls, and help.

%e %b %T %Y 
%e %b %R 
usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]

The less knowledge, the more universal the parser.

What is already there

Sources of the first version of Readability are published and are a chilling ball of regular expressions. This in itself is not bad, but special cases are just awful. I would like an algorithm that has much less knowledge about popular sites on the Internet (see above, “lyrical digression”).

The current version of Readability is closed and hung with buns of diverse relevance. There is an API .

There is a fork of Apple's first version of Readability (Reader feature in Safari). The source code is not very open, but you can look at it, there are even more regular expressions and special cases (for example, there is such a variable - isWordPressSite).

The problems of the original script are the complexity of the modification, the arcade heuristic. It mainly works, but requires a non-trivial file refinement. The Apple version is also licensed unclear.

What to write

A site parser with minimal markup knowledge. Input data - one page of a site, or a fragment of a page. The result is a textual representation of the input.

An important criterion is universality: the program will work on both the client and the server. Therefore, we do not get attached to existing DOM implementations, but build our own data structure (it also works faster than a full-fledged DOM, because we need data from a gulkin, say, a nose).

For the same reason, the program will not be able to independently download pages from the Internet, store the results on disk, have a user interface, or cross-stitch.

Algorithm Life and Adventures

The search engine found several articles on the algorithmization of the process described above. Most of all I liked these Chinese ^PDFs .

My formulas turned out to be slightly different, so I will tell you briefly about my version of the Chinese algorithm.

For each tag in the document:

We consider the estimate.

Here chars is the amount of text (characters) inside the tag, hyperchars is the amount of text in the links, tags is the number of nested tags (all metrics are recursive).
We consider the sum of the grades.
The sum of grades of first-generation children (i.e., not recursively).
Found a tag with the maximum amount.
This is a high probability container for the main text. Or the longest comment. In any case, there is a letter inside, that's cool.

Plenty of room for labor

Further optimization. I will describe several cases, but in general this is the most interesting topic, you can chat in the comments.

Garbage in the main text. All sorrow bloggers like to put numerous buttons of their social contacts, twitter, etc. directly into the body of the post. unnecessary things. For such buttons, the score (score, see above) tends to zero, according to this principle they can be demolished.

Just in case, I also check that after the garbage has been removed, the parent’s score has increased, if not (or has grown insignificantly), then I won’t delete it.

HTML The algorithm does not use knowledge about the structure of the document, they can now be added to improve (or speed up) the program. That is, let's pessimize in advance

and

, or add annotations to invisible elements (in the browser) and skip them altogether - there really is room for activity, I haven’t implemented anything yet.

Text signals. If the text contains commas, periods, and other punctuation marks, it is most likely connected text (as opposed to navigation, for example). Such a heuristic was in Readability.

Here it is necessary to pay attention to the fact that punctuation marks in different languages are still different, and commas in Chinese ("，" Unicode U + FF0C) differ from the character "," (ASCII 44).

What happened, how to use

The result I called unpretentiously readability2, laid out in npm .

Briefly about the tests

Testing such a thing is necessary in order to avoid regressions (and generally automatically testing programs is cool).

Here a certain problem arises: the readability test is a saved page of a completely extraneous site, plus a “reference” text torn from it by hands. I don’t really understand how to distribute this in such a way that the legal traders do not try to destroy me for illegal copying of sites and texts.

If someone knows the correct answer, please write in the comments. Now the tests live in a closed repository, but they really want to be free.

Sources without tests: GitHib

Usage example

For illustrative purposes, I wrote a demo.html page in which there are two lines of text among all navigation.

The text is called “Title”. Content part:

The whole microdistrict was quietly observing God's miracle:
Pop Ignatius tilibonkali inflicted his church.

(Public domain)

By the way, I renounce property copyrights to this literary work, transferring it thus to the public domain (public domain). Distribute and use the full text can now all without restrictions.

This should be the result of running the program. If the result is not so, then everything is broken.

And here is the source code of the demo.js program with comments. Used parser sax authorship by Isaac Z. The Schlueter .

Documentation, aka API

Constructor:

var reader = new Readability

Doesn’t accept anything.

SAX interface:

reader.onopentag(tagName)       // <тег>
reader.onattribute(name, value) // атрибут=значение
reader.ontext(text)             // текст
reader.onclosetag(tagName)      //

Here all the arguments are strings.

To get the result:

var res = reader.compute(),
    text = reader.clean(res)

At the output: res.heading - the title of the article and text - the main text without formatting.

Instead, reader.cleanyou can write another formatter, then you will get not text, but simple markup, for example.

Conclusion

The program works. She is still a little scared to use, because there are only about 20 tests, but I'm working on it. Updates will be. Patches are welcome, except for any stupid ones. GitHub . MIT license, I forgot to upload it to the repository.

Important note: the picture on the left has nothing to do with the post. Therefore, if it does not load, and you do not see any picture on the left, do not be discouraged.

Better write in the comments what you think about all this.

Tags: