fvdmedia March 28, 2014 at 15:20

Parser in Nimbus Note, or how we solved the problem of "pure" HTML

One of the key features of Nimbus Note is to save and / or edit notes as an html document. And these notes are created / edited in a browser or on mobile devices. Then they are sent to the server. And as professional paranoia prompts, information coming from the user cannot be trusted. Because there can be anything: XSS, a document that turns layout into a dream of an abstract artist, or never a text at all. Therefore, the data received from the user needs preliminary processing. In this article I will describe some features of our solution to this problem.

It would seem that this is complicated? Add some html-purifyer before saving and that’s it. Yes, that's right, this could have been done if it hadn't been for some circumstances:

there can be a lot of text in one note (several megabytes);
a significant number of simultaneous requests for saving changes are expected;
preservation requests will presumably be made from different parts written in different languages;
after processing the text and before saving, additional checks are possible;
after processing, you need to keep the appearance of the note as close as possible to the original (ideally, the appearance should not completely change);
page layout when displaying a saved note should not "suffer";
impossible to use iframe.

The first three points clearly require a solution that works separately from the main code. The fourth one excludes the use of queues (RabbitMQ for example) or, equivalently, leads to the need for non-trivial solutions when using them.

And finally, the last three points require deep processing of the layout, taking into account the fact that initially it is most likely not valid (“left” and / or unclosed tags, attributes, values). For example, if the width of any element is set to 100500, then this value does not fall into the definition of “permissible” and must be deleted or replaced (depending on the settings).

All of the above arguments led to the fact that we decided to write our ~~bike~~parser / validator. As the language chosen python. The reason was that the main project was written in this language and, of course, aesthetic preferences played a role.

In order not to write everything from scratch, it was decided to simplify my life and use some kind of lightweight framework. The choice fell on tornado, because we already had experience with it.

Based on scalability considerations, nginx was added to the system as a load balancer. Such a structure allows a fairly wide range to increase processing capacity by simply adding parser instances. And the client’s timeout for waiting for a response from the parser allows you to set the maximum waiting time, which still doesn’t leave the comfort zone for users (it won’t cause a feeling that “everything is hanging”).

At first, lxml was chosen as the html parser engine. A good, fast, C written parser. And everything would be fine with him, if not for a couple of “surprises”.

Firstly, in the process of working in such a “glory” such a well-known fact appeared that the lxml library interpreted html-documents as “broken” xml-app. This feature, which at first did not cause concern, began to produce an ever-increasing number of “crutches”. So, for example, lxml strongly believed that "" is a single tag and correctly carried out the following conversion "=> ".

However, one could put up with “crutches” if “secondly” did not come out. During a test run on a copy of real data, the parser stably crashed according to the “Segmentation Fault”. What was the reason for this is unknown. Because “Departure” was guaranteed to occur after processing about half a thousand records. Regardless of their contents (sampling was made from different places in the table).

Thus, having accumulated a certain amount of "cones", we settled on a bunch of "Beautiful Soup", "html5lib" plus their operating time crutches.

After this decision, “here it is, happiness” almost began to seem. And this happiness lasted exactly until the moment the msn.com page processed by the parser caught my eye. The remarkable features of this page were the active, fictitious, use of the “type” attribute for the “input” tag and the love of their typesetters for “position: absolute;”. Since the problem was localized, it was relatively easy to solve it - correct the configs, a bit of code and, of course, write tests covering the delicate spots found.

Now we are not just abstractly convinced that many pages on the web contain invalid html, but we are waiting for a new “surprise” to come. We are waiting, trying to take preventive measures, and we know that one day we will see her, having passed all the filters, all the tricks. We’ll see a page that is a product of a feverish delirium of an abstract artist ...

Tags:

Parser in Nimbus Note, or how we solved the problem of "pure" HTML

Also popular now: