Igorbek January 26, 2010 at 15:36

Proper HTML Serialization in .Net

Good to all!

Those who actively use XSLT to generate HTML (not XHTML) have probably come across situations when it is necessary to generate not only valid XML - XHTML, but also for browsers that do not support XHTML to generate valid HTML, which, in general, is not same. To do this, we used "dirty hacks" in XSLT.
In this article, I will talk about a cleaner and more beautiful method, which, unfortunately, is not often used.

The method is specific to the .Net infrastructure, but there are probably similar tools on other platforms.

Well, now in order.

Introduction

It is clear that the informational representation of XML is enough to describe any HTML-document, and, moreover, it is wonderful in the framework of XML. The problem is that the textual representation of XML may differ from the representation of the same document in HTML.

The essence of the problem

The critical differences in serializing XML from HTML are quite simple:

A document cannot have an xml declaration ;
Some elements must have a closing tag , which means that the standard XML serializer will incorrectly make a self-closing tag
for an empty div, as The HTML parser should expect a closing tag for the div;
some elements cannot contain entity references , which means that the HTML parser does not process entity references in elements such as script or style.

In addition, there are restrictions that depend on the content itself (nothing to do with the serializer), but it is important that these restrictions are met:

Some elements cannot have content , i.e. must be empty; for example, content is not allowed for the link element, and therefore, if it does not even have content for the link, but there is a separate closing tag, this will be an HTML parser error (which, of course, it will ignore);
some elements cannot contain children or comments ; these are elements such as title and textarea;
general limitations of the structure of the document , which we will not consider here, but leave to inquiring minds =)

Method itself

The fact is that for serializing XML, the environment uses XmlWriter , which takes care of all the work on properly formatting XML. This class is used in almost all operations where you need to somehow write XML. In particular, with XSL transformations ( XslCompiledTransform.Transform ), an instance of this class is used as the destination.

So, and all that is needed is to implement your XmlWriter, which will correctly format our XML in accordance with HTML rules. So, Introducing - HtmlXmlWriter !

Theory

We take the HTML specification, and more specifically HTML5 (where now without it), and we see that 5 types of elements are highlighted in it :

Void (empty) elements - area, base, br, col, command, embed, hr, img, input, keygen, link, meta, param, source;
raw text (pure text) elements - script, style;
RCDATA elements (text only) - textarea, title;
foreign (external) elements - any external non-HTML elements, in particular from MathML and SVG, but we will consider such elements not from the XHTML namespace;
normal elements - all other HTML elements;

Now our HtmlXmlWriter should control and not allow any content to be added to the empty (void) elements and they would always be self-closing ().

Pure text (raw text) can only have text (no entities or comments), but should not contain a sequence that can be construed as a closing tag (regardless of case).

RCDATA cannot have children, but can only have text, including entity references. Comments in them, too, seem to be impossible.

External (foreign) elements can be any - this is plain XML. There are no restrictions.

Normal elements can also contain whatever they want, but they only need a closing tag.

Implementation

Well, actually, I will not give the implementation here, it is not complicated and anyone can do it for himself. I did it for myself, and, perhaps, when I document it, and if they tearfully beg me, I will post it on some kind of code repository. Here I will give only useful notes (maybe a little messy).

HtmlXmlWriter will be the descendant of XmlWriter. It should aggregate a third-party instance of XmlWriter (which must be passed to the constructor), and by default call the appropriate methods from it.

The HtmlXmlWriter should keep track of which element it is currently on (name and type of the last element), defining this in the XmlWriter.WriteStartElement / WriteEndElement method. It should also keep track of whether it is on an attribute (WriteStartAttribute / WriteEndAttribute).

When closing an element (WriteEndElement / WriteFullEndElement), select WriteEndElement or WriteFullEndElement depending on the type of element.

The hardest part is with raw text elements, because XmlWriter will escape some characters. Therefore, you need to replace the text output (WriteCharEntity, WriteString, WriteSurrogateCharEntity) with WriteRaw on them. But here we must not forget to control so that there is no closing tag in the text.

Conclusion

Now having such a class, you can easily pass it into XSL transformations (or where else) and get normal HTML from XHTML, so even any dumb HTML parser will understand this.

Tags: