Parse HTML in .NET and survive: library analysis and comparison

While working on one home project, I encountered the need to parse HTML. A google search returned a comment from Athari and its micro-review of current HTML HTML parsers in .NET for which many thanks to him.

Unfortunately, no figures and / or arguments in favor of this or that parser were found, which served as the reason for writing this article.

Today I will test popular, at the moment, libraries for working with HTML, namely: AngleSharp , CsQuery , Fizzler , HtmlAgilityPack and, of course, Regex-way . Compare them in terms of speed and ease of use.

TL; DR : The code for all benchmarks can be found on github. There are also test results. The most relevant parser at the moment is AngleSharp - a convenient, fast, ~~youth~~ parser with a convenient API.

Those who are interested in a detailed review - welcome to cat.

Content

Library Description
Benchmark
- Getting addresses from links on a page
- Retrieving data from a table
findings

Library Description

This section will contain brief descriptions of the libraries in question, a description of the licenses, etc.

HtmlAgilityPack

One of the most (if not the most) famous HTML parser in the .NET world. A lot of articles have been written about him both in Russian and in English, for example, in habrahabr .

In short, this is a fast, relatively convenient library for working with HTML (if XPath queries are uncomplicated). The repository has not been updated for a long time.
License MS-PL .

The parser will be convenient if the task is typical and well described by the XPath expression, for example, to get all the links from the page, we need very little code:

///<summary>/// Extract all anchor tags using HtmlAgilityPack///</summary>public IEnumerable<string> HtmlAgilityPack()
{
    HtmlDocument htmlSnippet = new HtmlDocument();
    htmlSnippet.LoadHtml(Html);
    List<string> hrefTags = new List<string>();
    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }
    return hrefTags;
}

However, if you want to work with css classes, then using XPath will bring you a lot of headache:

///<summary>/// Extract all anchor tags using HtmlAgilityPack///</summary>public IEnumerable<string> HtmlAgilityPack()
{
    HtmlDocument hap = new HtmlDocument();
    hap.LoadHtml(html);
    HtmlNodeCollection nodes = hap
        .DocumentNode
        .SelectNodes("//h3[contains(concat(' ', @class, ' '), ' r ')]/a");
    List<string> hrefTags = new List<string>();
    if (nodes != null)
    {
        foreach (HtmlNode node in nodes)
        {
            hrefTags.Add(node.GetAttributeValue("href", null));
        }
    }
    return hrefTags;
}

Of the observed oddities - a specific API, sometimes incomprehensible and confusing. If nothing is found, it returns null, not an empty collection. Well, the library update somehow dragged on - nobody commited the new code for a long time. Bugs are not fixed ( Athari mentioned a critical bug Incorrect parsing of HTML4 optional end tags , which leads to incorrect processing of HTML tags, closing tags for which are optional.)

Fizzler

HtmlAgilityPack add-on that allows you to use CSS selectors.
The code, in this case, will be a visual description of what problem Fizzler solves :

// Документ загружается как обычноvar html = new HtmlDocument();
html.LoadHtml(@"
  <html>
      <head></head>
      <body>
        <div>
          <p class='content'>Fizzler</p>
          <p>CSS Selector Engine</p></div>
      </body>
  </html>");
// Fizzler это набор методов-расширений для HtmlAgilityPack// к примеру QuerySelectorAll у HtmlNodevar document = html.DocumentNode;
// вернется: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content"); 
// вернется: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");
// вернется пустая последовательность
document.QuerySelectorAll("body>p");
// вернется [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");
// вернется [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

In terms of speed, it practically does not differ from HtmlAgilityPack, but it is more convenient due to work with CSS selectors.

Commits have the same problem as HtmlAgilityPack - there are no updates for a long time and, apparently, is not expected.

License: LGPL .

Csquery

It was one of the modern HTML parsers for .NET. The validator.nu parser for Java was taken as the basis, which in turn is the port of the parser from the Gecko engine (Firefox).

The API drew inspiration from jQuery, the CSS selector language is used to select elements. The names of the methods are copied almost one-to-one, that is, for programmers familiar with jQuery, learning will be simple.

Currently, development CsQuery stored in the passive phase.

Message from the developer

CsQuery is not being actively maintained. I no longer use it in my day-to-day work, and indeed don't even work in .NET much these day! Therefore it is difficult for me to spend any time addressing problems or questions. If you post issues, I may not be able to respond to them, and it's very unlikely I will be able to make bug fixes.

While the current release on NuGet (1.3.4) is stable, there are a couple known bugs (see issues) and there are many changes since the last release in the repository. However, I am not going to publish any more official releases, since I don't have time to validate the current code base and address the known issues, or support any unforseen problems that may arise from a new release.

I would welcome any community involvement in making this project active again. If you use CsQuery and are interested in being a collaborator on the project please contact me directly.

The author himself advises using AngleSharp as an alternative to his project.

The code for getting links from the page looks nice and familiar to everyone who used jQuery:

///<summary>/// Extract all anchor tags using CsQuery///</summary>public IEnumerable<string> CsQuery()
{
    List<string> hrefTags = new List<string>();
    CQ cq = CQ.Create(Html);
    foreach (IDomObject obj in cq.Find("a"))
    {
        hrefTags.Add(obj.GetAttribute("href"));
    }
    return hrefTags;
}

License: MIT

Anglesharp

Unlike CsQuery, it is written from scratch manually in C #. Also includes parsers of other languages.

The API is based on the official JavaScript HTML DOM specification. In some places, there are oddities unusual for .NET developers (for example, accessing the wrong index in the collection will return null and not throw an exception; there is a separate Url class; namespaces are very granular), but overall there is nothing critical.

The library is developing very quickly. The number of different goodies that make the job easy is amazing, for example, IHtmlTableElement , IHtmlProgressElement , etc.

The code is clean, neat, comfortable.
For example, extracting links from a page is practically no different from Fizzler:

///<summary>/// Extract all anchor tags using AngleSharp///</summary>public IEnumerable<string> AngleSharp()
{
    List<string> hrefTags = new List<string>();
    var parser = new HtmlParser();
    var document = parser.Parse(Html);
    foreach (IElement element in document.QuerySelectorAll("a"))
    {
    hrefTags.Add(element.GetAttribute("href"));
    }
    return hrefTags;
}

And for more complex cases, there are dozens of specialized interfaces that will help solve the task.

License: MIT

Regex

An ancient and not the most successful approach for working with HTML. I really liked the Athari comment , so I will duplicate it here, comment:

Scary and terrible regular expressions. It is undesirable to use them, but sometimes it becomes necessary, since the parsers that build the DOM are noticeably more gluttonous than Regex: they consume more processor time and memory.

If it comes to regular expressions, then you need to understand that you cannot build a universal and absolutely reliable solution on them. However, if you want to parse a specific site, then this problem may not be so critical.

For heaven’s sake, do not turn regular expressions into an unreadable mess. You do not write C # code on one line with single-letter variable names, so regular expressions do not need to be spoiled. The regex engine in .NET is powerful enough to write quality code.

The code for getting links from the page looks even more or less clear:

///<summary>/// Extract all anchor tags using Regex///</summary>public IEnumerable<string> Regex()
{
    List<string> hrefTags = new List<string>();
    Regex reHref = new Regex(@"(?inx)
    <a \s [^>]*
        href \s* = \s*
            (?<q> ['""] )
                (?<url> [^""]+ )
            \k<q>
    [^>]* >");
    foreach (Match match in reHref.Matches(Html))
    {
        hrefTags.Add(match.Groups["url"].ToString());
    }
    return hrefTags;
}

But if you suddenly want to work with tables, and even in a fanciful format, then please first look here .

The license is listed on this site .

Benchmark

The speed of the parser, whatever one may say, is one of the most important attributes. How much time this or that task takes depends on the speed of HTML processing.

To measure the performance of parsers, I used the library BenchmarkDotNet from DreamWalker , for which he thanks a lot.

The measurements were made on Intel® Core (TM) i7-4770 CPU @ 3.40GHz, but experience suggests that the relative time will be the same on any other configurations.

A few words about Regex- do not repeat this at home. Regex is a very good tool in the right hands, but working with HTML is definitely not where to use it. But as an experiment, I tried to implement a minimally working version of the code. He successfully completed his task, but the amount of time spent writing this code suggests that I will definitely not repeat this.

Well, let's look at the benchmarks.

Getting addresses from links on a page

This task, it seems to me, is basic for all parsers - more often it is with this statement of the problem that a fascinating acquaintance with the world of parsers (sometimes Regex) begins.

The benchmark code can be found on github , and below is a table with the results:

Library	Average time	Standard deviation	operations / sec
Anglesharp	8.7233 ms	0.4735 ms	114.94
Csquery	12.7652 ms	0.2296 ms	78.36
Fizzler	5.9388 ms	0.1080 ms	168.44
HtmlAgilityPack	5.4742 ms	0.1205 ms	182.76
Regex	3.2897 ms	0.1240 ms	304.37

In general, the expected Regex was the fastest, but far from the most convenient. HtmlAgilityPack and Fizzler showed roughly the same processing time, slightly ahead of AngleSharp. CsQuery, unfortunately, is hopelessly behind. It is likely that I do not know how to cook it. I will be glad to hear comments from people who worked with this library.

Evaluating convenience is not possible, since the code is almost identical. But other things being equal, I liked the CsQuery and AngleSharp code more.

Retrieving data from a table

I faced this problem in practice. Moreover, the table that I had to work with was not simple.

A note on life in Belarus

Захотелось мне получать актуальную информацию о обменном курсе валют в славном городе Минске. Каких-либо сервисов, для получения информации о курсах в банках, найдено не было, но случайно наткнулся на http://select.by/kurs/. Там информация обновляется часто и есть то, что мне нужно. Но в очень неудобном формате.
Ребят, если будете это читать — сделайте нормальный сервис, ну или хотя бы HTML поправьте.

I made an attempt to hide as much as possible everything that does not apply specifically to HTML processing, but due to the specifics of the task, not everything worked out.

The code for all libraries is about the same, the only difference is in the API and which results are returned. However, two things are worth mentioning: firstly, AngleSharp has specialized interfaces, which made the task easier. Secondly, Regex is not suitable for this task at all .

Let's look at the results:

Library	Average time	Standard deviation	operations / sec
Anglesharp	27.4181 ms	1.1380 ms	36.53
Csquery	42.2388 ms	0.7857 ms	23.68
Fizzler	21.7716 ms	0.6842 ms	45.97
HtmlAgilityPack	20.6314 ms	0.3786 ms	48.49
Regex	42.2942 ms	0.1382 ms	23.64

As in the previous example, HtmlAgilityPack and Fizzler showed about the same and very good time. AngleSharp is lagging behind them, but maybe I did not do everything in the most optimal way. To my surprise, CsQuery and Regex showed equally bad processing time. If everything is clear with CsQuery - it’s just slow, then with Regex it’s not so simple - most likely the problem can be solved in a more optimal way.

findings

Probably, everyone made conclusions for himself. On my own, I’ll add that AngleSharp will now be the best choice, since it is actively developed, has an intuitive API and shows good processing time. Does it make sense to roll over to AngleSharp with HtmlAgilityPack? Most likely not - install Fizzler and enjoy the very fast and convenient library.

Thank you all for your attention.
All code can be found in the github repository . Any additions and / or changes are welcome.

Only registered users can participate in the survey. Please come in.

What HTML parser are you using?

Tags: