Html Agility Pack - a handy .NET HTML parser

    Hello!
    Once, the idea came to me to analyze the vacancies posted on Habré. I was specifically interested in whether there is a correlation between the size of the salary and the availability of higher education. And now the students are having a session (including mine), then maybe someone is already tired of trying to get on their nerves in exams and this analysis will be useful.
    Since I am a programmer on .Net, I decided to solve this problem - I decided to parse ads on Habr in C #. I didn’t want to parse the html lines manually, so I decided to find an html parser that would help to complete the task.
    Looking ahead, I’ll say that nothing interesting came out of the analysis and the session will have to be taken further :(
    But I’ll talk a little bit about the very useful Html Agility Pack library

    Parser selection


    I went to this library through a discussion on Stackoverflow. The comments also suggested solutions, for example, the SgmlReader library , which translates HTML into XmlDocument, and a full set of tools for XML in .Net tools. But for some reason this did not bribe me and I went to download the Html Agility Pack.

    Quick Tour Html Agility Pack


    Help for the library can be downloaded on the project page. Functionality is actually very happy.
    There are twenty main classes available to us: Method names correspond to the DOM interfaces ( remark k12th ) + buns: GetElementbyId (), CreateAttribute (), CreateElement (), etc., so it will be especially convenient if you had to deal with JavaScript It seems that html is still redirected to Xml, and HtmlDocument and other classes are a wrapper, well, there’s nothing wrong with that, because of this, features like:





    • Linq to Objects (via LINQ to Xml)
    • XPATH
    • XSLT

    Parsim Habr!


    Jobs on the Habré are presented in the form of a table, the lines give information about the required specialty and salary, but since we need information about education, we will have to go to the vacancy page and sort it out.
    So, let's start, we need a table to get links and info about a position with a salary from there:
    1. static void GetJobLinks (HtmlDocument html)
    2. {
    3.     var trNodes = html.GetElementbyId ("job-items"). ChildNodes.Where (x => x.Name == "tr");
    4.  
    5.     foreach (var item in trNodes)
    6.     {
    7.         var tdNodes = item.ChildNodes.Where (x => x.Name == "td"). ToArray ();
    8.         if (tdNodes.Count ()! = 0)
    9.         {
    10.             var location = tdNodes [2] .ChildNodes.Where (x => x.Name == "a"). ToArray ();
    11.  
    12.             jobList.Add (new HabraJob ()
    13.             {
    14.                 Url = tdNodes [0] .ChildNodes.First (). Attributes ["href"]. Value,
    15.                 Title = tdNodes [0] .FirstChild.InnerText,
    16.                 Price = tdNodes [1] .FirstChild.InnerText,
    17.                 Country = location [0] .InnerText,
    18.                 Region = location [2] .InnerText,
    19.                 City = location [2] .InnerText
    20.             });
    21.         }
    22.  
    23.     }
    24.  
    25. }

    After that, it’s necessary to go through each link and get information about education, and at the same time also employment - there is a small problem in that if the table with links to the vacancy was in the div with the known id, then the information about the vacancy lies in the table without any id, so I had to twist a little:
    1. static void GetFullInfo (HabraJob job)
    2. {
    3.     HtmlDocument html = new HtmlDocument ();
    4.     html.LoadHtml (wClient.DownloadString (job.Url));
    5.     // html.LoadHtml (GetHtmlString (job.Url));
    6.  
    7.     // this cannot be done :-(
    8.     var table = html.GetElementbyId ("main-content"). ChildNodes [1] .ChildNodes [9] .ChildNodes [1] .ChildNodes [2] .ChildNodes [1] .ChildNodes [3] .ChildNodes.Where (x = > x.Name == “tr”). ToArray ();
    9.  
    10.     foreach (var tr in table)
    11.     {
    12.         string category = tr.ChildNodes.FindFirst ("th"). InnerText;
    13.  
    14.         switch (category)
    15.         {
    16.             case "Company":
    17.                 job.Company = tr.ChildNodes.FindFirst ("td"). FirstChild.InnerText;
    18.                 break;
    19.             case "Education:":
    20.                 job.Education = HabraJob.ParseEducation (tr.ChildNodes.FindFirst ("td"). InnerText);
    21.                 break;
    22.             case "Employment:":
    23.                 job.Employment = HabraJob.ParseEmployment (tr.ChildNodes.FindFirst ("td"). InnerText);
    24.                 break;
    25.             default:
    26.                 continue;
    27.         }
    28.     }
    29. }

    results


    Well, then, we save the results in XML and look in Excel-e what happened ... and we see that nothing good happened, because most companies either do not indicate the amount of salary, or do not indicate information about education (forget, indicate in the body vacancies, either really unimportant), or do not indicate everything at once.
    Who cares, here are the results in xlsx and xml , and here is the source

    PS


    When parsing, there was such a problem - pages downloaded very slowly. So I tried WebClient first and then WebRequest, but there was no difference. A google search indicated that you should explicitly disable Proxy in the code, and then everything would be fine, but that didn't help either.

    Also popular now: