Grab pages with WebHarvest

    The task of hornbeam information from web pages is always relevant. As for a project, and in order to more conveniently use the resource. I mean usability or just the need to see data in a different context. Robbing someone else’s information and using it for commercial purposes is always bad, they usually try to punish and punish it. And for personal purposes, you can use it freely. This, probably, can be compared with the use of a pencil or colored markers when reading newspapers and magazines. For example, if I circle ads in red or yellow, and cross out some of them in bold, then I simply qualitatively change the display of information in the right light for my tasks. But lawyers need to be afraid.

    The occurrence of the task


    About six months ago in Kiev, a BlogcampCEE was planned and everyone who wanted to take part in the event had to register and fill out their profile. At that time, I was little acquainted with the local market of Internet projects, as well as with who and what is doing in the Ukrainian IT and Internet markets. The public profiles of participants in this event will make it possible to draw up a certain picture of market participants (parties). It is convenient to use such information when it is on the same page. In order to collect all this data manually, it would be necessary to spend a lot of time navigating through pages and copy-paste, and even update over time would also have to be pens.

    Search for a way


    There are programming skills, so you can think of something. And laziness, as you know, is the engine of progress. Just recently, work colleagues solved the problem of receiving a site rating from an html page on Alex . And he was hidden behind a bunch of fake spans with generated styles, which did not make it possible to directly stretch this value. So we got acquainted with WebHarvest then . And they perfectly solved this problem, and even with a bunch of similar ones that arise when writing tools for SEO specialists. They really need a lot of data, page rank, keyword strength, competitor analysis, etc. etc. And for percent 80% of the tasks associated with collecting information API does not exist, you just need to run through the pages and collect this info.

    So, I want to share my experience in obtaining the necessary information in a readable form from the BlogcampCEE website (just a life example) using the WebHarvest library, the Java programming language (in the background), the Ant build tool, and the XSLT processor for converting XML to HTML.

    First approach


    I found at that time the page blogcampcee.com/ru/group/tracker on which all open events occurring on the site are tracked. Among them there is an event of the Usernode type, which means the registration of a new user and contains a link to the user profile.

    image

    Later, I found the page blogcampcee.com/ru/userlist , where only users are shown, and it would be more logical to take it as the entry point to the task. But it was already late, everything was already done. She would not have helped much, except that she accelerated the work due to the lack of uninteresting events and an additional type check. From clicks would not save for sure.

    The task was to scroll through all the possible pages, get all the links to the user’s pages and then go through each user and collect the necessary information.

    It is necessary to write a configuration file in the GUI from Webharvest, not so much a development environment, but better than nothing at all. Debugging is also not easy. But at least the state of the variables can be viewed at any point in execution. And this is a lot.

    image
    WebHarvest GUI Functions are transferred to the functions.xml

    file , which will then be used in the main configuration file. Walking through a leaflet and tearing links to users' pages is a separate function (borrowed with some changes from examples on WebHarvest). The main configuration file for blogcamp.xml

    It contains the logic of the passage through the pages of users and pulling out some fields, including links to personal pages and project sites, and this interested me the most. And all this information in custom xml format ( users-samples.xml ) is saved to a file.

    Modernization


    An XML file is not very convenient to read, so users-style.xsl was written for it to give a readable and convenient look.

    And here we get such a page at the output of the XSLT processor.

    image
    Desired user list presentation format.

    Miniproject structure


    WebHarvest is a purely programmer tool that can be conveniently used as a library in your projects by feeding it the necessary configuration files and getting the desired result. Therefore, we will connect the hornbeam and XSL transformation into a single process with the help of Ant, as if it would further develop into a complex product or so that in the future it would not be necessary to recall how this whole thing was called and where it developed. So it turned out build.xml and such a structure.

    image

    Now, when in the future it is necessary to remember what is happening and what is happening, and repeat these procedures (and this happens not infrequently and after a lot of time, when you have time to completely forget what it really is), it will be very simple to resume, repeat, edit and etc. etc.

    The problem is solved. The necessary information is on one page in a form convenient for me. The whole process is easily repeated with the click of a button, which is very important as the number of users increases and over time you will need to update your sheet.

    disadvantages


    The most important is the lack of multithreading in the hornbeam process. Team tool only works this way. This was not critical for my task. Programmatically, you are free to do what you want. And to organize multithreading, parallelization and work in a distributed mode on a cluster of N computers - please. All in your hands.

    The not entirely successful structure of the solution itself. When a passage is immediately made to determine all user links, and then processing is already done on them. It is better to process the user's link as soon as it is received and save the result to a file. Then, when the connection is broken or something else, you do not have to start the hornbeam from the very beginning, but you can continue from the place where everything was lying around.

    There is no serial number in the resulting HTML, I already noticed later and it was too lazy to finish it, because it is again to remember XSLT.

    Few comments in all configuration files. Sorry, but this is the scourge that haunts me constantly. Usually they begin to be written after celebrating six months of work on the project and you understand that the memory resource has exhausted itself, and the bells and whistles in the project do not think to stop :)

    And many more can be found.

    Materials


    I do not like it when reading an article you can’t immediately try everything in life. Therefore, I post the archive with the full project , including WebHarvest. For convenient work, you need to run Ant from the command line anywhere in your system. Important! Do not torture the blogcamp site. Although the event has passed, but still, it’s because hosting and traffic are not necessarily anlim.

    If you do not work much with XPath, then all this is quickly forgotten. Here are a couple of resources that helped me refresh (learn) the features of these technologies.

    XPath specification XPath
    spelling examples

    Conclusion


    The main goal is to show the possibility of using WebHarvest tools, as a grabber in their projects achieved.

    There are enough features and twists for Tulsa, including the use of Javascript and XQuery, saving cookies, basic authentication, user agent spoofing, custom request headers, etc. Yes, and you can rewrite part of the API or customize it beyond recognition that the creator will not recognize the dad later.

    Also popular now: