Asynchronous web-mining using node.js

    I would like to share my experience in solving the problem of web-mining: collecting some information from a specific list of resources. I would like to note right away that this is not an attempt to create your own “search engine” - completely different approaches are used for this. The goal of web mining is to pull out some of the information. For example, if a resource supports microformats in the form of "business cards", etc.


    Now about the implementation: why exactly node.js? Indeed, I had no restrictions on any particular technology - you could use everything from C ++ with Java / .NET to Perl / Python. I'll tell you why I chose node.js:
    • Asynchronous IO operations. Although it is possible to organize asynchrony in other languages, and sometimes it’s very simple - in F # there is an async block, but node.js has asynchrony out of the box and is the preferred way to perform operations.
    • The most familiar syntax with the least amount of redundant constructions. Of course, the item is “holistic”, but in fact javascript is closer to those who used C / C ++, java, C # than F # or Python.
    • Support for http client and regular expressions "out of the box" without the need to install additional modules.
    • Execution speed. Although the V8 has a "weak" point - context switching, but for this task it should not be a "narrow neck" and the "linear" speed is more important. And the V8 can boast of just that (NB make a benchmark to prove this point in numbers).

    Install node.js


    Installation on my server (FreeBSD, amd64) went more than smoothly - "cd / usr / ports / www / node; make install" and node.js is ready to use.

    For Windows platforms, the most accessible installation option is through cygwin. I did not find a good instruction, although I came across an implementation of node.js purely by .NET .

    For Ubuntu, it is also done without any problems - for example, a good instruction .

    Further reading a pretty nice manual . Although the manual really looks pretty, it only covers the basic elements, and when I wanted my web miner to be like most other classes and able to trigger events, it turned out that this manual was not described at all. But more on that later.

    Page unloader


    Taking an example on http.Client and spinning up the wait for the entire document to load, parsing the url and compiling the desired request, the following “class” came out: The interesting thing here is how the class is registered as a source of events:
    var webDownloader = function(sourceUrl) {

        events.EventEmitter.call(this);

        this.load = function(sourceUrl) {

          var src = url.parse(sourceUrl);

          var webClient = http.createClient(src.port==undefined?80:src.port,src.hostname);

          var get = src.pathname+(src.search==undefined?'':src.search);

          sys.log('loading '+src.href);

          var request = webClient.request('GET', get ,

           {'host': src.hostname});

          request.end();

          var miner = this;

          request.on('response', function (response) {

      //     console.log('STATUS: ' + response.statusCode);

      //     console.log('HEADERS: ' + JSON.stringify(response.headers));

           response.setEncoding('utf8');

           var body = '';

           response.on('data', function (chunk) {

            body += chunk;

           });

           response.on('end', function() {

              miner.emit('page',body, src);

           });

          });

        };

      }

      sys.inherits(webDownloader, events.EventEmitter);



    * This source code was highlighted with Source Code Highlighter.



    1. first, we register with EventEmitter in the constructor: events.EventEmitter.call (this);
    2. “Inherit” a class from EventEmitter
    3. “Emit” an event using the emit method


    It is work with EventEmitter that is still poorly documented, so I had to google a little.

    Now we can subscribe to the full page load event:
    var loader = new webDownloader();

    loader.on('page',vcardSearch);


    Search for vCard data


    Now a less interesting function that pulls vCard data from the page. I did not want to spend a lot of time on the correct implementation, so I did it “forehead” - searching for elements with the necessary classes.

    There is nothing particularly interesting here, except for the use of the Apricot module for parsing the page (although it would be really enough to use htmlparser, but Apricot got me much faster). At first I tried to build a CSS selector to search for the necessary elements and use the find function of Apricot (which, in turn, uses Sizzle to search), but as it turned out, recursive traversal of all elements is faster.

    As a result, we got this function:
    var vcardSearch = function(body,src) {

        sys.log('scaning '+src.href);;

        Apricot.parse(body,function(doc) {

          var vcardClasses = [

            // required

            'fn',

            'family-name', 'given-name', 'additional-name', 'honorific-prefix', 'honorific-suffix',

            'nickname',

            // optional

            'adr','contact',

            'email',

            'post-office-box', 'extended-address', 'street-address', 'locality', 'region', 'postal-code', 'country-name',

            'bday','email','logo','org','photo','tel'

          ];

          var vcard = new vCard();

          var scanElement = function(el) {

            if (el==undefined) return;



            if (el.className != undefined && el.className!='') {

              var classes = el.className.split(' ');

              for(var n in classes) {

                if (vcardClasses.indexOf(classes[n])>=0) {

                  var value = el.text.trim().replace(/<\/?[^>]+(>|$)/g, '');

                  if (value != '') vcard.Values[classes[n]] = value;

                }

              }

            }

            for(var i in el.childNodes) scanElement(el.childNodes[i]);

          }

          scanElement(doc.document.body);

          if (!vcard.isEmpty())

            sys.log('vCard = '+vcard.toString());

          else

            sys.log('no vCard found on '+src.href);

        });

      }



    * This source code was highlighted with Source Code Highlighter.

    Total


    Использовать результат просто:

    loader.load('http://www.google.com/profiles/olostan');

    loader.load('http://www.flickr.com/people/olostan/');



    Сразу хочу сказать, что задумывалось не как конечный хоть немного серьезный продукт, а скорее как proof-of-concept и для того, чтоб пощупать node.js

    Код полностью (загруженно на Google Docs, может потребовать google аккаунт)

    P.S. Это перепост с моего поста в песочнице. Извиняюсь, если так не принято, но интересно было бы услышать комментарии. Спасибо Romachev за инвайт. Запостить в тематический блог не хватает кармы.

    Also popular now: