Conveyor - time-lapse data processing

What and why


It took me once to parse information from one site. I picked up Node.js and got down to business.
The site consisted of sections, each section consisted of pages. To process one section, I had to make many requests, according to the number of pages.

At that moment, I had to face restrictions: the site began to give an error when requests were too frequent (more often than a few requests per second). Well, not a problem, I thought and solved it in a known way, making a kind of “asynchronous cycle”. That is, at the end of processing one page, I started a timer to process the next.

Then I remembered that it was necessary to parse different sections of this site and realized that it was already becoming too inconvenient. Therefore, he made the Conveyor tool, which knows how to process certain "data elements" (that is, apply a handler function to given objects) with a time delay between processing. It turned out to be convenient for "heavy" calculations, which can be performed in a cycle for a long time.

The Conveyor code lies on the github , you can put it through npm (called a dataconveyor). More structured help is also on the github. You can use it anytime, anywhere, without restrictions.

Below is a description of the Conveyor tool.

How to use


First, create an instance of the Conveyor object, giving it a data handler:

var conveyor = new Conveyor(function(element) {
    console.log(element);
}, {
    period: 100
});

Here we create an object that will write data to the console with an interval of 100 ms. After initialization, you should specify the data:

conveyor.add(12);
conveyor.add("Ahoj, Habr!");
conveyor.add([firstElement, secondElement]);

It should be noted that in the case of an array, the firstElement and secondElement elements will be processed separately, and not the entire array. New data can be added during data processing, i.e. conveyor.add () can also be used inside the handler installed in the constructor.

So, when we added the data for processing (by the way, they begin to be processed immediately after the addition), we can set a function that will be called after the start of the handlers of all events and waiting for the interval:

conveyor.whenStop(function() {
    console.log('Done.');
});

In such a simple way, we can start processing data with the frequency we need. This solved the problem of loading information from many pages. But another problem came up.

Having made a function of type parseAllPages () (which loads information from all pages of one section), I did not foresee that I would like to call it for different sections simultaneously and asynchronously. To load information from various categories, I ran this conditional function parseAllPages () in another Conveyor element. But several Conveyors are not synchronized with each other and therefore can execute more requests per second than permissible by restrictions.

To eliminate the drawback, the flag (boolean parameter) useQueue (default false) was added to the Conveyor parameters, cocking it means sequential data processing (the next element will be processed only after the previous one has been processed). This type of processing allows you to synchronize several interconnected Conveyor objects. Example:

var categoriesConveyor = new Conveyor(function (category, cb) {
    parseAllPages(category, function() {
        cb();
    }
}, {
    period: 100,
    useQueue: true
});

That is, the categories I processed sequentially, and the pages within the category - not sequentially. Well, then according to the described algorithm.

The Conveyor.wait (count) function is also implemented in case elements for processing will be added later when the whenStop function is called. That is, the function from whenStop will not be called until the conveyor.add () function is called count times. Or, if you no longer need to add data, you can call the Conveyor.unwait (count) function. The expected items counter can also be set when Conveyor is initialized by specifying the expectedElementsCounter parameter.

And if you need to stop processing (ignoring raw elements), you should call the Conveyor.forceStop () function.

This thing really helped me. I hope that someone will also find it useful.

I would be grateful for the feedback. It will be especially useful by codestyle in js.

Also popular now: