We use Node.js to work with large files and raw data sets.



    This post is a translation of the original article by Page Nidrinhouse, a full-stack software engineer. Its main specialty is JavaScript, but Paige studies other languages ​​and frameworks. And he shares his experience with his readers. By the way, the article will be interesting to novice developers.

    Recently, I was faced with a task that interested me - it was necessary to extract certain data from the huge volume of unstructured files of the US Federal Election Commission. I did not work too much with raw data, so I decided to take the challenge and take on this task. As a tool for solving it, I chose Node.js.

    Skillbox recommends: The Frontend Developer Profession online course .

    We remind you: for all readers of “Habr” - a discount of 10,000 rubles when registering for any Skillbox course using the “Habr” promo code.

    The task was described in four points:
    • The program should calculate the total number of lines in the file.
    • Every eighth column contains a person’s name. You need to load this data and create an array with all the names contained in the file. It is necessary to display the 432th and 43,243rd name.
    • Each fifth column contains the date of donation by volunteers. Count how many total donations are made each month, and print the total result.
    • Every eighth column contains a person’s name. Create an array by selecting only the first name, without the last name. Find out which name is most often found and how many times?

    (The original task can be viewed here at this link .) The

    file you need to work with is a regular .txt with a capacity of 2.55 GB. There is also a folder that contains parts of the main file (you can debug the program on them without having to analyze the entire huge array).

    Two possible solutions on Node.js


    In principle, working with large files does not scare a JavaScript specialist. In addition, this is one of the main functions of Node.js. There are several possible solutions for reading from and writing to files.

    The familiar one is fs.readFile (). It allows you to read the entire file, putting it into memory, and then use Node.

    An alternative is fs.createReadStream (), a function that passes data similar to how it is organized in other languages ​​- for example, in Python or Java.

    The solution I chose


    Since I needed to calculate the total number of lines and parse data to parse names and dates, I decided to stop on the second option. Here I could use the rl.on ('line', ...) function to get the necessary data from the lines.

    Node.js CreateReadStream () & ReadFile () Code

    Below is the code that I wrote using Node.js and the fs.createReadStream () function.



    Initially, I needed to set everything up, realizing that importing data requires Node.js functions such as fs (file system), readline and stream. Next, I was able to create instream and outstream along with readLine.createInterface (). The resulting code made it possible to parse the file line by line, taking the necessary data.

    In addition, I added some variables and comments to work with specific data. These are lineCount, dupeNames and arrays of names, donation and firstNames.

    In the rl.on ('line', ...) function, I was able to set file parsing line by line. So, I entered the lineCount variable for each line. I used the JavaScript split () method to parse names by adding them to my names array. Next, I separated only names, without surnames, while highlighting exceptions, such as the presence of double names, initials in the middle of the name, etc. Next, I separated the year and date from the data column, converting all this into the YYYY-MM format and adding the dateDonationCount to the array.

    In the rl.on ('close', ...) function, I performed all the transformations of the data added to the arrays with the information received in console.log.

    lineCount and names are needed to determine the 432th and 43,243rd names, no conversions are required here. But the identification of the most common name in the array and the determination of the number of donations are more complicated tasks.

    In order to identify the most common name, I had to create an object of value pairs for each name (key) and the number of references to Object.entries (). (value) and then convert it all into an array of arrays using the ES6 function. After that, the task of sorting names and identifying the most duplicate was no longer difficult.

    With donations, I did roughly the same trick: I created an object of value pairs and a logDateElements () function, which allowed me, using ES6 interpolation, to display keys and values ​​for each month. Then I created new Map (), converting the dateDonations object to a metamarray, and looped through each array using logDateElements (). (It didn’t turn out as easy as it seemed at the beginning.)

    But it worked, I was able to read a relatively small file of 400 MB in size, highlighting the necessary information.

    After that I tried fs.createReadStream () - I implemented the task on fs.readFile () in order to see the difference. Here is the code:



    You can see the whole solution here .

    Results of work with Node.js


    The solution turned out to be working. I added the path to the readFileStream.js file and ... watched the Node server crash with a JavaScript heap out of memory error.



    It turned out that, although everything worked, but this solution tried to transfer the entire contents of the file to memory, which was impossible with a capacity of 2.55 GB. Node can work simultaneously with 1.5 GB in memory, no more.

    Therefore, none of my decisions came up. It took a new one that could work even with such voluminous files.

    New solution


    As it turned out, it was necessary to use the popular NPM module EventStream.

    Having studied the documentation, I was able to understand what needs to be done. Here is the third version of the program code.



    The documentation for the module indicated that the data stream should be divided into separate elements using the \ n character at the end of each line of the txt file.

    Basically, the only thing I had to change was the names response. I didn’t manage to put 130 million names into the array - again I ran out of memory error. I solved the problem by calculating the 432th and 43,243rd names and adding them to my own array. A little not what was asked in the conditions, but who said that you can not be creative?

    Round 2. We try the program in work


    Yes, all the same file with a volume of 2.55 GB, we cross our fingers and follow the result.



    Success!

    As it turned out, just Node.js is not suitable for solving such problems, its capabilities are somewhat limited. But expanding them using modules, you can work with such large files.

    Skillbox recommends:


    Also popular now: