How not to use the Node.js Stream API
On the Internet, again, someone is wrong - in yesterday's Node Weekly there was a link to a post in which the author tries to measure and compare the performance of the Stream API in Node.js with its peers. Sadness causes, how the author works with streams and what conclusions he is trying to do on the basis of this:
... this worked out pretty well. Although it has been streaming its memory
Let's try to figure out what is wrong with the conclusions and code of the author.
From my point of view, the problem is that the author of the article does not know how to use the Stream and this is a problem with which one has to face quite often. This phenomenon, in my opinion, has three reasons:
- The complex story of Node.js Stream API - pain and suffering are described here.
- Not the most intuitive API if you try to use it without any wrappers.
- Rather strange documentation that presents the Stream as something very complex and low-level.
Together, this leads to the fact that developers often do not know how and do not want to use the Stream API.
What is wrong with the author's code ?
To begin, we will repeat the task here (the original is in English and the link to the file can be found in the post):
There is a certain 2.5 GB file with lines of the form:
C00084871|N|M3|P|201703099050762757|15|IND|COLLINS, DARREN ROBERT|SOUTHLAKE|TX|760928782|CELANESE|VPCHOP&TECH|02282017|153||PR2552193345215|1151824||P/R DEDUCTION ($76.92 BI-WEEKLY)|4030920171380058715
You need to parse it and find out the following information:
- Number of lines in the file
- Names on the 432nd and 43243rd lines (does the question really arise how to count, with 0 or 1?)
- The most common name and how many times it is found
- The number of contributions for each month
What is the problem? - The author honestly says that he loads the entire file into memory and because of this Node “hangs up” and the author brings us an interesting fact.
Fun fact: Node.js can only hold up to 1.67GB in memory at any one time
The author makes a strange conclusion from this fact that it is the Stream that loads the entire file into memory, and it was not he who wrote the wrong code.
Let's refute the thesis: " Although Node.js was streaming the whole file, " writing a small program that counts the number of lines in a file of any size:
const { Writable } = require('stream')
const fs = require('fs')
const split = require('split')
let counter = 0const linecounter = new Writable({
write(chunk, encoding, callback) {
counter = counter + 1
callback()
},
writev(chunks, callback) {
counter = counter + chunks.length
callback()
}
})
fs.createReadStream('itcont.txt')
.pipe(split())
.pipe(linecounter)
linecounter.on('finish', function() {
console.log(counter)
})
NB : the code is intentionally written as simple as possible. Global variables are bad!
What you should pay attention to:
- split - npm package that accepts a stream of strings at the “input” - returns the stream of rowsets with a split line break to the “exit”. Most likely done as an implementation of Transformation stream. We pipe it to our ReadStream with a file, and its pipe in ...
- linecounter - WritableStream implementation. In it, we implement two methods: for processing one piece (chunk) and several. A “piece” in this situation is the line of code. The return is the addition of the desired number to the counter. It is important to understand - we will not load the entire file into memory in this situation, and the API will divide everything for us into the most convenient “pieces” for processing.
- 'finish' - events that “happen” when the “data” coming to our ReadableStream “end”. When this happens, we will log the data counter
Well, let's test our creation on a large file:
> node linecounter.js
13903993
As we see, everything works. From which we can conclude that the Stream API does an excellent job with files of any size and the statement of the author of the post, to put it mildly, is not true. Approximately also we can calculate any other value required in the task.
Tell us:
- Is it interesting for you to read how to solve the problem completely and how to bring the resulting code into a convenient form for tracking?
- Do you use the Stream API and what difficulties did you encounter?