akzhan August 11, 2011 at 18:37

Nodeload2: Download Engine - Reboot

Transfer

Nodeload, the first GitHub team project done using node.js, recently turned 1 year old . Nodeload is a service that packs the contents of a Git repository into ZIP archives and tarballs. Since then, the load on the service has been growing throughout the year, and we have encountered various problems. Read about the origin of Nodeload if you don’t remember why it works the way it works now.

Essentially, we have too many requests going through a single nodeload server. These requests started processesgit archivethat run SSH processes to communicate with file servers. These requests constantly recorded gigabytes of data, and also transferred them through nginx. One simple idea was to order more servers, but that would duplicate the cache of archived repositories. I wanted to avoid this, if possible. So, I decided to start over and rewrite Nodeload from scratch.

Now the Nodeload server works only as a simple proxy application. This proxy application searches for the appropriate file servers of the requested repository, and proxies data directly from the file server. The file archiver now runs on file servers, which is basically an HTTP interface forgit archive. Cached repositories are now written to the TMPFS section to reduce the load on the data input / output subsystem (IO). Nodeload Proxy also tries to use file backup servers instead of active file servers, shifting most of the load to unloaded backup servers. Node.js is great for this application due to its excellent streaming API. When implementing proxies of any kind, you have to deal with clients who cannot read data as fast as you can send them. When the HTTP server response stream cannot send more data on your part, returns false. With this value, you can pause the proxied HTTP request stream until the response object generates an event . Event

write()draindrainmeans that the response object is ready to send more data, and that you can now resume the proxied HTTP request stream. This logic is completely encapsulated in the method ReadableStream.pipe().

// proxy the file stream to the outgoing HTTP response
var reader = fs.createReadStream('some/file');
reader.pipe(res);

Heavy launch

After the launch, we came across some strange problems on the weekend:

Nodeload servers still had a heavy load on the input / output system (IO);
File backup servers used up all available RAM;
Nodeload servers used up all available RAM;
topand psdidn’t show that nodeload processes change the size they occupy. Nodeload processes worked well, but we observed that the available server memory was slowly decreasing in size.

We observed a high IO in connection with the nginx option proxy_buffering. As soon as we turned it off, IO fell sharply. This means that the flows are at the speed of the client. If clients cannot download the archive fast enough, the proxy pauses the HTTP request stream. This is passed on to the archiver application, which pauses the file stream.

To track a memory leak, I tried installing v8-profiler ( including the Felix Gnass patch to show heap retainers (objects that keep the GC from freeing other objects)), and used the node-inspectorto monitor live Node processes in a production environment. Webkit Web Inspector for profiling an application works great, but it hasn’t shown any obvious memory leak.

By that time, @ tmm1 , @rtomayko and @rodjek had come to the rescue for brainstorming in finding other possible problems. They eventually tracked down a leak in the form of an accumulation of FD file descriptors on processes.

tmm1 @ arch1: ~ $ sudo lsof -nPp 17655 | grep ": 7005 ("
node 17655 git 16u IPv4 8057958 TCP 172.17.1.40-00-009232->172.17.0.148:7005 (ESTABLISHED)
node 17655 git 21u IPv4 8027784 TCP 172.17.1.40∗8054->172.17.0.133:7005 (ESTABLISHED)
node 17655 git 22u IPv4 8058226 TCP 172.17.1.40rige2498->172.17.0.134:7005 (ESTABLISHED)

This happened because the read streams were not properly closed when clients interrupted the download. This caused FD to remain open on the Nodeload server, as well as on file servers. In fact, this led to the fact that nagios warned us about the overflow of the / data / archives partition when there were only 20 MB of archives. Open file descriptors prevented the server from using space from remote archive caches.

The fix for this problem is handling the closeHTTP request object event on the server. pipe()actually does not handle this case, because it is written for the general API of a readable stream. The close event is different from the more general end event., because the first event means that the HTTP request stream was interrupted before it was called response.end().

// check to see if the request is closed already
if (req.connection.destroyed) {
  return;
}
var reader = fs.createReadStream('/some/file');
req.on('close', function() {
  reader.destroy();
});
reader.pipe(res);

Conclusion

Nodeload is now more stable than before. Rewritten code is simpler and better tested than before. Node.js works just fine. But the fact that we use HTTP everywhere means that we can easily replace any of the components. Our main goal now is to install the best probes to monitor Nodeload and improve service reliability.

Tags:

Nodeload2: Download Engine - Reboot

Heavy launch

Conclusion

Also popular now: