Troubleshooting Node.js applications under the hood

    Since the advent of Node.js, diagnostics have been improved in several stages, from debugging applications to in-depth performance analysis. This time we will talk about strategies for using tools such as core dump debuggers, flame graphs, production errors and memory leaks.


    The material is based on the transcript of the report of Nikolai Matvienko from Grid Dynamics from our December HolyJS 2017 Moscow conference.

    To understand at what stage of the evolution of diagnostic tools we are now, I suggest turning to history. On the timeline, I depicted the most significant points.



    At first, tools such as DTrace, Instruments, perf were released - these are system profilers that appeared long before Node.js, and they are still used today. With the advent of Node.js, Node-oriented tools come to us - here it is worth highlighting node inspector, node-heapdump, node-memwatch. I marked this period gray and conditionally call it troubled time. You remember that there was a split in the community: at that time io.js separated from Node.js, and the community did not have a common understanding of where to go next and who to choose. But in 2015, io.js and Node.js merged, and we saw Node.js v4 Argon. The community has a clear development vector, and Node.js in production began to use such large companies as Amazon, Netflix, Alibaba.

    And here effective tools ready to work in production become popular.

    Since 2017, I conditionally distinguish the third stage.



    From this moment on, Node.js is already developing qualitatively. It is worth noting the full transition to Turbofan and the development of Async Hooks.

    In 2015, we released the release on Node.js version 0.12 and gradually updated first to the fourth version and subsequently to the sixth. Now we are planning the transition to the eighth version.

    In the release year, we used tools from the first stage, which were not functional for working in production. We have good experience with the tools of the second stage, and now we use the tools of the third stage. But with the transition to new versions of Node.js, we are faced with the fact that we are losing support for useful tools. In fact, we faced a choice. Or we are switching to a new version of Node.js, which is more optimized and faster than the previous one, but we do not have tools and we cannot diagnose it. Or we remain on the old version of Node.js, which works more slowly, but we have tools for it, and we know its weaknesses. Or there is a third option, when new tools appear, and they support, for example, Node v4 and higher, and the previous ones do not.

    Our application is a large e-commerce project where Node.js applications (these are Web UI backend and Node.js microservices) occupy a central place in enterprise architecture.



    The problem in this place can come from any direction. On days of increased workloads such as Black Friday, when a company earns 60% of annual profits in a few days, it is especially important to maintain high bandwidth. If some kind of malfunction occurs, a local council is convened, where all the leaders of the systems are present, including me from Node.js. If the problem is on the side of Node.js, you need to quickly select a strategy, select the tools for it and fix the problem, because the company is losing thousands of dollars per minute.



    Below, I would like to show with concrete examples how we act in production in cases of errors, memory leaks and performance degradation.

    Debugging applications in production


    There are several messages for debugging in production.


    Consider a simple example: a service for reserving products. Controller uses a service that sends a collection of products to the REST API, returns data, and an error occurs in controller.





    In the picture you see the listing of ProductController. In the reserve method, we get the profile from cookies, we also take a rewards account where bonus points, id and other useful information are stored, also from cookies. And this cookie is missing - undefined. In case when id is taken from rewards, the error falls. Locally, we would set a breakpoint (red dot), get the state of all objects and find out the reason why our code crashes.



    I'll show you how it works under the hood, so you understand what Stack Trace is. If we write the% DebugTrace () command, the native V8 command, and run Node.js with the --allow-natives-syntax option, that is, enable the native syntax in V8, we get a big message with Stack Trace.

    At the very top in the second frame, we will have the reserve function before the controller method, and we will see there a link to the ProductController context - this, request and response parameters, we will get local variables, some of which are stored on the stack, and the other in Heap (managed heap ) We also have the source code, and it would be nice to have something like this in production when debugging in order to get the state of the objects.



    Such support was implemented in 2015 from version 0.12. If we run Node with the flag --abort-on-uncaught-exception, a core dump is created - this is a memory cast of the whole process that contains Stack Trace and Heap Dump.

    For those who have forgotten, let me remind you: the memory conditionally consists of Call Stack and Heap.



    In Call Stack, yellow bricks are frames. They contain functions, and the function contains input parameters, local variables and return address. And all objects, except small integer (SMI), are stored in a managed heap, for which there are addresses.

    We figured out what core dump consists of, now let's move on to the process of creating it.



    Everything is simple here: your application process is running, an error occurs, the process crashes, a core dump is created, then the application restarts. Next, you remotely connect to production or pull this core dump to your local computer, debug it with a debugger, make a hot-fix and publish it. Now that we understand what this process looks like, we need to find out which debuggers can be used.



    They all appeared before Node.js. In order to read JS Stack Trace, at different times several plugins were written for them. Most of them are already outdated and not supported by the latest version of Node.js. I tried almost everything and in this report I want to highlight llnode. It is cross-platform (except Windows), works with the latest versions of Node.js, starting with the fourth. I will show how you can use this debugger to debug production errors.

    Consider the example described above - product reservations.



    We set up the environment, set the ability to create a core dump, and run our Node application with the flag --abort-on-uncaught-exception. Next, we get the process pid, make a request to the user who does not have cookies reward, and we get a message that the error appeared in the ProductController in the reserve method.



    Next, we get the message that core dump was created - a cast of memory. Having received core dump or access to it, we check its availability - it weighs 1.3 GB. A lot.



    Run this core dump with the debugger using the llnode command, we can get the Stack Trace using the bt command.



    The image shows (above and below in more detail) that there is no information in frames 4 through 20.



    The fact is that this is JS Stack Trace, which C ++ debuggers cannot read. In order to read it, just use the plugin.



    Running the V8 bt command, we get JS Stack Trace, where in the sixth frame we see the reserve function. The frame contains a link to the context of this ProductController, parameters: IncomingMessage - Request, ServerResponse and the address of the function in memory is highlighted in blue. At this address we can get its source code from core dump.

    Next, we write the v8 inspect --print-source [address] command and get the source code of our function from memory.



    Let's go back to Stack Trace, there were request and response parameters, and find out what was in request.



    In the image we see the v8 inspect command with the address. After it is executed, we will receive the state of the request object. Here we are interested in the property cookies, url, the POST method and body. Find out which products were transferred to the server.

    Let's write the command v8 inspect [body address].



    We get an object in which the array consists of three elements in the products property. We take its address, inspect it and get an array from id.



    There is no problem here. Let's see what was in the req.cookies object.



    This is an object of two properties: profile and rewards - undefined. Find out which user does not have a rewards account.



    We write the v8 inspect [address profile] command and get its id and storeid. Thus, we were able in production without source code to get the source code and restore the user session.

    Here was a very convenient case, and if you use a framework that contains ErrorHandler Middleware, then unlike the previous example, where we had links from Stack Trace to the context of this, request, response, if you don’t rewrite ErrorHandler, then you have Stack Trace will have the latest calls from Express itself and will not work with anything. You will not have any links.



    I will give an example of how this happens.



    The same example with the reserve method, but already the error falls deep in a service that works in asynchronous mode. In order to catch it, we subscribe to the events of the unhadledRejection process, where we do process.abort () to create a core dump.



    The following shows what the backtrace looks like in this case.



    There is no information about ProductController, about reserve. There is a call to process.on in this frame and nothing else useful. Such cases occur in production when the error was not caught and there is no information in Stack Trace. Further I will show how, working only with a managed heap without Stack Trace, to find all objects.

    We can get all instances of an object by type IncomingMessage. This will be a large list of all requests that were submitted, but we need to find the one that caused the process to crash.



    In this case, we need logs and in them we need to get the id request'a - this is X-Transaction-ID.



    In llnode, it is possible in memory by the value of a string to find links that reference it. If we write the command v8 findrefs --string [string value], we will find an object that has the X-Transaction-ID property, and it refers to this string. Find out what this object is.



    We write the v8 inspect [address] command, after which we get the headers object. Next, we find out what refers to this object, and we get an IncommingMessage object. Thus, without Stack Trace and addresses, we were able to restore the user's session.



    Let's delve into a more complex example when you need to get local variables - they are in services. Suppose that the error has fallen somewhere deep and there is little data. In this case, the REST API is used. Objects are taken from the database, and they are all assigned to some local variable called someLocalValue. And we do not have an object that refers to someLocalValue. How then to get a local variable from memory? You can find this object, for example, by id value, if you got it from the logs, but you can also get objects from memory by the name of its properties, i.e. with the v8 findrefs name command, we can find objects that have the userAccount, preferredStore property, and so we get the local variables we need!

    So, let's take on the real case - errors in production.



    For each error, we create a core dump. But what is wrong here? There are a lot of errors - 3620 per minute, and core dump weighs 1.3 GB, and the time it takes to create and merge it onto the system is about 15-20 seconds. There is clearly not enough logic to select the core dump creation.

    For example, you are running an application and an error occurs, or you caught an error that is difficult for you to reproduce. Next, the process gets into ErrorHandler, where you have the logic - is it necessary to do process.abort for this error to create a core dump, or do we continue to execute the program.

    If the error is unique and needs to be processed, a report is generated using the node-report, then process.abort is called and a core dump is created, and the process is restarted. Your server works again, serves the traffic, and only later you get access to the report, core dump and logs, then using the debugger you debug it, make hotfix and publish it.



    The image below shows an example of logic that makes selection. I call it the registry of errors. It limits the creation of core dump.



    This is some class that In-Memory Databases uses, and when you find an error, check whether it is registered in the registry or not. If not, see if you need to register it. If so, then reset the process, create a core dump. In this case, you will not create a core dump for errors that are repeated. You will select them. This will speed up the debugging process.



    I believe that the debugging process in production is some kind of detective investigation, where the node-report will tell you what happened, where and under what circumstances a crime was committed against your JS Stack Trace process. Logs will give you clues, evidence, valuable information. The debugger will allow you to find objects that took place at the time of the crime, that is, in these functions.

    This is how we work in production and debug errors.

    Search for memory leaks


    Let's move on to finding memory leaks.



    There are many ways to detect memory leaks (some are mentioned in the image above).



    For example, in the graphs of the processor load and memory you can notice a certain correlation. And in situations where you have a memory leak, the garbage collector is loaded, which takes up more processor time.

    Consider an example of a leak.



    The image above shows the schedule of the garbage collector. Scavenge - works in the space of young objects. MarkSweep - in the space of old objects. The graph shows that the number of garbage collections is growing, which means that memory is leaking and more frequent garbage collection is needed. To understand where and in which space it leaks, we use the trash of the garbage collector.



    The image above shows a list of memory areas, which displays their status at each iteration of the garbage collection. There are old space, new space and others. When we received the source code from core dump, we took it from code space.



    From experience, I can say that if memory leaks into the space of new objects, then expensive cloning, merge, and so on are to blame for this. If memory flows in a large space of old objects, then these are missed closures, timers. And if the memory leaks in large space, where the garbage collector does not work, then this is serialization / deserialization of large JSON.

    The image below shows an example of a Scavenge trace that works in the area of ​​young objects. There is no memory leak.



    How can this be found out? We have two columns in bold green. This memory that was allocated for your process before assembly and after assembly. The values ​​are equal, and this indicates that after garbage collection, the allocated memory did not increase. The blue font indicates the time spent on garbage collection. The documentation states that it takes 1 millisecond by default. In our project, this is about 3 seconds, and we consider this the norm. Three lines are visible in the upper image below, this is also garbage collection, but iterative, that is, not all of your memory is scanned and cleared, but piecemeal. It also shows the number of iterations.

    The image below shows an example of a memory leak in the same space. In the lilac columns it can be seen that the amount of occupied memory is growing, that is, after garbage collection, it increases each time.



    How are things in the space of old objects?



    Here we see that the process takes longer: 90ms, 110ms, etc. This is a lot. Moreover, it is still iterative, which also takes time in the background thread.

    Under the hood, Node.js has several threads. One for profiling, several for garbage collection, and a thread for executing the application itself.

    Consider the example below to understand how to detect memory leaks.



    Notice that Middleware, which processes the request, captures the start time of execution at the beginning, and when request is over, it fixes the time after execution, where some function to process it is called - someProcessOperation. We pass request to this function. There is a closure. The request object is held after it has been submitted. And there may still be other links to it, which pulls a lot of objects. What should be done in this case? We do not need the entire request object. We can only pass its id and url. This way we avoid memory leaks.



    How can I find out?



    I will show how we do this with the llnode debugger (in the case of Fatall Error, when Node.js creates a coredump after an application crash due to lack of memory). In the previous case, we received a core dump when exception fell and the process terminated. This time, of course, we did not call process.abort () on a running application to get a core dump, but armed ourselves with the gcore tool, which allows us to query process id at the infrastructure level and get core dump on demand. However, here we are faced with an infrastructure constraint, which is why we had to look for other options.

    A month ago, a library called gen-core was published. This tool allows you to fork your process, kill it without harming your main process, and get a core dump.

    The image below shows an example of how gen-core works.



    Here the DiagnosticController by the REST API will be able to do a core dump.

    How to search for memory leaks using llnode




    We have the findjsobjects command, with which we can scan the entire heap and group by types, get a report on the number of object instances of each type, their size in memory, etc. The image above shows that there are a lot of instances of ServerResponse and IncomingMessage, there are also Timeout and String objects. The strings themselves are very large and take up a lot of memory. They are often located in large space where the garbage collector does not work.

    Let's find out why we have many request objects. Remember that the request object was held in Middleware.

    There is a findjsinstances [type] command, and with it we get a list of request objects.



    We take the address of the first object and, using the v8 findrefs -v command, look for the object in memory that has a link to it.



    These are two objects. The first is an array of 18 elements, and the second is a request object. We find out what this array is and why it holds our request objects.



    We use the v8 findrefs -v [address of this array] command and we get that this array is in an object of type Profiller and the processingQueue property. Thus, we were able to find out where our request objects are held in memory. Next we go to the source code or get it from there and see where processingQueue is called.



    We use the findjsinstances command [type Timeout], get a list of all timeouts and then we will look for where they are used.



    Using the v8 i command, we get the first timeout.



    In the onTimeout property, we can see that the setInterval function is called and the address with the file location. Thus, we quickly found out where we had a forgotten timer and why it is running.



    How to identify a memory leak at the design stage of the application


    If you use the% DebugTrackRetainingPath (x) command, which was recently added in V8 (Node.js has support for it yet), download the V8 sources, build and use the V8 debugger and run the JS image above, we will get any links you need us a variable up to the root object. Do not forget about calling the garbage collector. In the example, this is a call to gc ().



    We run debug8 with the flags --allow-natives-syntax, --track-retaining-path, --expose-gc and the file test.js. In the listing, we get step by step who and what holds the object to root.

    Thus, we examined the search for memory leaks in production.

    Application Performance Profiling



    In the image above you see a real example when, after the next release, we dipped five times in performance. Web UI application load times have increased dramatically.

    For such cases, there is a simple and effective algorithm of actions, which is demonstrated below.



    Although a profiler was added to Node.js in 2013, the tick-processor (a tool that processes V8 logs and reports) only appeared there last year. Before that, we used either the V8 profiler, which had a tick-processor, or had to download the V8 itself, assemble it, and activate the processor in the tools there.

    The profiler gives you information at a high level, it will help determine the nature of the problem and see which functions have been performed for the longest time.



    There is a convenient representation of the data of this log in graphical form, and it is called flame graph. It was developed in 2013 by Brendan Gregg.

    Who remembers, in 2015, getting a graph was a real headache, but now it is created by one team using 0x.



    Now let's figure out how to work with it.

    To highlight the profile of our application, the lower buttons will remove all unnecessary. As a result, we see that the application only takes up a small part of the total execution time.



    The stack trace of the function call sequence is finite and fits on the screen. What then takes a long time? If we turn on the highlighting of npm packages, we will see that they run nine times longer than the application itself. This is a lot.



    If we highlight the optimized functions, we will see that there are very few of them. About 90% of calls were not optimized.



    If we look at the top of the calls, we will see that in our application there were a lot of stacks in handlebars, in addition we had a lodash merge that made merge objects, we had third-party utilities that also took time.



    If we analyze the same graph of our application, we will see that the basis is the native promise, which took a long time before Turbofan. But the most time consuming is the rendering of pages. Our application in percentage terms is a tenth of the whole logic.



    What have we fixed?

    1. Decrease in CPU of expensive operations: merge, JSON, string parse
    2. Removing logic from view templates
    3. Npm dependency review and package update
    4. Node.js update (V8 optimization killers bypass) - in progress

    Below is the graph that we see after the corrections are made.



    Here it is worth paying attention to the right side, where handlebars helpers. Previously, the view rendering occupied more than half of the graph, but now it is one tenth. We optimized everything quite well, and now the application itself takes about half the time. This does not mean that it began to work longer; all other processes began to work less in time.



    The image above shows that at the stage before the release of the critical release, the chart grew, and we gradually lost in performance. Then there was a sharp jump. In the critical period, we were 3.5 weeks and looked for bugs, made changes.

    The essence of the problem was that the error (parsing JSON strings) was embedded in the application long before this release, and only shot at a certain point. As a result, after all corrections and optimizations, we won twice in productivity, if we take the initial level as a reference point.

    Conclusion


    • Performance must be part of the requirements.
    • Need to make measurements when introducing third-party libraries
    • Measurement should be done on staging / pre-production environments
    • It is necessary to collect an archive of measurement results


    Resources Used in the Report



    We hope you enjoy the experience of Nicholas. By the way, HolyJS 2018 Piter kicks off this Saturday , where Nikolay will talk about the Main Thread decomposition in Node.js to increase throughput . And if you like the inside out of JS, you might also be interested in the following reports:


    Also popular now: