Finding Performance, Part 2: Profiling Java under Linux
There is an opinion that you can endlessly look at fire, water and how others work, but there is something else! We are sure that you can talk endlessly with Sasha goldshtn Goldstein about performance. We already interviewed Sasha before JPoint 2017, but then the conversation was specifically about BPF, to which Sasha’s report was dedicated.
This time we decided to dig deeper and find out the fundamental problems of performance monitoring and their solutions.

- Last time, we talked in some detail about BPF and briefly discussed the problems of monitoring Java performance under Linux. This time I would like to concentrate not on a specific tool, but on problems and finding solutions. The first question, quite commonplace, is how to understand that there are problems with performance? Should I think about it if the user is not complaining?
Sasha Goldstein: If you start thinking about performance only when your users complain, they will not be with you for long. For many, performance engineering is trouble shooting and crisis mode. Phones are ringing, the light is blinking, the system has crashed, the keyboard is on - the usual workdays of a performance engineer. In reality, they spend most of their time planning, designing, monitoring and preventing crises.
To begin with, capacity planning is an assessment of the expected system load and resource use; scalability design will help to avoid bottlenecks and get significant increases in load; instrumentation and monitoring are vital to understanding what is going on inside the system so as not to dig blindly; thanks to the installation of automatic notification, you will definitely know about any problems that arise, as a rule, even before users begin to complain; and of course there will be isolated crises, which will have to be solved under stressful conditions.
It is worth noting that the tools are constantly changing, but the process itself remains unchanged. I will give a couple of specific examples: capacity planning you can do on a piece of paper on your knee; You can use APM solutions (like New Relic or Plumbr) for end-to-end instrumentation and monitoring, AB and JMeter for quick load testing and so on. To learn more, you can read Brendan Gregg's book Systems Performance , an excellent source on the life cycle and performance methodology, and Google 's Site Reliability Engineering, which covers the topic of setting performance indicators (Service Level Objectives) and monitoring them.
- Suppose we understand that there is a problem: where to start? It often seems to me that many (especially non-professional performance engineers) are immediately ready to uncover JMH, rewrite everything to unsafe and “hack compilers”. Then watch what happened. But in reality it’s better not to start with this?
Sasha Goldstein:This is a fairly common practice when, when writing code and conducting basic profiler tests, there are performance problems that can be easily fixed by changing the code or hacking the compilers. However, on the prod, based on my experience, this is not done so often. Many problems are inherent in only one environment, are caused by changing workload patterns or are associated with bottlenecks outside the code of your application, and only a small part can be microbended and improved at the source code level by smart hacks.
Here are a couple of examples to illustrate:
I understand that it is often much easier to focus on things that you can control, such as application level hacks. Psychologically, this is understandable, it does not require deep knowledge of the system or environment, and for some reason it is considered “cooler” in some cultures. But to address this in the first place is wrong.
- Surely you should first see how the application / service works in production. What tools do you recommend for this, and which ones are not?
Sasha Goldstein: Monitoring and profiling on the prod is a set of tools and techniques.
We start with the metrics of a high-level performance, focusing on the use of resources (processor, memory, disk, network) and load characteristics (# queries, errors, query types, # database queries). There are standard tools for obtaining this data for each operation and runtime. For example, Linux usually uses tools like vmstat, iostat, sar, ifconfig, pidstat; for JVM use JMX-based tools or jstat. These are metrics that can be continuously collected in a database, possibly with a 5 or 30 second interval, so that you can analyze leaps and, if necessary, go back to correlate previous deployment operations, releases, world events or workload changes . It is important that many focus on collecting only averages; they are good, but by definition do not represent the full distribution of what you are measuring. It is much better to collect percentiles, and if possible even histograms.
The next level is operational metrics, which usually cannot be continuously collected or stored for a long time. They include: garbage collection log, network queries, database queries, class loads, and so on. Understanding these data after they have been stored somewhere is sometimes much more difficult than collecting them. This allows, however, to ask questions such as “what kind of requests worked while the CPU load of the database increased to 100%” or “what were the IOPS of the drives and the response time during the execution of this request”. Only numbers, especially in the form of averages, will not allow you to conduct this kind of research.
And finally, the “hardcore” level: SSH in the server (or remote launch of tools) to collect more internal metrics that cannot be stored during the regular operation of the service. These are tools commonly referred to as profilers.
For profiling Java production, there are many creepy tools that not only give a lot of overhead and delays, but can also lie to you. Despite the fact that the ecosystem has been around for 20 years, there are only a few reliable profiling techniques with low overhead for JVM applications. I can recommend the Honest Profiler of Richard Warburton, async-profiler of Andrey Pangin and, of course, my favorite is perf .
By the way, many tools focus on processor profiling, understanding which way of code execution causes high CPU utilization. This is great, but often this is not the problem; we need tools that can show the code execution paths responsible for memory allocation (async-profiler can now do this too), page missing errors, cache misses, disk accesses, network requests, database queries and other events. What attracted me to this area was precisely the problem of finding the right performance tools to study the working environment.
- I heard that under the Java / Linux stack there are a lot of problems with the reliability of measurements. Surely you can somehow fight this. How do you do it?
Sasha Goldstein:Yes, this is sad. Here's what the current situation looks like: you have a fast conveyor line with a huge number of different parts that you need to test in order to find defects and understand the speed of the application / service. You cannot check absolutely every part, so your main strategy is to check 1 part per second and see if everything is in order, and you need to do this through the "tiny window" above this "tape", because it is already dangerous to get closer. Sounds good, doesn't it? But then it turns out that when you try to look into it, it does not show you what is happening on the conveyor belt right now; it waits until the conveyor enters the magical "safe" mode, and only after that gives you everything to see. It also turns out that you will never see many parts, because the conveyor cannot enter its “safe” mode, while they are nearby; and it also turns out that the process of finding a defect in the window takes as long as 5 seconds, so it is impossible to do this every second.
Around this state, there are now many profilers in the JVM world. YourKit, jstack, JProfiler, VisualVM - they all have the same approach to CPU profiling: they use thread sampling in a safe state. This means that they use a documented API to pause all JVM threads and take their stack traces, which they then collect for reporting with the hottest methods and stacks.
The problem with such a suspension of the process is the following: threads do not stop immediately, runtime waits until they reach a safe state, which can be after a lot of instructions and even methods. As a result, you get a biased picture of the application, and different profilers may also disagree with each other!
There is research showing how bad it is when each profiler has their own point of view regarding the hottest method in the same workload (Mitkovich et al., “Evaluating the Accuracy of Java Profilers”). Moreover, if you have 1000 threads in a complex Spring call stack, stack stack traces often fail. Perhaps not more than 10 times per second. As a result, your stack data will differ even more from the actual workload!
It is not easy to solve such problems, but it is worth investing: some workloads on the prod cannot be profiled using "traditional" tools like those listed above.
There are two separate approaches and one hybrid:
Any of these options are much better than safepoint-biased profilers in terms of accuracy, resource consumption and reporting frequency. They can be complicated, but I think that accurate profiling of a production with a low overhead is worth the effort.
Container profiling
- Speaking of environments, it is now fashionable to pack everything in containers, are there any features here? What should be remembered when working with containerized applications?
Sasha Goldstein: There are interesting problems with containers that many tools completely ignore and as a result stop working altogether.
Let me briefly recall that Linux containers are built around two key technologies: control groups and namespaces. Monitoring groups allow you to increase the resource quota for a process or for a group of processes: CPU time caps, memory limit, IOPS storage, and so on. Namespaces make container isolation possible: mount namespace provides each container with its own mount point (in fact, a separate file system), PID namespace - its own process identifiers, network namespace gives each container its own network interface, and so on. Because of the namespace, it’s difficult for many tools to correctly communicate with containerized JVM applications (although some of these problems are not unique to JVMs).
Before discussing specific issues, it would be better if we briefly talk about the different types of observability of tools for the JVM. If you have not heard about some of them, it's time to refresh your knowledge:
Here are some problems that arise when profiling and monitoring containers from the host, and how to solve them (in the future I will try to tell you more about this):
Here you might think: what if I just put performance tools in a container so that all these isolation problems do not arise because of me? Although the idea is not bad, many tools will not work with this configuration because of seccomp. Docker, for example, rejects the perf_event_open system call, which is necessary for profiling with perf and async-profiler; it also rejects the ptrace system call, which is used by a large number of tools to read the memory capacity of the JVM process. Changing the seccomp policy to accept these system calls puts the host at risk. Also, by placing profiling tools in a container, you increase its attack surface.
We wanted to continue the conversation and discuss the effect of iron on profiling ...

Very soon, Sasha will come to St. Petersburg to conduct training on profiling the JVM-applications in production mode and to speak at the conference Joker in 2017 with a report about the BPF , so if you want to dive into the subject more deeply - you have all the chances to meet with Sasha in person.
This time we decided to dig deeper and find out the fundamental problems of performance monitoring and their solutions.

Where to start
- Last time, we talked in some detail about BPF and briefly discussed the problems of monitoring Java performance under Linux. This time I would like to concentrate not on a specific tool, but on problems and finding solutions. The first question, quite commonplace, is how to understand that there are problems with performance? Should I think about it if the user is not complaining?
Sasha Goldstein: If you start thinking about performance only when your users complain, they will not be with you for long. For many, performance engineering is trouble shooting and crisis mode. Phones are ringing, the light is blinking, the system has crashed, the keyboard is on - the usual workdays of a performance engineer. In reality, they spend most of their time planning, designing, monitoring and preventing crises.
To begin with, capacity planning is an assessment of the expected system load and resource use; scalability design will help to avoid bottlenecks and get significant increases in load; instrumentation and monitoring are vital to understanding what is going on inside the system so as not to dig blindly; thanks to the installation of automatic notification, you will definitely know about any problems that arise, as a rule, even before users begin to complain; and of course there will be isolated crises, which will have to be solved under stressful conditions.
It is worth noting that the tools are constantly changing, but the process itself remains unchanged. I will give a couple of specific examples: capacity planning you can do on a piece of paper on your knee; You can use APM solutions (like New Relic or Plumbr) for end-to-end instrumentation and monitoring, AB and JMeter for quick load testing and so on. To learn more, you can read Brendan Gregg's book Systems Performance , an excellent source on the life cycle and performance methodology, and Google 's Site Reliability Engineering, which covers the topic of setting performance indicators (Service Level Objectives) and monitoring them.
- Suppose we understand that there is a problem: where to start? It often seems to me that many (especially non-professional performance engineers) are immediately ready to uncover JMH, rewrite everything to unsafe and “hack compilers”. Then watch what happened. But in reality it’s better not to start with this?
Sasha Goldstein:This is a fairly common practice when, when writing code and conducting basic profiler tests, there are performance problems that can be easily fixed by changing the code or hacking the compilers. However, on the prod, based on my experience, this is not done so often. Many problems are inherent in only one environment, are caused by changing workload patterns or are associated with bottlenecks outside the code of your application, and only a small part can be microbended and improved at the source code level by smart hacks.
Here are a couple of examples to illustrate:
- A couple of years ago, Datadog ran into a problem when inserting and updating a database in PostgreSQL jumped from 50 ms to 800 ms. They used AWS EBS with SSD. What did it give? Instead of tuning the database or changing the application code, they found that EBS throttling was to blame for everything: it has an IOPS quota, in case of exceeding which you will be subject to a performance limitation.
- Recently, I had a user with the problem of huge jumps in server response time, which were associated with garbage collection delays. Some requests took more than 5 seconds (and they appeared completely haphazardly), as garbage collection went out of control. After carefully examining the system, we found that everything was in order with the allocation of application memory or tuning of the garbage collection; due to a jump in the size of the workload, the actual memory usage has increased and caused swapping, which is absolutely detrimental to the implementation of any garbage collection (if the collector needs to pump and pump memory to mark active objects, this is the end).
- A couple of months ago, Sysdig faced the problem of container isolation: being next to container X, file system operations performed by container Y were much slower, while memory usage and processor load for both containers were very low. After a little research, they found that the kernel directory cache was overloaded with container X, which subsequently caused a hash table collision and, as a result, a significant slowdown. Again, changing application code or container resource allocation would not solve this problem.
I understand that it is often much easier to focus on things that you can control, such as application level hacks. Psychologically, this is understandable, it does not require deep knowledge of the system or environment, and for some reason it is considered “cooler” in some cultures. But to address this in the first place is wrong.
- Surely you should first see how the application / service works in production. What tools do you recommend for this, and which ones are not?
Sasha Goldstein: Monitoring and profiling on the prod is a set of tools and techniques.
We start with the metrics of a high-level performance, focusing on the use of resources (processor, memory, disk, network) and load characteristics (# queries, errors, query types, # database queries). There are standard tools for obtaining this data for each operation and runtime. For example, Linux usually uses tools like vmstat, iostat, sar, ifconfig, pidstat; for JVM use JMX-based tools or jstat. These are metrics that can be continuously collected in a database, possibly with a 5 or 30 second interval, so that you can analyze leaps and, if necessary, go back to correlate previous deployment operations, releases, world events or workload changes . It is important that many focus on collecting only averages; they are good, but by definition do not represent the full distribution of what you are measuring. It is much better to collect percentiles, and if possible even histograms.
The next level is operational metrics, which usually cannot be continuously collected or stored for a long time. They include: garbage collection log, network queries, database queries, class loads, and so on. Understanding these data after they have been stored somewhere is sometimes much more difficult than collecting them. This allows, however, to ask questions such as “what kind of requests worked while the CPU load of the database increased to 100%” or “what were the IOPS of the drives and the response time during the execution of this request”. Only numbers, especially in the form of averages, will not allow you to conduct this kind of research.
And finally, the “hardcore” level: SSH in the server (or remote launch of tools) to collect more internal metrics that cannot be stored during the regular operation of the service. These are tools commonly referred to as profilers.
For profiling Java production, there are many creepy tools that not only give a lot of overhead and delays, but can also lie to you. Despite the fact that the ecosystem has been around for 20 years, there are only a few reliable profiling techniques with low overhead for JVM applications. I can recommend the Honest Profiler of Richard Warburton, async-profiler of Andrey Pangin and, of course, my favorite is perf .
By the way, many tools focus on processor profiling, understanding which way of code execution causes high CPU utilization. This is great, but often this is not the problem; we need tools that can show the code execution paths responsible for memory allocation (async-profiler can now do this too), page missing errors, cache misses, disk accesses, network requests, database queries and other events. What attracted me to this area was precisely the problem of finding the right performance tools to study the working environment.
Java Profiling for Linux
- I heard that under the Java / Linux stack there are a lot of problems with the reliability of measurements. Surely you can somehow fight this. How do you do it?
Sasha Goldstein:Yes, this is sad. Here's what the current situation looks like: you have a fast conveyor line with a huge number of different parts that you need to test in order to find defects and understand the speed of the application / service. You cannot check absolutely every part, so your main strategy is to check 1 part per second and see if everything is in order, and you need to do this through the "tiny window" above this "tape", because it is already dangerous to get closer. Sounds good, doesn't it? But then it turns out that when you try to look into it, it does not show you what is happening on the conveyor belt right now; it waits until the conveyor enters the magical "safe" mode, and only after that gives you everything to see. It also turns out that you will never see many parts, because the conveyor cannot enter its “safe” mode, while they are nearby; and it also turns out that the process of finding a defect in the window takes as long as 5 seconds, so it is impossible to do this every second.
Around this state, there are now many profilers in the JVM world. YourKit, jstack, JProfiler, VisualVM - they all have the same approach to CPU profiling: they use thread sampling in a safe state. This means that they use a documented API to pause all JVM threads and take their stack traces, which they then collect for reporting with the hottest methods and stacks.
The problem with such a suspension of the process is the following: threads do not stop immediately, runtime waits until they reach a safe state, which can be after a lot of instructions and even methods. As a result, you get a biased picture of the application, and different profilers may also disagree with each other!
There is research showing how bad it is when each profiler has their own point of view regarding the hottest method in the same workload (Mitkovich et al., “Evaluating the Accuracy of Java Profilers”). Moreover, if you have 1000 threads in a complex Spring call stack, stack stack traces often fail. Perhaps not more than 10 times per second. As a result, your stack data will differ even more from the actual workload!
It is not easy to solve such problems, but it is worth investing: some workloads on the prod cannot be profiled using "traditional" tools like those listed above.
There are two separate approaches and one hybrid:
- Richard Warburton's Honest Profiler uses an internal undocumented API, AsyncGetCallTrace, which returns the stack trace of a single thread, does not require a transition to a safe state, and is called using a signal handler. It was originally designed by Oracle Developer Studio. The main approach is to install a signal handler and register it on a signal with a set time (for example, 100 Hz). Then you need to take the stack trace of any thread that is currently working inside the signal handler. Obviously, there are difficult tasks when it comes to efficient stack stack merging, especially in the context of a signal handler, but this approach works great. (This approach uses a JFR requiring a commercial license)
- Linux perf can provide rich sampling of stacks (not only for CPU instructions, but also for other events such as disk access and network requests). The problem is translating the address of the Java method into the name of the method, which requires a JVMTI agent to get a text file (perf map) that perf can read and use. There are also problems with stack reconstruction if JIT uses frame pointer suppression. This approach may well work, but requires little preparation. As a result, however, you will get stack traces not only for JVM threads and Java methods, but for all the threads you have, including the kernel stack and C ++ stacks.
- Andrei Pangin's async-profiler combines two approaches. It installs a set of perf patterns, but also uses a signal handler to call AsyncGetStackTrace and get the Java stack. Combining the two stacks gives a complete picture of what is happening in the stream, avoiding the problems of resolving Java method names and suppressing frame pointers.
Any of these options are much better than safepoint-biased profilers in terms of accuracy, resource consumption and reporting frequency. They can be complicated, but I think that accurate profiling of a production with a low overhead is worth the effort.
Container profiling
- Speaking of environments, it is now fashionable to pack everything in containers, are there any features here? What should be remembered when working with containerized applications?
Sasha Goldstein: There are interesting problems with containers that many tools completely ignore and as a result stop working altogether.
Let me briefly recall that Linux containers are built around two key technologies: control groups and namespaces. Monitoring groups allow you to increase the resource quota for a process or for a group of processes: CPU time caps, memory limit, IOPS storage, and so on. Namespaces make container isolation possible: mount namespace provides each container with its own mount point (in fact, a separate file system), PID namespace - its own process identifiers, network namespace gives each container its own network interface, and so on. Because of the namespace, it’s difficult for many tools to correctly communicate with containerized JVM applications (although some of these problems are not unique to JVMs).
Before discussing specific issues, it would be better if we briefly talk about the different types of observability of tools for the JVM. If you have not heard about some of them, it's time to refresh your knowledge:
- basic tools like jps and jinfo provide information about existing JVM processes and their configuration;
- jstack can be used to get thread dump (stack trace) from existing JVM processes;
- jmap - to get a head dump of existing JVM processes or simpler class histograms;
- jcmd is used to replace all previous tools and send commands to existing JVM processes through the JVM attach interface; it is based on a UNIX domain socket, which is used by JVM processes and jcmd to exchange data;
- jstat - for monitoring basic JVM performance information like class loading, JIT compilation and garbage collection statistics; based on a JVM that generates / tmp / hsperfdata_ $ UID / $ PID files with this data in binary format;
- The Serviceability Agent provides an interface for checking the memory of JVM processes, threads, stacks, and so on, and can be used with a memory dump and not only live processes; it works by reading process memory and internal data structures;
- JMX (managed beans) can be used to receive performance information from an existing process, as well as to send commands to control its behavior;
- JVMTI agents can join various interesting JVM events, such as loading classes, compiling methods, starting / stopping a thread, monitor contention, and so on.
Here are some problems that arise when profiling and monitoring containers from the host, and how to solve them (in the future I will try to tell you more about this):
- most tools require access to process binaries to search for objects and values. All this is in the mount namespace of the container and is not accessible from the host. In part, this access can be provided by bind-mounting from the container or by entering the mount namespace of the container in the profiler during symbol resolving (this is what perf and BCC tools that I described in a previous interview do now ).
- if you have a JVMTI agent that generates a perf map (e.g. perf-map-agent), it will be written to the / tmp container repository using the container process ID (e.g. /tmp/perf-1.map). The map file must be accessible to the host, and the host must wait for the correct process ID in the file name. (Again, perf and BCC can now do this automatically).
- The JVM attach interface (which jcmd, jinfo, jstack and some other tools rely on) requires the correct PID and mount namespace attach file, as well as the UNIX socket of the domain used to communicate with the JVM. This information can be scrolled using jattach utility and creating an attach file, entering the container namespace or using bind-mounting the corresponding directories on the host.
- using JVM performance data files (in / tmp / hsperfdata_ $ UID / $ PID) used by jstat requires access to mount the container namespace. This is easily addressed by bind-mounting the / tmp container on the host;
- the easiest approach to using JMX-based tools is, perhaps, accessing the JVM as if it were remote - configuring the RMI endpoint, as you would for remote diagnostics;
- Serviceability Agent tools require exact version matching between the JVM process and the host. I think you understand that you should not run them on the host, especially if it uses a different distribution and different versions of the JVM are installed on it.
Here you might think: what if I just put performance tools in a container so that all these isolation problems do not arise because of me? Although the idea is not bad, many tools will not work with this configuration because of seccomp. Docker, for example, rejects the perf_event_open system call, which is necessary for profiling with perf and async-profiler; it also rejects the ptrace system call, which is used by a large number of tools to read the memory capacity of the JVM process. Changing the seccomp policy to accept these system calls puts the host at risk. Also, by placing profiling tools in a container, you increase its attack surface.
We wanted to continue the conversation and discuss the effect of iron on profiling ...

Very soon, Sasha will come to St. Petersburg to conduct training on profiling the JVM-applications in production mode and to speak at the conference Joker in 2017 with a report about the BPF , so if you want to dive into the subject more deeply - you have all the chances to meet with Sasha in person.