Profiling already running programs

    We all use profilers. The traditional scheme of working with them is such that you have to start the program “under the profiler” initially and then, after the end of its work, analyze the raw dump using additional utilities.
    But what if we do not have root and want to profile an already running program that has been working "for a long time" for a long time, and now something went wrong. And we want to do it quickly. Common situation?
    Then consider the most popular profilers and how they work. And then a profiler that solves the specified problem.


    Popular profilers

    If you know a fundamentally different - write about it in the comments. In the meantime, consider these 4:

    I. gprof

    Good old UNIX profiler which, according to Kirk McCusick, was written by Bill Joy to analyze the performance of BSD subsystems. In fact, the profiler is "provided" by the compiler - it must set breakpoints at the beginning and at the end of each function. The difference between these two points will be the time of its execution.
    It is worth noting that gprof in this case exactly “knows” how many times each function was called. And although this may be necessary in some situations, it also has a negative effect - overhead from measurements can be comparable or even more than the function body itself. Therefore, for example, when compiling C ++ code, optimizations leading to inline are used.
    One way or another, but gprof does not work with already running programs.

    II. Callgrind

    Callgrind is part of Valgrind, an excellent framework for building tools for dynamic code analysis. Valgrind runs the sandbox program, actually using virtualization. Callgrind performs profiling based on breakpoints on instructions such as call and ret . It significantly slows down the analyzed code, as a rule, from 5 to 20 times. Thus, for analysis on big data in runtime, it is usually unsuitable.
    However, the tool is very popular, and the simple call graph format is supported by excellent visualization tools, for example, kcachegrind.

    III. OProfile

    OProfile is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead.
    OProfile is a system-wide profiler. Those. it is not aimed at working with individual processes, profiling the entire system instead. OProfile collects metrics by reading, not a system timer, like gprof or callgrind, but CPU counters. Therefore, to run the daemon, it requires privileges.
    However, this is an indispensable tool when you need to understand the work of the entire system, the entire server at once. And especially indispensable for profiling the core area.
    New version of OProfile 0.9.8
    For versions 0.9.7 and earlier, the profiler consisted of a kernel driver and a daemon for collecting data. Since version 0.9.8, this method has been replaced by the use of Linux Kernel Performance Events (requires a kernel 2.6.31 or newer). Release 0.9.8 also includes the 'operf' program, which allows unprivileged users to profile individual processes.


    IV. Google perftools

    This profiler is part of the Google perftools suite. I did not find its review on a habr, therefore I will very briefly describe.
    The kit includes a series of libraries aimed at accelerating and analyzing C / C ++ applications. The central part is the tcmalloc allocator , which, in addition to accelerating the allocation of memory, carries tools for analyzing classic problems - memory leaks and heap profile.

    The second part is libprofiler, which allows you to collect statistics on CPU usage. It is important to focus on how he does it. Several times per second (default 100), the program is interrupted by a timer signal. The stack is unwound in the processor of this signal and all instruction pointers are stored. At the end, the raw data is dumped into a file, according to which it is already possible to build statistics and a call graph.
    Here are some details on how to do this.
    1. By default, the ITIMER_PROF timer is selected by the timer signal, which ticks only when the CPU uses the program. After all, as a rule, we are not very interested where the program was waiting for input from the keyboard or receipt of data on the socket. And if it’s still interesting, use env CPUPROFILE_REALTIME=1

    2. The call stack is unwound either using libunwind or manually (which requires --fno-omit-framepointer, it always works on x86).
    3. Function names are subsequently recognized using addr2line (1)
    4. Like other Google perftools tools, the profiler can be linked explicitly, or it can be preloaded with tools LD_PRELOAD.

    The principle of action is interesting - the program is interrupted only N times per second, where N is quite small. This is the so-called sampling profiler. Its advantage is that it does not significantly affect the program being analyzed, no matter how many small functions are called up there. Due to the nature of the work, it does not, however, allow us to answer the question “how many times this function was called”.
    In the case of google profiler there are some more troubles:

    • This profiler is also not designed to work with already running programs.
    • recent versions do not work with fork (2), sometimes making it difficult to use it in daemons



    Crxprof

    As promised, now about another profiler that is written specifically to solve the problem identified above - easy profiling of already running processes.

    It collects a call stack and displays the hottest parts in the console by pressing ENTER. He also knows how to save the call graph in the previously mentioned callgrind format. It works quickly, and like any other sampling profiler, it does not depend on the complexity of calls in the program being profiled.
    Some work details
    Basically, crxprof works just like perftools, but uses external profiling via ptrace (2) . Like perftools, it uses libunwind to spin the stack, and instead of the hard work of converting to function names, libbfd is used instead of addr2line (1) . Several times per second, the program stops (SIGSTOP) and with the help of libunwind the call stack is "removed". By loading at the start of crxprof a map of the functions of the program being profiled and the libraries associated with it, we can quickly find which function belongs to each IP section (instruction pointer). In parallel, the call graph is built, assuming that there is a certain central function - the entry point. This is usually __libc_start_main from the libc library.





    Source code is available on github. Because The utility was created for me and my colleagues, I fully admit that it may not correspond to your use-case. Anyway, ask.

    Let's build crxprof and look at an example of its use.

    Assembly


    What you need: Linux (2.6+), autoconf + automake, binutils-dev (includes libbfd), libunwind-dev (I have it called libunwind8-dev).
    To build, we perform:
    autoreconf -fiv
    ./configure
    make
    sudo make install
    

    If libunwind is installed in a non-standard location, use:
    ./configure --with-libunwind=/path/to/libunwind
    


    Profiling


    To do this, simply run
    crxprof pid
    

    And that’s all! Now use ENTER to output the profile to the console, and ^ C to complete. Crxprof will also output the profile on exit.
    crxprof: ptrace (PTRACE_ATTACH) failed: Operation not permitted
    If you see this error, then ptrace on your system is "limited". (Ubuntu?)
    You can read more about it here.
    Briefly, either start with sudo or (better) run it in the console:

    $ echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
    


    Like all unix utilities, crxprof displays usage when invoked with the --help switch. See details in man crxprof.
    crxprof --help
    Usage: crxprof [options] pid
    Options are:
     -t | --threshold N: visualize nodes that takes at least N% of time (default: 5.0)
     -d | --dump FILE: save callgrind dump to given FILE
     -f | --freq FREQ: set profile frequency to FREQ Hz (default: 100)
     -m | --max-depth N: show at most N levels while visualizing (default: no limit)
     -r | --realtime: use realtime profile instead of CPU
     -h | --help: show this help
     --full-stack: print full stack while visualizing
     --print-symbols: just print funcs and addrs (and quit)
    


    Real example


    In order to give a real, but not complicated example, I use this code in C. Compile, run it and ask crxprof to save the function call graph (4054 - pid of the program being profiled):

    $ crxprof -d /tmp/test.calls 4054
    Reading symbols (list of function)
    reading symbols from /home/dkrot/work/crxprof/test/a.out (exe)
    reading symbols from /lib/x86_64-linux-gnu/libc-2.15.so (dynlib)
    reading symbols from /lib/x86_64-linux-gnu/ld-2.15.so (dynlib)
    Attaching to process: 6704
    Starting profile (interval 10ms)
    Press ENTER to show profile, ^C to quit
    1013 snapshot interrputs got (0 dropped)
    main (100% | 0% self)
     \_ heavy_fn (75% | 49% self)
       \_ fn (25% | 25% self)
     \_ fn (24% | 24% self)
    Profile saved to /tmp/test.calls (Callgrind format)
    ^C--- Exit since ^C pressed
    


    The statistics displayed on the console show that:
    • main () calls heavy_fn () (and this is the hardest way)
    • heavy_fn () calls fn ()
    • main () also calls fn () directly
    • heavy_fn () takes half the CPU time
    • fn () takes up the remaining CPU time
    • main () by itself does not consume anything



    Visualization is done according to the scheme "the largest subtrees - first." Thus, even for large real programs, you can use simple visualization in the console, which should be convenient on servers.

    To visualize complex call graphs, it is convenient to use KCachegrind:

    $ kcachegrind /tmp/test.calls
    

    The picture that I got is presented on the right.
    Instead of a conclusion, let me remind you that so far only a few of my colleagues and I myself use the profiler. I hope it will also be useful to you.

    Also popular now: