Performance analysis of individual subsystems of the Linux perf report program

    Usually, I used only the simplest startup options to prepare a profiling report on Linux perf report(the performance data itself must be obtained before the perf reportcommand starts perf record, here you can read more with an example ):


    Module Report:


    $ perf report --stdio --kallsyms=/boot/System.map-$(uname -r) --sort=dso -g none

    Feature Report:


    perf report -g none --stdio --kallsyms=/boot/System.map-$(uname -r)

    Function report with callgraph plotting:


    perf report --stdio --kallsyms=/boot/System.map-$(uname -r) -g

    For many situations, such reports were enough to find performance issues. However, some time ago I was faced with a situation where the callgraph report showed a very large call tree and it was difficult to understand.


    For a better understanding of the performance report, I wanted to break it into separate reports that would exactly correspond to the program subsystems:


    1. Subsystem for receiving messages, the main function: amqp_service::on_message()
    2. Message processing subsystem, main function: input_handler::execute()
    3. Notification distribution subsystem, main function: event_handler::on_msg()

    It turned out that Linux perf has a key that allows you to select only events related to one function in the call stack:


           -p, --parent=
               A regex filter to identify parent. The parent is a caller of this function and searched through the callchain, 
              thus it requires callchain information recorded. The pattern is in the exteneded regex format and 
              defaults to "^sys_|^do_page_fault"

    This key allowed me to analyze performance separately for each function that is important to me. First, it must be said that according to the report, the total number of events in perf.dat was this:


    # Event count (approx.): 72784958041

    Parsing amqp_service :: on_message ()


    Reporting by amqp_service :: on_message ():


    $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \
        -g -p "amqp_service::on_message"

    The number of events with this filter


    # Event count (approx.): 1590851860

    It turns out that 1590851860 events are about 2% of the total number of events 72784958041. That is, this function did not generate many events even under heavy load. Just in case, I look at what functions were in the top, when I worked amqp_service::on_message()and whether there are any in the top pthread_mutex_lock()/pthread_mutex_unlock():


    $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \
        -g -p "amqp_service::on_message" | \
        grep -E -e "^[ ]+[0-9]"  | sed -re 's/[ ]+data_provider::.*+$//' | head
        29.38%  data_provider  [kernel.kallsyms]     [k] clear_page_c
        11.09%  data_provider  libc-2.12.so          [.] memcpy
         6.01%  data_provider  [kernel.kallsyms]     [k] __alloc_pages_nodemask
         5.27%  data_provider  [kernel.kallsyms]     [k] compact_zone
         4.44%  data_provider  [kernel.kallsyms]     [k] page_fault
         4.08%  data_provider  [kernel.kallsyms]     [k] get_pageblock_flags_group
         4.05%  data_provider  [kernel.kallsyms]     [k] __reset_isolation_suitable
         3.55%  data_provider  [kernel.kallsyms]     [k] compaction_alloc
         1.78%  data_provider  libpthread-2.12.so    [.] pthread_mutex_lock
         1.61%  data_provider  [kernel.kallsyms]     [k] __mem_cgroup_commit_charge

    From this data and from the full report with callgraph for, amqp_service::on_message()I see that under load the function mainly spends time copying messages. Blocking on mutecs in this function is not a problem, only 1.8% is under load, there is no large contention on mutecs in this function and in the functions called from it according to the report.


    Analysis of input_handler :: execute () function.


    Reporting on this separate function:


    $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \
        -g -p "input_handler::execute"

    Number of events with this filter:


    # Event count (approx.): 16423187486

    It turns out that 16423187486 events are 22% of the total number of events 72784958041. This is already a lot of work regarding the total number of perf events and here I looked more closely at where the algorithm spends time.


    $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \
        -g -p "input_handler::execute" | \
        grep -E -e "^[ ]+[0-9]"  | sed -re 's/[ ]+data_provider::.*+$//' | head
         7.31%  data_provider  data_provider.005.00  [.] dtl::basic_string_ref >::empty() const
         7.19%  data_provider  data_provider.005.00  [.] dtl::string_parser::fail() const
         4.95%  data_provider  data_provider.005.00  [.] dtl::string_parser::eof() const
         4.87%  data_provider  data_provider.005.00  [.] dtl::string_parser::peek() const
         3.46%  data_provider  data_provider.005.00  [.] meta::json::decoder::parse_value(dtl::basic_string_ref >&)
         3.08%  data_provider  data_provider.005.00  [.] dtl::basic_string_ref >::operator[](unsigned long) const
         2.53%  data_provider  data_provider.005.00  [.] dtl::basic_string_ref >::remove_prefix(unsigned long)
         2.30%  data_provider  data_provider.005.00  [.] dtl::hash_seq(unsigned char const*, unsigned long)
         2.21%  data_provider  data_provider.005.00  [.] bool dtl::strtou<(unsigned char)10, unsigned char>(dtl::basic_string_ref >&, unsigned char&)

    However, judging by these data, the function performs a large number of decoding, but there are no problems with scaling in the code - there are no significant number of calls to pthread_mutex_lock (). That is, you can add more threads for input_handler :: execute () and expect an increase in overall performance.


    Analysis of the function event_handler :: on_msg ().


    I will skip the analysis of the number of events and simply note that calls made from this function spent most of their time in blocking. This may need to be fixed since such a large number of locks is bad for scaling.


    $ perf report -x --stdio --kallsyms=/boot/System.map-$(uname -r) \
        -g -p "event_handler::on_msg" | \
        grep -E -e "^[ ]+[0-9]"  | sed -re 's/[ ]+data_provider::.*+$//' | head 
         6.73%  data_provider  libpthread-2.12.so    [.] pthread_mutex_lock
         4.90%  data_provider  libpthread-2.12.so    [.] pthread_mutex_unlock
         4.45%  data_provider  data_provider.005.00  [.] std::__atomic_base::operator++()
         3.84%  data_provider  [kernel.kallsyms]     [k] _spin_lock
         3.54%  data_provider  data_provider.005.00  [.] std::__atomic_base::fetch_add(unsigned long, std::memory_order)
         3.47%  data_provider  data_provider.005.00  [.] dtl::intrusive_ptr::intrusive_ptr(dtl::intrusive_ptr const&)
         2.92%  data_provider  data_provider.005.00  [.] (anonymous namespace)::page_allocator_base::allocate_obj(dtl::page_allocation_info&)
         2.70%  data_provider  data_provider.005.00  [.] std::char_traits::length(char const*)
         2.18%  data_provider  data_provider.005.00  [.] dtl::string::size() const
         1.96%  data_provider  data_provider.005.00  [.] dtl::string::create_(dtl::basic_string_ref > const&)

    As a result, building a report using a key --parenthelped to analyze the performance of individual parts of my program. However, due to the fact that the Lua library used was apparently compiled without -fno-omit-frame-pointer, some of the functions that were performed in the input_handler::execute()analysis with --parentwere not included in the report, although they are in the general report.


    Also popular now: