Monitoring PHP code performance with Pinba
Let's imagine a typical startup, gaining popularity, using, for example, PHP or Python. At first everything is on one server - PHP (or Python), Apache, MySQL. Then you take MySQL to a separate server, install nginx to distribute content, perhaps add memcached for caching and a few more application servers ...
Over time, the number of servers will grow, and sooner or later you will ask,
“What is happening with the system now? How often is this or that script executed and for how long? What operations take up most of the time? ” Monitoring systems like Zabbix give only a general, superficial picture of the state of the application.
To get answers to these questions, Badoo has developed a real-time monitoring and statistics service - Pinba. In this article we will talk about our experience of using it for monitoring and profiling our system.
To assess the difficulty level of the task, we will describe in a few words what Badoo is. Today, we have 2 data centers, one in America and one in Europe. Hundreds of PHP-servers, MySQL-servers, dozens of memcached and "self-written" services written in C / C ++ are installed in data centers. Badoo has its own subprojects (for example, a website, applications for mobile devices and desktops), the implementation and nature of the loads on which are very different. In one PHP request, interactions with dozens of different data sources (database, cache, service) can occur. This is bad for performance, but necessary for solving business tasks. And in addition to this, we post a new code 2 to 5 times a week. Monitoring the performance of such a system is not an easy task. Pinba helped solve a large layer of problems related to monitoring the performance of PHP scripts. This service collects information about all processed PHP requests and generates detailed reports on this data.
There was already an article on Habrahabr on how to start working with Pinba: " Pinba - monitor php in real time ." In it, the author described how to install and configure Pinba, so now we will concentrate on more interesting things: how it all works and what information can be obtained.
The general scheme for working with Pinba is shown in the diagram:
Pinba consists of a PHP module (English PHP extension ) and a server. The Pinba module automatically sends data about all processed requests to the server. A pinba server is a service written in C that:
To find out what is happening with the system at the moment, just connect to the server with any MySQL client and make several queries from the report tables. For example, you can see statistics on requests for each of the PHP scripts (one of the most used reports in Pinba):
Below we describe in more detail how to work with reports in Pinba.
Even at the planning stage of the first version of the Pinba service, the idea of organizing the collection of general statistics without making changes to the PHP code immediately appeared. People who have experience writing PHP modules know that the latter have the ability to perform some actions before each request (phase PHP_RINIT) and after processing it (phase PHP_RSHUTDOWN). For example, in the PHP_RINIT phase, a module can initialize its internal resources, and in RSHUTDOWN it can free them. The PHP module for Pinba uses both of these phases. In RINIT, it is initialized, in RSHUTDOWN, it calculates how many resources were spent processing the request, and then all the data is sent by the UDP packet to the Pinba server.
Sending a UDP packet to C is fast enough, so the Pinba module does not add extra load to the PHP process. And due to the fact that Pinba works as a module, it’s enough to install and configure it, and immediately, without making changes to the PHP code, you will be able to find out such parameters of your system as the number of requests for each of the scripts, the average operating time and much more.
Unfortunately, if you have a complex system, common parameters will not be enough.
Initially, there were no timers in the first version of Pinba (let's call it Pinba v0.1). This version could tell us how often a particular script is called, what is the average response time, what is the average response size in bytes, and the like. It was impossible to find out the frequency of queries to MySQL, memcached, or other services. It may not seem so critical to you, but in our system, besides MySQL and memcached, there are a dozen different services written in C / C ++.
If a problem appeared in the system (for example, something started to “slow down”), then Pinba v0.1 showed us that there was a problem, but it was not always possible to quickly understand what exactly it was. Most often, in such situations, one of the services to which PHP made queries (MySQL, memcached, and others) was guilty.
In order to learn how to find such problem areas quickly, timers were introduced in the next version of Pinba.
In fact, we decided to measure the time of the mysql_connect / mysql_query / memcache_connect calls, etc., send this time to the Pinba server and build timer reports. On pseudo-PHP code, it looks something like this:
Here we measure the working time of the mysql_connect call and attach 2 tags to the timer, by which we can subsequently build reports.
All timers, together with all the data, are sent to the Pinba-server and there the report tables are built on them. This allows you to see which operations are currently taking the most time. Moreover, if, for example, you have hundreds of MySQL servers and only one of them slows down, Pinba makes it easy to calculate the very problematic server thanks to a very flexible timer reporting scheme.
Here is a list of tables that Pinba gives access to:
Conventionally, all tables can be divided into 2 types: tables with raw data and reports.
There are 4 tables with raw data: request, tag, timer, timertag. These tables contain a record of each call to the script, timer, and tag in the last few minutes. In heavily loaded systems, these tables can contain millions of records, so we do not recommend using them directly. Instead, it is better to use report tables that contain enough useful data to analyze system performance.
Report_by_script_name - one of the most used reports; It contains information about how often each script was called, the average time it ran, its CPU usage, and the amount of traffic it receives.
Fields with the suffix _total are the sum of the corresponding values for all requests in the last 5 minutes. That is, for example, req_time_total is the sum of the execution time of all requests for the corresponding script in the last 5 minutes. And req_time_total / req_count is the average execution time of the script. The req_utime_ * and req_stime_ * fields indicate the CPU usage in user mode and system mode, respectively.
According to this data, in our company we build graphs of the number of requests to a particular script and graphs of CPU consumption by scripts (graphs 1, 2):
Graph 1.
Graph 2.
Report_by_hostname: This report shows how many requests go to a specific server. It has almost the same list of fields, but instead of the script_name field, the table contains the hostname field.
Probably, some expected to see the domain passed in the Host header in the hostname variable. But actually hostname is the contents of the variable $ _ENV ['HOSTNAME'] and contains the name of the machine.
The main essence of this report is to obtain information about the number of requests to a specific server and whether there are more requests to some servers than to neighboring ones.
Report_by_server_name : in this report, the data is grouped by the domain to which the request came. It shows the number of requests going to a particular domain. The table structure is the same as in the previous two, only the data is now grouped by the server_name field .
Report_by_status : this report is used to view the HTTP statuses with which our scripts end:
If the report has a status of 500, it means that we have somewhere a "Fatal error" in the PHP code.
The reports discussed above are useful for general system analysis. But for a more detailed analysis, we use timers and tags.
A timer is the execution time of a piece of code. You can attach multiple tags to a timer. A tag has a key and a value. For example, in the timer code shown above,
we measure the execution time of the mysql_connect function and attach 2 tags - group and host.
By default, Pinba has no reports on timers and tags, but it provides very flexible options for creating them. Below we’ll take a closer look at how this is done in Badoo.
First of all, we use timers to obtain statistics on requests to various data sources: MySQL, memcached, services written in C / C ++. We “wrapped” all such requests with Pinba timers with two tags: group and host (host is the address of the database or service, group is the operation on the service).
So, if we implemented a tag in our system, which is sent to Pinba, we can build several reports. First of all, it will be a full report on all tag values. In our example with the group tag, we can build a report on how often this or that operation is called and how much time (in total) it takes. To do this, create a table:
The table name is arbitrary and does not affect anything. It is important that the field order is exactly the same as in the example, as well as the values for ENGINE and COMMENT = 'tag_info: group'. In the comment, group is the name of the tag by which we create the report. And tag_info is a type of report (there is also tag_report, which will be discussed below).
After creating the table, Pinba will automatically begin to enter the data there. This report will look something like this:
For example, this report shows that we have some very slow curl_request .
Reports of the tag_info type give a general picture of its use in the system. To study in more detail where these or other operations are called from, you need to build the same report, but with a grouping by the name of the script. To do this, create a table in Pinba:
The table structure is in many ways similar to the previous tag_info_group report . Please note that the script_name field appeared , and in the COMMENT field we created a report of the tag_report type . From this table we can get all the same data as from the previous one, but grouped by script_name .
You can see what operations are performed in a particular script:
This example shows the difference between the hit_count and req_count fields . Req_count is the number of HTTP requests during which the timer with the specified tag_value was called at least once. Hit_count - how many times the timer with the specified tag_value was called.
The example above shows that in our script /admin/messages/all.phtml, memcache_get is called on average 30 times. Most likely this is a programmer error, and the page needs to be optimized.
If you have a lot of database servers, memcached, and other services, then there may be cases when only one server "slows down". And its "inhibition" against the general background may not be obvious. For these purposes, we made another tag called server . You can create a general report for him.tag_info: server and see the general statistics for each host. Here is sample data for memcached statistics:
The general report on the host tag is not sufficiently informative. It shows the frequency of calling timers with one or another host, but without information on the operations performed. Fortunately, Pinba can create reports like tag_info and tag_report with a grouping of 2 tags:
This report allows you to get statistics on operations on each host. For example, you can see the statistics of operations on a specific host:
For obvious reasons, it is SELECT that occupy most of the time.
Also, for example, you can compare statistics for the same operation on different hosts. Example for memcache_connect:
You can also make a report of the tag2_report type , in which information on 2 tags is broken down by script:
We give an example where this table may be useful. In our system, all users are distributed between hundreds of MySQL servers. This is called sharding. There are also certain central MySQL servers for specific tasks. In theory, all web scripts should only query “shards” and not query central databases. The hostname of our "shards" begins with dbs * or dbb *. You can check if there are scripts that make connections to other MySQL servers using this request:
It will show which scripts make requests to central databases.
But even all this was not enough for us. When the client sends us an HTTP request, the latter first gets into nginx and then proxies to PHP. What happens if the PHP process at this moment cannot accept and process the request? For example, if the server is overloaded, or due to network problems, nginx will not be able to contact the php-fpm process? In this case, the user will see the error, but we will not know about it in PHP (though it will be in the nginx logs).
In order to have an idea of what responses (which HTTP statuses) we return to the client, Pinba developer Anton Dovgal made a module for nginx . The module for nginx works the same as the PHP module, but for obvious reasons it does not have timers. The most useful reports he makes arereport_by_script_and_status and report_by_status .
Pinba stores data for the last few minutes. This time can be changed in the settings . But this is not enough if you want to monitor how system performance changes over time. To do this, you need to store Pinba report data (not necessarily all) for a long time and build graphs on them. Thus, you can observe the dynamics of changes in the load on the system and its individual scripts, as well as monitor how your PHP code loads databases, memcached, and other data sources.
Pinba cannot plot and store data over a long period. For this you need to use separate utilities. You can configure Zabbix to monitor Pinba and build graphs based on some of its values. You can use ring (cyclic) databases such as RRDtool or Graphite. In our company we use RRDtool.
RRDtool does not describe in a nutshell. He has a lot of his own nuances, and different people have different preferences in charting, so this topic is worthy of a separate article. There are already enough guides on the Internet to help you get started with this set of utilities from scratch.
What did Pinba give us? Most performance problems began to be detected much faster than before its implementation. If a script has become much more frequently requested by users, it can easily be found in Pinba. If suddenly one of the services (MySQL, memcached, etc.) has become slower to work, then with the correct reports in Pinba you can easily localize this service. And if it began to "slow down" due to the high frequency of the request, then thanks to Pinba it is sometimes possible to localize scripts that often request a specific problem service.
It is very useful to watch Pinba charts after uploading a new version of code to the server. You can see how the load on the system has changed. But for this it is important to think about what kind of graphs will be built.
However, do not take Pinba as a substitute for full server monitoring. For example, we use Zabbix to monitor the consumption of CPU, memory, disks and network activity on all servers. On the other hand, there is Pinba with charts in RRD. By comparing the graphs from these two systems, you can easily calculate the "cost" of a PHP request and estimate how many requests per second your servers can process. And this, in turn, allows you to plan whether you need to purchase new servers in the near future or are there opportunities for optimizing the code.
Apart from the PHP module, Pinba is in no way tied to PHP and can be used to monitor systems written in other languages. GitHub already has projects using Pinba in Python and Ruby.
Maxim max_m Matyukhin, developer
badoo company
Over time, the number of servers will grow, and sooner or later you will ask,
“What is happening with the system now? How often is this or that script executed and for how long? What operations take up most of the time? ” Monitoring systems like Zabbix give only a general, superficial picture of the state of the application.
To get answers to these questions, Badoo has developed a real-time monitoring and statistics service - Pinba. In this article we will talk about our experience of using it for monitoring and profiling our system.
A few words about Badoo
To assess the difficulty level of the task, we will describe in a few words what Badoo is. Today, we have 2 data centers, one in America and one in Europe. Hundreds of PHP-servers, MySQL-servers, dozens of memcached and "self-written" services written in C / C ++ are installed in data centers. Badoo has its own subprojects (for example, a website, applications for mobile devices and desktops), the implementation and nature of the loads on which are very different. In one PHP request, interactions with dozens of different data sources (database, cache, service) can occur. This is bad for performance, but necessary for solving business tasks. And in addition to this, we post a new code 2 to 5 times a week. Monitoring the performance of such a system is not an easy task. Pinba helped solve a large layer of problems related to monitoring the performance of PHP scripts. This service collects information about all processed PHP requests and generates detailed reports on this data.
How does it work
There was already an article on Habrahabr on how to start working with Pinba: " Pinba - monitor php in real time ." In it, the author described how to install and configure Pinba, so now we will concentrate on more interesting things: how it all works and what information can be obtained.
The general scheme for working with Pinba is shown in the diagram:
Pinba consists of a PHP module (English PHP extension ) and a server. The Pinba module automatically sends data about all processed requests to the server. A pinba server is a service written in C that:
- collects data on processed requests via UDP;
- stores all this data in memory and builds reports on it;
- provides access to raw data and reports using the MySQL protocol, that is, it is actually a MySQL engine.
To find out what is happening with the system at the moment, just connect to the server with any MySQL client and make several queries from the report tables. For example, you can see statistics on requests for each of the PHP scripts (one of the most used reports in Pinba):
mysql> SELECT * FROM report_by_script_name;
Below we describe in more detail how to work with reports in Pinba.
How the PHP module for Pinba works
Even at the planning stage of the first version of the Pinba service, the idea of organizing the collection of general statistics without making changes to the PHP code immediately appeared. People who have experience writing PHP modules know that the latter have the ability to perform some actions before each request (phase PHP_RINIT) and after processing it (phase PHP_RSHUTDOWN). For example, in the PHP_RINIT phase, a module can initialize its internal resources, and in RSHUTDOWN it can free them. The PHP module for Pinba uses both of these phases. In RINIT, it is initialized, in RSHUTDOWN, it calculates how many resources were spent processing the request, and then all the data is sent by the UDP packet to the Pinba server.
Sending a UDP packet to C is fast enough, so the Pinba module does not add extra load to the PHP process. And due to the fact that Pinba works as a module, it’s enough to install and configure it, and immediately, without making changes to the PHP code, you will be able to find out such parameters of your system as the number of requests for each of the scripts, the average operating time and much more.
Unfortunately, if you have a complex system, common parameters will not be enough.
Timers
Initially, there were no timers in the first version of Pinba (let's call it Pinba v0.1). This version could tell us how often a particular script is called, what is the average response time, what is the average response size in bytes, and the like. It was impossible to find out the frequency of queries to MySQL, memcached, or other services. It may not seem so critical to you, but in our system, besides MySQL and memcached, there are a dozen different services written in C / C ++.
If a problem appeared in the system (for example, something started to “slow down”), then Pinba v0.1 showed us that there was a problem, but it was not always possible to quickly understand what exactly it was. Most often, in such situations, one of the services to which PHP made queries (MySQL, memcached, and others) was guilty.
In order to learn how to find such problem areas quickly, timers were introduced in the next version of Pinba.
In fact, we decided to measure the time of the mysql_connect / mysql_query / memcache_connect calls, etc., send this time to the Pinba server and build timer reports. On pseudo-PHP code, it looks something like this:
pinba_timer_start(array('type'=>'db::connect', 'host'=>'users.sql'));
$conn = mysql_connect(...);
pinba_timer_stop();
Here we measure the working time of the mysql_connect call and attach 2 tags to the timer, by which we can subsequently build reports.
All timers, together with all the data, are sent to the Pinba-server and there the report tables are built on them. This allows you to see which operations are currently taking the most time. Moreover, if, for example, you have hundreds of MySQL servers and only one of them slows down, Pinba makes it easy to calculate the very problematic server thanks to a very flexible timer reporting scheme.
Getting started with tables
Here is a list of tables that Pinba gives access to:
mysql> SHOW TABLES FROM pinba;
+--------------------------------------+
| Tables_in_pinba |
+--------------------------------------+
| info |
| report_by_hostname |
| report_by_hostname_and_script |
| report_by_hostname_and_server |
| report_by_hostname_server_and_script |
| report_by_script_name |
| report_by_server_and_script |
| report_by_server_name |
| report_by_status |
| request |
| tag |
| timer |
| timertag |
+--------------------------------------+
13 rows in set (0.00 sec)
Conventionally, all tables can be divided into 2 types: tables with raw data and reports.
There are 4 tables with raw data: request, tag, timer, timertag. These tables contain a record of each call to the script, timer, and tag in the last few minutes. In heavily loaded systems, these tables can contain millions of records, so we do not recommend using them directly. Instead, it is better to use report tables that contain enough useful data to analyze system performance.
Report Tables
Report_by_script_name - one of the most used reports; It contains information about how often each script was called, the average time it ran, its CPU usage, and the amount of traffic it receives.
mysql> DESC report_by_script_name;
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| req_count | int(11) | YES | | NULL | |
| req_per_sec | float | YES | | NULL | |
| req_time_total | float | YES | | NULL | |
| req_time_percent | float | YES | | NULL | |
| req_time_per_sec | float | YES | | NULL | |
| ru_utime_total | float | YES | | NULL | |
| ru_utime_percent | float | YES | | NULL | |
| ru_utime_per_sec | float | YES | | NULL | |
| ru_stime_total | float | YES | | NULL | |
| ru_stime_percent | float | YES | | NULL | |
| ru_stime_per_sec | float | YES | | NULL | |
| traffic_total | float | YES | | NULL | |
| traffic_percent | float | YES | | NULL | |
| traffic_per_sec | float | YES | | NULL | |
| script_name | varchar(128) | YES | | NULL | |
+------------------+--------------+------+-----+---------+-------+
15 rows in set (0.00 sec)
Fields with the suffix _total are the sum of the corresponding values for all requests in the last 5 minutes. That is, for example, req_time_total is the sum of the execution time of all requests for the corresponding script in the last 5 minutes. And req_time_total / req_count is the average execution time of the script. The req_utime_ * and req_stime_ * fields indicate the CPU usage in user mode and system mode, respectively.
According to this data, in our company we build graphs of the number of requests to a particular script and graphs of CPU consumption by scripts (graphs 1, 2):
Graph 1.
Graph 2.
Report_by_hostname: This report shows how many requests go to a specific server. It has almost the same list of fields, but instead of the script_name field, the table contains the hostname field.
mysql> DESC report_by_hostname;
+------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-------------+------+-----+---------+-------+
| req_count | int(11) | YES | | NULL | |
.........
| hostname | varchar(32) | YES | | NULL | |
+------------------+-------------+------+-----+---------+-------+
15 rows in set (0.00 sec)
Probably, some expected to see the domain passed in the Host header in the hostname variable. But actually hostname is the contents of the variable $ _ENV ['HOSTNAME'] and contains the name of the machine.
The main essence of this report is to obtain information about the number of requests to a specific server and whether there are more requests to some servers than to neighboring ones.
Report_by_server_name : in this report, the data is grouped by the domain to which the request came. It shows the number of requests going to a particular domain. The table structure is the same as in the previous two, only the data is now grouped by the server_name field .
Report_by_status : this report is used to view the HTTP statuses with which our scripts end:
mysql> SELECT req_count, status FROM report_by_status;
+-----------+--------+
| req_count | status |
+-----------+--------+
| 540622 | 200 |
| 2578 | 301 |
| 21955 | 302 |
| 296 | 403 |
| 2090 | 404 |
+-----------+--------+
5 rows in set (0.14 sec)
If the report has a status of 500, it means that we have somewhere a "Fatal error" in the PHP code.
Timers and tags
The reports discussed above are useful for general system analysis. But for a more detailed analysis, we use timers and tags.
A timer is the execution time of a piece of code. You can attach multiple tags to a timer. A tag has a key and a value. For example, in the timer code shown above,
pinba_timer_start(array('group'=>'db::connect', 'host'=>'users.sql'));
$conn = mysql_connect(...);
pinba_timer_stop();
we measure the execution time of the mysql_connect function and attach 2 tags - group and host.
By default, Pinba has no reports on timers and tags, but it provides very flexible options for creating them. Below we’ll take a closer look at how this is done in Badoo.
First of all, we use timers to obtain statistics on requests to various data sources: MySQL, memcached, services written in C / C ++. We “wrapped” all such requests with Pinba timers with two tags: group and host (host is the address of the database or service, group is the operation on the service).
So, if we implemented a tag in our system, which is sent to Pinba, we can build several reports. First of all, it will be a full report on all tag values. In our example with the group tag, we can build a report on how often this or that operation is called and how much time (in total) it takes. To do this, create a table:
CREATE TABLE `tag_info_group` (
`tag_value` varchar(64) DEFAULT NULL,
`req_count` int(11) DEFAULT NULL,
`req_per_sec` float DEFAULT NULL,
`hit_count` int(11) DEFAULT NULL,
`hit_per_sec` float DEFAULT NULL,
`timer_value` float DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='tag_info:group'
The table name is arbitrary and does not affect anything. It is important that the field order is exactly the same as in the example, as well as the values for ENGINE and COMMENT = 'tag_info: group'. In the comment, group is the name of the tag by which we create the report. And tag_info is a type of report (there is also tag_report, which will be discussed below).
After creating the table, Pinba will automatically begin to enter the data there. This report will look something like this:
mysql> SELECT * FROM tag_info_group ORDER BY timer_value;
+--------------------+-----------+-------------+-----------+-------------+-------------+
| tag_value | req_count | req_per_sec | hit_count | hit_per_sec | timer_value |
+--------------------+-----------+-------------+-----------+-------------+-------------+
| curl_request | 56216 | 281.08 | 56710 | 283.55 | 11977.3 |
| memcache_get | 2563499 | 12817.5 | 11462991 | 57315 | 6823.04 |
| hsc_connect | 2319186 | 11595.9 | 2870366 | 14351.8 | 6140.5 |
| hsc_open_index | 2319158 | 11595.8 | 3285335 | 16426.7 | 5805.61 |
| hsc_select | 2312567 | 11562.8 | 3158465 | 15792.3 | 5778.57 |
| memcache_connect | 2563064 | 12815.3 | 4007451 | 20037.3 | 2389.65 |
| memcache_set | 917102 | 4585.51 | 1616016 | 8080.08 | 1059.08 |
| memcache_delete | 600720 | 3003.6 | 2451872 | 12259.4 | 785.702 |
| scribe_receive | 382881 | 1914.41 | 975892 | 4879.46 | 382.522 |
| hsc_update | 84877 | 424.385 | 85287 | 426.435 | 317.637 |
| hsc_insert | 41564 | 207.82 | 41574 | 207.87 | 280.468 |
| scribe_send | 382894 | 1914.47 | 975892 | 4879.46 | 174.156 |
| memcache_increment | 60817 | 304.085 | 88734 | 443.67 | 30.9102 |
| memcache_add | 14818 | 74.09 | 16281 | 81.405 | 4.71686 |
| memcache_multi_get | 2543 | 12.715 | 2543 | 12.715 | 1.59621 |
| hsc_delete | 4 | 0.02 | 9 | 0.045 | 0.003373 |
+--------------------+-----------+-------------+-----------+-------------+-------------+
16 rows in set (0.00 sec)
For example, this report shows that we have some very slow curl_request .
Reports of the tag_info type give a general picture of its use in the system. To study in more detail where these or other operations are called from, you need to build the same report, but with a grouping by the name of the script. To do this, create a table in Pinba:
CREATE TABLE `tag_report_group` (
`script_name` varchar(128) DEFAULT NULL,
`tag_value` varchar(64) DEFAULT NULL,
`req_count` int(11) DEFAULT NULL,
`req_per_sec` float DEFAULT NULL,
`hit_count` int(11) DEFAULT NULL,
`hit_per_sec` float DEFAULT NULL,
`timer_value` float DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='tag_report:group'
The table structure is in many ways similar to the previous tag_info_group report . Please note that the script_name field appeared , and in the COMMENT field we created a report of the tag_report type . From this table we can get all the same data as from the previous one, but grouped by script_name .
You can see what operations are performed in a particular script:
mysql> SELECT tag_value, req_count, hit_count, timer_value, hit_count/req_count FROM tag_report_group WHERE script_name='/admin/messages/all.phtml' ORDER BY timer_value DESC;
+--------------------+-----------+-----------+-------------+---------------------+
| tag_value | req_count | hit_count | timer_value | hit_count/req_count |
+--------------------+-----------+-----------+-------------+---------------------+
| memcache_get | 107439 | 3227537 | 1060.68 | 30.0406 |
| memcache_connect | 107439 | 887789 | 270.919 | 8.2632 |
| scribe_receive | 107573 | 283478 | 68.9106 | 2.6352 |
| memcache_set | 34339 | 153080 | 50.6079 | 4.4579 |
| scribe_send | 107573 | 283478 | 39.7583 | 2.6352 |
| memcache_delete | 18236 | 21306 | 4.54436 | 1.1683 |
| curl_request | 15 | 15 | 0.514396 | 1.0000 |
| memcache_decrement | 1250 | 1250 | 0.251141 | 1.0000 |
| memcache_multi_get | 125 | 125 | 0.066473 | 1.0000 |
| memcache_increment | 2 | 2 | 0.000336 | 1.0000 |
+--------------------+-----------+-----------+-------------+---------------------+
10 rows in set (0.03 sec)
This example shows the difference between the hit_count and req_count fields . Req_count is the number of HTTP requests during which the timer with the specified tag_value was called at least once. Hit_count - how many times the timer with the specified tag_value was called.
The example above shows that in our script /admin/messages/all.phtml, memcache_get is called on average 30 times. Most likely this is a programmer error, and the page needs to be optimized.
If you have a lot of database servers, memcached, and other services, then there may be cases when only one server "slows down". And its "inhibition" against the general background may not be obvious. For these purposes, we made another tag called server . You can create a general report for him.tag_info: server and see the general statistics for each host. Here is sample data for memcached statistics:
mysql> SELECT * FROM tag_info_server WHERE tag_value LIKE 'memcache%' ORDER BY timer_value;
+------------------+-----------+-------------+-----------+-------------+-------------+
| tag_value | req_count | req_per_sec | hit_count | hit_per_sec | timer_value |
+------------------+-----------+-------------+-----------+-------------+-------------+
| memcached7.mlan | 350702 | 1753.51 | 1767513 | 8837.57 | 781.644 |
| memcached3.mlan | 336217 | 1681.08 | 1734430 | 8672.15 | 871.633 |
| memcached1.mlan | 350082 | 1750.41 | 1823910 | 9119.55 | 883.52 |
| memcached8.mlan | 357798 | 1788.99 | 1793808 | 8969.04 | 908.848 |
| memcached10.mlan | 343287 | 1716.44 | 1757085 | 8785.42 | 909.042 |
| memcached5.mlan | 398042 | 1990.21 | 1861729 | 9308.64 | 914.772 |
| memcached4.mlan | 401584 | 2007.92 | 1877775 | 9388.88 | 937.53 |
| memcached2.mlan | 459912 | 2299.56 | 2028406 | 10142 | 971.189 |
| memcached6.mlan | 399768 | 1998.84 | 1912571 | 9562.86 | 975.094 |
| memcached9.mlan | 425276 | 2126.38 | 1939575 | 9697.88 | 1008.29 |
+------------------+-----------+-------------+-----------+-------------+-------------+
10 rows in set (0.00 sec)
The general report on the host tag is not sufficiently informative. It shows the frequency of calling timers with one or another host, but without information on the operations performed. Fortunately, Pinba can create reports like tag_info and tag_report with a grouping of 2 tags:
CREATE TABLE `tag_info_group_server` (
`group_value` varchar(64) DEFAULT NULL,
`server_value` varchar(64) DEFAULT NULL,
`req_count` int(11) DEFAULT NULL,
`req_per_sec` float DEFAULT NULL,
`hit_count` int(11) DEFAULT NULL,
`hit_per_sec` float DEFAULT NULL,
`timer_value` float DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='tag2_info:group,server'
This report allows you to get statistics on operations on each host. For example, you can see the statistics of operations on a specific host:
mysql> SELECT * FROM tag_info_group_server WHERE server_value = 'dbs101.mlan' ORDER BY timer_value;
+-------------+--------------+-----------+-------------+-----------+-------------+-------------+
| group_value | server_value | req_count | req_per_sec | hit_count | hit_per_sec | timer_value |
+-------------+--------------+-----------+-------------+-----------+-------------+-------------+
| db::replace | dbs101.mlan | 154 | 0.77 | 154 | 0.77 | 0.257526 |
| db::delete | dbs101.mlan | 1340 | 6.7 | 1340 | 6.7 | 1.84398 |
| db::begin | dbs101.mlan | 6437 | 32.185 | 6589 | 32.945 | 2.96375 |
| db::commit | dbs101.mlan | 6437 | 32.185 | 6589 | 32.945 | 3.04067 |
| db::insert | dbs101.mlan | 4674 | 23.37 | 11171 | 55.855 | 13.9457 |
| db::update | dbs101.mlan | 6358 | 31.79 | 10399 | 51.995 | 18.197 |
| db::connect | dbs101.mlan | 16826 | 84.13 | 16826 | 84.13 | 22.1383 |
| db::select | dbs101.mlan | 14372 | 71.86 | 47106 | 235.53 | 180.235 |
+-------------+--------------+-----------+-------------+-----------+-------------+-------------+
8 rows in set (0.01 sec)
For obvious reasons, it is SELECT that occupy most of the time.
Also, for example, you can compare statistics for the same operation on different hosts. Example for memcache_connect:
mysql> SELECT server_value, req_count, hit_count, timer_value FROM tag_info_group_server WHERE group_value='memcache_connect' ORDER BY timer_value;
+------------------+-----------+-----------+-------------+
| server_value | req_count | hit_count | timer_value |
+------------------+-----------+-----------+-------------+
| memcached1.mlan | 366138 | 365061 | 191.808 |
| memcached3.mlan | 354688 | 353760 | 192.184 |
| memcached10.mlan | 365358 | 364341 | 197.353 |
| memcached8.mlan | 368912 | 367839 | 213.768 |
| memcached7.mlan | 374160 | 373152 | 215.914 |
| memcached5.mlan | 415444 | 414434 | 217.855 |
| memcached4.mlan | 423394 | 422242 | 252.416 |
| memcached2.mlan | 478949 | 477641 | 272.346 |
| memcached6.mlan | 414088 | 413003 | 279.038 |
| memcached9.mlan | 441589 | 440481 | 377.952 |
+------------------+-----------+-----------+-------------+
10 rows in set (0.01 sec)
You can also make a report of the tag2_report type , in which information on 2 tags is broken down by script:
CREATE TABLE `tag_report_group_server` (
`sc0зript_name` varchar(128) DEFAULT NULL,
`tag1_value` varchar(64) DEFAULT NULL,
`tag2_value` varchar(64) DEFAULT NULL,
`req_count` int(11) DEFAULT NULL,
`req_per_sec` float DEFAULT NULL,
`hit_count` int(11) DEFAULT NULL,
`hit_per_sec` float DEFAULT NULL,
`timer_value` float DEFAULT NULL
) ENGINE=PINBA DEFAULT CHARSET=latin1 COMMENT='tag2_report:group,server'
We give an example where this table may be useful. In our system, all users are distributed between hundreds of MySQL servers. This is called sharding. There are also certain central MySQL servers for specific tasks. In theory, all web scripts should only query “shards” and not query central databases. The hostname of our "shards" begins with dbs * or dbb *. You can check if there are scripts that make connections to other MySQL servers using this request:
mysql> SELECT * FROM tag_report_group_server WHERE tag1_value='db::connect' AND tag2_value NOT LIKE 'dbs%' AND tag2_value NOT LIKE 'dbb%';
It will show which scripts make requests to central databases.
nginx + Pinba
But even all this was not enough for us. When the client sends us an HTTP request, the latter first gets into nginx and then proxies to PHP. What happens if the PHP process at this moment cannot accept and process the request? For example, if the server is overloaded, or due to network problems, nginx will not be able to contact the php-fpm process? In this case, the user will see the error, but we will not know about it in PHP (though it will be in the nginx logs).
In order to have an idea of what responses (which HTTP statuses) we return to the client, Pinba developer Anton Dovgal made a module for nginx . The module for nginx works the same as the PHP module, but for obvious reasons it does not have timers. The most useful reports he makes arereport_by_script_and_status and report_by_status .
Plotting
Pinba stores data for the last few minutes. This time can be changed in the settings . But this is not enough if you want to monitor how system performance changes over time. To do this, you need to store Pinba report data (not necessarily all) for a long time and build graphs on them. Thus, you can observe the dynamics of changes in the load on the system and its individual scripts, as well as monitor how your PHP code loads databases, memcached, and other data sources.
Pinba cannot plot and store data over a long period. For this you need to use separate utilities. You can configure Zabbix to monitor Pinba and build graphs based on some of its values. You can use ring (cyclic) databases such as RRDtool or Graphite. In our company we use RRDtool.
RRDtool does not describe in a nutshell. He has a lot of his own nuances, and different people have different preferences in charting, so this topic is worthy of a separate article. There are already enough guides on the Internet to help you get started with this set of utilities from scratch.
results
What did Pinba give us? Most performance problems began to be detected much faster than before its implementation. If a script has become much more frequently requested by users, it can easily be found in Pinba. If suddenly one of the services (MySQL, memcached, etc.) has become slower to work, then with the correct reports in Pinba you can easily localize this service. And if it began to "slow down" due to the high frequency of the request, then thanks to Pinba it is sometimes possible to localize scripts that often request a specific problem service.
It is very useful to watch Pinba charts after uploading a new version of code to the server. You can see how the load on the system has changed. But for this it is important to think about what kind of graphs will be built.
However, do not take Pinba as a substitute for full server monitoring. For example, we use Zabbix to monitor the consumption of CPU, memory, disks and network activity on all servers. On the other hand, there is Pinba with charts in RRD. By comparing the graphs from these two systems, you can easily calculate the "cost" of a PHP request and estimate how many requests per second your servers can process. And this, in turn, allows you to plan whether you need to purchase new servers in the near future or are there opportunities for optimizing the code.
Apart from the PHP module, Pinba is in no way tied to PHP and can be used to monitor systems written in other languages. GitHub already has projects using Pinba in Python and Ruby.
Maxim max_m Matyukhin, developer
badoo company
useful links
- Official Pinba Project Page
- Pinba Engine development list - Pinba's English-speaking discussion group
- Pinba Engine development list (in Russian) - Russian-speaking Google discussion group Pinba
- Pinba - real-time php monitoring - article on Habrahabr about the first steps with Pinba
- Official RRDtool Project Page
- Official Graphite Project Page
- Loop database - wikipedia.org
- Nginx module for pinba
- Python WSGI middleware for pinba