Postgres logging experience
We developed our PostgreSQL logging system ... Yes, I know that there are add-ons for ElasticSearch (GrayLog2, Logstash), and that there are other similar tools, and there are those about which I do not know. However, our tool is currently built on PostgreSQL, and it works.
During the working week, from all VLSI services in the cloud we receive more than 11 billion records per day, they are stored for 3 days, the total amount of space occupied at the same time does not exceed 32 Tb. All this processes 8 servers with PostgreSQL 9.6. Each server has 24 cores, 16GB RAM and 4 1TB SSD drives.
Our services are written in 40% Python, 50% C ++, 9% SQL, 1% Javascript. There are more than 200 services. Often it is required to quickly understand various kinds of problems - analysis of registered errors or logical ones. Sometimes you just need to monitor the work, check whether everything is going according to the intended scenario. Heterogeneous groups can do all this with us: developers, testers, server administrators, and in some cases, management. Therefore, we need an understandable tool for all these groups. We have created our own logging system, or rather, a system for tracing http-requests to our web-services. It is not a universal solution for logging in general, but it works well for our model of work. In addition to actually viewing the logs, we have other uses of the collected data - more on this in the next section.
Our http-requests to web-services can be presented as a call tree for the convenience of analysis. Simplified, this tree can be represented as follows:
Request to service A
| - Stream-processor number 1
| | - SQL database query X
| | - Internal subquery 1
| | | - Request to Redis Y
| | | - Synchronous http service request B
| | - Internal subquery 2
| | | - SQL database query X
| | | - Synchronous http-request for service C
| - Stream number 2
| - Asynchronous request for service W
Screen report see fig. 1, representing such a tree, speaks more eloquently. At each node, it is shown how much time it took to execute it, how many percent this is from the parent. You can search for bottlenecks in the request. Using the links you can see the logs of subqueries for other services. The report is quite convenient, it is built almost instantly. Of course, there are exceptions when call trees contain millions of records (well, yes, there are some). Here the process takes longer, but up to 5 million entries in the tree can be obtained. We call this report “single call profiling,” because it is most often used for job profiling purposes. The image is clickable, it opens in the current tab of the web browser. Fig. 1. Profiling a single call
Sometimes there is a need to perform profiling of a typical query on a statistical sample of the same type of calls, and not on a single call. To do this, we have a report that combines such calls into one tree, showing its nodes and leaves in the form of squares with an area in% of the time of the parent call. See screen in fig. 2 The image is clickable, it opens in the current tab of the web browser. Fig. 2. Profiling for several typical calls. There is a report that allows you to catch the presence of network delays. This is when service A sent a request to another service B, the response was received in 100ms, and the request on service B was executed 10ms, and 90ms disappeared somewhere. We call this missing time “lag.” The lag report screen is shown below in fig. 3. Fig. 3. Log report
In addition to these log reports, we use others, but they are not as widespread as the given reports.
Our web services are made up of peer independent nodes. Each service node consists of a control process and several work processes. The control process receives http-requests from the client and puts them in a queue waiting for processing, and also sends responses to the client. Workflows pick up requests from the control process queue and perform real processing. They can go to the PostgreSQL database, to another web service, or to Redis, or somewhere else.
Fig. 4. Architecture of the web service node at the process level.
Each service request has the following set of attributes, which we must write to the logs:
Various events take place inside the query: calls to PostgreSQL, Redis, СlickHouse, RabbitMQ, to other services, calls to internal methods of the service. We record these events with the following attributes:
Thus, the data structure looks as a first approximation as in the figure below.
Fig. 5. The first version of the database structure for storing logs
Here in the figure:
Service - a table with the names of services, there are few of them, several hundred
ServiceNode - a table with service nodes, several thousand records in it, several nodes can be associated with one service
Method - a table with names methods, there are also several thousand
Log- Log - in fact, the main table where the request data and its events are written. The table is intentionally denormalized so as not to waste time joining large tables in queries and not to keep extra indexes. For example, query attributes could be placed in a separate table, but then the logic of adding and retrieving data would become more complicated. It would be possible to take out the UUID and User Session Identifier, but would have to have an index in each new field in the new table by field, an index on the foreign key in the Log table, and an unnecessary primary key in the new table.
Web services send logs via http, via nginx (for balancing). The nodes of the log service process them and write to the database. The scheme is given below.
Fig. 6. Scheme for sending logs to the log service
Figure 7 shows the screen, what our request logs to the fix-osr-bl17.unix.tensor.ru node of the Moscow service look like. The request is called "VLSI. Perform Action", its number is 15155. I will not give the UUID, it is displayed above the name of the request. The first is a record with a message of the form “[m] [start] Edo → EDOCertCheckAttorney” - this is a fixation of the start event of the call to the internal method of the service without arguments. Following it immediately begins the second sub-call “[m] [start] Document.Incoming / 1 (234394;)” with one argument with a value of 234394. Then a call is made from the node of the caching service, the line says [[rpc call] ... etc. The image is clickable, it opens in the current tab of the web browser. Fig. 7. Screen of the log screen according to the "VLSI. PerformAction" method on the node "fix-osr-bl17.unix.tensor.ru"
For a year and a half of existence, this database and service scheme has not undergone major changes. What problems did we encounter? Initially, we wrote in one database and from the very first days we came across the fact that:
These operations pretty well overclocked the PostgreSQL server. We tried to change the synchronous_commit and commit_delay parameters as well, but in our case they did not noticeably affect the performance.
Given the splitting of tables into data sections for each day of the month, the new database scheme now looked like this:
Fig. 8. Scheme of the base with tables for every day of the month
Over time, with constantly growing volumes of logs, we could no longer store data in one database. And the first prototype of the log service was redone for a distributed version. Now each node of the log service wrote to one of several databases, choosing a base for recording using the round-robin algorithm. It was convenient enough. Statistically, each database received the same load, the load was scaled horizontally, the data volumes on the databases coincided to within GB. Instead of one PostgreSQL server for logs, now it worked 5. The scheme for the nodes of the log service looked like this:
Fig. 9. The scheme of the log service with multiple databases
The diagram shows that each service node holds connections to several databases. Such a scheme, of course, has drawbacks. If the base failed, or it started to "stupid", then the whole service stopped working, because all were equal in the latest, well, when adding a new service node, the number of connections to the PostgreSQL server increased.
Writing to the same table simultaneously through a large number of connections causes a larger number of locks, which slows down the writing process. The increase in the number of connections per base can be controlled using pgBouncer in TRANSACTON MODE mode. However, there are no miracles, and in this case, the time to complete the request increases slightly, because still, work goes through an additional link. Well, with TRANSACTON MODE, connections to the database are too often switched, which also does not affect the work in the best way.
We worked on this option for another year, and finally moved on to a scheme in which one service node works exactly with one database, directly, without pgBouncer. It turned out to be 2 times more effective, and by that time we had added 3 more bases, and we had 8. There are so many of us who live to this day.
From this scheme, we got even small advantages. The number of connections to the database does not grow when a new node is added. If the database starts to stupid, then nginx distributes the load to other service nodes.
Fig. 10. The current scheme of the log service: one node - one database.
Additionally, for the best operation of the PostrgeSQL server, and it works on Centos 7, we installed the deadline scheduler. And they also reduced the size of the dirty cache. This is the vm.dirty_background_bytes parameter, it sets the size of the cache, upon reaching which the system starts the background process of flushing it to disk. If its volume is too large, a peak load arrives on the disk - the parameter must be selected so that it is smoothed.
In addition to performance problems, there is an acute problem of lack of storage space. Now we cost 32 TB. This is enough for 3 days. However, sometimes there are peak bursts in logging several times, and the place ends earlier. How to deal with this without increasing space? We formulated the task for ourselves as follows: we need at least some logs to remain, albeit to the detriment of their details.
In accordance with this, we divided the log table per day into three tables. The first stores data for the period from 0 hours to 8, the second from 8 to 16, the third from 16 to 24. Each of these tables was further divided into three sections. Sections correspond to three levels of log importance. The first level stores basic information about the fact of the request, its duration, without details, and the facts of errors that occurred during the request. The second level stores information about sub-calls and SQL queries. The third level stores everything that is not included in the first two. If the log service node understands that less than 15% of free space remains for recording, it starts cleaning the oldest section with the third level. And so on, until there is enough space for recording. If there is no longer a third-level section, but there is still not enough space, then cleaning of the second-level sections begins,
Before that, it was about recording logs, but they do a recording in order to read later. In general, for reading, depending on the time period, the necessary sections of the table are calculated and substituted into the query. The current mechanism of PostgreSQL to work with sections is inoperative and inconvenient, we do not use it. What the result of the query to the logs looks like, we showed in fig. 7. Basic requirements for the request:
It is 100% impossible to satisfy both requirements, but they can be implemented for 80% of requests. This is what we did. It turned out about 20 query profiles that fit into 11 indexes on the Log table. Indexes, unfortunately, slow down the addition of entries and steal space. In our version, these indices take away a place comparable to the place for data.
We never select data for the entire period of time specified by the user to receive logs, this would be too inefficient. In many cases, displaying the first or last page of a query is sufficient. In more rare cases, users can navigate to the following pages. Consider the log selection algorithm using an example. Let us request data for a period of 1 hour with the output of 500 records on one page:
First, we try to build a query for a period of 1 ms, addressing in parallel to all the log databases. After receiving the result from all databases, we combine the request, sorting its data by time. If 500 records are not accumulated, then we shift by 1ms, increase the period by 2 times and repeat the procedure until we collect the required 500 records for display.
More often than not, this simple algorithm is enough to quickly obtain data. But if your filtering conditions are such that millions of records can be selected over the entire period, and you only need a few among them, for example, you are looking for a specific, rarely seen line in the “Msg” field of log events, then the result will not be returned quickly.
Is everything so rosy? Alas, not all ... You can design a query so that you won’t get the result for a long time, and fill the database server with such queries so that they will not be able to work. Since this is used internally, we exclude intentional blockage - we easily calculate such people, only a random blockage remains. We protect ourselves against accidental timeouts for a query through the PostgreSQL command “SET LOCAL statement_timeout TO ...” The total time of all queries to one database is given a time of 1200 seconds. A timeout of 1200 s is set for the first query to the database, 1200 for the second query, minus the time spent on the first query, etc. If it was not possible to meet the deadline, an error is returned asking to narrow the filtering conditions.
We made a serious attempt to go to the storage of logs in ClickHouse. Worked with the MergeTree engine. Preliminary tests went fine, we rolled out the system to pre-production. There were no questions with recording speed here at all - the gain at the storage location was up to 7 times. Two ClickHouse nodes processed data, each had 20 cores and 64 GB of memory. By the way, PostgreSQL in pre-production was a little more modest in our requirements - 8 cores and 32 GB of memory per server. But be that as it may, the decrease in storage in ClickHouse bribed, we were even ready to forgive some degradation of read requests to ClickHouse compared to PostgreSQL.
As soon as the number of requests to the ClickHouse server to select data became more than a certain number, they decelerated sharply. We could not defeat this. ClickHouse had to be abandoned. Perhaps the reason for slowing down read requests was that only one index could be created on the ClickHouse table. Many log query profiles did not fit into this index and data reading slowed down.
In addition to ClickHouse, they also aimed at ElasticSearch, poured data from one PostgreSQL log database with production (~ 4TB) into it, it gave a gain of about 15% in relation to PostgreSQL, this is not enough to head over to the logs for storage in Elastic.
Author: Alexey Terentyev
During the working week, from all VLSI services in the cloud we receive more than 11 billion records per day, they are stored for 3 days, the total amount of space occupied at the same time does not exceed 32 Tb. All this processes 8 servers with PostgreSQL 9.6. Each server has 24 cores, 16GB RAM and 4 1TB SSD drives.
Who needs this?
Our services are written in 40% Python, 50% C ++, 9% SQL, 1% Javascript. There are more than 200 services. Often it is required to quickly understand various kinds of problems - analysis of registered errors or logical ones. Sometimes you just need to monitor the work, check whether everything is going according to the intended scenario. Heterogeneous groups can do all this with us: developers, testers, server administrators, and in some cases, management. Therefore, we need an understandable tool for all these groups. We have created our own logging system, or rather, a system for tracing http-requests to our web-services. It is not a universal solution for logging in general, but it works well for our model of work. In addition to actually viewing the logs, we have other uses of the collected data - more on this in the next section.
Other applications of logs
Our http-requests to web-services can be presented as a call tree for the convenience of analysis. Simplified, this tree can be represented as follows:
Request to service A
| - Stream-processor number 1
| | - SQL database query X
| | - Internal subquery 1
| | | - Request to Redis Y
| | | - Synchronous http service request B
| | - Internal subquery 2
| | | - SQL database query X
| | | - Synchronous http-request for service C
| - Stream number 2
| - Asynchronous request for service W
Screen report see fig. 1, representing such a tree, speaks more eloquently. At each node, it is shown how much time it took to execute it, how many percent this is from the parent. You can search for bottlenecks in the request. Using the links you can see the logs of subqueries for other services. The report is quite convenient, it is built almost instantly. Of course, there are exceptions when call trees contain millions of records (well, yes, there are some). Here the process takes longer, but up to 5 million entries in the tree can be obtained. We call this report “single call profiling,” because it is most often used for job profiling purposes. The image is clickable, it opens in the current tab of the web browser. Fig. 1. Profiling a single call
Sometimes there is a need to perform profiling of a typical query on a statistical sample of the same type of calls, and not on a single call. To do this, we have a report that combines such calls into one tree, showing its nodes and leaves in the form of squares with an area in% of the time of the parent call. See screen in fig. 2 The image is clickable, it opens in the current tab of the web browser. Fig. 2. Profiling for several typical calls. There is a report that allows you to catch the presence of network delays. This is when service A sent a request to another service B, the response was received in 100ms, and the request on service B was executed 10ms, and 90ms disappeared somewhere. We call this missing time “lag.” The lag report screen is shown below in fig. 3. Fig. 3. Log report
In addition to these log reports, we use others, but they are not as widespread as the given reports.
How does it all work
Our web services are made up of peer independent nodes. Each service node consists of a control process and several work processes. The control process receives http-requests from the client and puts them in a queue waiting for processing, and also sends responses to the client. Workflows pick up requests from the control process queue and perform real processing. They can go to the PostgreSQL database, to another web service, or to Redis, or somewhere else.
Fig. 4. Architecture of the web service node at the process level.
Each service request has the following set of attributes, which we must write to the logs:
- unique name of the request, we call it “service method”
- service node, where the request finally arrived
- Request initiator IP
- user session id
- UUID of the request cascade (the request cascade is when one request from outside the service generates a series of requests to other services)
- request number on the service node - when the service is restarted, the request numbers are reset
- the number of the workflow that processed the request
Various events take place inside the query: calls to PostgreSQL, Redis, СlickHouse, RabbitMQ, to other services, calls to internal methods of the service. We record these events with the following attributes:
- the date and time the event occurred, accurate to milliseconds
- the text of the event is the most important part, everything that does not fit into other attributes is written into it: the text of the SQL query, the command to access Redis, the parameters of the call from another service, the parameters of the asynchronous call through RabbitMQ, etc.
- event duration in milliseconds
- stream number in which the event occurred
- processor ticks - needed to determine the sequence of events that have the same date and time, accuracy of up to milliseconds
- event type: normal, warning or error
Thus, the data structure looks as a first approximation as in the figure below.
Fig. 5. The first version of the database structure for storing logs
Here in the figure:
Service - a table with the names of services, there are few of them, several hundred
ServiceNode - a table with service nodes, several thousand records in it, several nodes can be associated with one service
Method - a table with names methods, there are also several thousand
Log- Log - in fact, the main table where the request data and its events are written. The table is intentionally denormalized so as not to waste time joining large tables in queries and not to keep extra indexes. For example, query attributes could be placed in a separate table, but then the logic of adding and retrieving data would become more complicated. It would be possible to take out the UUID and User Session Identifier, but would have to have an index in each new field in the new table by field, an index on the foreign key in the Log table, and an unnecessary primary key in the new table.
Web services send logs via http, via nginx (for balancing). The nodes of the log service process them and write to the database. The scheme is given below.
Fig. 6. Scheme for sending logs to the log service
Figure 7 shows the screen, what our request logs to the fix-osr-bl17.unix.tensor.ru node of the Moscow service look like. The request is called "VLSI. Perform Action", its number is 15155. I will not give the UUID, it is displayed above the name of the request. The first is a record with a message of the form “[m] [start] Edo → EDOCertCheckAttorney” - this is a fixation of the start event of the call to the internal method of the service without arguments. Following it immediately begins the second sub-call “[m] [start] Document.Incoming / 1 (234394;)” with one argument with a value of 234394. Then a call is made from the node of the caching service, the line says [[rpc call] ... etc. The image is clickable, it opens in the current tab of the web browser. Fig. 7. Screen of the log screen according to the "VLSI. PerformAction" method on the node "fix-osr-bl17.unix.tensor.ru"
For a year and a half of existence, this database and service scheme has not undergone major changes. What problems did we encounter? Initially, we wrote in one database and from the very first days we came across the fact that:
- We won’t be able to write to the database in the usual way through INSERT, and we switched only to COPY instructions.
- To delete obsolete data in the usual way via DELETE is unrealistic, and we switched to TRUNCATE. This instruction acts quickly enough on the entire table, truncating the file to almost zero size. However, we had to have our own table-section for each day of the month so that only irrelevant data was deleted. There is one unpleasant moment with TRUNCATE anyway - if the PostgreSQL server has decided to start the autovacuum to prevent wraparound process on the plate, then TRUNCATE will not be executed until autovacuum to prevent wraparound is over, and it can work for quite some time. Therefore, before cleaning, we nail this process.
- We do not need a transaction log - and we began to create tables through CREATE UNLOGGED.
- Synchronous writing to disk is not needed - and we did fsync = off and full_page_writes = off, this is permissible, database outages due to disks, of course, were, but they are extremely rare.
These operations pretty well overclocked the PostgreSQL server. We tried to change the synchronous_commit and commit_delay parameters as well, but in our case they did not noticeably affect the performance.
Given the splitting of tables into data sections for each day of the month, the new database scheme now looked like this:
Fig. 8. Scheme of the base with tables for every day of the month
Over time, with constantly growing volumes of logs, we could no longer store data in one database. And the first prototype of the log service was redone for a distributed version. Now each node of the log service wrote to one of several databases, choosing a base for recording using the round-robin algorithm. It was convenient enough. Statistically, each database received the same load, the load was scaled horizontally, the data volumes on the databases coincided to within GB. Instead of one PostgreSQL server for logs, now it worked 5. The scheme for the nodes of the log service looked like this:
Fig. 9. The scheme of the log service with multiple databases
The diagram shows that each service node holds connections to several databases. Such a scheme, of course, has drawbacks. If the base failed, or it started to "stupid", then the whole service stopped working, because all were equal in the latest, well, when adding a new service node, the number of connections to the PostgreSQL server increased.
Writing to the same table simultaneously through a large number of connections causes a larger number of locks, which slows down the writing process. The increase in the number of connections per base can be controlled using pgBouncer in TRANSACTON MODE mode. However, there are no miracles, and in this case, the time to complete the request increases slightly, because still, work goes through an additional link. Well, with TRANSACTON MODE, connections to the database are too often switched, which also does not affect the work in the best way.
We worked on this option for another year, and finally moved on to a scheme in which one service node works exactly with one database, directly, without pgBouncer. It turned out to be 2 times more effective, and by that time we had added 3 more bases, and we had 8. There are so many of us who live to this day.
From this scheme, we got even small advantages. The number of connections to the database does not grow when a new node is added. If the database starts to stupid, then nginx distributes the load to other service nodes.
Fig. 10. The current scheme of the log service: one node - one database.
Additionally, for the best operation of the PostrgeSQL server, and it works on Centos 7, we installed the deadline scheduler. And they also reduced the size of the dirty cache. This is the vm.dirty_background_bytes parameter, it sets the size of the cache, upon reaching which the system starts the background process of flushing it to disk. If its volume is too large, a peak load arrives on the disk - the parameter must be selected so that it is smoothed.
In addition to performance problems, there is an acute problem of lack of storage space. Now we cost 32 TB. This is enough for 3 days. However, sometimes there are peak bursts in logging several times, and the place ends earlier. How to deal with this without increasing space? We formulated the task for ourselves as follows: we need at least some logs to remain, albeit to the detriment of their details.
In accordance with this, we divided the log table per day into three tables. The first stores data for the period from 0 hours to 8, the second from 8 to 16, the third from 16 to 24. Each of these tables was further divided into three sections. Sections correspond to three levels of log importance. The first level stores basic information about the fact of the request, its duration, without details, and the facts of errors that occurred during the request. The second level stores information about sub-calls and SQL queries. The third level stores everything that is not included in the first two. If the log service node understands that less than 15% of free space remains for recording, it starts cleaning the oldest section with the third level. And so on, until there is enough space for recording. If there is no longer a third-level section, but there is still not enough space, then cleaning of the second-level sections begins,
Before that, it was about recording logs, but they do a recording in order to read later. In general, for reading, depending on the time period, the necessary sections of the table are calculated and substituted into the query. The current mechanism of PostgreSQL to work with sections is inoperative and inconvenient, we do not use it. What the result of the query to the logs looks like, we showed in fig. 7. Basic requirements for the request:
- it should be as fast as possible
- the result of the request should be navigation
It is 100% impossible to satisfy both requirements, but they can be implemented for 80% of requests. This is what we did. It turned out about 20 query profiles that fit into 11 indexes on the Log table. Indexes, unfortunately, slow down the addition of entries and steal space. In our version, these indices take away a place comparable to the place for data.
We never select data for the entire period of time specified by the user to receive logs, this would be too inefficient. In many cases, displaying the first or last page of a query is sufficient. In more rare cases, users can navigate to the following pages. Consider the log selection algorithm using an example. Let us request data for a period of 1 hour with the output of 500 records on one page:
First, we try to build a query for a period of 1 ms, addressing in parallel to all the log databases. After receiving the result from all databases, we combine the request, sorting its data by time. If 500 records are not accumulated, then we shift by 1ms, increase the period by 2 times and repeat the procedure until we collect the required 500 records for display.
More often than not, this simple algorithm is enough to quickly obtain data. But if your filtering conditions are such that millions of records can be selected over the entire period, and you only need a few among them, for example, you are looking for a specific, rarely seen line in the “Msg” field of log events, then the result will not be returned quickly.
Is everything so rosy? Alas, not all ... You can design a query so that you won’t get the result for a long time, and fill the database server with such queries so that they will not be able to work. Since this is used internally, we exclude intentional blockage - we easily calculate such people, only a random blockage remains. We protect ourselves against accidental timeouts for a query through the PostgreSQL command “SET LOCAL statement_timeout TO ...” The total time of all queries to one database is given a time of 1200 seconds. A timeout of 1200 s is set for the first query to the database, 1200 for the second query, minus the time spent on the first query, etc. If it was not possible to meet the deadline, an error is returned asking to narrow the filtering conditions.
Attempts to go to other log storage systems
We made a serious attempt to go to the storage of logs in ClickHouse. Worked with the MergeTree engine. Preliminary tests went fine, we rolled out the system to pre-production. There were no questions with recording speed here at all - the gain at the storage location was up to 7 times. Two ClickHouse nodes processed data, each had 20 cores and 64 GB of memory. By the way, PostgreSQL in pre-production was a little more modest in our requirements - 8 cores and 32 GB of memory per server. But be that as it may, the decrease in storage in ClickHouse bribed, we were even ready to forgive some degradation of read requests to ClickHouse compared to PostgreSQL.
As soon as the number of requests to the ClickHouse server to select data became more than a certain number, they decelerated sharply. We could not defeat this. ClickHouse had to be abandoned. Perhaps the reason for slowing down read requests was that only one index could be created on the ClickHouse table. Many log query profiles did not fit into this index and data reading slowed down.
In addition to ClickHouse, they also aimed at ElasticSearch, poured data from one PostgreSQL log database with production (~ 4TB) into it, it gave a gain of about 15% in relation to PostgreSQL, this is not enough to head over to the logs for storage in Elastic.
Author: Alexey Terentyev