ass026 March 11, 2019 at 17:48

Surprise Query Scheduler in PostgreSQL Database

Charts, reports and analytics - all this is somehow present in the back-office of any, even very small, enterprise. When it becomes crowded in regular tables in Excel / Numbers / Libre, but data is still not very big, traditional solutions for internal company needs are often built using relational databases such as PostgreSQL, MySQL or MariaDB.

These databases are free, thanks to SQL they can conveniently integrate with other components in the system, they are popular and most developers and analysts can work with them. They can digest the load (traffic and volumes) sufficiently voluminous to calmly hold out until the company can afford more complex (and expensive) solutions for analytics and reports.

Starting position

However, even in a technology that has been repeatedly studied, there are always different nuances that can suddenly add to the worries of engineers. In addition to reliability, the most frequently mentioned problem with databases is their performance. Obviously, with an increase in the amount of data, the DB response rate decreases, but if this happens predictably and is consistent with the increase in load, then this is not so bad. You can always see in advance when the database starts to demand attention and plan an upgrade or transition to a fundamentally different database. Much worse if database performance degrades unpredictably.

The topic of improving database performance is as old as the world and very extensive, and in this article I would like to focus on only one direction. Namely, on evaluating the effectiveness of query plans in a PostgreSQL database, as well as changing this efficiency over time to make the behavior of the database scheduler more predictable.

Despite the fact that many of the things that will be discussed are applicable to all recent versions of this database, the examples below mean version 11.2, the latter at the moment.
Before we dive into the details, it makes sense to digress and say a few words about where performance problems in relational databases can come from. What exactly is the database busy with when it "slows down"? Lack of memory (a large number of disk or network accesses), a weak processor, these are all obvious problems with clear solutions, but what else can affect the query execution speed?

Freshen up memories

In order for the database to respond to the SQL query, it needs to build a query plan (in which tables and columns to see what indexes are needed, what to pick from there, what to compare with, how much memory is required, and so on). This plan is formed in the form of a tree, the nodes of which are just a few typical operations, with different computational complexity. Here are a few of them, for example (N is the number of lines with which to perform the operation):

Operation	What is done	Cost
SELECT ... WHERE ... data fetch operations
Seq scan	We load each row from the table and check the condition.	O (N)
Index Scan (b-tree index)	The data is directly in the index, so we search by condition for the necessary elements of the index and take the data from there.	O (log (N)), search for an element in a sorted tree.
Index Scan (hash index)	The data is directly in the index, so we search by condition for the necessary elements of the index and take the data from there.	O (1), searching for an item in a hash table, excluding the cost of creating hashes
Bitmap heap scan	We select the numbers of the necessary lines by index, then we load only the necessary lines and carry out additional checks with them.	Index Scan + Seq Scan (M), Where M is the number of rows found after Index Scan. It is assumed that M << N, i.e. index is more useful than Seq Scan.
Join operations (JOIN, SELECT from multiple tables)
Nested loop	For each row from the left table, look for a suitable row in the right table.	O (N ² ). But if one of the tables is much smaller than the other (dictionary) and practically does not grow with time, then the actual cost can decrease to O (N).
Hash join	For each row from the left and right tables, we consider the hash, which reduces the number of searches of possible connection options.	O (N), but in the case of a very inefficient hash function or a large number of identical fields for the connection, there may be O (N ² )
Merge join	By condition, we sort the left and right tables, after which we combine the two sorted lists	O (N * log (N)) The cost of sorting + going through the list.
Aggregation Operations (GROUP BY, DISTINCT)
Group aggregate	We sort the table according to the aggregation condition and then in the sorted list we group the adjacent rows.	O (N * log (N))
Hash aggregate	We consider the hash for the aggregation condition for each row. For rows with the same hash, we perform aggregation.	O (N)

As you can see, the cost of a query very much depends on how the data is located in the tables and how this order corresponds to the hash operations used. Nested Loop, despite its cost in O (N ² ), can be more profitable than Hash Join or Merge Join when one of the joined tables degenerates to one or several rows.

In addition to CPU resources, cost also includes memory usage. Both are limited resources, so the query planner has to find a compromise. If two tables are mathematically more profitable to connect via Hash Join, but there is simply no room for such a large hash table in the memory, the database may be forced to use Merge Join, for example. A "slow" Nested Loop generally does not require additional memory and is ready to produce results right after launch.

The relative cost of these operations is more clearly shown on the graph. These are not absolute numbers, just an approximate ratio of different operations.

The Nested Loop chart "starts" below, because it does not require any additional calculations, nor the allocation of memory or copying intermediate data, but it has a cost of O (N ²) Merge Join and Hash Join have higher initial costs, however, after some N values, they begin to beat Nested Loop in time. The scheduler tries to choose the plan with the lowest cost and on the chart above adheres to different operations with different N (green dashed arrow). With the number of lines up to N1, it is more profitable to use Nested Loop, from N1 to N2 it is more profitable to Merge Join, then after N2 it becomes more profitable to Hash Join, however Hash Join requires memory to create hash tables. And when reaching N3, this memory becomes insufficient, which leads to the forced use of Merge Join.

When choosing a plan, the scheduler estimates the cost of each operation in the plan using a set of relative costs of some “atomic” operations in the database. As, for example, calculations, comparisons, loading a page into memory, etc. Here is a list of some of these parameters from the default configuration, there are not many of them:

Relative cost constant	Default value
seq_page_cost	1.0
random_page_cost	4.0
cpu_tuple_cost	0.01
cpu_index_tuple_cost	0.005
cpu_operator_cost	0.0025
parallel_tuple_cost	0.1
parallel_setup_cost	1000.0

True, these constants alone are few, you still need to know the very “N”, that is, exactly how many rows from the previous results will have to be processed in each such operation. The upper bound is obvious here - the database “knows” how much data is in any table and can always calculate “to the maximum”. For example, if you have two tables of 100 rows each, then joining them can produce from 0 to 10,000 rows in the output. Accordingly, the next input operation can have up to 10,000 lines.

But if you know at least a little about the nature of the data in the tables, this number of rows can be predicted more accurately. For example, for two tables of 100 rows from the example above, if you know in advance that the join will not produce 10 thousand rows, but the same 100, the estimated cost of the next operation is greatly reduced. In this case, this plan could be more effective than others.

Out of the box optimization

In order for the scheduler to be able to more accurately predict the size of intermediate results, PostgreSQL uses statistics collection on tables, which is accumulated in pg_statistic, or in its more readable version - in pg_stats. It is updated automatically when vacuum starts, or explicitly with the ANALYZE command. This table stores a variety of information about what data and what kind of nature are in the tables. In particular, histograms of values, percentage of empty fields and other information. The planner uses all this to more accurately predict the amount of data for each operation in the plan tree, and thus more accurately calculate the cost of operations and the plan as a whole.

Take for example the query:

SELECT t1.important_value FROM t1 WHERE t1.a > 100

Assume that the histogram of the values in the “t1.a” column revealed that values greater than 100 are found in approximately 1% of the rows of the table. Then we can predict that such a sample will return about a hundredth of all the rows from the table “t1”.
The database gives you the opportunity to look at the predicted cost of the plan through the EXPLAIN command, and the actual time of its operation - using EXPLAIN ANALYZE.

It seems that with automatic statistics everything should be fine now, but there can be difficulties. There is a good article about this from Citus Data , with an example of the inefficiency of automatic statistics and the collection of additional statistics using CREATE STATISTICS (available with PG 10.0).

So, for the scheduler, there are two sources of errors in calculating costs:

The relative cost of primitive operations (seq_page_cost, cpu_operator_cost, and so on) by default can be very different from reality (cpu cost 0.01, srq page load cost - 1 or 4 for random page load). Far from the fact that 100 comparisons will be equal to 1 page load.
Error predicting the number of rows in intermediate operations. The actual cost of the operation in this case can be very different from the forecast.

In complex queries, drawing up and forecasting all possible plans can take a lot of time by itself. What is the use of returning data in 1 second if the database was only planning a minute request? PostgreSQL has a Geqo optimizer for this situation, it is a scheduler that does not build all possible options for plans, but starts with a few random ones and completes the best ones, predicting ways to reduce costs. All this also does not improve the accuracy of the forecast, although it speeds up the search for at least some more or less optimal plan.

Sudden plans - competitors

If everything goes well, your request fulfills as quickly as possible. As the amount of data increases, the speed of query execution in the database gradually increases, and after some time, observing it, you can roughly predict when it will be necessary to increase the memory or the number of CPU cores or expand the cluster, etc.

But we must take into account the fact that the optimal plan has competitors with close execution costs, which we do not see. And if the database suddenly changes the query plan to another, this comes as a surprise. It’s good if the database jumps to a more efficient plan. And if not? Let's look at the picture, for example. This is the predicted cost and real time of the implementation of two plans (red and green):

Here, one plan is shown in green and its closest “competitor” in red. The dotted line shows a graph of projected costs, the solid line is the real time. The gray dashed arrow shows the planner selection.

Suppose that one fine Friday night the predicted number of rows in some intermediate operation reaches N1 and the “red” forecast starts to outperform the “green” one. The scheduler begins to use it. The actual query execution time immediately jumps (switching from a green solid line to a red one), that is, the database degradation schedule takes the form of a step (or maybe a “wall”). In practice, such a “wall” can increase the query execution time by an order of magnitude or more.

It is worth noting that this situation is probably more typical for the back office and analytics than for the front end, since the latter is usually adapted to more simultaneous queries and, therefore, uses simpler queries in the database, where the error in plan forecasts is less. If this is a database for reporting or analytics, queries can be arbitrarily complex.

How to live with it?

The question arises: was it possible somehow to foresee such “underwater” invisible plans? After all, the problem is not that they are not optimal, but that switching to another plan can occur unpredictably, and, according to the law of meanness, at the most unfortunate moment for this.

Unfortunately, you cannot directly see them, but you can look for alternative plans by changing the actual weights by which they are selected. The meaning of this approach is to remove from sight the current plan, which the scheduler considers optimal, so that one of his closest competitors becomes optimal, and thus, he could be seen through the EXPLAIN team. Periodically checking changes in costs in such “competitors” and in the main plan, you can assess the likelihood that the database will soon “jump” to another plan.

In addition to collecting data on forecasts of alternative plans, you can run them and measure their performance, which also gives an idea of the internal “well-being” of the database.
Let's see what tools we have for such experiments.

First, you can explicitly “prohibit” specific operations using session variables. Conveniently, they do not need to be changed in the config and the database is reloaded, their value changes only in the current open session and does not affect other sessions, so you can experiment directly with real data. Here is a list of them with default values. Almost all operations are included:

Operations Used	Default value
enable_bitmapscan enable_hashagg enable_hashjoin enable_indexscan enable_indexonlyscan enable_material enable_mergejoin enable_nestloop enable_parallel_append enable_seqscan enable_sort enable_tidscan enable_parallel_hash enable_partition_pruning	on
enable_partitionwise_join enable_partitionwise_aggregate	off

By prohibiting or allowing certain operations, we force the scheduler to select other plans that we can see with the same EXPLAIN command. In fact, the “prohibition” of operations does not prohibit their use, but simply greatly increases their cost. In PostgreSQL, each “forbidden” operation automatically piles up a cost equal to 10 billion conventional units. Moreover, in EXPLAIN, the total weight of the plan can turn out to be prohibitively high, but against the background of these tens of billions, the weight of the remaining operations is clearly visible, since it usually fits into smaller orders.

Of particular interest are two of the following operations:

Hash Join. Its complexity is O (N), but with an error with a forecast in the amount of the result, you can not fit in memory and you will have to do Merge Join, with a cost of O (N * log (N)).
Nested Loop. Its complexity is O (N ² ), therefore, the error in the size forecast quadratically affects the speed of such a connection.

For example, let’s take some real numbers from queries, the optimization of which we were engaged in in our company.

Plan 1. With all permitted operations, the total cost of the most optimal plan was 274962.09 units.

Plan 2. With the “forbidden” nested loop, the cost increased to 40000534153.85. These 40 billion that make up the bulk of the cost are 4 times the Nested Loop used, despite the ban. And the remaining 534153.85 - this is precisely the forecast of the cost of all other operations in the plan. It, as we see, is about 2 times higher than the cost of the optimal plan, that is, it is close enough to it.

Plan 3.With the “forbidden” Hash Join, the cost was 383253.77. The plan was really made without using the Hash Join operation, since we do not see any billions. Its cost, however, is 30% higher than that of the optimum, which is also very close.

In reality, the query execution times were as follows:

Plan 1 (all operations are allowed) completed in ~ 9 minutes.
Plan 2 (with the “forbidden” nested loop) completed in 1.5 seconds.
Plan 3 (with a “forbidden” hash join) was completed in ~ 5 minutes.

The reason, as you can see, is the erroneous prediction of the cost of Nested Loop. Indeed, when comparing EXPLAIN with EXPLAIN ANALYZE, an error is detected in it with the definition of that ill-fated N in the intermediate operation. Instead of a predicted single row, the Nested Loop encountered several thousand rows, which caused the query execution time to increase by a couple of orders of magnitude.

Savings with the “forbidden” Hash Join are associated with the replacement of hashing with sorting and Merge Join, which worked faster in this case than Hash Join. Note that this plan 2 in reality is almost two times faster than the "optimal" plan 1. Although it was predicted that it will be slower.

In practice, if your request suddenly (after a DB upgrade or just by itself) began to run much longer than before, first try to deny either Hash Join or Nested Loop and see how this affects the speed of the query. In a successful case, you will be able to at least ban a new non-optimal plan, and return to the previous fast one.

In order to do this, you don’t need to change PostgreSQL configuration files with a database restart, it’s quite simple in any console to change the value of the desired variable for an open session from the database. The remaining sessions will not be affected, the configuration will change only for your current session. For example, like this:

SET enable_hashjoin=’on’;
SET enable_nestloop=’off’;
SELECT … 
FROM … 
(и остальная часть анализируемого запроса)

The second way to influence the choice of the plan is to change the weights of low-level operations. There is no universal recipe here, but, for example, if you have a database with a “warmed up” cache and the whole data is stored in memory, it is likely that the cost of sequential page loading does not differ from the cost of loading a random page. Whereas in the default config, random is 4 times more expensive than sequential.

Or, another example, the conditional cost of running parallel processing is 1000 by default, while the cost of loading a page is 1.0. It makes sense to start by changing only one of the parameters at a time to determine whether it affects the choice of plan. The easiest ways are to start by setting the parameter to 0 or to some high value (1 million).

However, keep in mind that by improving performance in one request you can degrade it in another. In general, there is a wide field for experiments. It’s better to try changing them one at a time, one at a time.

Alternative Treatment Options

A story about a scheduler would be incomplete without mentioning at least two PostgreSQL extensions.

The first is SR_PLAN , for saving the calculated plan and forcing its further use. This helps make database behavior more predictable in terms of plan choices.

The second is the Adaptive Query Optimizer , which implements feedback to the scheduler from the real-time execution of the query, that is, the scheduler measures the actual results of the executed query and adjusts its plans in the future with this in mind. The database is thus "self-tuning" for specific data and queries.

What else does the database do when it slows down?

Now that we’ve more or less sorted out query planning, we’ll see what else can be improved both in the database itself and in the applications that use it to get the maximum performance from it.

Suppose the query plan is already optimal. If we exclude the most obvious problems (low memory or a slow disk / network), then there are still costs for calculating the hashes. There are probably great opportunities for future improvements to PostgreSQL (using the GPU or even the SSE2 / SSE3 / AVX instructions of the CPU), but so far this has not been done and hash calculations almost never use the hardware capabilities of the hardware. You can help a little in this database.

If you notice, by default indexes in PostgreSQL are created as b-tree. Their usefulness is that they are quite versatile. Such an index can be used both with equality conditions and with comparison conditions (more or less). Finding an item in such an index is a logarithmic cost. But if your query contains only an equality condition, indexes can also be created as a hash index, the cost of which is constant.

Further, you can still try to modify the request so as to use its parallel execution. To understand exactly how to rewrite it, it is best to familiarize yourself with the list of cases when parallelism is automatically prohibited by the scheduler and avoid such situations. Guideon this topic briefly describes all the situations, so it makes no sense to repeat them here.

What to do if the request is still not good at making parallel? It’s very sad to see how in your powerful multi-core database, where you are the only client, one core is 100% occupied, and all other kernels just look at it. In this case, you have to help the database from the side of the application. Since each session is assigned its own core, you can open several of them and divide the general query into parts, making shorter and faster selections, combining them into a common result already in the application. This will occupy the maximum available CPU resources in the PostgreSQL database.

In conclusion, I would like to note that the above diagnostic and optimization capabilities are just the tip of the iceberg, however they are quite easy to use and can help you quickly identify the problem directly on the operational data without risking spoiling the config or disrupting the operation of other applications.

Successful inquiries, with accurate and short plans.

Tags: