PostgreSQL on Power 8 multi-core servers


    annotation


    With the help of the Moscow office of IBM, we tested the performance of the latest PostgreSQL DBMSs on Power8 servers, studied the scalability of the performance versus the number of simultaneous queries, found bottlenecks that limited performance, proposed new technical solutions and achieved record performance.

    Introduction


    In a number of tasks, virtually unlimited scaling in the volume of processed transactions can be achieved using distributed systems in which one way or another the transaction flow is distributed to a large number of servers. This scaling is often called “horizontal.” However, a universal distributed solution does not exist, in addition, distributedness comes at a price. System architecture must be designed in advance as distributed. Distributed systems are less flexible than monolithic ones; moreover, they are more difficult to operate and require more qualified personnel. Some tasks are easier to parallelize, while others are more difficult. Therefore, the demand for high-performance monolithic systems exists, and achieving the best possible performance results on one server has been and remains an important task. This is often called “vertical scaling”.

    The essence of the problems arising from the parallel processing of a large number of transactions in monolithic and distributed systems is the same - this is a competition between transactions for access to the same resources. Simply put, individual processes work in parallel and independently until they are lined up for a shared resource (which can be either a hardware resource or an information item stored in the database) and do not begin to expect each other.

    To solve such problems, there are mechanisms for controlling access to resources - the use of locks, as well as suitable in some cases non-blocking (lock-free) approaches. The performance growth of these mechanisms, as well as the details of locks, make it possible to reduce the costs associated with simultaneous (competitive) access.

    Moreover, if in distributed systems the network is usually a bottleneck, in monolithic systems close to peak performance, its growth is limited by the mentioned simultaneous access control mechanisms.

    PostgreSQL scaling studies


    In 2012, at a PgCon conference, Robert Haas from EnterpriseDB presented the results of PostgreSQL performance studies (the number of read requests per second) depending on the number of simultaneously running client sessions (Fig. 1). Measurements were taken on the amd64 platform using the pgbench tool included with PostgreSQL.

    image
    Fig. 1. Graph from R. Haas's report at the PgCon 2012 conference, case for 32 cores

    Before analyzing this graph, we make a few important points. The architecture of PostgreSQL is such that one client is served by one single-threaded process on the server, and disk work is performed through a buffer located in the shared memory shared between all such processes. Thus, on the one hand, the simultaneous independent operation of a certain number of client sessions is possible, however, the data itself located in shared memory is a shared resource to which competitive access is made. The disk subsystem is also a common resource, but in this study, work with it was not considered. All data was already in the buffers, so the measurements show the maximum achievable performance.

    In this graph, the growth in the number of transactions per second (TPS) with an increase in the number of clients is initially close to linear; this means that the costs of accessing shared resources are insignificant, when the next client appears, the next processor core is fully involved. However, with increasing competitiveness, linear growth stops. The cessation of linear growth when the number of parallel sessions is less than the number of processor cores, indicates that the CPU is used inefficiently, and the cost of scalability is high.

    Fig. 1. shows that during the transition from version 9.1 to 9.2, due to the improvement of locking mechanisms, a steady increase in productivity began to be observed for at least 32 clients working simultaneously with the CPU (in the test, a 32-core machine was used, therefore, with a further increase in the number of clients, productivity did not grow anymore). A little later, R. Haas carried out similar measurements on 64 cores (Fig. 2). It can be seen that close to linear vertical scalability is preserved even with this number of cores.

    image
    Fig. 2. Graph from R. Haas's report at the PgCon 2012 conference, case for 64 cores

    These graphs and the research behind them inspired us to see how PostgreSQL behaves in modern versions on modern multiprocessor servers. IBM, which supplies such servers on the Power platform, became interested in this (the business associated with x86-servers, IBM lost Lenovo in 2014 after the PC). Here is what came of it.

    Our research


    In February 2015, the most famous Russian developers of PostgreSQL foundedPostgres Professional, whose goal was the development of the PostgreSQL DBMS and the provision of a full range of related services. The creation of the company allowed several times to increase the amount of work on the development of PostgreSQL in Russia, made it possible to conduct applied research and enter into partnerships with manufacturers of software and hardware. From the very beginning, Postgres Professional has shown interest in partnering with IBM and has joined the IBM PartnerWorld partner network. In May 2015, Terri Virnig, IBM’s vice president of Power Ecosystem and Strategy, visited Moscow, confirming IBM’s strong interest in PostgreSQL and its adaptation to maximize performance on the Power platform.

    As part of the partnership, in August 2015, the IBM E880 server in a 2-node configuration was allocated for testing PostgreSQL at the IBM client center in Poughkeepsie, NY, each node contained 4 eight-core processors with SMT (simultaneous multithreading) technology, each core allows you to get up to 8 threads. Thus, this configuration provides up to 2 * 4 * 8 * 8 = 512 threads of execution (or in IBM terminology “Logical CPU”, LCPU). It is important to note that the implementation of SMT in IBM Power 8 differs from its implementation in Intel processors (known as Hyperthreading), and in contrast to the latter, it provides a noticeable, up to two-fold increase in performance on PostgreSQL DBMS tasks. The number of threads (LCPUs) can be controlled by setting their number to 1, 2, 4, or 8 per CPU core. The testing technique was previously tested on less powerful servers in Moscow. Tests, like R. Haas, were carried out using the pgbench utility included with PostgreSQL.

    The test script was a search query on the B-tree index, which was entirely placed in memory. The PostgreSQL server was launched on a large LPAR, and using the built-in Linux kernel tools in the NUMA architecture, it was fixed on a certain number of numa-nodes. The success of the test was to achieve as many transactions per second (TPS) as possible.

    The very first launch showed that when more cores are added, performance does not increase, but even drops.

    image
    Fig. 3. Performance depending on the number of clients, for a different number of LCPUs

    In order to understand why performance dropped when adding LCPUs, you need to imagine the IBM E880 server device. It has two motherboards, each of which has 4 processors. Between themselves motherboards are connected by a hyperconnect. You can read more about the internal structure of servers on IBM Power 8 in a series of articles from IBM . Hyperconnect, although very fast, turned out to be predictably slower than the interaction of processors located on the same motherboard. And this will be noticeable not only in PostgreSQL, but also in any other system where processes interact through shared memory. Therefore, in further studies, in bottlenecks, we excluded the influence of hyperconnection and used one computing node, i.e. 256 LCPU with SMT = 8.

    First, the last official PostgreSQL release (version 9.4.5) was tested at the time of testing, which showed linear scaling to only 32 clients, then the current versions from branches 9.5 and 9.6, on which growth close to linear continued to 64 clients.


    Fig. 4. Performance depending on the number of threads

    Version 9.5 shows an improvement over version 9.4, which can be explained by the optimization of LWLock mechanisms (to be explained below).

    So, the measurements have been made. But we decided not to dwell on this, but to move on - trying to understand if there is an opportunity, having discovered a bottleneck in the code, eliminate it and achieve a better result. The perf utility was used to search for bottlenecks, which showed that about 32% of the processor time was spent executing the s_lock function (Fig. 5).

    32.10% postgres       [.] s_lock
    7.77%  postgres       [.] GetSnapshotData
    2.64%  postgres       [.] AllocSetAlloc
    1.40%  postgres       [.] hash_search_with_hash_value
    1.37%  postgres       [.] base_yyparse
    1.36%  postgres       [.] SearchCatCache
    1.32%  postgres       [.] PinBuffer
    1.23%  postgres       [.] LWLockAcquire
    1.05%  postgres       [.] palloc
    1.01%  postgres       [.] ReadBuffer_common
    0.99%  postgres       [.] LWLockRelease
    0.94%  libc-2.17.so   [.] __memset_power7
    

    Fig. 5. An example of perf top output during testing

    In order to identify the places in the PostgreSQL code responsible for calling s_lock, the gdb debugger was used (the performance of single-threaded perf for constructing the dependency graph was insufficient). Selective gdb connections showed that in most cases the s_lock call comes from the PinBuffer and UnpinBuffer functions, which increase and decrease the counter of references to the buffer in which a certain data block is stored. The reference counter is protected by the BufferDesc.buf_hdr_lock spinlock, and therefore, to work with it, you must first capture this spinlock. To better understand this place, let's look at what types of locks are in PostgreSQL, and why are they needed.

    Locks in PostgreSQL: which and why are so many different?


    PostgreSQL implements several types of locks, each of which solves its own problems.
    1. Spinlock is the simplest type of locks, which has only two states “busy” and “free”. The spinlock capture procedure consists of successive attempts to change its state from “free” to “busy”. In order to avoid racing , spinlocks use the atomic operation Test-and-set(TAS). Processes wishing to receive a spinlock do not line up. From the DBMS side, neither monitoring of spinlocks nor automatic detection of deadlocks on spinlocks is supported. All of this was sacrificed for high performance in very short operations. PostgreSQL contains assembler implementations of spinlocks for various hardware platforms. In the case when there is no assembler implementation of spinlocks, emulation of spinlocks through UNIX semaphores is used, which, however, have much lower performance.
    2. Lightweight lock (LWLock) - a more complex type of locks, which has two levels of shared and exclusive. At the same time, either an unlimited number of shared locks can be taken, or only one exclusive. The LWLock implementation has undergone significant changes between PostgreSQL 9.4 and 9.5. In versions 9.4 and below, LWLock contained a set of fields protected by a spinlock. Starting with version 9.5, most of the operations with LWLock are performed bypassing the spinlock, using atomic operations with the “state variable” of LWLock. Processes that want to get LWLock are lined up, so that with a long wait for LWLock, the bus resource is not wasted. Automatically detecting deadlocks for LWLocks is not supported, DBMS algorithms are aligned so that deadlocks never occur. ildus ) as part of the more general task of monitoring wait events.
    3. Heavyweight lock (HWLock) is the most “advanced” type of locks. It has many levels and a nontrivial matrix of conflicts . Indexes and other database objects, HWLocks are installed on tables, both implicitly by the SQL commands that use them, and explicitly by the LOCK command. For monitoring HWLocks, the pg_locks system view is intended . As the name implies, HWLocks are quite a heavyweight mechanism, so a special fastpath locking mechanism is provided for receiving them with DML commands. In addition, since the order in which HWLocks are received depends on the user's actions, it is impossible to guarantee the absence of deadlocks, so PostgreSQL implements a mechanism for automatically detecting deadlocks.


    In addition to the above locks, PostgreSQL has row-level locks, predicate locks, advisory locks and others, which, in turn, are implemented using the three basic types of locks listed above.

    How we accelerated PostgreSQL


    To understand whether it would be possible to improve performance in the bottleneck we discovered, we, together with the leading PostgreSQL developer Andres Freund, developed a patch that replaced the buf_hdr_lock spinlock with atomic operations with a buffer “state”. Thus, the developed patch implements for buffers an optimization similar to what was done for LWLocks in version 9.5: the PinBuffer function began to perform the Compare and Swap (CAS) operation in a loop, and the UnpinBuffer function became an atomic decrement. This allowed us to extend the growth area close to linear by about 2 times and achieve a productivity of more than 800 thousand TPS (Fig. 6).


    Fig. 6. The results of measuring the performance of the developed patch.

    However, we did not stop there either. If you look at the assembler implementation of the CAS operation under power8, you can see that it is a loop (Fig. 7).

    # Входные параметры:
    #  r3 – старое значение, r4 – новое значение
    #  r5 – адрес атомарной переменной
    .L1: lwarx 9,0,5
         cmpw 0,9,3
         bne- 0,.L2
         stwcx. 4,0,5
         bne- 0,.L1
    .L2: isync
    

    Figure 7. Implementation of the atomic CAS operation on Power assembler 8.

    Thus, the execution of the CAS operation in a cycle is a cycle in a cycle. With a lot of competition for changing the value, this can be noticeably slower than performing a single cycle.

    If you look at the assembler implementation of the atomic increment operation, you can see that it also represents a similar loop, but inside which the add command is already located.

    # Входные параметры:
    #  r3 – инкремент
    #  r5 – адрес атомарной переменной
    .L1: lwarx 9,0,5
         add 9,9,3
         stwcx. 9,0,5
         bne- 0,.L1
         isync
    

    Figure 8. Implementation of the atomic operation of atomic increment in the Power assembler 8.

    Obviously, between the lwarx and stwcx commands more complex calculations can be located.

    Based on this, two ideas were born for further optimizations.
    1. Implement performance-critical features such as PinBuffer and LWLockAttemptLock using assembler inserts. The downside of this option is that we go beyond the level of abstraction in the form of atomic operations that the compiler provides us with.
    2. An alternative idea is to use the atomic increment and implement the so-called “optimistic approach”. In PinBuffer and LWLockAttemptLock, you can make an atomic increment of a “state variable”, and then, after the value actually set, check whether we had the right to such a state change or not. In the event that we did not have the right to change the state in this way, cancel this change in the operation of the atomic increment. The calculation here is that changes will have to be canceled only in a very small fraction of cases, this is the “optimism” of the approach. Of course, all other functions that work with the “state variable” should also take this into account.

    Both ideas were implemented in the form of patches. At the same time, the version with an atomic increment was checked in several versions, so that the effect on applying this approach to PinBuffer and to LWLockAttemptLock could be separately evaluated.


    Fig. 9. The results of measuring the performance of various patches.

    The graph (Fig. 9) shows a comparison of the performance of the following patches:
    1. pinunpin-cas - PinBuffer is executed using a cycle of CAS operations.
    2. pinunpin-increment - PinBuffer is executed using an “optimistic” atomic increment.
    3. pinunpin-cas-lwlock-increment - PinBuffer is executed using the CAS cycle, LWLockAttemptLock is executed using the “optimistic” atomic increment.
    4. pinunpin-lwlock-increment - PinBuffer and LWLockAttemptLock are executed using an “optimistic” atomic increment.
    5. pinunpin-lwlock-asm - PinBuffer and LWLockAttemptLock are implemented in assembler.

    It can be seen that there is almost no effect from the “optimistic” atomic increment in PinBuffer, but applying the same approach to LWLockAttemptLock has a very large effect. And about the same effect can be obtained using assembler optimizations.

    Figure 10 shows the changed perf top pattern after applying the patch. It can be seen that the bottleneck has moved to the GetSnapshotData function, which, perhaps, can also be accelerated. But the time allotted to us for testing ended here and it was not possible to show the best results. However, the results achieved are of great value, because A significant increase in productivity has been achieved.

    13.75% postgres       [.] GetSnapshotData
    4.88%  postgres       [.] AllocSetAlloc
    2.47%  postgres       [.] LWLockAcquire
    2.11%  postgres       [.] hash_search_with_hash_value
    2.02%  postgres       [.] SearchCatCache
    2.00%  postgres       [.] palloc
    1.81%  postgres       [.] base_yyparse
    1.69%  libc-2.17.so   [.] __memset_power7
    1.63%  postgres       [.] LWLockRelease
    1.56%  libc-2.17.so   [.] __memcpy_power7
    1.33%  postgres       [.] _bt_compare
    0.99%  postgres       [.] core_yylex
    0.99%  postgres       [.] expression_tree_walker
    

    Fig. 10 Example of perf top output during testing after applying the experimental pinunpin-lwlock-asm patch.

    It should be noted that not only our team is engaged in vertical scalability of postgres. There is a general direction towards a more economical use of locks; this, in particular, was discussed at the October conference pgconf.eu in a report by Andres Freund). Including, as already mentioned, version 9.5 includes LWLock optimizations.

    conclusions


    In experimental mode, the low-level locking system in PostgreSQL was replaced, which allowed us to approximately double the scalability of the number of processor cores and thereby increase the suitability of PostgreSQL for efficient use on very large Power 8. servers.

    Patches containing optimizations of critical sections of code were tested used in locks and buffer manager, both in assembler and using atomic operations provided by the compiler. The results showed that when using atomic operations at a higher level of abstraction, it is possible to achieve almost the same level of performance as using assembler inserts.

    The OpenSource IBM Advanced Toolchain (which is essentially a version of gcc with runtime libraries optimized for PPC) has shown better performance than vanilla gcc and is recommended for assembly. XLC under linux has not yet resolved problems with PostgreSQL and is not recommended.

    PostgreSQL showed solid performance in LPAR. Hardware partitioning with PowerVM turned out to be a powerful and convenient virtualization system with which you can increase the performance of a single LPAR and the PostgreSQL server located inside it.

    SMT in Power8 can provide more than a twofold increase in performance on tasks similar to those on which we tested.

    The latest patch on asm allowed us to achieve a total performance of more than 2 million transactions per second on 48 Power8 cores. This was achieved on two instances of PostgreSQL running simultaneously on different computing nodes.

    In practice, this means that you don’t have to think about how much PostgreSQL scales across the cores until you reach a performance of 1 million transactions per second on top-end hardware, or bury yourself in a commonly used memory page (for example, the top of B- tree). As a rule, our clients with> 100 physical cores do not experience problems with PostgreSQL scaling and do not feel the effects described here.

    Acknowledgments


    This study was conducted as part of the activities of the Russian PostgreSQL vendor, Postgres Professional, a member of the IBM Partner World program. The authors thank the Moscow branch of IBM and personally Denis Sosnovtsev and Alexander Lunyakov for their interest in Postgres Professional research and assistance in gaining access to testing equipment, and Ivan Goncharov for valuable advice on its use, as well as IBM Vice President Terry Wirnig for support and Staff at the IBM Pukipsi Client Center and IBM Cambridge Innovation Center for test organization.

    I would also like to thank my co-authors - Postgres Professional employees Dmitry Vasiliev (aka vadv), who directly ran the tests and wrote a significant part of this article, and Yuri Zhuravlev (aka stalkerg ), who developed the initial version of the patch for translating PinBuffer / UnpinBuffer to atomic operations.

    Also popular now: