How we made PHP 7 twice as fast as PHP 5. Part 1: optimizing data structures

    In December 2015 , PHP 7.0 was released. Companies that switched to the "seven" noted that productivity has increased, and the load on the server has decreased. The first to move to the seven were Vebia and Etsy, and we have Badoo, Avito and OLX. For Badoo, switching to the seven cost $ 1 million in server savings. Thanks to PHP 7 in OLX, the average server load decreased by 3 times, increased efficiency and resource savings.

    Dmitry Stogov from Zend Technologies spoke at HighLoad ++ , which increased productivity. In decoding: about the internal structure of PHP, about the ideas at the heart of version 7.0, about changes in the basic data structures and algorithms that determined success.

    Disclaimer: As of March 2019, 80% of sites run on PHP, and70% of them are in PHP 5, although since January 1, 2019 this version is not supported . Dmitry’s report of 2016 about the principles due to which there was a double jump in productivity between PHP 5 and 7 is also relevant in March 2019. For half of the sites, for sure.

    About the speaker: Dmitry Stogov began programming back in the 80s: “Electronics B3-34”, Basic, assembler. In 2002 Dmitry became acquainted with PHP and soon began to work on improving it: he developed Turck MMCache for PHP, led the PHPNG project and played an important role in working on JIT for PHP. The last 14 years of Principal Engineer at Zend Technologies.

    Zend Technologies is developing PHP and commercial solutions based on it. In 1999, it was founded by Israeli programmers Andy Gutmans and Zeev Suraski, who two years ago created PHP 3. These people were at the forefront of PHP development and largely determined the current look of the language and the success of the technology.

    Zend Technologies is developing the PHP core and applications for it, and during the work I had to write extensions, get into all the subsystems and even engage in commercial projects, sometimes not at all connected with PHP. But the most interesting topic for me has always been performance .

    I started looking for ways to speed up PHP even before joining Zend, working on my own project that competed with the company. During the work on the project, I thoroughly understood the language and realized that working not with the mainstream project, you can influence only certain aspects of the script execution, and all the most interesting and effective can be created only in the kernel . This understanding and coincidence led me to Zend.

    A small digression into the history of PHP


    PHP is not just and not just a programming language . PHP stands for Personal Home Page - a tool for creating personal web pages and dynamic websites. Language is only one of its main parts. PHP is a huge library of functions, many extensions for working with other third-party libraries, for example, for access to the database or XML parsers, as well as a set of modules for communicating with various web servers.

    Danish programmer Rasmus Lerdorf introduced PHP in June 1995 . At that time it was just a set of CGI scripts written in Perl. In April 96, Rasmus introduced PHP / FI, and in June PHP / FI 2.0 was released. Subsequently, this version was substantially reworked by Andy Gutmans and Zeev Surasky, and in the 98th released PHP 3.0. By 2000, the language came to the kind that we are used to seeing today both in terms of language and internal architecture - PHP 4, based on the Zend Engine.

    Since version 4, PHP has evolved. The turning point was the release of PHP 5 in 2004, when the object model was completely updated . It was she who opened the era of PHP frameworks and raised the question of performance to a new level. Anticipating this, immediately after the release of 5.0, we at Zend thought about speeding up PHP and started working on improving productivity.

    Version 7.1, which was released in November 2016 on synthetic tests25 times faster than the 2002 version . According to the graph of performance changes in different branches, the main breakthroughs are visible in 5.1 and 7.0.



    In version 5.1, we just started working on performance, and everything we took on - it turned out, but after 5.3 we ran into a wall, all attempts to improve the interpreter came to nothing.

    Nevertheless, we found where to dig, and got even more than expected - 2.5-fold acceleration compared to the previous version 5.6 on tests. But the most interesting thing is that we got the same 2.5-fold acceleration on unchanged real applications. This is a phenomenon, because we developed the previous factor 2 throughout the life of the five in 10 years.



    The huge jump in 5.1 on synthetic tests is not noticeable on real applications. The reason is that with different uses, PHP performance rests on the brakes associated with different subsystems.

    The history of PHP 7 begins with a three-year stagnation that began in 2012 and ended in 2015 with the release of the seventh version. Then we realized that we could no longer increase productivity with small improvements of our interpreter and turned to the JIT side.

    Wandering around JIT


    Almost two years we spent on the JIT prototype for PHP-5.5. At first, we generated a very simple code - a sequence of calls for standard handlers, something like a stitched Fort code. Then they wrote their own Runtime Assembler , inline separate code for workarounds, but realized that such low-level optimizations did not give practical effect even on tests.

    Then we thought about deriving variable types using static analysis methods. Having realized the conclusion, we immediately received 2-fold acceleration in tests. Encouraged, they tried to write global register allocators, but failed. We used a fairly high-level representation, and it was almost impossible to use it for register allocation.

    To avoid low-level problems, we decided to try LLVM, and a year later we got 10x acceleration for bench.php , but nothing on real applications. In addition, compiling real applications now took minutes, for example, the first request to Wordpress took 2 minutes and did not give acceleration. Of course, this was completely unsuitable for real practice.

    Good code is possible with proper type prediction, which works poorly in real applications, and using PHP data structures makes the generated code inefficient.

    What slows down?


    We rethought the reasons for the failures and decided once again to see why PHP is slow. The picture shows the result of profiling several requests to the Wordpress home page.



    Less than 30% is spent on interpreting bytecode, 20% is the overhead of the memory manager, 13% is working with hash tables, and 5% is working with regular expressions.

    Working at JIT, we got rid of only the first 30%, and everything else lay dead weight. Almost everywhere, we were forced to use standard PHP data structures, which entailed overhead: memory allocation, reference counting, etc. This understanding led to the conclusion that it is necessary to replace key data structures in PHP. With this substitution of the foundation , the PHPNG project began.

    Phpng New generation


    The project was developed after unsuccessful attempts to create JIT for PHP. The main goal is to achieve a new level of productivity and lay the foundation for future improvements .

    We promised ourselves for some time to no longer use synthetic tests to measure performance - these are usually small computing programs that use a limited amount of data that fits completely into the processor’s cache. Real applications, by contrast, are subject to the brakes associated with subsystem memory, and a single read from memory can cost 100 computational instructions. The PHPNG project is a refactoring of key PHP data structures to optimize memory access . No innovation, 100% PHP 5 compatible.

    How to change these structures was clear. But the volume of dependent changes was huge, because the core of PHP itself is 150,000 lines , and almost every third needed to be changed. Add a hundred more extensions that are included in base distribution, a dozen modules for different web servers, and you will realize the grandeur of the project.

    We were not even sure that we would finish the project. Therefore, they launched the project in secret and opened it only when the first optimistic results appeared. It took two weeks to simply compile the kernel . Two weeks later, bench.php earned. We spent a month and a half to ensure the work of Wordpress. A month later, we opened the project - it was May 2014. At that time, we had an acceleration of 30% on Wordpress. It already seemed like a grand event.

    PHPNG immediately aroused a wave of interest, and in August 2014 it was adopted as the basis for the future of PHP 7 . It was already another project, with a different set of goals, where productivity was only one of them.

    PHP 7.0


    Version number 7 itself was in doubt. The previous version was the fifth. And the sixth one was developed several years ago and was completely devoted to native Unicode support , but the unsuccessful decisions made at the early stages of development led to excessive complexity of the kernel code and each extension. In the end, it was decided to freeze the project.

    By this time, a lot of material devoted to PHP 6 had already been accumulated: speeches at conferences, published books. In order not to confuse anyone, we called the project PHP 7, skipping PHP 6. This version was much luckier - PHP 7 was released in December 2015, almost according to plan.

    In addition to performance, some long-sought-after innovations appeared in PHP 7:

    • Ability to define scalar types of parameters and return values.
    • Exceptions instead of errors - now we can catch and process them.
    • There were Zero-cost assert(), anonymous classes, cleaning inconsistency, new operators and functions (<=>, ??).

    Innovation is good, but back to the internal changes. Let's talk about the path that PHP 7 has followed and where this path can lead us.

    zval


    This is the basic PHP data structure. It is used to represent any value in PHP . Since our language is dynamically typed and the type of variables can change during program execution, we need to store a type field (zend_uchar type), which can take the values ​​IS_NULL, IS_BOOL, IS_LONG, IS_DOUBLE, IS_ARRAY, IS_OBJECT, etc., and actually the value represented by union (value), where an integer, real number, string, array or object can be stored.

    zval in PHP 5


    The memory for each such structure was allocated separately in Heap. In addition to the type and value, it also contained a counter of references to the structure. So the structure took 24 bytes, not counting the overhead of the memory manager and the pointer to it.

    The picture at the top right shows the data structures that were created in PHP 5's memory for a simple script.



    On the stack, memory was allocated for 4 variables represented by pointers. The values ​​themselves (zval) are on the heap. In our case, these are just two zval, each of which is referenced by two variables, and accordingly their reference counters are set to 2.

    To access a type or a scalar value, you need at least two readings: first read the value of the pointer, and then the value of the structure. If you need to read not a scalar value, but, for example, part of a string or array, then you will need at least one more reading.

    zval in PHP 7


    Where we used pointers before, in the seven we began to embed zval. We have moved away from reference counting for scalar types. The fields type and value remained without significant changes, but some more flags and a reserved place were added, which I will talk about a little later.



    On the left - as it looked in PHP 5, and on the right - in PHP 7.



    Now zval themselves are on the stack. To read types and scalar values, just one machine instruction is enough. All values ​​are grouped in one memory area, which means that when working with local variables, we will practically have no losses due to misses of the processor cache. But the true power of the new performance is included when copying is needed.

    Copy Record


    In the top line of the script, another assignment was added.



    In PHP5, we allocated memory from the heap for the new zval, initialized its int (2), changed the value of the pointer to the variable b, and decreased the reference counter of the value to which b had previously referred.

    In PHP 7, we simply initialized the variable b directly in place with a few instructions , while in PHP 5 it required hundreds of instructions. So zval looks now in memory.



    These are two 64-bit words. The first word is meaning: integer, real or pointer. In the second word, type(he says how to interpret the meaning), flags, and a reserved place that would still be added when aligning. But it does not disappear, but is used by different subsystems to store indirectly related values.

    Flags are a set of bits where each bit indicates whether zval supports a protocol. For example, if it is IS_TYPE_REFCOUNTED, then when working with this zval, the engine should take care of the value of the reference counter. When assigning, increase; when leaving the scope, decrease; if the reference counter reaches zero, destroy the dependent structure.

    Of the types, compared to PHP 5, several new ones appeared.

    • IS_UNDEF - a marker of an uninitialized variable.
    • The single was IS_BOOLreplaced by separate IS_FALSEand IS_TRUE.
    • Added a separate type for links and a few more magic types.

    Types from IS_UNDEFto IS_DOUBLEare scalar, and do not require additional memory. To copy them, it is enough to copy the first machine 64-bit word with a value and half the second with a type and flags.

    Refcounted


    With other types more difficult. They are all represented by a subordinate structure, and zval simply stores a reference to this structure. For each type, this structure is different, but in terms of OOP, they all have a common abstract ancestor or structure zend_refcounted. It determines the format of the first 64-bit word , where the reference count and other information for the garbage collector are stored.



    This word can be considered simply as information for the garbage collector, and structures for specific types add their fields after this first word.

    Lines


    In the seven for the string, we store the calculated value of the hash function, its length and the characters themselves. The size of such a structure is variable and depends on the length of the string. The hash function is calculated for the string once, when necessary. In PHP 5, it was re-computed at every need.



    Now the strings have become reference countable, and if in PHP 5 we copied the characters themselves, now it’s enough to increase the reference count for this structure.

    As in PHP 5, we still have the concept of immutable or interned strings . They usually exist in one instance, live until the end of the query, and can behave like scalar values. We do not need to take care of the counter of references to them, and for copying it is enough to copy only zval itself with the help of four machine instructions.

    Arrays


    Arrays are represented by a built-in hash table and are not much different from PHP 5. The hash table itself has changed, but more on that separately.



    Arrays are now an adaptive structure that slightly changes its internal structure and behavior depending on the stored data. If we store only elements with close numeric keys, then we get access to the elements directly by index with a speed comparable to the speed of arrays in C. But if you add an element with a string key to the same array, it turns into a real hash with collision resolution.

    This is how the hash table looks like in PHP 5.



    This is a classic hash table implementation with collision resolution using linear lists (shown in the upper right corner). Each item is represented by a bucket. All Buckets are linked by doubly linked lists to resolve collisions, and linked by another doubly linked list to iterate in order. The values ​​for each zval are allocated separately - in Bucket we only store a link to it. Also, string keys can be allocated separately.

    Thus, for each hash table, you need to allocate a lot of small blocks of memory, and in order to find something later, you have to run along the pointers. Each such transition can cause cahce miss and a delay of ~ 10-100 processor cycles.

    This is what happened in PHP 7.



    The logical structure remained unchanged, only the physical one changed. Now, under a hash table, memory is allocated with one operation.

    In the picture, at the bottom of the base pointer, there are elements, and at the top is a hash array that is addressed by a hash function. For flat or packed arrays, when we store only elements with numerical indices, the upper part is not allocated at all, and we address the Bucket directly by number.

    To bypass elements, we sequentially sort them from top to bottom or from bottom to top, which modern processors do flawlessly. The values ​​are built into Buckets, but the reserved space in them is just used to resolve collisions. It stores the index of another Bucket with the same hash function value or the end of the list marker.

    The memory for the string values ​​of the keys is allocated separately, but it's still the same zend_string. When pasting into an array, it is enough to increase the reference counter of the string, although before we had to copy the characters directly, and when searching, we can now compare not the characters, but the pointers to the strings themselves.

    Immutable Arrays


    Previously, we had immutable strings, but now immutable arrays have also appeared. Like strings, they do not use the reference count and are not destroyed until the end of the request. This is a simple script that creates an array of a million elements, and each element is the same array with a single "hello" element.



    In PHP 5, at each loop iteration, a new empty array was created, “hello” was written to it, and all this was added to the resulting array. In PHP 7, at the compilation stage, we create just one immutable array that behaves like a scalar, and add it to the resulting one. In the presented example, this allows us to achieve more than 10-fold reduction in memory consumption and almost 10-fold acceleration.

    Constant arrays of millions of elements in real applications, of course, are not often found, but small ones are quite common. On each of them you will get a small, but a win.

    The objects


    Links to all objects in PHP 5 lay in a separate repository, and in zval there was only handle - a unique object ID.



    To get to the object, we made at least 3 readings. In addition, the memory for the value of each property of the object was allocated separately, and we needed at least 2 more readings to read it.

    In PHP 7, we were able to move to direct addressing.



    Now the address zend_objectis accessible with one machine instruction. And Property are built-in and to read them you need only one additional reading. They are also grouped together, which improves data locality and helps modern processors not stumble.

    In addition to the predefined property, a link to the class of this object is also stored here, some handlers - an analogue of virtual method tables, and a hash table for property that have not been defined. In PHP, you can add property to any object that was not originally defined, and if several machine instructions are enough to access the predefined property, then for non-predefined properties you will have to use a hash table, which will require dozens of machine instructions. Of course, this is much more expensive.

    Reference


    Finally, we had to introduce a separate type to represent PHP links.



    This is a completely transparent type. It is not visible to PHP scripts. Scripts see another zval that is built into the zend_reference structure. It is understood that we refer to one such structure from at least two places, and the reference counter of this structure is always greater than 1. As soon as the counter drops to 1, the link turns into a regular scalar value. The zval embedded in the link is copied to the last zval that references it, and the structure itself is deleted.

    It seems that working with reference is now much more complicated than with other types (and this is true), but in fact in PHP 5 we had to do work of comparable complexity when accessing any value (even a prime integer). Now we are applying more complex protocols to only one type and thereby speeding up work with all others, especially with scalar values.

    IS_FALSE and IS_TRUE


    I have already said that the single type IS_BOOL was split into separate IS_FALSE and IS_TRUE. This idea was spied on in the LuaJIT implementation, and was made to speed up one of the most common operations - conditional transition.



    If in PHP 5 it was required to read the type, check for boolean, read the value, find out whether it is true or false and make a transition based on this, now it’s enough to simply check the type and compare it with true:

    • if it is true, then we go along one branch;
    • if it is less than true, go to another branch;
    • if it is more than true, go to the so-called slow path (slow path) and there we check what type it came from and what to do with it: if it is integer, then we must compare its value with 0, if float - again with 0 ( but real), etc.

    Calling convention


    A change in the Calling Convention or function call convention is an important optimization that affects not only data structures, but also underlying algorithms. In the picture on the left is a small script consisting of the foo () function and its call. Below is the byte code into which this script was compiled PHP 5.



    First, I will tell you how it worked in PHP 5.

    Calling Convention in PHP 5


    The first instruction SEND_VALwas to send the value “3” to the foo function. To do this, she was forced to allocate a new zval on the heap, copy the value (3) there and write the value of the pointer to this structure onto the stack.



    Similarly with the second instruction. Then it DO_FCALLinitialized CALL FRAME, reserved space for local and temporary variables, and transferred control to the called function.



    The first statement RECVchecked the first argument and initialized the slot of the corresponding local variable ($ a) on the stack. Here we did without copying and simply increased the reference counter of the corresponding parameter (zval with a value of 3). Similarly, the second operator RECVestablished a connection between the variable $ b and parameter 5.



    Further body functions. 3 + 5 addition happened - it turned out 8. This is a temporary variable and its value was stored directly on the stack.



    RETURN and we return from the function.



    When returning, we release all variables and arguments that are out of scope. To do this, we go through all the zval referenced by slots from the freed frame, and for each we decrease the reference count. If it reaches 0, then destroy the corresponding structure.

    As you can see, even such a simple operation as sending a constant to a function requires allocating new memory, copying and increasing the reference counter, and then also double decreasing and deleting.

    Calling Convention in PHP 7


    In PHP 7, these problems have been fixed - now on the stack we store not the zval pointers, but the zval ones themselves.



    We also introduced a new instruction INIT_FCALL, which is now responsible for initializing and allocating memory for CALL FRAME, and reserving space for arguments and temporary variables.



    SEND_VAL 3now just copy the argument to the first slot for CALL FRAME. Next SEND_VAL 5to the second slot.



    Then the most interesting. It would seem that it DO_FCALLshould transfer control to the first instruction of the called function. But the arguments have already hit the slots that are reserved for the variable parameters $ a and $ b, and the instructions RECVjust do nothing. Therefore, you can simply skip them. We sent two parameters, so we skip two instructions. If they sent three, they would have missed three.



    So we go directly to the body of the function, make addition and return.



    When returning, we clear all local variables, but now only for two slots, and since we have scalars there, we again do not need to do anything.



    My story is a little simplified, it does not take into account functions with a variable number of arguments and the need for type checking and some other points.

    The new Calling Convention has broken compatibility a bit . PHP has features like func_get_argand func_get_args. If earlier they returned the original value of the sent parameter, now they return the current value of the corresponding local variable, because we simply do not store the original values. Just like C. debuggers do



    In addition, the function can no longer have multiple parameters with the same name. There was no point in this before, but foo($_, $_)I met such PHP code . What does it look like? (I recognized Prolog)

    New memory manager


    Having finished with the optimization of data structures and basic algorithms, we once again drew attention to all the braking subsystems. The memory manager in PHP 5 took up almost 20% of the processor time on Wordpress.

    After we got rid of a lot of allocations, his overhead costs became less, but still significant - and not because he was doing some significant work, but because he stumbled on the cache. This happened due to the fact that we used the classic Doug Lea's malloc algorithm, which involved finding suitable free memory locations by traveling through links and trees, and all these trips inevitably caused cache misses.

    Today, there are new memory management algorithms that take into account the features of modern processors. For instance:jemalloc and ptmalloc from google . At first, we tried to use them unchanged, but did not get a win, because the lack of PHP-specific functionality made it more expensive to completely free memory at the end of the request. As a result, we abandoned dlmalloc and wrote something of our own, combining ideas from the old memory manager and jemalloc.

    We have reduced the overhead of Memory Manager to 5% , reduced the memory overhead for service information and improved the use of the CPU cache. Suitable memory blocks are now searched by bitmaps, memory for small blocks is allocated from individual pages and cached upon release, specialized functions for frequently used block sizes are added.

    Many small improvements


    I spoke only about the most important improvements, but there were much more minor ones. I can mention some of them.

    • Fast API for parsing parameters of internal functions and a new API for iterating over HashTable.
    • New VM instructions: string concatenation, specialization, super-instructions.
    • Some internal functions have been turned into VM instructions: strlen, is_int.
    • Using CPU registers for VM registers: IP and FP.
    • Optimization of duplication and deletion of arrays.
    • Using link counts instead of copying wherever you can.
    • PCRE JIT.
    • Optimization of internal functions and serialize ().
    • Reduced code size and processed data.

    Some were very simple, for example, it took only three lines of code to enable JIT in regular Pearl expressions, and this immediately brought visible (2-3%) acceleration to almost all applications. Other optimizations touched on some narrow aspects of certain PHP functions, and are not particularly interesting, although the total contribution of all these minor improvements is quite significant.

    What have you come to


    This is the contribution of various subsystems on WordPress / PHP 7.0.



    Virtual machine contribution increased to 50%. Memory Manager already consumes less than 5% - and mainly not due to optimizations of the Memory Manager itself, but by reducing the number of calls to it. If earlier on the same test the memory was allocated 130 million times, now it’s only 10 million. It might seem that all the main acceleration was achieved by reducing the overhead of the Memory Manager and reducing the number of calls to it due to the improvement of data structures, but in fact all subsystems have been significantly improved.


    The main sources of acceleration:

    • The interpreter began to work 2 times better.
    • MM overhead decreased 17 times.
    • Hash tables began to work 4 times faster.
    • Total WordPress productivity has grown 3.5 times.

    At the beginning of the article, we talked about 2.5x real acceleration, but now the numbers are different. Why is that? The fact is that we measured the real speed in requests per second, but here the speed is measured by the profiler in terms of CPU time, in fact - processor clock cycles when it is not idle. When PHP is waiting for a response from the database, the processor stands and this time is not taken into account here.

    PHP 7 performance


    WordPress 3.6 was the main benchmark for us - on it we monitored performance from the first days of work. At some point, when the mysql extension was thrown out of PHP 7, we had to specifically support it, just to continue this graph.



    The graph shows that the main breakthroughs occurred in the first months of work on PHPNG. By August, 2/3 of the improvements had been scored. Then we moved in small steps, and scored the remaining third.

    Of course, we measured performance not only on WordPress, but also on other popular applications, and almost everywhere we see - from 1.5 to 2 times acceleration.

    PHP 7 and HHVM


    According to our version, we almost everywhere overtook even the current versions of HHVM at that time.



    But comparing with a third-party product is a thankless task. Always a gain in favor of the measuring one. The Facebook team version shows other results. On a graph, HHVM is everywhere proportionally faster. Perhaps this is due to different measurement procedures, testing on different hardware platforms, differences in fine-tuning, and maybe subjective factors also influenced.



    Apotheosis of PHP 7 - the beginning of the use of large sites. The pioneers were Chinese Vebia, American Etsy and Badoo. Highload checking revealed several significant problems, but they were quickly localized and fixed.

    Switching to PHP 7.0 for Etsy and Badoo allowed shutting down almost half of the servers in web farms. Badoo ratedmillion dollar savings.



    The graphs are indicative that at the time of the transition, the total processor load decreased by 2 times, and memory consumption - by as much as 7 times.

    On this happy note, we will end today's talk about PHP 7.0. But we will continue it in the second part devoted to PHP 7.1, in the optimization of which we went much further than data structures.

    In May, at PHP Russia, Dmitry Stogov will give a talk on the most interesting new technologies being developed for PHP 8 . If your experience is largely related to PHP, you know how to prepare it correctly, and are ready to share your best practices - send applications by April 1. And remember, the main thing that we are looking for is a lively useful experience, and we will help with the report, ask the right questions and tell you where to move.

    Also popular now: