Judy arrays in PHP

    Badoo uses many services in C and C ++, most of which work with huge amounts of data. As a rule, services act as a “fast cache” or a “fast database”, i.e. perform various operations with arrays of the same type of data. For quick access to data, we have long and successfully used Judy arrays. But once we wanted a strange thing: to process large arrays of integers in PHP, and we immediately remembered Judy.

    A bit of history

    Judy arrays were invented by Douglas Baskins in early 2000. Their development project was funded by HP, but was closed after about two years. During this time, four versions were released, and the development of the latter took more than a year, and in it the developers were able to speed up Judy twice, halve the memory consumption, although this was not an easy price: the amount of code grew 5 times, and its complexity - an order of magnitude.

    The algorithm for working Judy arrays is patented , however, the original implementation in C is available on the official website under the LGPL license.
    The project was named after Sister Baskins, because the developers could not come up with a better option.

    Tree arrays

    In truth, Judy arrays are not arrays themselves. Judy is a development of trie , i.e. prefix tree using different compression techniques. Judy arrays compares favorably with C arrays and standard hash array implementations: sparse Judy arrays use memory efficiently and scale perfectly, while showing comparable read and write speeds (see benchmarks ).
    A detailed description of the Judy work algorithm is unlikely to fit in the format of the article for Habr, so we just leave links to the official documentation: 1 and 2 .

    There are several types of Judy arrays:
    • Judy1 - a bitmap (bitmap), the index is an integer (integer-> bit);
    • JudyL - an array with an integer as an index (integer-> integer / pointer);
    • JudySL - an array of pointers with a string as an index (string-> integer / pointer);
    • JudyHS - an array of pointers with a set of bytes as an index (array of bytes-> integer / pointer).

    Judy arrays do not require initialization, and an empty array is a simple NULL pointer. The consequence of this is that Judy arrays are conveniently nested in each other. So convenient that JudySL and JudyHS are actually nested JudyL arrays.

    Benchmarks

    Built-in arrays in PHP are universal and, as a result, extremely inefficient in some cases. Compared with Judy, they are implemented quite trivially: these are hash arrays with a doubly linked list inside each element in case of hash collisions. PHP arrays store PHP variables, i.e. pointers to zval structures . In the vast majority of cases, this is exactly what is required of them, but there are exceptions.
    To demonstrate why we decided to use Judy, we will conduct a comparative test of Judy and PHP arrays. As an interface to Judy from PHP, we use the php-judy modulein which the Judy class is implemented with the implementation of the ArrayAccess interface. In the test, create arrays of the form integer -> integer (array (), Judy :: INT_TO_INT, Judy :: INT_TO_MIXED) or integer-> boolean (Judy :: BITSET) and fill them with elements with sequential keys.
    First, we’ll measure the speed of writing to arrays using a similar code:

    image

    See the interactive chart here .
    The peaks in writing array () are caused by doubling the size of the array and rebuilding it when all the hash elements are full. Small elevations in the Judy graphs can be explained by the measurement error.
    The graph shows that the different types of Judy in writing speed are approximately equal to the built-in array. The only exception is BITSET (Judy1), which is slightly faster.

    Then we’ll measure the speed of sequential reading from arrays:

    image

    See the interactive chart here .
    It can be seen from the graph that when reading Judy :: BITSET and Judy :: INT_TO_INT they slightly lose the built-in PHP array. This is because these arrays store bits and integers, respectively, and not PHP variables (i.e. zval). It is the overhead of creating and filling in the variables that rob us of the performance gain.
    On the other hand, array () and Judy :: INT_TO_MIXED store just the PHP variables inside, and they don’t have to spend time creating them, so they are approximately equal in speed.

    Finally, we will measure the memory consumption when filling the arrays and get the following graph: See the interactive graph here .
    image


    As you can see from the graph, the difference in memory consumption between PHP arrays and Judy1 / JudyL is huge, and for us this is the main criterion in this case, because we can well sacrifice part of the speed when reading from the array, in return having the opportunity to work with arrays of a much larger size, which previously did not fit in the server’s RAM at all.
    Thus, the use of Judy arrays in PHP is justified in solving problems associated with the processing of large data arrays, especially integers and bits. In terms of read and write speeds, Judy arrays are not too different from built-in arrays, however, due to differences in implementation, they use memory much more efficiently.
    Anticipating your questions, we note that, despite their obvious advantages, Judy arrays are unlikely to replace regular arrays in PHP, if only because the latter can contain both integer and lowercase indices, while in Judy this is implemented using two different types of arrays.

    Also popular now: