Under the hood of Redis: a hash table (part 1)

    If you know why after running `hset mySey foo bar` we spend at least 296 bytes of RAM, why instagram engineers do not use string keys, why you should always change hash-max-ziplist-entries / hash-max-ziplist-val and why the data type underlying hash is part of list, sorted set, set - do not read. For the rest, I will try to talk about it. Understanding the design and operation of hash tables in Redis is critical when writing systems where memory saving is important.

    What this article is about, what expenses Redis incurs for storing the key itself, what ziplist and dict are , when and what they are used for, how much they occupy in memory. When hash is stored inziplist when in dicth and what it gives us. What tips from fashion articles on Redis optimization shouldn't be taken seriously and why.
    I want to immediately apologize for breaking the material into two articles. Having finished the part on the dictionary, I realized that the second part - about the ziplist, turns out a little less than the first. As a result, after tests on colleagues, I decided to break it into two parts.

    First, we need to understand how the individual structures that underlie Hashes work with . Then we calculate the cost of storing data. It will be easier for you to understand the material if you already know what redisObject is and how Redis stores strings .
    There is some confusion that the hashes structure is perceived as a hash table. Under the hood there is a separate type of dict, which is a real hash table. The structure itself hashes a hybrid of dict and ziplist. We will need a great introduction on how hash tables are arranged and work. Having shown an article without this material to several of my colleagues, I became convinced that without it everything becomes more confusing.

    In Redis, you control which internal data type a particular hash key will represent. The variable hash-max-ziplist-entries determines the maximum number of elements in hash with which REDIS_ENCODING_ZIPLIST encoding can be used . For example, with hash-max-ziplist-entries = 100, your hash will be represented as a ziplist, as long as it has less than 100 elements. As soon as there are more elements, it will be converted to REDIS_ENCODING_HT (dict). Hash-max-ziplist-val works similarly - as long as the length of any value of any single field in hash does not exceed hash-max-ziplist-valwill be used ziplist. Parameters work in pairs - the internal representation from ziplist to dict will happen as soon as any of them is exceeded. By default, any hash always has a ziplist encoding.

    Dictionaries are the key structure of Redis. Not only for hashes, but also for list, zset, set. This is the main structure that Redis uses to store any kind of key-value data bundles. Dictionaries in Redis is a classic hash table that supports insert, replace, delete, search, and retrieve a random item. The entire implementation of dictionaries can be found in dict.h and disc.c in the Redis source code.

    Any hash of a table has a number of parameters that are critical for understanding its performance and choosing the right implementation. In addition to being important for any data structurealgorithmic complexity is also a number of factors that directly depend on the conditions of use. In this manner particularly interesting duty ratio ( fill factor ) and the strategy resizing . The latter is especially important in any system working in real time, since any of the available theoretical implementations always puts you before the choice - to spend a lot of RAM or processor time. Redis uses a so-called inkrimentarnoe resizing ( the incremental resizing ). Although there are suggestions to try linear hashing, monotonic keys, consistent hashing, resizing by copying depending on the specifics of the hash.

    The implementation of incremental resizing (functions dictRehashMilliseconds , dictRehash in dict.c) is reduced to the fact that:
    1. When resizing a table, we select a new hash table that is obviously larger than the existing one (and do not change the old one). As with rows, Redis will double the number of slots in the new table. Along with this, any requested size (including the original) will always be aligned up to the nearest power of two. For example, you need 9 slots, it will be allocated - 16. The minimum border for the new table is determined by the constant of the compilation stage DICT_HT_INITIAL_SIZE (by default 4, defined in dict.h).
    2. During each read / write operation, we look at both tables.
    3. All insert operations are carried out only in the new table.
    4. For any operation, we move n elements from the old table to the new.
    5. Delete the old table if all the elements are migrated.

    In the case of Redis, n is the number of keys that the server managed to transfer in 1 millisecond from steps 1000. In other words, 1 step of rehashing is at least 1000 elements. This is important because when using keys with long string values, such an operation can significantly exceed 1 ms, causing the server to freeze. An additional aspect is the large peak memory consumption during table rehashing, which is sharply manifested on hashes with a large number of "long" keys. In the calculations, we will consider tables outside the hash, remembering, however, that if there is a large table in our Redis node, there is a high probability of refusing service if it is impossible to change its size. For example, if you rent a budget Redis on heroku (only 25 mb of memory).

    The fill factor ( dict_force_resize_ratio constant is equal to 5 by default) determines the degree of sparseness in the dictionary and when Redis starts the process of doubling the size of the current dictionary. The current value of the fill factor is defined as the ratio of the total number of elements in the hash table to the number of slots. In other words, if we have only 4 slots and 24 elements (distributed between them), Redis will decide to double the size (24/4> 5). A similar check occurs every time you access any dictionary key. If the number of elements becomes equal to or exceeds the number of slots, Redis will also try to double the number of slots.


    Look at this structure in source
    typedef struct dict {
        dictType *type;
        void *privdata;
        dictht ht[2];
        long rehashidx; /* rehashing not in progress if rehashidx == -1 */
        int iterators; /* number of iterators currently running */
    } dict;
    typedef struct dictht {
        dictEntry **table;
        unsigned long size;
        unsigned long sizemask;
        unsigned long used;
    } dictht;
    typedef struct dictEntry {
        void *key;
        union {
            void *val;
            uint64_t u64;
            int64_t s64;
            double d;
        } v;
        struct dictEntry *next;
    } dictEntry;
    


    Each dictionary is built around the use of three structures - dict , dictht and dictEntry . Dict is the root structure; in its ht field, from 1 to 2 (at the time of resolving) dictht tables with a list of slots are stored. In turn, dictht stores a linked list from dictEntry . Each dictEntry stores a pointer to a key and a value (in the form of a pointer or double / uint64_t / int64_t). If you use the number as the value - dictEntry will store the number, store the string - the pointer to redisObject with the sds string on board will be saved .

    When Redis accesses a specific key inside a table hash:
    1. With hashed function is desired slot (to find slots currently used hashed function MurMur2 ).
    2. All values ​​of the slot are searched one by one (linked list), comparing the keys.
    3. The requested operation on dictEntry is performed (with key and value)

    The help for HSET or HGET says that these are operations performed in O (1) . Believe with caution: most of the time your hset will take O (n) . In large tables (more than 1000 elements) with an active record, hget will also occupy O (n) .
    The answer to the question how much memory will actually be used depends on the operating system, the compiler, the type of your process and the allocator used (in redis, jemalloc is the default). I give all further calculations for redis 3.0.5 assembled on a 64 bit server running centos 7.

    Now you can calculate how much memory is wasted when calling `hset mySet foo bar`. Overhead on creating an empty dictionary:
    • dictEntry : 3 * size_of (pointer)
    • dictht : 3 * size_of (unsigned longs) + size_of (pointer) = 24 + size_of (pointer)
    • dict : 2 * size_of (dictht) + 2 * size_of (int) + 2 * size_of (pointer) = 56 + 4 * size_of (pointer)

    If we put everything together, we get a formula for calculating the approximate overhead for storing n elements:
    	56 + 2 * size_of(pointer) + 3 * next_power(n) * size_of(pointer) 
    	-------------------------    -------------------------------------
    	      |	                                      |
                  ` dict + dictht                    `dictEntry  
    


    This formula does not take into account the costs of storing the key itself and the value. The key is always an sds string in redisObject. The value, depending on the type, can be an sds string or native numbers (integers or floating point). Let's check for `hset mySet foo bar`, remembering that the minimum number of slots (dictEntry) in the new dictionary is DICT_HT_INITIAL_SIZE (4) :
    56 + 2 * 8 + 3 * 4 * 8 = 168 + 2 * 56 = 280 bytes (will be aligned to 296 bytes).

    We check:
    config set hash-max-ziplist-entries 0
    +OK
    config set hash-max-ziplist-value 1
    +OK
    hset mySet foo bar
    :1
    debug object mySet
    +Value at:0x7f44894d6340 refcount:1 encoding:hashtable serializedlength:9 lru:5614464 lru_seconds_idle:6
    

    What does info memory show ? It will show you that 320 bytes have been spent. 24 bytes more than we expected. This memory is the alignment cost of allocating memory.

    How was the mySet key saved and how does Redis find our hash by that name? At the heart of Redis lies the redisDb structure (redis database representation, in redis.h). It contains a single dict dictionary for all keys. Exactly the same as described above. This is important because it gives an idea of ​​the costs of storing the key database and gives us the basis for understanding the advice that it is not worth storing a lot of simple keys in one instance.

    Let's see what the article says storing hundreds of millions of simple keys. If you need to store many key-value pairs do not use string keys, use a hash of the table.
    A gist with a test from the instagram article, we will write on LUA so that it would not take to check anything other than running Redis:
    flushall
    +OK
    info memory
    $225
    # Memory
    used_memory:508408
    eval "for i=0,1000000,1 do redis.call('set', i, i) end" 0   
    $-1
    info memory
    $226
    # Memory
    used_memory:88578952
    

    To store 1,000,000 integer keys, we needed a little more than 88 mb . Now save the same data in the hash table, evenly distributing the keys between the 2000 hash tables:
    flushall
    +OK
    info memory
    $227
    # Memory
    used_memory:518496
    eval "for i=0,1000000,1 do local bucket=math.floor(i/500); redis.call('hset', bucket, i, i) end" 0
    $-1
    info memory
    $229
    # Memory
    used_memory:104407616
    

    To store the same data with integer fields, we spent a little more than 103 mb . However, if we try to save, say 10,000,000 keys, the situation will change dramatically, namely ~ 970 mb versus ~ 165 mb (and all of them are in storage overhead in the system key hash table). In general, the rule "if you have hundreds of millions of keys, do not use string keys" is true.

    Let's look at another aspect. When describing optimizations of this type very often, it is understood that you have many keys of the form entity: identifier (for example, `set username: 4566 RedisMan`) and suggest switching to using the hash of the tables
    bucket: id identifier(e.g. `hset usernames: 300 4566 RedisMan`). It is important to understand that there is a partial substitution of concepts - the sds string `username: 4566` turns into a key field 4566 encoded REDIS_ENCODING_INT . This means that instead of sds strings in redisObject they began to use a number, thus the minimum 32 bytes per sds string (after aligning jemalloc) turned into 4 bytes.

    Let's force the hash encoding of the tables in ziplist:
    config set hash-max-ziplist-entries 1000                                           
    +OK
    config set hash-max-ziplist-value 1000000
    +OK
    flushall
    +OK
    eval "for i=0,1000000,1 do local b=math.floor(i/500); redis.call('hset', 'usernames:' ..b, i, i) end" 0
    $-1
    info memory
    $228
    # Memory
    used_memory:16816432
    

    Our data storage took only 16 MB or 72 MB of savings ( 5 times less ) with my example for 1,000,000 keys. Is it tempting? We turn to the explanation in the second part of the article.

    In an intermediate conclusion, it is worth formulating several important conclusions for a loaded system, which should save memory:
    • Use the numeric names of keys, values, fields in the hash tables wherever possible. Do not use prefixes / postfixes.
    • When designing systems that actively use Redis, proceed from the principle: one set of data requirements - one Redis. Storing heterogeneous data is difficult due to the hash-max-ziplist-entries / hash-max-ziplist-value settings and key differentiation without prefixes.
    • When replacing simple keys with hash table groups, remember that your optimization works with the number of keys from a million or more.
    • If your copy has millions and hundreds of millions of keys, you incur huge expenses for storing them in the system dictionary and memory overruns to the reserve of the system dictionary. Let's say for 100 million keys this will be about 2.5 GB only for the system dict itself, without taking into account your data.
    • If you go under the hash-max-ziplist-entries / hash-max-ziplist-value values ​​and your data is stored in a ziplist, instead of dict, you can get memory savings expressed in hundreds of percent. And pay for it with high CPU utilization. About it further.

    Table of contents:

    Materials used in writing the article:

    Also popular now: