How and why we optimized the algorithm of cleaning SLAB-caches in the Linux kernel

    The growing popularity of containers and their use in conjunction with control groups revealed a serious scalability problem, which leads to a significant drop in performance on large machines. The problem is that the bypass time of SLAB-caches depends quadratically on the number of containers, and the active consumption of large amounts of memory in a short period can cause the system to leave in a busy loop, consuming 100% of the processor time. Today I would like to tell you how we solved this problem by changing the algorithm of accounting for using the SLAB-cache objects by the control group memcg and optimizing the function shrink_slab ().

    Memory clear

    Why is there any question of optimizing processes in the kernel? It all started with the fact that one of our customers, who actively use containers and memory control groups (memcg), drew attention to the strange peaks in processor consumption that occur from time to time. Normal system loading was about 50%, and at peak times 100% of the processor time was occupied, and almost all of it was consumed by the kernel (sys time).
    The node itself was multi-user, and about 200 OpenVZ containers were launched on it. The analysis showed that a large number of users created nested Docker containers and multi-level hierarchies of memory control groups. Each top-level user container contained about 20 mount points and 20 memory control groups (memcg) created by systemd. In addition, there were mount points and control groups created by the above mentioned Docker. Simply put, the node was heavily loaded, and the load on it was much stronger than the average for all our other customers. We were interested to find the reason for the appearance of these peaks, since the same problem could also appear on less loaded machines, where it was barely noticeable (for example, give peaks at + 5% sys time, which degrade performance).

    By manipulating the perf, I managed to catch the peak and remove the trace. It turned out that most of the CPU time is spent on clearing SLAB caches, namely, superblock caches:

    -  100,00%     0,00%  kswapd0  [kernel.vmlinux]     [k] kthread                                                                             
       - 99,31% balance_pgdat                                                                                                             
         - 82,11% shrink_zone                                                                                                                    - 61,69% shrink_slab                                                                                                                   - 58,29% super_cache_count           + 54,56% list_lru_count_one


    Here it is worth making an explanation and dwell on this issue. Everyone knows that the kernel caches unused data for a while before finally releasing the memory. The kernel makes extensive use of this principle. For example, a page cache contains pages of data related to a file, which significantly speeds up re-access to them when reading (because it does not need to re-access the disk). In our case, the problem arose with the superblock metadata cache contained in two LRU lists: s_dentry_lru and s_inode_lru.

    LRU (Least Recently Used)

    struct lru_list points to an array of linked lists, and each active memcg corresponds to one element (list_lru_one) in this array. When a certain SLAB object is no longer used by the kernel, the kernel adds it to one of the connected lists of the array (depending on which memcg the object belongs to, or, roughly speaking, which memcg the process belongs to when it created this object). The array itself is described as follows (lru_list :: node :: memcg_lrus):

    structlist_lru_memcg {structrcu_headrcu;/* array of per cgroup lists, indexed by memcg_cache_id */structlist_lru_one 	*lru[0];/* Массив связных списков */
    };
    structlist_lru_one {structlist_headlist;/* Связный список объектов *//* may become negative during memcg reparenting */long                	nr_items; /* Количество объектов в списке */
    };
    

    lru [0] specifies a list of objects related to memcg with ID 0;
    lru [1] specifies a list of memcg related objects with ID 1;
    ...
    lru [n] specifies a list of memcg related objects with ID n;

    Our problem is LRU s_dentry_lru and s_inode_lru lists, and as the name suggests, they contain unused dentry objects and the inode of the file system.
    Later, when there is a shortage of memory in the system or a specific memcg, some of the list elements are finally released, and a special mechanism called the shrinker is involved.

    Shrinker

    When the kernel needs to allocate pages of memory, but there is no free memory on the NUMA node or in the system, a mechanism is started to clear it. He is trying to discard or flush a number: 1) pages of file contents from page cache; 2) pages related to anonymous memory in a swap, and 3) cached objects of SLAB (with them is the problem we are facing).

    Dropping part of cached SLAB objects affects the release of pages indirectly: their size is usually much smaller than the page size, and there are hundreds of objects on one page. When some objects are released, free memory gaps arise in the SLAB pages, which can be used to create other SLAB objects. This algorithm is adopted in the core intent: it is simple and fairly efficient. The interested reader can look at the formula for selecting a part of objects for cleaning in the do_shrink_slab () function.

    This function performs the actual cleaning of part of objects, guided by the description passed to it in the struct shrinker:

    staticunsignedlongdo_shrink_slab(struct shrink_control *shrinkctl, struct shrinker *shrinker, int priority){	…
            /* Посчитать объекты */
            freeable = shrinker->count_objects(shrinker, shrinkctl);
            if (freeable == 0)
    	        return0;
            total_scan = НЕКОТОРАЯ_ЧАСТЬ(freeable);
            while (total_scan >= batch_size) {
    	        /* Освободить объекты */
                    ret = shrinker->scan_objects(shrinker, shrinkctl);
                    total_scan -= shrinkctl->nr_scanned;
            }
            ...
    }

    Applied to the shrinker superblock, these functions are implemented as follows. Each superblock maintains its own s_dentry_lru and s_inode_lru lists of unused objects related to it:

    structsuper_block {
            ...
            structshrinkers_shrink;/* per-sb shrinker handle */
            ...
            structlist_lrus_dentry_lru;structlist_lrus_inode_lru;
            …
    };


    The .count_objects method returns the number of objects:

    staticunsignedlongsuper_cache_count(struct shrinker *shrink, struct shrink_control *sc){
            total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
            total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
            /* Вернуть часть общего количества объектов) */
            total_objects = vfs_pressure_ratio(total_objects);
            return total_objects;
    }


    The .scan_objects method actually releases objects:

    staticunsignedlongsuper_cache_scan(struct shrinker *shrink, struct shrink_control *sc){
            /* Освободить часть объектов из s_dentry_lru */
            prune_dcache_sb(sb, sc);
            /* Освободить часть объектов из s_inode_lru */
            prune_icache_sb(sb, sc);
    }

    The number of objects to be freed is passed in the sc parameter. Also, there is a memcg, the objects of which must be dropped from the LRU:

    structshrink_control {int nid; /* ID NUMA узла */unsignedlong nr_to_scan; /* Количество объектов */structmem_cgroup *memcg;/* memcg */
    };
    

    Thus, prune_dcache_sb () selects a linked list from the array struct list_lru_memcg :: lru [] and works with it. Prune_icache_sb () does the same.

    Old shrinker traversal algorithm

    With the standard approach, “dropping out” objects from SLAB when there is not enough memory in
    sc-> target_mem_cgroup is as follows:

    shrink_node()
    {
            …
            structmem_cgroup *root = sc->target_mem_cgroup;/* Цикл по всем дочерним для sc->target_mem_cgroup групп */
            memcg = mem_cgroup_iter(root, NULL, &reclaim);
            do {
                    …
                    shrink_slab(memcg, ...);
                    …
            } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
            ...
    }
    

    Pass through all the child memcg and call shrink_slab () for each of them. Next, in the shrink_slab () function, we go through all shrinkers and for each of them call do_shrink_slab ():

    staticunsignedlongshrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority){
            list_for_each_entry(shrinker, &shrinker_list, list) {
                    structshrink_controlsc = {
                            .nid = nid,
                            .memcg = memcg,
                    };
                    ret = do_shrink_slab(&sc, shrinker, ...);
    	}
    }
    

    Recall that for each superblock a shrinker is added to this list. Calculate how many times do_shrink_slab () will be called for a case with 200 containers of 20 memcg and 20 mount points in each. In total, we have 200 * 20 mount points and 200 * 20 control groups. If there is a shortage of memory in the uppermost memcg, we will have to bypass all its child memcg (i.e., in general, all), and for each of them call each shrinker from the shrinker_list. Thus, the kernel will make 200 * 20 * 200 * 20 = 16000000 calls to the do_shrink_slab () function.

    In this case, the overwhelming number of calls to this function will be useless: containers are usually isolated from each other, and the probability that the CT1 container will use super_block2, created in CT2, is generally low. Or, the same, if memcg1 is a control group from CT1, then the corresponding element of the array super_block2-> s_dentry_lru-> node-> memcg_lrus-> lru [memcg1_id] is an empty list, and there is no point in calling do_shrink_slab () for it.

    This problem can be modeled using a simple bash script (here we use data from the patchset, which was later transferred to the kernel):
    $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
    $mkdir /sys/fs/cgroup/memory/ct
    $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`;                                                          
    do                              
            mkdir /sys/fs/cgroup/memory/ct/$i;  
            echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
            mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
    done

    Let's see what happens if we call the cache reset procedure 5 times in a row:
    $timeecho 3 > /proc/sys/vm/drop_caches

    The first iteration lasts 14 seconds, because the cached objects really exist in memory: 0.00 user 13.78 system 0: 13.78 elapsed 99% CPU.
    The second iteration takes 5 seconds, although there are no objects anymore: 0.00user 5.59system 0: 05.60 elapsed 99% CPU.
    The third iteration takes 5 seconds: 0.00user 5.48system 0: 05.48elapsed 99% CPU
    The fourth iteration takes 8 seconds: 0.00user 8.35system 0: 08.35elapsed 99% CPU
    The fifth iteration takes 8 seconds: 0.00user 8.34system 0: 08.35elapsed 99% CPU

    It became obvious that the shrinker bypass algorithm used by the vanilla core is not optimal, and we need to change it to the best from the point of view of scalability.

    New algorithm for bypassing shrinkers.

    I wanted to achieve the following from the new algorithm:

    1. release it from the flaws of the old and
    2. Do not add new locks. Calling do_shrink_slab () only when it makes sense (that is, the corresponding linked list is not empty from the s_dentry_lru array or from the s_inode_lru array), but does not directly access the memory of the linked lists.

    It was clear that this can only be provided by a new data structure on top of heterogeneous shrinkers (there are shrinkers not only of the superblock, but also of other data objects not described in this article. The reader can familiarize themselves with them by searching on the keyword prealloc_shrinker () in the kernel code). The new data structure should allow two states to be encoded: “it makes sense to call do_shrink_slab ()” and “it does not make sense to call do_shrink_slab ()”.

    Data structures of type IDA were rejected because they use locks inside themselves. The data structure of the bit field is fully suited to this role: it allows for atomic modification of individual bits, and in combination with memory barriers allows you to build an efficient algorithm without using locks.

    Each shrinker gets its unique id (shrinker :: id), and each memcg is a bitmap that can accommodate the largest id currently registered. When the first element is added to the s_dentry_lru-> node-> memcg_lrus-> lru [memcg_id] list, the bitmap of the corresponding memcg is set to 1 bit with the number shrinker-> id. The same in the case of s_inode_id.

    Now the loop in shrink_slab () can be optimized to bypass only the necessary shrinkers:

    unsignedlongshrink_slab(){
            …
            for_each_set_bit(i, map, shrinker_nr_max) {
                    …
                    shrinker = idr_find(&shrinker_idr, i);
                    …
                    do_shrink_slab(&sc, shrinker, priority);
                    …
            }
    }
    

    (Also, bit cleaning is implemented when the shrinker goes to the state “there is no point in calling do_shrink_slab (). See the Github commit for details .

    If you repeat the cache reset test, it will show significantly better results using the new algorithm:
    $timeecho 3 > /proc/sys/vm/drop_caches

    First iteration: 0.00user 1.10system 0: 01.10elapsed 99% CPU
    Second iteration: 0.00user 0.00system 0: 00.01elapsed 64% CPU
    Third iteration: 0.00user 0.01system 0: 00.01elapsed 82% CPU
    Fourth iteration: 0.00user 0.00system 0 : 00.01elapsed 64% CPU
    Fifth iteration: 0.00user 0.01system 0: 00.01elapsed 82% CPU The
    duration of the iterations from second to fifth is 0.01 seconds, 548 times faster than it was before.

    Since similar actions to reset the caches occur with every lack of memory on the machine, this optimization significantly improves the performance of machines with a large number of containers and memory control groups. A set of patches (17 pieces) was adopted into the vanilla kernel, and you can find it there, starting with version 4.19.

    In the process of reviewing the patches, a Google employee appeared, and it turned out that they had the same problem. Therefore, the patches were additionally tested on a different type of load.
    As a result, the patchset was adopted from the 9th iteration; and its entry into the vanilla core took about 4 months. Also today, the patchset is included in our own Virtuozzo 7 core, starting with version vz7.71.9

    Also popular now: