SLY_G September 10, 2015 at 00:04

Sorting integers when out of memory

Transfer

The author of the original in English - dzeban

Introduction

Last time we discussed how to artificially limit the memory available to a program. As a bonus, I got myself a libmemrestrict - a library with function wrappers like malloc for tracking memory usage, and ptrace-restrict - a ptrace- based tool that intercepts brk, sbrk and mmap calls for the same purpose.

So why should we try to organize a memory limit - is this so common? When was the last time OOM nailed your application? Do you always think about memory consumption during programming? Memory is cheap, and if you run out of memory, add a couple more gigabytes.

And, nevertheless, it is impossible to add memory endlessly - and not because you do not have its infinite source. When processing Big data, it is simply impossible to accommodate all the input in an array - it is necessary to distribute the data between the RAM, storage media and the network. Algorithms and techniques are needed for such data processing.

And so I began to do such tasks, starting with a simple one - how to sort a million integers (4 MiB data) with 2 MiB of memory? This task can be generalized to the case when you do not have enough memory to accommodate all the data.

Given

You must write a program to sort the set of integers stored in a file. To create it, I wrote the simplest utilities randints and rangeints.

The program should output a sorted array to stdout as text.

It should measure the runtime and output it to stderr. You can’t just run the program through the time utility, because it will count the time to read the file and the time to output it.

It should work with at least half the memory size of the file. To do this, we will use libmemrestrict or ptrace-restrict.

For some methods, these utilities are not useful. For example, for mmap they will not work - you will have to physically limit the use of memory.

They will be checked to solve the original problem (sort 4 MiB into 2 MiB). I will also run them on a virtual machine with 128 MiB of memory to sort 500 Mb (125 million four-byte integers).

Naive approach

Let's try to sort the numbers directly and calculate the memory usage (and time). Just read the file into an array of integers, and call qsort.

Let's try a file with 4 MB of data. Without limitations, everything works:

$ ./naive 4M.bin > /dev/null
4000000 bytes sorted in 0.323146 seconds

but it is not interesting. Limit memory 2 MiB:

$ LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted 
Segmentation fault

Raise the limit to 4 MiB - and again failure. (libmemrestrict reads settings from the environment).

$ MR_THRESHOLD=5000000 LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted 
Segmentation fault

Obviously, qsort requires more memory. Let's see how much he wants with massif from valgrind :

$ valgrind --tool=massif ./naive ints 
$ ms_print massif.out.10676

Here is a beautiful schedule for you:

    MB
    8.819^               :::::::::::::::::::::::::::#                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |            :::::::@                          #:::::::::::::::::::::::: 
     |            :      @                          #                         
     |            :      @                          #                         
     |            :      @                          #                         
     |            :      @                          #                         
     |            :      @                          #                         
     |      @@@@@@:      @                          #                         
     |      @     :      @                          #                         
     |      @     :      @                          #                         
     |   :::@     :      @                          #                         
     | :::  @     :      @                          #                         
       0 +----------------------------------------------------------------------->Gi
     0                                                                   1.721

You can see several data placements doubling the memory requests to 4 MiB - this is my array, and then four more MiB for qsort. Statistics:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 21    173,222,581        5,247,504        4,000,568     1,246,936            0
 22    173,223,802        5,246,920        4,000,000     1,246,920            0
 23    173,226,655        5,247,504        4,000,568     1,246,936            0
 24    173,229,202        5,246,920        4,000,000     1,246,920            0
 25    173,229,311        9,246,928        8,000,000     1,246,928            0
 26    869,058,772        9,246,928        8,000,000     1,246,928            0
86.52% (8,000,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.26% (4,000,000B) 0x400A26: readfile (in /home/avd/dev/cs/sorting/external/naive)
| ->43.26% (4,000,000B) 0x400ACD: main (in /home/avd/dev/cs/sorting/external/naive)
|   
->43.26% (4,000,000B) 0x35D40383F7: qsort_r (in /usr/lib64/libc-2.18.so)
| ->43.26% (4,000,000B) 0x400B3D: main (in /home/avd/dev/cs/sorting/external/naive)
|   
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

I use 4 million bytes, and 4 million more - qsort_r. And another 1.2 MB additionally uses massif on the heap.

Apparently, in this case qsort behaves like O (n) with respect to complexity in volume. Which is strange, since qsort works "in place" and uses various optimization techniques to guarantee complexity in volume in O (log n). Further reading on glibc qsort implementation .

Okay - can he sort 500 MB into 128 MiB RAM?

$ ./naive 500M.bin > /dev/null
Segmentation fault

Of course not. Performance:

$ ./naive 4M.bin > /dev/null
4000000 bytes sorted in 0.322712 seconds

This means that naive sorting works well without restrictions, doesn't work with restrictions at all, and qsort requires O (n) memory. This is strange. For example, if you limit the memory to 5.3 MB, it will work and will not require O (n) memory. I'm still dealing with this.

File and mmap

mmap is a hacker way of sorting large amounts of data under conditions of memory limitations. It transfers the burden of data distribution between memory and swap onto the shoulders of the OS.

It works like this:

Through mmap we send the entire file to memory
We get a pointer to the data from it
We call the algorithm for sorting data by this pointer

And that’s it! Now you will not get memory overflow, even sorting the file by the volume larger than the available memory. To understand the mechanism, you need to understand a little about memory management in the OS.

Each program is represented by a process that has its own virtual address space isolated from others. Its length is limited by the width of the CPU bus, that is, for a 32-bit CPU, it is equal to 2 ³² , that is, 4 GiB.

All memory allocation, which is involved in the process, takes place in virtual memory. This virtual memory is mapped to the physical subsystem of the kernel for working with memory - MMU. And usually it happens in a “lazy” mode - that is, when a process asks for memory, the kernel immediately gives it to it, but it does not physically place it instantly - that is, a page of virtual memory is not immediately mapped to physical. When such a page is accessed (for example, for writing), the MMU raises a “page fault” exception, which the kernel processes by mapping the virtual page to a physical page. Now it is displayed, and the record in this page will be broadcast by the MMU as a record at a specific address in the physical memory.

On the other hand, if you remember that the virtual address space is limited only by the size of the CPU bus, you can get into a situation in which the program wants to take up more memory than is available. For example, in a 32-bit system with 256 MiB RAM, a process can place and use 1 GiB of memory. In this case, the memory pages will go to the swap - instead of RAM they will go to the drive, such as a hard disk. When accessing such a page, the kernel reads it from the drive and sends it to memory (replacing another page in memory).

The kernel copes well with the distribution of data between the memory and the drive, so it’s natural to try to use this property in our task. When we call mmap for our file, the kernel will reserve a range of virtual addresses that will not be immediately allocated. When we try to access them, the kernel will load it from the input file into memory. When we run out of physical memory, the kernel will remove the pages into a swap. Thus, we will balance the data between the file on disk, memory and swap.

The limitation is only the virtual address space (4 GiB for 32bit systems and 256 TiB for 64bit), and swap - but storage devices today are inexpensive.

Due to the fact that mmap loads the entire file into virtual memory, we will not be able to use libmemrestrict or ptrace-restrict, since they limit the virtual memory itself. Trying to limit the sorting of data with a volume of 100M into virtual memory with a volume of 10M, we get an error from mmap:

$ qemu-x86_64 -R 10M ./mmaped 100M.bin 
mmap stack: Cannot allocate memory

No wonder - the file is loaded into virtual memory, and the kernel distributes it between the physical memory and the swap. So we need at least 100M of virtual memory, plus some more space for qemu.

Therefore, for this method, I use a virtual machine with 128 MiB of memory. Here is my sorting program using mmap. And it works!

$ free -m
             total       used       free     shared    buffers     cached
Mem:           119         42         76          0          4         16
-/+ buffers/cache:         21         97
Swap:          382          0        382
$ ll -h 500M.bin
-rw-r--r-- 1 root root 477M Feb  3 05:39 500M.bin
$ ./mmaped 500M.bin > /dev/null
500000000 bytes sorted in 32.250000 seconds

Information from top:

PID  USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
3167 root      20   0  480m  90m  90m R  84.6 76.4   1:18.65 mmaped

We use 500 MB of virtual memory, while the real available memory is 90 MiB. Note that MiB is 2 ²⁰ , and MB is 10 ⁶ = 1 million. And 500 MB = 500,000,000 bytes, and 500,000,000 >> 20 = 476 MiB.

Looking at the details from vmstat during 500 MB sorting, we will see how the kernel balances between swap, disk cache, buffers and free memory:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0  77776   2120  15104    1   27   710   971   24   34  3  1 95  1
 1  1      0   2060    488  90068    1   27   785  1057   25   37  3  1 95  1
 1  0    928   3400     60  89744    1   27   799  1092   25   38  3  1 94  1
 0  2   1908   1928    212  92040    1   27   852  1174   26   40  4  1 94  1
 0  2   3436   2360    280  93056    1   27   911  1282   28   42  4  1 94  2
 0  0   5272   3688    196  94688    1   27  1066  1471   31   48  4  1 93  2
 0  0   5272   3720    204  94700    1   27  1064  1469   31   48  4  1 93  2

At first, we had ~ 70 MiB of free memory, an empty swap, and a bit of memory was allocated for I / O buffers and cache. Then after mmap all this memory went to cache. When free memory ran out, the kernel began to use a swap, which increases with the increase in I / O load. And we come to the conclusion that almost all of the memory is given to the disk cache - which is normal, since pages with a disk cache, in the case when we need memory for the application, go first.

So, sorting through mmap is a cool hack that requires basic ideas about working with memory, and provides a simple solution for processing a large amount of data with a small amount of memory.

External merge sort

Let's say mmap cannot be used - you want to sort a file in 5 GiB on a 32-bit system.

There is a well-known popular method called “external merge sorting”. If you do not have enough memory, you need to use an external drive - for example, a hard disk. You will only have to work with the data piece by piece, because they will not fit into the memory.

External merge sort works like this:

break data into pieces by the amount of available memory
each piece is sorted in memory and written to external media
you now have pieces of size fileize / buffersize
read portions of pieces of size buffersize / # chunks to combine them in a buffer and output the result to a file

I made a simple unoptimized implementation :

$ LD_PRELOAD=./libmemrestrict.so ./external-merge 4M.bin 1048576 > /dev/null
4194304 bytes sorted in 0.383171 seconds

Sorted by having 2 MiB of memory and using a buffer of 1 MiB.

Now sort 500 MB. First, disable the swap, because we control the exchange of pieces of data manually.

$ swapoff /dev/vda5

Zero the caches:

$ echo 3 > /proc/sys/vm/drop_caches

Check the available memory:

$ free -m
             total       used       free     shared    buffers     cached
Mem:           119         28         90          0          0          6
-/+ buffers/cache:         21         97
Swap:            0          0          0

We will use a buffer of 50 MB - 10 times smaller than the file size.

$ ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 120.115180 seconds

Nothing, two minutes! Why's that? Of course, due to I / O operations. Three things hinder the process. In the data separation phase, we sequentially read the file using a small buffer. In the merge phase, we open and close files with pieces of information. And there is also a conclusion - at the merge phase, we output 50 MB of data to stdout, which, despite being redirected to / dev / null, gives a load. If you disable this, we get a performance increase of 25%.

$ ./external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 87.140000 seconds

Using memory is fine with me. If you run the program through massif, you can see that the peak of consumption is the size of the buffer and a small heap.

--------------------------------------------------------------------------------
Command:            ./external-merge 500M.bin 50000000
Massif arguments:   (none)
ms_print arguments: massif.out.17423
--------------------------------------------------------------------------------
    MB
47.75^                                                                  ::::: 
     |#::::::@:::::::::::@:::::::::@:::@::::@::::@::::::::@::::@::::@:::@     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
   0 +----------------------------------------------------------------------->Gi
     0                                                                   332.5
Number of snapshots: 98
 Detailed snapshots: [4 (peak), 10, 20, 29, 32, 35, 38, 45, 48, 54, 64, 74, 84, 94]
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1        119,690              584              568            16            0
  2        123,141       50,004,496       50,000,568         3,928            0
  3      4,814,014       50,005,080       50,001,136         3,944            0
  4      4,817,234       50,005,080       50,001,136         3,944            0
99.99% (50,001,136B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.99% (50,000,000B) 0x400FA2: external_merge_sort (in /root/external-merge)
| ->99.99% (50,000,000B) 0x40128E: main (in /root/external-merge)
|   
->00.00% (1,136B) in 1+ places, all below ms_print's threshold (01.00%)

You can limit the memory to 50 MB, plus another 500 KB for temporary lines containing file paths:

$ LD_PRELOAD=./libmemrestrict.so MR_THRESHOLD=51000000 ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 87.900000 seconds

In general, with memory - ok, with speed - not ok. mmap allowed to do this operation in 32 seconds. Let's improve our way.

Let's profile the program using gprof. Create a binary

$ gcc -pg -g -Wall -Wextra external-merge.c -o external-merge-gprof

And we’ll call the program many times to accumulate statistics using my handy script from the gprof article. Here is the result:

%   cumulative   self              self     total           
time   seconds   seconds    calls  Ts/call  Ts/call  name    
81.98    432.87   432.87                             compar
18.17    528.82    95.95                             print_arr
0.00    528.82     0.00     1100     0.00     0.00  form_filename
0.00    528.82     0.00      100     0.00     0.00  merge
0.00    528.82     0.00      100     0.00     0.00  save_buf
0.00    528.82     0.00       10     0.00     0.00  external_merge_sort
0.00    528.82     0.00       10     0.00     0.00  split

Most of the time was spent sorting and output. But do not forget that gprof does not show the time taken for system calls and I / O.

What can be improved here?

add multithreading and I / O tricks to external sorting
take another sorting algorithm

Universal external merge sorting is a simple idea for sorting big data with a small amount of memory, but without any improvements it works slowly.

Customize sorting

You can, of course, use multithreading to separate and merge - but this is a bad idea. Using it in the separation phase does not make sense, because the data is contained in one buffer. You can try to influence how the kernel reads data. There are two functions for this:

readahead (Linux only).
posix_fadvise with POSIX_FADV_SEQUENTIAL.

They tell the memory management subsystem how we will read the data. In our case, the reading is sequential, so it would be nice to see the contents of the file in the page cache.

In the merge phase, we can not do constant opening and closing of files, but create dedicated streams for each of the files. Each stream will keep its file open, and we will fill in a buffer for it. When it is filled, it is sorted and output. And readahead will work for each thread.

Here is an advanced multi-threaded version of external merge sorting. Well, as I said, multithreading is not a good idea. There is no difference on a single-core process.

$ ./mt-ext-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 117.380000 seconds

This is data output. And without output:

$ ./mt-ext-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 91.040000 seconds

Well, let's try on a four-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):

$ ./naive 500M.bin > /dev/null 
500000000 bytes sorted in 23.040499 seconds
$ ./mmaped 500M.bin > /dev/null
500000000 bytes sorted in 23.542076 seconds
$ ./external-merge 500M.bin 50000000 > /dev/null 
500000000 bytes sorted in 39.228695 seconds
$ ./mt-external-merge 500M.bin 50000000 > /dev/null 
500000000 bytes sorted in 41.062793 seconds
$ ./external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 28.893745 seconds
$ ./mt-external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 28.368976 seconds
Теперь для ста кусочков и ста потоков:
$ ./external-merge-no-output 500M.bin 5000000 > /dev/null
500000000 bytes sorted in 27.107728 seconds
$ ./mt-external-merge-no-output 500M.bin 5000000 > /dev/null
500000000 bytes sorted in 28.558468 seconds

There is no difference between external-merge and mt-external-merge. Why is that? Yes, because multithreading does not solve the problem of input and output restrictions. It is well suited for those cases when:

execution of threads independently
input and output resources can work in parallel - for example, like RAID

Our threads are interdependent - the main thread has to wait for the buffer to be sorted, and only then start the next reading from the file. And reading to split is faster than sorting the buffer, so most of the time threads wait until the main thread finishes.

Other ways to improve the algorithm are needed.

Special sorting algorithms

Let's try to use something other than QuickSort. Since we know that we are sorting integers, we must use this. There are special algorithms that are used for specific data types, and they can be divided into two groups:

don't use comparisons
do not require loading the entire array into memory

They work better than O (n log (n)) - the lower bound for comparing algorithms like QuickSort. But not all of them are suitable in case of memory limitations. So I settled on using counting sort

Counting Sort

If we have a lot of data with a small spread, you can use counting sorting. The idea is simple - we will not store data in memory, but an array of counters. We sequentially read the data and increase the corresponding counters. The complexity of the algorithm is linear in time, and in volume - proportional to the range of data.

A simple implementation works with an array from 0 to N. Integers correspond to the indices of the array. Here is my version , which works well without any tuning. The second argument is the size of the buffer in the elements. Buffering greatly speeds up the work, because the program does not read 4 bytes from the file.

$ ./counting-array 500M-range.bin 1000000 > /dev/null
Range is 1000000
500000000 bytes sorted in 3.240000 seconds

Ugums. Half a gigabyte of data is sorted in 3 and a half seconds on 128 MiB memory and one CPU. Compare with qsort or mmap:

$ ./mmaped 500M-range.bin > /dev/null
500000000 bytes sorted in 76.150000 seconds

23 times faster!

But do not forget about restrictions - only integers (or their equivalent), and their small consecutive interval. I tried to make an option with inconsistent intervals through hashes and binary search - but its performance is very poor.

And if we assume the uniqueness of our numbers, then the counters can only be in two states - whether or not they are, so they can be single-bit. Then our array will shrink. Yes, and we do not need an array - we can store numbers in the form of bits, that is, instead of an array, we will have a vector. We read the file and set the N-th bit, if there was a number N. After that, we just go through the vector and print the numbers for which the bits are cocked.

Such decisions require a careful approach, since you can still go beyond the limits. For example, to sort all numbers from the interval of integers (2 ³² ), you need 1 bit for each number, and this is 4294967296 bits = 536870912 bytes = 512 MiB. And we have only 128 MiB, which is not enough for you. But in some cases, the gain will be colossal - here is a story on this subject from Programming Pearls by Jon Bentley .

Knowing your data is very helpful.

Total

For the 5 months spent on the article, I did a lot of things - a dozen programs, several good ideas, many bad ones. And much more can be done and corrected.

The simple problem of sorting data with a lack of memory revealed a whole set of oddities that we usually don’t think about:

common algorithms are not suitable for all problems
debugging and profiling are very useful and visual things
I / O is a problem area if you don’t shift all the work to the core
multithreading is not a panacea for speed
know your data and your environment

Sort plate:

Test	Naive QuickSort	mmap and quicksort	External merge sort	Multithreaded external merge sort	Counting Sort
4 MiB in 2 MiB	Segfault	N / a	0.38s	0.41s	0.01
500 MB in 128 MiB	Segfault	32.25s	87.14s	91.04	3.24

Know your data and develop a simple algorithm to work with them!

References

Tags: