Sorting integers when out of memory
- Transfer
The author of the original in English - dzeban
Last time we discussed how to artificially limit the memory available to a program. As a bonus, I got myself a libmemrestrict - a library with function wrappers like malloc for tracking memory usage, and ptrace-restrict - a ptrace- based tool that intercepts brk, sbrk and mmap calls for the same purpose.
So why should we try to organize a memory limit - is this so common? When was the last time OOM nailed your application? Do you always think about memory consumption during programming? Memory is cheap, and if you run out of memory, add a couple more gigabytes.
And, nevertheless, it is impossible to add memory endlessly - and not because you do not have its infinite source. When processing Big data, it is simply impossible to accommodate all the input in an array - it is necessary to distribute the data between the RAM, storage media and the network. Algorithms and techniques are needed for such data processing.
And so I began to do such tasks, starting with a simple one - how to sort a million integers (4 MiB data) with 2 MiB of memory? This task can be generalized to the case when you do not have enough memory to accommodate all the data.
You must write a program to sort the set of integers stored in a file. To create it, I wrote the simplest utilities randints and rangeints.
The program should output a sorted array to stdout as text.
It should measure the runtime and output it to stderr. You can’t just run the program through the time utility, because it will count the time to read the file and the time to output it.
It should work with at least half the memory size of the file. To do this, we will use libmemrestrict or ptrace-restrict.
For some methods, these utilities are not useful. For example, for mmap they will not work - you will have to physically limit the use of memory.
They will be checked to solve the original problem (sort 4 MiB into 2 MiB). I will also run them on a virtual machine with 128 MiB of memory to sort 500 Mb (125 million four-byte integers).
Let's try to sort the numbers directly and calculate the memory usage (and time). Just read the file into an array of integers, and call qsort.
Let's try a file with 4 MB of data. Without limitations, everything works:
but it is not interesting. Limit memory 2 MiB:
Raise the limit to 4 MiB - and again failure. (libmemrestrict reads settings from the environment).
Obviously, qsort requires more memory. Let's see how much he wants with massif from valgrind :
Here is a beautiful schedule for you:
You can see several data placements doubling the memory requests to 4 MiB - this is my array, and then four more MiB for qsort. Statistics:
I use 4 million bytes, and 4 million more - qsort_r. And another 1.2 MB additionally uses massif on the heap.
Apparently, in this case qsort behaves like O (n) with respect to complexity in volume. Which is strange, since qsort works "in place" and uses various optimization techniques to guarantee complexity in volume in O (log n). Further reading on glibc qsort implementation .
Okay - can he sort 500 MB into 128 MiB RAM?
Of course not. Performance:
This means that naive sorting works well without restrictions, doesn't work with restrictions at all, and qsort requires O (n) memory. This is strange. For example, if you limit the memory to 5.3 MB, it will work and will not require O (n) memory. I'm still dealing with this.
mmap is a hacker way of sorting large amounts of data under conditions of memory limitations. It transfers the burden of data distribution between memory and swap onto the shoulders of the OS.
It works like this:
And that’s it! Now you will not get memory overflow, even sorting the file by the volume larger than the available memory. To understand the mechanism, you need to understand a little about memory management in the OS.
Each program is represented by a process that has its own virtual address space isolated from others. Its length is limited by the width of the CPU bus, that is, for a 32-bit CPU, it is equal to 2 32 , that is, 4 GiB.
All memory allocation, which is involved in the process, takes place in virtual memory. This virtual memory is mapped to the physical subsystem of the kernel for working with memory - MMU. And usually it happens in a “lazy” mode - that is, when a process asks for memory, the kernel immediately gives it to it, but it does not physically place it instantly - that is, a page of virtual memory is not immediately mapped to physical. When such a page is accessed (for example, for writing), the MMU raises a “page fault” exception, which the kernel processes by mapping the virtual page to a physical page. Now it is displayed, and the record in this page will be broadcast by the MMU as a record at a specific address in the physical memory.
On the other hand, if you remember that the virtual address space is limited only by the size of the CPU bus, you can get into a situation in which the program wants to take up more memory than is available. For example, in a 32-bit system with 256 MiB RAM, a process can place and use 1 GiB of memory. In this case, the memory pages will go to the swap - instead of RAM they will go to the drive, such as a hard disk. When accessing such a page, the kernel reads it from the drive and sends it to memory (replacing another page in memory).
The kernel copes well with the distribution of data between the memory and the drive, so it’s natural to try to use this property in our task. When we call mmap for our file, the kernel will reserve a range of virtual addresses that will not be immediately allocated. When we try to access them, the kernel will load it from the input file into memory. When we run out of physical memory, the kernel will remove the pages into a swap. Thus, we will balance the data between the file on disk, memory and swap.
The limitation is only the virtual address space (4 GiB for 32bit systems and 256 TiB for 64bit), and swap - but storage devices today are inexpensive.
Due to the fact that mmap loads the entire file into virtual memory, we will not be able to use libmemrestrict or ptrace-restrict, since they limit the virtual memory itself. Trying to limit the sorting of data with a volume of 100M into virtual memory with a volume of 10M, we get an error from mmap:
No wonder - the file is loaded into virtual memory, and the kernel distributes it between the physical memory and the swap. So we need at least 100M of virtual memory, plus some more space for qemu.
Therefore, for this method, I use a virtual machine with 128 MiB of memory. Here is my sorting program using mmap. And it works!
Information from top:
We use 500 MB of virtual memory, while the real available memory is 90 MiB. Note that MiB is 2 20 , and MB is 10 6 = 1 million. And 500 MB = 500,000,000 bytes, and 500,000,000 >> 20 = 476 MiB.
Looking at the details from vmstat during 500 MB sorting, we will see how the kernel balances between swap, disk cache, buffers and free memory:
At first, we had ~ 70 MiB of free memory, an empty swap, and a bit of memory was allocated for I / O buffers and cache. Then after mmap all this memory went to cache. When free memory ran out, the kernel began to use a swap, which increases with the increase in I / O load. And we come to the conclusion that almost all of the memory is given to the disk cache - which is normal, since pages with a disk cache, in the case when we need memory for the application, go first.
So, sorting through mmap is a cool hack that requires basic ideas about working with memory, and provides a simple solution for processing a large amount of data with a small amount of memory.
Let's say mmap cannot be used - you want to sort a file in 5 GiB on a 32-bit system.
There is a well-known popular method called “external merge sorting”. If you do not have enough memory, you need to use an external drive - for example, a hard disk. You will only have to work with the data piece by piece, because they will not fit into the memory.
External merge sort works like this:
I made a simple unoptimized implementation :
Sorted by having 2 MiB of memory and using a buffer of 1 MiB.
Now sort 500 MB. First, disable the swap, because we control the exchange of pieces of data manually.
Zero the caches:
Check the available memory:
We will use a buffer of 50 MB - 10 times smaller than the file size.
Nothing, two minutes! Why's that? Of course, due to I / O operations. Three things hinder the process. In the data separation phase, we sequentially read the file using a small buffer. In the merge phase, we open and close files with pieces of information. And there is also a conclusion - at the merge phase, we output 50 MB of data to stdout, which, despite being redirected to / dev / null, gives a load. If you disable this, we get a performance increase of 25%.
Using memory is fine with me. If you run the program through massif, you can see that the peak of consumption is the size of the buffer and a small heap.
You can limit the memory to 50 MB, plus another 500 KB for temporary lines containing file paths:
In general, with memory - ok, with speed - not ok. mmap allowed to do this operation in 32 seconds. Let's improve our way.
Let's profile the program using gprof. Create a binary
And we’ll call the program many times to accumulate statistics using my handy script from the gprof article. Here is the result:
Most of the time was spent sorting and output. But do not forget that gprof does not show the time taken for system calls and I / O.
What can be improved here?
Universal external merge sorting is a simple idea for sorting big data with a small amount of memory, but without any improvements it works slowly.
You can, of course, use multithreading to separate and merge - but this is a bad idea. Using it in the separation phase does not make sense, because the data is contained in one buffer. You can try to influence how the kernel reads data. There are two functions for this:
They tell the memory management subsystem how we will read the data. In our case, the reading is sequential, so it would be nice to see the contents of the file in the page cache.
In the merge phase, we can not do constant opening and closing of files, but create dedicated streams for each of the files. Each stream will keep its file open, and we will fill in a buffer for it. When it is filled, it is sorted and output. And readahead will work for each thread.
Here is an advanced multi-threaded version of external merge sorting. Well, as I said, multithreading is not a good idea. There is no difference on a single-core process.
This is data output. And without output:
Well, let's try on a four-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):
There is no difference between external-merge and mt-external-merge. Why is that? Yes, because multithreading does not solve the problem of input and output restrictions. It is well suited for those cases when:
Our threads are interdependent - the main thread has to wait for the buffer to be sorted, and only then start the next reading from the file. And reading to split is faster than sorting the buffer, so most of the time threads wait until the main thread finishes.
Other ways to improve the algorithm are needed.
Let's try to use something other than QuickSort. Since we know that we are sorting integers, we must use this. There are special algorithms that are used for specific data types, and they can be divided into two groups:
They work better than O (n log (n)) - the lower bound for comparing algorithms like QuickSort. But not all of them are suitable in case of memory limitations. So I settled on using counting sort
If we have a lot of data with a small spread, you can use counting sorting. The idea is simple - we will not store data in memory, but an array of counters. We sequentially read the data and increase the corresponding counters. The complexity of the algorithm is linear in time, and in volume - proportional to the range of data.
A simple implementation works with an array from 0 to N. Integers correspond to the indices of the array. Here is my version , which works well without any tuning. The second argument is the size of the buffer in the elements. Buffering greatly speeds up the work, because the program does not read 4 bytes from the file.
Ugums. Half a gigabyte of data is sorted in 3 and a half seconds on 128 MiB memory and one CPU. Compare with qsort or mmap:
23 times faster!
But do not forget about restrictions - only integers (or their equivalent), and their small consecutive interval. I tried to make an option with inconsistent intervals through hashes and binary search - but its performance is very poor.
And if we assume the uniqueness of our numbers, then the counters can only be in two states - whether or not they are, so they can be single-bit. Then our array will shrink. Yes, and we do not need an array - we can store numbers in the form of bits, that is, instead of an array, we will have a vector. We read the file and set the N-th bit, if there was a number N. After that, we just go through the vector and print the numbers for which the bits are cocked.
Such decisions require a careful approach, since you can still go beyond the limits. For example, to sort all numbers from the interval of integers (2 32 ), you need 1 bit for each number, and this is 4294967296 bits = 536870912 bytes = 512 MiB. And we have only 128 MiB, which is not enough for you. But in some cases, the gain will be colossal - here is a story on this subject from Programming Pearls by Jon Bentley .
Knowing your data is very helpful.
For the 5 months spent on the article, I did a lot of things - a dozen programs, several good ideas, many bad ones. And much more can be done and corrected.
The simple problem of sorting data with a lack of memory revealed a whole set of oddities that we usually don’t think about:
Sort plate:
Know your data and develop a simple algorithm to work with them!
Introduction
Last time we discussed how to artificially limit the memory available to a program. As a bonus, I got myself a libmemrestrict - a library with function wrappers like malloc for tracking memory usage, and ptrace-restrict - a ptrace- based tool that intercepts brk, sbrk and mmap calls for the same purpose.
So why should we try to organize a memory limit - is this so common? When was the last time OOM nailed your application? Do you always think about memory consumption during programming? Memory is cheap, and if you run out of memory, add a couple more gigabytes.
And, nevertheless, it is impossible to add memory endlessly - and not because you do not have its infinite source. When processing Big data, it is simply impossible to accommodate all the input in an array - it is necessary to distribute the data between the RAM, storage media and the network. Algorithms and techniques are needed for such data processing.
And so I began to do such tasks, starting with a simple one - how to sort a million integers (4 MiB data) with 2 MiB of memory? This task can be generalized to the case when you do not have enough memory to accommodate all the data.
Given
You must write a program to sort the set of integers stored in a file. To create it, I wrote the simplest utilities randints and rangeints.
The program should output a sorted array to stdout as text.
It should measure the runtime and output it to stderr. You can’t just run the program through the time utility, because it will count the time to read the file and the time to output it.
It should work with at least half the memory size of the file. To do this, we will use libmemrestrict or ptrace-restrict.
For some methods, these utilities are not useful. For example, for mmap they will not work - you will have to physically limit the use of memory.
They will be checked to solve the original problem (sort 4 MiB into 2 MiB). I will also run them on a virtual machine with 128 MiB of memory to sort 500 Mb (125 million four-byte integers).
Naive approach
Let's try to sort the numbers directly and calculate the memory usage (and time). Just read the file into an array of integers, and call qsort.
Let's try a file with 4 MB of data. Without limitations, everything works:
$ ./naive 4M.bin > /dev/null
4000000 bytes sorted in 0.323146 seconds
but it is not interesting. Limit memory 2 MiB:
$ LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted
Segmentation fault
Raise the limit to 4 MiB - and again failure. (libmemrestrict reads settings from the environment).
$ MR_THRESHOLD=5000000 LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted
Segmentation fault
Obviously, qsort requires more memory. Let's see how much he wants with massif from valgrind :
$ valgrind --tool=massif ./naive ints
$ ms_print massif.out.10676
Here is a beautiful schedule for you:
MB
8.819^ :::::::::::::::::::::::::::#
| : #
| : #
| : #
| : #
| : #
| : #
| : #
| : #
| :::::::@ #::::::::::::::::::::::::
| : @ #
| : @ #
| : @ #
| : @ #
| : @ #
| @@@@@@: @ #
| @ : @ #
| @ : @ #
| :::@ : @ #
| ::: @ : @ #
0 +----------------------------------------------------------------------->Gi
0 1.721
You can see several data placements doubling the memory requests to 4 MiB - this is my array, and then four more MiB for qsort. Statistics:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
21 173,222,581 5,247,504 4,000,568 1,246,936 0
22 173,223,802 5,246,920 4,000,000 1,246,920 0
23 173,226,655 5,247,504 4,000,568 1,246,936 0
24 173,229,202 5,246,920 4,000,000 1,246,920 0
25 173,229,311 9,246,928 8,000,000 1,246,928 0
26 869,058,772 9,246,928 8,000,000 1,246,928 0
86.52% (8,000,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.26% (4,000,000B) 0x400A26: readfile (in /home/avd/dev/cs/sorting/external/naive)
| ->43.26% (4,000,000B) 0x400ACD: main (in /home/avd/dev/cs/sorting/external/naive)
|
->43.26% (4,000,000B) 0x35D40383F7: qsort_r (in /usr/lib64/libc-2.18.so)
| ->43.26% (4,000,000B) 0x400B3D: main (in /home/avd/dev/cs/sorting/external/naive)
|
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
I use 4 million bytes, and 4 million more - qsort_r. And another 1.2 MB additionally uses massif on the heap.
Apparently, in this case qsort behaves like O (n) with respect to complexity in volume. Which is strange, since qsort works "in place" and uses various optimization techniques to guarantee complexity in volume in O (log n). Further reading on glibc qsort implementation .
Okay - can he sort 500 MB into 128 MiB RAM?
$ ./naive 500M.bin > /dev/null
Segmentation fault
Of course not. Performance:
$ ./naive 4M.bin > /dev/null
4000000 bytes sorted in 0.322712 seconds
This means that naive sorting works well without restrictions, doesn't work with restrictions at all, and qsort requires O (n) memory. This is strange. For example, if you limit the memory to 5.3 MB, it will work and will not require O (n) memory. I'm still dealing with this.
File and mmap
mmap is a hacker way of sorting large amounts of data under conditions of memory limitations. It transfers the burden of data distribution between memory and swap onto the shoulders of the OS.
It works like this:
- Through mmap we send the entire file to memory
- We get a pointer to the data from it
- We call the algorithm for sorting data by this pointer
And that’s it! Now you will not get memory overflow, even sorting the file by the volume larger than the available memory. To understand the mechanism, you need to understand a little about memory management in the OS.
Each program is represented by a process that has its own virtual address space isolated from others. Its length is limited by the width of the CPU bus, that is, for a 32-bit CPU, it is equal to 2 32 , that is, 4 GiB.
All memory allocation, which is involved in the process, takes place in virtual memory. This virtual memory is mapped to the physical subsystem of the kernel for working with memory - MMU. And usually it happens in a “lazy” mode - that is, when a process asks for memory, the kernel immediately gives it to it, but it does not physically place it instantly - that is, a page of virtual memory is not immediately mapped to physical. When such a page is accessed (for example, for writing), the MMU raises a “page fault” exception, which the kernel processes by mapping the virtual page to a physical page. Now it is displayed, and the record in this page will be broadcast by the MMU as a record at a specific address in the physical memory.
On the other hand, if you remember that the virtual address space is limited only by the size of the CPU bus, you can get into a situation in which the program wants to take up more memory than is available. For example, in a 32-bit system with 256 MiB RAM, a process can place and use 1 GiB of memory. In this case, the memory pages will go to the swap - instead of RAM they will go to the drive, such as a hard disk. When accessing such a page, the kernel reads it from the drive and sends it to memory (replacing another page in memory).
The kernel copes well with the distribution of data between the memory and the drive, so it’s natural to try to use this property in our task. When we call mmap for our file, the kernel will reserve a range of virtual addresses that will not be immediately allocated. When we try to access them, the kernel will load it from the input file into memory. When we run out of physical memory, the kernel will remove the pages into a swap. Thus, we will balance the data between the file on disk, memory and swap.
The limitation is only the virtual address space (4 GiB for 32bit systems and 256 TiB for 64bit), and swap - but storage devices today are inexpensive.
Due to the fact that mmap loads the entire file into virtual memory, we will not be able to use libmemrestrict or ptrace-restrict, since they limit the virtual memory itself. Trying to limit the sorting of data with a volume of 100M into virtual memory with a volume of 10M, we get an error from mmap:
$ qemu-x86_64 -R 10M ./mmaped 100M.bin
mmap stack: Cannot allocate memory
No wonder - the file is loaded into virtual memory, and the kernel distributes it between the physical memory and the swap. So we need at least 100M of virtual memory, plus some more space for qemu.
Therefore, for this method, I use a virtual machine with 128 MiB of memory. Here is my sorting program using mmap. And it works!
$ free -m
total used free shared buffers cached
Mem: 119 42 76 0 4 16
-/+ buffers/cache: 21 97
Swap: 382 0 382
$ ll -h 500M.bin
-rw-r--r-- 1 root root 477M Feb 3 05:39 500M.bin
$ ./mmaped 500M.bin > /dev/null
500000000 bytes sorted in 32.250000 seconds
Information from top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3167 root 20 0 480m 90m 90m R 84.6 76.4 1:18.65 mmaped
We use 500 MB of virtual memory, while the real available memory is 90 MiB. Note that MiB is 2 20 , and MB is 10 6 = 1 million. And 500 MB = 500,000,000 bytes, and 500,000,000 >> 20 = 476 MiB.
Looking at the details from vmstat during 500 MB sorting, we will see how the kernel balances between swap, disk cache, buffers and free memory:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 77776 2120 15104 1 27 710 971 24 34 3 1 95 1
1 1 0 2060 488 90068 1 27 785 1057 25 37 3 1 95 1
1 0 928 3400 60 89744 1 27 799 1092 25 38 3 1 94 1
0 2 1908 1928 212 92040 1 27 852 1174 26 40 4 1 94 1
0 2 3436 2360 280 93056 1 27 911 1282 28 42 4 1 94 2
0 0 5272 3688 196 94688 1 27 1066 1471 31 48 4 1 93 2
0 0 5272 3720 204 94700 1 27 1064 1469 31 48 4 1 93 2
At first, we had ~ 70 MiB of free memory, an empty swap, and a bit of memory was allocated for I / O buffers and cache. Then after mmap all this memory went to cache. When free memory ran out, the kernel began to use a swap, which increases with the increase in I / O load. And we come to the conclusion that almost all of the memory is given to the disk cache - which is normal, since pages with a disk cache, in the case when we need memory for the application, go first.
So, sorting through mmap is a cool hack that requires basic ideas about working with memory, and provides a simple solution for processing a large amount of data with a small amount of memory.
External merge sort
Let's say mmap cannot be used - you want to sort a file in 5 GiB on a 32-bit system.
There is a well-known popular method called “external merge sorting”. If you do not have enough memory, you need to use an external drive - for example, a hard disk. You will only have to work with the data piece by piece, because they will not fit into the memory.
External merge sort works like this:
- break data into pieces by the amount of available memory
- each piece is sorted in memory and written to external media
- you now have pieces of size fileize / buffersize
- read portions of pieces of size buffersize / # chunks to combine them in a buffer and output the result to a file
I made a simple unoptimized implementation :
$ LD_PRELOAD=./libmemrestrict.so ./external-merge 4M.bin 1048576 > /dev/null
4194304 bytes sorted in 0.383171 seconds
Sorted by having 2 MiB of memory and using a buffer of 1 MiB.
Now sort 500 MB. First, disable the swap, because we control the exchange of pieces of data manually.
$ swapoff /dev/vda5
Zero the caches:
$ echo 3 > /proc/sys/vm/drop_caches
Check the available memory:
$ free -m
total used free shared buffers cached
Mem: 119 28 90 0 0 6
-/+ buffers/cache: 21 97
Swap: 0 0 0
We will use a buffer of 50 MB - 10 times smaller than the file size.
$ ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 120.115180 seconds
Nothing, two minutes! Why's that? Of course, due to I / O operations. Three things hinder the process. In the data separation phase, we sequentially read the file using a small buffer. In the merge phase, we open and close files with pieces of information. And there is also a conclusion - at the merge phase, we output 50 MB of data to stdout, which, despite being redirected to / dev / null, gives a load. If you disable this, we get a performance increase of 25%.
$ ./external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 87.140000 seconds
Using memory is fine with me. If you run the program through massif, you can see that the peak of consumption is the size of the buffer and a small heap.
--------------------------------------------------------------------------------
Command: ./external-merge 500M.bin 50000000
Massif arguments: (none)
ms_print arguments: massif.out.17423
--------------------------------------------------------------------------------
MB
47.75^ :::::
|#::::::@:::::::::::@:::::::::@:::@::::@::::@::::::::@::::@::::@:::@
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
|# : : @ : : : : @ : : @ @ @ @ : @ @ @ @
0 +----------------------------------------------------------------------->Gi
0 332.5
Number of snapshots: 98
Detailed snapshots: [4 (peak), 10, 20, 29, 32, 35, 38, 45, 48, 54, 64, 74, 84, 94]
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
0 0 0 0 0 0
1 119,690 584 568 16 0
2 123,141 50,004,496 50,000,568 3,928 0
3 4,814,014 50,005,080 50,001,136 3,944 0
4 4,817,234 50,005,080 50,001,136 3,944 0
99.99% (50,001,136B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.99% (50,000,000B) 0x400FA2: external_merge_sort (in /root/external-merge)
| ->99.99% (50,000,000B) 0x40128E: main (in /root/external-merge)
|
->00.00% (1,136B) in 1+ places, all below ms_print's threshold (01.00%)
You can limit the memory to 50 MB, plus another 500 KB for temporary lines containing file paths:
$ LD_PRELOAD=./libmemrestrict.so MR_THRESHOLD=51000000 ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 87.900000 seconds
In general, with memory - ok, with speed - not ok. mmap allowed to do this operation in 32 seconds. Let's improve our way.
Let's profile the program using gprof. Create a binary
$ gcc -pg -g -Wall -Wextra external-merge.c -o external-merge-gprof
And we’ll call the program many times to accumulate statistics using my handy script from the gprof article. Here is the result:
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
81.98 432.87 432.87 compar
18.17 528.82 95.95 print_arr
0.00 528.82 0.00 1100 0.00 0.00 form_filename
0.00 528.82 0.00 100 0.00 0.00 merge
0.00 528.82 0.00 100 0.00 0.00 save_buf
0.00 528.82 0.00 10 0.00 0.00 external_merge_sort
0.00 528.82 0.00 10 0.00 0.00 split
Most of the time was spent sorting and output. But do not forget that gprof does not show the time taken for system calls and I / O.
What can be improved here?
- add multithreading and I / O tricks to external sorting
- take another sorting algorithm
Universal external merge sorting is a simple idea for sorting big data with a small amount of memory, but without any improvements it works slowly.
Customize sorting
You can, of course, use multithreading to separate and merge - but this is a bad idea. Using it in the separation phase does not make sense, because the data is contained in one buffer. You can try to influence how the kernel reads data. There are two functions for this:
- readahead (Linux only).
- posix_fadvise with POSIX_FADV_SEQUENTIAL.
They tell the memory management subsystem how we will read the data. In our case, the reading is sequential, so it would be nice to see the contents of the file in the page cache.
In the merge phase, we can not do constant opening and closing of files, but create dedicated streams for each of the files. Each stream will keep its file open, and we will fill in a buffer for it. When it is filled, it is sorted and output. And readahead will work for each thread.
Here is an advanced multi-threaded version of external merge sorting. Well, as I said, multithreading is not a good idea. There is no difference on a single-core process.
$ ./mt-ext-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 117.380000 seconds
This is data output. And without output:
$ ./mt-ext-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 91.040000 seconds
Well, let's try on a four-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):
$ ./naive 500M.bin > /dev/null
500000000 bytes sorted in 23.040499 seconds
$ ./mmaped 500M.bin > /dev/null
500000000 bytes sorted in 23.542076 seconds
$ ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 39.228695 seconds
$ ./mt-external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 41.062793 seconds
$ ./external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 28.893745 seconds
$ ./mt-external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 28.368976 seconds
Теперь для ста кусочков и ста потоков:
$ ./external-merge-no-output 500M.bin 5000000 > /dev/null
500000000 bytes sorted in 27.107728 seconds
$ ./mt-external-merge-no-output 500M.bin 5000000 > /dev/null
500000000 bytes sorted in 28.558468 seconds
There is no difference between external-merge and mt-external-merge. Why is that? Yes, because multithreading does not solve the problem of input and output restrictions. It is well suited for those cases when:
- execution of threads independently
- input and output resources can work in parallel - for example, like RAID
Our threads are interdependent - the main thread has to wait for the buffer to be sorted, and only then start the next reading from the file. And reading to split is faster than sorting the buffer, so most of the time threads wait until the main thread finishes.
Other ways to improve the algorithm are needed.
Special sorting algorithms
Let's try to use something other than QuickSort. Since we know that we are sorting integers, we must use this. There are special algorithms that are used for specific data types, and they can be divided into two groups:
- don't use comparisons
- do not require loading the entire array into memory
They work better than O (n log (n)) - the lower bound for comparing algorithms like QuickSort. But not all of them are suitable in case of memory limitations. So I settled on using counting sort
Counting Sort
If we have a lot of data with a small spread, you can use counting sorting. The idea is simple - we will not store data in memory, but an array of counters. We sequentially read the data and increase the corresponding counters. The complexity of the algorithm is linear in time, and in volume - proportional to the range of data.
A simple implementation works with an array from 0 to N. Integers correspond to the indices of the array. Here is my version , which works well without any tuning. The second argument is the size of the buffer in the elements. Buffering greatly speeds up the work, because the program does not read 4 bytes from the file.
$ ./counting-array 500M-range.bin 1000000 > /dev/null
Range is 1000000
500000000 bytes sorted in 3.240000 seconds
Ugums. Half a gigabyte of data is sorted in 3 and a half seconds on 128 MiB memory and one CPU. Compare with qsort or mmap:
$ ./mmaped 500M-range.bin > /dev/null
500000000 bytes sorted in 76.150000 seconds
23 times faster!
But do not forget about restrictions - only integers (or their equivalent), and their small consecutive interval. I tried to make an option with inconsistent intervals through hashes and binary search - but its performance is very poor.
And if we assume the uniqueness of our numbers, then the counters can only be in two states - whether or not they are, so they can be single-bit. Then our array will shrink. Yes, and we do not need an array - we can store numbers in the form of bits, that is, instead of an array, we will have a vector. We read the file and set the N-th bit, if there was a number N. After that, we just go through the vector and print the numbers for which the bits are cocked.
Such decisions require a careful approach, since you can still go beyond the limits. For example, to sort all numbers from the interval of integers (2 32 ), you need 1 bit for each number, and this is 4294967296 bits = 536870912 bytes = 512 MiB. And we have only 128 MiB, which is not enough for you. But in some cases, the gain will be colossal - here is a story on this subject from Programming Pearls by Jon Bentley .
Knowing your data is very helpful.
Total
For the 5 months spent on the article, I did a lot of things - a dozen programs, several good ideas, many bad ones. And much more can be done and corrected.
The simple problem of sorting data with a lack of memory revealed a whole set of oddities that we usually don’t think about:
- common algorithms are not suitable for all problems
- debugging and profiling are very useful and visual things
- I / O is a problem area if you don’t shift all the work to the core
- multithreading is not a panacea for speed
- know your data and your environment
Sort plate:
Test | Naive QuickSort | mmap and quicksort | External merge sort | Multithreaded external merge sort | Counting Sort |
4 MiB in 2 MiB | Segfault | N / a | 0.38s | 0.41s | 0.01 |
500 MB in 128 MiB | Segfault | 32.25s | 87.14s | 91.04 | 3.24 |
Know your data and develop a simple algorithm to work with them!
References
- Sort a million 32-bit numbers in 2 MB of memory in Python
- External sorting of large data sets
- Effective sorting of data that exceeds the amount of memory
- We reveal the oyster (1st article from the "Pearl of programming")
- Multithreaded file input and output
- Improving Algorithms with Performance Measurement, Part 2
- Parallel Sorting Algorithms