Memory Capture / Free Speed in C #
I had this problem: I need to process the data file. The file is divided into sections with a length of about 1 MB, each of them contains approximately 100,000 records in a packed form. The number of records can vary from section to section and is recorded in the header of each of them. During processing, the section is unpacked, and each record turns into 20 integers. For processing, you need to store the current unpacked section and several previous ones (somewhere around 5-10, but maybe more - it is not known in advance how many). The question is how to allocate memory for unpacking sections.
The project, within the framework of which it is necessary to solve the problem, is written in C # under VS 2008 (using inserts from other languages is strongly discouraged), the main system under which the finished program will work is Windows 7, 64 bit (at least for now). And, as usual, you need to process faster.
The first question that arises is whether it is necessary to organize an array pool for unpacking, or whether it is possible to capture an array for each new section again. The second question is what should be the structure of this array, which is better - work with linear arrays of 8 MB in length, or split the array into smaller pieces and organize, for example, an array of arrays. In the second case, what should be the length of these pieces.
I took a few objects:
The numbers M and N for a two-dimensional array were selected so that M * N = 40,000,000 (which corresponds to a memory for 20 sections).
For each object, the average time to create + fill + read (after which the object was forgotten) was measured, and for control, the time to fill + read (the object was created only once). Time was measured in nanoseconds per processed element of the object. The measurement was performed twice: when working on one processor core and when working on 4 cores in parallel (in the second case, the time spent on 4 was not multiplied, i.e. the result, as a rule, should be less than in the case of a single core).
The results look like this:
Two times are recorded in each cell - for one and four cores.
What can be extracted from this plate? First, it turns out that the time it takes to capture memory linearly depends on the length of the array: a single linear array of 160 MB in length is captured 100 times longer than an array of 1.6 MB in length. Secondly, if we want to capture one array for a short time, then short arrays have the advantage: their capture takes 0.3 ns / word, while the capture of long ones is 1.8 ns / word (the difference between the 3rd and 4th lines). This confirms the often cited claim that objects shorter than 88 KB are taken from a separate, faster pool. But if there are a lot of arrays, the picture becomes the opposite: for long arrays there are about 1.5 ns / word, and for short ones - 5.8 ns / word - almost 4 times more! So if you need a multidimensional array for a short time, then you should not make it stepwise with short internal arrays, it’s better to look for another option. For example, grab a one-dimensional array and read indices.
In addition, it is clear that the system did not like my implementation of the list at all: when its length approached a million, the time to create one item increased approximately 6 times compared to short lists.
Optimal for my task, apparently, would be to capture long arrays (one per unpacked section) - if I want to capture arrays every time. For a file with a length of 1600 sections (this is a typical size), the time loss would be 1.5 * 2 * 1.6 = 5 seconds. True, now it takes only 11 seconds to complete one of the processing options (without unnecessary memory captures), but there is something to think about: other treatments will be longer and more complicated. It is possible that it will be necessary to continue to reuse memory wherever possible, and not to abuse dynamic memory. But maybe not.
The project, within the framework of which it is necessary to solve the problem, is written in C # under VS 2008 (using inserts from other languages is strongly discouraged), the main system under which the finished program will work is Windows 7, 64 bit (at least for now). And, as usual, you need to process faster.
The first question that arises is whether it is necessary to organize an array pool for unpacking, or whether it is possible to capture an array for each new section again. The second question is what should be the structure of this array, which is better - work with linear arrays of 8 MB in length, or split the array into smaller pieces and organize, for example, an array of arrays. In the second case, what should be the length of these pieces.
I took a few objects:
- Array int [] [] of size M * N
- Array int [] of length N
- A homemade list of N items long:
class list{ public list next; public int val; }
- List List
length N elements
The numbers M and N for a two-dimensional array were selected so that M * N = 40,000,000 (which corresponds to a memory for 20 sections).
For each object, the average time to create + fill + read (after which the object was forgotten) was measured, and for control, the time to fill + read (the object was created only once). Time was measured in nanoseconds per processed element of the object. The measurement was performed twice: when working on one processor core and when working on 4 cores in parallel (in the second case, the time spent on 4 was not multiplied, i.e. the result, as a rule, should be less than in the case of a single core).
The results look like this:
Mxn | 8000x5000 | 2000x20000 | 1000x40000 | 100x400000 | 10x4000000 | 1x40000000 |
int [] [] | 8.34 / 7.30 | 8.34 / 7.02 | 4.08 / 2.69 | 3.76 / 2.55 | 3.62 / 2.58 | 3.63 / 2.78 |
int [] [], R + W | 2.57 / 1.60 | 2.64 / 1.60 | 2.22 / 1.04 | 2.20 / 1.00 | 2.18 / 1.00 | 2.09 / 1.03 |
int [], full | 1.94 / 1.04 | 1.85 / 0.96 | 3.4 / 1.58 | 3.44 / 2.69 | 3.60 / 3.63 | 3.60 / 2.78 |
int [], R + W | 1.58 / 0.46 | 1.56 / 0.47 | 1.56 / 0.47 | 1.57 / 0.63 | 1.83 / 0.93 | 2.00 / 1.05 |
list | 16.30 / 9.14 | 19.16 / 19.00 | 21.69 / 35.17 | 53.8 / 85.65 | 145/130 | |
list, read | 2.32 / 0.60 | 2.29 / 0.61 | 2.31 / 1.12 | 6.4 / 2.58 | 7.2 / 3.67 | |
List | 8.95 / 4.21 | 11.06 / 4.74 | 11.98 / 5.03 | 11.85 / 6.38 | 11.85 / 6.98 | 13.71 / 8.10 |
List | 2.95 / 0.88 | 2.96 / 0.92 | 2.96 / 0.92 | 2.96 / 0.92 | 3.13 / 1.05 | 4.13 / 1.65 |
Two times are recorded in each cell - for one and four cores.
What can be extracted from this plate? First, it turns out that the time it takes to capture memory linearly depends on the length of the array: a single linear array of 160 MB in length is captured 100 times longer than an array of 1.6 MB in length. Secondly, if we want to capture one array for a short time, then short arrays have the advantage: their capture takes 0.3 ns / word, while the capture of long ones is 1.8 ns / word (the difference between the 3rd and 4th lines). This confirms the often cited claim that objects shorter than 88 KB are taken from a separate, faster pool. But if there are a lot of arrays, the picture becomes the opposite: for long arrays there are about 1.5 ns / word, and for short ones - 5.8 ns / word - almost 4 times more! So if you need a multidimensional array for a short time, then you should not make it stepwise with short internal arrays, it’s better to look for another option. For example, grab a one-dimensional array and read indices.
In addition, it is clear that the system did not like my implementation of the list at all: when its length approached a million, the time to create one item increased approximately 6 times compared to short lists.
Optimal for my task, apparently, would be to capture long arrays (one per unpacked section) - if I want to capture arrays every time. For a file with a length of 1600 sections (this is a typical size), the time loss would be 1.5 * 2 * 1.6 = 5 seconds. True, now it takes only 11 seconds to complete one of the processing options (without unnecessary memory captures), but there is something to think about: other treatments will be longer and more complicated. It is possible that it will be necessary to continue to reuse memory wherever possible, and not to abuse dynamic memory. But maybe not.