VVS0205 July 15, 2011 at 15:50

Using GPGPU to Compress Data (Part I)

From the sandbox

Hello, dear Habra community.

Many probably have already heard about GPGPU (graphics card) computing ; at the moment there are many implementations of this programming technique. We will focus on two of them - this is the notorious CUDA from Nvidia, and I think the slightly less popular, but also well-known OpenCL framework . There are already a sufficient number of articles on the hub that describe the basic principle of these technologies, so we will not focus on this. In the article I want to share the results obtained when using GPGPU in comparison with the CPU for data compression.

Background

I have always been interested in parallel computing, so at the institute, as a scientific work, I chose this direction. When it was necessary to choose the topic of the thesis, having read various articles about GPGPU and increasing productivity, I decided that the topic of the diploma would be related specifically to GPGPU. I decided to try on data compression (you can say that this is not the best task for GPGPU), the static Huffman algorithm for data compression was chosen as the algorithm.

Standard Huffman algorithm for data compression

For those who are not familiar with the algorithm, I will give a small example of its operation.
Consider the algorithm using the following sequence of characters as an example:
aaaaaaaccaabccbbbb
Step 1: You
must read the entire file and count how many times each character from the extended ASCII set occurs.
For our example, the frequency of occurrence of characters is equal to:
a - 9, b - 5, c - 4.
Step 2:
After counting the frequency of occurrence of each character, it is necessary to form a binary tree, consider the algorithm for constructing a tree using an example.

Our sequence consists of three characters, arrange these characters in decreasing order of their frequencies: {(a, 9), (b, 5), (c, 4)}.
On the next line, write the set obtained from the previous one as follows: -
instead of the last two characters with their frequencies, we write a new element, which instead of the name of the symbol will have the entry “Vertex #n”, and instead of the frequency, the sum of the frequencies of the last two elements of the previous set;
-Sort the resulting set in descending order.
{(a, 19), ("top # 1", 9)}.
We repeat the previous step until the set consists of only one element: ("top #last_number", summa_vseh_chastot):
{("top # 2", 28)}

We got a binary tree where the root is the last vertex.
Step 3:
Using the resulting tree, we can encode the file. To do this, select new bit sequences for characters, we move up the tree nodes and add 1 for each right node, and for the left one 0. In our example, the characters will be encoded as follows:
a - 0, b - 10, c - 11,
note that before compression, according to the ASCII table, the characters looked like:
a - 01100001, b - 01100010, c - 01100011.
When encoding, we replace the characters with the sequence data.
To be able to decode the file, we must save the tree to a file. Decoding is carried out by passing through the tree from the root to the leaves.

GPGPU Compression Algorithm

The encoding program is logically divided into a preprocessor (step 1.2 of the algorithm) and the actual encoding (step 3). Step 1.3 of the algorithm were submitted to GPGPU.
The first two stages of preprocessing are standard when implementing the Huffman method:

counting frequencies encountered by various characters of the alphabet;
ordering an array of frequencies;

Then you need to build a Huffman tree (step 2). In our case, instead of the Huffman tree, an additional data structure is constructed that contains:

an array of character codes;
an array of character code lengths.

The decision to use this data structure replacing the standard Huffman tree was made, since none of the technologies used allows implementing a full-fledged tree on GPGPU.
Of all the pre-processing steps described above, only counting symbol frequencies depends on the size of the input data. The remaining stages of the preprocessing are elementary steps.

results

C # was chosen as the programming language. To compare the efficiency, two versions of the program were written for the CPU, CPU (Async) - parallel version and CPU (Sync) - serial, and two versions for GPGPU (OpenCL and CUDA).

For testing, the following equipment was used:

processor (CPU): AMD PHENOM X4 965 Black Edition;
graphics card: MSI GeForce GTS 450.

To measure the speed of the programs, two files were selected: one with a size of 200 MB (210 596 352 bytes), the second 603 MB (633 128 511 bytes). To measure time, the Stopwatch () class, available in C #, is used. The accuracy of time measurement depends on the frequency of the processor and theoretically on the computer used is 1/36902109016524975 = 3.0e-16.
Measurements were made ten times for each version of the program and for each file. In fig. 1-4 are graphs showing average time based on measurements.

Average run time of a program on a 200 MB file (no write)

Average run time of a program on a 200 MB file (no write)

Figure 1.
In this test, OpenCL and CUDA are eight and a half times faster than CPU (Sync), and two and a half times faster than CPU (Async).

Average run time of a program on a 200 MB file (with recording)

Average run time of a program on a 200 MB file (with recording)

Figure 2.
In this test, OpenCL and CUDA are four and a half times faster than CPU (Sync) and one and a half times faster than CPU (Async).

Average run time of a program on a file of 603 MB in size (without writing)

Average run time of a program on a file of 603 MB in size (without writing)

Figure 3.
In this test, OpenCL and CUDA are eight times faster than CPU (Sync), and two and a half times faster than CPU (Async).

Average run time of a program on a file of 603 MB in size (with recording)

Average run time of a program on a file of 603 MB in size (with recording)

Figure 4.
In this test, OpenCL and CUDA are four to three times faster than CPU (Sync), and one to three times faster than CPU (Async).

Conclusion

CUDA and OpenCL proved their effectiveness in testing in comparison with the CPU.
Based on the results, we can conclude that for tasks that do not use a large number of mathematical calculations, the runtime of programs written using OpenCL is equal to the runtime of programs written using CUDA. In the second part of the article, I will talk about the technical aspects of implementation.

Tags: