saul May 7, 2015 at 09:06

High-performance DEFLATE compression optimized for genomic data sets

Original author: Jim Guilford, Vinodh Gopal, Sean Gulley, Wajdi Feghali

Transfer

igzip is a high-performance library for performing gzip or DEFLATE compression. It was originally described in the article DEFLATE High Performance Compression for Intel Processors . This article describes a related release of source code that contains optional (at build time) optimizations to increase the compression ratio of genomic datasets in BAM and SAM formats. igzip runs about 4 times faster than Zlib when configured for maximum speed, and with approximately the same compression ratio for genomic data. We believe that igzip can be similarly optimized for other applications where data sets are different from regular text data.

Overview

Two main versions of igzip can be built: igzip0c and igzip1c. The first is faster, and the second is slightly slower, but with a higher compression ratio. Both of these versions are significantly faster than the standard gzip -1 or Zlib -1, but differ in a slightly lower compression ratio. These implementations use the SSE 4.2 instruction set. They are intended for Intel processors that support this instruction set.

In the article mentioned above, we described the compression ratio and performance of igzip when working on general-purpose data. After that, we expanded the implementation for efficient processing of genomic data. In the computational process used to determine the genomic sequence, data compression is a serious performance limitation. There are usage scenarios with long-term storage when a very high compression ratio is required, especially considering the significant size of the genomic data sets. Nevertheless, at various stages of determining the genomic sequence, there are operations that require very fast compression with a moderate degree of compression. In this study, we focused on two main data formats used to determine genomic sequence; these are SAM and BAM formats.

An analysis of these formats showed that the content of the data is very specific, found only in genomic data and is significantly different from ordinary data (for example, from data sets in the University of Calgary collection). We have identified the following basic properties of genomic data: a very small length of matches in LZ77 processing (~ 90% of matches less than 8 bytes long), very small offsets (in ~ 90% of matches, a shift of less than 1024 bytes) and a very diverse distribution of literals (for example, in the format SAM is a significant number of characters A, T, G, C from processed string sequences). We changed the Huffman tables in igzip to take these features into account, and achieved about the same compression ratio as Zlib-1, but 4 times faster.

Programming interface

API:

struct LZ_Stream2 {
    UINT8 *next_in;   // Next input byte
    UINT32 avail_in;  // number of bytes available at next_in
    UINT32 total_in;  // total number of bytes read so far
    UINT8 *next_out;  // Next output byte
    UINT32 avail_out; // number of bytes available at next_out
    UINT32 total_out; // total number of bytes written so far
    UINT32 end_of_stream; // non-zero if this is the last input buffer
    LZ_State2 internal_state;
  };
  void init_stream(LZ_Stream2 *stream);
  void fast_lz(LZ_Stream2 *stream);

The basic principle is that next_in points to the input buffer, and avail_in indicates the length of this buffer. Similarly, next_out indicates an empty output buffer, avail_out indicates the size of this buffer.

The fields total_in and total_out start at 0 and are updated by the fast_lz () function . They correspond to the total number of bytes read or written at the current moment in case this information is needed by the calling application.

The init_stream () function statically initializes the stream data structure, and in particular the internal state.
Call fast_lz ()will take data from the input buffer (updating next_in and avail_in ) and write the compressed stream to the output buffer (updating next_out and avail_out ). The function returns when either avail_in or avail_out returns zero (i.e., when the input data runs out or when the output buffer is full).

After transmitting the last input buffer, the end_of_stream flag should be set . The procedure will process the bitstream to the end of the input buffer if there is enough space in the output buffer.
Simple example

LZ_Stream2 stream;
    UINT8 inbuf[LEN_IN], outbuf[LEN_OUT];
    init_stream(&stream);
    stream.end_of_stream = 0;
    do {
        stream.avail_in = (UINT32) fread(inbuf, 1, LEN_IN, in);
        stream.end_of_stream = feof(in);
        stream.next_in = inbuf;
        do {
            stream.avail_out = LEN_OUT;
            stream.next_out = outbuf;
            fast_lz(&stream);
            fwrite(outbuf, 1, LEN_OUT - stream.avail_out, out);
        } while (stream.avail_out == 0);
        assert(stream.avail_in == 0);
    } while (! stream.end_of_stream);

The Zlib operation FLUSH_SYNC is not currently supported.

Source code versions

The options in the options.inc file determine which version will be built. This determines the number of characters in the YASM syntax. The Perl script converts this file to options.h for use in C code. You should not manually edit the options.h file. (If Perl is not available, you can manually edit options.h, but then make sure that the options in the options.h and options.inc files match.)

As mentioned above, the two main versions are 0c and 1c. The version is selected by editing the options.inc file, where one of the following lines can be.

%define MAJOR_VERSION IGZIP0C

%define MAJOR_VERSION IGZIP1C

(IGZIP0 and IGZIP1 are not supported in this release.)

In any of these versions, you can select one of three Huffman tables. The default table is for use with most files. There are two other tables optimized for use with genomic applications, in particular for SAM and BAM files. To select one of these tables, uncomment the corresponding lines (removing the starting semicolon).

;%define GENOME_SAM
;%define GENOME_BAM

If both lines are commented out (as shown above), the default tables are used. You should not uncomment both of these lines at the same time.

By default, igzip creates a GZIP file (RFC 1952). If the line

;%define ONLY_DEFLATE

If it is uncommented, then the output data of the DEFLATE format will be generated, i.e., without GZIP headers.

Figure 1. Supported (tested) versions in this release.

Code can be built on Linux or Windows. There is a makefile for Linux. The code combines C and assembler. To assemble assembler code, use the YASM assembler. The path to YASM in the Makefile can be changed according to where YASM is installed in the user environment.
You can compile the code (under Linux) in the form of a static library, for example, make lib0c or make lib1c, or in the form of a common object, for example make slib0c or make slib1c.

results

This igzip release supports not only two major versions (igzip0c and igzip1c), but also additional optimization for genomic data sets, in particular for BAM and SAM files. These files are quite different from regular files, and the prevalence of statistics for each type of these files means that you can optimize Huffman tables for these statistics and achieve a higher compression ratio.

To illustrate, three test files were compressed with different configurations. One of the test files was progl from the collection of the University of Calgary (the so-called Calgary Corps). It corresponds to a regular file. The other two files are arbitrary examples of genomic files in the BAM and SAM formats. In each test, the compression ratio was calculated as the quotient of dividing the size of the compressed file by the size of the uncompressed file (the smaller the better). For brevity, only the compression ratio for the “preferred” version of igzip is shown (that is, when igzip0c-bam is started for a BAM file). As expected, the preferred version provides a higher degree of compression than other versions (for example, when running igzip0c for a SAM file instead of igzip0c-sam). In addition, for simplicity, we show igzip performance only for preferred data types.

Note. Tests and performance tests are performed on specific computer systems and components and are consistent with the approximate performance of Intel products in these tests. Any differences in system hardware and in software or configuration may affect actual performance. When choosing purchased products, you should look for other information and performance tests. For more information on performance tests and the performance of Intel products, see here .

The performance results shown in this section were obtained when measured on an Intel Core i7 4770 processor (Haswell). Tests were conducted with Intel Turbo Boost Technology disabled and represent performance without using Intel Hyper-Threading Technology (Intel HT) on a single core. We offer results for in-memory data, excluding file I / O.

We compiled Zlib implementations using the GCC compiler with aggressive optimization options. For reference, we used Zlib version 1.2.8.

Temporal indicators were measured using the rdtsc () functionwhich returns the processor time stamp counter (TSC). TSC is the number of clock cycles since the last reset. TSC_initial - TSC value written before the function call. After the function finishes, rdtsc () is called again to write a new TSC_final loop counter. The actual number of cycles for the called procedure is calculated using the expression:
number of cycles = (TSC_final - TSC_initial) .

A significant number of such measurements were carried out for each file, followed by averaging of their results.

Figure 2. The difference in speed and compression ratio of igzip0c * compared to gzip / Zlib -1

Table 1. Compression performance (cycles per byte) and compression ratio

The table shows the compression ratio and performance for the “preferred” version of igzip for the corresponding file type. It can be seen that, in particular, for the BAM and SAM test files, the igzip compression ratio is very close to zlib -1. For SAM files, the compression ratio and speed turned out to be the same as for some ordinary files from the University of Calgary data set, while for the BAM format both the compression ratio and speed are lower than for SAM (this applies to both igzip and Zlib). In the BAM format, a significant part of these fields is already packed in binary format, which is why the compression using the LZ77 algorithms is negligible. The results show that igzip is about 4 times faster than Zlib in speed (when configured for maximum speed in both cases).

Conclusion

igzip is a library for high speed compression using the DEFLATE / gzip algorithm. Its various configurations are possible with different ratios of compression ratio and speed. In addition, there are additional optimization options for the genomic BAM and SAM files. During optimization, the compression ratio is close to gzip -1, but the speed is much higher. Having studied the SAM and BAM formats, we believe that it is possible to analyze various data sets from other applications, which can be quite different from standard text data, and develop individually selected igzip implementations with similar results.

For more information on compiler optimization, see our optimization notice .

Download igzip source code.

Tags: