feel_OS_off December 18, 2017 at 10:03

Multi-pattern matching on a GPU myth or reality

From the sandbox

Some lyrics

In the old days, when the grass was greener and the trees were taller, I firmly believed that such scary words as stream divergence, cache missing, coalescing global memory accesses and others did not allow to effectively implement the multiple search task on the GPU. Years passed, confidence did not disappear, but at one point I came across the PFAC library . If you are interested in what she is capable of, welcome to cat.

Why all this

Why are we doing this and what do we want to get? The most interesting fact is the performance of the PFAC library. I would like to get the high speed of work obtained through the use of the GPU. For comparison, we will use the CPU and OpenMP implementations of the PFAC algorithm.

I assume that the reader has some idea of the CUDA architecture. If this is still not the case, then the basic ideas can be found in articles, for example here and here .

About the library

The PFAC library was developed in 2012 and adapted to the architectures and capabilities of video cards of that time, so if you have a video card with compute capability no higher than 3.0, you are ready to install ubuntu 12.04 and cuda 4.2, then everything should fly out of the box. However, we are not looking for easy ways and in this article we will work with ubuntu 16.04 and cuda 9.0. The test bench will use a laptop with Intel Core i7-7500U, GeForce 940MX and 8 GB RAM.

Before rushing into battle and counting gigabits, I’ll say a few words about the algorithmic part to understand what awaits us.

The library implements the parallel algorithm Aho-Korasik without descents. PFAC, respectively, short for Parallel Failureless Aho-Corasick. The algorithm builds a prefix tree without descents and packs it into a table.

As stated in the documentation, each thread processes data, starting with its input character.

However, if you look at the source, each thread processes 4 characters, which is associated with memory alignment.

All the scary words with which I started the article have a place to be in this decision. No one guarantees that all threads within the same warp (32 consecutive threads within the same block) will process closely spaced table data. There is also no guarantee that all the threads inside the warp will cross the same branch branch or exit the cycle of passing through the tree at one time.

But I propose not to think about the bad, and with burning eyes and an optimistic attitude, start a computational experiment.

Deployment

First, let's try to deploy PFAC. I believe that CUDA has already been installed, and paths to nvcc and libraries have been added.

We go into the project, do make. Waiting - everything is fine, everything turned around. Harsh reality - some strange mistakes. It turns out that for CUDA 9.0, assemblies for video cards with compute capability below 3.0 are deprecated. This can be fought. It is only necessary to change src / Makefile.

Makefile example for compute capability 5.0.

#
# Makefile in directory src
#
# resource usage:
#
# To compile a dynamic module
# (1) nvcc cannot accept -fPIC, so compile .cu to .cu.cpp first
#     nvcc -arch=sm_50 -cuda ../src/PFAC_kernel.cu
#
# (2) then use g++ to comple PFAC_notex_shared_reorder.cu.cpp
#     g++ -fPIC -c PFAC_kernel.cu.cpp
#
# (3) finally combine two object files to a .so library
#     g++ -shared -o libpfac.so  $(LIBS) PFAC_kernel.cu.o ...
#
#     $(LIBS) is necessary when compiling PFAC library on 32-bit machine
#
include ../common.mk
INC_DIR = ../include
LIB_DIR = ../lib
OBJ_DIR = ../obj
INCPATH += -I../include/
CU_SRC = PFAC_kernel.cu
CU_SRC += PFAC_reduce_kernel.cu
CU_SRC += PFAC_reduce_inplace_kernel.cu
CU_SRC += PFAC_kernel_spaceDriven.cu
CPP_SRC  = PFAC_reorder_Table.cpp
CPP_SRC += PFAC_CPU.cpp
CPP_SRC += PFAC_CPU_OMP.cpp
CPP_SRC += PFAC.cpp
inc_files = $(INC_DIR)/PFAC_P.h  $(INC_DIR)/PFAC.h
35
CU_OBJ = $(patsubst %.cu,%.o,$(CU_SRC))
CU_CPP = $(patsubst %.cu,%.cu.cpp,$(CU_SRC))
CPP_OBJ = $(patsubst %.cpp,%.o,$(CPP_SRC))
cppobj_loc = $(patsubst %.o,$(OBJ_DIR)/%.o,$(CPP_OBJ))
cppobj_fpic_loc = $(patsubst %.o,$(OBJ_DIR)/%_fpic.o,$(CPP_OBJ))
cu_cpp_sm50_loc = $(patsubst %.cpp,$(OBJ_DIR)/sm50_%.cpp,$(CU_CPP))
cu_cpp_obj_sm50_loc = $(patsubst %.cpp,$(OBJ_DIR)/sm50_%.cpp.o,$(CU_CPP))
all: mk_libso_no50  mk_lib_fpic 
mk_libso_no50: $(cu_cpp_sm50_loc) 
	$(CXX) -shared -o $(LIB_DIR)/libpfac_sm50.so $(LIBS) $(cu_cpp_obj_sm50_loc)
mk_liba: $(cppobj_loc)
	ar cru $(LIB_DIR)/libpfac.a  $(cppobj_loc)
	ranlib $(LIB_DIR)/libpfac.a
mk_lib_fpic: $(cppobj_fpic_loc)
	$(CXX) -shared -o $(LIB_DIR)/libpfac.so  $(cppobj_fpic_loc) $(LIBS)
$(OBJ_DIR)/%_fpic.o: %.cpp  $(inc_files)
	$(CXX) -fPIC -c $(CXXFLAGS) $(INCPATH) -o $@ $<
$(OBJ_DIR)/PFAC_CPU_OMP_reorder_fpic.o: PFAC_CPU_OMP_reorder.cpp  $(inc_files)
	$(CXX) -fPIC -c $(CXXFLAGS) $(INCPATH) -o $@ $<
$(OBJ_DIR)/PFAC_CPU_OMP_reorder.o: PFAC_CPU_OMP_reorder.cpp  $(inc_files)
	$(CXX) -c $(CXXFLAGS) $(INCPATH) -o $@ $<
$(OBJ_DIR)/%.o: %.cpp  $(inc_files)
	$(CXX) -c $(CXXFLAGS) $(INCPATH) -o $@ $<
$(OBJ_DIR)/sm50_%.cu.cpp: %.cu
	$(NVCC) -arch=sm_50 -cuda $(INCPATH) -o $@ $<
	$(CXX) -fPIC -O2 -c -o $@.o $@
#clean :
#	rm -f *.linkinfo
#	rm -f $(OBJ_DIR)/*
#	rm -f $(EXE_DIR)/*
####### Implicit rules
.SUFFIXES: .o .c .cpp .cc .cxx .C
.cpp.o:
	$(CXX) -c $(CXXFLAGS) $(INCPATH) -o "$@" "$<"
.cc.o:
	$(CXX) -c $(CXXFLAGS) $(INCPATH) -o "$@" "$<"
.cxx.o:
	$(CXX) -c $(CXXFLAGS) $(INCPATH) -o "$@" "$<"
.C.o:
	$(CXX) -c $(CXXFLAGS) $(INCPATH) -o "$@" "$<"
.c.o:
	$(CC) -c $(CFLAGS) $(INCPATH) -o "$@" "$<"
####### Build rules

Reassemble.

Fuh! Everything seems to be fine. Add the path to the lib folder in LD_LIBRARY_PATH. We try to run the test example simple_example - an error occurs:

simple_example.cpp:64: int main(int, char**): Assertion `PFAC_STATUS_SUCCESS == PFAC_status' failed.

If you add the output PFAC_status, we get the status 10008, which corresponds to PFAC_STATUS_ARCH_MISMATCH. For unknown reasons, the architecture is not supported.
Dive deeper into the code. In the src / PFAC.cpp file we find the following code that generates this status.

ifs never happen

    int device_no = 10*deviceProp.major + deviceProp.minor ;
    if ( 30 == device_no ){
        strcpy (modulepath, "libpfac_sm30.so");    
    }else if ( 21 == device_no ){
        strcpy (modulepath, "libpfac_sm21.so");    
    }else if ( 20 == device_no ){
        strcpy (modulepath, "libpfac_sm20.so");
    }else if ( 13 == device_no ){
        strcpy (modulepath, "libpfac_sm13.so");
    }else if ( 12 == device_no ){
        strcpy (modulepath, "libpfac_sm12.so");
    }else if ( 11 == device_no ){
        strcpy (modulepath, "libpfac_sm11.so");
    }else{
        return PFAC_STATUS_ARCH_MISMATCH ;
    }

From the comments next to it, it becomes clear that the code does not work only with compute capability 1.0. Replace this fragment with

   int device_no = 10*deviceProp.major + deviceProp.minor ;
   if ( 11 > device_no )
        return PFAC_STATUS_ARCH_MISMATCH ;
   sprintf(modulepath, "libpfac_sm%d.so",  device_no);

Everything seems to be fine now. No more obvious dependencies on the version of computing capabilities of the video card are observed.

Again, try to run the test version. We observe a strange error:

Error: fails to PFAC_matchFromHost, PFAC_STATUS_CUDA_ALLOC_FAILED: allocation fails on device memory.

Fortunately, there is a way to enable debugging information in the library.
In the include / PFAC_P.h file, you need to uncomment the line

#define PFAC_PRINTF( ... ) printf( __VA_ARGS__ )

And, accordingly, comment

//#define PFAC_PRINTF(...)

Now, after running simple_example, you can see a more intelligible error message: Yeah, so the problem is not in allocating memory, but in using texture memory. PFAC allows you to disable texture memory. Let's try to take advantage of this. In the test / simple_example.cpp file, after creating the handle, add the line:

Error: cannot bind texture, 11264 bytes invalid texture reference

Error: fails to PFAC_matchFromHost, PFAC_STATUS_CUDA_ALLOC_FAILED: allocation fails on device memory

PFAC_setTextureMode(handle, PFAC_TEXTURE_OFF ) ;

A miracle happened, the console gave us the long-awaited lines.

Something was found

At position 0, match pattern 1

At position 1, match pattern 3

At position 2, match pattern 4

At position 4, match pattern 4

At position 6, match pattern 2

However, ancient manuscripts suggest that texture memory improves performance by using texture cache. Well, let's try running PFAC with texture memory.

PFAC supports 2 modes of operation: PFAC_TIME_DRIVEN and PFAC_SPACE_DRIVEN. We are interested in the first. The implementation is located in the src / PFAC_kernel.cu file.

We find the lines where the texture memory is mounted on the global one:

cuda_status = cudaBindTexture( &offset, (const struct textureReference*) texRefTable,
            (const void*) handle->d_PFAC_table, 
            (const struct cudaChannelFormatDesc*) &channelDesc,
            handle->sizeOfTableInBytes ) ;

For video cards with compute capability 5.0 and later, the implementation of the texture memory operation has been slightly changed and the set access mode is incorrect, therefore, replace this line with

       
cuda_status = cudaBindTexture( &offset, tex_PFAC_table,
            (const void*) handle->d_PFAC_table, handle->sizeOfTableInBytes ) ;

After that, in the file test / simple_example.cpp, turn on the texture memory again:

PFAC_setTextureMode(handle, PFAC_TEXTURE_ON ) ;

We check the work - everything works.

Testing and Comparison

Now let's look at the performance of the library.

For testing, we will use the signatures provided by clamAV
Download the main.cvd file. Next, unpack

dd if=main.cvd of=main.tar.gz bs=512 skip=1
tar xzvf main.tar.gz

We will test on the signatures of the main.mdb file. We do a lot of routine actions: cut off n lines and consider them patterns, shuffle input data, etc. I think it’s not worth it to go into details here.

In order to play around with performance, I slightly modified the file test / simple_example.cpp

Today it looks like this

#include 
#include 
#include 
#include 
#include 
 #include 
#include 
int main(int argc, char **argv)
{
    if(argc < 2){
        printf("args input file, input pattern\n" );
        return 0;
    }
    char dumpTableFile[] = "table.txt" ;      
    char *inputFile = argv[1]; //"../test/data/example_input" ;
    char *patternFile = argv[2];//"../test/pattern/example_pattern" ;
    PFAC_handle_t handle ;
    PFAC_status_t PFAC_status ;
    int input_size ;    
    char *h_inputString = NULL ;
    int  *h_matched_result = NULL ;
    // step 1: create PFAC handle 
    PFAC_status = PFAC_create( &handle ) ;
    PFAC_status = PFAC_setTextureMode(handle, PFAC_TEXTURE_OFF);
    printf("%d\n", PFAC_status);
    assert( PFAC_STATUS_SUCCESS == PFAC_status );
    // step 2: read patterns and dump transition table 
    PFAC_status = PFAC_readPatternFromFile( handle, patternFile) ;
    if ( PFAC_STATUS_SUCCESS != PFAC_status ){
        printf("Error: fails to read pattern from file, %s\n", PFAC_getErrorString(PFAC_status) );
        exit(1) ;   
    }
    // dump transition table 
    FILE *table_fp = fopen( dumpTableFile, "w") ;
    assert( NULL != table_fp ) ;
    PFAC_status = PFAC_dumpTransitionTable( handle, table_fp );
    fclose( table_fp ) ;
    if ( PFAC_STATUS_SUCCESS != PFAC_status ){
        printf("Error: fails to dump transition table, %s\n", PFAC_getErrorString(PFAC_status) );
        exit(1) ;   
    }
    //step 3: prepare input stream
    FILE* fpin = fopen( inputFile, "rb");
    assert ( NULL != fpin ) ;
    // obtain file size
    fseek (fpin , 0 , SEEK_END);
    input_size = ftell (fpin);
    rewind (fpin);
    // allocate memory to contain the whole file
    h_inputString = (char *) malloc (sizeof(char)*input_size);
    assert( NULL != h_inputString );
    h_matched_result = (int *) malloc (sizeof(int)*input_size);
    assert( NULL != h_matched_result );
    memset( h_matched_result, 0, sizeof(int)*input_size ) ;
    // copy the file into the buffer
    input_size = fread (h_inputString, 1, input_size, fpin);
    fclose(fpin);    
    auto started = std::chrono::high_resolution_clock::now();
    // step 4: run PFAC on GPU           
    PFAC_status = PFAC_matchFromHost( handle, h_inputString, input_size, h_matched_result ) ;
    if ( PFAC_STATUS_SUCCESS != PFAC_status ){
        printf("Error: fails to PFAC_matchFromHost, %s\n", PFAC_getErrorString(PFAC_status) );
        exit(1) ;   
    }     
    auto done = std::chrono::high_resolution_clock::now();
    std::cout << "gpu_time: " << std::chrono::duration_cast(done-started).count()<< std::endl;
    memset( h_matched_result, 0, sizeof(int)*input_size ) ;
    PFAC_setPlatform(handle, PFAC_PLATFORM_CPU);
    started = std::chrono::high_resolution_clock::now();
    // step 4: run PFAC on CPU           
    PFAC_status = PFAC_matchFromHost( handle, h_inputString, input_size, h_matched_result ) ;
    if ( PFAC_STATUS_SUCCESS != PFAC_status ){
        printf("Error: fails to PFAC_matchFromHost, %s\n", PFAC_getErrorString(PFAC_status) );
        exit(1) ;   
    }     
    done = std::chrono::high_resolution_clock::now();
    std::cout << "cpu_time: " << std::chrono::duration_cast(done-started).count()<< std::endl;
    memset( h_matched_result, 0, sizeof(int)*input_size ) ;
    PFAC_setPlatform(handle, PFAC_PLATFORM_CPU_OMP);
    started = std::chrono::high_resolution_clock::now();
    // step 4: run PFAC on OMP           
    PFAC_status = PFAC_matchFromHost( handle, h_inputString, input_size, h_matched_result ) ;
    if ( PFAC_STATUS_SUCCESS != PFAC_status ){
        printf("Error: fails to PFAC_matchFromHost, %s\n", PFAC_getErrorString(PFAC_status) );
        exit(1) ;   
    }     
    done = std::chrono::high_resolution_clock::now();
    std::cout << "omp_time: " << std::chrono::duration_cast(done-started).count() << std::endl;
    PFAC_status = PFAC_destroy( handle ) ;
    assert( PFAC_STATUS_SUCCESS == PFAC_status );
    free(h_inputString);
    free(h_matched_result); 
    return 0;
}

Since I use std :: chrono for time measurements, the flag -std = c ++ 0x must be added to the test / Makefile file. And do not forget to set the number of threads for OpenMP (I used 4 at home):

export OMP_NUM_THREADS=4

Now you can begin to measure performance. Let's see what the GPU can show. The length of each pattern is 32, the length of the input data is 128 MB.

We get the following graph:

With an increase in the number of patterns, performance degrades. This is not surprising, because the Aho-Korasik algorithm without descents depends not only on the input data, but also on the patterns, because for each position in the text, it is necessary to go deep into the tree by the length of the matching substring of the text with the prefix of some attribute, and with an increase in the number of patterns, the probability of finding a substring matching with a certain prefix increases.

And now for a more interesting graph - acceleration using the GPU.

On a small number of patterns, it is almost imperceptible; data transfer eats it up. However, with the increase in number, acceleration also increases, which indicates the potential capabilities of the GPU in an unusual task.

Instead of output

When I took up this task, I was skeptical, because there can be no such thing. However, the result gives hope for the possibility of using CUDA in an industry unusual for it.
The path of the researcher does not end there, it will take a lot of work, a lot of performance research and comparisons with competitors.

The first step has been taken, do not lose hope!

Tags: