How I accelerated strstr

I recently needed to write an analog of the strstr function (search for a substring in a string). I decided to speed it up. The result is an algorithm. I did not find it on the first links in the search engine, but there are a bunch of other algorithms, so I wrote it.


Graph comparing the speed of my algorithm, with the strstr function on 600 kb text of a Russian-language book, when searching for strings of 1 to 255 bytes in size:


image


The idea underlying the algorithm is as follows. First notation:


S - The line in which the search will be performed will be called
P - the line we are looking for will be called
sl - the length of the string S
pl - the length of the string P
PInd - the label compiled in the first step.


So the idea is this: If in line S we select elements with indices equal: pl 1, pl 2, ..., pl J, ..., pl (sl / pl), then if in the segment pl (J-1) .. .pl (J + 1) includes the string P, then the selected item must belong to string P. Otherwise, the string would not fit in length.


The picture, where it is drawn to where P reaches, is in pl * 3 .:


image


And yet, since the selected element is usually included in the string P a small number of times, you can check only these occurrences, and not the entire range.


Total algorithm is as follows:


Step 1. Sort P


For all elements (characters) of the string P, we sort the indices of these elements by the value of these elements. That is, we make it so that all the numbers of all elements equal to some value can be quickly obtained. This label will be called PInd:


image


As can be seen from this plate, when searching in step 2 for the desired string P equal to "sort indexes", we need to check a maximum of 2 substrings in S.


At first, I used quick sorting and some other to compile this plate, and then I realized that since the elements of the string are single-byte, you can use bitwise.


Stage 2. Search


We go along the line line S with jumps equal to pl, choosing the corresponding elements. For each selected item, check to see if it is on line P.


If it is, then for all indices from PInd of the corresponding element, we check whether substring S coincides with the beginning at the corresponding index from PInd by the offset relative to the selected element and the desired string P.


If it matches, then return the result; if not, continue.


The worst option for this algorithm depends on the way the strings are compared in the second step. If a simple comparison, then sl * pl + pl, if some else, then another. I used just comparison, as in regular strstr.


Due to the fact that the algorithm jumps through pl elements and checks only possible lines, it works faster.


The best option is when the algorithm skips the entire string S and does not find the text or finds it the first time, then it will spend about sl / pl.
The average speed I do not know how to calculate.


Here is one of my implementations of this algorithm, according to which a graph in C was built. Here pl is bounded, that is, the PInd table is constructed by the prefix P of length max_len, and not along the entire row. It was he who was used to build the chart:


Here is coded in si
char * my_strstr(const char * str1,  const char * str2,size_t slen){
    unsigned char max_len = 140;
    if ( !*str2 )
        return((char *)str1);
    //Очистка массивов
    unsigned char index_header_first[256];
    unsigned char index_header_end[256];
    unsigned char last_char[256];
    unsigned char sorted_index[256];
    memset(index_header_first,0x00,sizeof(unsigned char)*256);
    memset(index_header_end,0x00,sizeof(unsigned char)*256);
    memset(last_char,0x00,sizeof(unsigned char)*256);
    memset(sorted_index,0x00,sizeof(unsigned char)*256);
    //Этап 1.
    char *cp2 = (char*)str2;
    unsigned char v;
    unsigned int len =0;
    unsigned char current_ind = 1;
    while((v=*cp2) && (len slen){
        return NULL;
    }
    //Этап 2.
    unsigned char *s1, *s2;
    //Начинаем проверку с элемента S+pl-1
    unsigned char *cp = (unsigned char *) str1+(len-1);
    unsigned char *cp_end= cp+slen;
    while (cp=str1)
                {
                    //Сравниваем строки
                    s2 = (unsigned char*)str2;
                    while ( *s2 && !(*s1^*s2) )
                        s1++, s2++;
                    if (!*s2)
                        return (char*)(cp-pos_in_len);
                }
            }
            while( (ind = last_char[ind]) );
        }
        //Прыгаем вперёд на pl
        cp+=len;
    }
    return(NULL);
}

Update: This is really not a direct replacement for strstr, as it additionally requires a parameter - the length of the string S.


Also popular now: