About one interesting bug in Lucene.Net


    Some programmers, when they hear about static analysis, say that they do not need it, since all of their code is covered with unit tests, and this is enough to catch all the errors. I came across an error that is theoretically possible to find with the help of unit tests, but if you don’t know about it, then writing such a test is almost impossible.

    Introduction


    Lucene.Net is a popular full-text search library ported from Java to C #. The source code is open and available on the project website https://lucenenet.apache.org/ .

    Since this project is developing slowly, does not contain much code and is used in many other projects for full-text search, our analyzer also found only five suspicious places [ 1 ]. I did not expect more. But one of these positives seemed especially interesting to me, and I decided to tell the readers of our blog about it.

    About the error found


    We have V3035 diagnostics that instead of + = you can mistakenly write = +, where + will be a unary plus. When I did it by analogy with the same V588 diagnostic intended for the C ++ language, I thought - is it really possible to make such a mistake in C #? In C ++, okay - someone uses different text editors instead of the IDE, in which you can seal yourself and not notice an error. But typing in Visual Studio, which automatically aligns the code after I put a semicolon, how can I skip this? It turns out that you can. I found such an error in Lucene.Net. And it is more interesting because it is rather difficult to detect it in other ways than static analysis. Consider the code:

    protected virtual void Substitute( StringBuilder buffer )
    {
        substCount = 0;
        for ( int c = 0; c < buffer.Length; c++ ) 
        {
            ....
            // Take care that at least one character
            // is left left side from the current one
            if ( c < buffer.Length - 1 ) 
            {
                // Masking several common character combinations
                // with an token
                if ( ( c < buffer.Length - 2 ) && buffer[c] == 's' &&
                    buffer[c + 1] == 'c' && buffer[c + 2] == 'h' )
                {
                    buffer[c] = '$';
                    buffer.Remove(c + 1, 2);
                    substCount =+ 2;
                }
                ....
                else if ( buffer[c] == 's' && buffer[c + 1] == 't' ) 
                {
                    buffer[c] = '!';
                    buffer.Remove(c + 1, 1);
                    substCount++;
                }
                ....
            }
        }
    }

    There is a GermanStemmer class that truncates the suffixes of German words to highlight a common root. It works as follows: first, the Substitute method replaces various good letter combinations with other characters so as not to confuse them with a suffix. Replaced: 'sch' by '$', 'st' by '!' and so on (this can be seen from the sample code). Moreover, the number of characters by which the word length is reduced by such replacements is accumulated in the variable substCount. Further, the Strip method cuts off excess suffixes, and at the end the Resubstitute method performs the reverse replacement: '$' by 'sch', '!' to 'st'. That is, if, for example, we had the word kapitalistischen (capitalist), then the stemmer will work as follows: kapitalistischen => kapitali! I $ en (Substitute) => kapitali!

    Due to this typo in the code, when replacing 'sch' with '$', the substCount variable will be assigned the value 2, instead of increasing substCount by 2. And this error is quite difficult to find using other methods than static analysis. There are developers who say: why do I need a static analyzer if I have unit tests? So, in order to catch such an error with tests, you need to test Lucene.Net in German text using GermanStemmer, the tests should index the word that contains the combination 'sch' and another combination of letters for which substitution will be performed, and it should be present in the word before 'sch' so that substCount is non-zero by the time the substCount = + 2 expression is executed. A rather nontrivial combination for the test,

    Conclusion


    Unit tests and static analysis are not exclusive, but complementary software development techniques [ 2 ]. I suggest downloading the PVS-Studio static analyzer, checking your projects and finding errors that were not detected using unit tests.

    Sitelinks


    1. Andrey Karpov. Why in small programs a low density of errors .
    2. Andrey Karpov. How static analysis complements TDD .


    If you want to share this article with an English-speaking audience, then please use the link to the translation: Ilya Ivanov. An unusual bug in Lucene.Net .

    Have you read the article and have a question?
    Often our articles are asked the same questions. We collected the answers here: Answers to questions from readers of articles about PVS-Studio, version 2015 . Please see the list.

    Also popular now: