Sensations confirmed by numbers
For a long time, I was bothered by articles on the Internet that attempted, on the basis of checking small projects, to judge the benefits of using static code analyzers.
Many of the articles I read make this assumption. If 2 errors were found in a project of N lines of code, then in a full-fledged project of N * 100 lines, you can find only 200 errors. And from this it is concluded that static analysis is certainly good, but not great. Too few errors. It is better to develop other defect search techniques.
There are two main reasons why people try the analyzer on small projects. Firstly, a large working draft is not so easy to verify. You need to configure something, write something somewhere, exclude some libraries from the scan, and so on. Naturally, I don’t want to do this all. There is a desire to quickly check something, and not mess with the settings. Secondly, a huge project will receive a huge number of diagnostic messages. And again I do not want to spend a lot of time analyzing them. It is much easier to learn to take a smaller project.
As a result, a person does not touch a large project that he is working on, but takes something small. For example, it could be his old course project or a small open source project with GitHub.
He checks this project and makes a linear interpolation of how many errors he can find in his large project. And then he writes an article about the research.
At first glance, such studies look correct and useful. But I was sure that it was not so.
The first drawback of all these studies is obvious. They forget that a working debugged version of a project is taken. Many of the errors that could be quickly found by static analysis were searched slowly and sadly. They were detected during testing or after user complaints. That is, it is forgotten that static analysis is a tool of constant, but not single use. After all, programmers regularly look at Warnings, issued by the compiler, and not once a year.
With the second flaw in research, things are more complicated and interesting. I had a clear feeling that it is impossible to equally evaluate small and large projects. Let the student write in 5 days a good project for a term paper containing 1000 lines of code. I’m sure that in 500 days he will not be able to write a good commercial application with a volume of 100,000 lines of code. He will be prevented by the growth of complexity. The larger the program becomes, the more difficult it is to add new functionality to it, the more it is required to test it and mess around more with errors.
In general, there was a sensation, but I could not formulate it in any way. Suddenly, one of the employees came to my aid. As he studied Steve McConnell's book, The Perfect Code, he noticed an interesting tablet in it. And I forgot about her. This tablet immediately puts everything in its place!
Of course, considering small projects, it is incorrect to estimate the number of errors in large ones! They have a different density of errors!
The larger the project, the more errors per 1000 lines of code it contains. Take a look at this wonderful table:
Table 1. Project size and typical error density. The book indicates data sources: “Program Quality and Programmer Productivity” (Jones, 1977), “Estimating Software Costs” (Jones, 1998).
To make it easier to perceive the data, we will build graphs.
Chart 1. Typical error density in the project. Blue is the maximum amount. Red is the average amount. Green is the smallest amount.
I think, considering these graphs, it becomes clear that the dependence is not linear. The larger the project, the easier it is to make a mistake.
Of course, not all errors are detected by a static analyzer. However, the larger the project, the more effective it is. And even more effective if used regularly.
By the way, in a small project, errors may not be found at all. Or there will be only a couple of them. In this case, we can come to the completely wrong conclusions. Therefore, I highly recommend trying various tools for finding errors on real working projects.
Yes, it’s more difficult, but you’ll get the right idea about the possibilities. For example, as one of the authors of PVS-Studio , I can promise that we will try to help everyone who contacts us. If in the process of learning PVS-Studio something fails, write to us. Often, many problems can be solved by properly tuning the tool.
PS I
invite you to join my twitter @Code_Analysis or the community on the site Reddit. In them I regularly publish links to interesting articles on topics: C / C ++, static code analysis, optimization, and other interesting things about programming. Articles both ours and others. And then I was everywhere kicked out, except for "I am PR."
Many of the articles I read make this assumption. If 2 errors were found in a project of N lines of code, then in a full-fledged project of N * 100 lines, you can find only 200 errors. And from this it is concluded that static analysis is certainly good, but not great. Too few errors. It is better to develop other defect search techniques.
There are two main reasons why people try the analyzer on small projects. Firstly, a large working draft is not so easy to verify. You need to configure something, write something somewhere, exclude some libraries from the scan, and so on. Naturally, I don’t want to do this all. There is a desire to quickly check something, and not mess with the settings. Secondly, a huge project will receive a huge number of diagnostic messages. And again I do not want to spend a lot of time analyzing them. It is much easier to learn to take a smaller project.
As a result, a person does not touch a large project that he is working on, but takes something small. For example, it could be his old course project or a small open source project with GitHub.
He checks this project and makes a linear interpolation of how many errors he can find in his large project. And then he writes an article about the research.
At first glance, such studies look correct and useful. But I was sure that it was not so.
The first drawback of all these studies is obvious. They forget that a working debugged version of a project is taken. Many of the errors that could be quickly found by static analysis were searched slowly and sadly. They were detected during testing or after user complaints. That is, it is forgotten that static analysis is a tool of constant, but not single use. After all, programmers regularly look at Warnings, issued by the compiler, and not once a year.
With the second flaw in research, things are more complicated and interesting. I had a clear feeling that it is impossible to equally evaluate small and large projects. Let the student write in 5 days a good project for a term paper containing 1000 lines of code. I’m sure that in 500 days he will not be able to write a good commercial application with a volume of 100,000 lines of code. He will be prevented by the growth of complexity. The larger the program becomes, the more difficult it is to add new functionality to it, the more it is required to test it and mess around more with errors.
In general, there was a sensation, but I could not formulate it in any way. Suddenly, one of the employees came to my aid. As he studied Steve McConnell's book, The Perfect Code, he noticed an interesting tablet in it. And I forgot about her. This tablet immediately puts everything in its place!
Of course, considering small projects, it is incorrect to estimate the number of errors in large ones! They have a different density of errors!
The larger the project, the more errors per 1000 lines of code it contains. Take a look at this wonderful table:
Table 1. Project size and typical error density. The book indicates data sources: “Program Quality and Programmer Productivity” (Jones, 1977), “Estimating Software Costs” (Jones, 1998).
To make it easier to perceive the data, we will build graphs.
Chart 1. Typical error density in the project. Blue is the maximum amount. Red is the average amount. Green is the smallest amount.
I think, considering these graphs, it becomes clear that the dependence is not linear. The larger the project, the easier it is to make a mistake.
Of course, not all errors are detected by a static analyzer. However, the larger the project, the more effective it is. And even more effective if used regularly.
By the way, in a small project, errors may not be found at all. Or there will be only a couple of them. In this case, we can come to the completely wrong conclusions. Therefore, I highly recommend trying various tools for finding errors on real working projects.
Yes, it’s more difficult, but you’ll get the right idea about the possibilities. For example, as one of the authors of PVS-Studio , I can promise that we will try to help everyone who contacts us. If in the process of learning PVS-Studio something fails, write to us. Often, many problems can be solved by properly tuning the tool.
PS I
invite you to join my twitter @Code_Analysis or the community on the site Reddit. In them I regularly publish links to interesting articles on topics: C / C ++, static code analysis, optimization, and other interesting things about programming. Articles both ours and others. And then I was everywhere kicked out, except for "I am PR."