What we know and what we don’t know about the assessment of labor costs in software development
- Transfer
The vast majority of studies and reports confirm the tendency to exceed budgets and timelines for program projects. On average, this excess is about 30 percent 1 . Moreover, if we compare the accuracy of the estimates in the 1980s and the one that appears in recent studies, we do not find a significant difference (the only analysis that suggests a significant improvement in the quality of the assessment is found in Chaos Reports from Standish Group. However, this is an “improvement ”, Most likely, stems from the fact that the researchers improved the quality of their data, moving from being overloaded with problematic projects to a more representative sample 2) Assessment methods also have not changed significantly. Despite intensive research in the field of formal valuation models, the “expert assessment” continues to be the dominant method. 3 The
obvious lack of breakthroughs in improving the methodology for estimating labor costs does not mean that we did not begin to know more about this. In this article, I will try to summarize some of the knowledge that, in my opinion, we now have. Some of this knowledge can potentially improve the quality of the assessment, some, most likely, will not be able to improve it, and some relate to what we know about what we do not know about the assessment of labor costs in software development. All the materials that I use to confirm my theses have been published 1 .
After analyzing the studies on the assessment of labor costs, I selected seven results that are consistent with most studies:
A lot of research is devoted to comparing the accuracy of the score obtained using various models and methods, while we observe a wide variety of “winners” in these competitions for accuracy 4 . The main reason for this lack of stability in the results seems to be that many key relationships, for example, such as between the size of the project and the complexity of development, vary significantly from context to context 5 . In addition to this, the variables with the largest impact on labor intensity also vary, leading us to conclude that models and assessment methods should be tailored to the context in which they are applied.
The lack of stability of key ratios also explains why statistically advanced estimation models, as a rule, do not improve or practically do not improve the accuracy of estimates compared to simple models. Statistically advanced models rely too much on historical data and can produce even worse results than simple models when applied in a different context. The results obtained suggest that it is better for development companies to build their own assessment models than to expect that universal methods and tools will be accurate in their specific context.
The tendency to underestimate labor costs is most pronounced in situations where the choice of a supplier is made on the basis of price, for example, when requesting quotes. In less cost-sensitive situations, such as in-house development, there is no such tendency - in fact, you can even encounter the opposite phenomenon here. This suggests that the key reason for underestimation is the client’s focus on getting the lowest possible price, i.e. when suppliers who underestimated labor costs are more likely to become contractors. This observation suggests that customers can avoid budget overruns by paying less attention to the estimated cost and more on the competencies of the contractor.
Estimation intervals, such as the 90 percent confidence interval, are systematically too narrow to reflect the actual uncertainty of labor requirements. Despite the strong confirmation of our inability to estimate the minimum and maximum levels of labor, current assessment methods continue to suggest that this is a solvable task. This is especially evident when using PERT methods (three-point estimation), in which the estimated estimate is based on the most probable, minimum and maximum estimates.
Instead of using expert estimates to determine the minimum and maximum estimates of labor costs, developers should use historical data about previous estimation errors to set realistic minimum-maximum intervals 6 .
Any assessment of labor costs in the field of software development, even based on formal assessment models, requires expert judgment. But, although expert judgment can be reasonably accurate, it is also strongly affected by external factors. Probably the strongest (and negative) impact arises when those involved in the assessment of labor costs, before or during the assessment, receive information about the budget, customer expectations, available time for implementation, or other values that may be the so-called. "Anchors of attraction assessment." Without noticing this, experts will give such an assessment, which is unreasonably close to the values of the "anchors". For example, knowing that a customer expects a low price or a small number of man-hours is likely to underestimate labor costs. Expert judgment may also be affected by the request,
Despite a lot of research on how to get rid of false premises and how to neutralize bias in the assessment, no reliable methods for this have been identified. The main conclusion from this situation is that the persons responsible for the assessment should be careful in every way against irrelevant information and information that could form a misconception - for example, by removing one from the requirements documentation.
One of the well-documented ways to improve the accuracy of an assessment is to use historical data and assessment checklists. When historical data is relevant, and checklists are tailored to the needs of the company, there is less chance that certain activities will be missed, and more that a sufficient margin of risk will be made, and previous experience will be reused to the maximum. This in turn leads to a more realistic assessment. Especially if data on similar projects can be attracted for the so-called “Estimates by analogy” 7 , the accuracy of the assessment is significantly improved.
Despite the obvious usefulness of such tools, many companies still do not use them to improve the accuracy of their estimates.
The average of several ratings from various sources is more likely to be more accurate than most individual ratings. The key factor for improving accuracy is precisely the “independence” of evaluations, that is, the assessments obtained must differ in terms of expertise, expert background and the assessment process used. The evaluation process using some “ Delphic method ”, for example, “poker planning”, when using which developers show their independently obtained estimates of labor costs (their “cards”), looks especially useful in the context of evaluating labor costs for developing software.
A group, structured assessment process brings added value compared to the mechanical combination of assessments, because the exchange of knowledge increases the total amount of knowledge in the group. The negative effects of applying group judgments, such as “ group thinking ” and the willingness to take a greater risk in the group (compared to individual decision making), are not documented for cases of evaluating the labor costs of developing software.
Evaluation models, on average, are less accurate than expert estimates. However, the difference in the processes of assessment by models and assessment by experts makes the combination of both approaches especially useful to increase the accuracy of the assessment.
Estimates not only predict the future, but often influence it. Too low ratings can lead to poor quality, possible rework at subsequent stages, and higher risks of project failure; too high marks can reduce productivity according to Parkinson’s law, which states that “any work takes up all the time allotted for it”.
That is why it is important to carefully analyze whether it is really necessary to evaluate labor costs at this stage. If in fact this is not absolutely necessary, it may turn out to be more reliable to work without an assessment, or to postpone them to a later stage when additional information appears. Flexible “agile” software development methodologies — which only plan the next sprint or release using feedback from previous sprints or releases — can be a good way to avoid the harm done by evaluations done too early.
There are several assessment-related problems for which no satisfactory solution has yet been found, despite the volume of research. Three of them especially vividly emphasize the scarcity of our knowledge in this area.
Mega-projects present increased requirements for evaluating labor costs. Not only because big stakes are at stake, but also due to a lack of relevant experience and historical data. Many of the activities typical of mega-projects, such as solving organizational issues, in which many participants are involved with different interests and goals, are very difficult to accurately evaluate, as they usually involve changes in business processes and involve complex interactions between project participants and existing software.
Despite years of research into measuring the size and complexity of programs, none of the proposed metrics is good enough when it comes to estimating labor costs. Some contexts of size and complexity, apparently, facilitate a correct assessment, but such contexts are quite rare.
Even if you have well appreciated the size and complexity of the project, [for a reliable assessment] you need to reliably predict the productivity of teams and team members who have to work on it. This prediction is complicated by the surprisingly large differences in performance between teams and within teams. There is no reliable method for such an assessment (with the possible exception of trialsourcing).
At the moment, we don’t even know if the projects in the field of software development have “economies of scale” (productivity increases with increasing project size) or “de-savings” (productivity decreases with increasing project size). Most empirical studies seem to indicate that on average software development projects have “economies of scale”, while most practitioners believe the opposite. Unfortunately, the results of studies confirming the existence of "savings", apparently, are a consequence of how the study is organized, and do not reveal the in-depth relationships between the scope of the project and productivity.
Thus, what we currently know about the assessment of labor costs in software development, in fact, does not allow us to solve the problems of estimating labor costs on real projects. However, we can point out several practices that, as a rule, can improve the reliability of such estimates. In particular, companies are likely to be able to improve the accuracy of their estimates if they:
- develop and apply simple models adapted to their contexts, in combination with the expert assessment method;
- use historical data about estimation errors to construct minimum-maximum intervals;
- avoid situations in which the expert risks exposure to misleading information or irrelevant information;
- use checklists tested in this organization;
- use structured methods of group assessment, with a guarantee of independence of estimates;
- Avoid early assessment based on incomplete information.
Highly competitive tenders, with a focus on the lowest cost, are highly likely to lead to the selection of an overly optimistic Contractor, and, as a result, to failure to meet contract deadlines and poor software quality. In other areas, this is called the winner’s curse.". In the long run, most customers will begin to realize that their focus on the lowest contract price in the software development field will ultimately negatively impact the success of the project. Until then, development companies should try to keep track of situations where they can be selected only in the case of an overly optimistic assessment of projects, and to have in stock response strategies to counter or avoid the “curse of the winner”.
1. T. Halkjelsvik and M. Jørgensen, “From Origami to Software Development: A Review of Studies on Judgment-Based Predictions of Performance Time,” Psychological Bulletin, vol. 138, no. 2, 2012, pp. 238–271.
2. M. Jørgensen and K. Moløkken-Østvold, “How Large Are Software Cost Overruns? A Review of the 1994 CHAOS Report, ”Information and Software Technology, vol. 48, no. 4, 2006, pp. 297-301.
3. M. Jørgensen, “A Review of Studies on Expert Estimation of Software Development Effort,” J. Systems and Software, vol. 70, no. 1, 2004, pp. 37-60.
4. T. Menzies and M. Shepperd, “Special Issue on Repeatable Results in Software Engineering Prediction,” Empirical Software Eng., Vol. 17, no. 1, 2012, pp. 1-17.
5. JJ Dolado, “On the Problem of the Software Cost Function,” Information and Software Technology, vol. 43, no. 1, 2001, pp. 61–72.
6. M. Jørgensen and DIK Sjøberg, “An Effort Prediction Interval Approach Based on the Empirical Distribution of Previous Estimation Accuracy,” Information and Software Technology, vol. 45, no. 3, 2003, pp. 123-136.
7. B. Flyvbjerg, “Curbing Optimism Bias and Strategic Misrepresentation in Planning: Reference Class Forecasting in Practice,” European Planning Studies, vol. 16, no. 1, 2008, pp. 3-21.
Magne Jorgensen is a researcher at the Simula Research Laboratory and a professor at the University of Oslo . The main areas of his scientific work are the issues of assessing labor costs, the processes of searching for a contractor, outsourcing, and evaluating the compensations of software developers. You can contact him by email. mail magnej@simula.no
obvious lack of breakthroughs in improving the methodology for estimating labor costs does not mean that we did not begin to know more about this. In this article, I will try to summarize some of the knowledge that, in my opinion, we now have. Some of this knowledge can potentially improve the quality of the assessment, some, most likely, will not be able to improve it, and some relate to what we know about what we do not know about the assessment of labor costs in software development. All the materials that I use to confirm my theses have been published 1 .
What do we know
After analyzing the studies on the assessment of labor costs, I selected seven results that are consistent with most studies:
There is no “best” model or assessment methodology
A lot of research is devoted to comparing the accuracy of the score obtained using various models and methods, while we observe a wide variety of “winners” in these competitions for accuracy 4 . The main reason for this lack of stability in the results seems to be that many key relationships, for example, such as between the size of the project and the complexity of development, vary significantly from context to context 5 . In addition to this, the variables with the largest impact on labor intensity also vary, leading us to conclude that models and assessment methods should be tailored to the context in which they are applied.
The lack of stability of key ratios also explains why statistically advanced estimation models, as a rule, do not improve or practically do not improve the accuracy of estimates compared to simple models. Statistically advanced models rely too much on historical data and can produce even worse results than simple models when applied in a different context. The results obtained suggest that it is better for development companies to build their own assessment models than to expect that universal methods and tools will be accurate in their specific context.
Customer focus on the low price leads to cost overruns
The tendency to underestimate labor costs is most pronounced in situations where the choice of a supplier is made on the basis of price, for example, when requesting quotes. In less cost-sensitive situations, such as in-house development, there is no such tendency - in fact, you can even encounter the opposite phenomenon here. This suggests that the key reason for underestimation is the client’s focus on getting the lowest possible price, i.e. when suppliers who underestimated labor costs are more likely to become contractors. This observation suggests that customers can avoid budget overruns by paying less attention to the estimated cost and more on the competencies of the contractor.
"Minimum-maximum" evaluation intervals are too narrow
Estimation intervals, such as the 90 percent confidence interval, are systematically too narrow to reflect the actual uncertainty of labor requirements. Despite the strong confirmation of our inability to estimate the minimum and maximum levels of labor, current assessment methods continue to suggest that this is a solvable task. This is especially evident when using PERT methods (three-point estimation), in which the estimated estimate is based on the most probable, minimum and maximum estimates.
Instead of using expert estimates to determine the minimum and maximum estimates of labor costs, developers should use historical data about previous estimation errors to set realistic minimum-maximum intervals 6 .
It's easy to mislead those who value, but hard to get rid of
Any assessment of labor costs in the field of software development, even based on formal assessment models, requires expert judgment. But, although expert judgment can be reasonably accurate, it is also strongly affected by external factors. Probably the strongest (and negative) impact arises when those involved in the assessment of labor costs, before or during the assessment, receive information about the budget, customer expectations, available time for implementation, or other values that may be the so-called. "Anchors of attraction assessment." Without noticing this, experts will give such an assessment, which is unreasonably close to the values of the "anchors". For example, knowing that a customer expects a low price or a small number of man-hours is likely to underestimate labor costs. Expert judgment may also be affected by the request,
Despite a lot of research on how to get rid of false premises and how to neutralize bias in the assessment, no reliable methods for this have been identified. The main conclusion from this situation is that the persons responsible for the assessment should be careful in every way against irrelevant information and information that could form a misconception - for example, by removing one from the requirements documentation.
Relevant historical data and checklists improve accuracy
One of the well-documented ways to improve the accuracy of an assessment is to use historical data and assessment checklists. When historical data is relevant, and checklists are tailored to the needs of the company, there is less chance that certain activities will be missed, and more that a sufficient margin of risk will be made, and previous experience will be reused to the maximum. This in turn leads to a more realistic assessment. Especially if data on similar projects can be attracted for the so-called “Estimates by analogy” 7 , the accuracy of the assessment is significantly improved.
Despite the obvious usefulness of such tools, many companies still do not use them to improve the accuracy of their estimates.
Combining Independent Evaluations Improves Evaluation Accuracy
The average of several ratings from various sources is more likely to be more accurate than most individual ratings. The key factor for improving accuracy is precisely the “independence” of evaluations, that is, the assessments obtained must differ in terms of expertise, expert background and the assessment process used. The evaluation process using some “ Delphic method ”, for example, “poker planning”, when using which developers show their independently obtained estimates of labor costs (their “cards”), looks especially useful in the context of evaluating labor costs for developing software.
A group, structured assessment process brings added value compared to the mechanical combination of assessments, because the exchange of knowledge increases the total amount of knowledge in the group. The negative effects of applying group judgments, such as “ group thinking ” and the willingness to take a greater risk in the group (compared to individual decision making), are not documented for cases of evaluating the labor costs of developing software.
Evaluation models, on average, are less accurate than expert estimates. However, the difference in the processes of assessment by models and assessment by experts makes the combination of both approaches especially useful to increase the accuracy of the assessment.
Ratings can be harmful
Estimates not only predict the future, but often influence it. Too low ratings can lead to poor quality, possible rework at subsequent stages, and higher risks of project failure; too high marks can reduce productivity according to Parkinson’s law, which states that “any work takes up all the time allotted for it”.
That is why it is important to carefully analyze whether it is really necessary to evaluate labor costs at this stage. If in fact this is not absolutely necessary, it may turn out to be more reliable to work without an assessment, or to postpone them to a later stage when additional information appears. Flexible “agile” software development methodologies — which only plan the next sprint or release using feedback from previous sprints or releases — can be a good way to avoid the harm done by evaluations done too early.
What we don't know
There are several assessment-related problems for which no satisfactory solution has yet been found, despite the volume of research. Three of them especially vividly emphasize the scarcity of our knowledge in this area.
How to accurately evaluate labor costs in mega-large, complex software projects
Mega-projects present increased requirements for evaluating labor costs. Not only because big stakes are at stake, but also due to a lack of relevant experience and historical data. Many of the activities typical of mega-projects, such as solving organizational issues, in which many participants are involved with different interests and goals, are very difficult to accurately evaluate, as they usually involve changes in business processes and involve complex interactions between project participants and existing software.
How to measure the size and complexity of programs for accurate assessment
Despite years of research into measuring the size and complexity of programs, none of the proposed metrics is good enough when it comes to estimating labor costs. Some contexts of size and complexity, apparently, facilitate a correct assessment, but such contexts are quite rare.
How to measure and predict productivity
Even if you have well appreciated the size and complexity of the project, [for a reliable assessment] you need to reliably predict the productivity of teams and team members who have to work on it. This prediction is complicated by the surprisingly large differences in performance between teams and within teams. There is no reliable method for such an assessment (with the possible exception of trialsourcing).
At the moment, we don’t even know if the projects in the field of software development have “economies of scale” (productivity increases with increasing project size) or “de-savings” (productivity decreases with increasing project size). Most empirical studies seem to indicate that on average software development projects have “economies of scale”, while most practitioners believe the opposite. Unfortunately, the results of studies confirming the existence of "savings", apparently, are a consequence of how the study is organized, and do not reveal the in-depth relationships between the scope of the project and productivity.
Thus, what we currently know about the assessment of labor costs in software development, in fact, does not allow us to solve the problems of estimating labor costs on real projects. However, we can point out several practices that, as a rule, can improve the reliability of such estimates. In particular, companies are likely to be able to improve the accuracy of their estimates if they:
- develop and apply simple models adapted to their contexts, in combination with the expert assessment method;
- use historical data about estimation errors to construct minimum-maximum intervals;
- avoid situations in which the expert risks exposure to misleading information or irrelevant information;
- use checklists tested in this organization;
- use structured methods of group assessment, with a guarantee of independence of estimates;
- Avoid early assessment based on incomplete information.
Highly competitive tenders, with a focus on the lowest cost, are highly likely to lead to the selection of an overly optimistic Contractor, and, as a result, to failure to meet contract deadlines and poor software quality. In other areas, this is called the winner’s curse.". In the long run, most customers will begin to realize that their focus on the lowest contract price in the software development field will ultimately negatively impact the success of the project. Until then, development companies should try to keep track of situations where they can be selected only in the case of an overly optimistic assessment of projects, and to have in stock response strategies to counter or avoid the “curse of the winner”.
Sources:
1. T. Halkjelsvik and M. Jørgensen, “From Origami to Software Development: A Review of Studies on Judgment-Based Predictions of Performance Time,” Psychological Bulletin, vol. 138, no. 2, 2012, pp. 238–271.
2. M. Jørgensen and K. Moløkken-Østvold, “How Large Are Software Cost Overruns? A Review of the 1994 CHAOS Report, ”Information and Software Technology, vol. 48, no. 4, 2006, pp. 297-301.
3. M. Jørgensen, “A Review of Studies on Expert Estimation of Software Development Effort,” J. Systems and Software, vol. 70, no. 1, 2004, pp. 37-60.
4. T. Menzies and M. Shepperd, “Special Issue on Repeatable Results in Software Engineering Prediction,” Empirical Software Eng., Vol. 17, no. 1, 2012, pp. 1-17.
5. JJ Dolado, “On the Problem of the Software Cost Function,” Information and Software Technology, vol. 43, no. 1, 2001, pp. 61–72.
6. M. Jørgensen and DIK Sjøberg, “An Effort Prediction Interval Approach Based on the Empirical Distribution of Previous Estimation Accuracy,” Information and Software Technology, vol. 45, no. 3, 2003, pp. 123-136.
7. B. Flyvbjerg, “Curbing Optimism Bias and Strategic Misrepresentation in Planning: Reference Class Forecasting in Practice,” European Planning Studies, vol. 16, no. 1, 2008, pp. 3-21.
about the author
Magne Jorgensen is a researcher at the Simula Research Laboratory and a professor at the University of Oslo . The main areas of his scientific work are the issues of assessing labor costs, the processes of searching for a contractor, outsourcing, and evaluating the compensations of software developers. You can contact him by email. mail magnej@simula.no