Documente Academic
Documente Profesional
Documente Cultură
Software Measures
and Metrics
A Guide to Selecting
Software Measures
and Metrics
Capers Jones
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Preface ...............................................................................................................vii
Acknowledgments ..............................................................................................xi
About the Author .............................................................................................xiii
1 Introduction ...........................................................................................1
2 Variations in Software Activities by Type of Software .........................17
3 Variations in Software Development Activities by Type of Software.........29
4 Variations in Occupation Groups, Staff Size, Team Experience ...........35
5 Variations due to Inaccurate Software Metrics That Distort Reality .......45
6 Variations in Measuring Agile and CMMI Development ....................51
7 Variations among 60 Development Methodologies ..............................59
8 Variations in Software Programming Languages ................................63
9 Variations in Software Reuse from 0% to 90% .....................................69
10 Variations due to Project, Phase, and Activity Measurements .............77
11 Variations in Burden Rates or Overhead Costs ....................................83
12 Variations in Costs by Industry ............................................................87
13 Variations in Costs by Occupation Group............................................93
14 Variations in Work Habits and Unpaid Overtime ................................97
15 Variations in Functional and Nonfunctional Requirements ................105
v
vi ◾ Contents
1. Agile coaches
2. Architects (software)
3. Architects (systems)
vii
viii ◾ Preface
4. Architects (enterprise?)
5. Assessment specialists
6. Capability maturity model integrated (CMMI) specialists
7. Configuration control specialists
8. Cost estimating specialists
9. Customer support specialists
10. Database administration specialists
11. Education specialists
12. Enterprise resource planning (ERP) specialists
13. Expert-system specialists
14. Function point specialists (certified)
15. Graphics production specialists
16. Human factors specialists
17. Integration specialists
18. Library specialists (for project libraries)
19. Maintenance specialists
20. Marketing specialists
21. Member of the technical staff (multiple specialties)
22. Measurement specialists
23. Metric specialists
24. Project cost analysis specialists
25. Project managers
26. Project office specialists
27. Process improvement specialists
28. Quality assurance specialists
29. Scrum masters
30. Security specialists
31. Technical writing specialists
32. Testing specialists (automated)
33. Testing specialists (manual)
34. Web page design specialists
35. Web masters
Another major leak is that of failing to record the rather high costs for users when
they participate in software projects, such as embedded users for agile projects. But
users also provide requirements, participate in design and phase reviews, perform
acceptance testing, and carry out many other critical activities. User costs can col-
lectively approach 85% of the effort of the actual software development teams.
Without multiplying examples, this new book is somewhat like a medical book
that attempts to discuss treatments for common diseases. This book goes through
a series of measurement and metric problems and explains the damages they can
cause. There are also some suggestions on overcoming these problems, but the main
Preface ◾ ix
focus of the book is to show readers all of the major gaps and problems that need to
be corrected in order to accumulate accurate and useful benchmarks for software
projects. I hope readers will find the information to be of use.
Quality data are even worse than productivity and resource data and are only
about 25% complete. The new technical debt metric is only about 17% complete.
Few companies even start quality measures until after unit test, so all early bugs
found by reviews, desk checks, and static analysis are invisible. Technical debt does
not include consequential damages to clients, nor does it include litigation costs
when clients sue for poor quality.
Hardly anyone measures bad fixes, or new bugs in bug repairs themselves.
About 7% of bug repairs have new bugs, and this can rise above 35% for modules
with high cyclomatic complexity. Even fewer companies measure bad-test cases, or
bugs in test libraries, which average about 15%.
Yet another problem with software measurements has been the continuous
usage for more than 50 years of metrics that distort reality and violate standard
economic principles. The two most flagrant metrics with proven errors are cost per
defect and lines of code (LOC). The cost per defect metric penalizes quality and
makes buggy applications look better than they are. The LOC metric makes
requirements and design invisible and, even worse, penalizes modern high-level
programming languages.
Professional benchmark organizations such as Namcook Analytics, Q/P
Management Group, Davids’ Consulting, and TI Metricas in Brazil that validate
client historical data before logging it can achieve measurement accuracy of perhaps
98%. Contract projects that need accurate billable hours in order to get paid are
often accurate to within 90% for development effort (but many omit unpaid
overtime, and they never record user costs).
Function point metrics are the best choice for both economic and quality
analyses of software projects. The new SNAP metric for software nonfunctional
assessment process measures nonfunctional requirements but is difficult to apply
and also lacks empirical data.
Ordinary internal information system projects and web applications developed
under a cost-center model where costs are absorbed instead of being charged out
are the least accurate and are the ones that average only 37%. Agile projects are
very weak in measurement accuracy and have often less than 50% accuracy. Self-
reported benchmarks are also weak in measurement accuracy and are often less
than 35% in accumulating actual costs.
A distant analogy to this book on measurement problems is Control of
Communicable Diseases in Man, published by the U.S. Public Health Service. It has
concise descriptions of the symptoms and causes of more than 50 common com-
municable diseases, together with discussions of proven effective therapies.
Another medical book with useful guidance for those of us in software is
Paul Starr’s excellent book on The Social Transformation of American Medicine.
x ◾ Preface
This book won a Pulitzer Prize in 1982. Some of the topics on improving medical
records and medical education have much to offer on improving software records
and software education.
So as not to have an entire book filled with problems, Appendix 2 is a more posi-
tive section that shows 25 quantitative goals that could be achieved between now and
2026 if the industry takes measurements seriously and also takes quality seriously.
Acknowledgments
Thanks to my wife, Eileen Jones, for making this book possible. Thanks for her
patience when I get involved in writing and disappear for several hours. Also thanks
for her patience on holidays and vacations when I take my portable computer and
write early in the morning.
Thanks to my neighbor and business partner Ted Maroney, who handles
contracts and the business side of Namcook Analytics LLC, which frees up my time
for books and technical work. Thanks also to Aruna Sankaranarayanan for her excel-
lent work with our Software Risk Master (SRM) estimation tool and our website.
Thanks also to Larry Zevon for the fine work on our blog and to Bob Heffner for
marketing plans. Thanks also to Gary Gack and Jitendra Subramanyam for their
work with us at Namcook.
Thanks to other metrics and measurement research colleagues who also attempt
to bring order into the chaos of software development: Special thanks to the late
Allan Albrecht, the inventor of function points, for his invaluable contribution to
the industry and for his outstanding work. Without Allan’s pioneering work on
function points, the ability to create accurate baselines and benchmarks would
probably not exist today in 2016.
The new SNAP team from International Function Point Users Group (IFPUG)
also deserves thanks: Talmon Ben-Canaan, Carol Dekkers, and Daniel French.
Thanks also to Dr. Alain Abran, Mauricio Aguiar, Dr. Victor Basili, Dr. Barry
Boehm, Dr. Fred Brooks, Manfred Bundschuh, Tom DeMarco, Dr. Reiner Dumke,
Christof Ebert, Gary Gack, Tom Gilb, Scott Goldfarb, Peter Hill, Dr. Steven Kan,
Dr. Leon Kappelman, Dr. Tom McCabe, Dr. Howard Rubin, Dr. Akira Sakakibara,
Manfred Seufort, Paul Strassman, Dr. Gerald Weinberg, Cornelius Wille, the late
Ed Yourdon, and the late Dr. Harlan Mills for their own solid research and for the
excellence and clarity with which they communicated ideas about software. The
software industry is fortunate to have researchers and authors such as these.
Thanks also to the other pioneers of parametric estimation for software projects:
Dr. Barry Boehm of COCOMO, Tony DeMarco and Arlene Minkiewicz of
PRICE, Frank Freiman and Dan Galorath of SEER, Dr. Larry Putnam of SLIM
and the other Putman family members, Dr. Howard Rubin of Estimacs, Dr. Charles
Turk (a colleague at IBM when we built DPS in 1973), and William Roetzheim
xi
xii ◾ Acknowledgments
Capers Jones is currently the vice president and chief technology officer of Namcook
Analytics LLC (www.Namcook.com). Namcook Analytic LLC designs leading-
edge risk, cost, and quality estimation and measurement tools. Software Risk
Master (SRM)™ is the company’s advanced estimation tool with a patent-pending
early sizing feature that allows sizing before requirements via pattern matching.
Namcook Analytics also collects software benchmark data and engages in longer
range software process improvement, quality, and risk-assessment studies. These
Namcook studies are global and involve major corporations and some government
agencies in many countries in Europe, Asia, and South America. Capers Jones is
the author of 15 software books and several hundred journal articles. He is also an
invited keynote speaker at many software conferences in the United States, Europe,
and the Pacific Rim.
xiii
Chapter 1
Introduction
1
2 ◾ A Guide to Selecting Software Measures and Metrics
It is fair to ask if historical data are incomplete, how is it possible to know the
true amounts and evaluate the quantity of missing data that were left out?
In order to correct the gaps and omissions that are normal in cost-tracking
systems, it is necessary to interview the development team members and the
project managers. During these interview sessions, the contents of the histori-
cal data collected for the project are compared to a complete work breakdown
structure derived from similar projects.
For each activity and task that occurs in the work breakdown structure, but
which is missing from the historical data, the developers are asked whether or not
the activity occurred. If it did occur, the developers are asked to reconstruct from
memory or their informal records the number of hours that the missing activity
accrued.
Problems with errors and leakage from software cost-tracking systems are as
old as the software industry itself. The first edition of the author’s book, Applied
Software Measurement, was published in 1991. The third edition was published
in 2008. Yet the magnitude of errors in cost- and resource-tracking systems is
essentially the same today as it was in 1991. Following is an excerpt from the third
edition that summarizes the main issues of leakage from cost-tracking systems:
It is a regrettable fact that most corporate tracking systems for effort and costs
(dollars, work hours, person months, etc.) are incorrect and manage to omit from
30% to more than 70% of the real effort applied to software projects. Thus most
companies cannot safely use their own historical data for predictive purposes.
When benchmark consulting personnel go on-site and interview managers and
technical personnel, these errors and omissions can be partially corrected by
interviews.
The commonest omissions from historical data, ranked in order of significance,
are given in Table 1.1.
Not all of these errors are likely to occur on the same project, but enough of
them occur so frequently that ordinary cost data from project tracking systems are
essentially useless for serious economic study, for benchmark comparisons between
companies, or for baseline analysis to judge rates of improvement.
A more fundamental problem is that most enterprises simply do not record
data for anything but a small subset of the activities actually performed. In carry-
ing out interviews with project managers and project teams to validate and correct
historical data, the author has observed the following patterns of incomplete and
missing data, using the 25 activities of a standard chart of accounts as the refer-
ence model (Table 1.2).
When the author and his colleagues collect benchmark data, we ask the manag-
ers and personnel to try and reconstruct any missing cost elements. Reconstruction
of data from memory is plainly inaccurate, but it is better than omitting the miss-
ing data entirely.
Unfortunately, the bulk of the software literature and many historical studies
only report information to the level of complete projects, rather than to the level
Introduction ◾ 3
of specific activities. Such gross bottom line data cannot readily be validated and is
almost useless for serious economic purposes.
Table 1.3 illustrates the differences between full activity-based costs for a soft-
ware project and the typical leaky patterns of software measurements normally
carried out. Table 1.3 uses a larger 40-activity chart of accounts that shows typical
work patterns for large systems of 10,000 function points or more.
As can be seen, measurement leaks degrade the accuracy of the information
available to C-level executives and also make economic analysis of software costs
very difficult unless the gaps are corrected.
To illustrate the effect of leakage from software tracking systems, consider what
the complete development cycle would look like for a sample project. The sample is
for a PBX switching system of 1,500 function points written in the C programming
language. Table 1.4 illustrates a full set of activities and a full set of costs.
4 ◾ A Guide to Selecting Software Measures and Metrics
08 Coding Complete
Table 1.3 Measured Effort versus Actual Effort: 10,000 Function Points
Percent of Total Measured Results (%)
4 Requirements 4.25
6 Prototyping 2.00
7 Architecture 0.50
22 Integration 0.75
(Continued)
6 ◾ A Guide to Selecting Software Measures and Metrics
37 Installation/training 0.65
Now consider what the same project would look like if only design, code, and
unit test (DCUT) were recorded by the company’s tracking system. This combina-
tion is called DCUT and it has been a common software measurement for more
than 50 years. Table 1.5 illustrates the partial DCUT results.
Instead of a productivity rate of 6.00 function points per staff month, Table 1.4
indicates a productivity rate of 18.75 function points per staff month. Instead of
a schedule of almost 25 calendar months, Table 1.2 indicates a schedule of less
than 7 calendar months. Instead of a cost per function point of U.S. $1,666, the
DCUT results are only U.S. $533 per function point.
Yet both Tables 1.4 and 1.5 are for exactly the same project. Unfortunately,
what passes for historical data far more often matches the partial results shown
in Table 1.5 than the complete results shown in Table 1.4. This leakage of data is
Table 1.4 Example of Complete Costs for Software Development
Average monthly salary $8,000
CMM level 1
Programming language C
(Continued)
◾ 7
8
Table 1.4 (Continued) Example of Complete Costs for Software Development
Staff Function Monthly Work Burdened
Point Function Point Hours per Cost per
Assignment Production Function Function Schedule Effort
Activities Scope Rate Point Point Months Staff Months
(Continued)
Table 1.4 (Continued) Example of Complete Costs for Software Development
Staff Function Monthly Work Burdened
Point Function Point Hours per Cost per
Assignment Production Function Function Schedule Effort
Activities Scope Rate Point Point Months Staff Months
20 Field (beta) testing 1,000 250.00 0.53 $40.00 4.00 1.50 6.00
Table 1.5 Example of Partial Costs for Software Development (DCUT = Design, Code, and Unit Test)
◾
CMM level = 1
Programming language C
not economically valid, and it is not what C-level executives need and deserve to
understand the real costs of software.
Internal software projects where the development organization is defined as
a cost center are the most incomplete and inaccurate in collecting software data.
Many in-house projects by both corporations and government agencies lack use-
ful historical data. Thus such organizations tend to be very optimistic in their
internal estimates because they have no solid basis for comparison. If they switch
to a commercial estimating tool, they tend to be surprised at how much more
costly the results might be.
External projects that are being built under contract, and projects where the
development organization is a profit center, have stronger incentives to capture
costs with accuracy. Thus contractors and outsource vendors are likely to keep
better records than internal software groups.
Another major gap for internal software projects developed by companies for
their own use is the almost total failure to measure user costs. Users participate in
requirements, review documents, participate in phase reviews, perform acceptance
tests, and are sometimes embedded in development teams if the agile methodology
is used. Sometimes user costs can approach or exceed 75% of development costs.
Table 1.6 shows typical leakage for user costs for internal projects where users are
major participants. Table 1.6 shows an agile project of 1,000 function points.
As can be seen in Table 1.6, user costs were more than 35% of development
costs. This is too large a value to remain invisible and unmeasured if software eco-
nomic analysis is going to be taken seriously.
Tables 1.3 through 1.6 show how wide the differences can be between full
measurement and partial measurement. But an even wider range is possible,
because many companies measure only coding and do not record unit test as a
separate cost element.
Table 1.7 shows the approximate distribution of tracking methods noted at
more than 150 companies visited by the author and around 26,000 projects.
Among the author’s clients, about 90% of project historical data are wrong and
incomplete until Namcook consultants help the clients to correct them. In fact, the
average among the author’s clients is that historical data are only about 37% com-
plete for effort and less than 25% complete for quality.
Only 10% of the author’s clients actually have complete cost and resource data
that include management and specialists such as technical writers. These projects
usually have formal cost-tracking systems and also project offices for larger projects.
They are often contract projects where payment depends on accurate records of
effort for billing purposes.
Leakage from cost-tracking systems and the wide divergence in what activities
are included present a major problem to the software industry. It is very difficult
to perform statistical analysis or create accurate benchmarks when so much of
the reported data are incomplete, and there are so many variations in what gets
recorded.
12 ◾ A Guide to Selecting Software Measures and Metrics
Table 1.6 User Effort versus Development Team Effort: Agile 1,000
Function Points
Team Percent User Percent
of Total of Total
7 Architecture 0.50
13 Coding 22.50
22 Integration 0.75
(Continued)
Introduction ◾ 13
Table 1.6 (Continued) User Effort versus Development Team Effort: Agile
1,000 Function Points
Team Percent User Percent
of Total of Total
The gaps and variations in historical data explain why the author and his
colleagues find it necessary to go on-site and interview project managers and
technical staff before accepting historical data. Unverified historical data are
often so incomplete as to negate the value of using them for benchmarks and
industry studies.
When we look at software quality data, we see similar leakages. Many com-
panies do not track any bugs before release. Only sophisticated companies such as
IBM, Raytheon, and Motorola track pretest bugs.
14 ◾ A Guide to Selecting Software Measures and Metrics
100.00
At IBM, there were even volunteers who recorded bugs found during desk check
sessions, debugging, and unit testing, just to provide enough data for statistical analysis.
(The author served as an IBM volunteer and recorded desk check and unit test bugs.)
Table 1.8 shows the pattern of missing data for software defect and quality mea-
surements for an application of a nominal 1,000 function points in Java.
Table 1.8 Measured Quality versus Actual Quality: 1000 Function Points
Defects Defects Percent
Defect Removal Activities Removed Measured of Total
(Continued)
Introduction ◾ 15
Out of the 25 total forms of defect removal, data are collected only for 13 of
these under normal conditions. Most quality measures ignore all bugs found before
testing, and they ignore unit test bugs too.
The apparent defect density of the measured defects is less than one-third of the
true volume of software defects. In other words, true defect potentials would be about
3.50 defects per function point, but due to gaps in the measurement of quality, appar-
ent defect potentials would seem to be just under 1.00 defects per function point.
16 ◾ A Guide to Selecting Software Measures and Metrics
The apparent defect removal efficiency (DRE) is artificially reduced from more
than 94% to less than 80% due to the missing defect data from static analysis,
inspections, and other pretest removal activities.
For the software industry as a whole, the costs of finding and fixing bugs are the
top cost driver. It is professionally embarrassing for the industry to be so lax about
measuring the most expensive kind of work since software began.
The problems illustrated in Tables 1.1 through 1.8 are just the surface mani-
festation of a deeper issue. After more than 50 years, the software industry
lacks anything that resembles a standard chart of accounts for collecting his-
torical data.
This lack is made more difficult by the fact that in real life, there are many
variations of activities that are actually performed. There are variations due to
application size, and variations due to application type.
Chapter 2
Variations in Software
Activities by Type
of Software
In many industries, building large products is not the same as building small
products. Consider the differences in specialization and methods required to build
a rowboat versus building an 80,000 ton cruise ship.
A rowboat can be constructed by a single individual using only hand tools.
But a large modern cruise ship requires more than 350 workers including many
specialists such as pipe fitters, electricians, steel workers, painters, and even interior
decorators and a few fine artists.
Software follows a similar pattern: Building large system in the 10,000 to
100,000 function point range is more or less equivalent to building other large
structures such as ships, office buildings, or bridges. Many kinds of specialists are
utilized, and the development activities are quite extensive compared to smaller
applications.
Table 2.1 illustrates the variations in development activities noted for six size
plateaus using the author’s 25-activity checklist for development projects.
Below the plateau of 1,000 function points (which is roughly equivalent to
100,000 source code statements in a procedural language such as COBOL), less
than half of the 25 activities are normally performed. But large systems in the
10,000 to 100,000 function point range perform more than 20 of these activities.
To illustrate these points, Table 2.2 shows more detailed quantitative variations
in results for three size plateaus, 100, 1,000, and 10,000 function points.
17
Table 2.1 Development Activities for Six Project Size Plateaus
18
◾
2. Prototyping X X X
3. Architecture X X
4. Project plans X X X
5. Initial design X X X X X
6. Detail design X X X X
7. Design reviews X X
8. Coding X X X X X X
9. Reuse acquisition X X X X X X
12. Independent
Verification and
A Guide to Selecting Software Measures and Metrics
Validation
13. Change control X X X
(Continued)
Table 2.1 (Continued) Development Activities for Six Project Size Plateaus
1 Function 10 Function 100 Function 1,000 Function 10,000 Function 100,000
Activities Performed Point Points Points Points Points Function Points
15. User X X X X
documentation
22. Independent
testing
24. Installation/training X X X
25. Project X X X X X X
management
Variations in Software Activities by Type of Software
Activities 5 6 9 18 22 23
◾
19
20 ◾ A Guide to Selecting Software Measures and Metrics
Table 2.2 Variations by Powers of Ten (100, 1,000, and 10,000 Function
Points)
Size in function points 100 1,000 10,000
(Continued)
Variations in Software Activities by Type of Software ◾ 21
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Size in logical KLOC 5.3 53 530
(SRM default for KLOC)
(Continued)
22 ◾ A Guide to Selecting Software Measures and Metrics
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Physical lines of code 2,689.42 2,156.03 876.31
(LOC) per month
(includes blank lines,
comments, headers, etc.)
Requirements creep 1.00% 6.00% 15.00%
(total percent growth)
Requirements creep 1 60 1,500
(function points)
Probable deferred 0 0 2,500
features to release 2
Client planned project $65,625 $812,500 $18,667,600
cost
Actual total project cost $71,930 $897,250 $22,075,408
Plan/actual cost $6,305 $84,750 $3,407,808
difference
Plan/actual percent 8.77% 9.45% 15.44%
difference
Planned cost per $656.25 $812.50 $1,866.76
function point
Actual cost per function $719.30 $897.25 $2,207.54
point
(Continued)
Variations in Software Activities by Type of Software ◾ 23
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Defects per function point 2.31 4.09 5.75
Security flaws 0 3 81
Test Cases for Selected Test Cases Test Cases Test Cases
Tests
Unit test 101 1,026 10,461
Document Sizing
(Continued)
24 ◾ A Guide to Selecting Software Measures and Metrics
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Architecture 17 76 376
(Continued)
Variations in Software Activities by Type of Software ◾ 25
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
High warranty repairs/low 6.00 14.75 32.00
maintainability
Architects 0 0 0.86
(Continued)
26 ◾ A Guide to Selecting Software Measures and Metrics
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
02 Prototyping X X
03 Architecture X
04 Project plans X X
05 Initial design X X
06 Detail design X X X
07 Design reviews X
08 Coding X X X
09 Reuse acquisition X X X
10 Package purchase X
11 Code inspections X
12 Independent
verification and
validation (IV*V)
13 Change control X X
14 Formal integration X X
15 User X X X
documentation
16 Unit testing X X X
17 Function testing X X X
18 Integration testing X X
19 System testing X X
20 Beta testing X
21 Acceptance testing X X
(Continued)
Variations in Software Activities by Type of Software ◾ 27
Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
22 Independent
testing
23 Quality assurance X X
24 Installation/ X
training
25 Project X X X
management
Activities 8 17 23
As can be seen in Table 2.2, what happens for a small project of 100 function
points can be very different from what happens for a large system of 10,000 func-
tion points. Note the presence of many kinds of software specialists at the large
10,000 function point size and their absence for the smaller sizes. Note also the
increase in activities from 8 to 23 as application size gets larger.
Just consider the simple mathematical combinations that have to be estimated
or measured as software size increased. A small project of 100 function points
might have three occupation groups and perform eight activities: that results in
24 combinations that need to be predicted or measured. A large system of 10,000
function points might have 20 occupation groups and perform 25 activities. This
results in a total of 500 combinations that need to be predicted or measured. Even
worse, some activities require many occupation groups, whereas others require only
a few or even one. The total permutations can run into the billions of potential
combinations!
Chapter 3
Variations in Software
Development Activities
by Type of Software
Another key factor that influences software development activities is the type
of software being constructed. For example, the methods utilized for building
military software are very different from civilian norms. For example, military
software projects use independent verification and validation (IV and V) and also
independent testing, which seldom occur for civilian projects.
The systems and commercial software domains also have fairly complex develop-
ment activities compared to management information systems. The outsource domain,
due to contractual implications, also uses a fairly extensive set of development activities.
Table 3.1 illustrates the differences in development activities that the author has
noted across the six types of software.
As can be seen, the activities for outsourced, commercial systems, and military
software are somewhat more numerous than for web and MIS projects where
development processes tend to be rudimentary in many cases.
The six types of software shown in Table 3.1 are far from being the only
kinds of software developed. For example, open-source applications developed
by independent personnel have a unique development method that can be quite
different from software developed by a single organization and a single team.
Software that requires government certification such as the U.S. Food and
Drug Administration, Federal Aviation Administration, or Department of Defense
will also have unique development patterns, and these can vary based on the spe-
cific government agency rules and regulations.
29
30
Table 3.1 Development Activities for Six Project Types
◾
01 Requirements X X X X X
02 Prototyping X X X X X X
03 Architecture X X X X
04 Project plans X X X X X
05 Initial design X X X X X
06 Detail design X X X X X
07 Design reviews X X X X
08 Coding X X X X X X
09 Reuse acquisition X X X X X X
10 Package purchase X X X X X X
11 Code inspections X X X X
13 Change control X X X X X
A Guide to Selecting Software Measures and Metrics
14 Formal integration X X X X X
15 User documentation X X X X X
(Continued)
Table 3.1 (Continued) Development Activities for Six Project Types
Activities Performed Web MIS Outsource Commercial Systems Military
16 Unit testing X X X X X X
17 Function testing X X X X X
18 Integration testing X X X X X
19 System testing X X X X X X
20 Beta testing X X X
21 Acceptance testing X X X X X
22 Independent testing X
23 Quality assurance X X X X
24 Installation/training X X X X X
25 Project management X X X X X
Activities 6 18 22 23 23 25
Variations in Software Development Activities by Type of Software
31◾
32 ◾ A Guide to Selecting Software Measures and Metrics
Large financial software applications in the United States that are subject to the
Sarbanes–Oxley rules will have a very elaborate and costly governance process that
never occurs on other kinds of software applications.
Software that is intended to be used or marketed in many countries will have
elaborate and expensive nationalization procedures that may include translation of
all documents, HELP text, and sometimes even code comments into other national
languages.
Table 3.2 shows the most likely number of activities, occupation groups, and
combinations for 20 different types of software:
1 Military 33 35 1,155
software-weapons
4 Telecom—public 27 26 702
switches
12 Multinational 18 15 270
applications
(Continued)
Variations in Software Development Activities by Type of Software ◾ 33
18 Web applications 9 11 99
19 Open source 8 4 32
20 Personal software 3 1 3
Averages 20 18 424
Variations in Occupation
Groups, Staff Size,
Team Experience
35
36 ◾ A Guide to Selecting Software Measures and Metrics
2. Architects (software)
3. Architects (systems)
4. Architects (enterprise)
5. Assessment specialists
8. Cost-estimating specialists
9. Customer-support specialists
(Continued)
Variations in Occupation Groups, Staff Size, Team Experience ◾ 37
One of the most common management reactions when projects start to run late
is to add more people. Of course this sometimes slows things down, but it is still
a common phenomenon as noted years ago by Dr. Fred Brooks in his classic book
The Mythical Man-Month (1975).
Here too the software literature and software benchmarks are strangely silent.
If the average complement of software engineers for 1,000 function points is six
people, what would happen if it were increased to 10 people? What would happen
if it were reduced to three people? The literature and most benchmarks are silent
on this basic issue.
There is a curve called the Putnam–Norden–Rayleigh (PNR) curve that
shows the relationships between software effort and schedules. In essence, the
curve shows that one person working for 10 months and 10 people working for one
month are not equivalent.
With 10 people, communications would cause confusion and probably stretch
the schedule to two months, hence doubling the effort. One person might not have
all of the necessary skills, so the schedule might slip to 12 calendar months. Some
intermediate value such as four people working for 2.5 months would probably
deliver the optimal result.
As mentioned, Fred Brooks’ classic book, The Mythical Man-Month, made this
concept famous, and also has one of the best software book titles of all time. Fred
is also famous for the phrase no silver bullet to highlight the fact that no known
methodology solves all software engineering problems.
(Phil Crosby’s book, Quality Is Free (1979), is another great book title that
resonates through the ages. Phil also developed the phrase zero defects, which is a
laudable goal even if hard to achieve.)
The PNR curve originated with Lord Rayleigh, a British physicist who died in
1919. He was the discoverer of the argon gas, and he also developed a mathematical
model of light scattering that explains why the sky is blue. Peter Norden of IBM
38 ◾ A Guide to Selecting Software Measures and Metrics
Flat approximation
P1 P2 P3 P4 P5 P6
Time
Effort
(staff × time)
Linear
range
occupation groups, the Rayleigh curves are not a perfect fit but are still a useful
concept.
Incidentally, the agile concept of pair programming is an existence proof that
doubling staff size does not cut elapsed time in half. In fact, some pair-programming
projects take longer than the same number of function points developed by a single
programmer!
The literature on pair programming is woefully inadequate because it only com-
pares individual programmers to pairs and ignores other factors such as inspections,
static analysis, automated proofs, requirements models, and many other modern
quality techniques.
The author’s observations and data are that single programmers using inspec-
tions and static analysis have better quality than pair programmers and also have
shorter coding schedules.
Table 4.2 presents the normal staffing and occupation group patterns for four
software size plateaus: 100, 1,000, 10,000, and 100,000 function points. Java is the
assumed language in all four cases. Table 4.2 shows only 20 of the more common
software occupation groups that occur with a high frequency.
(Continued)
40 ◾ A Guide to Selecting Software Measures and Metrics
Note that the small project of 10 function points used few occupation groups,
with programmers and testers being the two main categories. But as applications
get larger, more and more specialists are needed: business analysts, function point
counters, database administration, and many others.
Note also that because some of the specialists such as technical writers and busi-
ness analysts are only involved part of the time, it is necessary to deal with part-time
fractional personnel rather than all full-time personnel. This is why there are more
occupations than people for 10 function points but more people than occupations
above 1,000 function points.
Figure 4.3 shows the probable impact of various team sizes for an application of
1,000 function points coded in Java.
Team = 10
Team = 9
Team = 8
Team = 7
Team = 6
Series2
Team = 5 Series1
0 5 10 15 20 25
Schedules for 1,000 function points
Figure 4.3 The impact of team size on schedules for 1,000 function points.
Variations in Occupation Groups, Staff Size, Team Experience ◾ 41
As can be seen, adding staff can shorten schedules, up to a point. However, add-
ing staff raises costs and lowers productivity. Also, adding excessive staff can lead
to confusion and communication problems. Selecting the optimal staff size for a
specific project is one of the more complex calculations of parametric estimating
tools such as Software Risk Master (SRM) and the other parametric estimation
tools.
Team size, team experience, work hours, and unpaid overtime are all personnel
issues that combine in a very complex pattern. This is why parametric estimation
tools tend to be more accurate than manual estimates and also more repeatable
because the algorithms are embedded in the tool and not only in the minds of
project managers.
Another factor of importance is that of experience levels. SRM uses a five-point
scale to rank team experience ranging from novice to expert. The chart in Figure 4.4
shows the approximate differences using an average team as the 100% mark.
As can be seen, results are slightly asymmetrical. Top teams are about 30% more
productive than average, but novice teams are only about 15% lower than average.
The reason for this is that normal corporate training and appraisal programs tend
to weed out the really unskilled so that they seldom become actual team members.
The same appraisal programs reward the skilled, so that explains the fact that the
best results have a longer tail.
Software is a team activity. The ranges in performance for specific individuals
can top 100%. But there are not very many of these superstars. Only about 5% to
10% of general software populations are at the really high level of the performance
spectrum.
A third personnel factor with a strong influence is that of unpaid overtime.
Unpaid overtime has a tangible impact on software costs, and a more subtle impact
on software schedules. Projects with plenty of unpaid overtime will have shorter
schedules, but because most tracking systems do not record unpaid overtime, it is
Impact of team experience
Expert
Experienced
Average
Inexperienced
Novice Series2
Series1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Percentage of average performance
difficult to study this situation without interviewing development teams. Figure 4.3
shows the probable impact of unpaid overtime.
Unpaid overtime is seldom measured, and this causes problems for estimation
and also with using historical data as benchmarks. The missing unpaid overtime
can make as much as 15% of total effort invisible!
Figure 4.5 is a standard SRM output and shows the impact of unpaid overtime
on project costs for a project of 1,000 function points or 53,333 Java statements.
The graph shows unpaid overtime hours per calendar month for the full team:
2 hours per month for a team of seven people is 14 free hours who worked each
month.
As can be seen, unpaid overtime is a significant factor for software cost esti-
mates, and it is equally significant for software schedule prediction. SRM includes
unpaid overtime as a standard adjustment to estimates, but the default value is zero.
Users need to provide local values for unpaid overtime in their specific organiza-
tions and for specific projects.
The range of unpaid overtime runs from 0 to more than 20 hours per month.
This is a major variable but one that is often not measured or included properly in
software cost estimates. When personnel are working long hours for free, this has a
big impact on project results.
In today’s world with software being developed globally and international out-
source vendors having about 30% of the U.S. software, market local work hours
per calendar month also need to be included. The sum of paid and unpaid hours per
month ranges globally from about 202 for India to 116 for the Netherlands
(Figure 4.6).
These factors are all intertwined and have simultaneous impacts. The best
results for U.S. software would be small teams of experts working more than
150 hours per month who put in perhaps 12 hours of unpaid overtime per month.
Unpaid = 2
Unpaid = 4
Unpaid = 8
Unpaid = 12
Series2
Unpaid = 16
Series1
1,150,000 1,200,000 1,250,000 1,300,000 1,350,000 1,400,000
Project costs with monthly unpaid overtime hours
India
Peru
Malaysia
Japan
The United States
Canada
France
Series2
Norway Series1
The Netherlands
0 50 100 150 200 250
U.S. industry segments with this combination include start-up companies, com-
puter game companies, open-source development, and commercial software
companies.
The worst U.S. results would be large teams of novices, working less than
125 hours per month that have zero unpaid overtime and being inexperienced need
frequent meetings to decide what has to be done, rather than just doing the neces-
sary work because they already know how to do it. Sectors that might fit this pat-
tern include state and local government software groups.
In other words, large teams tend to have messy communication channels that
slows down progress. U.S. industry segments with this combination include state
and federal government software projects, unionized shops, and some time and
materials contracts where unpaid overtime is not allowed.
On a global basis, the best results would be small teams of experts in coun-
tries with intense work months of more than 160 hours per month and more than
16 hours of unpaid overtime each month.
On a global basis, the worst results would be large teams of novices in heavily
unionized countries working less than 120 hours per month where unpaid overtime
does not exist.
Chapter 5
Variations due to
Inaccurate Software
Metrics That
Distort Reality
1. The traditional lines of code (LOC) metrics, which penalize modern high-level
languages and make requirements and design invisible.
2. The cost per defect metric, which penalizes quality and makes buggy software
look better than it really is.
3. The technical debt metric, which omits expensive quality issues such as litiga-
tion and consequential damages. Technical debt does not occur at all for
projects where quality is so bad that they are canceled and never released,
even though some canceled projects cost millions of dollars. Technical debt is
also hard to apply to embedded and systems software. It does provide inter-
esting if incomplete information for information technology projects.
45
46 ◾ A Guide to Selecting Software Measures and Metrics
and inaccurate metrics for more than 60 years with hardly any of the software
literature even questioning their results?
Universities have been essentially silent on these metric problems, and indeed
some continue to teach software engineering and software project management
using LOC and cost per defect metrics without any cautions to students at all that
these metrics are invalid and distort reality.
Progress in software engineering requires a great deal of technical knowledge
that can only be derived from accurate measurements of size, quality, costs, and
technology effectiveness. Until the software industry abandons bad metrics
and switches to functional metrics, progress will resemble a drunkard’s walk with
as many backward steps as forward steps that make progress.
The software industry has suffered from inaccurate metrics and sloppy and
incomplete measurement practices for more than 60 years. This is a key factor in
today’s poor software quality and low software productivity.
If medicine had the same dangerous combination of bad metrics and incomplete
measurements as software does, then medical doctors would probably not be using
sterile surgical procedures even in 2016 and might still be treating infections with
blood letting and leaches instead of with antibiotics! Vaccinations would probably
not exist and antibiotics might not have been discovered. Things like joint replace-
ments would be impossible because they require very accurate measurements.
Sometimes metrics problems do impact other industries. A huge lawsuit
occurred in Narragansett, Rhode Island when a surveyor miscalculated property
lines and a house was built that encroached on a public park by about 6 feet! This
house was not a small shack but an imposing home of over 4,000 square feet.
The park had been deeded to the city by a philanthropist and part of the deed
restrictions were that the park dimensions needed to be kept unchanged and
the park had to be kept available for free public usage. Obviously the new house
changed the park’s physical dimensions and kept citizens from using the portion of
the park under or near the new house.
This case reached the Rhode Island Supreme Court and the court ordered that
the house either be demolished completely or moved far enough to restore the park
boundaries. Note that the unfortunate owners of the new house were not at fault
because they and the builder acted in good faith and depended on the flawed sur-
vey. As might be expected, the surveyor declared bankruptcy so there was no way
for the owner to recover the costs.
Problems of this magnitude are rare in home construction, but similar problems
occur on about 35% of large software projects that are either canceled completely or
more than a year late for delivery and over budget by more than 50%.
A recommendation by the author is that every reader of this book should also
acquire and read Paul Starr’s book, The Social Transformation of American Medicine
1982, Perseus Group. This book won a well-deserved Pulitzer Prize in 1982.
Only about 150 years ago, medicine had a similar combination of poor mea-
surements and inaccurate metrics combined with mediocre professional training.
Variations due to Inaccurate Software Metrics That Distort Reality ◾ 47
Starr’s book on how medical practice improved to reach today’s high standards is a
compelling story with many topics that are relevant to software.
The software industry is one of the largest and most wealthy industries in
human history. Software had created many multibillion dollar companies such as
Apple, Facebook, Google, and Microsoft. Software has created many millionaires
and also quite a few billionaires such as Bill Gates, Larry Ellison, Sergey Brin,
Jeff Bezos, and Mark Zukerberg.
However, software quality remains mediocre in 2016 and software wastage
remains alarmingly bad. (Wastage is the combination of bug repairs, cyber attacks,
canceled projects, and time spent on litigation for poor quality.) These software
problems are somewhat like smallpox and diphtheria. They can be prevented by
vaccination or successfully treated.
In order to improve software performance and reduce software wastage,
the software industry needs to eliminate inaccurate metrics that distort reality. The
software industry needs to adopt functional metrics and also capture the true and
complete costs of software development and maintenance instead of must measur-
ing small fractions such as design, code, and unit test (DCUT) that include less than
30% of total software costs.
Software quality control also needs to focus on defect prevention and pretest defect
removal before testing instead of just considering testing. In addition, of course, for
testing itself it is beneficial to use formal mathematical methods for test case design
such as cause–effect graphs and design of experiments. Also, formal testing should
use more certified test personnel instead of informal testing by untrained developers.
It would be technically possible to improve software development by more than
50% reduction in software wastage by more than 90% within 10 years if there were
a rapid and effective method of technology transfer that could reach hundreds of
companies and thousands of software personnel.
However, as of 2016, there are no really effective channels that can rapidly spread
proven facts about better development methods and better software quality control.
Unfortunately, there are no really effective software learning channels as of 2016.
Consider these problems with software learning channels as of 2016.
Many universities at both graduate and undergraduate levels often still use
lines of code and cost per defect and hence are providing disinformation to students
instead of solid facts. Functional metrics are seldom taught in universities except in
passing and are often combined with hazardous metrics such as lines of code because
the faculty has not analyzed the problems.
Professional societies such as the Institute of Electrical and Electronic Engineers
(IEEE), Association of Computing Machinery, Society of Information Management,
Project Management Institute, International Function Point Users Group, and so on
provide valuable networks and social services for members, but they do not provide
reliable quantitative data. Also, it would be more effective if the software profes-
sional societies followed the lead of the American Medical Association (AMA) and
provided reciprocal memberships and better sharing of information.
48 ◾ A Guide to Selecting Software Measures and Metrics
The major standards for software quality and risk such as ISO 9000/9001 and
ISO 31000 provide useful guidelines, but there are no empirical quantified data
that either risks or quality have tangible benefits from adhering to ISO standards.
This is also true for other standards such as IEEE and OMG.
Achieving levels 3 through 5 on the Software Engineering Institute’s capability
maturity model integrated (CMMI) does yield tangible improvements in quality.
However, the Software Engineering Institute (SEI) itself does not collect or publish
quantitative data on quality or productivity. (The Air Force gave the author a con-
tract to demonstrate the value of higher CMMI levels.)
The software journals, including refereed software journals, contain almost no
quantitative data at all. The author’s first job out of college was editing a medical
journal. About a third of the text in medical articles discusses the metrics and
measurement methods used and how data were collected and validated. Essentially
every medical article has reliable data based on accurate measures and valid metrics.
In contrast, the author has read more than a dozen refereed software articles that
used the LOC metric without even defining whether physical lines or logical state-
ments were used, and these can vary by over 500%. Some refereed software articles
did not even mention which programming languages were used, and these can vary
by over 2000%.
The author has read more than 100 refereed articles that claim it costs 100 times
as much to fix a bug after release as during development even though this is not actually
true. Compared to medical journals, refereed software journals are embarrassingly
amateurish even in 2016 when it comes to metrics, measures, and quantitative results.
Software quality companies in testing and static analysis make glowing claims
about their products but produce no facts or proven quantitative information about
actual defect removal efficiency (DRE).
Software education and training companies teach some useful specific courses
but all of them lack an effective curriculum that includes defect prevention, pretest
defect removal, and effective test technologies, or measuring defect potentials and
DRE that should be basic topics in all software quality curricula.
Software quality conferences often have entertaining speakers, but suffer from
a shortage of factual information and solid quantitative data about methods to
reduce defect potentials and raise DRE.
There are some excellent published books on software quality, but only a few of
these have sold more than a few thousand copies in an industry with millions of prac-
titioners. For example, Paul Strassmann’s book on The Squandered Computer (1997)
covers software economic topics quite well. Steve Kan’s book on Metrics and Models in
Software Quality Engineering (2002) does an excellent job on quality metrics and mea-
sures; Mike Harris, David Herron, and Stasia Iwanacki’s book on The Business Value of
IT (2008) is another solid title with software economic facts; Alain Abran’s book on
Software Metrics and Metrology (2010) covers functional metrics; Olivier Bonsignour
and the author’s book on The Economics of Software Quality (2012) have quantified
data on the effectiveness of various methods, tools, and programming languages.
Variations due to Inaccurate Software Metrics That Distort Reality ◾ 49
There are some effective software benchmark organizations that use function
point metrics for productivity and quality studies, but all of these collectively have
a few thousand clients only.
Some of these benchmark groups include The International Software Bench-
marking Standards Group (ISBSG), the Quality/Productivity Management Group,
Davids’ Consulting Group, Software Productivity Research (SPR), TI Metricas in
Brazil, Quantitative Software Management (QSM), and Namcook Analytics LLC.
On a scale of 1 to 10 the quality of medical information is about a 9.9; the qual-
ity of legal information is about a 9; the quality of information in electronic and
mechanical information is also about a 9; for software in 2016 the overall quality of
published information maybe is a 2.5. In fact, some published data that use cost per
defect and lines of code have a negative value of perhaps 5 due to the distortion of real-
ity by these two common but inaccurate metrics. Table 5.1 shows the comparative
accuracy of measured information for 15 technical and scientific fields.
1 Medicine 9.90
2 Astronomy 9.85
4 Physics 9.75
8 Architecture 9.20
11 Biology 8.90
Average 7.87
50 ◾ A Guide to Selecting Software Measures and Metrics
Wastage, poor quality, poor metrics, poor measurements, and poor technol-
ogy transfer are all endemic problems of the software industry. This is not a good
situation in a world driven by software that also has accelerating numbers of cyber
attacks and looming cyber warfare.
All of these endemic software problems of bad metrics and poor measures are
treatable problems that could be eliminated if software adopts some of the methods
used by medicine as discussed in Paul Starr’s book The Social Transformation of
American Medicine (1982).
Chapter 6
Variations in Measuring
Agile and CMMI
Development
Agile software development has become the number one software development
methodology in the United States and in more than 25 other countries. (There are
currently about 70 named software development methodologies such as waterfall,
iterative, DevOps, RUP, container development, and mashups.)
Another popular development approach, although not a true methodology, is
that of achieving the higher levels of the Software Engineering Institute (SEI) capa-
bility maturity model integrated (CMMI®).
High CMMI levels are primarily found in the defense sector, although some
civilian groups also achieve high CMMI levels. India has become famous for the
large numbers of companies with high CMMI levels.
The CMMI approach is the older of the two, having been published by the SEI
in 1987. The newer agile approach was first published in 2001.
There is now fairly solid evidence about the benefits of higher CMMI levels from
many studies. When organizations move from CMMI level 1 up to level 2, 3, 4, and 5,
their productivity and quality levels tend to improve based on samples at each level.
When they adopt the newer Team Software Process (TSP) and Personal Software
Process (PSP), also endorsed by the SEI, there is an additional boost in performance.
Unfortunately, there are not as much reliable data for agile due in part to the
complexity of the agile process and in part to the use of nonstandard and highly
variable metrics such as story points, velocity, burn down, and others that lack both
standards and formal training and hence vary by hundreds of percent.
51
52 ◾ A Guide to Selecting Software Measures and Metrics
What the CMMI provides is a solid framework of activities, much better rigor
in the areas of quality control and change management, and much better mea-
surement of progress, quality, and productivity than was previously the norm.
Measurement and collection of data for projects that use the CMMI tend to
be fairly complete. In part, this is due to the measurement criteria of the CMMI,
and in part it is due to the fact that many projects using the CMMI are contract
projects, where accurate time and expense records are required under the terms of
the contracts and needed to receive payments.
Watt Humphrey’s newer TSP and PSP are also very good in collecting data.
Indeed the TSP and PSP data are among the most precise ever collected. However, the
TSP data are collected using task hours or the actual number of hours for specific tasks.
Nontask activities such as departmental meetings and training classes are excluded.
The history of the agile methods is not as clear as the history of the CMMI
because the agile methods are somewhat diverse. However, in 2001, the famous
Agile Manifesto was published. The Manifesto for Agile Software Development was
informally published by the 17 participants who attended the Agile planning ses-
sion at Snowbird, Utah in 2001. This provided the essential principles of agile devel-
opment. That being said, there are quite a few agile variations that include Extreme
Programming (XP), Crystal Development, Adaptive Software Development,
Feature-Driven Development, and several others.
Some of the principal beliefs found in the agile manifesto include the following:
The agile methods and the CMMI are all equally concerned about three of the
same fundamental problems:
However, the agile method and the CMMI approach draw apart on two other
fundamental problems:
The agile methods take a strong stand that paper documents in the form of rigorous
requirements and specifications are too slow and cumbersome to be effective.
Variations in Measuring Agile and CMMI Development ◾ 53
In the agile view, daily meetings with clients are more effective than written
specifications. In the agile view, daily team meetings or Scrum sessions are the best
way of tracking progress, as opposed to written status reports. The CMMI approach
does not fully endorse this view.
The CMMI take a strong stand that measurements of quality, productivity,
schedules, costs, and so on are a necessary adjunct to process improvement and
should be done well. In the view of the CMMI, without data that demonstrates
effective progress, it is hard to prove that a methodology is a success or not. The agile
methods do not fully endorse this view. In fact, one of the notable gaps in the agile
approach is any quantitative quality or productivity data that can prove the success
of the agile methods.
Indeed some agile derivative methods such as pair programming where two pro-
grammers share a work station add to costs and schedules with very little actual
benefit. The literature on pair programming is embarrassing and totally omits
topics such as inspections and static analysis that benefit solo programmers.
Although some agile projects do measure, they often use metrics other than
function points. For example, some agile projects use story points and others may
use web-object points or running tested features (RTF). These metrics are interesting,
but lack formal training, ISO standards, and large collections of validated historical
data and therefore cannot be easily used for comparisons to older projects.
Owing to the fact that the CMMI approach was developed in the 1980s when
the waterfall method was common, it is not difficult to identify the major activities
that are typically performed. For an application of 1,500 function points (approxi-
mately 150,000 source code statements), the 20 activities would be typical using
CMMI as given in Table 6.1.
Using the CMMI, the entire application of 1,500 function points would have
the initial requirements gathered and analyzed, the specifications written, and
various planning document produced before coding got underway.
By contrast, the agile methods of development would follow a different pattern.
Because the agile goal is to deliver running and usable software to clients as rapidly
as possible, the agile approach would not wait for the entire 1,000 function points
to be designed before coding started.
What would be most likely with the agile methods would be to divide the
overall project into four smaller projects, each of about 250 function points in size.
(Possibly as many as five subset projects of 200 function points might be used for a
total of 1,000 function points.) In the agile terminology, these smaller segments are
termed iterations or sometimes sprints.
These subset iterations or sprints are normally developed in a time box fashion
that ranges between perhaps two weeks and three months based on the size of the
iteration. For the example here, we can assume about two calendar months for each
iteration or sprint.
However, in order to know what the overall general set of features would be, an agile
project would start with Iteration 0 or a general planning and requirements-gathering
54 ◾ A Guide to Selecting Software Measures and Metrics
2. Prototyping
3. Architecture
5. Initial design
6. Detailed design
7. Design inspections
8. Coding
9. Reuse acquisition
13. Integration
session. At this session, the users and developers would scope out the likely architec-
ture of the application and then subdivide it into a number of iterations.
Also, at the end of the project when all of the iterations have been completed,
it will be necessary to test the combined iterations at the same time. Therefore, a
release phase follows the completion of the various iterations. For the release, some
additional documentation may be needed. Also, cost data and quality data need to
be consolidated for all of the iterations. A typical agile development pattern might
resemble Table 6.2.
The most interesting and unique features of the agile methods are the follow-
ing: (1) the decomposition of the application into separate iterations, (2) the daily
Variations in Measuring Agile and CMMI Development ◾ 55
Iteration 0
1. General overall requirements
2. Planning
4. Funding
Iterations 1–4
1. User requirements for each iteration
4. Coding
5. Testing
6. Scrum sessions
7. Iteration documentation
Release
1. Integration of all iterations
face-to-face contact with one or more user representatives, and (3) the daily scrum
sessions to discuss the backlog of work left to be accomplished and any problems
that might slow down the progress. Another interesting feature is to create the test
cases before the code itself is written, which is a feature of XP and several other
agile variations.
56 ◾ A Guide to Selecting Software Measures and Metrics
Note that the author’s Software Risk Master ™ (SRM) tool has a special feature
for agile projects that aggregates all of the data from various sprints and converts the
data into a standard chart of accounts that can be used for side-by-side comparisons
with other software methodologies.
Table 6.3 illustrates this method using side-by-side comparisons of agile and
waterfall for a project of 1,000 function points.
Table 6.3 Size-by-Side Agile and Waterfall for 1,000 Function Points (from
Requirements through Delivery to Clients)
Agile Waterfall
Scrum CMMI 1
Overall Project
Development Schedule (months) 11.82 15.85
Development Activites
Requirements Effort (staff months) 7.17 15.85
Normalized Data
IFPUG function points per month 11.85 6.31
The SRM method of converting agile sprint data into a standard chart of
accounts is currently the only available method that can show side-by-side com-
parisons between agile and other popular methodologies such as DevOps, iterative,
Rational Unified Process (RUP), TSP, and waterfall.
Overall, agile projects tend to be somewhat faster and have a higher productivity
than waterfall projects up to about 1,000 function points in size. (The average size
of the author’s clients agile projects is about 270 function points.)
Above this size, agile tends to become complicated and troublesome. For large
applications in the 10,000 function point size range, the TSP and RUP methodolo-
gies are superior to both agile and waterfall development.
Although function point metrics are not common with agile projects, they do
provide significant advantages and especially for benchmark comparisons between
diverse methodologies such as agile, XP, Crystal, DevOps, TSP, RUP, and waterfall.
Chapter 7
Variations among
60 Development
Methodologies
Over and above the waterfall and Agile methodologies, there are over 60 named
software development methodologies and many hybrids that combine elements of
two or more methodologies.
In fact, new methodologies are created at a rate of about one new methodol-
ogy every eight months! This has been true for more than 40 years. These alternate
methodologies tend to have somewhat different results in terms of both productiv-
ity and quality.
For some of the very newest methods such as container development, microser-
vices, and GIT, there are not yet sufficiently accurate quantified data to include
them in this book.
The software industry does make rational decisions about which methodology
to use based on solid empirical data. Instead various methodologies appear and
attract followers based mainly on popularity, more or less like religious cults. Of
course, if the methodology does not provide any benefits at all, then it will lose out
when the next methodology achieves popularity. This explains the rapid rise and
fall of methodologies such as Computer Aided Software Engineering (CASE) and
Rapid Application Development (RAD).
Popularity and subjective opinions also explain the current popularity of Agile,
although in fact it does have some empirical data available as being successful on
smaller projects below 1,000 function points and 100 users. Agile is not the opti-
mal choice for large applications in the 10,000 function point size range where
RAD and Team Software Process (TSP) have better results. Table 7.1 shows the
59
60 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations among 60 Development Methodologies ◾ 61
(Continued)
62 ◾ A Guide to Selecting Software Measures and Metrics
Variations in Software
Programming Languages
As of 2016, there are more than 3,000 programming languages in existence and new
languages keep appearing at rates of more than one language every month! Why the
software industry has so many programming languages is a sociological mystery.
However, the existence of more than 3,000 languages is a proof that no known
language is fully useful for all sizes and types of software applications. This proof
is strengthened by the fact that a majority of applications need more than one
programming language and some have used up to 15 programming languages! An
average application circa 2016 uses at least two languages such as Java and HTML
or C# and MySQL.
Only about 50 of these thousands of languages are widely used. Many older
languages are orphans and have no working compilers and no active programmers.
Why software keeps developing new programming languages is an interesting
sociological question. The major languages circa 2016 include C dialects, Java,
Ruby, R, Python, Basic dialects, SQL, and HTML.
The influence of programming languages on productivity is inversely related
to application size. For small projects of 100 function points, coding is about
80% of total effort and therefore languages have a strong impact.
For large systems in the 10,000 function point size range, coding is only about
30% of total effort and other activities such as finding and fixing bugs and produc-
ing paper documents dilute the impact of pure coding.
Table 8.1 shows the impact of 80 programming languages for an application of
a nominal 1,000 function points in size. At that size, coding is about 50% of total
effort, whereas documents and paper work, bug repairs, and noncoding activities
comprise the other 50%.
63
64 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Software Programming Languages ◾ 65
(Continued)
66 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Software Programming Languages ◾ 67
The data shown in Table 8.1 are the aggregate results for a complete software
project that includes requirements, design, and noncode work as well as coding and
testing. Pure coding would have much higher rates than those shown in Table 8.1,
but for overall benchmark purposes it is the work involved with complete project
development that matters.
The reasons why the software industry has more than 3,000 programming lan-
guages and why almost every application uses multiple languages is outside the
scope of this report. However, sociological factors seem to have a bigger impact
than technical factors.
The existence of 3,000 programming languages is a proof that none of them
are fully adequate, or otherwise that language would dominate the world’s software
projects. Instead we have large numbers of specialized languages that are more or
less optimized for certain kinds of applications, but not very good for other kinds
of applications.
How the impact of programming languages will be combined with the impact
of the new software nonfunctional assessment process (SNAP) metric is outside the
scope of this book and probably not well understood as of 2016.
Chapter 9
Variations in Software
Reuse from 0% to 90%
69
70 ◾ A Guide to Selecting Software Measures and Metrics
2. Reusable architecture
3. Reusable design
5. Reusable estimates
0 50 100 150
Function points per staff month
Software reuse is based on the existence of common software patterns. There are
many kinds of software patterns that need effective visual representations: architecture,
design, schedule planning, quality control, cyber attacks, and many more.
Once a project’s taxonomy is firm and empirical results have been analyzed
from similar projects, another set of patterns come into play for collecting reusable
materials (Table 9.2).
4. Data patterns for the information created and used by the application
9. Source patterns for the mix of legacy, COTS, reuse, open-source, and
custom features
10. Code patterns for any custom code, in order to avoid security flaws
14. Defect removal patterns for the sequence of inspections, static analysis,
and test stages
18. Support patterns for contacts between customers and support teams
21. Value and ROI patterns to compare project costs to long-range value
2 Insurance applications–property 45 90
3 Insurance applications–life 50 90
4 Banking applications 60 85
6 Education applications–primary/ 30 85
secondary
7 Wholesale applications 60 85
9 Retail applications 40 80
10 Manufacturing applications 45 75
12 Insurance applications–health 25 70
13 Education applications–university 35 70
14 Weapons systems 20 55
15 Medical applications 15 45
One major barrier to expanding reuse at the level of specific functions is the
fact that there are no effective taxonomies for individual features used in software
applications. Current taxonomies work on entire software applications but are not
yet applied to the specific feature sets of these applications. For example, the widely
used Excel spreadsheet application has dozens of built-in reusable functions, but
there is no good taxonomy for identifying what all of these functions do.
Obviously the commercial software industry and the open-source software
industry are providing reuse merely by selling software applications that are used
by millions of people. For example, Microsoft Windows is probably the single most
widely used application on the planet with more than a billion users in over 200
countries. The commercial and open-source software markets provide an existence
proof that software reuse is an economically viable business.
Commercial reuse is fairly large and growing industry circa 2016. For example,
hundreds of applications use Crystal Reports. Thousands use commercial and
reusable static analysis tools, firewalls, antivirus packages, and the like. Hundreds
of major companies deploy enterprise resource planning (ERP) tools that attempt
reuse at the corporate portfolio level. Reuse is not a new technology, but neither
is it yet an industry with proper certification to eliminate bugs and security flaws
prior to deployment.
Informal reuse is common in 2016 but seldom measured and included in
software benchmarks. Among the author’s clients and measured projects, informal
reuse is about 15% of source code but less than 5% of other artifacts such as
requirements and designs.
Chapter 10
Variations due to
Project, Phase, and
Activity Measurements
77
78 ◾ A Guide to Selecting Software Measures and Metrics
Neither project level nor phase level data will be useful in exploring process
improvements, or in carrying out multiple regression analysis to discover the impact
of various tools, methods, and approaches.
Collecting data only at the level of projects and phases correlates strongly with
failed or canceled measurement programs, because the data cannot be used for
serious process research.
Historically, project-level measurements have been used most often. Phase-level
measurements have ranked second to project-level measurements in frequency of
usage. Unfortunately, phase-level measurements are inadequate for serious economic
study. Many critical activities such as user documentation or formal inspections
span multiple phases and hence tend to be invisible when data are collected at
the phase level.
Also, data collected at the levels of activities, tasks, and subtasks can easily be
rolled up to provide phase-level and project-level views. The reverse is not true: you
cannot explode project-level data or phase-level data down to the lower levels with
acceptable accuracy and precision. If you start with measurement data that are too
coarse, you will not be able to do very much with it.
Table 10.1 gives an illustration that can clarify the differences. Assume you are
thinking of measuring a project such as the construction of a small PBX switching
system used earlier in this paper. Here are the activities that might be included at
the level of the project, phases, and activities for the chart of accounts used to collect
measurement data.
If you collect measurement cost data only to the level of a project, you will have
no idea of the inner structure of the work that went on. Therefore the data will not
give you the ability to analyze activity-based cost factors and is almost useless for
purposes of process improvement. This is one of the commonest reasons for the
failure of software measurement programs: the data are not granular enough to find
out why projects were successful or unsuccessful.
Measuring at the phase level is only slightly better. There are no standard phase
definitions nor any standards for the activities included in each phase. Worse,
activities such as project management that span every phase are not broken out
for separate cost analysis. Many activities such as quality assurance and technical
writing span multiple phases, so phase-level measurements are not effective for
process improvement work.
Measuring at the activity level does not imply that every project performs every
activity. For example, small MIS projects and client-server applications normally
perform only 9 or so of the 25 activities that are shown above. Systems software such
as operating systems and large switching systems will typically perform about 20 of
the 25 activities. Only large military and defense systems will routinely perform all
25 activities.
Here too, by measuring at the activity level, useful information becomes
available. It is obvious that one of the reasons that systems and military software
Variations due to Project, Phase, and Activity Measurements ◾ 79
2. Analysis 2. Prototyping
3. Design 3. Architecture
4. Coding 4. Planning
7. Design review
8. Coding
14. Integration
24. Installation
25. Management
80 ◾ A Guide to Selecting Software Measures and Metrics
have much lower productivity rates than MIS projects is because they do many
more activities for a project of any nominal size.
Measuring at the task and subtask levels are more precise than activity-level
measurements but also much harder to accomplish. However, in recent years Watts
Humphrey’s Team Software Process (TSP) and Personal Software Process (PSP)
have started accumulating effort data to the level of tasks. This is perhaps the first
time that such detailed information has been collected on a significant sample of
software projects.
Table 10.2 illustrates what activity-based benchmark data would look like using
a large 40-activity chart of accounts normally used by the author for major systems
in the 10,000 function point size range.
As can be seen in Table 10.2, activity-based benchmarks provide an excellent
quantity of data for productivity and economic analysis of software cost struc-
tures. Only this kind of detailed benchmark information is truly useful for process
improvement studies and economic analysis.
KLOC 533
(Continued)
Variations due to Project, Phase, and Activity Measurements ◾ 81
(Continued)
82 ◾ A Guide to Selecting Software Measures and Metrics
Variations in Burden
Rates or Overhead Costs
A major problem associated with software cost studies is the lack of generally
accepted accounting practices for determining the burden rate or overhead costs
that are added to basic salaries to create a metric called the fully burdened salary
rate that corporations use for determining business topics such as the charge-out
rates for cost centers. The fully burdened rate is also used for other business pur-
poses such as contracts, outsource agreements, and return on investment (ROI)
calculations.
The components of the burden rate are highly variable from company to com-
pany. Some of the costs included in burden rates can be as follows: social secu-
rity contributions, unemployment benefit contributions, various kinds of taxes,
rent on office space, utilities, security, postage, depreciation, portions of mortgage
payments on buildings, various fringe benefits (medical plans, dental plans, dis-
ability, moving and living, vacations, etc.), and sometimes the costs of indirect staff
(human resources, purchasing, mail room, etc.).
One of the major gaps in the software literature, and for that matter in the
accounting literature, is the almost total lack of international comparisons of the
typical burden rate methodologies used in various countries. So far as it can be
determined, there are no published studies that explore burden rate differences
between countries such as the United States, Canada, India, the European Union
countries, Japan, China, and so on.
Among the author’s clients, the range of burden rates runs from a low of per-
haps 15% of basic salary levels to a high of approximately 300%. In terms of dollars,
that range means that the fully-burdened charge rate for the position of senior
83
84 ◾ A Guide to Selecting Software Measures and Metrics
systems programmer in the United States can run from a low of about $15,000 per
year to a high of $350,000 per year.
Unfortunately, the software literature is almost silent on the topic of burden or
overhead rates. Indeed, many of the articles on software costs not only fail to detail
the factors included in burden rates, but often fail to even state whether the burden
rate itself was used in deriving the costs that the articles are discussing!
Table 11.1 illustrates some of the typical components of software burden rates,
and also how these components might vary between a large corporation with a
massive infrastructure and a small start-up corporation that has very few overhead
cost elements.
When the combined ranges of basic salaries and burden rates are applied to
software projects in the United States, they yield almost a 6 to 1 variance in bill-
able costs for projects where the actual number of work months or work hours are
identical!
When the salary and burden rate ranges are applied to international projects,
they yield about a 15 to 1 variance between countries such as India, Pakistan, or
Bangladesh at the low end of the spectrum, and Germany, Switzerland, or Japan on
the high end of the spectrum.
Hold in mind that this 15 to 1 range of cost variance is for projects where the
actual number of hours worked is identical.
When productivity differences are considered too, there is more than a 100
to 1 variance between the most productive projects in companies with the lowest
salaries and burden rates and the least productive projects in companies with the
highest salaries and burden rates.
Table 11.1 Components of Typical Burden Rates in Large and Small Companies
Large Company Small Company
(Continued)
◾ 85
86
◾
Table 11.1 (Continued) Components of Typical Burden Rates in Large and Small Companies
Large Company Small Company
Variations in Costs
by Industry
87
88 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
90 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Costs by Industry ◾ 91
be within the context of the same or related industries, and comparisons should be
made against organizations of similar size and located in similar geographic areas.
Industry differences and differences in geographic regions and company sizes
are so important that cost data cannot be accepted at face value without knowing
the details of the industry, city, and company size.
Over and above differences in compensation, there are also significant differ-
ences in productivity, due in part to work hour patterns and in part to the experi-
ence and technology stacks used. Table 12.2 shows ranges among Namcook clients.
As can be seen, productivity and compensation vary widely by industry, and
also by country and by geographic region.
Average values are misleading and the overall ranges around the averages are
about 50% lower than average and perhaps 125% higher than average, based on
team experience, tools, methodologies, programming languages, and available
reusable materials.
Chapter 13
Variations in Costs by
Occupation Group
93
Table 13.1 Variations in Compensation for 15 U.S. Software Occupation Groups
94 ◾
within grade. Longevity is mainly a factor for unionized positions, which are rare
for software in the United States, but common in Europe and Australia.
This factor can add about another plus or minus $7,500 to the ranges of
compensation for technical positions, and even more for executive and managerial
positions. (Indeed multimillion dollar executive bonuses have been noted. Whether
these huge bonuses are justified or not is outside the scope of this article.)
Also not illustrated are the bonus programs and stock equity programs that
many companies offer to software technical employees and to managers. For exam-
ple, the stock equity program at Microsoft has become famous for creating more
millionaires than any similar program in the U.S. industry.
Chapter 14
97
98 ◾ A Guide to Selecting Software Measures and Metrics
Table 14.1 Global and Industry Variations in Software Work Hours 2016
Namcook Namcook Namcook
Namcook Software Software Percentage
Software Unpaid Total of U.S. Total
Work Hours Overtime Hours per Hours per
Countries per Month per Month Month Month
(Continued)
Variations in Work Habits and Unpaid Overtime ◾ 99
(Continued)
100 ◾ A Guide to Selecting Software Measures and Metrics
Namcook Namcook
Namcook Software Software Namcook
Software Unpaid Total (%) of U.S.
U.S. Industry Work Hours Overtime Hours per Total Hours
Segments per Month per Month Month per Month
(Continued)
Variations in Work Habits and Unpaid Overtime ◾ 101
Staff 5 5 0 0.00
each month are worked. The second version shows the same project in crunch mode
where the work hours comprise 110%, with all of the extra hours being in the form
of unpaid overtime by the software team.
As exempt software personnel are normally paid on a monthly basis rather than
on an hourly basis, the differences in apparent results between normal and intense
work patterns are both significant and also tricky when performing software eco-
nomic analyses.
As can be seen in Table 14.2, applying intense work pressure to a software proj-
ect in the form of unpaid overtime can produce significant and visible reductions
in software costs and software schedules. (However, there may also be invisible and
harmful results in terms of staff fatigue and burnout.)
Table 14.2 introduces five terms that are significant in software measurement
and also cost estimation, but which need a definition.
The first term is assignment scope (abbreviated to Ascope), which is the quantity
of function points (FP) normally assigned to one staff member.
The second term is production rate (abbreviated to Prate), which is the monthly
rate in function points at which the work will be performed.
The third term is nominal production rate (abbreviated to Nominal Prate in FP),
which is the rate of monthly progress measured in function points without any
unpaid overtime being applied.
The fourth term is virtual production rate (abbreviated to Virtual Prate in FP),
which is the apparent rate of monthly productivity in function points that will
result when unpaid overtime is applied to the project or activity.
The fifth term is work hours per function point, which simply accumulates the
total number of work hours expended and divides that amount by the function
point total of the application.
As software staff members are paid monthly but work hourly, the most visible
impact of unpaid overtime is to decouple productivity measured in work hours per
function point from productivity measured in function points per staff month.
Assume that a small 60 function point project would normally require two
calendar months or 320 work hours to complete. Now assume that the program-
mer assigned worked double shifts and finished the project in one calendar month,
although 320 hours were still needed.
If the project had been a normal one stretched over two months, the productiv-
ity rate would have been 30 function points per staff month and 5.33 work hours
per function point. By applying unpaid overtime to the work and finishing in one
month, the virtual productivity rate appears to be 60 function points per staff
month, but the actual number of hours required remains 5.33 work hours per func-
tion points.
Variations in work patterns are extremely significant variations when dealing
with international software projects. There are major national differences in terms
of work hours per week, quantities of unpaid overtime, numbers of annual holi-
days, and annual vacation periods.
104 ◾ A Guide to Selecting Software Measures and Metrics
Variations in Functional
and Nonfunctional
Requirements
When building a home, the various components are estimated separately for
the foundations, support beams, flooring, plumbing, electricals, heating and
air-conditioning (AC), roofing, and so on. But cost estimates for homes also aggre-
gate these disparate costs into cost per square foot which is a metric contractors, home
owners, taxing authorities, and mortgage specialists all know and understand.
Some of these home costs are based on user requirements, that is, quality of
cabinets, roofing materials, window design, and so forth. Other costs are based on
nonfunctional mandates by state and local governments.
For example, a home in Rhode Island within 1,000 yards of the ocean has to
have hurricane-proof windows. A home within 100 yards of an aquifer has to have
a special septic system. These are not things that the users want because they are
expensive, but they have to be used due to nonfunctional state requirements.
Software also has many disparate activities—requirements, architecture, design,
coding, testing, integration, configuration control, quality assurance, technical
writing, project management, and so on. Here too, they can be estimated sepa-
rately with their own units of measure but should also be consolidated into a single
metric.
From 1978 until 2012, the metric of cost per function point was used to aggre-
gate as many as 60 different kinds of software work into a total cost of ownership
(TCO) that used function points for data normalization.
In 2012, International Function Point Users Group (IFPUG) introduced a new
metric called SNAP. This is an acronym for software nonfunctional assessment process.
105
106 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
108 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Functional and Nonfunctional Requirements ◾ 109
(Continued)
110 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Functional and Nonfunctional Requirements ◾ 111
(Continued)
112 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Functional and Nonfunctional Requirements ◾ 113
Government projects are often more expensive than civilian projects of the
same, in part because of higher volumes of nonfunctional requirements. They are
also more expensive because of elaborate procurement and governance procedures
that generate huge volumes of paper documents and expensive status reporting.
In fact, defense projects are almost the only known software projects to use
independent verification and validation (IV and V) and independent testing. This
alone makes defense projects at least 5% more expensive than civilian projects of the
same size in function points.
If a typical civilian software project of 1,000 function points requires 100
staff months, a similar state government project of 1,000 function points would
probably require 110 staff months and a similar military software project of 1,000
function points might require 125 staff months.
The nonfunctional requirements compared to function points would probably
be 15% for the civilian project, 20% for the state government project, and 30% for
the military project.
Table 15.1 shows approximate SNAP points and also a percentage of SNAP
points compared to function points as predicted by SRM.
Chapter 16
Variations in Software
Quality Results
This chapter is basically a catalog of the known problems with software quality and
with software quality measurements and metrics. The overall quality observations
are based on the author’s lifetime collection of quality data from 1976 through
2016, containing about 26,000 software projects.
It is sad to report that recent software quality data collected since about 2010
are not a great deal better than older data from the 1980s. Newer programming
languages have reduced code defects. Static analysis has raised defect removal effi-
ciency, but because the bulk of software work circa 2016 is in renovation of legacy
software, neither better languages nor static analysis have wide usage in legacy
repairs.
Indeed most static analysis tools do not support some of the older languages such
as MUMPS, ALGOL 68, and JOVIAL, and so have no role in legacy renovation.
New development projects with quality-strong methods such as Team Software
Process (TSP) and high-level programming languages such as C# or Objective-C
are better than older projects of the same size.
Some companies are quite good at both quality control and quality measures.
The author is fortunate to have started collecting quality data at IBM circa 1970,
where quality measurements and quality control methods were taken seriously by
corporate executives.
The author was also fortunate to have worked at ITT, where the famous Phil
Crosby wrote Quality Is Free, and ITT executives also took quality seriously.
The two corporate chairmen, Thomas J. Watson Jr. of IBM and Harold Geneen
of ITT, were willing to spend significant corporate funds to improve quality
and quality measures and thereby improve overall software performance. Both
115
116 ◾ A Guide to Selecting Software Measures and Metrics
3 Prototypes 20.00%
4 Models 25.00%
5 CMMI 3 15.00%
6 CMMI 5 30.00%
(Continued)
118 ◾ A Guide to Selecting Software Measures and Metrics
Notes:
DRE goes up with team experience and high CMMI levels.
DRE goes up with quality-strong methods such as TSP and RUP.
DRE is inversely proportional to cyclomatic complexity.
Most forms of testing are <35% in DRE.
Inspections and static analysis have highest nominal DRE levels.
Variations in Software Quality Results ◾ 119
No company or project performs all of these quality control steps. But overall
quality goes up with a synergistic combination of defect prevention, pretest defect
removal, and formal testing using mathematical test case design and certified test
personnel. Quality goes down when pretest removal is skimped, or with informal
testing by untrained amateur development personnel.
(Continued)
120 ◾ A Guide to Selecting Software Measures and Metrics
Canceled $51.47
projects per FP
Most Fortune 500 companies spend close to half a billion dollars per year, fixing
bugs that could have been prevented or eliminated prior to software deployment
with better quality control. Unfortunately, defect removal costs are under reported
in many benchmark studies due to lack of specific defect counts and costs for desk
checking, static analysis, and unit testing, all of which are commonly excluded
from defect reports and not broken out in cost data.
The unfortunate cost per defect metric, distorts reality and penalizes quality.
It gives the false appearance that defect removal costs go up as a development cycle
proceeds. In fact, defect removal costs are fairly flat. Defect removal cost per func-
tion point is a much better and more reliable way to show the true value of high
quality.
Variations in Software Quality Results ◾ 121
Time spent finding and fixing bugs is the number one software cost driver at
about 40% of total effort. Time spent on canceled projects is variable from company
to company, but the U.S. average is about 8%.
Time spent on cyber attacks is unfortunately increasing and is about 11% in
2016. Time spent on litigation can be large for companies with active lawsuits, but
the overall industry total is <5%.
the original problem and added no new problems. Is it any wonder that there was
a lawsuit with four bad fixes in a row for an expensive high-end financial software
package?
Using static analysis on the first-bug repair would probably have eliminated the
original bad fix, and thereby eliminated the need for a lawsuit as well. In fact, static
analysis or inspections would probably have eliminated the original problem so bad
fixes would have been moot.
Table 16.4 Results of Poor Quality and High Quality Software (Nominal
10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality
(Continued)
Variations in Software Quality Results ◾ 125
Table 16.4 (Continued) Results of Poor Quality and High Quality Software
(Nominal 10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality
Total 1 7
Total 1 10
(Continued)
126 ◾ A Guide to Selecting Software Measures and Metrics
Table 16.4 (Continued) Results of Poor Quality and High Quality Software
(Nominal 10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality
Total 9 11
(Continued)
Variations in Software Quality Results ◾ 127
Table 16.4 (Continued) Results of Poor Quality and High Quality Software
(Nominal 10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality
Total 1 9
The curricula of the major software quality training companies are embarrass-
ing because of the gaps, omissions, and topics that are not covered. Even worse, not
a single quality education company has actual quantified data on software defect
origins, defect densities, defect prevention, or DRE levels.
You would have to go back to almost 200 years in medical education to find
such skimpy knowledge of the basic topics needed to train physicians as we have for
training software quality and test personnel in 2016.
You might take quality courses from companies such as Construx, CAST, IBM,
ITMPI, SQE, QAI, the SANS Institute, Parasoft, Smart Bear, and probably from
other local educators, but these would probably be single-topic courses such as
static analysis or automated testing.
Worse, these courses even from major quality companies would lack quantitative
data on defect potentials, DRE, bad-fix injections, EPR, or any other of the critical
topics that quality professionals should know about. The software industry is run-
ning blind due to a widespread lack of quantitative quality data.
Quality data are available from benchmark organizations such as Davids
Consulting, Gartner Group, Namcook Analytics LLC, TIMetricas, Q/P
Management Group/QSM, and several others. But the combined set of clients for
all current quality benchmark organizations is less than 50,000 customers in an
industry employing close to 20,000,000 people on a global basis.
Quality data can be predicted by some parametric software estimation tools such
as Software Risk Master (SRM), KnowledgePlan, SEER, SLIM, and COCOMO,
but the combined market for all of these parametric tools is less than 25,000
customers in an industry employing almost 20,000,000 people on a global basis.
In other words, even companies that offer accurate quality data have com-
paratively few clients who are interested in that data, even though it could save
128 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Software Quality Results ◾ 129
(Continued)
130 ◾ A Guide to Selecting Software Measures and Metrics
(Continued)
Variations in Software Quality Results ◾ 131
(Continued)
132 ◾ A Guide to Selecting Software Measures and Metrics
Nobody in 2016 sells all the tools that are needed to control software quality!
Most quality tool vendors do not even know about effective quality tools other
than the ones they sell. There are no software quality companies in 2016 that have
the depth and breadth of medical companies such as McKesson or Johnson &
Johnson.
The static analysis companies only sell static analysis; the testing companies
only sell test tools, to get quality metrics and measurement tools you need addi-
tional vendors, to get ordinary defect-tracking tools you need still other vendors, to
get quality benchmark data you need another set of vendors, to get software quality
predictions via commercial estimating tools you need yet another set of vendors.
No known company as of 2016 covers the full spectrum of software quality
tools, technologies, topics, and effective quality methods, although a few large
companies such as IBM, Microsoft, and Hewlett Packard may sell perhaps a 12 to
15 software quality tools out of the set of 56.
Of course no pharmaceutical company sells medicines for all diseases and no
physicians can treat all medical conditions, but physicians at least learn about
almost all common medical conditions as a basic part of their education. There are
also specialists who can deal with uncommon medical conditions.
Medicine has the Index Medicus that provides an overall description of the use
of thousands of prescription drugs, their side effects, and dosage. There is no exact
equivalent to the Index Medicus for software bugs and their treatment, but the
closest is probably Capers Jones’ and Olivier Bonsignour’s book on The Economics
of Software Quality published in 2012.
Medicine also has many wide-ranging books such as Control of Communicable
Diseases in Man, published by the U.S. Surgeon General’s office, which show the
scope of common infectious diseases such as polio and smallpox as well as their
known treatments. Software has nothing like the breadth and depth of the medical
literature.
A book recommended by the author to all clients and colleagues is The Social
Transformation of American Medicine (1982) by Paul Starr. This book won a Pulitzer
Prize in 1984. It also won the Bancroft Prize in 1984. This book provides an excel-
lent guide to how medicine was transformed from a poorly educated craft into one
of the top learned professions in world history.
Surprisingly at one time about 150 years ago, medicine was even more chaotic
than software is today. Medical schools did not require college degrees or even
high-school graduation to enter. Medical students never entered hospitals during
training because the hospitals used private medical staff. There was no monitoring
of medical malpractice, and quacks could become physicians. There were no medi-
cal licenses or board certifications.
There was no formal evaluation of prescription drugs before release, and harm-
ful substances such as opium could be freely prescribed. (A Sears–Roebuck catalog
in the 1890s offered liquid opium as a balm for quieting noisy children. This
product was available without prescription.)
134 ◾ A Guide to Selecting Software Measures and Metrics
1. Software has poor quality control due to lack of knowledge of effective soft-
ware quality techniques.
2. Software has embarrassingly poor education on software quality due to lack
of empirical data.
3. Software has embarrassingly bad and incomplete quality data due to the use
of ineffective and hazardous metrics such as cost per defect, combined with the
failure to use effective metrics such as function points and DRE.
The sad thing about poor software quality is that all three of these problems are
treatable conditions that could be eliminated in less than 10 years if one or more
major software companies became proactive in 1) Effective metrics and measures,
2) Fact-based education with quantitative data, and 3) Expanded quality control
that encompassed effective quality measures, effective defect prevention, effective
pretest defect removal, and effective formal testing.
Defect potentials and DRE data were based on the large IBM collection of
quality data. However, once the quality data had been analyzed, it was then possible
to predict defect potentials and DRE for future projects via parametric estimation
tools. In fact, the author developed IBM’s first quality and cost prediction tool in
about 1973 with the assistance of Dr. Charles Turk. All of the author’s commercial
estimation tools have predicted defect potentials and DRE. This adds to overall
cost estimate accuracy because the costs of finding and fixing bugs is the number
one software cost driver.
Knowledge of defect potentials and probable DRE values can lead to accurate
cost and schedule estimates that often come to within 5% of reality. (The number
two cost driver for large software projects and for all military software projects
is that of producing paper documents. However, paperwork costs are outside the
scope of the present report, although the author’s commercial estimating tools all
predict paperwork costs.)
Function points were also developed by IBM in the 1970s in part to measure
defect potentials better than lines of code (LOC) could. For that matter, function
points are also good at measuring and predicting paperwork costs, which cannot
be done with LOC metrics.
DRE is calculated by measuring all bugs found internally prior to release, and
then measuring all bugs reported by customers in the first 90 days after release. If
the development team found 950 bugs in a software application and users reported
50 bugs, then DRE is 95%. As can be seen, this is a very simple metric and easy to
calculate and understand.
tester had no other assignments, he or she would still have worked a 40 hour week
and the costs would have been $2500. The cost per defect would be infinity because
there were zero defects found. Cost per function point for zero-defect software
would be $25.00.
If the 15 hours of slack time are backed out, leaving 25 hours for actual testing,
the costs would have been $1,893.75. With slack time removed, the cost per func-
tion point would be $18.38 for zero-defect software.
Time and motion studies of defect repairs do not support the aphorism that “it
costs 100 times as much to fix a bug after release as before.” Bugs typically require
between 15 minutes and 6 hours to repair regardless of where they are found. Cost
per function point penalizes quality and makes testing seem more and more expen-
sive as fewer and fewer bugs are found. Defect removal cost per function point is a
much better economic metric for studying cost of quality (COQ).
Table 16.7 illustrates a side-by-side comparison of cost per defect and defect
removal cost per function point for an application of 1000 function points.
As can be seen, the reduction in defects in the later test stages combined with
the fixed costs of writing and running tests cause the cost per defect to rise steeply.
Defect removal cost per function point is much better for economic analysis of
COQ.
Software quality measures are so poor in 2016 that most companies do not
record the separate expense elements of writing test cases, running test cases, and
fixing defects. All three are simply lumped together, which of course causes late
defects to be much more expensive than early defects.
Table 16.7 Cost per Defect versus Cost per Function Point (Assumes $75.75
per Staff Hour for Costs; 1000 Function Points)
Number of $ per $ per
Total Costs Defects Defect FP
Table 16.8 Software Quality and the SEI Capability Maturity Model
Integrated CMMI for 2,500 function points
Defect Delivered
Defect Removal Defects per
Potential per Efficiency Function Delivered
CMMI Level Function Point (%) Point Defects
For the United States as a whole and especially for large software systems in the
10,000 function, point size range finding and fixing bugs is the number one cost
driver. The major U.S. software cost drivers circa 2016 is shown in Table 16.9.
As long as software is constructed from custom designs and manual code, it
will always be error prone and expensive. The ultimate goal of software engineering
should be to construct applications from suites of standard reusable components
that have been certified to zero-defect quality levels. If this occurs, then the future
cost drivers circa for software circa 2026 will have a much more cost-effective pat-
tern than those of 2016 as shown in Table 16.10.
A combination of increased volumes of certified reusable materials combined
with effective software quality control and better quality measures and metrics
should offer the following attractive economic advantages:
(Continued)
141
142 ◾
Table 16.9 (Continued) U.S. Software Cost Drivers in Rank Order for 2016
16 The costs of litigation for software failures and disasters
Average $ per function point = $1,250 dev; $1,400 maint; $2,650 TCO
Cyber attack $ per function point = $75 prevention; $450 recovery; $525 TCA
A Guide to Selecting Software Measures and Metrics
Variations in Software Quality Results ◾ 143
Table 16.10 Best Case Future U.S. Cost Drivers for 2026
1 The costs of innovation and new kinds of software
Software quality gets worse as application size goes up. Defect potentials
increase and DRE goes down as shown in Table 16.11.
As can be seen, software defect potentials get larger with application size,
whereas DRE declines. It is technically possible to lower large system defect poten-
tials and to raise large system defect removal efficiency, but in fact most companies
circa 2016 do not know enough about software quality control to do this.
It is sometimes asked why, if small software projects have better quality than
large systems, it would not be best to decompose large systems into numerous small
programs. Unfortunately, decomposing a large system into small pieces would be
like trying to decompose an 80,000 ton cruise ship into a set of 5,000 small boats.
You do not get the same features.
Moving from custom designs and manual coding to construction from libraries
of standard reusable components would reduce large application defect potentials.
Expanded use of quality function deployment (QFD) would also be of use for large
systems to lower defect potentials.
Expanded use of static analysis, inspections, requirement models, and auto-
mated proofs would raise DRE for large systems to over 97% even for 100,000
function points.
Software has become one of the most important industries in human history.
In spite of the success of software and the way it has transformed industry, military,
and government operations, it could be much more effective if software quality had
better metrics and measurement practices, and used state-of-the-art defect removal
techniques such as inspections and static analysis as well as testing. Software would
also benefit from moving away from custom design and manual coding to con-
struction from a suite of certified reusable component.
Variations in Software Quality Results ◾ 145
Variations in Pattern-Based
Early Sizing
The sizing method used by the author in the Software Risk Master (SRM) estima-
tion tool is novel for software but often used by other industries. The method is
based on pattern matching using a formal taxonomy.
The unique Namcook pattern-matching approach is based on the same method-
ology as the well-known Trulia and Zillow databases for real-estate costs and also
the Kelley Blue Book for used automobile prices. It is also used by Vision Appraisal
for municipal property tax analysis.
With the real-estate databases home buyers can find the costs, taxes, and other
information for all listed homes in all U.S. cities. They can specify patterns for
searching such as home size, lot size, number of rooms, and so on.
The main taxonomy topics used for software pattern matching in the Namcook
SRM tool is shown in Table 17.1.
As the SRM pattern-matching approach is fast, it is used to size not only
International Function Point Users Group (IFPUG) function points and software
nonfunctional assessment process (SNAP) point metrics but a total of 23 software
size metrics (Table 17.2).
The SRM sizing method also predicts the sizes of many other subcategories of
software deliverables including
147
148 ◾ A Guide to Selecting Software Measures and Metrics
Table 17.1 Taxonomy Patterns for Application Sizing and Risk Analysis
1. Country where the software will be developed (China, United States,
Sweden, and so on)
3. Region or city where the software will be developed (big cities are expensive)
4. Work hours per month for the software team (varies by country and
industry)
9. Local average team monthly salary and burden rates, overtime premiums
12. Hardware platform(s) (smart phone, PC, mainframe, server, and so on)
13. Operating system(s): (Android, Linux, Unix, IOS, IBM, Windows, and so on)
14. Development methodologies that will be used (Agile, RUP, TSP, and so on)
17. Programming language(s) that will be used (C#, C++, Java, SQL, and so on)
20. Class of the project (internal use, open-source, commercial, and so on)
21. Type of the project (embedded, web application, client–server, and so on)
22. Reusable material percent available for the project (design, code, tests,
and so on)
2. Automated code-based
3. Automated UML-based
◾ Feature growth during development and for three years after deployment.
◾ Postrelease incidents and bug reports for three years.
For the basic SRM taxonomy, all of the topics are usually known well before require-
ments. All of the questions are multiple choice questions except for start date and
compensation and burden rates. Default cost values are provided for situations where
150 ◾ A Guide to Selecting Software Measures and Metrics
such cost information is not known or is proprietary. This might occur if multiple
contractors are bidding on a project and they all have different cost structures.
The answers to the multiple choice questions form a pattern that is then
compared against a Namcook knowledge base of more than 26,000 software
projects. As with the real-estate databases, software projects that have identical
patterns usually have about the same size and similar results in terms of sched-
ules, staffing, risks, and effort.
Sizing via pattern matching can be used prior to requirements and therefore
perhaps six months earlier than most other sizing methods. The method is also very
quick and usually takes less than 5 minutes per project. With experience, the time
required can drop down to less than 2 minutes per project.
The pattern-matching approach is very useful for large applications >10,000
function points where manual sizing might take weeks or even months. With
pattern matching, the actual size of the application does not affect the speed of
the result and even massive applications in excess of 100,000 function points can
be sized in a few minutes or less.
SNAP is an emerging metric, but not yet old enough to have accumulated
substantial empirical data on the factors that cause SNAP values to go up or come
down. Function point metrics, on the other hand, have over 40 years of accumu-
lated historical data and formal benchmarks for over 75,000 projects.
In comparing one software project against another, it is important to know
exactly what kinds of software applications are being compared. This is not as easy
as it sounds. The industry lacks a standard taxonomy of software projects that can
be used to identify projects in a clear and unambiguous fashion.
Since 1984 the author has been using a multipart taxonomy for classifying
projects. The major segments of this taxonomy include nature, scope, class, type,
and complexity.
(Note that this taxonomy was originally developed for measuring historical proj-
ects. However, from analysis of the data collected, it was noted that applications with
the same patterns on the taxonomy were of about the same size and had similar cost
and schedule distributions. This discovery led to the current sizing method based
on pattern matching embedded in the author’s Software Risk Master™ (SRM) tool.)
Following are samples of the basic definitions of the author’s taxonomy as used
in SRM tool:
PROJECT NATURE:
PROJECT SCOPE:
1. Subroutine
2. Module
3. Reusable module
4. Disposable prototype
5. Evolutionary prototype
6. Standalone program
7. Component of a system
8. Release of a system (other than the initial release)
9. New system (initial release)
10. Enterprise system (linked integrated systems)
PROJECT CLASS:
PROJECT TYPE:
PROBLEM COMPLEXITY:
CODE COMPLEXITY:
DATA COMPLEXITY:
In addition, the author also uses codes for countries (telephone codes work for this
purpose as do ISO country codes), for industries (Department of Commerce North
American Industry Classifications [NAIC codes are used]), and geographic region
(Census Bureau state codes work in the United States. Five-digit zip codes or tele-
phone area codes could also be used.).
Why such a complex multilayer taxonomy is necessary can be demonstrated
by a thought experiment of comparing the productivity rates of two unlike appli-
cations with widely differing taxonomy patterns. Suppose the two applications
have the following taxonomy aspects as shown in Table 17.3.
As shown by Table 17.3, the productivity rate of Application A would probably
be 25.00 function points per staff month. The productivity rate of Application B
would probably be only 5.00 function points per staff month.
The total amount of effort devoted to Project B exceeded the effort devoted to
Project A by more than 400 to 1.
The cost per function point for Application A was only $400 per function
point, whereas the cost per function point for Application B is $2,000 per func-
tion point.
These two examples are from the same country, and same geographic region,
but from different industry segments and for very different types and sizes of
applications.
If one of the projects were done in China or India, the ranges would be even
broader by another 200% or so. If a high-cost country such as Switzerland was one
of the locations, the costs would swing upward.
Does that mean that the technologies, tools, or skills on Project A are superior
to those used on Project B? It does not—it simply means two very different kinds of
software projects are being compared, and great caution must be used to keep from
drawing incorrect conclusions.
In particular, software tool and methodology vendors should exercise more
care when developing their marketing claims, many of which appear to be derived
exclusively from comparisons of unlike projects in terms of the nature, scope, class,
type, and size parameters.
Several times vendors have made claims of 10 to 1 improvements in productivity
rates as a result of using their tools. Invariably these claims are false and are based
on comparison of unlike projects. The most common error is that of comparing
154 ◾ A Guide to Selecting Software Measures and Metrics
Defect potentials 2.50 bugs per function 5.50 bugs per function
point point
Security flaws 0 53
a very small project against a very large system. Another common error is to compare
only a subset of activities such as coding against a complete project measured from
requirement to delivery.
Only in the past week, the author received a question from a webinar attendee
about object-oriented productivity rates. The questioner said that his group using
object-oriented (OO) programming languages was more than 10 times as produc-
tive as the data cited in the webinar. But the data the questioner was using consisted
only of code development! The data in the webinar ran from requirements to deliv-
ery and also included project management. It was apples to oranges comparison of
coding only versus a full set of activities and occupation groups.
Unfortunately, not using a standard taxonomy and not clearly identifying the
activities that are included in the data are the norm for software measurements
circa 2016.
Chapter 18
When a major software project starts is the single most ambiguous point in the
entire life cycle. For many projects, there can be weeks or even months of informal
discussions and preliminary requirements gathering before it is decided that the
application looks feasible.
(If the application does not look feasible and no project results, there may still
be substantial resources expended that it would be interesting to know about.)
Even when it is decided to go forward with the project that does not automati-
cally imply that the decision was reached on a particular date, which can be used to
mark the commencement of billable or formal work.
So far as can be determined, there are no standards or even any reasonable
guidelines for determining the exact starting points of software projects. The
methodology used by the author to determine project starting dates is admittedly
crude: I ask the senior project manager for his or her opinion as to when the proj-
ect began, and utilize that point unless there is a better source.
Sometimes a formal request for proposal (RFP) exists, and also the responses to
the request. For contract projects, the date of the signing of the contract may serve
as the starting point. However, for the great majority of systems and MIS applica-
tions, the exact starting point is clouded in uncertainty and ambiguity.
Although the end date of a software project is less ambiguous than the start
date, there are still many variances. The author uses the delivery of the software
to the first actual customer as the termination date or the end of the project.
157
158 ◾ A Guide to Selecting Software Measures and Metrics
Although this works for commercial packages such as Microsoft Office, it does
not work for major applications such as ERP packages where the software cannot
be used on the day of delivery, but instead has to be installed and customized for
each specific client. In this situation, the actual end of the project would be the
date of the initial usage of the application for business purposes. In the case of
major ERP packages such as SAP and Oracle, successfully using the software can
be more than six calendar months after the software was delivered and installed
on the customer’s computers due to extensive customization and migration from
legacy applications.
The bottom line is that the software industry as of 2016 does not have any
standard method or unambiguous methods for measuring either the start or end
dates of major software projects. As a result, historical data on schedules are highly
suspect.
The nominal start point for software projects suggested by the author is the day
that formal requirements begin. The nominal end point suggested by the author
for software projects is the day the first actual client receives the final version (not
the Beta test version but the actual final version).
Also confusing are gaps and errors caused by overlapping schedules of various
development activities.
As software project schedules are among the most critical of all software
project factors, one might think that methods for measuring schedules would
be fully matured after some 60 years of trial and error. There is certainly no
shortage of project scheduling tools, many of which can record historical data
as well as plan unfinished projects. For example, Microsoft Project, Computer
Aid’s Automated Project Office (APO), and the Jira tool can also measure soft-
ware schedules.
However, the measurement of original schedules, slippages to those schedules,
milestone completions, missed milestones, and the overlaps among partially con-
current activities is still a difficult and ambiguous undertaking.
One of the fundamental problems is the tendency of software projects to keep
only the current values of schedule information, rather than a continuous record
of events. For example, suppose the design of a project was scheduled to begin
in January and end in June. In May it becomes apparent that June is unlikely, so
July becomes the new target. In June, new requirements are levied, so the design
stretches to August when it is nominally complete. Unfortunately, the original date
(June in this example) and the intermediate date (July in this example) are often
lost. Each time the plan of record is updated, the new date replaces the former date,
which then disappears from view.
It would be very useful and highly desirable to keep track of each change to a
schedule, why the schedule changed, and what were the prior schedules for com-
pleting the same event or milestone.
Another ambiguity of software measurement is the lack of any universal agree-
ment as to what constitutes the major milestones of software projects. A large
Gaps and Errors in When Projects Start. When Do They End? ◾ 159
Requirements ******
Design *********
Coding *********
Testing *********
Documentation *******
Project management *************************
Calendar months 1 2 3 4 5 6 7 8 9 10 11 12
Simplistic total schedule measurements along the lines of The project started in
August of 2010 and was finished in May of 2015 are essentially worthless for serious
productivity and economic studies.
Accurate and effective schedule measurement would include the schedules of
specific activities and tasks, and also include the network of concurrent and over-
lapped activities. Further, any changes to the original schedules would be recorded
too. Table 19.1 illustrates a typical schedule slippage pattern for an application of a
nominal 1,000 function points coded in Java.
As can be seen in Table 18.1, there were three slips of the original schedule,
one occurred in April and was caused by design slipping one month. The second
occurred in September and was caused by coding slipping two months. The last
and biggest slip occurred in January and was caused by testing slipping six months
leading to an overall slip of seven months contrasting the original planned schedule
with the actual results.
This pattern is fairly common due to poor quality control and also due to
project management skimping or bypassing pretest inspections and static analysis.
The inevitable result is a huge and unplanned slippage in testing schedules due to
excessive defects found when testing begins.
The main point of this chapter is not the overall magnitude of schedule slippage,
but the fact that the intermediate slippage is seldom recorded and often invisible!
Without knowing the intermediate dates it is not possible to perform really
accurate statistical surveys of software schedules. At the very least, the original start
date and original planned end date should be kept so that the final delivery date can
be used to calculate net project slippage.
One of the advantages of function point measurements, as opposed to lines of
code (LOC) metrics, is that function points are capable of measuring requirements
growth. The initial function point analysis can be performed from the initial user
requirements. Later when the application is delivered to users, a final function point
analysis is performed.
By comparing the quantity of function points at the end of the requirements
phase with the volume of function points of the delivered application, it has been
found that the average rate of growth of creeping requirements is between 1%
and 3% per calendar month, and sometimes more. The largest volume of creeping
requirements noted by the author was an astonishing 289%.
The new software nonfunctional assessment process (SNAP) metric for measur-
ing nonfunctional requirements is also subject to creep, but there are insufficient
data on SNAP point creep to include it in this chapter. However, SNAP points
seem to have less creep than ordinary function points, probably due to the fact that
they are not subject to changing user requests.
Creeping requirements are one of the major reasons for cost overruns and
schedule delays. For every requirements change of 10 function points, about 10
calendar days will be added to project schedules and more than 20 hours of staff
effort. It is important to measure the quantity of creeping requirements, and it is
also important for contracts and outsource agreements to include explicit terms
for how they will be funded.
It is also obvious that as requirements continue to grow and change, it will be
necessary to factor in the changes, and produce new cost estimates and new schedule
plans. Yet in a majority of U.S. software projects, creeping requirements are neither
measured explicitly nor factored into project plans. This problem is so hazardous that
it really should be viewed as professional malpractice on the part of management.
There are effective methods for controlling change but these are not always used.
Joint Application Design (JAD), formal requirements inspections, and change con-
trol boards have long track records of success. Yet these methods are almost never
used in applications that end up in court for cancellation or major overruns.
162 ◾ A Guide to Selecting Software Measures and Metrics
One of the standard features of the author’s Software Risk Master ™ (SRM) tool
is the prediction of requirements creep. It is necessary to do this because sometimes
the magnitude of creep is about as large as the original estimate!
The author has been an expert witness in a lawsuit in Canada where the client
added 10,000 function points to an application originally sized at 10,000 function
points. More recently an arbitration case involved the costs of adding 5,000 func-
tion points to an application originally sized at 15,000 function points.
Requirements creep is a major contributor to cost and schedule overruns, and it
cannot be omitted from software cost estimates.
The SRM sizing algorithms actually create 15 size predictions. The initial
prediction is for the nominal size at the end of requirements. SRM also predicts
requirements creep and deferred functions for the initial release.
After the first release, SRM predicts application growth for a 10 year period. To
illustrate the full set of SRM size predictions, Table 18.2 shows a sample application
Language C
(Continued)
Gaps and Errors in When Projects Start. When Do They End? ◾ 163
with a nominal starting size of 10,000 function points. All of the values are in
round numbers to make the patterns of growth clear.
As can be seen in Table 18.2 software applications do not have a single fixed
size, but continue to grow and change for as long as they are being used by custom-
ers or clients. Therefore productivity and quality data need to be renormalized from
time to time. Namcook suggests every renormalization at the beginning of every
fiscal or calendar year.
Chapter 19
165
166 ◾ A Guide to Selecting Software Measures and Metrics
Yet another definition for software quality has been a string of words ending
in …ility such as reliability and maintainability. However laudable these attributes
are, they are all ambiguous and difficult to measure. Further, they are hard to predict
before applications are built.
The quality standard ISO/IEC 9126 includes a list of words such as portability,
maintainability, reliability, and maintainability. It is astonishing that there is no
discussion of defects or bugs in this so-called quality standard.
Worse, the ISO/IEC definitions are almost impossible to predict before develop-
ment and are not easy to measure after release, nor are they quantified. It is obvious
that an effective quality measure needs to be predictable, measurable, and quantifiable.
Reliability is predictable in terms of mean time to failure (MTTF) and mean
time between failures (MTBF). Indeed these are standard predictions from the
author’s Software Risk Master (SRM) tool.
However in real life, software reliability is inversely proportional to delivered
defects and high-severity defects. Therefore the ISO quality standards should have
included defect potentials, defect removal efficiency (DRE), and delivered defect
densities.
An effective definition for software quality that can be both predicted before
applications are built and then measured after applications are delivered is as follows:
“Software quality is the absence of defects which would either cause the application
to stop working, or cause it to produce incorrect results.”
As delivered defects impact reliability, maintainability, usability, fitness for use,
conformance to requirements, and also customer satisfaction, any effective defini-
tion of software quality must recognize the central importance of achieving low
volumes of delivered defects. Software quality is impossible without low levels of
delivered defects, no matter what definition is used.
This definition has the advantage of being applicable to all software deliverables
including requirements, architecture, design, code, documents, and even test cases.
It is also applicable to all types of software: commercial, embedded, information
systems, open-source, military, web applications, smart phone applications, and
even computer games.
If software quality focuses on the prevention or elimination of defects, there are
some effective corollary metrics that are quite useful.
Finding and fixing bugs is the number one cost driver for the entire software
industry. As bug repairs cost more than any other cost driver, these costs should
be carefully measured and analyzed as part of a quality measurement program.
Unfortunately, measuring software quality is a challenging task that has not always
been done well for software.
It is not difficult to count bugs or defects, but that is not a sufficient topic
to understand software quality and software quality economics. We also need to
know what causes the bugs, the best ways of getting rid of them, the costs of having
bugs, and also the severity and consequences of software bugs that are released to
customers.
Gaps and Errors in Measuring Software Quality ◾ 167
Counting software bugs has been done for over 60 years but how this
is done is not consistent. All bugs should be counted including bugs found
privately by means of desk checks or unit tests. These private forms of bug
detection can be done with volunteers, and it is not necessary for every single
software engineer to do this.
For large applications there are likely to be hundreds or thousands of bug
reports. And there are millions of applications. This means that bug reporting
needs to include statistical analysis that can help to understand defect origins and
hopefully reduce them in the future. Therefore bug reports must be structured
to facilitate large-scale statistical studies. In today’s world of ever-increasing cyber
threats, bug reports must also be examined as possible security flaws.
The best metric for normalizing defect potentials is the function point metric.
Lines of code (LOC) cannot deal with defects found in requirements and design.
The new software nonfunctional assessment process (SNAP) metric may also be
used for normalization, but as of 2016 there is not enough empirical data about
defects per SNAP point to include it in this paper.
An important and useful metric for software quality is that of defect potentials.
This metric was developed by IBM circa 1970. It is the sum total of bugs originat-
ing from all major sources of error including requirements, design, code, and other
deliverables.
Defect potentials can be predicted before projects start, and of course bugs will
be measured and reported as they are found. The current U.S. average for software
defect potentials for 2016 is given in Table 19.1.
Most of the defect origins are self-explanatory. However, the bad-fix category
needs a word of explanation. About 7% of bug repairs contain new bugs in the
repair itself. For modules with high cyclomatic complexity, bad-fix injections have
been seen to top 35% of bugs reported by clients.
There are of course ranges in these averages. There are also variations in average
values based on sample size and nature. For example, systems software and web
applications do not have the same patterns of defect potentials.
Table 19.1 enumerates the origins of bugs in the application as delivered to cli-
ents. There are also bugs or defects in test cases and test plans, but these are not part
of the delivered software. However it is useful to record the following:
A study by IBM of regression test cases noted that about 15% of test cases either had
errors or gaps in them.
Another important topic that needs to be studied using reported defect data is
that of error-prone modules. A study by IBM about bugs reported against the oper-
ating systems, the IMS database, and other commercial software products found
that bugs are not randomly distributed, but tend to clump in a small number of
modules. For example, the IBM IMS database product had 425 modules. About
57% of customer-reported bugs were found in only 31 of these modules. IMS had
over 200 modules with zero defects that never received bug reports. Needless to
say, identifying error-prone modules (EPM) is an important aspect of software
bug reporting.
As can be seen, software bugs originate in many different sources. At IBM it
was the role of software change teams to assign an origin code to every reported
defect whether found internally or reported by customers.
It is also necessary to know the severity of these bugs or defects, and especially
so for bugs released into the field, and hence encountered by customers. The IBM
severity scale is the most widely used and it has been in use since the early 1960s
(Table 19.2).
• Invalid defect Bug reported against software but due to something else
• Severity 2 18%
• Severity 3 35%
• Severity 4 45%
• Total 100%
The approximate 2016 distribution of valid unique bugs by severity level after
release to customers is given in Table 19.3.
Using valid unique bugs as the basis of comparison, the approximate percentage
of invalid, abeyant, and duplicate bugs is given in Table 19.4.
For commercial software with thousands of users, every valid bug report usually
has a number of duplicate reports of the same bug, sometimes hundreds of duplicates.
Once the initial bug report is identified then all other reports of the same bug are just
stamped as duplicate and are not turned over to change teams.
For those of us who have actually worked in software maintenance and defect
repairs, invalid defects, and duplicate defects form a constant stream of bug reports.
Each of these by itself is not very expensive, but in total they add significant costs
to commercial software maintenance.
The possible record for duplicate defects was a word processor that had a bug in
the installation procedure so customers could not install the application. That bug
generated over 10,000 phone calls to the vendor on day 1 of the release, and in fact
shut down the phone system temporarily.
An example of an invalid defect happened to the author. A client who used several
estimating tools sent us a bug report, but it was not a bug in our software but rather
a bug found in a competitive product. Even though it was not our bug, we routed it
to the other company and notified the client. This took about 30 minutes of our time
for a bug that was invalid.
An example of abeyant defect was a bug report from the early days of personal
computers. An application that ran well on IBM personal computers and most
clones such as Compaq displayed unusual screen colors when running on an ITT
Xtra personal computer. Of course there were not very many of these ITT Xtra
machines and only one had installed the software that generated the unusual screen
170 ◾ A Guide to Selecting Software Measures and Metrics
colors. A fix was developed by recoding the screen colors while using an ITT Xtra,
but that version would not work properly on authentic IBM computers. However,
the ITT Xtra was taken off the market so the problem disappeared.
The next aspect of software quality measurement is the most complex and dif-
ficult. When a software application damages a customer, these damages are called
by the legal term consequential damages.
Usually consequential damages are financial, but other kinds of harm can
include injuries or even deaths. It is also important to measure the consequences of
software defects that are released to customers. Table 19.5 shows a scale listing of
various kinds of consequential damages.
The consequences of software defects are usually not immediately apparent
soon after release, and may not show up for months, or even years later. This means
that software defect data needs to be kept active for a significant period of time.
Once defects have been reported and turned over to the change teams, they can
be analyzed to find out why the defects slipped out to customers. In other words,
it is useful to identify an optimal defect removal strategy that can eliminate future
defects of the same type.
There are several components of an optimal defect elimination strategy as given
in Table 19.6.
• Strategy 3: The defect should have been found via static analysis
• Strategy 4: The defect should have been found via formal inspections
• Strategy 6: The defect might have been found via combinations of methods
Gaps and Errors in Measuring Software Quality ◾ 171
When all of these defect topics are put together, a typical software defect report
from a customer of a commercial software application might look like Table 19.7
after it has entered into a corporate defect tracking system.
Note that to improve the ease of statistical analysis across many industries the
defect report includes the North American Industry Classification (NAIC) code
of the client reporting the defect. These codes are available from the Department of
Commerce and other sources as well. The newer NAIC code is an update and
replacement for the older standard industry classification or SIC codes.
(Continued)
172 ◾ A Guide to Selecting Software Measures and Metrics
As can be seen, software defects are endemic problems and there are so many
of them that it is important to have high efficiency in fixing them. Of course it
would be even better to lower defect potentials and raise defect removal efficiency
(DRE) levels.
The next important metric for quality is that of DRE or the percentage of
bugs found and eliminated prior to the release of software to customers. The cur-
rent U.S. average for DRE is about 92.5% and the range is from less than 80%
up to about 99.65%.
Software defect potentials can be reduced by three important techniques:
The term efficiency is used because each form of defect prevention and defect
removal has a characteristic efficiency level, based on measuring thousands of
Gaps and Errors in Measuring Software Quality ◾ 173
software projects. The author’s SRM tool includes quality estimates as a standard
feature. Given in Table 19.8 are sample SRM quality predictions for 1,000 func-
tion point application coded in Java and developed by an expert team at level 5
on the CMMI.
As can be seen software quality measurement is a fairly complex activity, but a
very important one. Software in 2016 has far too high a value for defect potentials
and far too low a value for DRE.
Efficiency Per
Pretest Removal (%) KLOC Per FP Remainder Bad Fixes
(Continued)
174 ◾ A Guide to Selecting Software Measures and Metrics
Defects delivered 4 0
Per
QUALITY RESULTS KLOC Per FP
(Continued)
Gaps and Errors in Measuring Software Quality ◾ 175
>99% excellent
>95% good
>90% fair
<85% poor
<80% malpractice
Table 19.10 shows U.S. ranges in defect potentials from small projects of 1
function point up to massive systems of 100,000 function points.
As can be seen defect potentials go up rapidly with application size. This is one
of the key reasons why large systems fail so often and also run late and over budget.
Table 19.11 shows the overall U.S. ranges in DRE by application size from a
size of 1 function point up to 100,000 function points. As can be seen DRE goes
down as size goes up.
Note that the defects discussed in this chapter include all severity levels, rang-
ing from severity 1 show stoppers down to severity 4 cosmetic errors. Obviously it is
important to measure defect severity levels as well as recording number of defects.
(The normal period for measuring DRE starts with requirements inspections
and ends 90 days after delivery of the software to its users or customers. Of course
there are still latent defects in the software that would not be found in 90 days, but
having a 90-days interval provides a standard benchmark for DRE. It might be
thought that extending the period from 90 days to 6 months or 12 months would
provide more accurate results. However, updates and new releases usually come out
after 90 days, so these would dilute the original defect counts.)
Latent defects found after the 90-day period can exist for years, but on average
about 50% of residual latent defects are found in each calendar year. The results
vary with number of users of the applications. The more users, the faster resid-
ual latent defects are discovered. Results also vary with the nature of the software
itself. Military, embedded, and systems software tends to find bugs or defect more
quickly than information systems. Table 19.12 shows defect potentials for various
kinds of software.
Note that International Function Point Users Group (IFPUG) function points,
version 4.3 are used in this chapter for expressing results.
A future edition may include the new SNAP metric for nonfunctional size, but
to date there is insufficient data about quality to show defects normalized using the
SNAP metric.
The form of defect called bad fix in Table 19.12 is that of secondary defects
accidentally present in a bug or defect repair itself.
There are large ranges in terms of both defect potentials and DRE levels. The
best in class organizations have defect potentials that are below 2.5 defects per func-
tion point coupled with DREs that top 99% across the board. Projects that are
below 3.0 defects per function point coupled with a cumulative DRE level of about
97% tend to be lower in cost and shorter in development schedules than applica-
tions with higher defect potentials and lower levels of removal efficiency.
Observations of projects that run late and have significant cost overruns show
that the primary cause of these problems are excessive quantities of defects that
are not discovered nor removed until testing starts. Such projects appear to be on
schedule and within budget until testing begins. Delays and cost overruns occur
when testing starts, and hundreds or even thousands of latent defects are discov-
ered. The primary schedule delays occur due to test schedules far exceeding their
original plans.
DRE levels peak at about 99.85%. In examining data from about 26,000 soft-
ware projects over a period of 40 years, only two projects had zero defect reports in
the first year after release. This is not to say that achieving a DRE level of 100% is
impossible, but it is certainly very rare.
Organizations with defect potentials higher than 5.00 per function point coupled
with DRE levels of 85% or less can be viewed as exhibiting professional malprac-
tice. In other words, their defect prevention and defect removal methods are below
acceptable levels for professional software organizations.
Most forms of testing average about 30% to 35% only in DRE levels and seldom
top 50%. Formal design and code inspections, on the other hand, often top 85% in
DRE and average about 65%.
178 ◾
Table 19.12 Average Defect Potentials for Six Application Types. (Data expressed in defects per
function point)
Web MIS Outsource Commercial System Military Average
With every form of defect removal having a comparatively low level of removal
efficiency, it is obvious that many separate forms of defect removal need to be
carried out in sequence to achieve a high level of cumulative defect removal. The
phrase cumulative defect removal refers to the total number of defects found before
the software is delivered to its customers.
Table 19.13 shows patterns of defect prevention and defect removal for the same
six forms of software shown in Table 19.12.
Note: Other forms of defect removal exist besides the ones shown here. Text static
analysis, requirements models, and the use of cause-effect graphs for test case design
are not included. This is partly to save space and partly because these methods are
still uncommon. These look like they achieve results that range between about 50%
and 80% in defect prevention and removal efficiency but from very limited samples.
Note that the Agile topic of pair programming is also left out. This is because the
author’s data on pair programming shows it to be expensive, slow, and less effective
than individual programmers using static analysis and testing. The literature on
pair programming is embarrassing. Most studies only compare unaided individu-
als against unaided pairs, without any discussion at all as to the use of inspections,
static analysis, mathematical test case design, certified test personnel, or other top-
ics that can improve software quality.
Web (%) MIS (%) Outsource (%) Commercial (%) System (%) Military (%)
Prevention Activities
Pretest Removal
(Continued)
Table 19.13 (Continued) Patterns of Defect Prevention and Removal Activities
Web (%) MIS (%) Outsource (%) Commercial (%) System (%) Military (%)
Testing Activities
only includes about 17% of the true costs of poor quality. The major omissions from
the technical debt metric are
1. Projects with quality so poor that they are canceled and not released.
2. Projects with quality so poor that developers are sued by unhappy clients.
3. Consequential damages or harm caused to users by poor software quality.
3. Defect origins
4. Defect consequences
10. Number of test scripts and test cases for all test stages
(Continued)
184 ◾ A Guide to Selecting Software Measures and Metrics
Table 19.14 (Continued) Cost Elements for Software Cost of Quality (COQ)
28. Repair costs for defects that were not detected and hence delivered
29. Costs of bad fixes or new bugs in bug repairs
30. Costs of bad-test cases that add to test expense but not to value
31. Costs of quality support tools (static analysis, testing, and so on)
32. Costs of quality reporting tools
33. Costs of test library support tools
34. Costs of quality administrative support (clerks, help desks)
35. Software methodologies used for statistical purposes
36. Programming languages used for statistical purposes
37. Hardware platform of application for statistical purposes
38. Software platform (operating system) for statistical purposes
39. CMMI level of development group for statistical purposes
40. Compensation costs for all personnel involved in quality
41. Work hours per month for all personnel involved in quality
42. Unpaid overtime hours per month for all personnel involved in quality
43. Paid overtime hours per month for all personnel involved in quality
44. Function point size for data normalization
45. SNAP point size for data normalization
46. Logical code side for secondary data normalization
47. Start date of project for statistical purposes
48. Delivery date of project for statistical purposes
49. Current release number of project for statistical purposes
50. Number of software users or clients for statistical purposes
51. Countries of software users or clients for statistical purposes
52. Industries of software users or clients for statistical purposes
53. Geographical regions of users or clients for statistical purposes
54. Companies of users or clients for statistical purposes
55. Costs of normalization, statistical analysis, and quality reporting
56. Stabilization period: months after release to achieve zero defects
Table 19.15 Cost of Quality (COQ) Pretest and Test Defect Removal
Application size (FP) 1,000 Experienced clients
Classified use
Monthly cost:
Gaps and Errors in Measuring Software Quality
$10,000
◾
(Continued)
185
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
186
Security Vulnerabilities 0 1 2 0 3
Bad-fix injection 6 11 0 4 20
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Bad-fix injection 5 1 1 0 7
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
188
Bad-fix injection 0 8 4 1 23
Preparation/inspections
A Guide to Selecting Software Measures and Metrics
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Bad-fix injection 0 0 0 0 1
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
190
Bad-fix injection 0 0 33 0 34
Repair integration/test
A Guide to Selecting Software Measures and Metrics
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Bad-fix injection 0 0 17 0 17
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
192
Pretest Summary
Pretest efficiency
High-Severity Defects 3 6 22 1 31
Security Vulnerabilities 0 0 0 0 0
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
TEST DEFECT
REMOVAL STAGES
Security Vulnerabilities 0 0 0 0 0
Defects discovered 1 2 35 1 38
193
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
194 ◾
Bad-fix injection 0 0 1 0 1
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Defects discovered 1 7 26 2 35
Bad-fix injection 0 0 1 0 1
Defects remaining 14 24 40 4 82
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
196
Defects discovered 1 2 13 1 17
Bad-fix injection 0 0 0 0 0
Defects remaining 13 22 27 3 66
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Defects discovered 3 6 12 1 22
Bad-fix injection 0 0 0 0 1
Defects remaining 10 17 15 3 45
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
198 ◾
Defects discovered 1 3 8 1 13
Bad-fix injection 0 0 0 0 0
Defects remaining 9 14 8 2 32
Defect repairs
A Guide to Selecting Software Measures and Metrics
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Defects discovered 1 2 2 0 5
Bad-fix injection 0 0 0 0 0
Defects remaining 8 12 6 1 27
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
200
Test summary
Defects Remaining 8 12 6 1 27
Security Vulnerabilities 0 0 0 0 0
A Guide to Selecting Software Measures and Metrics
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Total Results
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
202
Remaining Defects 8 12 6 1 27
High-severity Defects 1 2 1 0 5
Security Vulnerabilities 0 0 0 0 0
Per KLOC
(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
(Continued)
204
◾
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Classified use
(Continued)
◾ 205
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
206 ◾
Security Vulnerabilities 0 1 2 0 3
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Defects Document
Defects per Defects per per Function Defects per
Function Point Function Point Point Function Point TOTALS
Security Vulnerabilities 0 1 2 0 3
Bad-fix injection 1 3 46 2 53
(Continued)
207
208
Bad-fix injection 1 8 34 5 48
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
(Continued)
209
210
Bad-fix injection 2 3 15 2 21
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Bad-fix injection 4 8 15 2 29
(Continued)
211
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
212
Bad-fix injection 2 5 8 2 17
Schedule (months) $0
Defect repairs
A Guide to Selecting Software Measures and Metrics
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Bad-fix injection 2 3 2 1 8
(Continued)
214
Test Summary
Security Vulnerabilities 0 0 0 0 0
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Total Results
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
216 ◾
High-severity Defects 33 41 18 8 99
Security Vulnerabilities 0 1 1 0 2
(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency
Postrelease Removal $
Technical Debt
◾
(Continued)
217
218
◾
complex than anything used by the software industry. For some sociological reason,
only the software industry fails to carry out serious data collection with valid met-
rics and perform serious statistical analysis.
It is hard to see how software can grow from a semi-skilled craft to become a
true engineering profession without using accurate metrics and using professional-
grade measurements.
Measuring only design, code, and unit test (DCUT) is unprofessional.
Measuring only code defects, starting late with quality measures is unprofessional.
Continuing to use flawed metrics such as LOC or cost per defect for economic
analysis is unprofessional. Software needs accurate cost, schedule, and quality data
based on effective measurement practices and reliable metrics such as function
point metrics.
For software the combination of function points for data normalization and
using related metrics such as DRE can provide a solid basis for software quality
economic analysis. This combination can also show the optimal sequence of defect
220 ◾ A Guide to Selecting Software Measures and Metrics
prevention methods, pretest defect removal methods, and test stages to top 99%
in DRE for a lower cost compared to today’s poor quality control and inaccurate
quality measures and metrics.
Possibly the new SNAP metric for nonfunctional size will contribute to quality
economic understanding too, but as of 2016, there is little empirical quality data
available for this new functional metric.
Chapter 20
There are many science and engineering disciplines that have multiple metrics
for the same values. For example, we have nautical miles, statute miles, and
kilometers for measuring speed and distance. We have both Fahrenheit and
Celsius for measuring temperature. We have three methods for measuring the
octane ratings of gasoline. There are also three methods of evaluating consumer
credit ratings. However, other engineering and business disciplines have conver-
sion rules from one metric to another.
The software industry is unique in having more metric variants than any
other engineering discipline in history, combined with an almost total lack of
conversion rules from one metric to another. As a result, producing accurate
benchmarks of software productivity and quality is much harder than for any
other engineering field.
The author has identified five distinct variations in methods for counting lines
of code (LOC), and 20 distinct variations in counting function point metrics. There
are no standard conversion rules between any of these variants, although there are
some conversion rules between COSMIC and International Function Point Users
Group (IFPUG) function points.
Here is an example of why this situation is harmful to the industry. Suppose
you are a consultant who has been commissioned by a client to find data on the
costs and schedules of producing a certain kind of software, such as a PBX switch-
ing system.
221
222 ◾ A Guide to Selecting Software Measures and Metrics
You scan the literature and benchmark databases and discover that data exist
on 90 similar PBX projects that would seem to be an adequate sample. You would
like to perform a statistical analysis of the results for presentation to the client.
But now the problems begin when trying to do statistical analysis of the 90 PBX
data samples (Table 20.1).
2. Three were measured using LOC and counted physical lines without
comments.
3. Three were measured using lines of code and counted logical statements.
4. Three were measured using lines of code and did not state the counting
method.
5. Three were constructed from reusable objects and only counted custom
code.
7. Three were measured using IFPUG function point metrics without SNAP.
8. Three were measured using IFPUG function points plus SNAP points.
20. Three were measured using Gartner backfired function point metrics.
(Continued)
Gaps and Errors due to Multiple Metrics without Conversion Rules ◾ 223
22. Three were measured using QSM backfired function point metrics.
27. Three were measured using goal-question metrics with local metrics.
As of 2016, there are no proven and effective conversion rules between most of
these metric variations. There is no effective way of performing a statistical analysis
of results across multiple dissimilar metrics. Why the software industry has devel-
oped so many competing variants of software metrics is an unanswered sociological
question.
Another interesting sociological problem is that adherents of each of these
30 metrics claim that it is the most accurate way of measuring software. There is
no cesium atom for software accuracy and hence no effective way of determining
accuracy against a fixed and unchanging value.
Some of these metric variations make certain kinds of software seem bigger
than other metric variations. What most metric adherents believe to be accuracy is
merely getting a bigger value for their favorite types of software compared to other
metric variants.
This started with Mark II function points and has continued ever since. It can-
not be over emphasized: accuracy is relative and essentially impossible to prove.
Other factors such as consistency across multiple counting personnel and ease of
use are more relevant than fanciful claims of accuracy.
Of course for certain hazardous metrics such as cost per defect and LOC, it can
be proven mathematically that they violate standard economic assumptions, and
hence are proven to be inaccurate. In other words, it is possible to prove inaccuracy
for software metrics but not possible to prove accuracy.
The inaccuracy for cost per defect is that it penalizes quality due to the fixed
costs of defect removal. The inaccuracy of LOC is that it penalizes high-level lan-
guages due to the fixed costs of noncode work such as requirements and design.
224 ◾ A Guide to Selecting Software Measures and Metrics
For unexplained sociological reasons, the software industry has developed more
competing variations than any other industry in human history: over 3,000 pro-
gramming languages, over 60 software development methodologies, over 50 static
analysis tools, over 37 competing benchmark organizations, over 25 software project
management tools, and about 30 competing software size metrics. There are at least
5 metric variations for counting LOC and at least 20 function point variations to
say nothing of pseudo functional metrics such as story points and use-case points.
The existence of so many variations is proof that none are fully adequate or else the
adequate variation would eliminate all of the others.
Developers of new versions of function point metrics almost always fail to
provide conversion rules between their new version and older standard metrics
such as IFPUG function points. This is happening right now with the software
nonfunctional assessment process (SNAP) metric, which as yet has no conversion
rules to the older function point metric. In the author’s view, it is the responsibil-
ity of the developers of new metrics to provide conversion rules to older metrics.
The existence of five separate methods for counting source code and at least
20 variations in counting function points with almost no conversion rules from
one metric to another is a professional embarrassment to the software industry.
As of 2016, the plethora of ambiguous metrics is slowing progress toward a true
economic understanding of the software industry.
However, a partial technical solution to this problem does exist. Using the high-
speed pattern-matching method for function point size prediction in Software Risk
Master ™ (SRM) tool, it would be possible to perform separate size estimates for
using each of the metric examples shown above (although this is not likely to occur
and not likely to be useful if it does occur). The high-speed method embedded in
the SRM tool produces size in terms of both IFPUG function points and logical
code statements. In fact, SRM produces size in a total of 23 metrics as shown in
Table 20.2.
The SRM tool can also be used from before requirements start through develop-
ment and also for legacy applications. It can also be used on commercial packages,
open-source applications, and even classified applications, if they can be placed on
a standard taxonomy of nature, scope, class, type, and complexity.
Not only would this tool provide size in terms of standard IFPUG function
points, but the taxonomy that is included with the tool would facilitate large-
scale benchmark studies. After samples of perhaps 100 applications were sized
that had used story points, use-case points, or task hours, enough data would
become available to perform useful statistical studies of the size ratios of all com-
mon metrics.
Tools such as SRM can be used to convert applications from one metric to
another. In fact, it is technically possible for SRM to generate application sizes in all
current metric variants. However, doing this would not add any value to software
economic analysis and the results would probably be unintelligible to most metrics
users, who only care about one metric and ignore all others.
Gaps and Errors due to Multiple Metrics without Conversion Rules ◾ 225
Table 20.2 Software Risk Master Size Predictions for 2,500 Function Points
It is obvious to those of us that collect benchmark data that as of 2016 the soft-
ware industry has way too many metrics, but not very much actual useful data in
any of those metrics.
226 ◾ A Guide to Selecting Software Measures and Metrics
As this book has pointed out, collecting accurate quantified data about software
effort, schedules, staffing, costs, and quality is seldom done well and often not done
at all. But even if accurate quantitative data were collected, the data by themselves
would not be sufficient to explain the variations that occur in project outcomes.
In addition to quantitative data, it is also necessary to record a wide variety of
supplemental topics in order to explain the variations that occur. Examples of the
kinds of supplemental data are shown in Table 21.1.
This kind of information lacks any kind of standard representation. The author’s
approach uses multiple choice questions to ascertain the overall pattern of tool and
methodology usage. At the end of the questionnaire, space is provided to name
the specific tools, languages, and other factors that had a significant effect on the
project.
There are a number of widely used questionnaires that gather supporting data
on methods, tools, languages, and other factors that influence software outcomes.
The oldest of these is perhaps the questionnaire designed by the author that was first
used in 1984, about a year before the Software Engineering Institute (SEI) started
assessments. The assessment questionnaire created by the SEI is perhaps the second,
and became available by about 1985.
Since then dozens of consulting and benchmark organizations have created
questionnaires for collecting data about software tools and methods. Some of
these organizations, in alphabetical order, include the David’s Consulting Group,
Galorath Associates, Gartner Group, the International Software Benchmarking
Standards Group (ISBSG), Namcook Analytics, and Quantitative Software
Management (QSM). There are many more in addition to these.
227
228 ◾ A Guide to Selecting Software Measures and Metrics
Each of these questionnaires may be useful in its own right, but because they
are all somewhat different and there are no industry standards that define the
information to be collected, it is hard to carry out large-scale studies.
Table 21.2 illustrates the author’s method of collecting data on the experience
levels of clients, managers, and development personnel. We use a five-point scale
where the levels mean the following:
1. Very experienced
2. Experienced
3. Average
4. Below average experience
5. All novices
Gaps and Errors in Tools, Methodologies, Languages ◾ 229
Following are the specific kinds of personnel, where experience levels are important
to the success of software projects:
A reliable taxonomy combined with a useful set of assessment and benchmark
questions are critical steps in improving software engineering, and turning it into a
true profession instead of a craft.
As software has such high labor content and such dismal quality control, it is of
considerable importance to be able to measure productivity and quality using stan-
dard economic principles. It is also of considerable importance to be able to predict
productivity and quality before major projects start. In order to accomplish these
basic goals, a number of standards need to be adopted for various measurement
topics. These standards are shown in Table 21.3.
Table 21.3 Data Elements Needed for Effective Benchmarks and Estimates
1. A standard taxonomy for identifying projects without ambiguity.
20. A standard for applying total cost of ownership (TCO) to software projects.
Gaps and Errors in Tools, Methodologies, Languages ◾ 231
There are more issues than the 20 standards shown here, but until these 20
standards are addressed and their problems solved, it will not be possible to use
standard economic assumptions for software applications.
What the author suggests would be a continuing series of workshops that
involved the major organizations that are concerned with software: the SEI,
IFPUG, COSMIC, ISBSG, ITMPI, PMI, ISO, Namcook Analytics, and so on.
Universities should also be involved. Unfortunately as of 2016 the relationships
among these organizations tend to be somewhat adversarial. Each wants its own
method to become the basis of international standards. Therefore cooperation on
common problems is difficult to bring about.
Also involved should be the companies that produce parametric software esti-
mation tools such as Galorath, Namcook Analytics, Price Systems, QSM, Software
Productivity Research, and USC in alphabetical order.
Appendix 1: Alphabetical
Discussion of Metrics
and Measures
Introduction
This appendix includes an alphabetical discussion of common software measure-
ment and metrics terms.
Over the past 50 years the software industry has grown to become one of the
major industries of the twenty-first century. On a global basis, software applica-
tions are the main operating tools of corporations, government agencies, and mil-
itary forces. Every major industry employs thousands of software professionals.
The total employment of software personnel on a global basis probably exceeds
20,000,000 workers.
Due to the importance of software and because of the high costs of software
development and maintenance combined with less than optimal quality, it is
important to measure both software productivity and software quality with high
precision. But this seldom happens.
For more than 60 years, the software industry has used a number of metrics that
violate standard economic concepts and produce inaccurate and distorted results.
Two of these are lines of code or LOC metrics and the cost per defect metric. LOC
metrics penalize high-level languages and make requirements and design invis-
ible. Cost per defect penalizes quality and ignores the true value of quality, which
is derived from shorter schedules and lower development and maintenance costs.
Both LOC and cost per defect metrics can be classed as professional malpractice for
overall economic analysis. However, both have limited use for more specialized
purposes.
One of the reasons IBM invested more than a million dollars into the development
of function point metrics was to provide a metric that could be used to measure
both productivity and quality with high precision and with adherence to standard
economic principles.
233
234 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
For more than 200 years, a basic law of manufacturing has been understood by
all major industries except software: “If a manufacturing cycle includes a high compo-
nent of fixed costs and there is a decline in the number of units manufactured, the cost per
unit will go up.” The problems with both LOC metrics and cost per defect are due
to ignoring this basic law of manufacturing economics.
For modern software projects, requirements and design are often more expensive
than coding. Further, requirements and design are inelastic and stay more or less
constant regardless of coding size and coding time. When there is a switch from
a low-level language such as assembly to a higher level language such as Java, the
quantity and effort for coding is reduced but requirements and design act like fixed
costs, so the cost per line of code will go up.
Table A.1 illustrates the paradoxical reversal of productivity rates using LOC
metrics in a sample of 10 versions of a private branch exchange (PBX) switching
application coded in 10 languages, but all the same size of 1,500 function points.
As can be seen from the table, the Assembly version had the largest amount of
effort but also the highest apparent productivity measured with LOC per month
and the lowest measured with function points per month.
Table A.1 Productivity Rates for 10 Versions of the Same Software Project
(A PBX Switching System of 1,500 Function Points in Size)
Function Work Hours LOC per
Effort Point per per Function Staff LOC per
Language (Months) Staff Month Point Month Staff Hour
Table A.2 Cost per Defect for Six Forms of Testing (Assumes U.S. $75.75
per Staff Hour for Costs)
Writing Running Number
Test Test Repairing Total of $ per
Cases Cases Defects Costs Defects Defect
Unit test $1,250.00 $750.00 $18,937.50 $20,937.50 50 $418.75
Function $1,250.00 $750.00 $7,575.00 $9,575.00 20 $478.75
test
Regression $1,250.00 $750.00 $3,787.50 $5,787.50 10 $578.75
test
Performance $1,250.00 $750.00 $1,893.75 $3,893.75 5 $778.75
test
System test $1,250.00 $750.00 $1,136.25 $3,136.25 3 $1,045.42
Acceptance $1,250.00 $750.00 $378.75 $2,378.75 1 $2,378.75
test
236 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Table A.3 Cost per Function Point for Six Forms of Testing (Assumes
U.S. $75.75 per Staff Hour for Costs, Assumes 100 Function Points in
the Application)
Total $ per
Writing Running Repairing Function Number
Test Cases Test Cases Defects Points of Defects
By contrast, looking at the same project and the same testing sequence using
the metric defect removal cost per function point, the true economic situation
becomes clear.
It is important to understand that Tables A.2 and A.3 both show the results
for the same project and also use identical constant values for writing test cases,
running them, and fixing bugs. However, defect removal costs per function point
decline when total defects decline, whereas cost per defect grows more and more
expensive as defects decline.
Here too the traditional cost per defect metric ignores the impact of fixed costs
in a process with a high percentage of fixed costs.
These are very common software industry problems. But the basic point
is that manufacturing economics and fixed costs need to be included in soft-
ware manufacturing and production studies. Thus far much of the software
literature has ignored fixed costs. All other industries except software have
understood the impact of fixed costs on manufacturing economics for more
than 200 years! Software is unique as it is the only industry that ignores fixed
costs.
Unfortunately, most universities that teach software engineering also ignore the
impact of fixed costs. Therefore software engineering students and software proj-
ect management students graduate without ever learning how to measure software
economic productivity or the economic value of high software quality levels. This
is a professional embarrassment.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 237
1. Stop measuring with unreliable metrics such as LOC and cost per defect and
begin to move toward activity-based costs, function point metrics, and defect
removal efficiency (DRE) metrics.
2. Start every project with formal early sizing that includes requirements creep,
formal risk analysis, formal cost, and quality predictions using parametric
estimation tools, and with requirements methods that will minimize toxic
requirements and excessive requirements creep later on.
3. Start every significant project with a formal risk analysis study based on the
risk patterns of applications of about the same size and the same taxonomy
patterns.
4. Raise DRE from less than 90% to more than 99.5% across the board. This
will also shorten development schedules and lower costs. This cannot be done
by testing alone, but needs a synergistic combination of pretest inspections,
static analysis, and formal testing using mathematically designed test cases
and certified test personnel.
5. Lower defect potentials from above 4.00 per function point to below 2.00
per function point for the sum of bugs in requirements, design, code,
documents, and bad-fix injections. This can only be done by increasing the
volume of reusable materials, combined with much better quality measures
than today.
6. Increase the volume of reusable materials from less than 15% to more than
85% as rapidly as possible. Custom designs and hand coding are intrinsi-
cally expensive and error-prone. Only use of certified reusable materials that
approach zero defects can lead to industrial-strength software applications
that can operate without excessive failure and without causing high levels of
consequential damages to clients.
7. Increase the immunity of software to cyber attacks. This must go beyond
normal firewalls and antivirus packages and include permanent changes in
software permissions, and probably in the von Neumann architecture as well.
There are proven methods that can do this, but they are not yet deployed.
Cyber attacks are a growing threat to all governments, businesses, and also to
individual citizens whose bank accounts and other valuable stored data are at
increasing risk.
4. Cyber attacks
3. Be unambiguous
14. Support all sizes of software from small changes through major systems
15. Support all classes and types of software (embedded, systems, Web, etc.)
both development and maintenance. The SNAP committee has not even addressed
requirements creep or changes in SNAP size over long time periods. SNAP has not
yet been applied to embedded software, to commercial packages such as Windows
10, or to weapons systems. Function points, on the other hand, have been used to
size naval shipboard gun controls, cruise missile navigation packages, cell-phone
operating systems, and all other known forms of software.
Other function point variations such as COSMIC, NESMA, FISMA, unadjusted,
engineering function points, feature points, and so on vary in how many criteria they
meet, but most meet more than 15 of the 20 criteria.
The automated function point method meets the first 5 of the 20 criteria. It is
cost effective, standardized, and unambiguous. However, it lacks conversion rules
that do not support all classes and types of software. For example, automated func-
tion points are not yet used on embedded applications, medical devices, or weapons
systems. They have not yet been used on commercial software packages such as
SAP, Oracle, Windows 10, and so on.
The older lines of code (LOC) metric meets only criterion 5 and none of the oth-
ers. LOC metrics are fast and cheap, but otherwise fail to meet the other 19 criteria.
The LOC metric makes requirements and design invisible and penalizes high-level
languages.
The cost per defect metric does not actually meet any of the 20 criteria and also
does not address the value of high quality in achieving shorter schedules and lower
costs.
The technical debt metric does not currently meet any of the 20 criteria, although
it is such a new metric that it probably will be able to meet some of the criteria in
the future. Technical debt has a large and growing literature, but does not actually
meet criterion 9 because the literature resembles the blind men and the elephant,
with various authors using different definitions for technical debt. Technical debt
comes close to meeting criteria 14 and 15.
The story point metric for Agile projects seems to meet five criteria; that is, num-
bers 6, 14, 16, 17, and 18 but varies so widely and is so inconsistent that it cannot
be used across companies and certainly cannot be used without user stories.
The use-case metric seems to meet criteria 5, 6, 9, 11, 14, and 15 but cannot be
used to compare data from projects that do not utilize use-cases.
This set of 20 software metric criteria is a useful guide for selecting metrics that
are likely to produce results that match standard economics and do not distort real-
ity, as do so many current software metrics.
Abeyant Defects
The term abeyant defect originated in IBM in the late 1960s. It refers to an unusual
kind of bug that is unique to a single client and a single configuration and does not
occur anywhere else. In fact, the change team tasked with fixing the bug may not
be able to reproduce it. Abeyant defects are both rare and extremely troublesome
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 241
when they occur. It is usually necessary to send a quality expert to the client site
to find out what unique combination of hardware and software led to the abeyant
defect occurring. Some abeyant defects have taken more than two weeks to identify
and repair. In today’s world of software with millions of users and spotty technical
support some abeyant defects may never be fixed.
Activity-Based Costs
The term activity is defined as the sum total of the work required to produce a major
deliverable such as requirements, architecture, design documents, source code, or
test cases. The number of activities associated with software projects ranges from a
low of three (design, code, and test) to more than 50.
Several parametric estimation tools such as Software Risk Master (SRM)
predict activity costs. A typical pattern of seven software activities for a midsized
software project of 1,000 function points in size might include: (1) requirements,
(2) design, (3) coding, (4) testing, (5) quality assurance, (6) user documentation,
and (7) project management.
One of the virtues of function point metrics are that they can show productivity
rates for every known activity, as illustrated by Table A.6, which is an example for
a generic project of 1,000 function points in size coded in Java.
The ability to show productivity for each activity is a virtue of function
point metrics and is not possible with many older metrics such as LOC and cost
per defect. To conserve space Table A.6 only shows seven activities, but this same
form of representation can be extended to more than 40 activities and more than
250 tasks.
Function points are the only available metric in 2016 that allows both activity
and task-level analysis of software projects.
LOC metrics could not show noncode work at all. Story points might show
activities but only for Agile projects and not for other forms of software. Use-case
points require use-cases to be used at all. Only function point metrics are method-
ologically neutral and applicable to all known software activities and tasks.
As of 2016, how the new SNAP metric will be used for activity-based cost
analysis is still uncertain.
Accuracy
The topic of accuracy is often applied to questions such as the accuracy of estimates
compared to historical data. However, it should be applied to the question of how
accurate are the historical data themselves. As discussed in the section historical
data leakage what is called historical data are often less than 50% complete and
omits major tasks and activities such as unpaid overtime, project management, and
the work of part-time specialists such as technical writers. There are little empiri-
cal data available on the accuracy of a host of important software topics including
in alphabetical order application costs, customer satisfaction, defects, development
effort, maintainability, maintenance effort, reliability, schedules, size, staffing, and
usability. Various function point metrics (COSMIC, FISMA, NESMA, etc.)
frequently assert that their specific counting method is more accurate than rival
function point methods such as those of the International Function Point Users
Group (IFPUG). These are unproven assertions and also irrelevant in an industry
where historical data include only about 37% of the true costs of software devel-
opment. As a general rule, better accuracy is needed for every software metric
without exception. There is no cesium atom for software accuracy. Consistency
across multiple-counting personnel is a useful surrogate for true accuracy.
Agile Metrics
The Agile development approach has created an interesting and unique set of met-
rics that are used primarily by the Agile community. Other metrics such as func-
tion points and DRE work with agile projects too and are needed if the Agile
method is to be compared to other methods such as Rational Unified Process (RUP)
and Team Software Process (TSP) because the Agile metrics themselves are not very
useful for cross-method comparisons. The Agile approach of dividing larger applica-
tions into small discrete sprints adds challenge to overall data collection. Some com-
mon Agile metrics include burn down, burn up, story points, and velocity. This is a
complex topic and one is still in evolution so a Google search on Agile metrics will turn
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 243
up many alternatives. The method used by the author for comparison between
Agile and other methods is to convert story points and other Agile metrics into
function points and to convert the effort from various sprints into a standard chart
of accounts showing requirements, design, coding, testing, and so on, for all sprints
in aggregate form. This allows side-by-side comparisons between agile projects and
other methods such as the RUP, TSP, waterfall, iterative, and many others.
Analysis of Variance
Analysis of variance (ANOVA) is a collection of statistical methods for analyzing
the ranges of outcomes from groups of related factors. ANOVA might be applied
to the schedules of a sample of 100 software projects of the same size and type,
or to the delivered defect volumes of the same sample. There are text books and
statistical tools available that explain and support ANOVA. ANOVA is related to
design of experiments and particularly to the design of well-formed experiments.
Variance and variations are major elements of both software estimating and
software measures.
Annual Reports
As all readers know, public companies are required to produce annual reports for
shareholders. These reports discuss costs, profits, business expansion, or contraction,
and other vital topics. Some sophisticated corporations also produce annual
software reports on the same schedule as corporate annual reports, that is, in the
first quarter of a fiscal year showing results for the prior fiscal year. The author has
produced such reports and they are valuable in explaining to senior management
at the CFO and CEO level of what kind of progress in software occurred in the
past fiscal year. Some of the topics included in these annual reports are software
demographics such as numbers of software personnel by job and occupation group,
numbers of customer supported by the software organizations, productivity for
the prior year and current year targets, quality for the prior year and current year
targets, customer satisfaction, reliability levels, and other relevant topics such as
the mix of COTS packages, open source packages, and internal development. Also
included would be modern issues such as cyber attacks and any software related liti-
gation. Really sophisticated companies might also include topics such as numbers
of software patents filed in the prior year.
Small projects are far more numerous than large systems. Large systems are far
more expensive and more troublesome than small projects. Coincidentally Agile
development is a good choice below 1,000 function points. TSP and RUP are good
choices above 1,000 function points. So far, Agile has not scaled up well to really
large systems above 10, 000 function points but TSP and RUP do well in this zone.
Size is not constant either before release or afterward, so long as there are active
users applications growing continuously. During development, the measured rate is
1%–2% per calendar month; after release the measured rate is 8%–15% per calendar
year. A typical postrelease growth pattern might resemble the following.
More than a 10 year period a typical mission-critical departmental system start-
ing at 15,000 function points might have
As can be seen, software applications are never static if they have active users. This
continuous growth is important to predict before starting and to measure at the
end of each calendar or fiscal year. The cumulative information on original devel-
opment, maintenance, and enhancement is called total cost of ownership or TCO.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 245
Appraisal Metrics
Many major corporations have annual appraisals of both technical and managerial
personnel. Normally the appraisals are given by an employee’s immediate man-
ager, but often include comments from other managers. Appraisal data are highly
confidential and in theory not used for any purpose other than compensation
adjustments or occasionally for terminations for cause. One interesting sociological
issue has been noted from a review of appraisal results in a Fortune 500 company.
Technical personnel with the highest appraisal scores tend to leave jobs more fre-
quently than those with lower scores. The most common reason for leaving was I do
not like working for bad management. Indirect observation supports the hypothesis
that teams with high appraisal scores outperform teams with low appraisal scores.
Some companies such as Microsoft try and force fit appraisal scores into patterns,
that is, only a certain and low percentage of employees can be ranked as excellent.
Although the idea is to prevent appraisal score creep or assessing many more people
who are as excellent as what truly occurs, the force-fit method tends to lower morale
and lead to voluntary turnover by employees who feel wrongly appraised. In some
countries and in companies with software personnel who are union members, it
may be illegal to have appraisals. The topic of appraisal scores and their impact on
quality and productivity needs additional study, but of necessity studies involving
appraisal scores would need to be highly confidential and covered by nondisclosure
agreements. The bottom line is that appraisals are a good source of data on experi-
ence and knowledge, and it would be useful to the industry to have better empirical
data on these important topics.
Assessment
The term assessment in a software context has come to mean a formal evaluation key
practice areas covering topics such as requirements, quality, and measures. In the
defense sector the assessment method developed by Watts Humphrey, Bill Curtis,
and colleagues at the Software Engineering Institute (SEI) is the most common. One
byproduct of SEI assessments is placing organizations on a five-point scale called
the capability maturity model integrated (CMMI®). However, the SEI is not unique
nor is it the oldest organization performing software assessments. The author’s for-
mer company, Software Productivity Research (SPR) was doing combined assess-
ment and benchmark studies in 1984, a year before the SEI was first incorporated.
There is also a popular assessment method in Europe called TickIT. Several former
officers of SPR now have companies that provide both assessment and benchmark
data collection. These include, in alphabetical order, the Davids’ Consulting Group,
Namcook Analytics LLC, and the Quality/Productivity Measurement group. SPR
246 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Assignment Scope
The term assignment scope refers to the amount of a specific deliverable that is nor-
mally assigned to one person. The metrics used for assignment scope can be either
natural metrics such as pages of a manual or synthetic metrics such as function
points. Common examples of assignment scopes would include code volumes, test-
case construction, documentation pages, customers supported by one phone agent,
and the amount of source code assigned to maintenance personnel. Assignment
scope metrics and production rate metrics are used in software estimation tools.
Assignment scopes are discussed in several of the author’s books including Applied
Software Measurement and Estimating Software Costs.
Attrition Measures
As we all know, personnel change jobs frequently. During the high-growth period
of software engineering in the 1970s, most software engineers had as many as five
jobs for five companies. In today’s weak economy, job hopping is less common. In
any case most corporations measure annual personnel attrition rates by job titles.
Examination of exit interviews show that top personnel leave more often than
average personnel, and do so because they do not like working for bad management.
For software engineers, technical challenge and capable colleagues tend to be larger
factors in attrition than compensation.
Analytics LLC website, www. Namcook.com. The author’s tool produces size in a
total of 23 metrics including function points, story points, use-case points, physical
and logical source code size, and others.
Backfiring
In the early 1970s, IBM became aware that LOC metrics had serious flaws as a
productivity metric because it penalized modern languages and made noncod-
ing work invisible. Alan Albrecht and colleagues at IBM White Plains began
development of function points. They had hundreds of IBM applications available
with accurate counts of logical code statements. As the function point metric
was being tested it was noted that various languages had characteristic levels,
or number of code statements per function point. The COBOL language, for
example, average about 106.7 statements per function point in the procedure and
data divisions. Basic assembly language average about 320 statements per func-
tion point. These observations led to a concept called backfiring or mathematical
conversion between older LOC data and newer function points. However, due to
variances in programming styles there were ranges of more than two to one in both
directions. COBOL varied from about 50 statements per function point to more
than 175 statements per function point even though the average value was 106.7.
Backfiring was not accurate but was easy to do and soon became a common sizing
method for legacy applications where code already existed. Today in 2014 several
companies such as Gartner Group, QSM, and Software Productivity Research
(SPR) sell commercial tables of conversion rates for more than 1,000 programming
languages. Interestingly, the values among these tables are not always the same
for specific languages. Backfiring remains popular in spite of its low accuracy for
specific applications and languages.
Bad-Fix Injections
Some years ago, IBM discovered that about 7% of attempts to fix software bugs
contained new bugs in the repairs themselves. These were termed bad fixes. In
extreme cases such as very high cyclomatic complexity levels bad-fix injections can
top 25%. This brings up the point that repairs to software are themselves the sources
of error. Therefore static analysis, inspections, and regression testing are needed for
all significant defect repairs. Bad-fix injections were first identified in the 1970s.
They are discussed in the author’s book The Economics of Software Quality.
Bad-Test Cases
A study of regression test libraries by IBM in the 1970s found that about 15% of test
cases had errors in them. (The same study also found about 20% duplicate test cases
that tested the same topics without adding any value.) This is a topic that is severely
248 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
under-reported in the quality and test literature. Test cases that themselves contain
errors add to testing costs but do not add to testing thoroughness.
Balanced Scorecard
Art Schneiderman, Robert Kaplan, and David Norton (formerly of Nolan and
Norton) originated the balanced scorecard concept as known today, although there
were precursors. The book The Balanced Scorecard by Kaplan and Norton made it
popular. It is now widely used for both software and nonsoftware purposes. A bal-
anced scorecard comprises four views and related metrics that combine financial
perspective, learning and growth perspective, customer or stakeholder perspective,
and financial perspective. The balanced scorecard is not just a retroactive set of
measures, but also includes proactive forward planning and strategy approaches.
Although balanced scorecards might be used by software organizations, they are
most commonly used at higher corporate levels where software, hardware, and
other business factors need integration.
Baselines
For software process improvement, a baseline is a measurement of quality and
productivity at the current moment before the improvement program begins.
As the improvement program moves through time, additional productivity and
quality data collections will show rates of progress over time. Baselines may
also have contract implications, and outsource vendor tenders and offer to pro-
vide development or maintenance services cheaper and faster than the current
rates. In general, the same kinds of data are collected for both baselines and
benchmarks.
Bayesian Analysis
Bayesian analysis is named after the English mathematician Thomas Bayes from
the eighteenth century. Its purpose, in general, is to use historical data and obser-
vations to derive the odds of occurrences or events. In 1999, a doctoral student at
the University of Southern California, Sunita Devnani-Chulani, applied Bayesian
analysis to software cost-estimating methods such as Checkpoint (designed by the
author of this paper), COCOMO, SEER, SLIM, and some others. This was an
interesting study. In any case, Bayesian analysis is useful in combining prior data
points with hypotheses about future outcomes.
Benchmarks
The term benchmark is much older than software and originally applied to chis-
eled marks in stones used by surveyors for leveling rods. Since then the term has
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 249
become generalized and as of 2014 can be used with well more than 500 dif-
ferent forms of benchmarks in almost every industry. Major corporations have
been observed to use more than 60 benchmarks including attrition rates, com-
pensation by occupation, customer satisfaction, market shares, quality, produc-
tivity, and many more. Total costs for benchmarks can top U.S. $5,000,000
per year, but are scattered among many operating units, so benchmark costs
are seldom consolidated. In this chapter, a more narrow form of benchmark
is relevant that deals specifically with software development productivity and
sometimes with software quality. As this chapter is written in 2014 there are
more than 25 organizations that provide software benchmark services. Among
these can be found the International Software Benchmarking Standards
Group (ISBSG), Namcook Analytics (the author’s company), the Quality and
Productivity Management Group, Quantimetrics, Reifer Associates, Software
Productivity Research (SPR), and many more. The data provided by these vari-
ous benchmark organizations varies, of course, but tends to concentrate on
software development results. Function point metrics are most widely used for
software benchmarks, but other metrics such as LOC also occur. Benchmark
data can either be self-reported by clients of benchmark groups or collected
by on-site or remote meetings with clients. The on-site or remote collection of
benchmark data by commercial benchmark groups allows known errors such
as failure to record unpaid overtime to be corrected that may not occur with
self-reported benchmark data.
Bug
One of the legends of software engineering is that the term bug was first refereed
to an actual insect that had jammed a relay in an electromechanical computer. The
term bug has since come to mean any form of defect in either code or other deliv-
erables such as requirements and design bugs. Bug reports during development and
after release are standard software measures. See also defect later in this chapter.
There is a pedantic discussion among academics that involves differences between
failures and faults and defects and bugs, but common definitions are more widely
used than academic nuances.
Burden Rates
Software cost structures are divided into two main categories: the costs of salaries
and the costs of overhead commonly called the burden rate and also overhead. Salary
costs are obvious and include the hourly or monthly salaries of software personnel.
Burden rates are not at all obvious and vary from industry to industry, from com-
pany to company, and from country to country. In the United States some of the
normal components of burden rates include insurance, office space, computers and
equipment, telephone service, taxes, unemployment, and a variety of other fees and
local taxes. Burden rates can vary from a low of about 25% of monthly salary costs
to a high of more than 100% of salary costs. Some industries such as banking and
finance have very high burden rates; other industries such as manufacturing and
agriculture have lower burden rates. But the specifics of burden rates need to be
examined for each company in specific locations where the company does business.
Burn Down
Although this metric can be used with any method, it is most popular with Agile
projects. The burn down rate is normally expressed graphically by showing the
amount of work to be performed as compared to the amount of time desired to
complete the work. Burn down is somewhat similar in concept to earned value.
A variety of commercial and open-source tools can produce burn down charts.
Read also the next topic of burn up. The work can be expressed in terms of user
stories or natural deliverable such as pages of documentation or source code.
Burn Up
This form of chart can also be used with any method but is most popular with Agile
projects. Burn up charts show the amount of work already completed compared to the
backlog of uncompleted work. Burn down charts, as discussed above, show uncom-
pleted work and time remaining. Here too a variety of commercial and open-source
tools can produce the charts. The work completed can be stories or natural metrics.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 251
Business Value
The term business value is somewhat subjective and ambiguous. Business value can
include tangible financial value, intangible value, and also intellectual property
such as patents. Tangible value can include revenues, profits, and services such
as education and consulting. Intangible value can include customer satisfaction,
employee morale, and benefits to human life or safety as might be found with
medical software. Business value tends to vary from industry to industry and from
company to company. It can also vary from project to project.
Certification
As this section was written in 2016 there are more than 50 U.S. organizations that
provide some form of certification for software workers. Among the kinds of cer-
tification that are currently available are certification for counting function points,
certification for testers, certification for quality assurance personnel, certification for
project managers, and certification offered by specific companies such as Microsoft
and Apple for working on their products and software packages. However, there are
little empirical data that demonstrate certification and actually improve performance
compared to uncertified personnel doing the same work. There have been studies that
show that certified function point counters are fairly congruent when counting the
same application. However, there is a shortage of data as to the performance of certi-
fied test personnel and certified project management personnel. There is no reason to
doubt that certification does improve performance; what is missing are solid bench-
mark data that proves this to be the case and quantifies the magnitudes of the benefits.
252 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Chaos Report
See the Standish report for additional data. This is an annual report of IT project
failures published by the consulting company of the Standish Group. Due to the
extensive literature surrounding this report, a Google search is recommended.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 253
Chaos Theory
Chaos theory is an important subfield of mathematics and physics that deals with
system evolution for systems that are strongly influenced by initial or starting con-
ditions. The sensitivity to initial conditions has become popular in science fiction
and is known as the butterfly effect based on a 1972 paper written by Edward Lorenz
that included a statement that a butterfly flapping its wings in Brazil might cause a
tornado in Texas. Chaos theory seems to be a factor in the termination of software
applications prior to delivery. It may also play part in software breach of contract
litigation. Chaos theory deals with abrupt departures from trend lines. By contrast
Rayleigh curves, discussed later in this alphabetical list, assume smooth and con-
tinuous trend lines. Since about 32% of large systems more than 10,000 function
points in size are canceled prior to completion, it seems obvious that both Rayleigh
curves and chaos theory need to be examined in a software context. A canceled
project obviously departs from a Rayleigh curve. A deeper implication of chaos
theory is that the outcomes of software systems are not predictable even if every
step is determined by the prior step. From working as an expert witness in a number
of lawsuits, it does seem probable that chaos theory is relevant to breach of contract
lawsuits. Failing projects and successful projects sometimes have similar initial con-
ditions, but soon diverge into separate paths. Chaos theory needs additional study
for its relevance to a number of software phenomena. Chaos theory is also likely
to become an important factor in analysis of cyber-security and the root causes of
successful cyber attacks. When historical data are poorly tracked, project status
information becomes invisible to project managers and governance executives. This
tends to lead a project away from a normal Rayleigh curve and toward a disastrous
ending such as termination for poor quality or a negative ROI. It would be inter-
esting to explore the combined impacts of accurate early estimates combined with
accurate progress tracking versus inaccurate early estimates and inaccurate progress
tracking.
Cognitive Dissonance
The phrase cognitive dissonance refers to both a theory and a set of experiments by
the psychologist Dr. Leon Festinger on opinion formation and entrenched beliefs.
Dr. Festinger found that once a belief is strongly held, the human mind rejects evi-
dence that opposes the belief. When the evidence becomes overwhelming, there is
then an abrupt change of opinion and a move to a new idea. Cognitive dissonance
is common in scientific fields and explains why theories such as sterile surgical pro-
cedures, continental drift, and Darwin’s theory of evolution were rejected by many
professionals when the theories were first published. Cognitive dissonance is also
part of military history and explains the initial rejection of major innovations such
as replacing muskets with rifles: the rejection of screw propellers for naval ships,
the rejection of naval cannon mounted on adjustable mounts, the rejection of iron-
clad ships, and the initial rejection of Samuel Colt’s revolver. Cognitive dissonance
is also part of business and caused the initial rejection of air-conditioning, and
the initial rejection of variable-speed windshield wipers for automobiles (later the
wiper idea was accepted and led to patent litigation by the inventor who had shown
a prototype to Ford). Cognitive dissonance is also part of software and explains
why various invalid metrics are still used even though they have been proven to be
inaccurate. Cognitive dissonance is clearly a factor in the continued use of invalid
metrics such as LOC and function points. It is also a factor in the resistance to new
metrics such as function points and SNAP points. Thus cognitive dissonance plays
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 255
a part in the bad metrics and measurement practices that are endemic in the soft-
ware industry. Apparently the weight of evidence is not yet strong enough to cause
an abrupt switch to function point metrics. Cognitive dissonance is also part of
software methodology selection, such as the current belief that Agile development
is a panacea and suitable for sizes and forms of software projects.
Cohesion
The cohesion metric is one of several metrics developed by Larry Constantine. See
also coupling later in this chapter. The cohesion metric deals with how closely are
all parts of a module related. High cohesion implies that all parts of a module are
closely related to whatever functionality the module provides, and hence probably
easy to read and understand.
Complexity
The scientific literature encompasses no fewer than 25 discrete forms of complex-
ity. Software engineering has managed to ignore most of these and tends to use
primarily cyclomatic and essential complexity, Halstead complexity, and the sub-
jective complexity associated with function point metrics. However, many other
forms of complexity such as fan complexity, flow complexity, syntactic complex-
ity, semantic complexity, mnemonic complexity, and organizational complexity also
have tangible impacts on software projects and software applications. The full suite
of 25 different forms of complexity is discussed in the author’s book Estimating
Software Costs. The topic of complexity needs additional study in a software context
because major forms of complexity are not included in either software cost estimates
or software benchmarks as of 2016.
Consequential Damages
The topic of consequential damages is a legal term that refers to harm experi-
enced by a customer as the result of a product malfunctioning or failing. As it
happens, software is extremely likely to malfunction or fail, and hence probably
causes more consequential damages than any other manufactured product in
the twenty-first century. Examples of consequential damages include having to
restate prior year financial results based on bugs in accounting software, deaths
or injuries due to malfunctions of medical software, huge financial losses in stock
markets due to malfunctions of stock-trading software, errors in taxes and with-
holding due to errors in software used by tax collection agencies, and many more.
Consequential damages are not included in either cost of quality (COQ) metrics
or in technical debt metrics. One reason for this is that software developers may
not know of consequential damages unless they are actually sued by a disgruntled
client. Even then the consequential damages will be for only a single client unless
256 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
the suit is a class action. In the modern world of 2016, software bugs probably
cost more than a trillion dollars per year for consequential damages without any
good way of measuring the actual harm caused or total costs to many industries.
Even worse, the consequential damages from cyber attacks are growing faster
than almost any business risk in history. The software literature is almost silent
on the overall consequential damages caused by poor quality software with sig-
nificant security flaws.
Contracts—Fixed Cost
The concept of a fixed-cost or fixed-price contract is that the work will be performed
for an agreed amount even though there may be changes in scope. Fixed-cost con-
tracts have a tendency toward litigation in several situations. In one case where the
author was an expert witness, the client added 82 major changes totaling more than
3,000 function points. The client did not want to pay because it was a fixed-price
contract even though there was a clause for out of scope changes. The court decided
in favor of the vendor, who did get paid. In another case, an arbitration, the client
agreed to pay for changes in scope but only the amount agreed to in the contract.
Some of the late changes cost quite a bit more due to the need for extensive changes
to the architecture of the application and then regression testing. Fixed-cost con-
tracts need to include clauses for out of scope changes and also a sliding scale of
costs for changes made late in the development cycle. Fixed-cost contracts also need
constant monitoring by clients and by vendor executives.
same requirement. Function points are very good contract metrics because of
the large volume of benchmark data available. Further, function points lend
themselves to using a sliding scale of costs for handling requirements creep and
even removing features from software. A number of civilian outsource com-
panies are also using function points for software contracts, and this trend is
expanding in 2016. Function points are also valuable for activity-based cost
analysis and can be applied to earned value measurements, although the U.S.
government and the Department of Defense are behind the civilian sectors in
function point usage. Function points are already playing a major role in soft-
ware litigation for breach of contract, poor quality, and other endemic problems
that end up in court.
Cost Center
The phrase cost center is an accounting term that refers to a corporate organization
that does not add to bottom line profit but does expend costs. A profit center is an
organization that does produce revenue. For internal software produced by com-
panies for their own use, the majority of these operate under a cost-center model,
that is, they do the software without charges to the users. As there are no charges,
software measurement practices for centers tend to leak and omit major software
cost elements such as unpaid overtime and management. Among the author’s cli-
ents the average completeness of software cost data under the cost-center model is
only about 37% of true costs.
Cost Drivers
Quite a few software researchers such as Barry Boehm, Ian Sommerville, and the
author use the concept of cost drivers. Normal project accounting keeps track of
costs by activity. However, cost drivers aggregate costs across all activities. For
example, one of the cost drivers used by both Boehm and Jones is that of soft-
ware documentation, which spans every phase and almost every activity. In total,
more than 100 documents can be created for large systems and these often cost
more than the code itself. The four major cost drivers cited by the author of this
chapter for specific projects are (1) finding and fixing bugs, (2) document creation,
(3) meetings and communications, and (4) requirements creep. When looking at
larger national results across thousands of projects, additional cost drivers include:
(1) canceled projects, (2) cyber attacks, (3) cyber attack recovery and reparations,
and (4) litigation for breach of contract, intellectual property, and other causes.
Cost drivers are useful for software economic analysis because they highlight major
areas that need study and improvements. A full list of major software cost drivers
is shown in Table A.7.
As can be seen by Table A.7, software has far too many major risk and problem
factors that rank too high in this list of overall software cost drivers.
258 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Table A.7 U.S. Software Cost Drivers in Rank Order for 2016
1 The cost of finding and fixing bugs
Cost of Quality
The cost of quality (COQ) metric is much older than software and was first made
popular by the 1951 book titled Juran’s QC Handbook by the well-known manufac-
turing quality guru Joseph Juran. Phil Crosby’s next book, Quality Is Free (1979)
also added to the literature. Cost of quality is not well-named because it really
focuses on the cost of poor quality rather than the cost of high quality. In its general
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 259
form, cost of quality includes prevention, appraisal, failure costs, and total costs.
When used for software the author of this chapter modifies these terms for a soft-
ware context: defect prevention, pretest defect removal, test defect removal, and
postrelease defect removal. The author also includes several topics that are not part
of standard COQ analysis: cost of projects canceled due to poor quality, cost of
consequential damages or harm to customers from poor quality, and cost of litiga-
tion and damage awards due to poor quality.
Following are cost per LOC side by side with cost per function point to illustrate
the errors.
As can be seen from the above table cost per LOC reverses true economic
productivity and makes the most expensive version coded in assembly language
look cheaper than the least expensive version coded in Smalltalk. The errors
of LOC metrics are clearly visible in cases given in Table A.8 where identical
applications coded in different languages are shown to highlight LOC metric
errors.
Coupling
This is another interesting metric developed by Larry Constantine (see also cohe-
sion). The coupling metric refers to how modules exchange or share information.
Coupling can range from low coupling to high coupling. Low coupling tends to
be associated with well-structured software that is easy to read and comprehend.
There are many forms of coupling ranging from no coupling at all through content
coupling when a module depends on the inner workings of another module. Some
forms of coupling include data coupling, temporal coupling, stamp coupling, message
coupling, control coupling, and others as well. Coupling and cohesion are often
used as a set of related metrics.
costs such as India. The number of customer support personnel needed to allow
clients to reach a live person in less than 5 minutes by phone or e-mail can be
estimated and some tools such as SRM include these predictions. Some sophis-
ticated companies such as Apple have calculated customer support needs and
do a pretty good job. Others, such as Verizon, have ignored their customers
and have inadequate support where it is next to impossible to reach a live sup-
port person. Customer support staffing is based in part on expected numbers of
postrelease defects, in part on expected numbers of clients using the software,
and in part on whether support will be available 24 hours per day or only during
one or two shifts.
Cyclomatic Complexity
This metric is one of the most widely used indicators of software structure, along
with essential complexity. The cyclomatic complexity metric was developed in 1976
by Tom McCabe. It is based on graph theory and is an expression of the control
flow graph of an application. Cyclomatic complexity for software with no branches
is 1. As numbers of branches increase cyclomatic complexity increases. Once cyc-
lomatic complexity rises above 20, it is hard to follow the flow, and hence some
branches may be wrong. The formula for cyclomatic complexity is graph edges
minus nodes plus 2. Cyclomatic complexity plays a part in estimating test cases and
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 263
maintenance effort. High cyclomatic complexity levels are also cited in litigation
for poor quality. An interesting theoretical question is whether or not code with
low cyclomatic complexity is possible for very complex problems. See also essential
complexity and Halstead complexity.
Dashboard
The term dashboard is much older than software and has been applied to the
control panels of various devices such as automobiles where instruments provide
useful information to the operator. In a software context the term dashboard
refers to a continuous display of information about a project’s status, includ-
ing but not limited to completed tasks versus unfinished tasks, completed test
cases versus unfinished test cases, and completed documents versus unfinished
documents. A number of commercial and some open-source tools provide auto-
mated or semi-automated dashboards for software projects. Some of these sup-
port a number of projects at the same time and are useful for portfolio and
analysis, and also for data center analysis when many applications are executing
simultaneously.
A number of tools produce dashboards for software projects: (1) the Automated
Project Office (APO) from Computer Aid, (2) Cognos from IBM, IDashboards,
and SAP analytics, and (3) DataWatch and in fact almost 40 products.
Defect (Definition)
There is a somewhat pedantic academic discussion of the differences between a
failure, a fault, a defect, a bug, an error, an incident, anomaly, and so on. The term
defect is a good general-purpose term that can encompass all of these. A defect is
an accidental mistake by a human that causes either total stoppage of software,
264 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Defect Consequences
The term defect consequences is derived from the legal term consequential dam-
ages. It refers to the type and severity of harm that bugs or security flaws cause to
software customers. The most serious consequences would include human deaths,
injuries, or major financial losses in excess of U.S. $1,000,000. Other consequences
might include reduced operational efficiency, loss or theft of confidential data, and
loss of customers or at least reduced customer satisfaction levels.
Defect Density
For many years defect density has informally been defined as defects per KLOC.
This of course omits requirements and design defects, which often outnumber code
defects. Worse, this definition penalizes high-level languages. Assume you have
5,000 lines (5 KLOC) of assembly code with 50 bugs. Now assume the same algo-
rithms are coded in 1,000 lines (1 KLOC) of Java with 10 bugs. Both have exactly
10 bugs per KLOC as apparent defect densities, even though assembly has five
times as many code bugs. Assume both versions were 20 function points in size.
With this assumption, assembly has 2.5 bugs per function point, whereas Java has
only 0.5 bugs per function point. As can be seen defects per function point correctly
compensates for the reduced bug counts, whereas KLOC metrics do not show any
value for reduced defect volumes. For that matter defects per function point can
also include bugs in requirements, design, architecture, user documents, and all
other categories.
latent defects in a month or two. These factors explain the measured differences in
defect discovery rates for various kinds of software. Embedded and systems soft-
ware usually have the fastest defect discovery rate; Web projects with thousands
of visitors have fast defect discovery rates. Interestingly Agile projects, which are
often done to support less than 100 users, have fairly slow defect discovery rates
that may lead to a premature assertion that Agile quality is better than it really
might be.
Defect Origins
The phrase defect origins was first defined in IBM circa 1968. The meaning is,
which specific place caused a software defect. There are six common software
266 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
defect origins: (1) software requirements, (2) architecture, (3) design, (4) source
code, (5) user documents, and (6) bad fixes or secondary bugs in defect repairs.
There are other sources of defects such as data errors and bugs in test cases, but
they are not normally included in software defect measurements. When bugs were
reported, IBM quality engineers noted the point in time where the bug was found
(inspection, testing, deployment, etc.) and also noted the place where the bug
was created. They then assigned an origin code to each bug. This allowed IBM to
explore quality in a fairly sophisticated way and led to many important findings
such as the fact that requirements and design errors often outnumber code errors.
Make no mistake, a requirements defect such as Y2K will eventually end up in
source code, but that is not where the Y2K bug started. It started as an explicit
user requirement to conserve space by using only two digits for date fields. Every
company that builds software should explore their own defect origins, as indeed
many do.
by severity level after release would be: (1) severity 1 = 1%, (2) severity 2 =
15%, (3) severity 3 = 35%, and (4) severity 4 = 49%. However, because most
companies fix high-severity bugs quicker than low-severity bugs, clients tend to
try and push low-severity bugs up into severity level 2 in order to get a quicker
repair. Some applications have more than 50% of bugs reported as severity 2 by
clients even for trivial issues such as the placement of text on a screen or the color
of a display. Also, some defect reports turn out to be suggested improvements
that are reported as defects. The actual determination of defect severity is usu-
ally assigned to a quality assurance or maintenance team that sometimes has to
negotiate with clients.
Deferred Features
For many software projects either clients, executives, or business pressures such as
government laws and mandates dictate schedules that are shorter than technically
possible. In the case of impossible schedules that something has to give, and it is
often features that are desirable but not mandatory. Below 100 function points,
software is usually delivered close to 100% complete. Above 10,000 function points
it is not uncommon for the first release to omit more than 35% of planned features
in order to make a shorter than possible delivery date. Deferred features are an
endemic problem of large software projects. An interesting law by Chris Winter
of IBM is 80% of features delivered on time are more valuable than 100% of features
delivered late.
Delphi Methods
The term Delphi method is based on a line of famous Greek oracles who lived in the
temple in Delphi and were sought out by various leaders to predict future events.
In the modern Delphi method, panels of experts answer questions in a formal
structured way but anonymously. After the first round of questions a second round
is prepared using summaries from the first round. There may be additional rounds
until a concurrence of opinions is reached. The concept is based on the hypothesis
that groups of experts can pool their knowledge and do a better job of prediction
than a single expert. Delphi is used more for corporate decisions than for software
decisions but is sometimes used for major applications with high risks. As Delphi
depends on expertise, it is important to select participants who have actual knowl-
edge of the issues.
Delivered Defects
As software DRE is almost always less than 100% and often less than 90%, the
great majority of software applications are delivered with latent defects. By using
268 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
historical data major companies such as IBM are able to predict delivered defects
in future projects, and also use effective methods to keep delivered defects at
very low levels. Some tools such as SRM predict delivered defects as a standard
feature. In fact, SRM predicts not only total defects but also delivered defects
by origin: defects caused by requirements, design, code, bad testing, and so on.
Delivered defects are predicted by using defect potentials and DRE. They are of
course measured as they occur. Note that defects are not at all found until several
years after release. Indeed delivering software with excessive defects slows down
defect discovery because clients do not trust the software and avoid using it, if
possible. Annual reports by clients of delivered defects are only around 30% of
actual latent defects for IT applications, but higher for systems and embedded
software.
Dilution
The term dilution refers to the loss of equity that entrepreneurs may experience
if they receive venture capital, and especially if they receive more than one round
of venture capital. In order to get funding for a software company or major
projects from a venture funding source, probably 20% of the ownership will
be turned over to the venture capitalists. If the project runs through the initial
investment, which many do, and second or third round financing are needed,
the entrepreneurs occasionally end up with less than 15% ownership. Quite a
few venture-funded companies fail completely and go bankrupt. As a service to
software entrepreneurs, SRM includes a venture-funding routine that will predict
both the number of rounds of funding and the probable dilution of ownership.
It can also predict the odds of failure or bankruptcy. As these predictions can
be done early before any money is committed at all, hopefully both the entre-
preneurs and the venture capitalist will have a good preview of probable results
before committing serious money.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 269
Documentation Costs
Documentation costs are fairly sparse for small projects and especially for Agile
projects. However, for large systems above 10,000 function points and especially
for government and military software projects, more than 100 kinds of documents
might be created and the total costs of these documents are often greater than the
costs of the source code itself. SRM has a standard feature for predicting document
numbers, pages, words, and costs. For a small project of 10 function points, total
document pages will be around 50. For projects of 100 function points, total doc-
ument pages will be around 400. For projects of 1,000 function points, total
document pages will be around 3,500. For projects of 10,000 function points, total
document pages will be around 32,500. Studies by the author have noted that pages
per function point tend to decline with larger applications because full documen-
tation might go past the lifetime reading speed of a single individual. Document
costs need additional research in the software engineering field. For large civilian
systems, document costs are the #2 cost driver and for large defense systems, some-
times the #1 cost driver even going past finding and fixing bugs. Function points
are the best metrics for studying document costs. To highlight the huge volume of
documents for major systems, following are the numbers and sizes of documents
for a systems software application of 25,000 function points, such as a central office
switching system (Table A.9).
Note that document completeness is also a problem for large systems.
Document completeness is inversely proportional to application size measured
in function points. For example, complete requirements and design documents
are only possible for small applications below about 500 function points in size.
Above that, applications grow during development and the larger they are the
more they grow. Documentation costs are the #2 cost driver for applications
larger than 10,000 function points. For military and defense projects, which
produce about three times the volume of paper as civilian projects, documen-
tation costs are the #1 cost driver. Agile projects have reduced documentation
costs, so that they are only the #4 cost driver, below coding. However for Agile
projects meetings and communication costs may be the #2 cost driver, replacing
paperwork costs.
Each duplicate defect can take from 5 minutes to 15 minutes of work. Individually
this is not very significant, but if an application receives 10,000 duplicate defects
the costs can be significant.
cause trouble for software projects because schedule slippage is most severe during
testing due to having more bugs than anticipated.
Enhancement Metrics
As of 2016, there are more enhancement projects for legacy applications than
there are new development projects. Enhancements are more difficult to estimate
and measure than new software development. This is because the size, structure,
and understanding of the legacy software interact with the enhancement itself.
Assume that a new small project of 100 function points is developed. This might
require a total of 12 work hours per function point. Now assume that a 100 func-
tion point enhancement is being made to a well-structured, well-documented
legacy application of 1,000 function points. As the architecture and design issues
were solved by the legacy application, the enhancement might only require 11
work hours per function point. Now assume that a 100 function point enhance-
ment is to be made to a large system of 10,000 function points with high cyc-
lomatic complexity and missing documentation. In this case, digging into the
legacy code and the need to carry out major regression testing might raise the
effort to 14 work hours per function point. As can be seen, estimating and mea-
surement needs to include both the new enhancement and the legacy application.
For measurements, both the specific enhancement needs to be measured and also
the cumulative total cost of ownership (TCO) for the updated legacy applica-
tion. In other words, two sets of measures are needed for enhancements. Several
commercial parametric estimation tools such as the author’s SRM can predict
enhancements, but need input data about the size and decay of the legacy applica-
tion. See also entropy.
Entropy
The concept of entropy is not a software concept but a basic fact of physics. All
systems and natural objects tend to have an increase in disorder overtime, which is
called entropy. This is why we age and why stars turn into supernova. For software,
entropy is observed in a gradual increase in cyclomatic and essential complexity
over time, due to the structural damages caused by hundreds of small changes over
long time periods. It is possible to reverse entropy by restructuring or refactoring
software, but this is expensive and unreliable if done manually for large systems.
Automated restricting tools exist, but they only support a few languages and are of
uncertain effectiveness. Entropy needs much more study and more direct measure-
ments. As entropy is associated with all human artifacts and also with all natural
systems, it is a fundamental fact of nature.
272 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Error-Prone Modules
In the early 1970s, IBM undertook an interesting study of the distribution of bug
reports in a number of major software projects including operating systems, com-
pilers, database products, and others. One of the most important findings was
that bugs were not randomly distributed through all modules of large systems, but
tended to clump in a few modules, which were termed error-prone modules (EPM).
For example, 57% of customer reported bugs in the IMS database application were
found in 32 modules out of a total of 425 modules. More than 300 IMS modules
had zero-defect reports. Other companies replicated these findings and EPM are
an established fact of large systems. Two common causes for EPM have been noted:
(1) high levels of cyclomatic complexity and (2) bypassing or skimping on inspec-
tions, static analysis, and formal testing. In theory EPM can be avoided by proper
quality control, but even now in 2014 they tend to be far too common in far too
many large applications.
Essential Complexity
This metric was also developed by Tom McCabe in 1976 and is a variation on his
more famous cyclomatic complexity metric. Note that Fred Brooks also uses the
term in a different context as the minimum set of factors in large complex problems.
The McCabe form of essential complexity is derived from cyclomatic complexity,
but it replaces well-structured control sequences with a single statement. If a code
section has a cyclomatic complexity of 10 but includes well-structured sequences
then essential complexity might be only 3 or 4.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 273
Experience
The author’s benchmark collection method and the SRM tool use experience for
a number of occupations, including client experience, software engineer experi-
ence, tester experience, software quality assurance experience, project management
experience, customer support experience, and several others. Experience is ranked
on a subjective scale of 1–5: 1 = expert; 2 = above average; 3 = average; 4 = below
average; 5 = inexperienced.
As can be seen results are slightly asymmetrical. Top teams are about 30% more
productive than average, but novice teams are only about 15% lower than average.
The reason for this is that normal corporate training and appraisal programs tend
to weed out the really unskilled personnel so that they seldom become actual team
members. The same appraisal programs reward the skilled, so that explains the fact
that the best results have a longer tail.
Software is a team activity. The ranges in performance for specific individu-
als can top 100%. But there are not very many of these super stars. Only about
5%–10% of general software populations are at the really high level of the perfor-
mance spectrum.
Individual practitioners can vary in performance by more than 10 to 1, but soft-
ware is normally a team event. In any case top performers are rare and bottom per-
formers are usually terminated, so average performance is the norm with a weight
on the high side. Also, bad management tends to slow down and degrade the per-
formance of top technical personnel, some of whom quit their jobs as a result.
Expert Estimation
The term expert estimation refers to manual software estimates by human beings as
opposed to using a parametric estimation tool such as COCOMO II, CostXpert,
KnowledgePlan, SEER, SRM, SLIM, or TrueCost. A comparison of 50 manual
estimates and 50 parametric estimates by the author found that below 250 function
points manual estimates and parametric estimates were almost identical. As applica-
tion size increased, manual estimates became progressively optimistic and predicted
shorter schedules and lower costs than what actually occurred. Above 5,000 func-
tion points, manual estimates even by experts tended to be hazardous and excessively
optimistic by more than 35%. This is not surprising because the validity of historical
data are also poor for large systems above 5,000 function points due to leakage of
major cost elements such as unpaid overtime, management, and specialists. See also
parametric estimation later in this report.
The definition used by the author for project failure is: “software that is terminated
without delivery due to errors, delays, or cost overruns or software whose development
company is sued for breach of contract after delivery for excessive errors.” See also the
definition for successful software later in this report. In between success and fail-
ure are thousands of projects that finally get released but are late and over budget
and probably have too many bugs after delivery. That is, the modus operandi for
software circa 2014. Another cut at a definition of failing projects would be proj-
ects in the lowest 15% in terms of quality and productivity rates from the bench-
mark collections of companies such Namcook Analytics, Q/P Management Group,
Software Productivity Research, and others.
Failure Rate
The term failure is defined by the author as software project that is terminated
prior to being completed due to poor quality, negative ROI, or some other cause
that was self-inflicted by the development team. Projects that are terminated for
business reasons such as buying a commercial software application rather than
finishing an internal application are not failures. The topic of software failures
has a lot of publicity, due in part to the large number of failures included in
the Standish Report, produced by the Standish consulting group. However, that
report only covers information technology and does not include systems soft-
ware or commercial software, both of which have lower failure rates. Also, the
Standish report does not show failures by application size, which is a serious
omission. The author’s data on project failure rates by size is as follows: The prob-
ability of a software project failing and not being completed is proportional to
the cube root of the size of the software application using IFPUG function points
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 275
with the results expressed as a percentage. For 1,000 function points the odds
are about 8%; for 10,000 function points the odds are about 16%; for 100,000
function points the odds are about 32%. These rules are not perfect, but are based
on observations taken from about 20,000 software projects of all sizes and types
including Web applications, smart phones, systems software, medical devices,
and military projects.
False Positive
The term false positive refers to misidentifying a code sequence as being incorrect,
when in fact it is correct. This metric can occur with testing and inspections,
but is most widely used with static analysis tools some of which may have more
than 10% false positives. False positives are annoying, but it is probably safer to
have a few false positives than to miss real bugs. Every form of defect removal is
less than 100% in removal efficiency and produces at least a few false positives.
Feature Bloat
This term is not ordinarily quantified, but is a subjective statement that many soft-
ware packages have features that are in excess of the ones truly needed by the vast
majority of users. For example, the author of this chapter has written 16 books and
hundreds of articles with Microsoft Word, but probably has used less than 15% of
the total feature set available in Microsoft Word. This is not to say that the features
in Word are useless, but there are so many of them that few authors ever use the
majority of available features in either Word or Excel. In theory, feature bloat could
be measured with function points and the newer SNAP metric for nonfunctional
size. However, there is a logical inconsistency. Function points are defined as user
benefits, and feature bloat is considered to have benefit. This might be resolved by
creating a bloat point metric that would be counted like function points but assume
zero-user benefits and perhaps zero-business value as well. Feature bloat is basically
a subject opinion and not a truly measureable attribute.
Fixed Costs
In a manufacturing process, the term fixed cost refers to a cost that stays constant
no matter how many products are built per month. A prime example of a fixed cost
would be the rent paid for a software office building. For software applications many
costs are not fixed in the classic sense of being constants, but they are inelastic and
stay more or less the same. For example, requirements and design are likely to stay
more or less the same regardless of what coding language is used. After release of soft-
ware, companies will have maintenance personnel standing by to fix bugs regardless
276 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
of how many bugs are reported by users. Assume a company has a full-time main-
tenance programmer standing by at a cost of U.S. $10,000 per month. Now assume
that project A has 10 bug reports in the first month of use. The cost per defect for
project A will be U.S. $1,000. Now assume that next month the same maintenance
programmer is standing by for project B, which only has 1 bug. Now the cost per
defect will be U.S. $10,000 for project B. Fixed and variable costs need to be ana-
lyzed for software, and especially for quality work. See also variable costs later in this
report. See also burden rate earlier in this report for another look at fixed costs.
Function Points
In the late 1960s and early 1970s, the number of programming languages used
inside IBM expanded from assembly to include COBOL, FORTRAN, PL/I, APL,
and others. It was found that LOC metrics penalized high-level languages and did
not encompass requirements and design work. IBM commissioned Al Albrecht
and his colleagues in IBM White Plains to develop a metric that could include
all software activities and was not based on source code. The results were func-
tion point metrics that were developed circa 1975. In 1978 at a joint conference by
Share, Guide, and IBM, Albrecht presented function points to the outside world.
Function points started to be used by IBM customers and in 1984 the IFPUG
was formed in Montreal, and later moved to the United States. Function point
metrics are the weighted combination of inputs, outputs, inquiries, logical files,
and interfaces adjusted for complexity. In today’s world of 2014 function points
are supported by an ISO standard and by certification exams. Function points are
probably the most widely used software metric in the world and have more bench-
mark data than all other metrics put together. There are a number of alternative
methods of counting function points discussed later in this report under the topic
of function point variations.
Gantt Charts
The phrase Gantt chart refers to a graphical method for showing overlapped project
schedules first developed in 1910 by Henry Gantt. These charts are of course much
older than software and used by many industries. However, they are also widely
used for software projects because they show that true waterfalls are uncommon.
Instead software projects normally start a new activity before the prior activity is
finished. Thus design starts before requirements are finished; coding starts before
design is finished, and so forth. A Gantt chart consists of horizontal bars showing
time lines for activities as shown below using a simple example.
278 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Requirements **********
Design **********
Coding **********
Testing **********
Documentation **********
Quality assurance **********
Management *********************
Goal-Question Metrics
The phrase goal-question metric (GQM) refers to a fairly new and general way of
measurement developed by Dr. Victor Basili of the University of Maryland, with
contributions by Dr. David Weiss and Albert Endres of IBM. The GQM approach
is a general-purpose measurement method and not limited to software. It includes a
six-step process: (1) set a goal, (2) generate questions based on the goal, (3) specify
metrics, (4) develop data collection method, (5) collect and validate data, and (6) do
a postmortem on results. It is possible to use the GQM approach with standard
metrics such as function points and DRE. Indeed two useful goals for the software
industry are (1) to raise average productivity rates to 15 function points per staff
month and (2) to raise average DRE levels to 99%.
released if it is able to work and perform most tasks. However, more sophisticated
software managers know that released bugs lower customer satisfaction and raise
support and warranty costs. Further, if software is developed properly using a com-
bination of defect prevention, pretest defect removal such as inspections and static
analysis, and formal testing by certified test personnel using mathematical test case
design, it can achieve >99% DRE and still be quicker than sloppy development
that only achieved 90% DRE or less. The good-enough fallacy is symptomatic of
inept management who need better training in software economics and software
quality control. Make no mistake: the shortest software schedules correlate with
the highest DRE levels and the lowest defect potentials. Software schedules slip
because there are too many bugs in software when testing starts. See also technical
debt discussed later in this chapter.
Governance
The financial collapse of Enron and other major financial problems partly blamed
on software that led to the passage of the Draconian Sarbanes–Oxley law in the
United States. This law is aimed at corporate executives, and can bring criminal
charges against corporate executives for poor governance or lack of due diligence.
The term governance means constant oversight and due diligence by executives of
software and operations that might have financial consequences if mistakes are
made. A number of the measures discussed here in this report are relevant to gov-
ernance including but not limited to: cyclomatic complexity, defect origins, defect
severity, defect potentials, defect detection efficiency (DDE), DRE, delivered
defects, function point size metrics, and reliability.
Halstead Complexity
The metrics discussed in this topic were developed by Dr. Maurice Halstead in 1977
and deal primarily with code complexity, although they have more general uses.
Halstead set up a suite of metrics that included operators (verbs or commands) and
operands (nouns or data). By enumerating distinct operators and operands various
metrics such as program length, volume, and difficulty are produced. Halstead
metrics and cyclomatic complexity metrics are different bug but somewhat con-
gruent. Today in 2014, Halstead complexity is less widely used than cyclomatic
complexity.
project management, user costs, and the work of part-time specialists such as
quality assurance, technical writers, business analysts, Agile coaches, project
office staff, and many more. Leakage is worse for projects created via cost centers
than via profit centers. Quality data leakage is also severe and includes omitting
bugs in requirements and design, omitting bugs found by unit test, omitting
bugs found by static analysis, and omitting bugs found by developers themselves.
At IBM, there were volunteers who reported unit test and self-discovered bugs
in order to provide some kind of statistical knowledge of these topics. Among
the author’s clients, overall cost data for cost-center projects average about 37%
complete. Quality data averaged only about 24% complete. Projects developed
under time and material contracts are more accurate than fixed-price contracts.
Projects developed by profit centers are more accurate than projects developed
by cost centers.
Incidents
The term incident is used in maintenance measurement and estimation. It is
a complex term that combines many factors such as bug reports, help requests,
and change requests that may impact software applications after they have been
released. SRM estimates and measures software maintenance incidents, which
include the following:
Mandatory changes
TOTAL INCIDENTS
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 281
Industry Comparisons
Software is produced essentially by every industry in the world. There is little pub-
lished data that compares software quality and productivity across industry lines.
From the author’s data collection of about 26,000 projects, the high-technology
industries that manufacture complex physical equipment (medical devices, avionics,
and embedded applications) have the best quality. Banks and insurance companies
have the best productivity. One of the virtues of function point metrics is the ability
to direct comparisons across all industries.
The U.S. Department of Commerce and the Census Bureau have developed an
encoding method that is used to identify industries for statistical purposes called
the North American Industry Classification (NAIC). Refer to the NAIC code
discussed later in this document for a description.
Inflation Metrics
Over long periods of time wages, taxes, and other costs tend to increase steadily. This
is called inflation and is normally measured in terms of a percentage increase. For
software, inflation rates play a part in large systems that take many years to develop.
They also play a part in long-range legacy application maintenance. Inflation also
plays a part in selection of outsource countries. For example, in 2014 the inflation
rates in China and India are higher than in the United States, which will eventually
erode the current cost advantages of these two countries for outsource contracts.
International Comparisons
Software is developed in every known country in the world. This brings up the question
of what methods are effective in comparing productivity and quality across national
boundaries? Some of the factors that have international impacts include: (1) average
compensation levels for software personnel by country, (2) national inflation rates,
(3) work hours per month by country, (4) vacation and public holidays by country,
(5) unionization of software personnel and local union regulations, (6) probabilities of
strikes or civil unrest, (7) stability of electric power supplies by country, (8) logistics such
as air travel, (9) time zones that make communication difficult between countries with
more than a 4 hour time difference, (10) knowledge of spoken and written English,
which are the dominant languages for software, and (11) intellectual property laws and
protection of patents and source code. Function point metrics allow interesting global
comparisons of quality and productivity that are not possible using other metrics.
Inspection Metrics
One of the virtues of formal inspections of requirements, design, code, and other
deliverables is the suite of standard metrics that are part of the inspection process.
282 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Inspection data routinely includes preparation effort: inspection session team size
and effort, defects detected before and during inspections, defect repair effort after
inspections, and calendar time for the inspections for specific projects. These data
are useful in comparing the effectiveness of inspections against other methods of
defect removal such as pair programming, static analysis, and various forms of test-
ing. To date, inspections have the highest levels of DRE (>85%) of any known form
of software defect removal.
Invalid Defects
The term invalid defect refers to a bug report against software applications that, on
examination, turns out not to be true defects. Some of the common reasons for
invalid defects include: user errors, hardware errors, and operating system errors
mistaken for application errors. As an example of an invalid defect, a bug report
against a competitive estimation tool was sent to the author’s company by mistake.
Even though it was not our bug, it took about an hour to forward the bug to the
actual company and to notify the client of the error. Invalid defects are not true
defects but they do accumulate costs. Overall about 15% of reported bugs against
many software applications are invalid defects.
ISO/IEC Standards
This phrase is an amalgamation of the international organization for standards,
commonly abbreviated to ISO, and the international electrotechnical commission,
commonly abbreviated to IEC. These groups have hundreds of standards cover-
ing essentially every industry. Some of the standards that are relevant to software
include: ISO/IEC 2096:2009 for function points, the ISO/IEC 9126 quality stan-
dard, and the new ISO 3101:2009 risk standard. An issue for all ISO/IEC standards
is lack of empirical data that proves the benefits of the standards. There is no reason
to doubt that international standards are beneficial, but it would be useful to have
empirical data that shows specific benefits. For example, do the ISO quality and risk
standards actually improve quality or reduce risks? As of 2014 nobody knows. The
standards community should probably take lessons from the medical community
and include proof of efficacy and avoidance of harm as part of the standards creation
process. As medicine has learned from the many harmful side-effects of prescription
drugs, releasing a medicine without thorough testing can cause immense harm to
patients including death. Releasing standards without proof of efficacy and avoid-
ance of harmful side-effects should be a standard practice itself.
Kanban
Kanban is a Japanese method of streamlining manufacturing first developed by
Toyota. It has become famous under the phrase just in time. The Kanban approach
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 283
uses interesting methods for marking progress and showing when a deliverable
is ready for the next step in production. Kanban is used with software, but not
consistently. The Agile approach has adopted some Kanban ideas, as have other
methodologies. Quite a number of methods for quality control were first used in
Japan, whose national interest in quality is thousands of years old. Other Japanese
methods include quality circles, Kaizen, and Poke Yoke. Empirical data gathered
from Japanese companies indicate very high software quality levels, so the combi-
nations of Japanese methods have proven to be useful and successful in a software
context.
KLOC
This term uses K to express 1,000 and LOC for lines of code. This is a metric that
dates back to the 1960s as a way of measuring both software size and also software
costs and defect densities. However, both KLOC and LOC metrics share common
problems in that they penalize high-level languages and make requirements and
design effort and defects invisible.
Language Levels
In the late 1960s and early 1970s, programming languages began their rapid
increase in numbers of languages and also powers of languages. By the mid 1970s,
more than 50 languages were in use. The phrases low level and high level were sub-
jective and had no mathematical rigor. IBM wanted to be able to evaluate the power
of various languages and so developed a mathematical form for quantifying levels.
284 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
This method made basic assembly language as the primary unit and it was assigned
level 1. Other languages were evaluated based on how many statements in basic
assembly language it would take to duplicate one statement in the higher-level lan-
guage. Using this method, both COBOL and FORTRAN were level 3 languages
because it took an average of three assembly statements to provide the features
of one statement in COBOL or FORTRAN. Later when function points were
invented in 1975, the level concept was extended to support function points and
was used for backfiring or mathematical conversion between code volumes and
function points. Here too basic assembly was the starting point, and it took about
320 assembly statements to be equivalent to one function point. Today in 2014,
tables of language levels are commercially available and include about 1,000 differ-
ent languages. For example, Java is level 6; objective C is level 12; PL/I is level 4; C
is level 2.5, and so forth. This topic is popular and widely used but needs additional
study and more empirical data to prove the validity of the assigned levels for each
language. Combinations of languages can also be assigned levels, such as Java and
HTML or COBOL and SQL.
Lean Development
The term lean is a relative term that implies less body fat and a lower weight than
average. When applied to software, the term means building software with a
smaller staff than normal, whereas hopefully not slowing down development or
causing harmful side effects. Lean manufacturing originated at Toyota in Japan,
but the concepts spread to software and especially to the Agile approach. Some of
the lean concepts include eliminate waste, amplify learning, and build as fast as
possible. A lean method called value stream mapping includes useful metrics. As
with many other software concepts, lean suffers from a lack of solid empirical data
that demonstrates effectiveness and lack of harmful side effects. The author’s clients
that use lean methods have done so on small projects below 1,000 function points,
and their productivity and quality levels have been good but not outstanding. As of
2014, it is uncertain how lean concepts will scale up to large systems in the 10,000
function point size range. However, TSP and RUP have proof of success for large
systems so lean should be compared against them.
Learning Curves
The concept of learning curves is that when human beings need to master a new
skill, their initial performance will be suboptimal until the skill is truly mastered.
This means that when companies adopt a new methodology, such as Agile, the first
project may lag in terms of productivity or quality or both. Learning curves have
empirical data from hundreds of technical fields in dozens of industries. However
for software, learning curves are often ignored when estimating initial projects
based on Agile, TSP, RUP, or whatever. In general, expect suboptimal performance
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 285
Maintenance Metrics
The term maintenance is highly ambiguous. No fewer than 23 different kinds of
work are subsumed under the single term maintenance. Some of these forms of
maintenance include defect repairs, refactoring, restructuring, reverse engineering,
reengineering of legacy applications, and even enhancements or adding new fea-
tures. For legal reasons, IBM made a rigorous distinction between maintenance in
the sense of defect repairs and enhancements or adding new features. A court order
required IBM to provide maintenance information to competitors, but the order
did not define what the word maintenance meant. A very useful metric for main-
tenance is to use function point metrics for the quantity of software one mainte-
nance programmer can keep up and running for one year. The current average is
about 1,500 function points. For very well structured software, the maintenance
assignment scope can top 5,000 function points. For very bad software with high
286 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
complexity and convoluted paths, the maintenance assignment scope can drop below
500 function points. Other metrics in the maintenance field include the number of
clients one telephone support person can handle during a typical day (about 10) and
the number of bugs that a maintenance programmer can fix per month (from 8 to 12).
In spite of the complexity of maintenance, the tasks of maintenance, customer sup-
port, and enhancement can be measured and predicted fairly well. This is important
because in 2014 the world population of software maintenance personnel is larger
than the world population of software development personnel.
$ per
Number of Total Meeting Function
Meeting Events Meetings Attendees Hours Costs Point
It is easy to see why meetings and communications are an important software cost
driver. However, they are seldom measured or included in benchmark reports even
though they may rank as high in total costs.
Metrics Conversion
With two different forms of LOC metrics, there are more than a dozen variations
in function point metrics; story points, use-case points, and RICE objects, one
might think that metrics conversion between various metrics would be sophisti-
cated and supported by both commercial and open-source tools, but this is not
the case. In the author’s view it is the responsibility of a metric inventor to provide
conversion rules between a new metric and older metrics. For example, it is NOT
the responsibility of the IFPUG to waste resources deriving conversion rules for
every minor variation or new flavor function point. As a courtesy, the author’s
SRM tool does provide conversions between 23 metrics, and this seems to be
the largest number of conversions as of 2016. There are more narrow published
conversions between COSMIC and IFPUG function points. However, metrics
conversion is a very weak link in the chain of software measurement techniques.
Examples of metrics conversion are shown below for an application of a nominal
1,000 IFPUG function points. These are standard outputs from the author’s tool
(Table A.11).
In the author’s opinion software has too many metrics, too many variations of
similar metrics, and a serious shortage of accurate benchmark data based on valid
metrics and activity-based costs.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 289
Metrics Education—Academic
Academic training in software metrics is embarrassingly bad. So far as can be deter-
mined from limited samples, not a single academic course mentions that LOC
metrics penalize high level languages and that cost per defect metrics penalize
quality. The majority of academics probably do not even know these basic facts of
software metrics. What universities should teach about software metrics include:
manufacturing economics and the difference between fixed and variable software
costs, activity-based cost analysis, defect potentials and defect removal efficiency,
function point analysis, metrics conversion, comparing unlike software methods,
comparing international software projects, software growth patterns during devel-
opment and after release. They should also teach the hazards of metrics with proven
mathematical and economic flaws such as LOC and cost per defect, both of which
violate standard economic assumptions.
290 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Metrics—Natural
The phrase natural metric refers to a metric that measures something visible and
tangible that can be seen and counted without ambiguity. Examples of natural
metrics for software would include pages of documents, test cases created, test cases
executed, and physical LOC. By contrast synthetic metrics are not visible and not
tangible.
Metrics—Synthetic
The phrase synthetic metric refers to things that are abstract and based on math-
ematics rather than on actual physical phenomena. Examples of synthetic metrics
for software include function point metrics, cyclomatic complexity metrics, logi-
cal code statements, test coverage, and defect density. Both synthetic and natural
metrics are important, but synthetic metrics are more difficult to count. However,
synthetic metrics tend to be very useful for normalization of economic and quality
data, which are difficult to do with natural metrics.
Metrics Validation
Before a metric is released to the outside world and everyday users, it should be
validated under controlled conditions and proven to be effective and be without
harmful consequences. For metrics such as function points and SNAP metrics they
did undergo extensive validation. Other metrics were just developed and published
without any validation. Older metrics such as LOC and “cost per defect” have been
in use for more than 50 years without yet being formally studied or validated for
ranges of effectiveness and for harmful consequences.
samples to derive probabilities and more general rules. For example, collecting data
on software projects from a sample of 50 commercial banks might provide useful
information on ranges of banking software performance. Doing a similar study for
50 manufacturing companies would provide similar data, and comparing the two
sets would also be insightful. For predictive modeling ranges of inputs would be
defined and then dozens or scores of runs would be made to check the distributions
over the ranges. John von Neumann programmed the ENIAC computer to provide
Monte Carlo simulations so this method is as old as the computer industry. Monte
Carlo simulation is also part of some software estimation tools.
Morale Metrics
A topic needing more study, but difficult to gather data, is the impact of morale
on team performance. Many companies such as IBM and Apple perform morale
studies, but these are usually kept internally and not published outside. Sometimes
interesting correlations do get published. For example, when IBM opened the new
Santa Teresa programming center, designed specifically for software, the morale
studies found that morale was much higher at Santa Teresa than at the nearby San
Jose lab where the programmer’s had worked before. Productivity and quality were
also high. Of course, these findings might not prove causation of the new architec-
ture, but they were interesting. In general, high morale correlates with high qual-
ity and high productivity, and low morale with the opposite case. But more study
is needed on this topic because it is an important one for software engineering.
Among the factors known to cause poor morale and even voluntary termination
among software engineers have been the following: (1) poor project management,
(2) forced use of pair programming without the consent of the software personnel,
(3) impossible demands for short schedules by clients or executives, (4) more than
6 hours of unpaid overtime per week for long periods, (5) arbitrary curve fitting
for appraisals that limit the number of top personnel to a limited statistical value.
codes of thousands of industries. The full NAIC code is six digits, but for many
benchmarks the two-digit and three-digit versions are useful because they are more
general. Some relevant two-digit NAIC codes for software include: manufacturing
31–33; retail 44–45; information 51; finance 52; professional services 54; education
61. For benchmarks and also for software cost estimation, NAIC codes are useful
to ensure apples to apples comparisons. NAIC codes are free as are a number of tools
for looking up the codes for specific industries.
National Averages
Given the size and economic importance of software one might think that every
industrialized nation would have accurate data on software productivity, qual-
ity, and demographics. This does not seem to exist. There seem to be no effective
national averages for any software topic, and software demographics are suspect
too. Although basic software personnel are known fairly well, the Bureau of Labor
Statics data does not show most of the 126 occupations. For example, there is
no good data on business analysts, software quality assurance, database analysts,
and scores of other ancillary personnel associated with software development and
maintenance. Creating a national repository of quantified software data would
benefit the United States. It would probably have to be done either by a major
university or by a major nonprofit association such as the ACM, IEEE, PMI, SIM,
or perhaps all of these together. Funding might be provided by major software
companies such as Apple, Microsoft, IBM, Oracle, and the similar, all of whom
have quite a bit of money and also large research organizations. Currently the best
data on software productivity and quality tends to come from companies that build
commercial estimation tools, and companies that provide commercial benchmark
services. All of these are fairly small companies. If you look at the combined data
from all 2015 software benchmark groups such as Galorath, International Software
Benchmarking Standards Group (ISBSG), Namcook Analytics, Price Systems,
Q/P Management Group, Quantimetrics, QSM, Reifer Associates, and Software
Productivity Research, the total number of projects is about 80,000. However, all
of these are competitive companies, and with a few exceptions such as the recent
joint study by ISBSG, Namcook, and Reifer the data are not shared or compared. It
is not always consistent either. One would think that a major consulting company
such as Gartner, Accenture, or KPMG would assemble national data from these
smaller sources, but this does not seem to happen. Although it is possible in 2015
to get rough employment and salary data for a small set of software occupation
groups, there is no true national average that encompasses all industries.
Nondisclosure Agreements
When the author and his colleagues from Namcook Analytics LLC collect bench-
mark data from clients, the data are provided under a nondisclosure agreement or
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 293
Nonfunctional Requirements
Software requirements come in two flavors: functional requirements that are what the
customer wants the software to do, nonfunctional requirements that are needed to
make the software work on various platforms, or required by government mandate.
Consider home construction before considering software. A home built overlooking
the ocean will have windows with a view—this is a functional requirement by the
owners. But due to zoning and insurance demands, homes near the ocean in many
states will need hurricane-proof windows. This is a nonfunctional requirement. See
the discussion of the new SNAP metric later in this report. Typical nonfunctional
requirements are changes to software to allow it to operate on multiple hardware
platforms or operate under multiple operating systems.
Normalization
In software, the term normalization has different meanings in different contexts,
such as database normalization and software project result normalization. In
this chapter the form of normalization of interest is converting raw data to a
fixed metric so that comparisons of different projects are easy to understand.
The function point metric is as good choice for normalization. Both work hours
per function point and defects per function point can show the results of dif-
ferences in application size, differences in application methodology, differences
in CMMI levels, and other topics of interest. However, there is a problem that
is not well-covered in the literature, and for that matter not well-covered by the
function point associations. Application size is not constant. During develop-
ment, software applications grow due to creeping requirements at more than
1% per calendar month. After release applications continue to grow for as long
as they are being used at more than 8% per calendar year. This means that both
294 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Object-Oriented Metrics
Object-oriented (OO) languages and methods have become mainstream
development approaches. For example, all software at Apple uses the Objective
C programming language. The terminology and concepts of object-oriented
development are somewhat unique and not the same as procedural languages.
However, some standard metrics such as function points and DRE work well with
object-oriented development. In addition, the OO community has developed met-
rics suites that are tailored to the OO approach. These include methods, classes,
inheritance, encapsulation, and some others. Coupling and cohesion are also used
with OO development. This is too complex a topic for a short discussion so a Google
search on object-oriented metrics will bring up interesting topics such as weighted
methods per class and depth of inheritance tree.
Occupation Groups
A study of software demographics in large companies was funded by AT&T and
carried out by the author and his colleagues. Some of the participants in the study
included IBM, the Navy, Texas Instruments, Ford, and other major organizations.
The study found 126 occupations in total, but no company employed more than 50
of them. Among the occupations were Agile coaches, architects, business analysts,
configuration control specialists, designers, estimating specialists, function point
specialists, human factors specialists, programmers or software engineers, project
office specialists, quality assurance specialists, technical writers, and test specialists.
The number of occupation groups increased with both application size and also with
company size. Traditional programming can be less than 30% of the team and less
than 30% of the effort for large applications. The study also found that no human
resource group actually knew how many software occupations were employed or
even how many software personnel were employed. It was necessary to interview
local managers. The study also found that some software personnel refused to be
identified with software due to low status. These were aeronautical or automotive
engineers building embedded software. Very likely government statistics on software
employment are wrong. If corporate HR organizations do not know how many
software people are employed they cannot tell the government software employment
either. There is a need for continuing study of this topic. Also needed are compari-
sons of productivity and quality between projects staffed with generalists and similar
projects staffed by specialists.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 295
Although programmers and testers dominate, note that neither of the occupation
group even hits 30% of overall staffing levels. Needless to say there are wide variations.
Also with a total of 126 known occupation groups, really large systems will have
much greater diversity in occupations than shown here.
Parametric Estimation
The term parametric estimation refers to software cost and quality estimates produced
by one or more commercial software estimation tools such as COCOMO II,
CostXpert, KnowledgePlan, SEER, SLIM, SRM, or TruePrice. Parametric estimates
are derived from the study and analysis of historical data from past projects. As a result
the commercial estimation companies tend to also provide benchmark services. Some
of the parametric estimation companies such as the author’s Namcook Analytics have
data on more than 20,000 projects. A comparison by the author of 50 parametric
estimates and 50 manual estimates by experienced project managers found that both
manual and parametric estimates were close for small projects below 250 function
points. But as application size increased manual estimates became progressively opti-
mistic, whereas parametric estimates stayed within 10% well past 100,000 function
points. For small projects both manual and parametric estimates should be accurate
enough to be useful, but for major systems parametric estimates are a better choice.
Some companies utilize two or more parametric estimation tools and run them all
when dealing with large mission-critical software applications. Convergence of the
estimates by separate parametric estimation tools adds value to major projects.
Pair Programming
Pair programming is an example of a methodology that should have been validated
before it started being used, but was not. The concept of pair programming is that
two programmers take turns coding and navigating using the same computer. Clearly
if personnel salaries are U.S. $100,000 per year and the burden rate is U.S. $50,000
per year, then a pair is going to cost twice as much as one programmer, that is, U.S.
$300,000 per year instead of U.S. $150,000 per year. A set of 10 pairs will cost U.S.
$3,000,000 per year, and return fairly low value. The literature on pair programming
is trivial and only compares unaided pairs against unaided individual programmers
without any reference to static analysis, inspections, or other proven methods of qual-
ity control. Although pair enthusiasts claim knowledge transfer as a virtue, there are
better methods of knowledge transfer including inspections and mentoring. Although
some programmers enjoy pair programming, many do not and several reports discuss
programmers who quit companies specifically to get away from pair programming.
This method should have been evaluated prior to release using a sample of at least
25 pairs compared to 25 individuals, and the experiments should also have compared
pairs and individuals with and without static analysis. The experiments should also
have compared pairs against individuals who used formal inspections. The author’s
296 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
data indicates that pairs always cost more, are usually slower, and are not as effective
for quality control, as individual programmers who use inspections and static analy-
sis. An unanswered question of the pair programming literature is if pairing program-
ming is good, and why not pair testers, quality assurance, project managers, business
analysts, and the other 125 occupations associated with software.
Pareto Analysis
The famous Pareto principle states that 80% of various issues will be caused by 20%
of the possible causes. The name was created by Joseph Juran and named in honor
of an Italian economist named Vilfredo Pareto who noted in 1906 that 20% of the
pea pods in his garden produced 80% of the peas. Pareto analysis is much more than
the 80/20 rule and includes sophisticated methods for analyzing complex problems
with many variables. Pareto distributions are frequently noted in software such as
the discovery of EPM and a Microsoft study shows that fixing 20% of bugs would
eliminate 80% of system crashes. Some of the areas where Pareto analysis seem to
show up include: (1) a minority of personnel seem to produce the majority of effec-
tive work and (2) in any industry a minority of companies are ranked best to work for
by annual surveys. Pareto diagrams are often used in software for things like analyz-
ing customer help requests and bug reports.
Customer complaints
280 100
90
240 80% line
80
200 70
60
160
50
120 Significant few Insignificant many
40
80 30
20
40
10
0 0
Parking Sales rep Poor Layout Sizes Clothing Clothing
difficult was rude lighting confusing limited faded shrank
Note that Pareto charts are useful for showing several kinds of data visually at the
same time.
Pattern Matching
Patterns have become an important topic in software engineering and will become
even more important as reuse enters the mainstream. Today in 2014 design patterns
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 297
and code patterns are both fairly well-known and widely used. Patterns are also use-
ful in measurement and estimation. For example, the author’s patent-pending early
sizing method is based on patterns of historical projects that match the taxonomy
of the new application that is being sized. Patterns need to be organized using stan-
dard taxonomies of application nature, scope, class, and type. Patterns are also used
by hundreds of other industries. For example, the Zillow database of real estates
and the Kelley Blue Book of used cars are both based on pattern matching.
Performance Metrics
For the most part this chapter deals with metrics for software development and main-
tenance. But software operating speed is also important, as is hardware operating
speed. There are dozens of performance metrics and performance evaluation meth-
ods. A Google search on the phrase software performance metrics is recommended.
Among these metrics are load, stress, data throughput, capacity, and many others.
Phase Metrics
The term phase refers to a discrete set of tasks and activities that center on produc-
ing a major deliverable such as requirements. For software projects there is some
ambiguity in phase terms and concepts, but a typical pattern of software phases
would include: (1) requirements, (2) design, (3) coding or construction, (4) testing,
and (5) deployment. Several commercial estimation tools predict software costs and
schedules by phase. However, there are major weaknesses with the phase concept.
Among these weaknesses is the fact that many activities such as technical documen-
tation, quality assurance, and project management span multiple phases. Another
weakness is the implicit assumption of a waterfall development method, so that
phases are not a good choice for Agile projects. Activity-based cost analysis is a better
and more accurate alternative to phases for planning and estimating software.
298 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Portfolio Metrics
The term portfolio in a software context refers to the total collection of software
owned and operated by a corporation or a government unit. The portfolio would
include custom developed software, commercial software packages, and open-
source software packages. In today’s world of 2015 it will also include cloud appli-
cations that companies use but do not have installed on their own computers, such
as Google documents. Function point metrics are a good choice for portfolios.
LOC metrics might be used but with thousands of applications coded in hundreds
of languages, LOC is not an optimal choice. In today’s world of 2015, a Fortune
500 company can easily own more than 5,000 software applications with an aggre-
gate size approaching 10,000,000 function points. Very few companies know how
large their portfolios are. Shown below in Table A.12 is a sample portfolio for a
manufacturing company with a total of 250,000 employees.
As can be seen from Table A.12, corporate portfolios comprise thousands of
applications and millions of function points.
(Continued)
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 299
(Continued)
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 301
Productivity
The standard economic definition for productivity is goods or services produced per
unit of labor or expense. The software industry has not yet found a standard topic
that can be used for the goods or services part of this definition. Among the topics
used for goods or services can be found as function points, LOC, story points,
RICE objects, and use-case points. Of these, only function points can be applied
to every activity and every kind of software developed by all known methodolo-
gies. As of 2014, function point metrics are the best choice for software goods and
services and therefore for measuring economic productivity. However, the software
literature includes more than a dozen others such as several flavors of LOC metrics,
story point, use-case points, velocity, and so on. So far as can be determined no
other industry besides software has such a plethora of bad choices for measuring
economic productivity.
302 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Production Rate
This metric is often paired with assignment scope to create software cost and schedule
estimates for specific activities. The production rate is the amount of work a person
can complete in a fixed time period such as an hour, a week, or a month. Using the
simple natural metric of pages in a user’s guide assigned to a technical writer, the
assignment scope might be 50 pages and the production rate might be 25 pages per
month. This combination would lead to an estimate of one writer and two calendar
months. Production rates can be calculated using any metric for a deliverable item,
such as pages, source code, function points, story points, and so on.
Professional Malpractice
As software is not a licensed profession it cannot actually have professional mal-
practice in 2016. Yet several metrics in this report are cited as being professional
malpractice in specific contexts. The definition of professional malpractice is an
instance of incompetence or negligence on the part of a professional. A corollary to
this definition is that academic training in the profession should have provided
all professionals with sufficient information to avoid most malpractice situations.
As of 2016, software academic training is inadequate to warn software engineers
and software managers of the hazards of bad metrics. The metric LOC is viewed as
professional malpractice in the specific context of attempting (1) economic analysis
across multiple programming languages and (2) economic analysis that includes
requirements, design, and noncode work. LOC metrics would not be malpracticed
for studying pure coding speed or for studying code defects in specific languages.
The metric cost per defect is viewed as professional malpractice in the context of
(1) exploring the economic value of quality and (2) comparing a sequence of defect
removal operations for the same project. Cost per defect would not be a malpractice
if fixed costs were backed out or for comparing identical defect removal activi-
ties such as unit test across several projects. LOC metrics make requirements and
design invisible and penalize modern high-level languages. Cost per defect makes
the buggiest software look cheapest and ignores the true value of quality in shorten-
ing schedules and lowering costs.
Profit Center
A profit center is a corporate group or organization whose work contributes to the
income and profits of the company. The opposite case would be a cost center where
money is consumed but the work does not bring in revenues. For internal soft-
ware that companies build for their own use, some companies use the cost-center
approach, and some use the profit-center approach. Cost-center software is provided
to internal clients for free, and funding comes from some kind of corporate account.
Profit-center software would charge internal users for the labor and materials needed
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 303
to construct custom software. In general, measures and metrics are better under the
profit center model because without good data there is no way to bill the clients.
As a general rule for 2015 about 60% of internal software groups are run using the
cost-center model and 40% are run using the profit-center model. For commercial
software, development is clearly a profit-center model. For embedded software in
medical devices or automotive engines the software is part of a hardware product
and usually not sold separately. However, it still might be developed under a profit-
center model, not always. Overall profit centers tend to be somewhat more efficient
and cost effective than cost centers. This topic should be included in standard bench-
mark reports, but is actually somewhat difficult to find in the software literature.
Project-Level Metrics
Probably the most common form of benchmark in the world is an overall result
for a software project without any granularity or internal information about
304 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
activities and tasks. For example, a typical project-level benchmark for an appli-
cation of 1,000 function point might be that it required 15 work hours per
function point, had a schedule of 15 calendar months, and a cost of U.S. $1,200
per function point. The problem with this high-level view is that there is no
way to validate it. Did the project include project management? Did the project
include unpaid overtime? Did the project include part-time workers such as qual-
ity assurance and technical writers? There is no way of being sure of what really
happens with project-level metrics. See the discussion of activity-based costs ear-
lier in this report.
took place on that date. Not everybody answers these questions the same way, and
there are no agreed-to rules or standards for defining a software project’s start date.
Quality
There are many competing definitions for software quality, including some like
conformance to requirements that clearly do not work well. Others such as maintain-
ability and reliability are somewhat ambiguous and only partial definitions. The
definition used by the author is the absence of defects that would cause a software
application to either stop completely or to produce incorrect results. This definition has
the virtue of being able to be used with requirements and design defects as well as
code defects. As requirements are often buggy and filled with errors, these defects
need to be included in a working definition for software quality. Defects also cor-
relate with customer satisfaction, in that as bugs go up satisfaction comes down.
IT projects 10.04
Commercial 9.12
Systems/embedded 7.11
Military/defense 5.12
Note that there are large variations by application size and also large variations
by application type. There are also large variations by country, although interna-
tional data are not shown here. Japan and India, for example, would be better than
the United States Also note that other benchmark providers might have data with
different results from the data shown here. This could be due to the fact that nor-
mally benchmark companies have unique sets of clients so the samples are almost
always different. Also, there is little coordination of cooperation among various
benchmark groups, although the author, ISBSG, and Don Reifer did produce a
report on project size with data from all three organizations.
requirements and design, defects found by static analysis, and defects found by
desk checking and unit test. Even delivered defects leak because if too many bugs
are released, usage will drop and hence latent bugs will remain latent and not be
discovered.
From the author’s collection of about 26,000 projects following are average
approximate values for software quality. Here too other benchmark sources will
vary (Table A.14).
As can be seen from the above table there are variations by application size and
also variations by application type. For national average purposes, the value shown
by type is more meaningful than size, because there are very few applications larger
than 10,000 function points, so these large sizes distort average values. In other
words, defect potentials average about 4.94, whereas defect removal averages about
Size
1 1.50 96.93 0.05
Type
Domestic outsource 4.32 94.50 0.24
93.78% and delivered defects average about 0.30 circa 2016 if the view is cross
industry. Overall ranges of defect potentials run from about 1.25 per function
point to about 7.50 per function point. Ranges of defect removal run from 99.65%
to a low of less than 77.00%. Of course, averages and ranges are both variable fac-
tors and change based on the size and type of software projects used in the samples
for calculating averages.
Assuming an application size of 1,000 function points these rules of them generate
the following schedule durations in calendar months:
As can easily be seen, the differences between the experts and novices translates
into significant schedule differences, and would also lead to differences in effort,
costs, and quality.
Rayleigh Curve
Lord Rayleigh was an English physicist who won a Nobel Prize in 1904 for the discov-
ery of Argon gas. He also developed a family of curves that showed the distribution
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 309
of results for several variables. This curve was adopted by Larry Putnam and Peter
Norden adopted this curve as a method of describing software staffing, effort, and
schedules. The curves for software are known as Putnam–Norden–Rayleigh curves or
PNR. A Google search for this term will show many different articles. In general, the
curves are a good approximation for software staffing over time. The PNR curves,
and other forms of Rayleigh curves, assume smooth progress. For software this is not
always the case. There are often severe discontinuities in the real world caused by
creeping requirements, canceled projects, deferred features, or other abrupt changes.
For example, about 32% of large systems above 10,000 function points are canceled
without being completed, which truncates PNR curves. For smaller project with
better odds of success the curves are more successful. Larry Putnam was the original
developer of the SLIM estimation tool, which supports the family of curves, as do
other tools as well. See also Chaos theory earlier in this chapter for a discussion of
discontinuities and random events.
Reliability Metrics
Software reliability refers, in general, to how long software can operate success-
fully without encountering a bug or crashing. Reliability is often expressed using
mean time to failure (MTTF) and mean time between failures (MTBF). Studies
at IBM found that reliability correlated strongly with numbers of released defects
and DRE. High reliability or encountering a bug or failure less than once per year
normally demands DRE levels of >99% and delivered defect densities of <0.001
per function point.
Requirement
There is some ambiguity in exactly what constitutes a requirement. In general, a
requirement is a description of a specific feature that clients want software to per-
form. A requirement for a word-processing software package might include auto-
matic spell checking. Requirements can be expressed in terms of use-cases, user
stories, text, mathematical formulae, or combinations of methods. No matter how
requirements are expressed, they are known to have several attributes that cause
problems for software projects. These attributes include: (1) errors in requirements,
310 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
(2) toxic requirements that are harmful to software (such as Y2k), and (3) incom-
pleteness that leads to continuous requirements growth. There are also deeper
and more subtle problems such as the lack of an effective taxonomy that can put
requirements into a well-formed hierarchy. All software needs to accept inputs,
perform various calculations, and produce results. But this general statement needs
to be expanded into a formal taxonomy that would encompass error checking, user
error avoidance, and many others. From comparisons of explicit requirements and
function points, an average requirement takes about 3.0 function points to imple-
ment, with a range between 0.5 and 20. The IFPUG organization has developed
a method of dividing requirements into functional requirements and nonfunc-
tional requirements that is supported by a new metric called SNAP for software
nonfunctional assessment process. Functional requirements are things users want.
Nonfunctional requirements are things like government mandates that have to be
included whether users want them or not. Although this makes sense, but logically
the great bulk of software requirements documents do not distinguish between
these two categories as of 2016.
Requirements Creep
Requirements have been known to change during development for more than
60 years. It was only after function points were released in 1978 that requirements
creep could be measured explicitly. A sample of software projects were sized using
function points at the end of the requirements phase. Later the same projects were
resized at the point of delivery to customers. As both starting and ending sizes were
known and the calendar month schedules were known, this allowed researchers at
IBM and elsewhere to measure requirements creep exactly. Assume that an appli-
cation was measured at the end of requirements at 1,000 function points. Assume
that the same application was measured 12 months later at release and was found
to be 1,200 function points in size. This is an average monthly growth rate of
1.67%. The total growth or creep was 200 function points or about 16.66 function
points per month. The additional 200 function points show a total growth of 20%.
Growth does not stop with delivery, but continues forever as long as the software
is in active use. Postrelease growth is slower at about 8% per calendar year. As of
2016, there is little or no data on SNAP nonfunctional requirements growth and
change over time. Indeed the IFPUG SNAP committee has not yet addressed this
topic; a major omission.
Requirements Metrics
It is a bad assumption to believe that user requirements are error-free. User require-
ments contain many errors and some requirements may be toxic and should not be
in the application at all; Y2K is an example of a toxic requirement. The essential
metrics for software requirements include but are not limited to (1) requirements
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 311
size in pages, words, and diagrams, (2) requirements errors found by inspection,
(3) possibly toxic requirements pointed out to users by domain experts, (4) rates
of requirements growth or change during development, (5) requirements deferred
to future releases in order to achieve arbitrary schedule targets, and (6) whether
requirements are functional or nonfunctional. As the author’s estimating SRM tool
has an early sizing feature that allows it to be used prior to requirements, it predicts
all six of these essential requirements metrics. An example for an application of
10,000 function points at delivery using the RUP would be: starting size = 8,023
function points; creep = 1,977 function points; monthly rate of creep = 1.82%;
total creep = 19.77%; requirements defects = 1,146; toxic requirements = 27;
requirements completeness = 73.68%; explicit requirements = 2,512; function
points per requirement = 3.37; SNAP points 1,250 with a growth of 150 leading to
a total of 1,400 SNAP points. All of these predictions can be made before require-
ments analysis starts by using pattern matching from similar completed projects.
Requirements are volatile and also error prone. In the future formal patterns of
reusable requirements will no doubt smooth out current problems and provide bet-
ter overall requirements than are common in 2016.
Return on Investment
A Google search on the phrase return on investment (ROI) will bring up hundreds
of articles and over a dozen flavors or ROI including return on assets, financial rate
of return, economic rate of return, and a number of others. For software projects,
ROI is often not done at all and is seldom done well. What is needed for software
ROI are accurate predictions of project schedules and costs prior to starting, and
accurate measures of schedules and costs after completion. Quality also needs to be
predicted and measured because poor quality will puff up maintenance costs and
warranty repair costs to alarming levels, and may also trigger expensive litigation
due to consequential damages. It is technically possible to predict schedules and
costs with good accuracy using any of the available parametric estimation tools,
with the caveat that they cannot be used until requirements and known. The SRM
tool includes a patent-pending early sizing feature that allows it to predict costs and
schedules prior to requirements. The SRM tool also includes ROI as a standard
output, assuming that the client who commissioned the estimate can provide value
data for tangible and intangible value. SRM compares costs to value to calculate
ROI, but it does not predict value. That must be user-supplied information because
value can range from almost nothing to creating an entirely new business that will
earn billions of dollars, as shown by Microsoft, Facebook, and Twitter. The essen-
tial problems with ROI calculations circa 2014 include: (1) optimistic estimates for
costs and schedules, (2) optimistic quality estimates, (3) leaky historical data,
(4) failure to include requirements creep in schedule and cost estimates, (5) very
poor tracking of progress, (6) poor quality control that leads to delays and cost
overruns, and (7) optimistic revenue or value predictions.
312 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Reusable Materials
As custom design and manual coding of software is intrinsically expensive and
error-prone, there is a need to move away from custom development and move
toward construction from certified standard reusable components. However,
reuse covers much more than just source code. The full suite of reusable artifacts
include but are not limited to reusable (1) architecture, (2) design, (3) requirements,
(4) plans, (5) estimates, (6) data structures, (7) source code, (8) test plans, (9) test
cases, (10) test scripts, and (11) user documentation. Currently software is built
rather like an America’s Cup yacht or a Formula 1 race car, using custom designs
and extensive manual labor. In the future, software might be constructed like regu-
lar automobiles such as Fords or Toyotas, using assembly lines of reusable materials
and perhaps even robots. Neither team experience, methodologies, nor program-
ming languages have as much impact on software productivity rates as does reuse
of certified components.
RICE Objects
ERP companies such as SAP and Oracle use the phrase RICE objects as a work met-
ric. The acronym RICE stands for reports, interfaces, conversions, and enhance-
ments. These are some of the activities associated with deploying ERP and building
and customizing applications to work with ERP packages.
Risk Metrics
Software projects have a total of about 210 possible risk factors. Among these are
outright cancellation, schedule delays, cost overruns, breach of contract litiga-
tion, patent litigation, cyber attacks, and many more. The risk analysis engine of
the author’s SRM tool predicts 20 of the 210 risks and assigns each risk a prob-
ability percent based on historical data derived from similar projects of the same
size and type. For example, risk of breach of contract litigation ranges from 0%
for in-house projects to about 15% for large contract waterfall projects with inex-
perienced personnel. Risk severities are also predicted using a scale from 1 to 10
with the lower numbers being less serious. Risk avoidance probabilities are also
calculated based on weighted combinations of CMMI levels, methodologies, and
team experience levels. The worst case would be cowboy development at CMMI 1
by a team of novices. The best case would be TSP or RUP at CMMI 5 by a team
of experienced personnel. These risk predictions and metrics are standard features
of the SRM tool. The normal way of presenting risks resembles the chart as given
in Table A.15.
Risks vary by size, complexity, experience, CMMI levels, and other factors that
are specific to specific projects.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 313
Poor quality and defect measures (omits >10% of bugs) 28.00 7.00
Feature bloat and useless features (>10% not used) 22.00 5.00
Average Risks for this size and type of project 18.44 8.27
Root-Cause Analysis
The phrase root-cause analysis refers to a variable set of methods and statistical
approaches that attempt to find out why specific problems occurred. Root-cause
analysis or RCA is usually aimed at serious problems that can cause harm or large
costs if not abated. RCA is not only used for software but is widely used by many hi-
technology industries and also by medicine and military researchers. As an example
of software RCA, a specific high-severity bug in a software application might have
slipped through testing because no test case looked for the symptoms of the bug.
A first-level issue might have been that project managers arbitrarily shortened test
case design periods. Another cause might be that test personnel did not use formal
test design methods based on mathematics such as design of experiments. Further,
testing might have been performed by untrained developers rather than by certified
test personnel. The idea of RCA is to work backward from a specific problem and
identify as many layers of causes as can be proven to exist. RCA is expensive, but
some tools are available from commercial and open-source vendors. See also failure
mode and effects analysis (FMEA) discussed earlier in this report.
Sample Sizes
An interesting question is what kinds of sample sizes are needed to judge software
productivity and quality levels? Probably the minimum sample would be 20 projects
of the same size, class, and type. As the permutations of size, class, and type, total
to more than 2,000,000 instances, a lot of data are needed to understand the key
variables that impact software project results. To judge national productivity and
quality levels about 10,000 projects per country would be useful. As software is a
major industry in more than 100 countries, the global sample size for the overall
software industry should include about 1,000,000 projects. As of 2014 the sum
total of all known software benchmarks is only about 80,000 software projects.
See the discussion of taxonomies later in this report.
Schedule Compression
Software schedules routinely run later than planned. Analysis by the author of more
than 500 projects found that average schedule demands by clients or senior manag-
ers approximated raising application size in function points to the 0.3 power. Actual
delivery dates for the same projects had exponents ranging from the 0.37 to 0.41
power. For a generic application of 1,000 function points clients wanted the software
in 8 calendar months and it took between 12 and 17 months to actually deliver it.
This brings up two endemic problems for the software industry: (1) software clients
and executives consistently demand schedules shorter than it is possible to build
software and (2) software construction methods need to switch from custom devel-
opment to using larger volumes of standard reusable components in order to shorten
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 315
Schedule Overlap
The term schedule overlap defines the normal practice of starting an activity before
a prior activity is completed. See the discussion on Gantt chart for a visual represen-
tation of schedule overlap. Normally for projects, design starts when requirements
are about 75% complete, coding starts when design is about 50% complete, and
testing starts when coding is about 25% complete. This means that the net schedule
of a software project from beginning to end is shorter than the sum of the activ-
ity schedules. Parametric estimation tools and also project management tools that
support PERT and GANTT charts all handle schedule overlaps, which are normal
for software projects. Schedule overlap is best handled using activity-based cost
analysis or task-based cost analysis. Agile projects with a dozen or more sprints are
a special case for schedule overlap calculations.
Schedule Slip
As discussed earlier in the section on schedule compression, users routinely demand
delivery dates for software projects that are quicker than technically possible.
However, schedule slip is not quite the same. Assume that a project is initially
scheduled for 18 calendar months. At about month 16 the project manager reports
that more time is needed and the schedule will be 20 months. At about month 19
the project manager reports that more time is needed and the schedule will be 22
months. At about month 21 the manager reports that more time is needed and the
schedule will be 24 months. In other words, schedule slip is the cumulative sequence
of small schedule delays usually reported only a short time before the nominal
schedule is due. This is an endemic problem for large software projects. The root
causes are inept schedule before the project starts, requirements creep during devel-
opment, and poor quality control that stretches out testing schedules. It should be
noted that most software projects seem to be on time and even early until testing,
at which point they are found to have so many bugs that the planned test schedules
double or triple.
316 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Scope
The word scope in a software context is synonymous with size and is measured
using function points, story points, LOC, or other common metrics. Scope creep is
another common term that is synonymous with requirements creep.
Security Metrics
In today’s world of hacking, denial of service, and cyber attacks companies are
beginning to record attempts at penetrating firewalls and other defenses. Also
measured are the strength of various encryption schemes for data and confidential
information. Also measured are password strength. This is a complex topic that
changes rapidly so a Google search on security metrics to stay current is recom-
mended. After software is released and actually experiences attack, data should be
kept on the specifics of each attack and also on the staffing, costs, and schedules for
recovery and also financial losses to both companies and individuals.
Size Adjustment
Many of the tables and graphs in this report, and others by the same author, show
data expressed in even powers of 10, that is, 100 function points, 1,000 function
points, 10,000 function points, and so on. This is not because the projects were all
even values. The author has a proprietary tool that converts application size to even
values. For example, if several PBX switches ranges from a low of 1,250 function
points to a high of 1,750 function points they could all be expressed at a median
value of 1,500 function points. The reason for this is to highlight the impact of
specific factors such as methodologies, experience levels, and CMMI levels. Size
adjustment is a subtle issue and includes adjusting defect potentials and require-
ment creep. In other words, size adjustments are not just adding or subtracting
function points and keeping all other data at the same ratios as the original. For
example, if software size in function points is doubled, defect potentials will go up
by more than 100% and DRE will decline.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 317
SNAP Metrics
Function point metrics were developed to measure the size of software features that
benefit users of the software. But there are many features in software that do not
benefit users but are still required due to technical or legal constraints. This new
metric, distinct from function points, is termed software nonfunctional assessment
process or SNAP. As an example of a nonfunctional requirement, consider home con-
struction. A home owner with an ocean view will prefer windows facing toward the
ocean, which is a functional requirement. However, local zoning codes and insur-
ance regulations mandate that windows close to the ocean must be hurricane proof,
which is very expensive. This is a nonfunctional requirement. Function points and
SNAP metrics are calculated separately. However, from the author’s clients who have
tried SNAP seem to approximate 15%–20% of the volume of function points. Due
to the fact that SNAP is new and only slowly being deployed, there may be future
changes in the counting method and additional data in the future. Some examples
of software SNAP might include security features and special features so the software
can operate on multiple hardware platforms or multiple operating systems. As this
report is being drafted, an announcement in March of 2014 indicates that IFPUG
and Galorath Associates are going to perform joint studies on SNAP metrics.
objective, and not watered down based on threats of reprisals by project managers
in case of a negative opinion. The SQA groups in major companies collect quality
data and also provide quality training. The SQA personnel also participate in formal
inspections, often as moderators. In terms of staffing, SQA organizations are typi-
cally about 3% of the development team size, although there are ranges. The IBM
SQA organizations also had a true research and development function over and above
normal project status reporting. For example, while working in an IBM quality assur-
ance group, the author performed research on software metrics pros and cons, and
also designed IBM’s first parametric software estimation tool in 1973. Formal SQA
organizations are not testing groups, although some companies call testing groups
by this name. Testing groups usually report to development management, whereas
SQA groups report through a separate organization up to a VP of Quality. One of
the more famous VP’s of Quality was Phil Crosby of ITT, whole book Quality Is Free
(1979) remains a best-seller even in 2014. The author also worked for ITT and was
the software representative to the ITT corporate quality council.
• Sizing Manufacturing
• Size
• Productivity • Replacement cost Purchasing
• Quality • Productivity
• Schedules • Quality
• Costs
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 319
(Continued)
320 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Tools 4 12 22
that the largest differences in tool use between laggards and leaders are for project
management and quality assurance. Laggards and leaders use similar tool suites for
development, but the leaders use more than twice as many tools for management
and quality assurance tasks than do the laggards.
Sprint
The term sprint is an interesting Agile concept and term. In some Agile projects,
overall features are divided into sets that can be built and delivered separately, often
in a short time period of six weeks to two months. These subsets of overall applica-
tion functionality are called sprints. The use of this term is derived from racing and
implies a short distance rather than a marathon. The sprint concept works well for
projects below 1,000 function points, but begins to encounter logistical problems at
about 5,000 function points. For really large systems >10,000 function points there
would be hundreds of sprints and there are no current technologies for decompos-
ing really large applications into small sets of independent features that fit the sprint
concept.
Staffing Level
In the early days of software the term staffing level meant the number of pro-
grammers it might take to build an application, with ranges from 1 to perhaps 5.
In today’s world of 2014 with a total of 126 occupation groups this term has
become much more complicated. Parametric estimation tool such as SRM and
also project management tools such as Microsoft Project can predict the number
of people needed to build software. SRM predicts a standard set of 20 occupation
groups including business analysts, architects, programmers, test personnel, qual-
ity assurance, technical writers, and managers. Staffing is not constant for most
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 321
occupations, but rises and shrinks as work is finished. Staffing levels by occupation
include average numbers of personnel and peak numbers of personnel. See also
the Rayleigh curve discussion earlier in this report. The staffing profile for a major
system of 25,000 function points is shown below and is a standard output from the
author’s SRM tool (Table A.17).
As can be seen software is a multidiscipline team activity with many different
occupation groups and special skills.
1 Programmers 94 141
2 Testers 83 125
3 Designers 37 61
4 Business analysts 37 57
5 Technical writers 16 23
6 Quality assurance 14 22
8 Database administration 8 11
10 Administrative support 8 10
11 Configuration control 5 6
12 Project librarians 4 5
14 Estimating specialists 3 4
15 Architects 2 3
16 Security specialists 1 2
17 Performance specialists 1 2
Story Points
Story points are a somewhat subjective metric based on analysis of designs expressed
in terms of user stories. Story points are not standardized and vary by as much as
400% from company to company. They are used primarily with Agile projects and
can be used to predict velocity. A Google search will bring up an extensive literature
including several papers that challenge the validity of story points.
In addition to the taxonomy itself, the author’s benchmark recording method also
captures data on 20 supplemental topics that are significant to software project
results. These include:
The eight taxonomy factors and the 20 supplemental factors make comparisons of
projects accurate and easy to understand by clients. The taxonomy and supplemen-
tal factors are also used for pattern matching or converting historical data into use-
ful estimating algorithms. As it happens, applications that have the same taxonomy
are also about the same in terms of schedules, costs, productivity, and quality. That
being said, there are millions of permutations from the factors used in the author’s
taxonomy. However, the vast majority of software applications can be encompassed
by fewer than 100 discreet patterns.
324 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Technical Debt
The concept of technical debt was put forth by Ward Cunningham. It is a brilliant
metaphor, but not a very good metric as currently defined. The idea of techni-
cal debt is that shortcuts or poor architecture, design, or code made to shorten
development schedules will lead to downstream postrelease work. This is certainly
true. But the use of the term debt brings up the analogy of financial debt, and here
there are problems. Financial debt is normally a two-party transaction between a
borrower and a lender; technical debt is self-inflicted by one party. A subtle issue
with technical debt is that it makes a tacit assumption that shortcuts are needed
to achieve early delivery. They are not. A combination of defect prevention, pretest
defect removal, and formal testing can deliver software with close to zero-technical
debt faster and cheaper than the same project with shortcuts, which usually skimp
on quality control. A more serious problem is that too many postrelease costs are
not included in technical debt. If an outsource contractor is sued for poor perfor-
mance or poor quality, then litigation and damages should be included in techni-
cal debt. Consequential damages to users of software caused by bugs or failures
should be included in technical debt. Also losses in stock value due to poor quality
should be included in technical debt. Also, about 32% of large systems are canceled
without being completed. These have huge quality costs but zero-technical debt.
Overall technical debt seems to encompass only about 17% of the full costs of poor
quality and careless development. Another omission with technical debt is the lack
of a normalization method. Absolute technical debt like absolute financial debt is
important, but it would also help to know technical debt per function point. This
would allow comparisons of various project sizes and also various development
methods. Technical debt can be improved over time if there is an interest in doing
so. Technical debt is currently a hot topic in the software literature so it will be
interesting to see if there are changes in structure and topics over time.
Test Metrics
This is a complex topic and also somewhat ambiguous and subjective. Among the
suite of common test metrics circa 2015 are test cases created: work hours per test case,
test work hours per function point, reused regression test cases, test cases per function
point, test cases executed successfully, test cases executed and failing, test coverage for
branches, paths, code statements, and risks, defects detected, test intervals or schedule,
test iterations, test coverage (branches and paths), and test DRE levels.
Test Coverage
The phrase test coverage is somewhat ambiguous and can be used to describe the
following: the percent of code statements executed during testing, the percent
of branches or paths, and the percent of possible risks for which test cases exist.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 325
Unpaid Overtime
The majority of U.S. software personnel are termed exempt that means they are
not required to be paid overtime even if they work much more than 40 hours per
week. Unpaid overtime is an important factor for both software costs and software
schedules. Unfortunately, unpaid overtime is the most common form of data that
leaks or does not get reported via normal project tracking. If you are comparing
benchmarks between identical projects and one of them had 10 hours per week
of unpaid overtime, whereas the other had 0 hours of unpaid overtime, no doubt
the project with overtime will be cheaper and have a shorter schedule. But if the
unpaid overtime is invisible and not included in project tracking data, there is no
good way to validate the results of the benchmarks. Among the author’s clients
unpaid overtime of about 4 hours per week is common, but omitting this unpaid
overtime from formal cost tracking is also common. As can be seen, the impact of
unpaid overtime on costs and schedules is significant. The following chart shows
an application of 1,000 function points with compensation at U.S. $10,000 per
staff month.
Use-Case Points
Use-cases are part of the design approach featured by the unified modeling lan-
guage (UML) and included in the RUP. Use-cases are fairly common among IBM
customers as are use-case points. This is a metric based on use-cases and used for
estimation. It was developed in 1993 by Gustav Korner prior to IBM acquiring the
method. This is a fairly complex metric and a Google search is recommended to
bring up definitions and additional literature. Use-case points and function points
can be used for the same software. Unfortunately, use-case points only apply to
projects with use-case designs, whereas function points can be used for all software,
and are therefore much better for benchmarks. IBM should have published conver-
sion rules between use-case points and function points, because both metrics were
developed by IBM. In the absence of IBM data, the author’s SRM tool predicts
and converts data between use-case points and IFPUG function points. A total of
1,000 IFPUG function points is roughly equal to 333 use-case points. However,
this value will change because use-cases also vary in depth and complexity. It is the
responsibility of newer metrics such as use-case points to provide conversion rules
to older metrics such as function points, but this responsibility is seldom acknowl-
edged by metrics developers and did not occur for use-case points.
User Costs
For internal IT projects users provide requirements, review documents, participate
in phase reviews, and may even do some actual testing. However, user costs are
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 327
seldom reported. Also, user costs are normally not included in the budgets for soft-
ware applications. The author’s SRM tool predicts user costs for IT projects. Total
user costs range from less than 50% of software development costs to more than
70% of software development costs. This topic is under reported in the software
literature and needs additional research. A sample of typical user costs for a medium
IT project of 2,500 function points is shown below:
User costs are not always measured even though they can top 65% of development
costs. They are also difficult to measure because they are not usually included in
software project budgets, and are scattered among a variety of different organiza-
tions, each of which may have their own budgets.
328 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
Value (Intangible)
The topic of intangible value is ambiguous and varies from application to applica-
tion. Some of the many forms of intangible value include: medical value, value to
human life and safety, military value for improving military operations, customer
satisfaction value, team morale value, and corporate prestige value. It would be
theoretically possible to create a value point metric similar to function point metrics
to provide a scale or range of intangible values.
Value (Tangible)
Value comes in two flavors: tangible and intangible. The forms of software
tangible value also comes in several flavors: (1) direct revenues such as software sales,
(2) indirect revenues such as training and maintenance contracts, and (3) operating
cost reductions and work efficiency improvements. Tangible value can be expressed
in terms of currencies such as dollars and are included in a variety of accounting
formulae such as accounting rate of return and internal rate of return.
Variable Costs
As the name implies, variable costs are the opposite of fixed costs and tend to be
directly proportional to the number of units produced. An example of a variable
cost would be the number of units produced per month in a factory. An example of
a fixed cost would be the monthly rent for the factory itself. For software, an impor-
tant variable cost is the amount and cost of code produced for a specific require-
ment, which varies by language. Another variable cost would be the number and
costs of bug repairs on a monthly basis. The software industry tends to blur together
fixed and variable costs, and this explains endemic errors in the LOC metric and
the cost per defect metric.
Velocity
The term velocity is a metric widely used by Agile projects. It can be used in both
forward predictive modes and historical data collection modes. Velocity can also
be used with tangible deliverable such as document pages and also with synthetic
metrics such as story points. As velocity is not precisely defined, users could do a
Google search to bring the additional literature on the velocity metric.
Venn Diagram
In 1880, the mathematician John Venn developed a simple graphing technique to
teach set theory. Each set was represented by a circle. The relationship between two
sets could be shown by the overlap between the circles. The use of Venn diagrams
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 329
is much older than software, and is used by dozens of kinds of engineers and math-
ematicians due to the simplicity and elegance of the approach. A simple Venn dia-
gram with two sets is shown below.
Venn diagrams can be used with more than two circles, of course, but becomes
complex and lose visual appeal with more than four circles.
War Room
In a software context, a war room is a room set aside for project planning and status
tracking. Usually they have tables with planning documents and often one or more
walls are covered with project flow diagrams that indicate current status. Usually
war rooms are found for large systems in the 10,000 function point size range.
Warranty Costs
Most software projects do not have warranties. Look at the fine print on almost any
box of commercial software and you will see phrases such as no warranty expressed
330 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures
or implied. In the rare cases where some form of warranty is provided, it can range
from replacement of a disk with a new version to actually fixing problems. There is
no general rule and each application by each company will probably have a unique
warranty policy. This is professionally embarrassing for the software industry,
which should offer standard warranties for all software. Some outsource contracts
include warranties, but here too there are variations from contract to contract.
Work Hours
There is a major difference between the nominal number of hours worked and the
actual number of hours worked. In the United States the nominal work week is
40 hours. But due to lunch breaks, coffee breaks, and other nonwork time, the
effective work week is around 33 hours per week or 132 hours per month. There
are major differences in work hours from country to country and these differences
are important for both software measurement and software estimation. The effective
work month for the United States is 132 hours, for China 186 hours, for Sweden
126 hours, and so forth. These variances mean that a project that might require one
calendar month in the United States would require only three weeks in China but
more than one month in Sweden.
Variations in work hours per month do not translate one-for-one into higher
or lower productivity. Other topics such as experience and methodologies are also
important. Even so the results are interesting and thought-provoking.
Zero Defects
The ultimate goal of software engineering is to produce software applications with
zero defects after release. As software applications are known to be error-prone, this
is a challenging goal indeed. Custom designs and hand coding of applications are
both error-prone and the average DRE as of 2014 is less than 90%, and defects aver-
age more than 3.0 per function point when requirements defects, design defects,
code defects, and bad-fix defects are all enumerated. The best approach to achieving
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 331
Introduction
The purpose of measurement and metrics is to gain insights and improve software
results. Now that metrics problems have been discussed, it is appropriate to show a
selection of potential improvements that seem to be technically feasible.
Following is a collection of 25 goals or targets for software engineering prog-
ress developed by Namcook Analytics LLC for five years between 2016 and 2021.
Some of these goals are achievable now in 2016 but not many companies have
achieved them. But some have already been achieved by a small selection of leading
companies.
Unfortunately less than 5% of the United States and global companies have
achieved any of these goals, and less than 0.1% of companies have achieved most
of them. None of the author’s clients have yet achieved every goal, although some
probably will be by 2017 or 2018.
The author suggests that every major software-producing company and gov-
ernment agency have their own set of five-year targets, using the current list as a
starting point.
333
334 ◾ Appendix 2: Twenty-Five Software Engineering Targets
of inspections. It is paired with the defect potential metric discussed in the next
paragraph. DRE is measured by comparing all bugs found during develop-
ment to those reported in the first 90 days by customers. The current U.S.
average is about 90%. Agile is about 92%. Quality strong methods such as
Rational Unified Process (RUP) and Team Software Process (TSP) usually
top 96% in DRE. Only a top few companies using a full suite of defect preven-
tion, pretest defect removal, and formal testing with mathematically designed
test cases and certified test personnel can top 99% in DRE. The upper limit of
DRE circa 2015 is about 99.6%. DRE of 100% is theoretically possible but has
not been encountered on more than about 1 project out of 10,000.
2. Lower software defect potentials from >4.0 per function point to
<2.0 per function point. The phrase defect potentials was coined in IBM
circa 1970. Defect potentials are the sum of bugs found in all deliverables:
(1) requirements, (2) architecture, (3) design, (4) code, (5) user documents,
and (6) bad fixes. Requirements and design bugs often outnumber code bugs.
Today defect potentials can top 6.0 per function point for large systems in
the 10,000 function point size range. Achieving this goal requires effective
defect prevention such as joint application design (JAD), quality function
deployment (QFD), requirements modeling, certified reusable components,
and others. It also requires a complete software quality measurement pro-
gram. Achieving this goal also requires better training in common sources
of defects found in requirements, design, and source code. The most effec-
tive way of lowering defect potentials is to switch from custom designs and
manual coding, which are intrinsically error prone. Construction for certified
reusable components can cause a very significant reduction in software defect
potentials.
3. Lower cost of quality (COQ ) from >45.0% of development to <15.0%
of development. Finding and fixing bugs has been the most expensive task
in software for more than 50 years. A synergistic combination of defect
prevention and pretest inspections and static analysis are needed to achieve
this goal. The probable sequence would be to raise defect removal efficiency
from today’s average of less than 90% up to 99%. At the same time defect
potentials can be brought down from today’s averages of more than 4.0
per function point to less than 2.0 per function point. This combination
will have a strong synergistic impact on maintenance and support costs.
Incidentally lowering cost of quality will also lower technical debt. But as
of 2016 technical debt is not a standard metric and varies so widely that it
is hard to quantify.
4. Reduce average cyclomatic complexity from >25.0 to <10.0. Achieving
this goal requires careful analysis of software structures, and of course it also
requires measuring cyclomatic complexity for all modules. As cyclomatic
tools are common and some are open source, every application should use
them without exception.
Appendix 2: Twenty-Five Software Engineering Targets ◾ 335
5. Raise test coverage from <75.0% to >98.5% for risks, paths, and require-
ments. Achieving this goal requires using mathematical design methods for
test case creation such as using design of experiments. It also requires mea-
surement of test coverage. It also requires predictive tools that can predict
number of test cases based on function points, code volumes, and cyclomatic
complexity. The author’s Software Risk Master (SRM) tool predicts test cases
for 18 kinds of testing and therefore can also predict probable test coverage.
6. Eliminate error-prone modules in large systems. Bugs are not randomly
distributed. Achieving this goal requires careful measurements of code defects
during development and after release with tools that can trace bugs to specific
modules. Some companies such as IBM have been doing this for many years.
Error-prone modules (EPM) are usually less than 5% of total modules, but
receive more than 50% of total bugs. Prevention is the best solution. Existing
error-prone modules in legacy applications may require surgical removal and
replacement. However, static analysis should be used on all identified EPM.
In one study a major application had 425 modules. Of these, 57% of all bugs
were found in only 31 modules built by one department. Over 300 modules
were zero-defect modules. EPM are easy to prevent but difficult to repair
once they are created. Usually surgical removal is needed. EPM are the most
expensive artifacts in the history of software. EPM is somewhat like the med-
ical condition of smallpox, that is, it can be completely eliminated with vac-
cination and effective control techniques. Error-prone modules often top 3.0
defects per function and remove less than 80% prior to release. They also tend
top 50 in terms of cyclomatic complexity. Higher defect removal via testing is
difficult due to the high cyclomatic complexity levels.
7. Eliminate security flaws in all software applications. As cyber crime
becomes more common, the need for better security is more urgent. Achieving
this goal requires use of security inspections, security testing, and automated
tools that seek out security flaws. For major systems containing valuable
financial or confidential data, ethical hackers may also be needed.
8. Reduce the odds of cyber attacks from >10.0% to <0.1%. Achieving this
goal requires a synergistic combination of better firewalls, continuous anti-
virus checking with constant updates to viral signatures, and also increasing
the immunity of software itself by means of changes to basic architecture
and permission strategies. It may also be necessary to rethink hardware and
software architectures to raise the immunity levels of both.
9. Reduce bad-fix injections from >7.0% to <1.0%. Not many people know
that about 7% of attempts to fix software bugs contain new bugs in the fixes
themselves commonly called bad-fixes. When cyclomatic complexity tops 50, the
bad-fix injection rate can soar to 25% or more. Reducing bad-fix injection
requires measuring and controlling cyclomatic complexity, using static analy-
sis for all bug fixes, testing all bug fixes, and inspections of all significant fixes
prior to integration.
336 ◾ Appendix 2: Twenty-Five Software Engineering Targets
10. Reduce requirements creep from >1.5% per calendar month to <0.25%
per calendar month. Requirements creep has been an endemic problem of
the software industry for more than 50 years. Although prototypes, Agile-
embedded users, and joint application design (JAD) are useful, it is tech-
nically possible to use automated requirements models also to improve
requirements completeness. The best method would be to use pattern match-
ing to identify the features of applications similar to the one being developed.
A precursor technology would be a useful taxonomy of software application
features, which does not actually exist in 2016 but could be created with sev-
eral months of concentrated study.
11. Lower the risk of project failure or cancellation on large 10,000 func-
tion point projects from >35.0% to <5.0%. Cancellation of large systems
due to poor quality, poor change control, and cost overruns that turn return
on investment (ROI) from positive to negative is an endemic problem of the
software industry and is totally unnecessary. A synergistic combination of
effective defect prevention and pretest inspections and static analysis can
come close to eliminating this far too common problem. Parametric estima-
tion tools that can predict risks, costs, and schedules with greater accuracy
that ineffective manual estimates are also recommended.
12. Reduce the odds of schedule delays from >50.0% to <5.0%. As the main
reasons for schedule delays are poor quality and excessive requirements creep,
solving some of the earlier problems in this list will also solve the problem of
schedule delays. Most projects seem on time until testing starts, when huge
quantities of bugs begin to stretch out the test schedule to infinity. Defect
prevention combined with pretest static analysis can reduce or eliminate
schedule delays. This is a treatable condition and it can be eliminated within
five years.
13. Reduce the odds of cost overruns from >40.0% to <3.0%. Software cost
overruns and software schedule delays have similar root causes, that is, poor
quality control and poor change control combined with excessive require-
ments creep. Better defect prevention combined with pretest defect removal
can help to cure both of these endemic software problems. Using accurate
parametric estimation tools rather than optimistic manual estimates are also
useful in lowering cost overruns.
14. Reduce the odds of litigation on outsource contracts from >5.0% to
<1.0%. The author of this book has been an expert witness in 12 breach of
contract cases. All of these cases seem to have similar root causes that include
poor quality control, poor change control, and very poor status tracking.
A synergistic combination of early sizing and risk analysis prior to contract
signing plus effective defect prevention and pretest defect removal can lower
the odds of software breach of contract litigation.
15. Lower maintenance and warranty repair costs by >75.0% compared
to 2016 values. Starting in about 2000, the number of U.S. maintenance
Appendix 2: Twenty-Five Software Engineering Targets ◾ 337
is static and consists either of text such as story points or very primitive and
limited diagrams such as flowcharts or UML diagrams. The technology cre-
ated a new kind of animated graphical design method in full color and also
three dimensions exist today in 2016. It is only necessary to develop the sym-
bol set and begin to animate the design process.
23. Develop an interactive learning tool for software engineering based on
massively interactive game technology. New concepts are occurring almost
every day in software engineering. New programming languages are coming
out on a weekly basis. Software lags medicine and law and other forms of
engineering in having continuing education. But live instruction is costly and
inconvenient. The need is for an interactive learning tool with a built-in cur-
riculum planning feature. It is technically possible to build such as tool today.
By licensing a game engine it would be possible to build a simulated software
university where avatars could take both classes and also interact with one
another.
24. Develop a suite of dynamic, animated project planning, and estimating
tools that will show growth of software applications. Today the outputs of
all software estimating tools are static tables augmented by a few graphs. But
software applications grow during development at more than 1% per calendar
month, and they continue to grow after release at more than 8% per calendar
year. It is obvious that software planning and estimating tools need dynamic
modeling capabilities that can show the growth of features over time. They
should also prevent the arrival (and discovery) of bugs or defects entering
from requirements, design, architecture, code, and other defect sources. The
ultimate goal, which is technically possible today, would be a graphical model
that shows application growth from the first day of requirements through
25 years of usage.
25. Introduce licensing and board certification for software engineers and
specialists. It is strongly recommended that every reader of this book also
read Paul Starr’s book The Social Transformation of American Medicine (1982).
This book won a Pulitzer Prize in 1984. Starr’s book shows how the American
Medical Association (AMA) was able to improve academic training, reduce
malpractice, and achieve a higher level of professionalism than other techni-
cal field. Medical licenses and board certification of specialists were a key
factor in medical progress. It took over 75 years for medicine to reach the
current professional status, but with Starr’s book as a guide software could do
the same within 10 years. This is outside the 5-year window of this article, but
the process should start in 2015.
Note that the function point metrics used in this book refer to function points as
defined by the International Function Point Users Group (IFPUG). Other function
points such as COSMIC, FISMA, NESMA, unadjusted, and so on can also be
used but would have different quantitative results.
340 ◾ Appendix 2: Twenty-Five Software Engineering Targets
The technology stack available in 2016 is already good enough to achieve each
of these 20 targets, although few companies have done so. Some of the technolo-
gies associated with achieving these 25 targets include, but are not limited to, the
following.
The software engineering field has been very different from older and more mature
forms of engineering. One of the main differences between software engineering
and true engineering fields is that software engineering has very poor measurement
practices and far too much subjective information instead of solid empirical data.
This short chapter suggests a set of 25 quantified targets that if achieved would
make significant advances in both software quality and software productivity. But
the essential message is that poor software quality is a critical factor that needs to get
better in order to improve software productivity, schedules, costs, and economics.
Suggested Readings
on Software Measures
and Metric Issues
343
344 ◾ Suggested Readings on Software Measures and Metric Issues
Gack, Gary. Managing the Black Hole: The Executives Guide to Software Project Risk. Thomson,
GA: Business Expert Publishing, 2010.
Gack, Gary. Applying Six Sigma to Software Implementation Projects. http://software.isixsigma.
com/library/content/c040915b.asp (last accessed on October 13, 2016).
Galorath, Dan. Software Sizing, Estimating, and Risk Management: When Performance is
Measured Performance Improves. Philadelphia, PA: Auerbach Publishing, 2006, 576 p.
Garmus, David and Herron, David. Measuring the Software Process: A Practical Guide to
Functional Measurement. Englewood Cliffs, NJ: Prentice Hall, 1995.
Garmus, David and Herron, David. Function Point Analysis – Measurement Practices for
Successful Software Projects. Boston, MA: Addison Wesley Longman, 2001, 363 p.
Garmus, David, Russac Janet, and Edwards, Royce. Certified Function Point Counters
Examination Guide. Boca Raton, FL: CRC Press, 2010.
Gilb, Tom and Graham, Dorothy. Software Inspections. Reading, MA: Addison Wesley, 1993.
Glass, Robert L. Software Runaways: Lessons Learned from Massive Software Project Failures.
Englewood Cliffs, NJ: Prentice Hall, 1998.
Harris, Michael D.S., Herron, David, and Iwanicki, Stasia. The Business Value of IT. Boca
Raton, FL: CRC Press, Auerbach, 2008.
IFPUG (52 authors); The IFPUG Guide to IT and Software Measurement. Boca Raton, FL:
CRC Press, Auerbach publishers, 2012.
International Function Point Users Group (IFPUG). IT Measurement – Practical Advice from
the Experts. Boston, MA: Addison Wesley Longman, 2002, 759 p.
Jacobsen, Ivar, Ng Pan-Wei, McMahon, Paul, Spence, Ian, and Lidman, Svente. The Essence of
Software Engineering: Applying the SEMAT Kernel. Boston, MA: Addison Wesley, 2013.
Jones, Capers. Patterns of Software System Failure and Success. Boston, MA: International
Thomson Computer Press, 1995, 250 p.
Jones, Capers. Software Quality – Analysis and Guidelines for Success. Boston, MA; International
Thomson Computer Press, 1997, 492 p.
Jones, Capers. Sizing up software. Scientific American Magazine, 1998, 279(6):104–111.
Johnson, James et al. The Chaos Report. West Yarmouth, MA: The Standish Group, 2000.
Jones, Capers. Software Assessments, Benchmarks, and Best Practices. Boston, MA: Addison
Wesley Longman, 2000, 657 p.
Jones, Capers. Estimating Software Costs. New York: McGraw-Hill, 2007.
Jones, Capers. Conflict and Litigation Between Software Clients and Developers. Narragansett,
RI: Software Productivity Research, Inc., 2008, 45 p.
Jones, Capers. Preventing Software Failure: Problems Noted in Breach of Contract Litigation.
Narragansett, RI: Capers Jones & Associates, 2008, 25 p.
Jones, Capers. Applied Software Measurement. New York: McGraw-Hill, 3rd edition, 2008, 668 p.
Jones, Capers. Software Engineering Best Practices. New York: McGraw Hill, 2010.
Jones, Capers and Bonsignour, Olivier. The Economics of Software Quality. Reading, MA:
Addison Wesley, 2011.
Jones, Capers. A Short History of the Cost per Defect Metric. Narragansett, RI: Namcook
Analytics LLC, 2014.
Jones, Capers. A Short History of Lines of Code Metrics. Narragansett, RI: Namcook Analytics
LLC, 2014.
Jones, Capers. The Technical and Social History of Software Engineering. Boston, MA: Addison
Wesley Longman, 2014.
Kan, Stephen H. Metrics and Models in Software Quality Engineering. Boston, MA: Addison
Wesley Longman, 2nd edition, 2003, 528 p.
Suggested Readings on Software Measures and Metric Issues ◾ 345
Websites
Information Technology Metrics and Productivity Institute (ITMPI): www. ITMPI.org
International Software Benchmarking Standards Group (ISBSG): www. ISBSG.org
International Function Point Users Group (IFPUG): www. IFPUG.org
Project Management Institute (www. PMI.org)
Capers Jones (www. Namcook.com)
346 ◾ Suggested Readings on Software Measures and Metric Issues
6 Construx www.construx.com
17 ITMPI www.itmpi.org
(Continued)
Suggested Readings on Software Measures and Metric Issues ◾ 347
24 QuantiMetrics www.quantimetrics.net
349
Index
Note: Page numbers followed by f and t refer to figures and tables, respectively.
351
352 ◾ Index
Published data and common metrics, 49, 49t Root-cause analysis (RCA), 274, 314
Pulitzer Prize, 46, 133, 339 Rowboat, 17
Putnam–Norden–Rayleigh (PNR) curve, Rules of thumb, 308
37–38, 309 Running tested features (RTF), 53
RUP. See Rational Unified Process
Q (RUP)