Sunteți pe pagina 1din 373

A Guide to Selecting

Software Measures
and Metrics
A Guide to Selecting
Software Measures
and Metrics

Capers Jones
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-1380-3307-8 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents

Preface ...............................................................................................................vii
Acknowledgments ..............................................................................................xi
About the Author .............................................................................................xiii
1 Introduction ...........................................................................................1
2 Variations in Software Activities by Type of Software .........................17
3 Variations in Software Development Activities by Type of Software.........29
4 Variations in Occupation Groups, Staff Size, Team Experience ...........35
5 Variations due to Inaccurate Software Metrics That Distort Reality .......45
6 Variations in Measuring Agile and CMMI Development ....................51
7 Variations among 60 Development Methodologies ..............................59
8 Variations in Software Programming Languages ................................63
9 Variations in Software Reuse from 0% to 90% .....................................69
10 Variations due to Project, Phase, and Activity Measurements .............77
11 Variations in Burden Rates or Overhead Costs ....................................83
12 Variations in Costs by Industry ............................................................87
13 Variations in Costs by Occupation Group............................................93
14 Variations in Work Habits and Unpaid Overtime ................................97
15 Variations in Functional and Nonfunctional Requirements ................105

v
vi ◾ Contents

16 Variations in Software Quality Results ..............................................115


Missing Software Defect Data ................................................................. 116
Software Defect Removal Efficiency ........................................................ 117
Money Spent on Software Bug Removal .................................................. 119
Wasted Time by Software Engineers due to Poor Quality .......................121
Bad Fixes or New Bugs in Bug Repairs ....................................................121
Bad-Test Cases (An Invisible Problem) .....................................................122
Error-Prone Modules with High Numbers of Bugs ..................................122
Limited Scopes of Software Quality Companies......................................123
Lack of Empirical Data for ISO Quality Standards .................................134
Poor Test Case Design .............................................................................135
Best Software Quality Metrics .................................................................135
Worst Software Quality Metrics ..............................................................136
Why Cost per Defect Distorts Reality .....................................................137
Case A: Poor Quality ..........................................................................137
Case B: Good Quality .........................................................................137
Case C: Zero Defects ..........................................................................137
Be Cautious of Technical Debt ................................................................139
The SEI CMMI Helps Defense Software Quality....................................139
Software Cost Drivers and Poor Quality .................................................139
Software Quality by Application Size ......................................................140
17 Variations in Pattern-Based Early Sizing ...........................................147
18 Gaps and Errors in When Projects Start. When Do They End? .........157
19 Gaps and Errors in Measuring Software Quality ...............................165
Measuring the Cost of Quality ................................................................179
20 Gaps and Errors due to Multiple Metrics without Conversion Rules ......221
21 Gaps and Errors in Tools, Methodologies, Languages .......................227
Appendix 1: Alphabetical Discussion of Metrics and Measures .................233
Appendix 2: Twenty-Five Software Engineering Targets from 2016
through 2021...............................................................................................333
Suggested Readings on Software Measures and Metric Issues................... 343
Summary and Conclusions on Measures and Metrics ................................349
Index ...........................................................................................................351
Preface

This is my 16th book overall and my second book on software measurement.


My first measurement book was Applied Software Measurement, which was pub-
lished by McGraw-Hill in 1991, had a second edition in 1996, and a third
edition in 2008.
The reason I decided on a new book on measurement instead of the fourth edi-
tion of my older book is that this new book has a different vantage point. The first
book was a kind of tutorial on software measurements with practical advice in getting
started and advice on how to produce useful reports for management and clients.
This new book is not a tutorial on measurement, but rather a critique on a num-
ber of bad measurement practices, hazardous metrics, and huge gaps and omissions
in the software literature that leave major topics uncovered and unexamined. In
fact the completeness of software historical data among more than 100 companies
and 20 government groups is only about 37%.
In my regular professional work, I help clients collect benchmark data. In doing
this, I have noticed major gaps and omissions that need to be corrected if the data
are going to be useful for comparisons or estimating future projects.
Among the more serious gaps are leaks from software effort data that, if not corrected,
will distort reality and make the benchmarks almost useless and possibly even harmful.
One of the most common leaks is that of unpaid overtime. Software is a very
labor-intensive occupation, and many of us work very long hours. But few companies
actually record unpaid overtime. This means that software effort is underreported by
around 15%, which is too large a value to ignore.
Other leaks include the work of part-time specialists who come and go as
needed. There are dozens of these specialists, and their combined effort can top
45% of total software effort on large projects. There are too many to show all of
these specialists, but some of the more common include the following:

1. Agile coaches
2. Architects (software)
3. Architects (systems)

vii
viii ◾ Preface

4. Architects (enterprise?)
5. Assessment specialists
6. Capability maturity model integrated (CMMI) specialists
7. Configuration control specialists
8. Cost estimating specialists
9. Customer support specialists
10. Database administration specialists
11. Education specialists
12. Enterprise resource planning (ERP) specialists
13. Expert-system specialists
14. Function point specialists (certified)
15. Graphics production specialists
16. Human factors specialists
17. Integration specialists
18. Library specialists (for project libraries)
19. Maintenance specialists
20. Marketing specialists
21. Member of the technical staff (multiple specialties)
22. Measurement specialists
23. Metric specialists
24. Project cost analysis specialists
25. Project managers
26. Project office specialists
27. Process improvement specialists
28. Quality assurance specialists
29. Scrum masters
30. Security specialists
31. Technical writing specialists
32. Testing specialists (automated)
33. Testing specialists (manual)
34. Web page design specialists
35. Web masters

Another major leak is that of failing to record the rather high costs for users when
they participate in software projects, such as embedded users for agile projects. But
users also provide requirements, participate in design and phase reviews, perform
acceptance testing, and carry out many other critical activities. User costs can col-
lectively approach 85% of the effort of the actual software development teams.
Without multiplying examples, this new book is somewhat like a medical book
that attempts to discuss treatments for common diseases. This book goes through
a series of measurement and metric problems and explains the damages they can
cause. There are also some suggestions on overcoming these problems, but the main
Preface ◾ ix

focus of the book is to show readers all of the major gaps and problems that need to
be corrected in order to accumulate accurate and useful benchmarks for software
projects. I hope readers will find the information to be of use.
Quality data are even worse than productivity and resource data and are only
about 25% complete. The new technical debt metric is only about 17% complete.
Few companies even start quality measures until after unit test, so all early bugs
found by reviews, desk checks, and static analysis are invisible. Technical debt does
not include consequential damages to clients, nor does it include litigation costs
when clients sue for poor quality.
Hardly anyone measures bad fixes, or new bugs in bug repairs themselves.
About 7% of bug repairs have new bugs, and this can rise above 35% for modules
with high cyclomatic complexity. Even fewer companies measure bad-test cases, or
bugs in test libraries, which average about 15%.
Yet another problem with software measurements has been the continuous
usage for more than 50 years of metrics that distort reality and violate standard
economic principles. The two most flagrant metrics with proven errors are cost per
defect and lines of code (LOC). The cost per defect metric penalizes quality and
makes buggy applications look better than they are. The LOC metric makes
requirements and design invisible and, even worse, penalizes modern high-level
programming languages.
Professional benchmark organizations such as Namcook Analytics, Q/P
Management Group, Davids’ Consulting, and TI Metricas in Brazil that validate
client historical data before logging it can achieve measurement accuracy of perhaps
98%. Contract projects that need accurate billable hours in order to get paid are
often accurate to within 90% for development effort (but many omit unpaid
overtime, and they never record user costs).
Function point metrics are the best choice for both economic and quality
analyses of software projects. The new SNAP metric for software nonfunctional
assessment process measures nonfunctional requirements but is difficult to apply
and also lacks empirical data.
Ordinary internal information system projects and web applications developed
under a cost-center model where costs are absorbed instead of being charged out
are the least accurate and are the ones that average only 37%. Agile projects are
very weak in measurement accuracy and have often less than 50% accuracy. Self-
reported benchmarks are also weak in measurement accuracy and are often less
than 35% in accumulating actual costs.
A distant analogy to this book on measurement problems is Control of
Communicable Diseases in Man, published by the U.S. Public Health Service. It has
concise descriptions of the symptoms and causes of more than 50 common com-
municable diseases, together with discussions of proven effective therapies.
Another medical book with useful guidance for those of us in software is
Paul Starr’s excellent book on The Social Transformation of American Medicine.
x ◾ Preface

This book won a Pulitzer Prize in 1982. Some of the topics on improving medical
records and medical education have much to offer on improving software records
and software education.
So as not to have an entire book filled with problems, Appendix 2 is a more posi-
tive section that shows 25 quantitative goals that could be achieved between now and
2026 if the industry takes measurements seriously and also takes quality seriously.
Acknowledgments

Thanks to my wife, Eileen Jones, for making this book possible. Thanks for her
patience when I get involved in writing and disappear for several hours. Also thanks
for her patience on holidays and vacations when I take my portable computer and
write early in the morning.
Thanks to my neighbor and business partner Ted Maroney, who handles
contracts and the business side of Namcook Analytics LLC, which frees up my time
for books and technical work. Thanks also to Aruna Sankaranarayanan for her excel-
lent work with our Software Risk Master (SRM) estimation tool and our website.
Thanks also to Larry Zevon for the fine work on our blog and to Bob Heffner for
marketing plans. Thanks also to Gary Gack and Jitendra Subramanyam for their
work with us at Namcook.
Thanks to other metrics and measurement research colleagues who also attempt
to bring order into the chaos of software development: Special thanks to the late
Allan Albrecht, the inventor of function points, for his invaluable contribution to
the industry and for his outstanding work. Without Allan’s pioneering work on
function points, the ability to create accurate baselines and benchmarks would
probably not exist today in 2016.
The new SNAP team from International Function Point Users Group (IFPUG)
also deserves thanks: Talmon Ben-Canaan, Carol Dekkers, and Daniel French.
Thanks also to Dr. Alain Abran, Mauricio Aguiar, Dr. Victor Basili, Dr. Barry
Boehm, Dr. Fred Brooks, Manfred Bundschuh, Tom DeMarco, Dr. Reiner Dumke,
Christof Ebert, Gary Gack, Tom Gilb, Scott Goldfarb, Peter Hill, Dr. Steven Kan,
Dr. Leon Kappelman, Dr. Tom McCabe, Dr. Howard Rubin, Dr. Akira Sakakibara,
Manfred Seufort, Paul Strassman, Dr. Gerald Weinberg, Cornelius Wille, the late
Ed Yourdon, and the late Dr. Harlan Mills for their own solid research and for the
excellence and clarity with which they communicated ideas about software. The
software industry is fortunate to have researchers and authors such as these.
Thanks also to the other pioneers of parametric estimation for software projects:
Dr. Barry Boehm of COCOMO, Tony DeMarco and Arlene Minkiewicz of
PRICE, Frank Freiman and Dan Galorath of SEER, Dr. Larry Putnam of SLIM
and the other Putman family members, Dr. Howard Rubin of Estimacs, Dr. Charles
Turk (a colleague at IBM when we built DPS in 1973), and William Roetzheim

xi
xii ◾ Acknowledgments

of ExcelerPlan. Many of us started work on parametric estimation in the 1970s


and brought out our commercial tools in the 1980s.
Thanks to my former colleagues at Software Productivity Research (SPR) for
their hard work on our three commercial estimating tools (SPQR/20 in 1984;
CHECKPOINT in 1987; and KnowledgePlan in 1990): Doug Brindley, Chas
Douglis, Lynn Caramanica, Carol Chiungos, Jane Greene, Rich Ward, Wayne
Hadlock, Debbie Chapman, Mike Cunnane, David Herron, Ed Begley, Chuck
Berlin, Barbara Bloom, Julie Bonaiuto, William Bowen, Michael Bragen, Doug
Brindley, Kristin Brooks, Tom Cagley, Sudip Charkraboty, Craig Chamberlin,
Michael Cunnane, Charlie Duczakowski, Gail Flaherty, Richard Gazoorian,
James Glorie, Scott Goldfarb, David Gustafson, Bill Harmon, Shane Hartman,
Bob Haven, Steve Hone, Jan Huffman, Peter Katsoulas, Richard Kauffold, Scott
Moody, John Mulcahy, Phyllis Nissen, Jacob Okyne, Donna O’Donnel, Mark
Pinis, Tom Riesmeyer, Janet Russac, Cres Smith, John Smith, Judy Sommers, Bill
Walsh, and John Zimmerman. Thanks also to Ajit Maira and Dick Spann for their
service on SPR’s board of directors.
Appreciation is also due to various corporate executives who supported the
technical side of measurement and metrics by providing time and funding. From
IBM, the late Ted Climis and the late Jim Frame both supported the author’s mea-
surement work and in fact commissioned several studies of productivity and quality
inside IBM as well as funding IBM’s first parametric estimation tool in 1973. Rand
Araskog and Dr. Charles Herzfeld at ITT also provided funds for metrics studies,
as did Jim Frame who became the first ITT VP of software.
Thanks are also due to the officers and employees of the IFPUG. This organi-
zation started almost 30 years ago in 1986 and has grown to become the largest
software measurement association in the history of software. When the affiliates in
other countries are included, the community of function point users is the largest
measurement association in the world.
There are other function point associations such as Common Software
Measurement International Consortium, Finnish Software Metrics Association,
and Netherlands Software Metrics Association, but all 16 of my software books
have used IFPUG function points. This is in part due to the fact that Al Albrecht
and I worked together at IBM and later at Software Productivity Research.
About the Author

Capers Jones is currently the vice president and chief technology officer of Namcook
Analytics LLC (www.Namcook.com). Namcook Analytic LLC designs leading-
edge risk, cost, and quality estimation and measurement tools. Software Risk
Master (SRM)™ is the company’s advanced estimation tool with a patent-pending
early sizing feature that allows sizing before requirements via pattern matching.
Namcook Analytics also collects software benchmark data and engages in longer
range software process improvement, quality, and risk-assessment studies. These
Namcook studies are global and involve major corporations and some government
agencies in many countries in Europe, Asia, and South America. Capers Jones is
the author of 15 software books and several hundred journal articles. He is also an
invited keynote speaker at many software conferences in the United States, Europe,
and the Pacific Rim.

xiii
Chapter 1

Introduction

As the developer of a family of software cost-estimating tools, the author is often


asked what seems to be a straightforward question: How accurate are the estimates
compared to historical data?
The answer to this question is surprising. Usually the estimates from modern
parametric estimation tools are far more accurate than the historical data used by
clients for comparisons! This fact is surprising because much of what are called
historical data are incomplete and omit most of the actual costs and work effort that
were accrued.
In some cases historical data capture only 25% or less of the full amount of effort
that was expended. Among the author’s IT clients, the average completeness of
historical effort data is only about 37% of the true effort expended when calibrated
by later team interviews that reconstruct the missing data elements such as unpaid
overtime.
Quality data are incomplete too. Most companies do not even start measuring
quality until after unit test, so all requirement and design defects are excluded,
as are static analysis defects and unit test defects. The result is a defect count that
understates the true numbers of bugs by more than 75%. In fact, some companies
do not measure defects until after release of the software.
Thus when the outputs from an accurate parametric software cost-estimating
tool such as Software Risk Master ™ (SRM), COCOMO II, CostXpert,
ExcelerPlan, KnowledgePlan, True-Price, SEER, or SLIM are compared to what
are called historical data, the results tend to be alarming and are also confusing
to clients and client executives.
The outputs from the estimating tools often indicate higher costs, more effort,
and longer schedules than the historical data indicate. It is seldom realized that the
difference is because of major gaps and omissions in the historical data themselves,
rather than because of errors in the estimates.

1
2 ◾ A Guide to Selecting Software Measures and Metrics

It is fair to ask if historical data are incomplete, how is it possible to know the
true amounts and evaluate the quantity of missing data that were left out?
In order to correct the gaps and omissions that are normal in cost-tracking
systems, it is necessary to interview the development team members and the
project managers. During these interview sessions, the contents of the histori-
cal data collected for the project are compared to a complete work breakdown
structure derived from similar projects.
For each activity and task that occurs in the work breakdown structure, but
which is missing from the historical data, the developers are asked whether or not
the activity occurred. If it did occur, the developers are asked to reconstruct from
memory or their informal records the number of hours that the missing activity
accrued.
Problems with errors and leakage from software cost-tracking systems are as
old as the software industry itself. The first edition of the author’s book, Applied
Software Measurement, was published in 1991. The third edition was published
in 2008. Yet the magnitude of errors in cost- and resource-tracking systems is
essentially the same today as it was in 1991. Following is an excerpt from the third
edition that summarizes the main issues of leakage from cost-tracking systems:
It is a regrettable fact that most corporate tracking systems for effort and costs
(dollars, work hours, person months, etc.) are incorrect and manage to omit from
30% to more than 70% of the real effort applied to software projects. Thus most
companies cannot safely use their own historical data for predictive purposes.
When benchmark consulting personnel go on-site and interview managers and
technical personnel, these errors and omissions can be partially corrected by
interviews.
The commonest omissions from historical data, ranked in order of significance,
are given in Table 1.1.
Not all of these errors are likely to occur on the same project, but enough of
them occur so frequently that ordinary cost data from project tracking systems are
essentially useless for serious economic study, for benchmark comparisons between
companies, or for baseline analysis to judge rates of improvement.
A more fundamental problem is that most enterprises simply do not record
data for anything but a small subset of the activities actually performed. In carry-
ing out interviews with project managers and project teams to validate and correct
historical data, the author has observed the following patterns of incomplete and
missing data, using the 25 activities of a standard chart of accounts as the refer-
ence model (Table 1.2).
When the author and his colleagues collect benchmark data, we ask the manag-
ers and personnel to try and reconstruct any missing cost elements. Reconstruction
of data from memory is plainly inaccurate, but it is better than omitting the miss-
ing data entirely.
Unfortunately, the bulk of the software literature and many historical studies
only report information to the level of complete projects, rather than to the level
Introduction ◾ 3

Table 1.1 Most Common Gaps in Software Measurement Data


Sources of Cost Errors Magnitude of Cost Errors

1. Unpaid overtime by exempt staff (Up to 25% of reported effort)

2. Charging time to the wrong project (Up to 20% of reported effort)

3. User effort on software projects (Up to 50% of reported effort)

4. Management effort on software projects (Up to 15% of reported effort)

5. Specialist effort on software projects (Up to 45% of reported effort)


Business analysts
Human factors specialists
Database administration specialists
Integration specialists
Quality assurance specialists
Technical writing specialists
Education specialists
Hardware or engineering specialists
Marketing specialists
Metrics and function point specialists

6. Effort spent prior to cost-tracking start-up (Up to 10% of reported effort)

7. Inclusion/exclusion of nonproject tasks (Up to 25% of reported effort)


Departmental meetings
Courses and education
Travel

Overall error magnitude (Up to 175% of reported effort)

Average accuracy of historical data (37% of true effort and costs)

of specific activities. Such gross bottom line data cannot readily be validated and is
almost useless for serious economic purposes.
Table 1.3 illustrates the differences between full activity-based costs for a soft-
ware project and the typical leaky patterns of software measurements normally
carried out. Table 1.3 uses a larger 40-activity chart of accounts that shows typical
work patterns for large systems of 10,000 function points or more.
As can be seen, measurement leaks degrade the accuracy of the information
available to C-level executives and also make economic analysis of software costs
very difficult unless the gaps are corrected.
To illustrate the effect of leakage from software tracking systems, consider what
the complete development cycle would look like for a sample project. The sample is
for a PBX switching system of 1,500 function points written in the C programming
language. Table 1.4 illustrates a full set of activities and a full set of costs.
4 ◾ A Guide to Selecting Software Measures and Metrics

Table 1.2 Gaps and Omissions Observed in Data for a Software


Chart of Accounts
Activities Performed Completeness of Historical Data

01 Requirements Missing or Incomplete

02 Prototyping Missing or Incomplete

03 Architecture Missing or Incomplete

04 Project planning Missing or Incomplete

05 Initial analysis and design Missing or Incomplete

06 Detail design Incomplete

07 Design reviews Missing or Incomplete

08 Coding Complete

09 Reusable code acquisition Missing or Incomplete

10 Purchased package acquisition Missing or Incomplete

11 Code inspections Missing or Incomplete

12 Independent verification and validation Complete (defense only)

13 Configuration management Missing or Incomplete

14 Integration Missing or Incomplete

15 User documentation Missing or Incomplete

16 Unit testing Incomplete

17 Function testing Incomplete

18 Integration testing Incomplete

19 System testing Incomplete

20 Field testing Missing or Incomplete

21 Acceptance testing Missing or Incomplete

22 Independent testing Complete (defense only)

23 Quality assurance Missing or Incomplete

24 Installation and training Missing or Incomplete

25 Project management Missing or Incomplete

26 Total project resources, costs Incomplete


Introduction ◾ 5

Table 1.3 Measured Effort versus Actual Effort: 10,000 Function Points
Percent of Total Measured Results (%)

1 Business analysis 1.25

2 Risk analysis/sizing 0.26

3 Risk solution planning 0.25

4 Requirements 4.25

5 Requirement. Inspection 1.50

6 Prototyping 2.00

7 Architecture 0.50

8 Architecture. Inspection 0.25

9 Project plans/estimates 0.25

10 Initial design 5.00

11 Detail design 7.50 7.50

12 Design inspections 2.50

13 Coding 22.50 22.50

14 Code inspections 20.00

15 Reuse acquisition 0.03

16 Static analysis 0.25

17 COTS Package purchase 0.03

18 Open-source acquisition 0.03

19 Code security audit 0.25

20 Independent verification 1.00


and validation

21 Configuration control 1.00

22 Integration 0.75

23 User documentation 2.00

24 Unit testing 0.75 0.75

25 Function testing 1.25 1.25

(Continued)
6 ◾ A Guide to Selecting Software Measures and Metrics

Table 1.3 (Continued) Measured Effort versus Actual Effort:


10,000 Function Points
Percent of Total Measured Results (%)

26 Regression testing 1.50 1.50

27 Integration testing 1.00 1.00

28 Performance testing 0.50

29 Security testing 0.50

30 Usability testing 0.75

31 System testing 2.50 2.50

32 Cloud testing 0.50

33 Field (beta) testing 0.75

34 Acceptance testing 1.00

35 Independent testing 1.50

36 Quality assurance 2.00

37 Installation/training 0.65

38 Project measurement 0.50

39 Project office 1.00

40 Project management 10.00

Cumulative results 100.00 37.00

Unpaid overtime 7.50

Now consider what the same project would look like if only design, code, and
unit test (DCUT) were recorded by the company’s tracking system. This combina-
tion is called DCUT and it has been a common software measurement for more
than 50 years. Table 1.5 illustrates the partial DCUT results.
Instead of a productivity rate of 6.00 function points per staff month, Table 1.4
indicates a productivity rate of 18.75 function points per staff month. Instead of
a schedule of almost 25 calendar months, Table 1.2 indicates a schedule of less
than 7 calendar months. Instead of a cost per function point of U.S. $1,666, the
DCUT results are only U.S. $533 per function point.
Yet both Tables 1.4 and 1.5 are for exactly the same project. Unfortunately,
what passes for historical data far more often matches the partial results shown
in Table 1.5 than the complete results shown in Table 1.4. This leakage of data is
Table 1.4 Example of Complete Costs for Software Development
Average monthly salary $8,000

Burden rate 25%

Fully burdened monthly rate $10,000

Work hours per calendar month 132.00

Application size in FP 1,500

Application type Systems

CMM level 1

Programming language C

LOC per FP 128.00

Staff Function Monthly Work Burdened


Point Function Point Hours per Cost per
Assignment Production Function Function Schedule Effort
Activities Scope Rate Point Point Months Staff Months

01 Requirements 500 200.00 0.66 $50.00 2.50 3.00 7.50

02 Prototyping 500 150.00 0.88 $66.67 3.33 3.00 10.00

03 Architecture 1,000 300.00 0.44 $33.33 3.33 1.50 5.00


Introduction

04 Project plans 1,000 500.00 0.26 $20.00 2.00 1.50 3.00

(Continued)
◾ 7
8
Table 1.4 (Continued) Example of Complete Costs for Software Development
Staff Function Monthly Work Burdened
Point Function Point Hours per Cost per
Assignment Production Function Function Schedule Effort
Activities Scope Rate Point Point Months Staff Months

05 Initial design 250 175.00 0.75 $57.13 2.86 3.00 8.57

06 Detail design 250 150.00 0.88 $66.67 1.67 6.00 10.00

07 Design reviews 200 225.00 0.59 $44.47 0.89 7.50 6.67

08 Coding 150 25.00 5.28 $400.00 6.00 10.00 60.00

09 Reuse acquisition 500 1,000.00 0.13 $10.00 0.50 3.00 1.50

10 Package purchase 2,000 2,000.00 0.07 $5.00 1.00 0.75 0.75

11 Code inspections 150 75.00 1.76 $133.33 2.00 10.00 20.00

12 Independent verification 1,000 250.00 0.53 $40.00 4.00 1.50 6.00


and validation

13 Configuration 1,500 1,750.00 0.08 $5.73 0.86 1.00 0.86


management
◾ A Guide to Selecting Software Measures and Metrics

14 Integration 750 350.00 0.38 $28.60 2.14 2.00 4.29

15 User documentation 1,000 75.00 1.76 $133.33 13.33 1.50 20.00

16 Unit testing 200 150.00 0.88 $66.67 1.33 7.50 10.00

(Continued)
Table 1.4 (Continued) Example of Complete Costs for Software Development
Staff Function Monthly Work Burdened
Point Function Point Hours per Cost per
Assignment Production Function Function Schedule Effort
Activities Scope Rate Point Point Months Staff Months

17 Function testing 250 150.00 0.88 $66.67 1.67 6.00 10.00

18 Integration testing 250 175.00 0.75 $57.13 1.43 6.00 8.57

19 System testing 250 200.00 0.66 $50.00 1.25 6.00 7.50

20 Field (beta) testing 1,000 250.00 0.53 $40.00 4.00 1.50 6.00

21 Acceptance testing 1,000 350.00 0.38 $28.60 2.86 1.50 4.29

22 Independent testing 750 200.00 0.66 $50.00 3.75 2.00 7.50

23 Quality assurance 1,500 250.00 0.53 $40.00 6.00 1.00 6.00

24 Installation/training 1,500 250.00 0.53 $40.00 6.00 1.00 6.00

25 Project management 1,000 75.00 1.76 $133.33 13.33 1.50 20.00

Cumulative Results 420.00 6.00 22.01 $1,666.60 24.65 3.57 249.99


Introduction
◾ 9
10

Table 1.5 Example of Partial Costs for Software Development (DCUT = Design, Code, and Unit Test)

Average monthly salary $8,000

Burden rate 25%

Fully burdened monthly rate $10,000

Work hours per calendar month 132

Application size in FP 1,500

Application type Systems

CMM level = 1

Programming language C

LOC per FP = 128

Staff Function Monthly Work Hours Burdened Cost


Point Assignment Function Point per Function per Function Schedule Effort
Activities Scope Production Rate Point Point Months Staff Months

01 Design 500 150 0.88 $66.67 3.33 3.00 10.00

02 Coding 150 25 5.28 $400.00 6.00 10.00 60.00


A Guide to Selecting Software Measures and Metrics

03 Unit testing 200 150 0.88 $66.67 1.33 7.50 10.00

Cumulative 130 18.75 7.04 $533.33 6.93 11.54 80.00


results
Introduction ◾ 11

not economically valid, and it is not what C-level executives need and deserve to
understand the real costs of software.
Internal software projects where the development organization is defined as
a cost center are the most incomplete and inaccurate in collecting software data.
Many in-house projects by both corporations and government agencies lack use-
ful historical data. Thus such organizations tend to be very optimistic in their
internal estimates because they have no solid basis for comparison. If they switch
to a commercial estimating tool, they tend to be surprised at how much more
costly the results might be.
External projects that are being built under contract, and projects where the
development organization is a profit center, have stronger incentives to capture
costs with accuracy. Thus contractors and outsource vendors are likely to keep
better records than internal software groups.
Another major gap for internal software projects developed by companies for
their own use is the almost total failure to measure user costs. Users participate in
requirements, review documents, participate in phase reviews, perform acceptance
tests, and are sometimes embedded in development teams if the agile methodology
is used. Sometimes user costs can approach or exceed 75% of development costs.
Table 1.6 shows typical leakage for user costs for internal projects where users are
major participants. Table 1.6 shows an agile project of 1,000 function points.
As can be seen in Table 1.6, user costs were more than 35% of development
costs. This is too large a value to remain invisible and unmeasured if software eco-
nomic analysis is going to be taken seriously.
Tables 1.3 through 1.6 show how wide the differences can be between full
measurement and partial measurement. But an even wider range is possible,
because many companies measure only coding and do not record unit test as a
separate cost element.
Table 1.7 shows the approximate distribution of tracking methods noted at
more than 150 companies visited by the author and around 26,000 projects.
Among the author’s clients, about 90% of project historical data are wrong and
incomplete until Namcook consultants help the clients to correct them. In fact, the
average among the author’s clients is that historical data are only about 37% com-
plete for effort and less than 25% complete for quality.
Only 10% of the author’s clients actually have complete cost and resource data
that include management and specialists such as technical writers. These projects
usually have formal cost-tracking systems and also project offices for larger projects.
They are often contract projects where payment depends on accurate records of
effort for billing purposes.
Leakage from cost-tracking systems and the wide divergence in what activities
are included present a major problem to the software industry. It is very difficult
to perform statistical analysis or create accurate benchmarks when so much of
the reported data are incomplete, and there are so many variations in what gets
recorded.
12 ◾ A Guide to Selecting Software Measures and Metrics

Table 1.6 User Effort versus Development Team Effort: Agile 1,000
Function Points
Team Percent User Percent
of Total of Total

1 Business analysis 1.25 3.75

2 Risk analysis/sizing 0.26

3 Risk solution planning 0.25

4 Requirements 4.25 5.31

5 Requirement. Inspection 1.50 1.50

6 Prototyping 2.00 0.60

7 Architecture 0.50

8 Architecture. Inspection 0.25

9 Project plans/estimates 0.25

10 Initial design 5.00

11 Detail design 7.50

12 Design inspections 2.50

13 Coding 22.50

14 Code inspections 20.00

15 Reuse acquisition 0.03

16 Static analysis 0.25

17 COTS package purchase 0.03 1.00

18 Open-source acquisition 0.03

19 Code security audit 0.25

20 Independent verification 1.00


and validation

21 Configuration control 1.00

22 Integration 0.75

23 User documentation 2.00 1.00

24 Unit testing 0.75

(Continued)
Introduction ◾ 13

Table 1.6 (Continued) User Effort versus Development Team Effort: Agile
1,000 Function Points
Team Percent User Percent
of Total of Total

25 Function testing 1.25

26 Regression testing 1.50

27 Integration testing 1.00

28 Performance testing 0.50

29 Security testing 0.50

30 Usability testing 0.75

31 System testing 2.50

32 Cloud testing 0.50

33 Field (beta) testing 0.75 9.00

34 Acceptance testing 1.00 4.00

35 Independent testing 1.50

36 Quality assurance 2.00

37 Installation/training 0.65 9.75

38 Project measurement 0.50

39 Project office 1.00

40 Project management 10.00

Cumulative results 100.00 35.91

Unpaid overtime 5.00 5.00

The gaps and variations in historical data explain why the author and his
colleagues find it necessary to go on-site and interview project managers and
technical staff before accepting historical data. Unverified historical data are
often so incomplete as to negate the value of using them for benchmarks and
industry studies.
When we look at software quality data, we see similar leakages. Many com-
panies do not track any bugs before release. Only sophisticated companies such as
IBM, Raytheon, and Motorola track pretest bugs.
14 ◾ A Guide to Selecting Software Measures and Metrics

Table 1.7 Distribution of Cost/Effort-Tracking Methods


Activities Percent of Projects

Coding only 5.00

Coding, unit test 10.00

Design, coding, and unit test (DCUT) 40.00

Requirements, design, coding, and testing 20.00

All development, but not project management 15.00

All development and project management including 10.00


specialists

100.00

At IBM, there were even volunteers who recorded bugs found during desk check
sessions, debugging, and unit testing, just to provide enough data for statistical analysis.
(The author served as an IBM volunteer and recorded desk check and unit test bugs.)
Table 1.8 shows the pattern of missing data for software defect and quality mea-
surements for an application of a nominal 1,000 function points in Java.

Table 1.8 Measured Quality versus Actual Quality: 1000 Function Points
Defects Defects Percent
Defect Removal Activities Removed Measured of Total

1 Requirements inspection 200 5.71

2 Requirements changes 25 0.71

3 Architecture inspection 50 1.43

4 Initial design inspection 100 2.86

5 Detail design inspection 300 8.57

6 Design changes 50 1.43

7 Code inspections 750 21.43

8 Code changes 150 4.29

9 User document editing 75 2.14

10 User document changes 20 0.57

11 Static analysis 700 20.00

(Continued)
Introduction ◾ 15

Table 1.8 (Continued) Measured Quality versus Actual Quality: 1000


Function Points
Defects Defects Percent
Defect Removal Activities Removed Measured of Total

12 Unit test 100 2.86

13 Function testing 150 150 4.29

14 Regression testing 50 50 1.43

15 Integration testing 150 150 4.29

16 Performance testing 50 50 1.43

17 Security testing 30 30 0.86

18 Usability testing 40 40 1.14

19 System testing 100 100 2.86

20 Cloud testing 70 70 2.00

21 Field (beta) testing 40 40 1.14

22 Acceptance testing 30 30 0.86

23 Independent testing 20 20 0.57

24 Quality assurance 50 50 1.43

25 90-days customer bug reports 200 200 5.71

Cumulative results 3,500 980 100.00

Percent (%) of total defects 100.00 32.33

Defects per function point 3.50 0.98

Discovered defects 3,300 780

Delivered defects 200 200

Defect removal efficiency (DRE) 94.29% 79.59%

Out of the 25 total forms of defect removal, data are collected only for 13 of
these under normal conditions. Most quality measures ignore all bugs found before
testing, and they ignore unit test bugs too.
The apparent defect density of the measured defects is less than one-third of the
true volume of software defects. In other words, true defect potentials would be about
3.50 defects per function point, but due to gaps in the measurement of quality, appar-
ent defect potentials would seem to be just under 1.00 defects per function point.
16 ◾ A Guide to Selecting Software Measures and Metrics

The apparent defect removal efficiency (DRE) is artificially reduced from more
than 94% to less than 80% due to the missing defect data from static analysis,
inspections, and other pretest removal activities.
For the software industry as a whole, the costs of finding and fixing bugs are the
top cost driver. It is professionally embarrassing for the industry to be so lax about
measuring the most expensive kind of work since software began.
The problems illustrated in Tables 1.1 through 1.8 are just the surface mani-
festation of a deeper issue. After more than 50 years, the software industry
lacks anything that resembles a standard chart of accounts for collecting his-
torical data.
This lack is made more difficult by the fact that in real life, there are many
variations of activities that are actually performed. There are variations due to
application size, and variations due to application type.
Chapter 2

Variations in Software
Activities by Type
of Software

In many industries, building large products is not the same as building small
products. Consider the differences in specialization and methods required to build
a rowboat versus building an 80,000 ton cruise ship.
A rowboat can be constructed by a single individual using only hand tools.
But a large modern cruise ship requires more than 350 workers including many
specialists such as pipe fitters, electricians, steel workers, painters, and even interior
decorators and a few fine artists.
Software follows a similar pattern: Building large system in the 10,000 to
100,000 function point range is more or less equivalent to building other large
structures such as ships, office buildings, or bridges. Many kinds of specialists are
utilized, and the development activities are quite extensive compared to smaller
applications.
Table 2.1 illustrates the variations in development activities noted for six size
plateaus using the author’s 25-activity checklist for development projects.
Below the plateau of 1,000 function points (which is roughly equivalent to
100,000 source code statements in a procedural language such as COBOL), less
than half of the 25 activities are normally performed. But large systems in the
10,000 to 100,000 function point range perform more than 20 of these activities.
To illustrate these points, Table 2.2 shows more detailed quantitative variations
in results for three size plateaus, 100, 1,000, and 10,000 function points.

17
Table 2.1 Development Activities for Six Project Size Plateaus
18

1 Function 10 Function 100 Function 1,000 Function 10,000 Function 100,000


Activities Performed Point Points Points Points Points Function Points
1. Requirements X X X X X X

2. Prototyping X X X

3. Architecture X X
4. Project plans X X X
5. Initial design X X X X X

6. Detail design X X X X

7. Design reviews X X

8. Coding X X X X X X

9. Reuse acquisition X X X X X X

10. Package purchase X X

11. Code inspections X X X

12. Independent
Verification and
A Guide to Selecting Software Measures and Metrics

Validation
13. Change control X X X

14. Formal integration X X X

(Continued)
Table 2.1 (Continued) Development Activities for Six Project Size Plateaus
1 Function 10 Function 100 Function 1,000 Function 10,000 Function 100,000
Activities Performed Point Points Points Points Points Function Points

15. User X X X X
documentation

16. Unit testing X X X X X X

17. Function testing X X X X

18. Integration testing X X X

19. System testing X X X

20. Beta testing X X

21. Acceptance testing X X X

22. Independent
testing

23. Quality assurance X

24. Installation/training X X X

25. Project X X X X X X
management
Variations in Software Activities by Type of Software

Activities 5 6 9 18 22 23

19
20 ◾ A Guide to Selecting Software Measures and Metrics

Table 2.2 Variations by Powers of Ten (100, 1,000, and 10,000 Function
Points)
Size in function points 100 1,000 10,000

Size in SNAP points 13 155 1,750

Examples Medium Smart phone Local system


update app

Team experience Average Average Average

Methodology Iterative Iterative Iterative

Sample size for this table 150 450 50

CMMI levels (0 = CMMI 0 1 1


not used)

Monthly burdened costs $10,000 $10,000 $10,000

Major cost drivers 1 Coding Bug repairs Bug repairs


(rank order)

2 Bug repairs Coding Paperwork

3 Management Paperwork Coding

4 Meetings Management Creep

5 Function Function Function


requirements requirements requirement

6 Nonfunction Nonfunction Nonfunction


requirement requirements requirements

7 Paperwork Creep Meetings

8 Integration Integration Integration

9 Creep Meetings Management

Programming language Java Java Java

Source statements per 53 53 53


function point

Size in logical code 5,300 53,000 530,000


statements (Software
risk master [SRM]
default for LOC)

(Continued)
Variations in Software Activities by Type of Software ◾ 21

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Size in logical KLOC 5.3 53 530
(SRM default for KLOC)

Size in physical LOC (not 19,345 193,450 1,934,500


recommended)

Size in physical KLOC 19.35 193.45 1,934.50


(not recommended)

Client planned schedule 5.25 12.5 28


in calendar months

Actual schedule in 5.75 13.8 33.11


calendar months

Plan/actual schedule 0.5 1.3 5.11


difference

Schedule slip percent 9.61% 10.43% 18.26%

Staff size (technical + 1.25 6.5 66.67


management)

Effort in staff months 7.19 89.72 2,207.54

Work hours per month 132 132 132


(U.S. value)

Unpaid overtime per 0 8 16


month (software norms)

Effort in staff hours 949.48 11,843.70 291,395.39

International Function 13.9 11.15 4.53


Point Users Group
(IFPUG) function points
per month

Work hours per function 9.49 11.84 29.14


point

Logical lines of code 736.83 590.69 240.09


(LOC) per month
(includes executable
statements and data
definitions)

(Continued)
22 ◾ A Guide to Selecting Software Measures and Metrics

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Physical lines of code 2,689.42 2,156.03 876.31
(LOC) per month
(includes blank lines,
comments, headers, etc.)
Requirements creep 1.00% 6.00% 15.00%
(total percent growth)
Requirements creep 1 60 1,500
(function points)
Probable deferred 0 0 2,500
features to release 2
Client planned project $65,625 $812,500 $18,667,600
cost
Actual total project cost $71,930 $897,250 $22,075,408
Plan/actual cost $6,305 $84,750 $3,407,808
difference
Plan/actual percent 8.77% 9.45% 15.44%
difference
Planned cost per $656.25 $812.50 $1,866.76
function point
Actual cost per function $719.30 $897.25 $2,207.54
point

Defect Potentials and Removal Percent

Defect Potentials Defects Defects Defects


Requirement defects 5 445 6,750
Architecture defects 0 1 27
Design defects 25 995 14,700
Code defects 175 2,150 30,500
Document defects 11 160 1,650
Bad-fix defects 15 336 3,900

Total defects 231 4,087 57,527

(Continued)
Variations in Software Activities by Type of Software ◾ 23

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Defects per function point 2.31 4.09 5.75

Defect removal efficiency 97.50% 96.00% 92.50%


(DRE)

Delivered defects 6 163 4,313

High-severity defects 1 20 539

Security flaws 0 3 81

Delivered defects per 0.06 0.16 0.43


function point

Delivered defects per 1.09 3.08 8.14


KLOC

Test Cases for Selected Test Cases Test Cases Test Cases
Tests
Unit test 101 1,026 10,461

Function test 112 1,137 11,592

Regression test 50 512 5,216

Component test 67 682 6,955

Performance test 33 341 3,477

System test 106 1,080 11,012

Acceptance test 23 237 2,413

Total 492 5,016 51,126

Test cases per 4.92 5.02 5.11


function point

Probable test coverage 95.00% 92.00% 87.00%

Probable peak cyclomatic 12 15 >25.00


complexity

Document Sizing

Document Sizes Pages Pages Pages


Requirements 40 275 2,126

(Continued)
24 ◾ A Guide to Selecting Software Measures and Metrics

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
Architecture 17 76 376

Initial design 45 325 2,625

Detail design 70 574 5,118

Test plans 23 145 1,158

Development plans 6 55 550

Cost estimates 17 76 376

User manuals 38 267 2,111

HELP text 19 191 1,964

Courses 15 145 1,450

Status reports 20 119 1,249

Change requests 18 191 2,067

Bug reports 97 1,048 11,467

Total 423 3,486 32,638

Document set 96.96% 91.21% 78.24%


completeness

Document pages per 4.23 3.49 3.26


function point

Project Risks Risk % Risk % Risk %


Cancellation 8.80 14.23 26.47

Negative ROI 11.15 18.02 33.53

Cost overrun 9.68 15.65 34.00

Schedule slip 10.74 18.97 38.00

Unhappy customers 7.04 11.38 34.00

Litigation 3.87 6.26 11.65

Technical debt/high COQ 5.00 16.00 26.21

Cyber attacks 7.00 9.75 15.30

Financial risk 9.00 21.00 41.00

(Continued)
Variations in Software Activities by Type of Software ◾ 25

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
High warranty repairs/low 6.00 14.75 32.00
maintainability

Risk average 7.83 14.60 29.22

Project staffing by 100 1,000 10,000


occupation group

Programmers 1.91 6.23 43.53

Testers 1.85 5.66 38.58

Designers 0.51 2.13 18.00

Business analysts 0 2.13 9.00

Technical writers 0.44 1.05 7.00

Quality assurance 0.46 0.98 5.00

1st line managers 1.21 1.85 7.13

Database administration 0 0 3.68

Project office staff 0 0 3.19

Administrative support 0 0 3.68

Configuration control 0 0 2.08

Project librarians 0 0 1.72

2nd line managers 0 0 1.43

Estimating specialists 0 0 1.23

Architects 0 0 0.86

Security specialists 0 0 0.49

Performance specialists 0 0 0.49

Function point counters 0 0.07 0.49

Human factors specialists 0 0 0.49

3rd line managers 0 0 0.36

Average total staff 6.37 20.11 148.42

(Continued)
26 ◾ A Guide to Selecting Software Measures and Metrics

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)

Project Activity Patterns

100 Function 1,000 Function 10,000


Activities Performed Points Points Function Points
01 Requirements X X X

02 Prototyping X X

03 Architecture X

04 Project plans X X

05 Initial design X X

06 Detail design X X X

07 Design reviews X

08 Coding X X X

09 Reuse acquisition X X X

10 Package purchase X

11 Code inspections X

12 Independent
verification and
validation (IV*V)

13 Change control X X

14 Formal integration X X

15 User X X X
documentation

16 Unit testing X X X

17 Function testing X X X

18 Integration testing X X

19 System testing X X

20 Beta testing X

21 Acceptance testing X X

(Continued)
Variations in Software Activities by Type of Software ◾ 27

Table 2.2 (Continued) Variations by Powers of Ten (100, 1,000, and 10,000
Function Points)
22 Independent
testing

23 Quality assurance X X

24 Installation/ X
training

25 Project X X X
management

Activities 8 17 23

As can be seen in Table 2.2, what happens for a small project of 100 function
points can be very different from what happens for a large system of 10,000 func-
tion points. Note the presence of many kinds of software specialists at the large
10,000 function point size and their absence for the smaller sizes. Note also the
increase in activities from 8 to 23 as application size gets larger.
Just consider the simple mathematical combinations that have to be estimated
or measured as software size increased. A small project of 100 function points
might have three occupation groups and perform eight activities: that results in
24 combinations that need to be predicted or measured. A large system of 10,000
function points might have 20 occupation groups and perform 25 activities. This
results in a total of 500 combinations that need to be predicted or measured. Even
worse, some activities require many occupation groups, whereas others require only
a few or even one. The total permutations can run into the billions of potential
combinations!
Chapter 3

Variations in Software
Development Activities
by Type of Software

Another key factor that influences software development activities is the type
of software being constructed. For example, the methods utilized for building
military software are very different from civilian norms. For example, military
software projects use independent verification and validation (IV and V) and also
independent testing, which seldom occur for civilian projects.
The systems and commercial software domains also have fairly complex develop-
ment activities compared to management information systems. The outsource domain,
due to contractual implications, also uses a fairly extensive set of development activities.
Table 3.1 illustrates the differences in development activities that the author has
noted across the six types of software.
As can be seen, the activities for outsourced, commercial systems, and military
software are somewhat more numerous than for web and MIS projects where
development processes tend to be rudimentary in many cases.
The six types of software shown in Table 3.1 are far from being the only
kinds of software developed. For example, open-source applications developed
by independent personnel have a unique development method that can be quite
different from software developed by a single organization and a single team.
Software that requires government certification such as the U.S. Food and
Drug Administration, Federal Aviation Administration, or Department of Defense
will also have unique development patterns, and these can vary based on the spe-
cific government agency rules and regulations.

29
30
Table 3.1 Development Activities for Six Project Types

Activities Performed Web MIS Outsource Commercial Systems Military

01 Requirements X X X X X

02 Prototyping X X X X X X

03 Architecture X X X X

04 Project plans X X X X X

05 Initial design X X X X X

06 Detail design X X X X X

07 Design reviews X X X X

08 Coding X X X X X X

09 Reuse acquisition X X X X X X

10 Package purchase X X X X X X

11 Code inspections X X X X

12 Independent verification and validation X

13 Change control X X X X X
A Guide to Selecting Software Measures and Metrics

14 Formal integration X X X X X

15 User documentation X X X X X

(Continued)
Table 3.1 (Continued) Development Activities for Six Project Types
Activities Performed Web MIS Outsource Commercial Systems Military

16 Unit testing X X X X X X

17 Function testing X X X X X

18 Integration testing X X X X X

19 System testing X X X X X X

20 Beta testing X X X

21 Acceptance testing X X X X X

22 Independent testing X

23 Quality assurance X X X X

24 Installation/training X X X X X

25 Project management X X X X X

Activities 6 18 22 23 23 25
Variations in Software Development Activities by Type of Software
31◾
32 ◾ A Guide to Selecting Software Measures and Metrics

Large financial software applications in the United States that are subject to the
Sarbanes–Oxley rules will have a very elaborate and costly governance process that
never occurs on other kinds of software applications.
Software that is intended to be used or marketed in many countries will have
elaborate and expensive nationalization procedures that may include translation of
all documents, HELP text, and sometimes even code comments into other national
languages.
Table 3.2 shows the most likely number of activities, occupation groups, and
combinations for 20 different types of software:

Table 3.2 Software Activities, Occupations, and Combinations


Project Types Activities Occupations Combinations

1 Military 33 35 1,155
software-weapons

2 Operating systems 30 31 930

3 ERP packages 30 30 900

4 Telecom—public 27 26 702
switches

5 Embedded software 34 20 680

6 Systems software 28 23 644

7 Avionics applications 26 24 624

8 Medical applications 20 22 440

9 Social networks 21 18 378

10 Federal government 19 19 361


applications

11 Financial applications 16 17 272

12 Multinational 18 15 270
applications

13 State government 17 15 255


applications

14 Big data 16 15 240

15 Municipal applications 18 13 234

16 Computer games 11 12 132

(Continued)
Variations in Software Development Activities by Type of Software ◾ 33

Table 3.2 (Continued) Software Activities, Occupations, and Combinations


Project Types Activities Occupations Combinations

17 Smart phone 12 10 120


applications

18 Web applications 9 11 99

19 Open source 8 4 32

20 Personal software 3 1 3

Averages 20 18 424

As can be seen in the table, building large military weapon systems is a


much more complicated problem than building web applications or open-source
applications. As activities and occupation groups get larger, projects become more
complex, more difficult, and have higher risks of failure, cost overruns, and schedule
overruns. Occupation groups will be discussed in detail in Chapter 4.
Chapter 4

Variations in Occupation
Groups, Staff Size,
Team Experience

Software development and maintenance are among the most labor-intensive


occupations in human history. A study commissioned by AT&T and carried out
by the author and his colleagues found a total of 116 different software occupation
groups such as business analysts, software engineers, testers, software quality
assurance, technical writers, and agile coaches.
No single company or project employed all of the 116 occupations, but some
large systems in companies such as Microsoft and IBM employed up to 50 different
occupations. A more recent study in 2016 found a total of 205 occupations.
Alarmingly, the largest increase in occupation groups was that concerned with
cyber-security due to the huge increase in cybercrime and looming threats of cyber
warfare.
Table 4.1 includes some of the part-time specialists who are often omitted from
historical data collection.
These specialists work mainly on large software projects having more than 1,000
function points. In fact, the larger the project, the more specialists are employed. In
total their combined efforts can top 60% of the total effort, and their contributions
should be captured as a normal part of historical data collection efforts.
One would think that the software literature and available benchmarks on
software would focus on staffing and occupation groups, but this is not the case. In
fact, the literature is almost silent on the varieties of software occupations, and few
software benchmarks include identifying the kinds of workers involved.

35
36 ◾ A Guide to Selecting Software Measures and Metrics

Table 4.1 Software Occupation Groups and Specialists 2016


1. Agile coaches

2. Architects (software)

3. Architects (systems)

4. Architects (enterprise)

5. Assessment specialists

6. Capability maturity model integrated (CMMI) specialists

7. Configuration control specialists

8. Cost-estimating specialists

9. Customer-support specialists

10. Database administration specialists

11. Education specialists

12. Enterprise resource planning (ERP) specialists

13. Expert-system specialists

14. Function-point specialists (certified)

15. Graphics-production specialists

16. Human-factors specialists

17. Integration specialists

18. Library specialists (for project libraries)

19. Maintenance specialists

20. Marketing specialists

21. Member of the technical staff (multiple specialties)

22. Measurement specialists

23. Metric specialists

24. Project cost-analysis specialists

25. Project managers

26. Project-office specialists

27. Process-improvement specialists

(Continued)
Variations in Occupation Groups, Staff Size, Team Experience ◾ 37

Table 4.1 (Continued) Software Occupation Groups and Specialists 2016


28. Quality-assurance specialists

29. Scrum masters

30. Security specialists

31. Technical-writing specialists

32. Testing specialists (automated)

33. Testing specialists (manual)

34. Web page design specialists

35. Web masters

One of the most common management reactions when projects start to run late
is to add more people. Of course this sometimes slows things down, but it is still
a common phenomenon as noted years ago by Dr. Fred Brooks in his classic book
The Mythical Man-Month (1975).
Here too the software literature and software benchmarks are strangely silent.
If the average complement of software engineers for 1,000 function points is six
people, what would happen if it were increased to 10 people? What would happen
if it were reduced to three people? The literature and most benchmarks are silent
on this basic issue.
There is a curve called the Putnam–Norden–Rayleigh (PNR) curve that
shows  the relationships between software effort and schedules. In essence, the
curve shows that one person working for 10 months and 10 people working for one
month are not equivalent.
With 10 people, communications would cause confusion and probably stretch
the schedule to two months, hence doubling the effort. One person might not have
all of the necessary skills, so the schedule might slip to 12 calendar months. Some
intermediate value such as four people working for 2.5 months would probably
deliver the optimal result.
As mentioned, Fred Brooks’ classic book, The Mythical Man-Month, made this
concept famous, and also has one of the best software book titles of all time. Fred
is also famous for the phrase no silver bullet to highlight the fact that no known
methodology solves all software engineering problems.
(Phil Crosby’s book, Quality Is Free (1979), is another great book title that
resonates through the ages. Phil also developed the phrase zero defects, which is a
laudable goal even if hard to achieve.)
The PNR curve originated with Lord Rayleigh, a British physicist who died in
1919. He was the discoverer of the argon gas, and he also developed a mathematical
model of light scattering that explains why the sky is blue. Peter Norden of IBM
38 ◾ A Guide to Selecting Software Measures and Metrics

and Larry Putnam of Quantitative Software Management (QSM) applied Rayleigh


curves to software projects, with fairly good success.
There are several flavors of the Rayleigh curve. One flavor shows that software
staffing is not linear but starts small, builds to a peak, and then declines near the
end of development (Figure 4.1).
A second flavor shows the relationships between effort and schedules and indi-
cates an interesting concept that there will be an impossible region where a project
cannot be completed in that scheduled duration no matter how large the staff is
(Figure 4.2).
The PNR curves are generic and do not show specific occupation groups or
skill levels or unpaid overtime. Therefore some modifications and customization
are needed. The Rayleigh curves were first applied to small projects and were used
for the population of coding programmers. For large systems with 50 or more

Rayleigh staffing profile curve

Rayleigh staffing profile


Staffing

Flat approximation

P1 P2 P3 P4 P5 P6
Time

Figure 4.1 Rayleigh curve illustration.

PNR effort/Time curve


Impossible zone

Effort
(staff × time)
Linear
range

Time to complete project

Figure 4.2 Impossible region where no further time compression is possible.


Variations in Occupation Groups, Staff Size, Team Experience ◾ 39

occupation groups, the Rayleigh curves are not a perfect fit but are still a useful
concept.
Incidentally, the agile concept of pair programming is an existence proof that
doubling staff size does not cut elapsed time in half. In fact, some pair-programming
projects take longer than the same number of function points developed by a single
programmer!
The literature on pair programming is woefully inadequate because it only com-
pares individual programmers to pairs and ignores other factors such as inspections,
static analysis, automated proofs, requirements models, and many other modern
quality techniques.
The author’s observations and data are that single programmers using inspec-
tions and static analysis have better quality than pair programmers and also have
shorter coding schedules.
Table 4.2 presents the normal staffing and occupation group patterns for four
software size plateaus: 100, 1,000, 10,000, and 100,000 function points. Java is the
assumed language in all four cases. Table 4.2 shows only 20 of the more common
software occupation groups that occur with a high frequency.

Table 4.2 Software Staffing Patterns: 10 to 100,000 Function Points

Project Staff 10 100 1,000 10,000 100,000

1 Programmers 1.0 1.9 6.2 43.5 301.1

2 Testers 1.0 1.8 5.7 38.6 265.9

3 Designers 0.0 0.5 2.1 18.0 135.0

4 Business analysts 0.0 0.0 2.1 9.0 110.0

5 Technical writers 0.3 0.4 1.1 7.0 62.5

6 Quality assurance 0.0 0.5 1.0 5.0 53.4

7 1st line managers 0.5 1.2 1.9 7.1 37.0

8 Database administration 0.0 0.0 0.5 3.7 26.3

9 Project office staff 0.0 0.0 0.0 3.2 22.8

10 Administrative support 0.0 0.0 0.0 3.7 21.0

11 Configuration control 0.0 0.0 0.0 2.1 14.9

12 Project librarians 0.0 0.0 0.0 1.7 12.3

13 2nd line managers 0.0 0.0 0.0 1.4 9.0

14 Estimating specialists 0.0 0.0 0.0 1.2 8.8

(Continued)
40 ◾ A Guide to Selecting Software Measures and Metrics

Table 4.2 (Continued) Software Staffing Patterns: 10 to 100,000


Function Points
Project Staff 10 100 1,000 10,000 100,000

15 Architects 0.0 0.0 0.0 0.9 6.1

16 Security specialists 0.0 0.0 0.0 0.5 3.5

17 Performance specialists 0.0 0.0 0.0 0.5 3.5

18 Function point counters 0.0 0.0 0.5 2.0 5.0

19 Human-factors specialists 0.0 0.0 0.0 1.0 3.5

20 3rd line managers 0.0 0.0 0.0 0.0 2.2

Total staff 1.8 4.5 14.8 106.5 802.9

Total occupations 5.0 7.0 10.0 19.0 20.0

Note that the small project of 10 function points used few occupation groups,
with programmers and testers being the two main categories. But as applications
get larger, more and more specialists are needed: business analysts, function point
counters, database administration, and many others.
Note also that because some of the specialists such as technical writers and busi-
ness analysts are only involved part of the time, it is necessary to deal with part-time
fractional personnel rather than all full-time personnel. This is why there are more
occupations than people for 10 function points but more people than occupations
above 1,000 function points.
Figure 4.3 shows the probable impact of various team sizes for an application of
1,000 function points coded in Java.

Schedules and team sizes

Team = 10

Team = 9

Team = 8

Team = 7

Team = 6
Series2
Team = 5 Series1
0 5 10 15 20 25
Schedules for 1,000 function points

Figure 4.3 The impact of team size on schedules for 1,000 function points.
Variations in Occupation Groups, Staff Size, Team Experience ◾ 41

As can be seen, adding staff can shorten schedules, up to a point. However, add-
ing staff raises costs and lowers productivity. Also, adding excessive staff can lead
to confusion and communication problems. Selecting the optimal staff size for a
specific project is one of the more complex calculations of parametric estimating
tools such as Software Risk Master (SRM) and the other parametric estimation
tools.
Team size, team experience, work hours, and unpaid overtime are all personnel
issues that combine in a very complex pattern. This is why parametric estimation
tools tend to be more accurate than manual estimates and also more repeatable
because the algorithms are embedded in the tool and not only in the minds of
project managers.
Another factor of importance is that of experience levels. SRM uses a five-point
scale to rank team experience ranging from novice to expert. The chart in Figure 4.4
shows the approximate differences using an average team as the 100% mark.
As can be seen, results are slightly asymmetrical. Top teams are about 30% more
productive than average, but novice teams are only about 15% lower than average.
The reason for this is that normal corporate training and appraisal programs tend
to weed out the really unskilled so that they seldom become actual team members.
The same appraisal programs reward the skilled, so that explains the fact that the
best results have a longer tail.
Software is a team activity. The ranges in performance for specific individuals
can top 100%. But there are not very many of these superstars. Only about 5% to
10% of general software populations are at the really high level of the performance
spectrum.
A third personnel factor with a strong influence is that of unpaid overtime.
Unpaid overtime has a tangible impact on software costs, and a more subtle impact
on software schedules. Projects with plenty of unpaid overtime will have shorter
schedules, but because most tracking systems do not record unpaid overtime, it is
Impact of team experience

Expert

Experienced

Average

Inexperienced

Novice Series2
Series1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Percentage of average performance

Figure 4.4 Impact of team experience on software development productivity.


42 ◾ A Guide to Selecting Software Measures and Metrics

difficult to study this situation without interviewing development teams. Figure 4.3
shows the probable impact of unpaid overtime.
Unpaid overtime is seldom measured, and this causes problems for estimation
and also with using historical data as benchmarks. The missing unpaid overtime
can make as much as 15% of total effort invisible!
Figure 4.5 is a standard SRM output and shows the impact of unpaid overtime
on project costs for a project of 1,000 function points or 53,333 Java statements.
The graph shows unpaid overtime hours per calendar month for the full team:
2 hours per month for a team of seven people is 14 free hours who worked each
month.
As can be seen, unpaid overtime is a significant factor for software cost esti-
mates, and it is equally significant for software schedule prediction. SRM includes
unpaid overtime as a standard adjustment to estimates, but the default value is zero.
Users need to provide local values for unpaid overtime in their specific organiza-
tions and for specific projects.
The range of unpaid overtime runs from 0 to more than 20 hours per month.
This is a major variable but one that is often not measured or included properly in
software cost estimates. When personnel are working long hours for free, this has a
big impact on project results.
In today’s world with software being developed globally and international out-
source vendors having about 30% of the U.S. software, market local work hours
per calendar month also need to be included. The sum of paid and unpaid hours per
month ranges globally from about 202 for India to 116 for the Netherlands
(Figure 4.6).
These factors are all intertwined and have simultaneous impacts. The best
results for U.S. software would be small teams of experts working more than
150 hours per month who put in perhaps 12 hours of unpaid overtime per month.

Unpaid overtime and costs


Unpaid = 0

Unpaid = 2

Unpaid = 4

Unpaid = 8

Unpaid = 12
Series2
Unpaid = 16
Series1
1,150,000 1,200,000 1,250,000 1,300,000 1,350,000 1,400,000
Project costs with monthly unpaid overtime hours

Figure 4.5 Impact of unpaid overtime on software project costs.


Variations in Occupation Groups, Staff Size, Team Experience ◾ 43

India
Peru
Malaysia
Japan
The United States
Canada
France
Series2
Norway Series1
The Netherlands
0 50 100 150 200 250

Figure 4.6 Typical work hours per month by country.

U.S. industry segments with this combination include start-up companies, com-
puter game companies, open-source development, and commercial software
companies.
The worst U.S. results would be large teams of novices, working less than
125 hours per month that have zero unpaid overtime and being inexperienced need
frequent meetings to decide what has to be done, rather than just doing the neces-
sary work because they already know how to do it. Sectors that might fit this pat-
tern include state and local government software groups.
In other words, large teams tend to have messy communication channels that
slows down progress. U.S. industry segments with this combination include state
and federal government software projects, unionized shops, and some time and
materials contracts where unpaid overtime is not allowed.
On a global basis, the best results would be small teams of experts in coun-
tries with intense work months of more than 160 hours per month and more than
16 hours of unpaid overtime each month.
On a global basis, the worst results would be large teams of novices in heavily
unionized countries working less than 120 hours per month where unpaid overtime
does not exist.
Chapter 5

Variations due to
Inaccurate Software
Metrics That
Distort Reality

The software industry is almost unique in having a collection of widely used


metrics that are dangerously inaccurate and even distort reality and reverse true
economic results! Three very common dangerous metrics circa 2016 include the
following:

1. The traditional lines of code (LOC) metrics, which penalize modern high-level
languages and make requirements and design invisible.
2. The cost per defect metric, which penalizes quality and makes buggy software
look better than it really is.
3. The technical debt metric, which omits expensive quality issues such as litiga-
tion and consequential damages. Technical debt does not occur at all for
projects where quality is so bad that they are canceled and never released,
even though some canceled projects cost millions of dollars. Technical debt is
also hard to apply to embedded and systems software. It does provide inter-
esting if incomplete information for information technology projects.

It is an interesting sociological question as to why an industry that employs


thousands of mathematicians and statisticians should continue to use dangerous

45
46 ◾ A Guide to Selecting Software Measures and Metrics

and inaccurate metrics for more than 60  years with hardly any of the software
literature even questioning their results?
Universities have been essentially silent on these metric problems, and indeed
some continue to teach software engineering and software project management
using LOC and cost per defect metrics without any cautions to students at all that
these metrics are invalid and distort reality.
Progress in software engineering requires a great deal of technical knowledge
that can only be derived from accurate measurements of size, quality, costs, and
technology effectiveness. Until the software industry abandons bad metrics
and switches to functional metrics, progress will resemble a drunkard’s walk with
as many backward steps as forward steps that make progress.
The software industry has suffered from inaccurate metrics and sloppy and
incomplete measurement practices for more than 60 years. This is a key factor in
today’s poor software quality and low software productivity.
If medicine had the same dangerous combination of bad metrics and incomplete
measurements as software does, then medical doctors would probably not be using
sterile surgical procedures even in 2016 and might still be treating infections with
blood letting and leaches instead of with antibiotics! Vaccinations would probably
not exist and antibiotics might not have been discovered. Things like joint replace-
ments would be impossible because they require very accurate measurements.
Sometimes metrics problems do impact other industries. A huge lawsuit
occurred in Narragansett, Rhode Island when a surveyor miscalculated property
lines and a house was built that encroached on a public park by about 6 feet! This
house was not a small shack but an imposing home of over 4,000 square feet.
The park had been deeded to the city by a philanthropist and part of the deed
restrictions were that the park dimensions needed to be kept unchanged and
the park had to be kept available for free public usage. Obviously the new house
changed the park’s physical dimensions and kept citizens from using the portion of
the park under or near the new house.
This case reached the Rhode Island Supreme Court and the court ordered that
the house either be demolished completely or moved far enough to restore the park
boundaries. Note that the unfortunate owners of the new house were not at fault
because they and the builder acted in good faith and depended on the flawed sur-
vey. As might be expected, the surveyor declared bankruptcy so there was no way
for the owner to recover the costs.
Problems of this magnitude are rare in home construction, but similar problems
occur on about 35% of large software projects that are either canceled completely or
more than a year late for delivery and over budget by more than 50%.
A recommendation by the author is that every reader of this book should also
acquire and read Paul Starr’s book, The Social Transformation of American Medicine
1982, Perseus Group. This book won a well-deserved Pulitzer Prize in 1982.
Only about 150 years ago, medicine had a similar combination of poor mea-
surements and inaccurate metrics combined with mediocre professional training.
Variations due to Inaccurate Software Metrics That Distort Reality ◾ 47

Starr’s book on how medical practice improved to reach today’s high standards is a
compelling story with many topics that are relevant to software.
The software industry is one of the largest and most wealthy industries in
human history. Software had created many multibillion dollar companies such as
Apple, Facebook, Google, and Microsoft. Software has created many millionaires
and also quite a few billionaires such as Bill Gates, Larry Ellison, Sergey Brin,
Jeff Bezos, and Mark Zukerberg.
However, software quality remains mediocre in 2016  and software wastage
remains alarmingly bad. (Wastage is the combination of bug repairs, cyber attacks,
canceled projects, and time spent on litigation for poor quality.) These software
problems are somewhat like smallpox and diphtheria. They can be prevented by
vaccination or successfully treated.
In order to improve software performance and reduce software wastage,
the software industry needs to eliminate inaccurate metrics that distort reality. The
software industry needs to adopt functional metrics and also capture the true and
complete costs of software development and maintenance instead of must measur-
ing small fractions such as design, code, and unit test (DCUT) that include less than
30% of total software costs.
Software quality control also needs to focus on defect prevention and pretest defect
removal before testing instead of just considering testing. In addition, of course, for
testing itself it is beneficial to use formal mathematical methods for test case design
such as cause–effect graphs and design of experiments. Also, formal testing should
use more certified test personnel instead of informal testing by untrained developers.
It would be technically possible to improve software development by more than
50% reduction in software wastage by more than 90% within 10 years if there were
a rapid and effective method of technology transfer that could reach hundreds of
companies and thousands of software personnel.
However, as of 2016, there are no really effective channels that can rapidly spread
proven facts about better development methods and better software quality control.
Unfortunately, there are no really effective software learning channels as of 2016.
Consider these problems with software learning channels as of 2016.
Many universities at both graduate and undergraduate levels often still use
lines of code and cost per defect and hence are providing disinformation to students
instead of solid facts. Functional metrics are seldom taught in universities except in
passing and are often combined with hazardous metrics such as lines of code because
the faculty has not analyzed the problems.
Professional societies such as the Institute of Electrical and Electronic Engineers
(IEEE), Association of Computing Machinery, Society of Information Management,
Project Management Institute, International Function Point Users Group, and so on
provide valuable networks and social services for members, but they do not provide
reliable quantitative data. Also, it would be more effective if the software profes-
sional societies followed the lead of the American Medical Association (AMA) and
provided reciprocal memberships and better sharing of information.
48 ◾ A Guide to Selecting Software Measures and Metrics

The major standards for software quality and risk such as ISO 9000/9001 and
ISO 31000 provide useful guidelines, but there are no empirical quantified data
that either risks or quality have tangible benefits from adhering to ISO standards.
This is also true for other standards such as IEEE and OMG.
Achieving levels 3 through 5 on the Software Engineering Institute’s capability
maturity model integrated (CMMI) does yield tangible improvements in quality.
However, the Software Engineering Institute (SEI) itself does not collect or publish
quantitative data on quality or productivity. (The Air Force gave the author a con-
tract to demonstrate the value of higher CMMI levels.)
The software journals, including refereed software journals, contain almost no
quantitative data at all. The author’s first job out of college was editing a medical
journal. About a third of the text in medical articles discusses the metrics and
measurement methods used and how data were collected and validated. Essentially
every medical article has reliable data based on accurate measures and valid metrics.
In contrast, the author has read more than a dozen refereed software articles that
used the LOC metric without even defining whether physical lines or logical state-
ments were used, and these can vary by over 500%. Some refereed software articles
did not even mention which programming languages were used, and these can vary
by over 2000%.
The author has read more than 100 refereed articles that claim it costs 100 times
as much to fix a bug after release as during development even though this is not actually
true. Compared to medical journals, refereed software journals are embarrassingly
amateurish even in 2016 when it comes to metrics, measures, and quantitative results.
Software quality companies in testing and static analysis make glowing claims
about their products but produce no facts or proven quantitative information about
actual defect removal efficiency (DRE).
Software education and training companies teach some useful specific courses
but all of them lack an effective curriculum that includes defect prevention, pretest
defect removal, and effective test technologies, or measuring defect potentials and
DRE that should be basic topics in all software quality curricula.
Software quality conferences often have entertaining speakers, but suffer from
a shortage of factual information and solid quantitative data about methods to
reduce defect potentials and raise DRE.
There are some excellent published books on software quality, but only a few of
these have sold more than a few thousand copies in an industry with millions of prac-
titioners. For example, Paul Strassmann’s book on The Squandered Computer (1997)
covers software economic topics quite well. Steve Kan’s book on Metrics and Models in
Software Quality Engineering (2002) does an excellent job on quality metrics and mea-
sures; Mike Harris, David Herron, and Stasia Iwanacki’s book on The Business Value of
IT (2008) is another solid title with software economic facts; Alain Abran’s book on
Software Metrics and Metrology (2010) covers functional metrics; Olivier Bonsignour
and the author’s book on The Economics of Software Quality (2012) have quantified
data on the effectiveness of various methods, tools, and programming languages.
Variations due to Inaccurate Software Metrics That Distort Reality ◾ 49

There are some effective software benchmark organizations that use function
point metrics for productivity and quality studies, but all of these collectively have
a few thousand clients only.
Some of these benchmark groups include The International Software Bench-
marking Standards Group (ISBSG), the Quality/Productivity Management Group,
Davids’ Consulting Group, Software Productivity Research (SPR), TI Metricas in
Brazil, Quantitative Software Management (QSM), and Namcook Analytics LLC.
On a scale of 1 to 10 the quality of medical information is about a 9.9; the qual-
ity of legal information is about a 9; the quality of information in electronic and
mechanical information is also about a 9; for software in 2016 the overall quality of
published information maybe is a 2.5. In fact, some published data that use cost per
defect and lines of code have a negative value of perhaps 5 due to the distortion of real-
ity by these two common but inaccurate metrics. Table 5.1 shows the comparative
accuracy of measured information for 15 technical and scientific fields.

Table 5.1 Accuracy of Published Data and Common Metrics


Technical Field Data Accuracy

1 Medicine 9.90

2 Astronomy 9.85

3 Aeronautical engineering 9.80

4 Physics 9.75

5 Mechanical engineering 9.70

6 Electrical engineering 9.35

7 Chemical engineering 9.30

8 Architecture 9.20

9 Civil engineering 9.25

10 Automotive engineering 9.00

11 Biology 8.90

12 Social sciences 4.50

13 Systems engineering 4.00

14 Political science 3.00

15 Software engineering 2.50

Average 7.87
50 ◾ A Guide to Selecting Software Measures and Metrics

Wastage, poor quality, poor metrics, poor measurements, and poor technol-
ogy transfer are all endemic problems of the software industry. This is not a good
situation in a world driven by software that also has accelerating numbers of cyber
attacks and looming cyber warfare.
All of these endemic software problems of bad metrics and poor measures are
treatable problems that could be eliminated if software adopts some of the methods
used by medicine as discussed in Paul Starr’s book The Social Transformation of
American Medicine (1982).
Chapter 6

Variations in Measuring
Agile and CMMI
Development

Agile software development has become the number one software development
methodology in the United States and in more than 25 other countries. (There are
currently about 70 named software development methodologies such as waterfall,
iterative, DevOps, RUP, container development, and mashups.)
Another popular development approach, although not a true methodology, is
that of achieving the higher levels of the Software Engineering Institute (SEI) capa-
bility maturity model integrated (CMMI®).
High CMMI levels are primarily found in the defense sector, although some
civilian groups also achieve high CMMI levels. India has become famous for the
large numbers of companies with high CMMI levels.
The CMMI approach is the older of the two, having been published by the SEI
in 1987. The newer agile approach was first published in 2001.
There is now fairly solid evidence about the benefits of higher CMMI levels from
many studies. When organizations move from CMMI level 1 up to level 2, 3, 4, and 5,
their productivity and quality levels tend to improve based on samples at each level.
When they adopt the newer Team Software Process (TSP) and Personal Software
Process (PSP), also endorsed by the SEI, there is an additional boost in performance.
Unfortunately, there are not as much reliable data for agile due in part to the
complexity of the agile process and in part to the use of nonstandard and highly
variable metrics such as story points, velocity, burn down, and others that lack both
standards and formal training and hence vary by hundreds of percent.

51
52 ◾ A Guide to Selecting Software Measures and Metrics

What the CMMI provides is a solid framework of activities, much better rigor
in the areas of quality control and change management, and much better mea-
surement of progress, quality, and productivity than was previously the norm.
Measurement and collection of data for projects that use the CMMI tend to
be fairly complete. In part, this is due to the measurement criteria of the CMMI,
and in part it is due to the fact that many projects using the CMMI are contract
projects, where accurate time and expense records are required under the terms of
the contracts and needed to receive payments.
Watt Humphrey’s newer TSP and PSP are also very good in collecting data.
Indeed the TSP and PSP data are among the most precise ever collected. However, the
TSP data are collected using task hours or the actual number of hours for specific tasks.
Nontask activities such as departmental meetings and training classes are excluded.
The history of the agile methods is not as clear as the history of the CMMI
because the agile methods are somewhat diverse. However, in 2001, the famous
Agile Manifesto was published. The Manifesto for Agile Software Development was
informally published by the 17 participants who attended the Agile planning ses-
sion at Snowbird, Utah in 2001. This provided the essential principles of agile devel-
opment. That being said, there are quite a few agile variations that include Extreme
Programming (XP), Crystal Development, Adaptive Software Development,
Feature-Driven Development, and several others.
Some of the principal beliefs found in the agile manifesto include the following:

◾ Working software is the goal, not documents.


◾ Working software is the primary measure of success.
◾ Close and daily contact between developers and clients are necessary.
◾ Face-to-face conversation is the best form of communication.
◾ Small self-organizing teams give the best results.
◾ Quality is critical, so testing should be early and continuous.

The agile methods and the CMMI are all equally concerned about three of the
same fundamental problems:

1. Software requirements always change.


2. Fixing software bugs is the most expensive software activity in history.
3. High quality leads to a high productivity and short schedules.

However, the agile method and the CMMI approach draw apart on two other
fundamental problems:

4. Paperwork is the second most expensive software activity in history.


5. Without careful measurements, continuous progress is unlikely.

The agile methods take a strong stand that paper documents in the form of rigorous
requirements and specifications are too slow and cumbersome to be effective.
Variations in Measuring Agile and CMMI Development ◾ 53

In the agile view, daily meetings with clients are more effective than written
specifications. In the agile view, daily team meetings or Scrum sessions are the best
way of tracking progress, as opposed to written status reports. The CMMI approach
does not fully endorse this view.
The CMMI take a strong stand that measurements of quality, productivity,
schedules, costs, and so on are a necessary adjunct to process improvement and
should be done well. In the view of the CMMI, without data that demonstrates
effective progress, it is hard to prove that a methodology is a success or not. The agile
methods do not fully endorse this view. In fact, one of the notable gaps in the agile
approach is any quantitative quality or productivity data that can prove the success
of the agile methods.
Indeed some agile derivative methods such as pair programming where two pro-
grammers share a work station add to costs and schedules with very little actual
benefit. The literature on pair programming is embarrassing and totally omits
topics such as inspections and static analysis that benefit solo programmers.
Although some agile projects do measure, they often use metrics other than
function points. For example, some agile projects use story points and others may
use web-object points or running tested features (RTF). These metrics are interesting,
but lack formal training, ISO standards, and large collections of validated historical
data and therefore cannot be easily used for comparisons to older projects.
Owing to the fact that the CMMI approach was developed in the 1980s when
the waterfall method was common, it is not difficult to identify the major activities
that are typically performed. For an application of 1,500 function points (approxi-
mately 150,000 source code statements), the 20 activities would be typical using
CMMI as given in Table 6.1.
Using the CMMI, the entire application of 1,500 function points would have
the initial requirements gathered and analyzed, the specifications written, and
various planning document produced before coding got underway.
By contrast, the agile methods of development would follow a different pattern.
Because the agile goal is to deliver running and usable software to clients as rapidly
as possible, the agile approach would not wait for the entire 1,000 function points
to be designed before coding started.
What would be most likely with the agile methods would be to divide the
overall project into four smaller projects, each of about 250 function points in size.
(Possibly as many as five subset projects of 200 function points might be used for a
total of 1,000 function points.) In the agile terminology, these smaller segments are
termed iterations or sometimes sprints.
These subset iterations or sprints are normally developed in a time box fashion
that ranges between perhaps two weeks and three months based on the size of the
iteration. For the example here, we can assume about two calendar months for each
iteration or sprint.
However, in order to know what the overall general set of features would be, an agile
project would start with Iteration 0 or a general planning and requirements-gathering
54 ◾ A Guide to Selecting Software Measures and Metrics

Table 6.1 Normal CMMI Activities for a Civilian


Application of 1,000 Function Points
1. Requirements

2. Prototyping

3. Architecture

4. Project planning and estimating

5. Initial design

6. Detailed design

7. Design inspections

8. Coding

9. Reuse acquisition

10. Code inspections

11. Change and configuration control

12. Software quality assurance

13. Integration

14. Test plans

15. Unit testing

16. New function testing

17. Regression testing

18. Integration testing

19. Acceptance testing

20. Project management

session. At this session, the users and developers would scope out the likely architec-
ture of the application and then subdivide it into a number of iterations.
Also, at the end of the project when all of the iterations have been completed,
it will be necessary to test the combined iterations at the same time. Therefore, a
release phase follows the completion of the various iterations. For the release, some
additional documentation may be needed. Also, cost data and quality data need to
be consolidated for all of the iterations. A typical agile development pattern might
resemble Table 6.2.
The most interesting and unique features of the agile methods are the follow-
ing: (1) the decomposition of the application into separate iterations, (2) the daily
Variations in Measuring Agile and CMMI Development ◾ 55

Table 6.2 Normal Agile Activities for an Application


of 1,000 Function Points

Iteration 0
1. General overall requirements

2. Planning

3. Sizing and estimating

4. Funding

Iterations 1–4
1. User requirements for each iteration

2. Test planning for each iteration

3. Testing case development for each iteration

4. Coding

5. Testing

6. Scrum sessions

7. Iteration documentation

8. Iteration cost accumulation

9. Iteration quality data

Release
1. Integration of all iterations

2. Final testing of all iterations

3. Acceptance testing of application

4. Total cost accumulation

5. Quality data accumulation

6. Final scrum session

face-to-face contact with one or more user representatives, and (3) the daily scrum
sessions to discuss the backlog of work left to be accomplished and any problems
that might slow down the progress. Another interesting feature is to create the test
cases before the code itself is written, which is a feature of XP and several other
agile variations.
56 ◾ A Guide to Selecting Software Measures and Metrics

Note that the author’s Software Risk Master ™ (SRM) tool has a special feature
for agile projects that aggregates all of the data from various sprints and converts the
data into a standard chart of accounts that can be used for side-by-side comparisons
with other software methodologies.
Table 6.3 illustrates this method using side-by-side comparisons of agile and
waterfall for a project of 1,000 function points.

Table 6.3 Size-by-Side Agile and Waterfall for 1,000 Function Points (from
Requirements through Delivery to Clients)

Agile Waterfall

Scrum CMMI 1

Average monthly cost $7,500 $7,500

Overall Project
Development Schedule (months) 11.82 15.85

Staff (technical + management) 7 10

Development Effort (staff months) 84 158

Development Costs $633,043 $1,188,670

Development Activites
Requirements Effort (staff months) 7.17 15.85

Design effort (staff months) 13.50 31.70

Coding effort (staff months) 21.95 31.70

Testing effort (staff months) 25.32 45.96

Documentation effort (staff month) 6.75 12.68

Management effort (staff months) 9.28 19.81

TOTAL EFFORT (Staff months) 83.98 157.70

Normalized Data
IFPUG function points per month 11.85 6.31

Work hours per function point 11.14 20.92

Cost per Function Point


$ per IFPUG function point $633 $1,189
Variations in Measuring Agile and CMMI Development ◾ 57

The SRM method of converting agile sprint data into a standard chart of
accounts is currently the only available method that can show side-by-side com-
parisons between agile and other popular methodologies such as DevOps, iterative,
Rational Unified Process (RUP), TSP, and waterfall.
Overall, agile projects tend to be somewhat faster and have a higher productivity
than waterfall projects up to about 1,000 function points in size. (The average size
of the author’s clients agile projects is about 270 function points.)
Above this size, agile tends to become complicated and troublesome. For large
applications in the 10,000 function point size range, the TSP and RUP methodolo-
gies are superior to both agile and waterfall development.
Although function point metrics are not common with agile projects, they do
provide significant advantages and especially for benchmark comparisons between
diverse methodologies such as agile, XP, Crystal, DevOps, TSP, RUP, and waterfall.
Chapter 7

Variations among
60 Development
Methodologies

Over and above the waterfall and Agile methodologies, there are over 60 named
software development methodologies and many hybrids that combine elements of
two or more methodologies.
In fact, new methodologies are created at a rate of about one new methodol-
ogy every eight months! This has been true for more than 40 years. These alternate
methodologies tend to have somewhat different results in terms of both productiv-
ity and quality.
For some of the very newest methods such as container development, microser-
vices, and GIT, there are not yet sufficiently accurate quantified data to include
them in this book.
The software industry does make rational decisions about which methodology
to use based on solid empirical data. Instead various methodologies appear and
attract followers based mainly on popularity, more or less like religious cults. Of
course, if the methodology does not provide any benefits at all, then it will lose out
when the next methodology achieves popularity. This explains the rapid rise and
fall of methodologies such as Computer Aided Software Engineering (CASE) and
Rapid Application Development (RAD).
Popularity and subjective opinions also explain the current popularity of Agile,
although in fact it does have some empirical data available as being successful on
smaller projects below 1,000 function points and 100 users. Agile is not the opti-
mal choice for large applications in the 10,000 function point size range where
RAD and Team Software Process (TSP) have better results. Table 7.1 shows the

59
60 ◾ A Guide to Selecting Software Measures and Metrics

Table 7.1 Productivity Comparison of 60 Software Methodologies


Function Work Hours
Points per per Function
Development Methodologies Month Point

1 Reuse-oriented (99% reusable materials) 100.00 1.32

2 Reuse-oriented (85% reusable materials) 73.33 1.80

3 Reuse-oriented (50% reusable materials) 35.00 3.77

4 IntegraNova 28.31 4.66

5 Mashup 25.00 5.28

6 Service-oriented modeling 18.86 7.00

7 Pattern-based 13.89 9.50

8 Product line engineering 13.68 9.65

9 Model-driven 12.22 10.80

10 SEMAT + TSP 12.00 11.00

11 SEMAT + Agile 11.89 11.10

12 Feature driven (FDD) 11.78 11.21

13 Hybrid (Agile/RUP/TSP) 11.68 11.30

14 TSP/PSP 11.63 11.35

15 Crystal 11.48 11.50

16 Specifications by Example 11.28 11.70

17 DSDM 11.09 11.90

18 Agile scrum 11.00 12.00

19 Kaizen 10.82 12.20

20 Kanban 10.78 12.24

21 Lean 10.64 12.41

22 CMMI 5 + TSP 10.56 12.50

23 Open-source 10.48 12.60

24 Microsoft solutions 10.27 12.85

25 Hybrid (Agile + Waterfall) 10.15 13.00

(Continued)
Variations among 60 Development Methodologies ◾ 61

Table 7.1 (Continued) Productivity Comparison of 60 Software Methodologies


Function Work Hours
Points per per Function
Development Methodologies Month Point

26 Continuous development 10.15 13.00

27 T-VEC 10.00 13.20

28 Rational Unified Process (RUP) from IBM 9.92 13.31

29 Legacy redevelopment 9.89 13.35

30 Object oriented (OO) 9.85 13.40

31 Extreme programming (XP) 9.78 13.50

32 CMMI 3 + spiral 9.78 13.50

33 Legacy renovation 9.70 13.61

34 Test-driven development 9.70 13.61

35 Prototypes–disposable 9.50 13.89

36 Legacy data mining 9.50 13.89

37 CASE 9.50 13.89

38 CMMI 4 + iterative 9.43 14.00

39 DevOps 9.43 14.00

40 RAD 9.43 14.00

41 Information engineering (IE) 9.29 14.21

42 Clean room 9.29 14.21

43 Spiral development 9.29 14.21

44 Evolutionary development (EVO) 9.13 14.46

45 Prototypes–evolutionary 9.10 14.51

46 CMMI 3 + iterative 8.95 14.75

47 Structured development 8.80 15.00

48 Iterative 8.80 15.00

49 CMMI 2 + iterative 8.52 15.49

50 Global 24 hour 8.46 15.60

(Continued)
62 ◾ A Guide to Selecting Software Measures and Metrics

Table 7.1 (Continued) Productivity Comparison of 60 Software Methodologies


Function Work Hours
Points per per Function
Development Methodologies Month Point

51 CMMI 1 + iterative 8.25 16.00

52 Merise 8.25 16.00

53 Reverse engineering 8.25 16.00

54 Waterfall 8.00 16.50

55 Prince 2 7.76 17.01

30 Reengineering 7.54 17.51

57 V-Model 7.02 18.80

58 Cowboy 6.50 20.31

59 Pair programming 6.00 22.00

60 Antipatterns 5.08 25.98

Average 11.74 11.24

comparative results for 60 software development methodologies for applications


of a nominal 1,000 function points in size derived from the author’s studies with
about 150 companies and over 26,000 projects.
As can be seen in Table 7.1, software methodologies have a big impact on results,
but reuse is the largest impact of all. Custom designs and manual coding are intrin-
sically inefficient and expensive no matter what methodology is used.
Chapter 8

Variations in Software
Programming Languages

As of 2016, there are more than 3,000 programming languages in existence and new
languages keep appearing at rates of more than one language every month! Why the
software industry has so many programming languages is a sociological mystery.
However, the existence of more than 3,000 languages is a proof that no known
language is fully useful for all sizes and types of software applications. This proof
is strengthened by the fact that a majority of applications need more than one
programming language and some have used up to 15 programming languages! An
average application circa 2016 uses at least two languages such as Java and HTML
or C# and MySQL.
Only about 50 of these thousands of languages are widely used. Many older
languages are orphans and have no working compilers and no active programmers.
Why software keeps developing new programming languages is an interesting
sociological question. The major languages circa 2016 include C dialects, Java,
Ruby, R, Python, Basic dialects, SQL, and HTML.
The influence of programming languages on productivity is inversely related
to application size. For small projects of 100 function points, coding is about
80% of total effort and therefore languages have a strong impact.
For large systems in the 10,000 function point size range, coding is only about
30% of total effort and other activities such as finding and fixing bugs and produc-
ing paper documents dilute the impact of pure coding.
Table 8.1 shows the impact of 80 programming languages for an application of
a nominal 1,000 function points in size. At that size, coding is about 50% of total
effort, whereas documents and paper work, bug repairs, and noncoding activities
comprise the other 50%.

63
64 ◾ A Guide to Selecting Software Measures and Metrics

Table 8.1 Productivity Variations Based on Programming Languages


Languages FP per Month Work Hours per FP Size in LOC

1 IntegraNova 33.25 3.97 5,333

2 Excel 31.70 4.16 6,400

3 BPM 30.75 4.29 7,111

4 Generators 30.75 4.29 7,111

5 Mathematica10 28.31 4.66 9,143

6 Mathematica9 24.78 5.33 12,800

7 TranscriptSQL 24.78 5.33 12,800

8 QBE 24.78 5.33 12,800

9 X 24.78 5.33 12,800

10 TELON 22.34 5.91 16,000

11 APS 21.77 6.06 16,842

12 Forte 21.18 6.23 17,778

13 MUMPS 20.55 6.42 18,824

14 IBM ADF 19.89 6.64 20,000

15 Smalltalk 19.19 6.88 21,333

16 Eiffel 18.45 7.16 22,857

17 ASP NET 17.66 7.48 24,615

18 Objective C 16.82 7.85 26,667

19 Visual Basic 16.82 7.85 26,667

20 Delphi 15.92 8.29 29,091

21 APL 14.97 8.82 32,000

22 Julia 13.95 9.46 35,556

23 M 13.95 9.46 35,556

24 OPA 13.95 9.46 35,556

25 Perl 13.95 9.46 35,556

26 Elixir 13.41 9.84 37,647

(Continued)
Variations in Software Programming Languages ◾ 65

Table 8.1 (Continued) Productivity Variations Based on Programming


Languages
Languages FP per Month Work Hours per FP Size in LOC

27 Haskell 13.41 9.84 37,647

28 Mixed Languages 13.41 9.84 37,647

29 R 30.75 4.29 25,000

30 DB2 12.85 10.27 40,000

31 LiveScript 12.85 10.27 40,000

32 Oracle 12.85 10.27 40,000

33 Erlang 12.27 10.76 42,667

34 CICS 11.67 11.31 45,714

35 DTABL 11.67 11.31 45,714

36 F# 11.67 11.31 45,714

37 Ruby 11.67 11.31 45,714

38 Simula 11.67 11.31 45,714

39 Dart 11.36 11.62 47,407

40 RPG III 11.36 11.62 47,407

41 Ada 95 11.05 11.95 49,231

42 Ceylon 11.05 11.95 49,231

43 Fantom 11.05 11.95 49,231

44 C# 10.72 12.31 51,200

45 X10 10.72 12.31 51,200

46 C++ 10.40 12.70 53,333

47 Go 10.40 12.70 53,333

48 Java 10.40 12.70 53,333

49 PHP 10.40 12.70 53,333

50 Python 10.40 12.70 53,333

51 Zimbu 9.72 13.58 58,182

(Continued)
66 ◾ A Guide to Selecting Software Measures and Metrics

Table 8.1 (Continued) Productivity Variations Based on Programming


Languages
Languages FP per Month Work Hours per FP Size in LOC

52 Quick Basic 9.37 14.08 60,952

53 Basic (interpreted) 9.02 14.64 64,000

54 Forth 9.02 14.64 64,000

55 haXe 9.02 14.64 64,000

56 Lisp 9.02 14.64 64,000

57 Prolog 9.02 14.64 64,000

58 SH (shell scripts) 9.02 14.64 64,000

59 ESPL/I 8.29 15.93 71,111

60 Javascript 8.29 15.93 71,111

61 ABAP 7.52 17.55 80,000

62 Modula 7.52 17.55 80,000

63 PL/I 7.52 17.55 80,000

64 Pascal 6.73 19.62 91,429

65 PL/S 6.73 19.62 91,429

66 GW Basic 6.32 20.90 98,462

67 Algol 5.89 22.39 106,667

68 Bliss 5.89 22.39 106,667

69 Chill 5.89 22.39 106,667

70 COBOL 5.89 22.39 106,667

71 Coral 5.89 22.39 106,667

72 Fortran 5.89 22.39 106,667

73 Jovial 5.89 22.39 106,667

74 C 5.02 26.27 128,000

75 XML 5.02 26.27 128,000

76 HTML 4.11 32.09 160,000

(Continued)
Variations in Software Programming Languages ◾ 67

Table 8.1 (Continued) Productivity Variations Based on Programming


Languages
Languages FP per Month Work Hours per FP Size in LOC

77 Macro Assembly 3.16 41.79 213,333

78 JCL 3.06 43.13 220,690

79 Basic Assembly 2.16 61.18 320,000

80 Machine language 1.11 119.36 640,000

Average 13.17 13.83 67,066

The data shown in Table 8.1 are the aggregate results for a complete software
project that includes requirements, design, and noncode work as well as coding and
testing. Pure coding would have much higher rates than those shown in Table 8.1,
but for overall benchmark purposes it is the work involved with complete project
development that matters.
The reasons why the software industry has more than 3,000 programming lan-
guages and why almost every application uses multiple languages is outside the
scope of this report. However, sociological factors seem to have a bigger impact
than technical factors.
The existence of 3,000 programming languages is a proof that none of them
are fully adequate, or otherwise that language would dominate the world’s software
projects. Instead we have large numbers of specialized languages that are more or
less optimized for certain kinds of applications, but not very good for other kinds
of applications.
How the impact of programming languages will be combined with the impact
of the new software nonfunctional assessment process (SNAP) metric is outside the
scope of this book and probably not well understood as of 2016.
Chapter 9

Variations in Software
Reuse from 0% to 90%

It is obvious to those of us who collect software benchmark data that custom


designs and manual coding are intrinsically slow, error-prone, and expensive. The
level of software development sophistication in 2016 is about the same as fire-arm
manufacture in 1798 just before Eli Whitney introduced standard reusable parts.
Eventually the software industry will move away from inefficient manual
construction and move toward assembly from standard reusable features. This is
more than just reusable source code. There are a total of 15 software artifacts that
are potentially reusable (Table 9.1).
Figure 9.1 illustrates why software reuse is the ultimate methodology that is
needed to achieve high levels of productivity, quality, and schedule adherence at
the same time. Figure 9.1 illustrates a generic application of 1,000 function points
coded in the Java language.
In 2016, applications of 1,000 function points in size are normally created at
rates between about 5.0 and 13.0 function points per staff month using custom
designs and manual coding. Antipatterns and Waterfall would be at the low end of
the spectrum, whereas Agile, RUP, TSP, and other advanced methods would be at
the high end of the spectrum.
However, a full complement of reusable materials should be able to push
software development rates up above 100 function points per staff month with
90% reuse, and above 250 function points per staff month with 99% reuse. Already
in 2016 productivity rates approaching 100 function points per month are starting
to appear with mashups, or applications constructed from segments of existing
applications.

69
70 ◾ A Guide to Selecting Software Measures and Metrics

Table 9.1 Potentially Reusable Software Components


1. Reusable requirements

2. Reusable architecture

3. Reusable design

4. Reusable project plans

5. Reusable estimates

6. Reusable source code

7. Reusable test plans

8. Reusable test scripts

9. Reusable test cases

10. Reusable marketing plans

11. Reusable user manuals

12. Reusable training materials

13. Reusable HELP screens and help text

14. Reusable customer support plans

15. Reusable maintenance plans

Impact of reuse on productivity


Reuse = 90%
Reuse = 80%
Reuse = 70%
Reuse = 60%
Reuse = 50%
Reuse = 40%
Reuse = 30%
Reuse = 20%
Reuse = 10% Series2
Reuse = 0% Series1

0 50 100 150
Function points per staff month

Figure 9.1 Impact of reuse on software productivity.


Variations in Software Reuse from 0% to 90% ◾ 71

Software reuse is based on the existence of common software patterns. There are
many kinds of software patterns that need effective visual representations: architecture,
design, schedule planning, quality control, cyber attacks, and many more.
Once a project’s taxonomy is firm and empirical results have been analyzed
from similar projects, another set of patterns come into play for collecting reusable
materials (Table 9.2).

Table 9.2 Software Application Patterns for Effective Reuse


1. Architectural patterns for the overall application structure

2. Design patterns for key application features

3. Requirements patterns from all projects with the same taxonomy

4. Data patterns for the information created and used by the application

5. Occupation group patterns (designers, coders, testers, QA, etc.)

6. Development activity patterns (requirements, architecture, design, code, etc.)

7. Growth patterns of new features during development and after release

8. Reuse patterns for standard features from standard components

9. Source patterns for the mix of legacy, COTS, reuse, open-source, and
custom features

10. Code patterns for any custom code, in order to avoid security flaws

11. Risk patterns based on similar applications

12. Security patterns (kinds of attacks noted on similar applications)

13. Governance patterns for software, dealing with financial data

14. Defect removal patterns for the sequence of inspections, static analysis,
and test stages

15. Marketing patterns for distribution of the software to clients

16. Usage patterns for typical usage scenarios

17. Maintenance patterns for defect repairs after release

18. Support patterns for contacts between customers and support teams

19. Enhancement patterns of future changes after initial deployment

20. Cost and schedule patterns for development and maintenance

21. Value and ROI patterns to compare project costs to long-range value

22. Litigation patterns for patent suits, breach of contract, and so on


72 ◾ A Guide to Selecting Software Measures and Metrics

These patterns combined with growing libraries of standard reusable compo-


nents should be able to increase application productivity rates from today’s average
of below 10 function points per staff month to more than 100 function points per
staff month. In some cases, for fairly small applications, productivity could approach
or even exceed 200 function points per staff month. The software industry should
not be satisfied with custom design and manual coding because it is intrinsically
expensive, slow, and error prone.
Reuse also benefits quality and security. Table 9.3 shows the approximate
impact of reuse on delivered software defects for applications between 100 and
10,000 function points in size.
Defect potentials are shown in terms of defects per function point because that
metric allows all defect origins to be included (requirements defects, design defects,
code defects, document defects, bad fixes, or secondary defects).
Table 9.3 is based on aggregate projects that range from 100 to 10,000 function
points in size.
Security flaws will also be reduced from using certified reusable components.
Table 9.4 illustrates the probable reduction in released security flaws.
Table 9.4 shows the same sequence as Table 9.3, only for the prevention and
removal of security flaws, also for applications between 100 and 10,000 function
points in size. In general, there are fewer security flaws than defects, but they are
harder to find and eliminate so the defect removal efficiency (DRE) is lower against
security flaws than against ordinary bugs.

Table 9.3 Reuse and Software Quality Levels at Delivery


Defects per Defect Removal Delivered
Percent of Reuse Function Point Percent Defects per FP

90 1.00 99.50 0.01

80 1.25 98.00 0.03

70 1.50 95.00 0.08

60 2.00 92.00 0.16

50 2.50 90.00 0.25

40 3.00 88.00 0.36

30 3.75 85.00 0.56

20 4.25 83.00 0.72

10 5.00 81.00 0.95

0 5.75 79.00 1.21


Variations in Software Reuse from 0% to 90% ◾ 73

Table 9.4 Reuse and Software Security Flaws at Delivery


Percent of Security Flaws per Flaw Removal Delivered Flaws
Reuse Function Point Percent per FP

90 0.40 94.53 0.02

80 0.50 93.10 0.03

70 0.60 90.25 0.06

60 0.80 87.40 0.10

50 1.00 85.50 0.15

40 1.20 83.60 0.20

30 1.50 80.75 0.29

20 1.91 78.85 0.40

10 2.50 76.95 0.58

0 3.16 75.05 0.79

The bottom line is that certified reusable components would be substantially


free from both latent defects and also from latent security flaws.
Reuse potential volumes vary by industry and application type. Reuse potential
is the percentage of overall application features that are provided by certified reus-
able components rather than being custom designed and manually coded. Table 9.5
shows approximate reuse potentials for the current year of 2016, and then future
reuse potentials for 2050, or 35 years from now.
For many industries, most corporate software applications do pretty much the
same thing as every other company in the industry. For example, all banks and all
insurance companies perform very similar software functions, some of them man-
dated by government regulations.
The concept of reusable components is to identify the specific sets of features
that are potentially reusable for every company in specific industries. For some
industries such as banking and stock trading, there are Federal laws and mandates
that make reuse mandatory for at least some critical features.
Some examples of common reusable features circa 2016 include but are
not limited to accounting rate of return, automotive GPS software, bar code
reading, browser add-ins, compound interest, Crystal reports, cryptographic
key processing, currency conversion, Excel functions, facial recognition, infla-
tion rates, internal rate of return, metrics conversion, PDF conversion, real
estate depreciation, state sales tax calculations, traffic light controls, and word
templates.
74 ◾ A Guide to Selecting Software Measures and Metrics

Table 9.5 Software Reuse Potentials in Selected Industry Segments


2016 Reuse 2050 Reuse
Potential (%) Potential (%)

1 Electric power applications 35 95

2 Insurance applications–property 45 90

3 Insurance applications–life 50 90

4 Banking applications 60 85

5 State government applications 35 85

6 Education applications–primary/ 30 85
secondary

7 Wholesale applications 60 85

8 Municipal government applications 40 80

9 Retail applications 40 80

10 Manufacturing applications 45 75

11 Federal civilian government applications 30 75

12 Insurance applications–health 25 70

13 Education applications–university 35 70

14 Weapons systems 20 55

15 Medical applications 15 45

Average Reuse Potential 38 78

As of 2016, reusable components approximate roughly 15% of the features


in many common applications, and sometimes top 30%. As of 2016, reuse is
not always certified, but the major commercial reusable components are fairly
reliable.
Unfortunately, there are several gaps in the reuse domain that need to be filled:
(1) There is no effective taxonomy of reusable features; (2) there are no available
catalogs of reusable features that might be acquired from commercial sources;
(3) software measurements tend to ignore or omit reusable features, which distorts
productivity and quality data; (4) some software estimating tools do not include
reuse (although this is a standard feature in the author’s Software Risk Master
(SRM) estimating tool); and (5) much of the literature on reuse only covers code
and does not yet fully support reusable requirements, reusable designs, reusable test
materials, and reusable user documents.
Variations in Software Reuse from 0% to 90% ◾ 75

One major barrier to expanding reuse at the level of specific functions is the
fact that there are no effective taxonomies for individual features used in software
applications. Current taxonomies work on entire software applications but are not
yet applied to the specific feature sets of these applications. For example, the widely
used Excel spreadsheet application has dozens of built-in reusable functions, but
there is no good taxonomy for identifying what all of these functions do.
Obviously the commercial software industry and the open-source software
industry are providing reuse merely by selling software applications that are used
by millions of people. For example, Microsoft Windows is probably the single most
widely used application on the planet with more than a billion users in over 200
countries. The commercial and open-source software markets provide an existence
proof that software reuse is an economically viable business.
Commercial reuse is fairly large and growing industry circa 2016. For example,
hundreds of applications use Crystal Reports. Thousands use commercial and
reusable static analysis tools, firewalls, antivirus packages, and the like. Hundreds
of major companies deploy enterprise resource planning (ERP) tools that attempt
reuse at the corporate portfolio level. Reuse is not a new technology, but neither
is it yet an industry with proper certification to eliminate bugs and security flaws
prior to deployment.
Informal reuse is common in 2016 but seldom measured and included in
software benchmarks. Among the author’s clients and measured projects, informal
reuse is about 15% of source code but less than 5% of other artifacts such as
requirements and designs.
Chapter 10

Variations due to
Project, Phase, and
Activity Measurements

Another weakness of software measurement is the chart of accounts used, or the


set of activities for which resource and cost data are collected. The topic of selecting
the activities to be included in software project measurements is a difficult issue and
cannot be taken lightly. There are five main contenders:

1. Project-level measurements Most common in 2016

2. Phase-level measurements Second most common in 2016

3. Activity-level measurements Best choice for accuracy in 2016

4. Task-level measurements Used mainly for complex military software


in 2016

5. Subtask-level measurements Seldom if ever used in 2016

Project-level measurements and phase-level measurements have been the most


widely used for more than 50 years, but they are also the least accurate. Of these
five levels of measurement, only activity, task, and subtask measurements will allow
benchmark data collection with a precision of better than 2.5% and support the
concept of activity-based costing.

77
78 ◾ A Guide to Selecting Software Measures and Metrics

Neither project level nor phase level data will be useful in exploring process
improvements, or in carrying out multiple regression analysis to discover the impact
of various tools, methods, and approaches.
Collecting data only at the level of projects and phases correlates strongly with
failed or canceled measurement programs, because the data cannot be used for
serious process research.
Historically, project-level measurements have been used most often. Phase-level
measurements have ranked second to project-level measurements in frequency of
usage. Unfortunately, phase-level measurements are inadequate for serious economic
study. Many critical activities such as user documentation or formal inspections
span multiple phases and hence tend to be invisible when data are collected at
the phase level.
Also, data collected at the levels of activities, tasks, and subtasks can easily be
rolled up to provide phase-level and project-level views. The reverse is not true: you
cannot explode project-level data or phase-level data down to the lower levels with
acceptable accuracy and precision. If you start with measurement data that are too
coarse, you will not be able to do very much with it.
Table 10.1 gives an illustration that can clarify the differences. Assume you are
thinking of measuring a project such as the construction of a small PBX switching
system used earlier in this paper. Here are the activities that might be included at
the level of the project, phases, and activities for the chart of accounts used to collect
measurement data.
If you collect measurement cost data only to the level of a project, you will have
no idea of the inner structure of the work that went on. Therefore the data will not
give you the ability to analyze activity-based cost factors and is almost useless for
purposes of process improvement. This is one of the commonest reasons for the
failure of software measurement programs: the data are not granular enough to find
out why projects were successful or unsuccessful.
Measuring at the phase level is only slightly better. There are no standard phase
definitions nor any standards for the activities included in each phase. Worse,
activities such as project management that span every phase are not broken out
for separate cost analysis. Many activities such as quality assurance and technical
writing span multiple phases, so phase-level measurements are not effective for
process improvement work.
Measuring at the activity level does not imply that every project performs every
activity. For example, small MIS projects and client-server applications normally
perform only 9 or so of the 25 activities that are shown above. Systems software such
as operating systems and large switching systems will typically perform about 20 of
the 25 activities. Only large military and defense systems will routinely perform all
25 activities.
Here too, by measuring at the activity level, useful information becomes
available. It is obvious that one of the reasons that systems and military software
Variations due to Project, Phase, and Activity Measurements ◾ 79

Table 10.1 Project, Phase, and Activity-Level Measurement Charts


of Accounts
Project Level Phase Level Activity Level

PBX switch 1. Requirements 1. Requirements

2. Analysis 2. Prototyping

3. Design 3. Architecture

4. Coding 4. Planning

5. Testing 5. Initial design

6. Installation 6. Detail design

7. Design review

8. Coding

9. Reused code acquisition

10. Package acquisition

11. Code inspection

12. Independent verification and validation

13. Configuration control

14. Integration

15. User documentation

16. Unit test

17. Function test

18. Integration test

19. System test

20. Field test

21. Acceptance test

22. Independent test

23. Quality assurance

24. Installation

25. Management
80 ◾ A Guide to Selecting Software Measures and Metrics

have much lower productivity rates than MIS projects is because they do many
more activities for a project of any nominal size.
Measuring at the task and subtask levels are more precise than activity-level
measurements but also much harder to accomplish. However, in recent years Watts
Humphrey’s Team Software Process (TSP) and Personal Software Process (PSP)
have started accumulating effort data to the level of tasks. This is perhaps the first
time that such detailed information has been collected on a significant sample of
software projects.
Table 10.2 illustrates what activity-based benchmark data would look like using
a large 40-activity chart of accounts normally used by the author for major systems
in the 10,000 function point size range.
As can be seen in Table 10.2, activity-based benchmarks provide an excellent
quantity of data for productivity and economic analysis of software cost struc-
tures. Only this kind of detailed benchmark information is truly useful for process
improvement studies and economic analysis.

Table 10.2 Example of Activity-Based Benchmark with 40 Activities


Language Java

Function points 10,000

Lines of code 533,333

KLOC 533

Development Work Hours FP per Work Hours LOC per


Activities per FP Month per KLOC Month

1 Business analysis 0.02 7,500.00 0.33 400,000

2 Risk analysis/sizing 0.00 35,000.00 0.07 1,866,666

3 Risk solution 0.01 15,000.00 0.17 800,000


planning

4 Requirements 0.38 350.00 7.08 18,667

5 Requirement 0.22 600.00 4.13 32,000


inspection

6 Prototyping 0.33 400.00 0.62 213,333

7 Architecture 0.05 2,500.00 0.99 133,333

8 Architecture 0.04 3,000.00 0.83 160,000


inspection

(Continued)
Variations due to Project, Phase, and Activity Measurements ◾ 81

Table 10.2 (Continued) Example of Activity-Based Benchmark with


40 Activities
Development Work Hours FP per Work Hours LOC per
Activities per FP Month per KLOC Month

9 Project plans/ 0.03 5,000.00 0.50 266,667


estimates

10 Initial design 0.75 175.00 14.15 9,333

11 Detail design 0.75 175.00 14.15 9,333

12 Design inspections 0.53 250.00 9.91 13,333

13 Coding 4.00 33.00 75.05 1,760

14 Code inspections 3.30 40.00 61.91 2,133

15 Reuse acquisition 0.01 10,000.00 0.25 533,333

16 Static analysis 0.02 7,500.00 0.33 400,000

17 COTS Package 0.01 10,000.00 0.25 533,333


purchase

18 Open-source 0.01 10,000.00 0.25 533,333


acquisition

19 Code security audit 0.04 3,500.00 0.71 186,667

20 Independent 0.07 2,000.00 1.24 106,667


verification and
validation

21 Configuration 0.04 3,500.00 0.71 186,667


control

22 Integration 0.04 3,500.00 0.71 186,667

23 User documentation 0.29 450.00 5.50 24,000

24 Unit testing 0.88 150.00 16.51 8,000

25 Function testing 0.75 175.00 14.15 9,333

26 Regression testing 0.53 250.00 9.91 13,333

27 Integration testing 0.44 300.00 8.26 16,000

28 Performance testing 0.33 400.00 6.19 21,333

29 Security testing 0.26 500.00 4.95 26,667

(Continued)
82 ◾ A Guide to Selecting Software Measures and Metrics

Table 10.2 (Continued) Example of Activity-Based Benchmark with


40 Activities
Development Work Hours FP per Work Hours LOC per
Activities per FP Month per KLOC Month

30 Usability testing 0.22 600.00 4.13 32,000

31 System testing 0.88 150.00 16.51 8,000

32 Cloud testing 0.13 1,000.00 2.48 53,333

33 Field (beta) testing 0.18 750.00 3.30 40,000

34 Acceptance testing 0.05 2,500.00 0.99 133,333

35 Independent testing 0.07 2,000.00 1.24 106,667

36 Quality assurance 0.18 750.00 3.30 40,000

37 Installation/training 0.04 3,500.00 0.71 186,667

38 Project measurement 0.01 10,000.00 0.25 533,333

39 Project office 0.18 750.00 3.30 40,000

40 Project management 4.40 30.00 82.55 1,600

Cumulative Results 20.44 6.46 377.97 349.46


Chapter 11

Variations in Burden
Rates or Overhead Costs

A major problem associated with software cost studies is the lack of generally
accepted accounting practices for determining the burden rate or overhead costs
that are added to basic salaries to create a metric called the fully burdened salary
rate that corporations use for determining business topics such as the charge-out
rates for cost centers. The fully burdened rate is also used for other business pur-
poses such as contracts, outsource agreements, and return on investment (ROI)
calculations.
The components of the burden rate are highly variable from company to com-
pany. Some of the costs included in burden rates can be as follows: social secu-
rity contributions, unemployment benefit contributions, various kinds of taxes,
rent on office space, utilities, security, postage, depreciation, portions of mortgage
payments on buildings, various fringe benefits (medical plans, dental plans, dis-
ability, moving and living, vacations, etc.), and sometimes the costs of indirect staff
(human resources, purchasing, mail room, etc.).
One of the major gaps in the software literature, and for that matter in the
accounting literature, is the almost total lack of international comparisons of the
typical burden rate methodologies used in various countries. So far as it can be
determined, there are no published studies that explore burden rate differences
between countries such as the United States, Canada, India, the European Union
countries, Japan, China, and so on.
Among the author’s clients, the range of burden rates runs from a low of per-
haps 15% of basic salary levels to a high of approximately 300%. In terms of dollars,
that range means that the fully-burdened charge rate for the position of senior

83
84 ◾ A Guide to Selecting Software Measures and Metrics

systems programmer in the United States can run from a low of about $15,000 per
year to a high of $350,000 per year.
Unfortunately, the software literature is almost silent on the topic of burden or
overhead rates. Indeed, many of the articles on software costs not only fail to detail
the factors included in burden rates, but often fail to even state whether the burden
rate itself was used in deriving the costs that the articles are discussing!
Table 11.1 illustrates some of the typical components of software burden rates,
and also how these components might vary between a large corporation with a
massive infrastructure and a small start-up corporation that has very few overhead
cost elements.
When the combined ranges of basic salaries and burden rates are applied to
software projects in the United States, they yield almost a 6 to 1 variance in bill-
able costs for projects where the actual number of work months or work hours are
identical!
When the salary and burden rate ranges are applied to international projects,
they yield about a 15 to 1 variance between countries such as India, Pakistan, or
Bangladesh at the low end of the spectrum, and Germany, Switzerland, or Japan on
the high end of the spectrum.
Hold in mind that this 15 to 1 range of cost variance is for projects where the
actual number of hours worked is identical.
When productivity differences are considered too, there is more than a 100
to 1 variance between the most productive projects in companies with the lowest
salaries and burden rates and the least productive projects in companies with the
highest salaries and burden rates.
Table 11.1 Components of Typical Burden Rates in Large and Small Companies
Large Company Small Company

Average annual $50,000 100.0% Average annual $50,000 100.0%


salary salary

Personnel Burden Personnel Burden


Payroll taxes $5,000 10.0% Payroll taxes $5,000 10.0%

Bonus $5,000 10.0% Bonus 0 0.0%

Benefits $5,000 10.0% Benefits $2,500 5.0%

Profit sharing $5,000 10.0% Profit sharing 0 0.0%

Subtotal $20,000 40.0% Subtotal $7,500 15.0%

Office Burden Office Burden


Office rent $10,000 20.0% Office rent $5,000 10.0%

Property taxes $2,500 5.0% Property taxes $1,000 2.0%

Office supplies $2,000 4.0% Office supplies $1,000 2.0%

Janitorial service $1,000 2.0% Janitorial service $1,000 2.0%

Utilities $1,000 2.0% Utilities $1,000 2.0%


Variations in Burden Rates or Overhead Costs

Subtotal $16,500 33.0% Subtotal $9,000 18.0%

(Continued)
◾ 85
86

Table 11.1 (Continued) Components of Typical Burden Rates in Large and Small Companies
Large Company Small Company

Corporate Burden Corporate Burden


Information $5,000 10.0% Information 0 0.0%
systems systems

Finance $5,000 10.0% Finance 0 0.0%

Human resources $4,000 8.0% Human resources 0 0.0%

Legal $3,000 6.0% Legal 0 0.0%

Subtotal $17,000 34.0% Subtotal 0 0.0%

Total burden $53,500 107.0% Total burden $16,500 33.0%

Salary + burden $103,500 207.0% Salary + burden $66,500 133.0%

Monthly rate $8,625 Monthly rate $5,542


A Guide to Selecting Software Measures and Metrics
Chapter 12

Variations in Costs
by Industry

Although software productivity measurements based on human effort in terms of


work hours or work months can be measured with acceptable precision, the same
cannot be said for software costs.
A fundamental problem with software cost measures is the fact that salaries and
compensation vary widely from job to job, worker to worker, company to company,
region to region, industry to industry, and country to country.
Among the author’s clients in the United States, the basic salary of the occupa-
tion of software project manager ranges from a low of about $42,000 per year to
a high of almost $150,000 per year. When international clients are included, the
range for the same position runs from less than $15,000 per year to more than
$175,000 a year.
Table 12.1 shows averages and ranges for project management compensation
for 20 U.S. industries. Table 12.2 shows average results in the left column, and
then shows the ranges observed based on company size, and on geographic region.
In general, large corporations pay more than small companies. For example, large
urban areas such as the San Francisco Bay area or the urban areas in New York
and New Jersey have much higher pay scales than do more rural areas or smaller
communities.
Also some industries such as banking and financial services and telecommu-
nications manufacturing tend to have compensation levels that are far above U.S.
averages, whereas other industries such as government service and education tend
to have compensation levels that are significantly lower than U.S. averages.
These basic economic facts mean that it is unsafe and inaccurate to use U.S.
averages for cost comparisons of software. At the very least, cost comparisons should

87
88 ◾ A Guide to Selecting Software Measures and Metrics

Table 12.1 Annual Salary Levels for Software Project Managers


in 20 Industries in the United States
Range by Range by
Average Company Geographic Maximum Minimum
Annual Size Region Annual Annual
Industry Salary (+ or −) (+ or −) Salary Salary

Banking $93,845 $20,000 $6,000 $119,845 $66,025

Electronics $92,823 $15,000 $6,000 $113,823 $70,353

Telecommuni­ $92,555 $15,000 $6,000 $113,555 $70,085


cations

Software $92,288 $15,000 $6,000 $113,288 $69,818

Consumer $91,944 $14,000 $5,500 $111,444 $71,079


products

Chemicals $90,874 $13,000 $5,500 $109,374 $60,379

Defense $86,486 $13,000 $5,500 $104,986 $66,691

Food/beverages $84,174 $12,000 $5,000 $101,174 $65,984

Media $80,384 $12,000 $5,000 $97,384 $62,194

Industrial $80,260 $12,000 $5,000 $97,260 $62,070


equipment

Distribution $80,143 $11,000 $5,000 $96,143 $63,023

Insurance $78,235 $10,000 $5,000 $93,235 $62,185

Public utilities $76,098 $7,500 $4,500 $88,098 $63,258

Retail $75,034 $7,000 $4,500 $86,534 $62,729

Health care $72,632 $7,500 $4,500 $84,632 $59,792

Nonprofits $71,583 $7,500 $4,500 $83,583 $58,743

Transportation $71,155 $7,000 $4,500 $82,655 $58,850

Textiles $70,176 $7,000 $4,500 $81,676 $57,871

Government $67,571 $6,000 $4,000 $77,571 $56,871

Education $66,741 $6,000 $4,000 $76,741 $56,041

Average $80,750 $10,875 $5,025 $96,650 $63,737


Variations in Costs by Industry ◾ 89

Table 12.2 Industry Productivity Ranges circa 2016


Work Hours
Function Points per Function
Industry per Month 2016 Point 2016

1 Smartphone/tablet applications 15.25 8.66

2 Software (commercial) 15.00 8.80

3 Social networks 14.90 8.86

4 Software (outsourcing) 14.00 9.43

5 Open source development 13.75 9.60

6 Entertainment—films 13.00 10.15

7 Consulting 12.70 10.39

8 Entertainment—television 12.25 10.78

9 Banks—commercial 11.50 11.48

10 Banks—investment 11.50 11.48

11 Credit unions 11.20 11.79

12 Entertainment—music 11.00 12.00

13 Insurance—medical 10.50 12.57

14 Insurance—life 10.00 13.20

15 Stock/commodity brokerage 10.00 13.20

16 Insurance—property and casualty 9.80 13.47

17 Manufacturing— 9.75 13.54


telecommunications

18 Telecommunications operations 9.75 13.54

19 Process control and embedded 9.00 14.67

20 Pharmacy chains 9.00 14.67

21 Manufacturing—pharmaceuticals 8.90 14.83

22 Transportation—airlines 8.75 15.09

23 Oil extraction 8.75 15.09

24 Hotels 8.75 15.09

(Continued)
90 ◾ A Guide to Selecting Software Measures and Metrics

Table 12.2 (Continued) Industry Productivity Ranges circa 2016


Work Hours
Function Points per Function
Industry per Month 2016 Point 2016

25 Publishing (books/journals) 8.60 15.35

26 Education—university 8.60 15.35

27 Professional support—medicine 8.55 15.44

28 Government—police 8.50 15.53

29 Accounting/financial consultants 8.50 15.53

30 Professional support—law 8.50 15.53

31 Sports (probaseball, football, etc.) 8.50 15.53

32 Other industries 8.30 15.90

33 Manufacturing—electronics 8.25 16.00

34 Wholesale 8.25 16.00

35 Manufacturing—general 8.25 16.00

36 Manufacturing—chemicals 8.00 16.50

37 Transportation—trains 8.00 16.50

38 Manufacturing—nautical 8.00 16.50

39 Transportation—bus 8.00 16.50

40 Hospitals—administration 8.00 16.50

41 Transportation—ship 8.00 16.50

42 Automotive sales 8.00 16.50

43 Retail 8.00 16.50

44 Transportation—truck 8.00 16.50

45 Manufacturing—medical devices 7.75 17.03

46 Manufacturing—automotive 7.75 17.03

47 Agriculture 7.75 17.03

48 Manufacturing—appliances 7.60 17.37

49 Education—secondary 7.60 17.37

(Continued)
Variations in Costs by Industry ◾ 91

Table 12.2 (Continued) Industry Productivity Ranges circa 2016


Work Hours
Function Points per Function
Industry per Month 2016 Point 2016

50 Games—traditional 7.50 17.60

51 Education—primary 7.50 17.60

52 Automotive repairs 7.50 17.60

53 Manufacturing—aircraft 7.25 18.21

54 Public utilities—water 7.25 18.21

55 Real estate—commercial 7.25 18.21

56 Real estate—residential 7.25 18.21

57 Government—intelligence 7.20 18.33

58 Construction 7.10 18.59

59 Public utilities—electricity 7.00 18.86

60 Manufacturing—apparel 7.00 18.86

61 Mining—metals 7.00 18.86

62 Waste management 7.00 18.86

63 Mining—coal 7.00 18.86

64 Food—restaurants 7.00 18.86

65 Government—municipal 7.00 18.86

66 Manufacturing—defense 6.85 19.27

67 Government—military 6.75 19.56

68 Natural gas generation 6.75 19.56

69 Government—state 6.50 20.31

70 Government—county 6.50 20.31

71 Government—federal civilian 6.50 20.31

72 ERP vendors 6.00 22.00

Averages 8.79 15.02


92 ◾ A Guide to Selecting Software Measures and Metrics

be within the context of the same or related industries, and comparisons should be
made against organizations of similar size and located in similar geographic areas.
Industry differences and differences in geographic regions and company sizes
are so important that cost data cannot be accepted at face value without knowing
the details of the industry, city, and company size.
Over and above differences in compensation, there are also significant differ-
ences in productivity, due in part to work hour patterns and in part to the experi-
ence and technology stacks used. Table 12.2 shows ranges among Namcook clients.
As can be seen, productivity and compensation vary widely by industry, and
also by country and by geographic region.
Average values are misleading and the overall ranges around the averages are
about 50% lower than average and perhaps 125% higher than average, based on
team experience, tools, methodologies, programming languages, and available
reusable materials.
Chapter 13

Variations in Costs by
Occupation Group

Other software-related positions besides project management have broad ranges in


compensation too, and there are now more than 205 total software-related occu-
pations in the United States! This means that in order to do software cost studies,
it is necessary to deal with major differences in costs based on industry, company
size, geographic location, on the kinds of specialists that are present on any given
project, and on years of tenure or merit appraisal results.
Table 13.1 illustrates the ranges of basic compensation (exclusive of bonuses or
merit appraisal adjustments) for 15 software occupations in the United States. As
can be seen, the range of possible compensation levels runs from less than $50,000
to more than $120,000.
Over and above the basic compensation levels are shown in Table 13.1, a num-
ber of specialized occupations are now offering even higher compensation levels
than those illustrated. For example, programmers who are familiar with SAP R/3
integrated system and the ABAP programming language can expect compensation
levels about 10% higher than average, and may even receive a signing bonus similar
to those offered to professional athletes. This is also true for some big data experts
and for cyber-security experts.
Even if only basic compensation is considered, it can easily be seen that software
projects developed by large companies in large cities such as New York and San
Francisco will have higher cost structures than the same applications developed by
small companies in smaller cities such as Little Rock or Knoxville.
Although the topic is not illustrated and the results are often proprietary, there
are also major variations in compensation based on merit appraisals and or longevity

93
Table 13.1 Variations in Compensation for 15 U.S. Software Occupation Groups
94 ◾

Range by Range by Minimum


Average Company Geographic Range by Maximum Annual
Occupation Annual Salary Size (+ or −) Region (+ or −) Industry (+ or −) Annual Salary Salary
Software architect $97,370 $13,000 $4,500 $7,500 $122,370 $70,620

Senior systems programmer $96,300 $13,000 $4,500 $6,000 $119,800 $71,155

Senior systems analyst $88,810 $11,000 $4,000 $6,000 $109,810 $66,340


Systems programmer $83,460 $12,000 $4,000 $5,500 $104,960 $60,455
Systems analyst $74,900 $10,500 $3,750 $5,000 $94,150 $54,303
Process analyst $71,690 $10,500 $3,750 $5,000 $90,940 $51,093
Programmer/analyst $70,620 $11,000 $3,500 $5,000 $90,120 $49,755
Database analyst $69,550 $12,000 $3,750 $6,000 $91,300 $46,278
Application programmer $69,550 $10,000 $3,500 $5,000 $88,050 $49,755
Maintenance programmer $67,410 $10,000 $3,500 $5,000 $85,910 $47,615
Testing specialist $66,875 $10,000 $3,500 $5,000 $85,375 $47,080
Metrics specialist $66,340 $8,000 $3,750 $5,000 $83,090 $48,418
A Guide to Selecting Software Measures and Metrics

Quality assurance $65,270 $7,500 $3,500 $5,000 $81,270 $48,150


Technical writer $55,105 $5,000 $3,500 $3,000 $66,605 $42,800
Customer support $51,895 $2,000 $3,500 $2,000 $59,395 $43,870
Average $73,009 $9,700 $3,767 $5,067 $91,543 $53,179
Variations in Costs by Occupation Group ◾ 95

within grade. Longevity is mainly a factor for unionized positions, which are rare
for software in the United States, but common in Europe and Australia.
This factor can add about another plus or minus $7,500 to the ranges of
compensation for technical positions, and even more for executive and managerial
positions. (Indeed multimillion dollar executive bonuses have been noted. Whether
these huge bonuses are justified or not is outside the scope of this article.)
Also not illustrated are the bonus programs and stock equity programs that
many companies offer to software technical employees and to managers. For exam-
ple, the stock equity program at Microsoft has become famous for creating more
millionaires than any similar program in the U.S. industry.
Chapter 14

Variations in Work Habits


and Unpaid Overtime

The software industry is a highly labor-intensive one. So long as software is built


using human effort as the primary tool, all of the factors associated with work
patterns and overtime will continue to be significant. That being said, unpaid over-
time is the most common omission from software cost-tracking systems. Unpaid
overtime averages more than 10 hours a week in the United States and more than
16 hours a week in Japan. This is far too significant a factor to be ignored, but that
is usually the case.
Assume that a typical month contains four work weeks, each composed of five
eight-hour working days. The combination of 4 weeks × 5 days × 8 hours = 160
available hours in a typical month.
However, at least in the United States, the effective number of hours worked
each month is often less than 160 due to factors such as coffee breaks, meetings,
and slack time between assignments.
Thus in situations where there is no intense schedule pressure, the effective
number of work hours per month may only amount to about 80% of the available
hours, or about 132 hours per calendar month.
The Organization for International Economic Cooperation and Development
(OECD) publishes general work hour data for all countries. However, the author’s
company Namcook Analytics LLC modifies the OECD general data for software.
Table 14.1 shows a large sample of global and industry ranges in effective work
hours.
As most readers know, software projects are often under intense schedule pres-
sures and overtime is quite common. The majority of professional U.S. software
personnel are termed exempt, which means that they do not receive overtime pay

97
98 ◾ A Guide to Selecting Software Measures and Metrics

Table 14.1 Global and Industry Variations in Software Work Hours 2016
Namcook Namcook Namcook
Namcook Software Software Percentage
Software Unpaid Total of U.S. Total
Work Hours Overtime Hours per Hours per
Countries per Month per Month Month Month

1 India 190.00 12.00 202.00 146.38

2 Taiwan 188.00 10.00 198.00 143.48

3 Mexico 185.50 12.00 197.50 143.12

4 China 186.00 8.00 194.00 140.58

5 Peru 184.00 6.00 190.00 137.68

6 Colombia 176.00 6.00 182.00 131.88

7 Pakistan 176.00 6.00 182.00 131.88

8 Hong Kong 168.15 12.00 180.15 130.54

9 Thailand 168.00 8.00 176.00 127.54

10 Malaysia 169.92 6.00 175.92 127.48

11 Greece 169.50 6.00 175.50 127.17

12 South Africa 168.00 6.00 174.00 126.09

13 Israel 159.17 8.00 167.17 121.14

14 Vietnam 160.00 6.00 166.00 120.29

15 Philippines 160.00 4.00 164.00 118.84

16 Singapore 155.76 8.00 163.76 118.67

17 Hungary 163.00 6.00 163.00 118.12

18 Poland 160.75 2.00 162.75 117.93

19 Turkey 156.42 4.00 160.42 116.24

20 Brazil 155.76 4.00 159.76 115.77

21 Panama 155.76 4.00 159.76 115.77

22 Chile 149.64 8.00 157.64 114.23

23 Estonia 157.42 0.00 157.42 114.07

24 Japan 145.42 12.00 157.42 114.07

(Continued)
Variations in Work Habits and Unpaid Overtime ◾ 99

Table 14.1 (Continued) Global and Industry Variations in Software Work


Hours 2016
Namcook Namcook Namcook
Namcook Software Software Percentage
Software Unpaid Total of U.S. Total
Work Hours Overtime Hours per Hours per
Countries per Month per Month Month Month

25 Switzerland 148.68 8.00 156.68 113.54

26 Czech Republic 150.00 0.00 150.00 108.70

27 Russia 145.51 4.00 149.51 108.34

28 Argentina 148.68 0.00 148.68 107.74

29 South Korea 138.00 6.00 144.00 104.35

30 United States 132.00 6.00 138.00 100.00

31 Saudi Arabia 141.60 0.00 141.60 102.61

32 Portugal 140.92 0.00 140.92 102.11

33 United Kingdom 137.83 2.00 139.83 101.33

34 Finland 139.33 0.00 139.33 100.97

35 Ukraine 138.06 0.00 138.06 100.04

36 Venezuela 134.52 2.00 136.52 98.93

37 Austria 134.08 0.00 134.08 97.16

38 Luxembourg 134.08 0.00 134.08 97.16

39 Italy 129.21 2.00 131.21 95.08

40 Belgium 131.17 0.00 131.17 95.05

41 New Zealand 128.25 2.00 130.25 94.38

42 Denmark 128.83 0.00 128.83 93.36

43 Canada 126.11 2.00 128.11 92.84

44 Australia 127.44 0.00 127.44 92.35

45 Ireland 127.42 0.00 127.42 92.33

46 Spain 124.34 2.00 126.34 91.55

47 France 123.25 0.00 123.25 89.31

(Continued)
100 ◾ A Guide to Selecting Software Measures and Metrics

Table 14.1 (Continued) Global and Industry Variations in Software Work


Hours 2016
Namcook Namcook Namcook
Namcook Software Software Percentage
Software Unpaid Total of U.S. Total
Work Hours Overtime Hours per Hours per
Countries per Month per Month Month Month

48 Iceland 120.00 0.00 120.00 86.96

49 Sweden 119.55 0.00 119.55 86.63

50 Norway 118.33 0.00 118.33 85.75

51 Germany 116.42 0.00 116.42 84.36

52 Netherlands 115.08 0.00 115.08 83.39

Average 148.21 3.85 151.94 110.10

Namcook Namcook
Namcook Software Software Namcook
Software Unpaid Total (%) of U.S.
U.S. Industry Work Hours Overtime Hours per Total Hours
Segments per Month per Month Month per Month

1 Start-up 191.67 16.00 207.67 150.48


technology
companies

2 Technology 175.00 14.00 189.00 136.96


companies

3 Computer games 165.00 8.00 173.00 125.36

4 Open source 160.42 8.00 168.42 122.04

5 Web/cloud 150.00 8.00 158.00 114.49

6 Bioengineering/ 147.50 10.00 157.50 114.13


medicine

7 Fixed-price 138.28 12.00 150.28 108.90


contracts

8 Management 142.00 8.00 150.00 108.70


consulting

9 Outsource 140.13 8.00 148.13 107.34


contractors

(Continued)
Variations in Work Habits and Unpaid Overtime ◾ 101

Table 14.1 (Continued) Global and Industry Variations in Software Work


Hours 2016
Namcook Namcook
Namcook Software Software Namcook
Software Unpaid Total (%) of U.S.
U.S. Industry Work Hours Overtime Hours per Total Hours
Segments per Month per Month Month per Month

10 Manufacturing 136.44 6.00 142.44 103.22

11 Finance/ 134.59 6.00 140.59 101.88


insurance

12 Telecom 134.59 6.00 140.59 101.88

13 Entertainment 132.75 6.00 138.75 100.54

14 U.S. Average 132.00 6.00 138.00 100.00

15 Wholesale/retail 131.00 6.00 137.00 99.28

16 Health care 130.00 4.00 134.00 97.10

17 Avionics 129.06 4.00 133.06 96.42

18 Energy 127.29 4.00 131.29 95.14

19 Profit-center 125.38 4.00 129.38 93.75


projects

20 Time and 129.06 0.00 129.06 93.52


material
contracts

21 Education 123.53 2.00 125.53 90.96

22 Federal 123.53 0.00 123.53 89.52


government

23 Cost-center 119.84 2.00 121.84 88.29


projects

24 Defense 121.69 0.00 121.69 88.18

25 State/local 117.26 0.00 117.26 84.97


government

Average 142.61 7.20 149.81 108.56


102 ◾ A Guide to Selecting Software Measures and Metrics

for work in the evening or on weekends. Indeed, many software cost-tracking


systems do not even record overtime hours.
Thus for situations where schedule pressures are intense, not only might the
software team work for the available 160 hours per month, but they would also
work late in the evenings and on weekends too. Thus on crunch projects, the work
might amount to 110% of the available hours or about 176 hours per week.
Table 14.2 compares two versions of the same project, which can be assumed
to be a 1,000 function points information systems application written in COBOL.
The first version is a normal version where only about 80% of the available hours

Table 14.2 Differences between Normal and Intense Software Work


Patterns
Activity Project 1 Project 2

Work Habits Normal Intense Difference Percentage

Function point size 1,000 1,000 0 0.00

Size in lines of code 55,000 55,000 0 0.00


(LOC)

LOC per FP 55 55 0 0.00

Ascope in FP 200 200 0 0.00

Nominal Prate in FP 10 10 0 0.00

Availability 80.00% 110.00% 30.00% 37.50

Hours per month 128 176 48 37.50

Unpaid overtime 0 16 16 Nil

Salary per month $10,000.00 $10,000.00 $0.00 0.00

Staff 5 5 0 0.00

Effort months 125 90.91 −34.09 −27.27

Schedule months 31.25 16.53 −14.72 −47.11

Cost $1,250,000 $909,100 ($340,900) −27.27

Cost per FP $1,250.00 $909.10 ($341) −27.27

Work hours per FP 16 16 0 0.00

Virtual Prate in FP 8 11 3 37.50

Cost per LOC $22.73 $16.53 ($6.20) −27.27

LOC per month 800 1,100 300 37.50


Variations in Work Habits and Unpaid Overtime ◾ 103

each month are worked. The second version shows the same project in crunch mode
where the work hours comprise 110%, with all of the extra hours being in the form
of unpaid overtime by the software team.
As exempt software personnel are normally paid on a monthly basis rather than
on an hourly basis, the differences in apparent results between normal and intense
work patterns are both significant and also tricky when performing software eco-
nomic analyses.
As can be seen in Table 14.2, applying intense work pressure to a software proj-
ect in the form of unpaid overtime can produce significant and visible reductions
in software costs and software schedules. (However, there may also be invisible and
harmful results in terms of staff fatigue and burnout.)
Table 14.2 introduces five terms that are significant in software measurement
and also cost estimation, but which need a definition.
The first term is assignment scope (abbreviated to Ascope), which is the quantity
of function points (FP) normally assigned to one staff member.
The second term is production rate (abbreviated to Prate), which is the monthly
rate in function points at which the work will be performed.
The third term is nominal production rate (abbreviated to Nominal Prate in FP),
which is the rate of monthly progress measured in function points without any
unpaid overtime being applied.
The fourth term is virtual production rate (abbreviated to Virtual Prate in FP),
which is the apparent rate of monthly productivity in function points that will
result when unpaid overtime is applied to the project or activity.
The fifth term is work hours per function point, which simply accumulates the
total number of work hours expended and divides that amount by the function
point total of the application.
As software staff members are paid monthly but work hourly, the most visible
impact of unpaid overtime is to decouple productivity measured in work hours per
function point from productivity measured in function points per staff month.
Assume that a small 60 function point project would normally require two
calendar months or 320 work hours to complete. Now assume that the program-
mer assigned worked double shifts and finished the project in one calendar month,
although 320 hours were still needed.
If the project had been a normal one stretched over two months, the productiv-
ity rate would have been 30 function points per staff month and 5.33 work hours
per function point. By applying unpaid overtime to the work and finishing in one
month, the virtual productivity rate appears to be 60 function points per staff
month, but the actual number of hours required remains 5.33 work hours per func-
tion points.
Variations in work patterns are extremely significant variations when dealing
with international software projects. There are major national differences in terms
of work hours per week, quantities of unpaid overtime, numbers of annual holi-
days, and annual vacation periods.
104 ◾ A Guide to Selecting Software Measures and Metrics

In fact, it is very dangerous to perform international studies without taking this


phenomenon into account. Variations in work practices are a major differentiating
factor for international software productivity and schedule results.
Software is currently among the most labor-intensive commodities on the
global market. Therefore work practices and work effort applied to software exerts
a major influence on productivity and schedule results. In every country, the top
software personnel tend to work rather long hours so Table 14.1 can only be used
for very rough comparisons.
The differences in national work patterns compounded with differences in
burdened cost structures can lead to very significant international differences in
software costs and schedules for the same size and kind of application.
Chapter 15

Variations in Functional
and Nonfunctional
Requirements

When building a home, the various components are estimated separately for
the foundations, support beams, flooring, plumbing, electricals, heating and
air-conditioning (AC), roofing, and so on. But cost estimates for homes also aggre-
gate these disparate costs into cost per square foot which is a metric contractors, home
owners, taxing authorities, and mortgage specialists all know and understand.
Some of these home costs are based on user requirements, that is, quality of
cabinets, roofing materials, window design, and so forth. Other costs are based on
nonfunctional mandates by state and local governments.
For example, a home in Rhode Island within 1,000 yards of the ocean has to
have hurricane-proof windows. A home within 100 yards of an aquifer has to have
a special septic system. These are not things that the users want because they are
expensive, but they have to be used due to nonfunctional state requirements.
Software also has many disparate activities—requirements, architecture, design,
coding, testing, integration, configuration control, quality assurance, technical
writing, project management, and so on. Here too, they can be estimated sepa-
rately with their own units of measure but should also be consolidated into a single
metric.
From 1978 until 2012, the metric of cost per function point was used to aggre-
gate as many as 60 different kinds of software work into a total cost of ownership
(TCO) that used function points for data normalization.
In 2012, International Function Point Users Group (IFPUG) introduced a new
metric called SNAP. This is an acronym for software nonfunctional assessment process.

105
106 ◾ A Guide to Selecting Software Measures and Metrics

Before SNAP, nonfunctional requirements were still part of software


development and had the effect of raising cost per function point (just as govern-
ment building mandates raise the cost per square foot of home construction).
When SNAP metrics arrived in 2012, they were not designed to be mathemati-
cally equivalent to function points (although they might have been). This means
that software no longer has a single unifying metric for all costs elements. Consider
the following two cases using simple whole numbers to make the results clear.
Case 1 circa 2010 a software application of 1,000 function points might have
had a total amount of development effort of 100 staff months so the net productivity
rate was 10 function points per month. Perhaps 20% of the effort went to build-
ing the nonfunctional requirements. Assume development costs were $1,000,000
based on a monthly rate of $10,000. The net cost would be $1,000 per function
point.
Case 2 circa 2016 the same application might be sized at 900 function points
and 150 SNAP points. But the total effort would still be 100 staff months because it
is the same application. Assume that 80 months went to normal software develop-
ment and 20 months went to building the nonfunctional features.
Now the 2016 productivity for normal development would be 11.25 function
points per month and the SNAP effort would be 7.50 SNAP points per month.
But what about the total effort for the total project? It is still 100 months and
$1,000,000 but now there is no overall metric for normalization.
The regular functional development costs were $800,000 and the SNAP costs
were $200,000 so the total is still $1,000,000. The functional costs are $889 per
function point and the SNAP costs are $1,333 per SNAP point.
It is useful to know the separate costs for the functional and nonfunctional
requirements. In general, nonfunctional requirements are more costly than normal
user requirements. But to make it easier to compare, new software applications
using SNAP with more than 75,000 existing preSNAP applications already sized
with function points, it is still possible to show overall net dollars per function
point.
Retrofitting SNAP to older software benchmarks is difficult and expensive. It is
also expensive to integrate SNAP into commercial parametric estimation tools. But
continuing to use overall cost per function point in addition to the separate costs for
functional and nonfunctional requirements would allow apples-to-apples compari-
sons between new software and legacy software.
Both Case 1 and Case 2 cost $1,000,000 and both required 100 staff months
as they are actually the same project. Using net cost per function point after SNAP
costs are known would provide continuity with historical data from older legacy
applications.
Nonfunctional requirements seem to go up with size in function points. They
are also a higher percentage of total features for government and defense soft-
ware. Also software facing government certification or controls such as Sarbanes–
Oxley, U.S. Food and Drug Administration certification, and Federal Aviation
Variations in Functional and Nonfunctional Requirements ◾ 107

Administration certification has larger volumes of nonfunctional requirements.


There are also many nonfunctional requirements associated with telecommunica-
tions and enterprise resource planning packages.
As shown in Table 15.1, 100 applications are sized by Software Risk Master
(SRM) and sorted by the overall percentage of nonfunctional SNAP points com-
pared to regular function points. Note that these are predicted values but based on
observations of many government projects.

Table 15.1 Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

1 Obamacare 107,350 33,450 12,345,250 31.16


website (all
features)

2 Citizens bank 4,017 1,240 367,224 30.87


on-line

3 State Motor 11,240 3,450 599,467 30.69


vehicle
registrations

4 Property tax 1,492 457 136,438 30.62


assessments

5 IRS income tax 19,013 5,537 1,352,068 29.12


analysis

6 VA Patient 23,109 6,500 4,929,910 28.13


monitoring

7 EZPass vehicle 4,751 1,300 253,400 27.36


controls

8 Consumer credit 1,332 345 53,288 25.90


report

9 FEDEX shipping 17,378 4,500 926,802 25.90


controls

10 American 20,141 4,950 1,432,238 24.58


Express billing

11 Insurance claims 11,033 2,567 252,191 23.27


handling

(Continued)
108 ◾ A Guide to Selecting Software Measures and Metrics

Table 15.1 (Continued) Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

12 Airline 38,392 8,900 6,142,689 23.18


reservation
system

13 State wide child 17,850 4,125 952,000 23.11


support

14 U.S. Air Traffic 306,324 70,133 65,349,222 22.90


control

15 World-wide 307,328 65,000 28,098,560 21.15


military
command and
control
(WWMCCS)

16 IBM Future 515,323 108,218 68,022,636 21.00


System FS/1
(circa 1985 not
completed)

17 Israeli air 300,655 63,137 24,052,367 21.00


defense system

18 Shipboard gun 21,199 4,240 1,938,227 20.00


controls

19 Star Wars missile 352,330 68,800 32,212,992 19.53


defense

20 Aegis destroyer 253,088 49,352 20,247,020 19.50


Command and
Conquer

21 North Korean 273,961 50,957 25,047,859 18.60


Border defenses

22 Iran’s air defense 260,100 46,558 23,780,557 17.90


system

23 Norton antivirus 2,151 369 152,942 17.16


software

(Continued)
Variations in Functional and Nonfunctional Requirements ◾ 109

Table 15.1 (Continued) Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

24 M1 Abrams 19,569 3,131 1,789,133 16.00


battle tank
operations

25 Skype 21,202 3,392 1,130,759 16.00

26 Bank ATM 3,917 571 208,927 14.58


controls

27 Apple iPhone v6 19,366 2,518 516,432 13.00


operations

28 Linux 17,505 2,276 700,205 13.00

29 FBI fingerprint 25,075 3,260 2,674,637 13.00


analysis

30 NASA space 23,153 3,010 2,116,878 13.00


shuttle

31 Oracle 229,434 29,826 18,354,720 13.00

32 MRI medical 18,785 2,442 1,335,837 13.00


imaging

33 Google search 18,640 2,423 1,192,958 13.00


engine

34 Data Warehouse 21,895 2,846 1,077,896 13.00

35 Amazon website 18,080 2,350 482,126 13.00

36 Tomahawk cruise 17,311 2,250 1,582,694 13.00


missile

37 Cruise ship 18,896 2,456 1,343,713 13.00


navigation

38 EBAY transaction 16,390 2,110 1,498,554 12.87


controls

39 Cat scan medical 4,575 585 244,000 12.79


device

(Continued)
110 ◾ A Guide to Selecting Software Measures and Metrics

Table 15.1 (Continued) Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

40 Denver Airport 17,002 2,166 1,554,497 12.74


luggage
(original)

41 Inventory 16,661 2,111 1,332,869 12.67


management

42 SAP 253,500 32,070 18,480,000 12.65

43 JAVA compiler 1,281 162 91,096 12.65

44 Laser printer 1,285 162 82,243 12.61


driver

45 IBM IMS 15,392 1,939 1,407,279 12.60


database

46 Lasik surgery 3,625 456 178,484 12.58


(wave guide)

47 All-in-one printer 1,306 163 52,232 12.48


driver

48 PBX switching 1,658 207 132,670 12.48


system

49 Android 14,019 1,749 690,152 12.48


operating system

50 IRA account 1,340 167 71,463 12.46


management

51 Sun Java compiler 1,310 163 119,772 12.44

52 Digital camera 1,344 167 286,709 12.43


controls

53 MapQuest 3,969 493 254,006 12.42

54 Motorola cell 1,579 196 144,403 12.41


phone contact
list

55 Seismic analysis 1,564 194 83,393 12.41

(Continued)
Variations in Functional and Nonfunctional Requirements ◾ 111

Table 15.1 (Continued) Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

56 Sidewinder 1,518 188 60,730 12.38


missile controls

57 SAS statistical 10,927 1,349 999,065 12.35


package

58 Google Gmail 1,379 170 98,037 12.33

59 Patriot missile 16,239 2,001 1,484,683 12.32


controls

60 SpySweeper 2,227 274 109,647 12.30


antispyware

61 Sun D-Trace utility 3,505 430 373,832 12.27

62 NVIDIA graphics 3,793 464 151,709 12.23


card

63 Toyota robotic 14,912 1,822 3,181,283 12.22


manufacturing

64 Apple iPod 1,507 183 80,347 12.15

65 AutoCAD 1,900 230 121,631 12.10

66 Microsoft Project 2,108 255 192,757 12.10


2007

67 Microsoft 3,450 416 157,714 12.06


Outlook

68 Mozilla Firefox 1,450 174 132,564 12.00


(original)

69 North Korean 37,235 4,468 5,101,195 12.00


Long-Range
Missile controls

70 Microsoft Visual 2,068 247 110,300 11.94


Basic

71 Intel Math 1,768 211 141,405 11.94


function library

(Continued)
112 ◾ A Guide to Selecting Software Measures and Metrics

Table 15.1 (Continued) Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

72 State 12,300 1,461 656,000 11.88


transportation
ticketing

73 Smart bomb 1,267 150 67,595 11.84


targeting

74 Wikipedia 1,257 148 67,040 11.77

75 All-in-one printer 1,963 231 125,631 11.77

76 Garmin hand- 1,858 218 118,900 11.73


held GPS

77 Microsoft Word 3,309 388 176,501 11.72


2007

78 Microsoft Excel 4,429 516 404,914 11.65


2007

79 Chinese 4,500 522 197,500 11.60


submarine
sonar

80 Quicken 2015 13,811 1,599 679,939 11.58

81 Adobe Illustrator 2,507 280 178,250 11.17

82 Windows 10 (all 198,050 21,786 12,675,200 11.00


features)

83 Microsoft Office 93,498 10,285 5,983,891 11.00


Professional
2010

84 Cochlear implant 1,250 135 66,667 10.80


(embedded)

85 Casio atomic 1,250 129 66,667 10.32


watch with
compass, tides

86 NSA code 35,897 3,590 3,829,056 10.00


decryption

(Continued)
Variations in Functional and Nonfunctional Requirements ◾ 113

Table 15.1 (Continued) Sizes of 100 Software Applications


Size in
Function SNAP Size in SNAP
Points Nonfunction Logical Code Percent
Applications IFPUG 4.3 Points IFPUG Statements (%)

87 NASA Hubble 21,632 2,163 1,977,754 10.00


controls

88 Computer BIOS 1,215 111 86,400 9.14

89 Automobile fuel 1,202 109 85,505 9.07


injection

90 APAR analysis 1,248 113 159,695 9.06


and routing

91 Antilock brake 1,185 107 63,186 9.03


controls

92 FBI Carnivore 31,111 2,800 3,318,515 9.00

93 Hearing aid 1,142 102 30,448 8.93


(multi program)

94 Ccleaner utility 1,154 103 73,864 8.92

95 Logitech cordless 1,134 96 90,736 8.46


mouse

96 Instant 1,093 89 77,705 8.14


messaging

97 Oracle CRM 10,491 836 745,995 7.97


Features

98 DNA Analysis 10,380 808 511,017 7.78

99 Twitter (original 1,002 77 53,455 7.68


circa 2009)

100 Denial of service 866 – 79,197 0.00


virus

Averages 42,682 7,739 4,250,002 14.46

Note: Sizes assume IFPUG 4.3.


All sizes predicted by SRM.
Copyright © 2016 by Capers Jones.
All rights reserved.
114 ◾ A Guide to Selecting Software Measures and Metrics

Government projects are often more expensive than civilian projects of the
same, in part because of higher volumes of nonfunctional requirements. They are
also more expensive because of elaborate procurement and governance procedures
that generate huge volumes of paper documents and expensive status reporting.
In fact, defense projects are almost the only known software projects to use
independent verification and validation (IV and V) and independent testing. This
alone makes defense projects at least 5% more expensive than civilian projects of the
same size in function points.
If a typical civilian software project of 1,000 function points requires 100
staff months, a similar state government project of 1,000 function points would
probably require 110 staff months and a similar military software project of 1,000
function points might require 125 staff months.
The nonfunctional requirements compared to function points would probably
be 15% for the civilian project, 20% for the state government project, and 30% for
the military project.
Table 15.1 shows approximate SNAP points and also a percentage of SNAP
points compared to function points as predicted by SRM.
Chapter 16

Variations in Software
Quality Results

This chapter is basically a catalog of the known problems with software quality and
with software quality measurements and metrics. The overall quality observations
are based on the author’s lifetime collection of quality data from 1976 through
2016, containing about 26,000 software projects.
It is sad to report that recent software quality data collected since about 2010
are not a great deal better than older data from the 1980s. Newer programming
languages have reduced code defects. Static analysis has raised defect removal effi-
ciency, but because the bulk of software work circa 2016 is in renovation of legacy
software, neither better languages nor static analysis have wide usage in legacy
repairs.
Indeed most static analysis tools do not support some of the older languages such
as MUMPS, ALGOL 68, and JOVIAL, and so have no role in legacy renovation.
New development projects with quality-strong methods such as Team Software
Process (TSP) and high-level programming languages such as C# or Objective-C
are better than older projects of the same size.
Some companies are quite good at both quality control and quality measures.
The author is fortunate to have started collecting quality data at IBM circa 1970,
where quality measurements and quality control methods were taken seriously by
corporate executives.
The author was also fortunate to have worked at ITT, where the famous Phil
Crosby wrote Quality Is Free, and ITT executives also took quality seriously.
The two corporate chairmen, Thomas J. Watson Jr. of IBM and Harold Geneen
of ITT, were willing to spend significant corporate funds to improve quality
and quality measures and thereby improve overall software performance. Both

115
116 ◾ A Guide to Selecting Software Measures and Metrics

companies made significant progress in software and also engineering quality


under these exceptional leaders. Follow-on executives in both companies continued
to support quality and measurements after Watson and Geneen retired.
This chapter is an overview of a variety of gaps, errors, and bad choices associ-
ated with poor software quality.

Missing Software Defect Data


Software quality measurements and metrics are so poor that over 70% of actual bugs
are not measured or recorded. Very few companies or individual researchers know the
total quantity of bugs in software. Most companies ignore bugs found by desk check-
ing, static analysis, and unit test: together these methods account for more than 50%
of total bugs removed. As of 2016, software quality data consist mainly of bugs found
during the later stages of testing. Quality data should start early and include all bugs.
At IBM, volunteers even recorded bugs found via desk checking and unit test, which
are normally private defect removal activities that go unmeasured.
The missing data have the unintended consequence of making software quality
look better than it really is. If you do not count more than half of the bugs in your
software, you really do not know its quality level.
Worse, almost all companies ignore bugs that originate in requirements, archi-
tecture, and design, and focus attention only on code bugs, which comprise less
than 40% of defect volumes. Bad fixes or new bugs in bug repairs are also seldom
recorded, although they are quite common (Table 16.1). The current U.S. average
for all bug sources is shown later.
The total sum of all bug sources is called the defect potential of a software appli-
cation. The term originated in IBM circa 1970, but is widely used by dozens of
leading technology companies in 2016.
Table 16.1 Approximate Average U.S. Software Defect Potentials
circa 2016
1. Requirements 0.70 defects per function point

2. Architecture 0.10 defects per function point

3. Design 0.95 defects per function point

4. Code 1.15 defects per function point

5. Security code flaws 0.25 defects per function point

6. Documents 0.45 defects per function point

7. Bad fixes 0.65 defects per function point

Totals 4.25 defects per function point


Variations in Software Quality Results ◾ 117

Software Defect Removal Efficiency


Most companies do not know (or seem to care) about the efficiency of various kinds
of defect removal methods in finding bugs. Most forms of testing are less than 35%
in defect removal efficiency.
Static analysis is about 55% efficient in finding bugs. Formal inspections have
topped 85% efficiency in finding bugs (Table 16.2). The approximate software
defect removal efficiency (DRE) values for a sample of selected defect prevention
and defect removal methods are shown below.

Table 16.2 Defect Prevention and DRE


Defect Prevention Nominal Efficiency

1 Joint application design (JAD) 25.00%

2 Quality function deployment (QFD) 45.00%

3 Prototypes 20.00%

4 Models 25.00%

5 CMMI 3 15.00%

6 CMMI 5 30.00%

Pretest Removal Efficiency

1 Desk check 26.00%

2 Pair programming 15.00%

3 Static analysis 55.00%

4 Informal walk-throughs 40.00%

5 Formal Inspections 85.00%

6 Independent verification and validation (IV and V) 15.00%

7 Automated proofs of correctness 33.00%

8 Manual proofs of correctness 5.00%

Test Removal Efficiency

1 Formal test planning 20.00%

2 Unit test 31.00%

3 Function test 34.00%

(Continued)
118 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.2 (Continued) Defect Prevention and DRE


Test Removal Efficiency

4 Regression test 13.00%

5 Cloud test 16.00%

6 Component test 31.00%

7 Usability test 12.00%

8 Stress/Performance test 12.00%

9 Security test 15.00%

10 Independent test 22.00%

11 Hardware platform test 10.00%

12 Software platform test 10.00%

13 Nationalization test 10.00%

14 Supply-chain test 22.00%

15 System test 35.00%

16 Beta test 20.00%

17 Acceptance test 20.00%

Special Methods for Special Defects Efficiency

1 Ethical hacking 18.00%

2 Penetration teams 29.00%

3 Defect seeding 27.00%

4 Race condition detection 33.00%

5 SANS Institute code defect categories 47.00%

Cumulative observed DRE range: 78.00% to 99.65%.

Notes:
DRE goes up with team experience and high CMMI levels.
DRE goes up with quality-strong methods such as TSP and RUP.
DRE is inversely proportional to cyclomatic complexity.
Most forms of testing are <35% in DRE.
Inspections and static analysis have highest nominal DRE levels.
Variations in Software Quality Results ◾ 119

No company or project performs all of these quality control steps. But overall
quality goes up with a synergistic combination of defect prevention, pretest defect
removal, and formal testing using mathematical test case design and certified test
personnel. Quality goes down when pretest removal is skimped, or with informal
testing by untrained amateur development personnel.

Money Spent on Software Bug Removal


Overall, the United States spends about 40 cents out of every software dollar, on
finding and fixing bugs or just over $142 billion per year. With better quality,
this could drop below 10 cents out of every dollar. Poor software quality is one
of the largest sources of industrial waste effort in all human history (Table 16.3).
Following are the approximate U.S. national totals for bug repairs circa 2016.

Table 16.3 2016 U.S. Software Quality Costs


Software Monthly Monthly Annual
Staff Salary U.S. Cost U.S. Cost

U.S. development 1,300,000 $10,000 $13,000,000,000 $156,000,000,000


staff

U.S. maintenance 1,850,000 $9,000 $16,650,000,000 $199,800,000,000


staff

TOTALS 3,150,000 $9,413 $29,650,000,000 $355,800,000,000

Average Annual dollar


Function Total Function per Function
Projects Points Points Points

Development 560,000 275 154,000,000 $1,012.99


projects

Legacy 2,850,000 140 399,000,000 $500.75


maintenance
projects

TOTALS 3,410,000 162 553,000,000 $643.40

% on bug repairs 40.00% $142,320,000,000

% on cyber attacks 11.00% $39,138,000,000

(Continued)
120 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.3 (Continued) 2016 U.S. Software Quality Costs


Average Annual dollar
Function Total Function per Function
Projects Points Points Points

% on canceled 8.00% $28,464,000,000


projects

% on litigation 300% $10,674,000,000

TOTAL U.S. 62.00% $220,596,000,000


WASTAGE

Bug repairs $257.36


per FP

Cyber attacks $70.77


per FP

Canceled $51.47
projects per FP

Litigation per FP $19.30

U.S. WASTAGE PER $398.91


FUNCTION POINT
Note 1: Bug repairs, cyber attacks, and canceled projects are all symptoms of poor
quality.
Note 2: About 5% of outsource contracts go to litigation for poor quality or
cancellation.
Note 3: Better quality control would speed up schedules, lower costs, and raise
value by about 50%.
Note 4: Static analysis, inspections, models, mathematical test case design, and
certified reuse are critical.

Most Fortune 500 companies spend close to half a billion dollars per year, fixing
bugs that could have been prevented or eliminated prior to software deployment
with better quality control. Unfortunately, defect removal costs are under reported
in many benchmark studies due to lack of specific defect counts and costs for desk
checking, static analysis, and unit testing, all of which are commonly excluded
from defect reports and not broken out in cost data.
The unfortunate cost per defect metric, distorts reality and penalizes quality.
It gives the false appearance that defect removal costs go up as a development cycle
proceeds. In fact, defect removal costs are fairly flat. Defect removal cost per func-
tion point is a much better and more reliable way to show the true value of high
quality.
Variations in Software Quality Results ◾ 121

Wasted Time by Software Engineers due to Poor Quality


Quality today is so poor that U.S. software engineers only spend about 50 days
a year on productive work. With better quality, this could go above 125 days per
year. This topic is based on long-range studies with volunteers who kept daily time
sheets as to how they spent each day. Formal inspections could probably shift
15 days a year from the wastage column to the productive column. Static analysis
is also cost effective and efficient and probably could shift at least 12 days a year
from the wastage column to the productive column. The term wastage includes
these topics:

1. Time spent finding and fixing bugs


2. Time spent on projects that are canceled due to poor quality or negative
return on investment (ROI)
3. Time spent dealing with cyber attacks due to poor quality control
4. Time spent in depositions and courts time for lawsuits for poor quality

Time spent finding and fixing bugs is the number one software cost driver at
about 40% of total effort. Time spent on canceled projects is variable from company
to company, but the U.S. average is about 8%.
Time spent on cyber attacks is unfortunately increasing and is about 11% in
2016. Time spent on litigation can be large for companies with active lawsuits, but
the overall industry total is <5%.

Bad Fixes or New Bugs in Bug Repairs


About 7% of bug repairs have new bugs in them called bad fixes (discovered by
IBM circa 1970). When cyclomatic complexity goes above 25 for specific modules,
bad-fix injections can approach 25%. For some error-prone modules (EPM), bad
fixes have topped 70%. Unfortunately, only a few leading companies such as IBM
measure bad-fix injections or know how common they are.
With better quality control and better quality measures, bad-fix injections
could drop less than 1%. Static analysis of bug repairs could essentially eliminate
bad-fix injections. Inspections could eliminate bad fixes too, but static analysis is
much faster and easier to deploy for small changes such as bug repairs.
In one lawsuit for a commercial financial software package where the author
was an expert witness, the case centered on four bad fixes released to a customer
over a nine-month period. (The original bug caused errors in financial reports that
cost the plaintiff about $3,000,000 in bank fees.)
None of the four initial repairs fixed the original problem; all four added new
problems. Finally, the fifth repair 10 months after the first repair successfully fixed
122 ◾ A Guide to Selecting Software Measures and Metrics

the original problem and added no new problems. Is it any wonder that there was
a lawsuit with four bad fixes in a row for an expensive high-end financial software
package?
Using static analysis on the first-bug repair would probably have eliminated the
original bad fix, and thereby eliminated the need for a lawsuit as well. In fact, static
analysis or inspections would probably have eliminated the original problem so bad
fixes would have been moot.

Bad-Test Cases (An Invisible Problem)


About 15% of test cases are either duplicates or have bugs in them. These bad-test
cases add to testing costs, but they find no bugs. They also slow down testing and
may report false positives. With better test case design methods, bad-test cases
could be reduced less than 1%.
The software literature is almost silent on the topic of bad-test cases, and not
very thorough on test-case design. None of the software testing companies mention
this problem, nor do they present data on how their tools or methods reduce
bad-test case incidence.
The only company known to have studied bad-test cases was IBM which did
several surveys of their major test libraries. Test case design based on design of
experiments has the lowest volume of bad-test case creation.

Error-Prone Modules with High Numbers of Bugs


Bugs are not randomly distributed, but clump in a small number of EPM. In
general, less than 10% of modules in a software application will contain over 50%
of bugs.
EPM cannot be found easily by testing, but inspections and static analysis could
eliminate EPM. EPM were discovered by IBM circa 1970. As a sample the IBM, IMS
product had 425 modules. Of these about 300 were zero-defect modules. About 57%
of all bugs were found in only 31 modules. Some EPM had more than 5.0 bugs per
function point as opposed to just over 1.0 bug per function point for normal modules.
EPM occur in most large software systems and have been independently verified
by AT&T, Motorola, Raytheon, and other companies with good quality assurance
teams and good quality measurements. However, only sophisticated companies
with very good quality measures that trace bugs back to specific modules are able
to find and remediate EPM.
EPM are the most expensive artifacts in all software and cost 5 to 10 times
more than low-defect modules. Many EPM are so complex and have such high
cyclomatic complexity that they cannot be fixed (bad-fix approach 100%), but need
to be surgically removed and replaced.
Variations in Software Quality Results ◾ 123

It is technically possible to eliminate EPM from all software packages. Doing


so would require a combination of code inspections of critical modules combined
with static analysis of all modules.
Mathematical test case design, test coverage tools, and the use of cyclomatic com-
plexity measures would also aid in the elimination of EPM. EPM are technically treat-
able and avoidable and are the software equivalent of a major disease such as mumps
that could be completely eliminated by vaccination and preventive care.
However, software quality measurements are so bad throughout the industry
that probably 85% of the major applications that have EPM do not measure quality
well enough to identify them or much less remove them before they cause harm.

Limited Scopes of Software Quality Companies


Most quality companies are one-trick-ponies that only care about one subject.
Some quality companies sell automated testing, some sell static analysis, some sell
automated proofs, or whatever; no company sells a full suite of software quality
tools and methods that encompass all sources of software defects and all forms of
software defect removal.
Effective quality control needs a synergistic combination of defect prevention,
pretest defect removal, and formal testing with certified test personnel. Worse,
most quality companies have zero empirical data as to the efficacy of their tools
and methods. They make vast claims of better quality but provide no case studies
or validated results.
If drugs and pharmaceutical products were released to the public with as little
validation as software quality tools, the U.S. death rate would probably be twice
what it actually is today.
A list of 56 software topics is shown in Table 16.4 that include defect origins,
defect prevention methods, and defect removal stages that run from early require-
ments to postdelivery for a large system of a nominal 10,000 function points and
1,500 software nonfunctional assessment process (SNAP) points in size in Java.
All of these 56 quality control factors are important for large systems in the
10,000 function point size range. The problem today in 2016 is that no known
software quality company sells more than one or two of these 56 quality methods
or even knows about the others!
Few quality companies and even fewer of their clients know about the other
factors! A narrow focus on testing and basic ignorance of the suite of effective soft-
ware defect prevention and defect removal methods is an endemic and chronic
problem of the software industry.
If a reader wanted to learn about all 56 quality factors, he or she would probably
need a dozen courses from at least half a dozen software quality training companies
because none of them cover the full spectrum of effective quality tools and methods
or even know about them!
124 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.4 Results of Poor Quality and High Quality Software (Nominal
10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality

U.S. Software Defect Potentials per Function Point 2016


1 Requirements defects (functional and 0.90 0.30
nonfunctional)

2 Architecture defects 0.20 0.05

3 Design defects 1.10 0.35

4 Code defects 1.55 0.60

5 Security flaw defects 0.35 0.10

6 Document defects 0.45 0.20

7 Bad-fix defects (new bugs in bug repairs) 0.55 0.05

Total 5.10 1.65

Application Defect Removal Efficiency Results


8 DRE 91.00% 98.50%

9 Defects removed per function point 4.64 1.63

10 Defects removed—actual total 46,410 16,253

11 Defects delivered per function point 0.46 0.02

12 Defects delivered—actual total 4,590 247

13 High severity defects delivered per 0.46 0.02


function point

14 High severity defects 689 27


delivered—actual total

15 Security flaws delivered—actual total 55 1

Application Defect Prevention Stages


16 Joint application design (JAD) No Yes

17 Prototype Yes Yes

18 Requirements models (primarily No Yes


functional)

(Continued)
Variations in Software Quality Results ◾ 125

Table 16.4 (Continued) Results of Poor Quality and High Quality Software
(Nominal 10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality

19 Quality function deployment (QFD) No Yes

20 Automated proofs No Yes

21 SEMAT (Software Engineering Methods No Yes


and Theory)

22 Six-Sigma for software No Yes

23 Capability maturity model (CMMI)— No No


defense only

Total 1 7

Application Pretest Defect Removal Stages


24 Formal inspections of requirements No Yes

25 Formal inspection of architecture (large No Yes


systems)

26 Formal inspections of design No Yes

27 Formal inspections of new/changed code No Yes

28 Formal quality assurance reviews No Yes

29 Pair programming (not recommended) No No

30 Independent verification and validation No No


(defense only)

31 FOG readability index of requirements, No Yes


design

32 Static analysis of application code and No Yes


changed code

33 Ethical hackers on high-security software No Yes

34 Cyclomatic complexity analysis and No Yes


reduction

35 SANS institute defect category analysis No Yes


and removal

Total 1 10

(Continued)
126 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.4 (Continued) Results of Poor Quality and High Quality Software
(Nominal 10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality

Application Test Defect Removal Stages


36 Unit test—automated Yes Yes

37 New function test Yes Yes

38 Regression test—automated Yes Yes

39 Stress/performance test Yes Yes

40 Usability test No Yes

41 Component test Yes Yes

42 Independent test (defense only) No No

43 Security test No Yes

44 System test—automated Yes Yes

45 Multiplatform test Yes Yes

46 Global nationalization test Yes Yes

47 Beta/acceptance test Yes Yes

Total 9 11

Application Post-Release Quality Stages


48 Static analysis of all code changes/bug No Yes
repairs

49 Formal inspection of large changes No Yes

50 Cyber attack defenses (firewalls, antivirus, Yes Yes


etc.)

51 Penetration teams (high security No Yes


applications)

52 Maintainability analysis of legacy No Yes


applications

53 Test library analysis (defective test case No Yes


removal)

(Continued)
Variations in Software Quality Results ◾ 127

Table 16.4 (Continued) Results of Poor Quality and High Quality Software
(Nominal 10,000 Function Points; 1,500 SNAP Points)
Poor Quality High Quality

54 Error-prone module (EPM) analysis and No Yes


removal

55 Race-condition analysis and correction No Yes

56 Cyclomatic complexity analysis and No Yes


correction

Total 1 9

56 TOTAL QUALITY CONTROL FACTORS 19 53

The curricula of the major software quality training companies are embarrass-
ing because of the gaps, omissions, and topics that are not covered. Even worse, not
a single quality education company has actual quantified data on software defect
origins, defect densities, defect prevention, or DRE levels.
You would have to go back to almost 200 years in medical education to find
such skimpy knowledge of the basic topics needed to train physicians as we have for
training software quality and test personnel in 2016.
You might take quality courses from companies such as Construx, CAST, IBM,
ITMPI, SQE, QAI, the SANS Institute, Parasoft, Smart Bear, and probably from
other local educators, but these would probably be single-topic courses such as
static analysis or automated testing.
Worse, these courses even from major quality companies would lack quantitative
data on defect potentials, DRE, bad-fix injections, EPR, or any other of the critical
topics that quality professionals should know about. The software industry is run-
ning blind due to a widespread lack of quantitative quality data.
Quality data are available from benchmark organizations such as Davids
Consulting, Gartner Group, Namcook Analytics LLC, TIMetricas, Q/P
Management Group/QSM, and several others. But the combined set of clients for
all current quality benchmark organizations is less than 50,000 customers in an
industry employing close to 20,000,000 people on a global basis.
Quality data can be predicted by some parametric software estimation tools such
as Software Risk Master (SRM), KnowledgePlan, SEER, SLIM, and COCOMO,
but the combined market for all of these parametric tools is less than 25,000
customers in an industry employing almost 20,000,000 people on a global basis.
In other words, even companies that offer accurate quality data have com-
paratively few clients who are interested in that data, even though it could save
128 ◾ A Guide to Selecting Software Measures and Metrics

companies and governments billions of dollars in reduced defect repairs and


reduced cyber attack recovery costs.
It is professionally embarrassing about how unsophisticated software quality
education is compared to medical school curricula for training physicians.
You probably could not take courses on this set of 56 topics from any university
because their curricula tend to deal only with a few of the more common methods
and concentrate primarily on testing. I have yet to see a university with quantitative
data on software defect volumes, severity levels, origins, or effective defect removal
methods with quantitative results.
You might take some courses from nonprofit associations such as the American
Society for Quality (ASQ) or the Project Management Institute (PMI). But no
single organization in 2016 covers more than a small fraction of the total intellec-
tual content of effective software quality control.
To illustrate the kind of quality education that is needed, Table 16.5 shows a
sample curriculum for software quality assurance testing and Table 16.6 shows a
sample curriculum for software test training.
Table 16.5 Software Quality Assurance Curricula circa 2016
Software Quality Assurance Courses Days Value

1 Hazardous quality metrics: cost per defect 0.50 10.00

2 Hazardous quality metrics: lines of code 0.50 10.00

3 Hazardous quality metrics: technical debt 0.50 10.00

4 Hazardous quality metrics: story points 0.50 10.00

5 Effective quality metrics: function points 0.50 10.00

6 Effective quality metrics: defect removal % 0.50 10.00

7 Effective quality metrics: defect severity levels 0.50 10.00

8 Effective quality metrics: defect origin analysis 0.50 10.00

9 Emerging quality metrics: SNAP points 0.50 10.00

10 Overview of major software failures 1.00 10.00

11 Overview of major software cyber attacks 1.00 10.00

12 Error-prone module (EPM) analysis 1.00 10.00

13 Software defect detection efficiency (DDE) 1.00 10.00

14 Software defect removal efficiency (DRE) 1.00 10.00

15 Software defect tracking 1.00 10.00

(Continued)
Variations in Software Quality Results ◾ 129

Table 16.5 (Continued) Software Quality Assurance Curricula circa 2016

Software Quality Assurance Courses Days Value

16 Software defect prevention (JAD, QFD, etc.) 1.00 10.00

17 Software pretest defect removal 1.00 10.00

18 Software test defect removal 1.00 10.00

19 Software requirements modeling 1.00 10.00

20 Functional and nonfunctional requirements 2.00 10.00

21 Software static analysis: text 1.00 10.00

22 Software static analysis: code 1.00 10.00

23 Software correctness proofs: manual 1.00 10.00

24 Software correctness proofs: automated 1.00 10.00

25 Software security and quality in 2016 2.00 10.00

26 Quality benchmarks: Namcook, Q/P, and so on 1.00 10.00

27 Software security inspections 3.00 10.00

28 Security flaw removal (hacking, test, etc.) 3.00 10.00

29 Error-prone module (EPM) analysis 2.00 9.95

30 Software test case design 2.00 9.75

31 Software test library management 1.00 9.75

32 Reducing bad-fix injections 1.00 9.75

33 Test case conflicts and errors 1.00 9.75

34 Software requirement inspections 1.00 9.75

35 Software design inspections 2.00 9.50

36 Software code inspections 2.00 9.50

37 Software test inspections 2.00 9.50

38 Defect removal using pair programming 1.00 9.50

39 Defect removal using container development 1.00 9.50

40 Defect removal using DevOps 2.00 9.50

41 Defect removal using TSP/PSP 2.00 9.00

(Continued)
130 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.5 (Continued) Software Quality Assurance Curricula circa 2016

Software Quality Assurance Courses Days Value

42 Defect removal using Agile 2.00 9.00

43 Defect removal using RUP 2.00 9.00

44 Automated software testing 2.00 9.00

45 Quality assurance of software reuse 1.00 9.00

46 Quality assurance of COTS and ERP 1.00 9.00

47 Quality assurance of open source 1.00 9.00

48 Tools: Quality assurance 1.00 9.00

49 Tools: Defect prediction 1.00 9.00

50 Defect removal using Waterfall development 1.00 8.00

51 Cost of quality (COQ) 1.00 8.00

52 Overview of the CMMI 1.00 8.00

53 ISO and IEEE quality standards 1.00 7.00

54 Six-Sigma: Green belt 3.00 7.00

55 Six-Sigma: Black belt 3.00 7.00

TOTAL 70.50 9.49

Table 16.6 Software Testing Courses circa 2016


Software Testing Courses Days Value

1 Test case design optimization 2.00 10.00

2 Test cases—design of experiments 2.00 10.00

3 Test cases—cause/effect graphing 2.00 10.00

4 Test cases and requirements 2.00 10.00

5 Risk-based test case design 2.00 10.00

6 Analysis of gaps and errors in test case designs 2.00 10.00

7 Cyclomatic complexity and test coverage 2.00 10.00

8 Test library control 2.00 10.00

(Continued)
Variations in Software Quality Results ◾ 131

Table 16.6 (Continued) Software Testing Courses circa 2016

Software Testing Courses Days Value

9 Security testing overview 2.00 10.00

10 Advanced security testing 3.00 10.00

11 Test schedule estimating 1.00 10.00

12 Software defect potential estimating 1.00 10.00

13 DRE measurement 1.00 10.00

14 Software build planning and control 1.00 10.00

15 Big data test design 2.00 10.00

16 Cloud test design 2.00 10.00

17 Removal of incorrect test cases 1.00 10.00

18 Test coverage analysis 1.00 9.50

19 Identifying error-prone modules (EPM) 2.00 9.50

20 Database test design 1.00 9.50

21 Test case conflicts and errors 1.00 9.25

22 Static analysis and testing 1.00 9.00

23 Reducing bad-fix injections 1.00 9.00

24 Basic black box testing 1.00 9.00

25 Basic white box testing 1.00 9.00

26 Basic gray box testing 1.00 9.00

27 Fundamentals or risk-based testing 1.00 9.00

28 Fundamentals of unit testing 1.00 9.00

29 Fundamentals of regression testing 1.00 9.00

30 Fundamentals of component testing 1.00 9.00

31 Fundamentals of stress testing 1.00 9.00

32 Fundamentals of virus testing 2.00 9.00

33 Fundamentals of lab testing 1.00 9.00

34 Fundamentals of system testing 2.00 9.00

(Continued)
132 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.6 (Continued) Software Testing Courses circa 2016


Software Testing Courses Days Value

35 Fundamentals of external beta testing 1.00 9.00

36 Fundamentals of acceptance testing 1.00 9.00

37 Testing web applications 1.00 9.00

38 Tools: Automated testing 2.00 9.00

39 Tools: Test case design 1.00 9.00

40 Tools: Test library control 1.00 9.00

41 Tools: Defect tracking 1.00 9.00

42 Tools: Complexity analysis 0.50 9.00

43 Tools: Test coverage analysis 0.50 9.00

44 Fundamentals of reusable test materials 1.00 9.00

45 Testing cloud, SOA, and SaaS 2.00 8.80

46 Testing COTS application packages 1.00 8.75

47 Testing ERP applications 1.00 8.75

48 Testing reusable functions 1.00 8.75

49 Supply chain testing 1.00 8.50

50 Function points for test measures 1.00 7.00

TOTAL 67.00 9.31

In today’s world, software quality assurance has an expanding role in cyber


defenses and cyber attack recovery. Software quality assurance personnel need
much more knowledge on security topics in 2016 than they did 30 years ago in
1986.
In 2016, software testing has become a barrier to cyber attacks, so special atten-
tion is needed for testing software security flaws.
Between software quality assurance training and software test personnel train-
ing there is a need to expand on both university curricula and the limited curricula
from software quality companies, neither of which are fully adequate as of 2016.
If you wanted to acquire actual supporting tools for these 56 quality topics,
you would probably need to go to at least 15 commercial quality companies, static
analysis companies, and test tool companies, and another half dozen open-source
quality groups.
Variations in Software Quality Results ◾ 133

Nobody in 2016 sells all the tools that are needed to control software quality!
Most quality tool vendors do not even know about effective quality tools other
than the ones they sell. There are no software quality companies in 2016 that have
the depth and breadth of medical companies such as McKesson or Johnson  &
Johnson.
The static analysis companies only sell static analysis; the testing companies
only sell test tools, to get quality metrics and measurement tools you need addi-
tional vendors, to get ordinary defect-tracking tools you need still other vendors, to
get quality benchmark data you need another set of vendors, to get software quality
predictions via commercial estimating tools you need yet another set of vendors.
No known company as of 2016 covers the full spectrum of software quality
tools, technologies, topics, and effective quality methods, although a few large
companies such as IBM, Microsoft, and Hewlett Packard may sell perhaps a 12 to
15 software quality tools out of the set of 56.
Of course no pharmaceutical company sells medicines for all diseases and no
physicians can treat all medical conditions, but physicians at least learn about
almost all common medical conditions as a basic part of their education. There are
also specialists who can deal with uncommon medical conditions.
Medicine has the Index Medicus that provides an overall description of the use
of thousands of prescription drugs, their side effects, and dosage. There is no exact
equivalent to the Index Medicus for software bugs and their treatment, but the
closest is probably Capers Jones’ and Olivier Bonsignour’s book on The Economics
of Software Quality published in 2012.
Medicine also has many wide-ranging books such as Control of Communicable
Diseases in Man, published by the U.S. Surgeon General’s office, which show the
scope of common infectious diseases such as polio and smallpox as well as their
known treatments. Software has nothing like the breadth and depth of the medical
literature.
A book recommended by the author to all clients and colleagues is The Social
Transformation of American Medicine (1982) by Paul Starr. This book won a Pulitzer
Prize in 1984. It also won the Bancroft Prize in 1984. This book provides an excel-
lent guide to how medicine was transformed from a poorly educated craft into one
of the top learned professions in world history.
Surprisingly at one time about 150 years ago, medicine was even more chaotic
than software is today. Medical schools did not require college degrees or even
high-school graduation to enter. Medical students never entered hospitals during
training because the hospitals used private medical staff. There was no monitoring
of medical malpractice, and quacks could become physicians. There were no medi-
cal licenses or board certifications.
There was no formal evaluation of prescription drugs before release, and harm-
ful substances such as opium could be freely prescribed. (A Sears–Roebuck catalog
in the 1890s offered liquid opium as a balm for quieting noisy children. This
product was available without prescription.)
134 ◾ A Guide to Selecting Software Measures and Metrics

Paul Starr’s excellent book The Social Transformation of American Medicine


shows how the American Medical Association (AMA) transformed itself and also
medical practices to improve medical education and introduce medical licenses and
board certifications.
This book by Paul Starr provides a full guide for the set of steps needed by the
software industry in order to become a true profession.
One of the interesting methods used by the AMA was reciprocal membership
with all state medical societies. This had the effect of raising AMA membership
from below 800 to more than 80,000, which finally gave physicians enough political
clout to lobby for medical licenses. It would be interesting if the IEEE, SIM, ACM,
and other software professional organizations also had reciprocal memberships
instead of more or less competing against one another.
Poor software quality is a sociological problem as well as a technological prob-
lem. Starr’s book showed how medicine gradually improved both the sociology of
medical practice and the underlying technology of medical offices and hospitals
over about a 50 year period.
With Starr’s book as a guide, software engineering might be able to accomplish
the same results in less than 25 years instead of the 50 years required to profes-
sionalize medicine.
The three major problems facing the software industry in 2016 are the following:

1. Software has poor quality control due to lack of knowledge of effective soft-
ware quality techniques.
2. Software has embarrassingly poor education on software quality due to lack
of empirical data.
3. Software has embarrassingly bad and incomplete quality data due to the use
of ineffective and hazardous metrics such as cost per defect, combined with the
failure to use effective metrics such as function points and DRE.

The sad thing about poor software quality is that all three of these problems are
treatable conditions that could be eliminated in less than 10 years if one or more
major software companies became proactive in 1) Effective metrics and measures,
2) Fact-based education with quantitative data, and 3) Expanded quality control
that encompassed effective quality measures, effective defect prevention, effective
pretest defect removal, and effective formal testing.

Lack of Empirical Data for ISO Quality Standards


There are a number of ISO quality standards, and also IEEE and other profes-
sional standards that deal in software quality topics. Unfortunately as of 2016,
there are no empirical data that prove that adherence to ISO quality standards
Variations in Software Quality Results ◾ 135

actually makes a tangible improvement in either lowering software defect potentials


or raising defect removal efficiency levels.
The standards do contain useful checklists and are sometimes required as a
basis for business contracts. However, it would be helpful if it could be proven that
adherence to ISO quality standards, risk standards, or any of the other ISO, ANSI
standards, or IEEE standards generated tangible improvements.
There are dozens of standards organizations and hundreds of individual stan-
dards such as ISO 9000, ISO 9001, ISO 9126, ISO 2510, and many more. It is not
the fault of the standards groups themselves that there are no solid data on efficacy.
The root cause is poor software measurement practices and an overall industrial
lack of quantification on all software quality topics.
Some of the ISO/IEC standards such as ISO/IEC 9026 on functional size
measurements were created by experts, and contain quite a bit of useful data,
plus narrowing down the ranges of counts by means of lists of standard features.
Standards serve a useful purpose in lowering the bandwidth of potential varia-
tions. However, empirical data on efficacy would be useful and especially so for
software where technical progress resembles a drunkard’s walk due to lack of
empirical data on all methods and tools, combined with chronically bad mea-
surement practices.

Poor Test Case Design


Most test cases are designed casually. Mathematical test case design using design
of experiments or cause–effect graphing can raise overall test coverage >90% while
reducing total test case volumes and lowering testing costs. Test case design and
execution are best when carried out by certified test personnel. However, uncerti-
fied amateurs are far and away most common. Certified test personnel do about
15% of U.S. software testing mainly in defense and systems software; uncertified
amateurs do the other 85% of test case designs.

Best Software Quality Metrics


The two best metrics for measuring quality are defect potentials measured in func-
tion points and defect removal efficiency or DRE. The current U.S. averages are
defect potentials of about 4.25 per function point and DRE of about 92.5%.
Defect potentials were developed by IBM circa 1970 and comprise all bugs
likely to be found in requirements, architecture, design, code, and bad fixes of bugs
in bug repairs. Defect removal efficiency, also developed by IBM, is the percent-
age of bugs found and fixed, prior to release of software to clients. DRE includes
inspection, static analysis, and all forms of testing.
136 ◾ A Guide to Selecting Software Measures and Metrics

Defect potentials and DRE data were based on the large IBM collection of
quality data. However, once the quality data had been analyzed, it was then possible
to predict defect potentials and DRE for future projects via parametric estimation
tools. In fact, the author developed IBM’s first quality and cost prediction tool in
about 1973 with the assistance of Dr. Charles Turk. All of the author’s commercial
estimation tools have predicted defect potentials and DRE. This adds to overall
cost estimate accuracy because the costs of finding and fixing bugs is the number
one software cost driver.
Knowledge of defect potentials and probable DRE values can lead to accurate
cost and schedule estimates that often come to within 5% of reality. (The number
two cost driver for large software projects and for all military software projects
is that of producing paper documents. However, paperwork costs are outside the
scope of the present report, although the author’s commercial estimating tools all
predict paperwork costs.)
Function points were also developed by IBM in the 1970s in part to measure
defect potentials better than lines of code (LOC) could. For that matter, function
points are also good at measuring and predicting paperwork costs, which cannot
be done with LOC metrics.
DRE is calculated by measuring all bugs found internally prior to release, and
then measuring all bugs reported by customers in the first 90 days after release. If
the development team found 950 bugs in a software application and users reported
50 bugs, then DRE is 95%. As can be seen, this is a very simple metric and easy to
calculate and understand.

Worst Software Quality Metrics


The worst metrics for measuring quality are LOC and cost per defect. LOC met-
rics cannot measure noncode bugs in requirements and design that are often more
numerous than code bugs. LOC also penalizes modern-high level languages and
makes assembly language look better than Ruby. LOC metrics cannot measure or
predict paperwork costs either, and this is the number two cost driver.
Cost per defect penalizes software quality and is cheapest for the buggiest soft-
ware. The reason for this is that both metrics ignore fixed costs. For testing software
test case design, test case development and test case execution are fixed costs not
directly related to the number of bugs in the software. Thus testing has a high level
of fixed logistical costs for building and running tests whether there are any bugs
in the software or not.
(The whole urban legend that it costs 100 times more to fix a bug after release
than it does early in development is a hoax based on poor measurement practices
that ignore fixed costs. Bug repairs stay flat throughout, if measured properly.)
Variations in Software Quality Results ◾ 137

Why Cost per Defect Distorts Reality


In order to understand the problems with cost per defect, you need to understand
a basic law of manufacturing productivity that has been understood by all indus-
tries except software for more than 200 years: If a manufacturing process has a high
number of fixed costs and there is a reduction in the number of units produced, the cost
per unit will go up.
As software testing proceeds through a normal test cycle, each test stage will
find fewer bugs than the stage before. But the fixed costs of writing test cases and
running them will make the cost per defect go up. Here are three small examples.

Case A: Poor Quality


Assume that a tester spent 15 hours writing test cases, 10 hours running them, and
15 hours fixing 10 bugs. The total hours spent was 40 and the total cost was $2500.
Since 10 bugs were found, the cost per defect was $250. The cost per function point
for the week of testing would be $25.

Case B: Good Quality


In this second case, assume that a tester spent 15 hours writing test cases, 10 hours
running them, and 5 hours fixing one bug, which was the only bug discovered.
However, since no other assignments were waiting and the tester worked a full
week for 40 hours were charged to the project. The total cost for the week was
still $2500, so the cost per defect has jumped to $2500. Cost per function point is
still $25.
If the 10 hours of slack time are backed out, leaving 30 hours for actual testing
and bug repairs, the cost per defect would be $2,273.50 for the single bug. Cost per
function point would be $2.37.
As an application moves through a full test cycle that includes unit test, func-
tion test, regression test, performance test, system test, and acceptance test, the
time required to write test cases and the time required to run test cases stays almost
constant, but the number of defects found steadily decreases, so cost per defect
steadily rises due to fixed costs.

Case C: Zero Defects


In this third case, assume that a tester spent 15 hours writing test cases and 10 hours
running them. No bugs or defects were discovered.
Because no defects were found, the cost per defect metric cannot be used at all.
But 25 hours of actual effort were expended writing and running test cases. If the
138 ◾ A Guide to Selecting Software Measures and Metrics

tester had no other assignments, he or she would still have worked a 40 hour week
and the costs would have been $2500. The cost per defect would be infinity because
there were zero defects found. Cost per function point for zero-defect software
would be $25.00.
If the 15 hours of slack time are backed out, leaving 25 hours for actual testing,
the costs would have been $1,893.75. With slack time removed, the cost per func-
tion point would be $18.38 for zero-defect software.
Time and motion studies of defect repairs do not support the aphorism that “it
costs 100 times as much to fix a bug after release as before.” Bugs typically require
between 15 minutes and 6 hours to repair regardless of where they are found. Cost
per function point penalizes quality and makes testing seem more and more expen-
sive as fewer and fewer bugs are found. Defect removal cost per function point is a
much better economic metric for studying cost of quality (COQ).
Table 16.7 illustrates a side-by-side comparison of cost per defect and defect
removal cost per function point for an application of 1000 function points.
As can be seen, the reduction in defects in the later test stages combined with
the fixed costs of writing and running tests cause the cost per defect to rise steeply.
Defect removal cost per function point is much better for economic analysis of
COQ.
Software quality measures are so poor in 2016 that most companies do not
record the separate expense elements of writing test cases, running test cases, and
fixing defects. All three are simply lumped together, which of course causes late
defects to be much more expensive than early defects.

Table 16.7 Cost per Defect versus Cost per Function Point (Assumes $75.75
per Staff Hour for Costs; 1000 Function Points)
Number of $ per $ per
Total Costs Defects Defect FP

Unit test $20,937.50 50 $418.75 $209.38

Function test $9,575.00 20 $478.75 $95.75

Regression test $5,787.50 10 $578.75 $57.88

Performance test $3,893.75 5 $778.75 $38.94

System test $3,136.25 3 $1,045.42 $31.36

Acceptance test $2,378.75 1 $2,378.75 $23.79

AVERAGE $4,954.25 8 $1,052.08 $49.54


Variations in Software Quality Results ◾ 139

Be Cautious of Technical Debt


The newer technical debt metric is popular but highly ambiguous. There are no
ISO standards for technical debt and every company seems to measure it differ-
ently. It is a good metaphor, but as of 2016 not a very good metric.

The SEI CMMI Helps Defense Software Quality


The Software Engineering Institute (SEI) capability maturity model integrated
(CMMI) has proven to benefit quality for military and defense software, although
it is seldom used in the civilian sector. Based on a study by the author funded by
the Air Force, the defect potentials and DRE values by CMMI level are given below
in Table 16.8.
There are of course ranges and overlaps among all five CMMI levels. Also many
civilian and commercial software companies do not use the CMMI at all, yet some
have quality levels as good as CMMI level 5.

Software Cost Drivers and Poor Quality


It is useful to examine software cost drivers or the relative topics where software
expenses are deployed during software development. The natural assumption is
that coding should be the number one or largest cost driver but this is not the case.

Table 16.8 Software Quality and the SEI Capability Maturity Model
Integrated CMMI for 2,500 function points
Defect Delivered
Defect Removal Defects per
Potential per Efficiency Function Delivered
CMMI Level Function Point (%) Point Defects

SEI CMMI 1 4.50 87.00 0.585 1,463

SEI CMMI 2 3.85 90.00 0.385 963

SEI CMMI 3 3.00 96.00 0.120 300

SEI CMMI 4 2.50 97.50 0.063 156

SEI CMMI 5 2.25 99.00 0.023 56


140 ◾ A Guide to Selecting Software Measures and Metrics

For the United States as a whole and especially for large software systems in the
10,000 function, point size range finding and fixing bugs is the number one cost
driver. The major U.S. software cost drivers circa 2016 is shown in Table 16.9.
As long as software is constructed from custom designs and manual code, it
will always be error prone and expensive. The ultimate goal of software engineering
should be to construct applications from suites of standard reusable components
that have been certified to zero-defect quality levels. If this occurs, then the future
cost drivers circa for software circa 2026 will have a much more cost-effective pat-
tern than those of 2016 as shown in Table 16.10.
A combination of increased volumes of certified reusable materials combined
with effective software quality control and better quality measures and metrics
should offer the following attractive economic advantages:

◾ Increase the volumes of certified reusable software materials (designs, code,


test cases, etc.) from <10% as of 2016 to >85% circa 2036.
◾ Software defect potentials could drop from 4.25 per function point to <1.50
per function point.
◾ DRE could increase from 92.50% up to about 99.65%.
◾ Software development effort could decrease from about 15.00 work hours per
function point down below 4.00 work hours per function point.
◾ Software development costs could drop from about $1100 per function point
down below $325 per function point.
◾ Software maintenance costs could drop from about $200 per function point
per year down below $50 per function point per year.
◾ Large systems above 10,000 function points that are canceled due to poor qual-
ity could drop from about 35% of all systems down below 2% of all systems.

Software Quality by Application Size


As manual counting of function points is slow and costly (the average rate is about
500 function points counted per day for a certified function point counter), there is
little function point data above 10,000 function points.
At a counting rate of 500 function points per day it would take 20 staff days to
count 10,000 function points and 200 staff days to count 100,000 function points.
Most companies do not want to spend so much time or money for manual function
point counts. (With the new SNAP metrics, counting costs are about 25% higher
than traditional function point counts.)
Assuming a cost of $2000 per day for a certified function point counter, the
10,000 function point application would cost $40,000 and the 100,000 function
point example would cost $400,000. (With SNAP, the counting costs might go
to $50,000 and $500,000 for the 10,000 and 100,000 function point samples,
respectively.)
Table 16.9 U.S. Software Cost Drivers in Rank Order for 2016
1 The cost of finding and fixing bugs

2 The cost of canceled projects

3 The cost of producing paper documents

4 The cost of programming or coding

5 The cost of security flaws and cyber attacks

6 The accrued costs of schedule delays on large applications

7 The costs of functional requirements

8 The costs of nonfunctional requirements

9 The cost of requirements changes (functional and nonfunctional)

10 The costs of user effort for internal software projects

11 The cost of postrelease customer support

12 The costs of meetings and communication

13 The costs of project management

14 The costs of legacy maintenance and renovation as software ages


Variations in Software Quality Results

15 The costs of innovation and new kinds of software

(Continued)
141
142 ◾

Table 16.9 (Continued) U.S. Software Cost Drivers in Rank Order for 2016
16 The costs of litigation for software failures and disasters

17 The costs of training and learning (customer training)

18 The costs of avoiding or removing security flaws

19 The costs of porting to ERP packages

20 The costs of assembling reusable components

Average $ per function point = $1,250 dev; $1,400 maint; $2,650 TCO

Cyber attack $ per function point = $75 prevention; $450 recovery; $525 TCA
A Guide to Selecting Software Measures and Metrics
Variations in Software Quality Results ◾ 143

Table 16.10 Best Case Future U.S. Cost Drivers for 2026
1 The costs of innovation and new kinds of software

2 The costs of assembling reusable components

3 The cost of programming or coding

4 The costs of avoiding or removing security flaws

5 The costs of functional requirements

6 The costs of nonfunctional requirements

7 The cost of requirements changes (functional and


nonfunctional)

8 The costs of user effort for internal software projects

9 The costs of legacy maintenance and renovation as


software ages

10 The cost of postrelease customer support

11 The costs of meetings and communication

12 The costs of project management

13 The costs of porting to ERP packages

14 The costs of training and learning (customer


training)

15 The cost of producing paper documents

16 The cost of finding and fixing bugs

17 The cost of security flaws and cyber attacks

18 The accrued costs of schedule delays on large


applications

19 The cost of canceled projects

20 The costs of litigation for software failures and disasters

Average $ per function point = $450 dev; $350 maint;


$800 TCO

Cyber attack $ per function point = $175 prevention;


$150 recovery; $325; TCA
144 ◾ A Guide to Selecting Software Measures and Metrics

Table 16.11 Software Quality Results by Application Size 2016


Size in IFPUG Defect Removal Delivered
Function Points Defect Potentials Efficiency (%) Defects per FP

1 1.25 99.66 0.004

10 2.14 98.79 0.026

100 2.97 95.71 0.127

1,000 3.27 93.60 0.209

10,000 5.20 90.45 0.497

100,000 7.05 87.80 0.860

1,000,000 8.00 79.65 1.628

Average 4.27 92.24 0.479

Software quality gets worse as application size goes up. Defect potentials
increase and DRE goes down as shown in Table 16.11.
As can be seen, software defect potentials get larger with application size,
whereas DRE declines. It is technically possible to lower large system defect poten-
tials and to raise large system defect removal efficiency, but in fact most companies
circa 2016 do not know enough about software quality control to do this.
It is sometimes asked why, if small software projects have better quality than
large systems, it would not be best to decompose large systems into numerous small
programs. Unfortunately, decomposing a large system into small pieces would be
like trying to decompose an 80,000 ton cruise ship into a set of 5,000 small boats.
You do not get the same features.
Moving from custom designs and manual coding to construction from libraries
of standard reusable components would reduce large application defect potentials.
Expanded use of quality function deployment (QFD) would also be of use for large
systems to lower defect potentials.
Expanded use of static analysis, inspections, requirement models, and auto-
mated proofs would raise DRE for large systems to over 97% even for 100,000
function points.
Software has become one of the most important industries in human history.
In spite of the success of software and the way it has transformed industry, military,
and government operations, it could be much more effective if software quality had
better metrics and measurement practices, and used state-of-the-art defect removal
techniques such as inspections and static analysis as well as testing. Software would
also benefit from moving away from custom design and manual coding to con-
struction from a suite of certified reusable component.
Variations in Software Quality Results ◾ 145

A combination of defect potentials and DRE measures provide software engi-


neering and quality personnel with powerful tools for predicting and measuring
all forms of defect prevention and all forms of defect removal. Function points are
the best metric for normalizing defect potentials because function points are the
only metrics that can handle requirements, design, architecture, and other sources
of noncode defects.
This chapter uses International Function Point Users Group (IFPUG) func-
tion points version 4.3. Other forms of function point metric such as COSMIC,
FISMA, and NESMA would produce similar but not identical to the values shown
here. As of 2016, there are insufficient data on SNAP metrics to show defect poten-
tials and DRE.
Chapter 17

Variations in Pattern-Based
Early Sizing

The sizing method used by the author in the Software Risk Master (SRM) estima-
tion tool is novel for software but often used by other industries. The method is
based on pattern matching using a formal taxonomy.
The unique Namcook pattern-matching approach is based on the same method-
ology as the well-known Trulia and Zillow databases for real-estate costs and also
the Kelley Blue Book for used automobile prices. It is also used by Vision Appraisal
for municipal property tax analysis.
With the real-estate databases home buyers can find the costs, taxes, and other
information for all listed homes in all U.S. cities. They can specify patterns for
searching such as home size, lot size, number of rooms, and so on.
The main taxonomy topics used for software pattern matching in the Namcook
SRM tool is shown in Table 17.1.
As the SRM pattern-matching approach is fast, it is used to size not only
International Function Point Users Group (IFPUG) function points and software
nonfunctional assessment process (SNAP) point metrics but a total of 23 software
size metrics (Table 17.2).
The SRM sizing method also predicts the sizes of many other subcategories of
software deliverables including

◾ Page and word sizes for 20 kinds of software documents.


◾ Test cases and test scripts for 18 kinds of software testing.
◾ Source code size for 84 programming languages and combinations of languages.
◾ Probable numbers of bugs from all sources (requirements, code, design,
and so on).

147
148 ◾ A Guide to Selecting Software Measures and Metrics

Table 17.1 Taxonomy Patterns for Application Sizing and Risk Analysis
1. Country where the software will be developed (China, United States,
Sweden, and so on)

2. Industry where the software will be developed (banks, government,


defense, and so on)

3. Region or city where the software will be developed (big cities are expensive)

4. Work hours per month for the software team (varies by country and
industry)

5. Anticipated paid and unpaid monthly overtime (varies widely)

6. Experience of the management team (experts to novices)

7. Experience of the development team (experts to novices)

8. Experience of the clients (experts to novices)

9. Local average team monthly salary and burden rates, overtime premiums

10. Planned/actual start date for the project

11. Desired/actual delivery date for the project

12. Hardware platform(s) (smart phone, PC, mainframe, server, and so on)

13. Operating system(s): (Android, Linux, Unix, IOS, IBM, Windows, and so on)

14. Development methodologies that will be used (Agile, RUP, TSP, and so on)

15. CMMI level of the development group

16. Usage or no usage of SEMAT (software engineering methods and theory)

17. Programming language(s) that will be used (C#, C++, Java, SQL, and so on)

18. Nature of the project (new, enhancement, and so on)

19. Scope of the project (subprogram, program, departmental system,


and so on)

20. Class of the project (internal use, open-source, commercial, and so on)

21. Type of the project (embedded, web application, client–server, and so on)

22. Reusable material percent available for the project (design, code, tests,
and so on)

23. Problem complexity ranging from very low to very high

24. Code complexity ranging from very low to very high

25. Data complexity ranging from very low to very high


Variations in Pattern-Based Early Sizing ◾ 149

Table 17.2 Metrics Supported by SRM Pattern Matching


1. IFPUG function points

2. Automated code-based

3. Automated UML-based

4. Backfired function points

5. Nonfunctional SNAP points based on SNAP rules

6. COSMIC function points

7. FISMA function points

8. NESMA function points

9. Simple function points

10. Mark II function points

11. Unadjusted function points

12. Function points light

13. Engineering function points

14. Feature points

15. Use-case points

16. Story points

17. Lines of code (logical statements)

18. Lines of code (physical lines)

19. RICE objects

20. Micro function points

21. Logical code statements

22. Physical lines of code

23. Additional metrics as published

◾ Feature growth during development and for three years after deployment.
◾ Postrelease incidents and bug reports for three years.

For the basic SRM taxonomy, all of the topics are usually known well before require-
ments. All of the questions are multiple choice questions except for start date and
compensation and burden rates. Default cost values are provided for situations where
150 ◾ A Guide to Selecting Software Measures and Metrics

such cost information is not known or is proprietary. This might occur if multiple
contractors are bidding on a project and they all have different cost structures.
The answers to the multiple choice questions form a pattern that is then
compared against a Namcook knowledge base of more than 26,000 software
projects. As with the real-estate databases, software projects that have identical
patterns usually have about the same size and similar results in terms of sched-
ules, staffing, risks, and effort.
Sizing via pattern matching can be used prior to requirements and therefore
perhaps six months earlier than most other sizing methods. The method is also very
quick and usually takes less than 5 minutes per project. With experience, the time
required can drop down to less than 2 minutes per project.
The pattern-matching approach is very useful for large applications >10,000
function points where manual sizing might take weeks or even months. With
pattern matching, the actual size of the application does not affect the speed of
the result and even massive applications in excess of 100,000 function points can
be sized in a few minutes or less.
SNAP is an emerging metric, but not yet old enough to have accumulated
substantial empirical data on the factors that cause SNAP values to go up or come
down. Function point metrics, on the other hand, have over 40 years of accumu-
lated historical data and formal benchmarks for over 75,000 projects.
In comparing one software project against another, it is important to know
exactly what kinds of software applications are being compared. This is not as easy
as it sounds. The industry lacks a standard taxonomy of software projects that can
be used to identify projects in a clear and unambiguous fashion.
Since 1984 the author has been using a multipart taxonomy for classifying
projects. The major segments of this taxonomy include nature, scope, class, type,
and complexity.
(Note that this taxonomy was originally developed for measuring historical proj-
ects. However, from analysis of the data collected, it was noted that applications with
the same patterns on the taxonomy were of about the same size and had similar cost
and schedule distributions. This discovery led to the current sizing method based
on pattern matching embedded in the author’s Software Risk Master™ (SRM) tool.)
Following are samples of the basic definitions of the author’s taxonomy as used
in SRM tool:

PROJECT NATURE:

1. New program development


2. Enhancement (new functions added to existing software)
3. Maintenance (defect repair to existing software)
4. Conversion or adaptation (migration to new platform)
5. Reengineering (reimplementing a legacy application)
6. Package modification (revising purchased software)
Variations in Pattern-Based Early Sizing ◾ 151

PROJECT SCOPE:

1. Subroutine
2. Module
3. Reusable module
4. Disposable prototype
5. Evolutionary prototype
6. Standalone program
7. Component of a system
8. Release of a system (other than the initial release)
9. New system (initial release)
10. Enterprise system (linked integrated systems)

PROJECT CLASS:

1. Personal program, for private use


2. Personal program, to be used by others
3. Academic program, developed in an academic environment
4. Internal program, for use at a single location
5. Internal program, for use at a multiple location
6. Internal program, for use on an intranet
7. Internal program, developed by external contractor
8. Internal program, with functions used via time sharing
9. Internal program, using military specifications
10. External program, to be put in public domain
11. External program, to be placed on the Internet
12. External program, leased to users
13. External program, bundled with hardware
14. External program, unbundled and marketed commercially
15. External program, developed under commercial contract
16. External program, developed under government contract
17. External program, developed under military contract

PROJECT TYPE:

1. Nonprocedural (generated, query, and spreadsheet)


2. World Wide Web application
3. Batch applications program
4. Interactive applications program
5. Interactive GUI applications program
6. Batch database applications program
7. Interactive database applications program
8. Client/server applications program
152 ◾ A Guide to Selecting Software Measures and Metrics

9. Scientific or mathematical program


10. Systems or support program including middleware
11. Communications or telecommunications program
12. Process-control program
13. Trusted system
14. Embedded or real-time program
15. Graphics, animation, or image-processing program
16. Multimedia program
17. Robotics or mechanical automation program
18. Artificial intelligence program
19. Neural net program
20. Hybrid project (multiple types)

PROBLEM COMPLEXITY:

1. No calculations or only simple algorithms


2. Majority of simple algorithms and simple calculations
3. Majority of simple algorithms plus a few of average complexity
4. Algorithms and calculations of both simple and average complexity
5. Algorithms and calculations of average complexity
6. A few difficult algorithms mixed with average and simple
7. More difficult algorithms than average or simple
8. A large majority of difficult and complex algorithms
9. Difficult algorithms and some that are extremely complex
10. All algorithms and calculations are extremely complex

CODE COMPLEXITY:

1. Most programming done with buttons or pull down controls


2. Simple nonprocedural code (generated, database, and spreadsheet)
3. Simple plus average nonprocedural code
4. Built with program skeletons and reusable modules
5. Average structure with small modules and simple paths
6. Well structured, but some complex paths or modules
7. Some complex modules, paths, and links between segments
8. Above average complexity, paths, and links between segments
9. Majority of paths and modules are large and complex
10. Extremely complex structure with difficult links and large modules

DATA COMPLEXITY:

1. No permanent data or files required by application


2. Only one simple file required, with few data interactions
Variations in Pattern-Based Early Sizing ◾ 153

3. One or two files, simple data, and little complexity


4. Several data elements, but simple data relationships
5. Multiple files and data interactions of normal complexity
6. Multiple files with some complex data elements and interactions
7. Multiple files, complex data elements, and data interactions
8. Multiple files, majority of complex data elements, and interactions
9. Multiple files, complex data elements, many data interactions
10. Numerous complex files, data elements, and complex interactions

In addition, the author also uses codes for countries (telephone codes work for this
purpose as do ISO country codes), for industries (Department of Commerce North
American Industry Classifications [NAIC codes are used]), and geographic region
(Census Bureau state codes work in the United States. Five-digit zip codes or tele-
phone area codes could also be used.).
Why such a complex multilayer taxonomy is necessary can be demonstrated
by a thought experiment of comparing the productivity rates of two unlike appli-
cations with widely differing taxonomy patterns. Suppose the two applications
have the following taxonomy aspects as shown in Table 17.3.
As shown by Table 17.3, the productivity rate of Application A would probably
be 25.00 function points per staff month. The productivity rate of Application B
would probably be only 5.00 function points per staff month.
The total amount of effort devoted to Project B exceeded the effort devoted to
Project A by more than 400 to 1.
The cost per function point for Application A was only $400 per function
point, whereas the cost per function point for Application B is $2,000 per func-
tion point.
These two examples are from the same country, and same geographic region,
but from different industry segments and for very different types and sizes of
applications.
If one of the projects were done in China or India, the ranges would be even
broader by another 200% or so. If a high-cost country such as Switzerland was one
of the locations, the costs would swing upward.
Does that mean that the technologies, tools, or skills on Project A are superior
to those used on Project B? It does not—it simply means two very different kinds of
software projects are being compared, and great caution must be used to keep from
drawing incorrect conclusions.
In particular, software tool and methodology vendors should exercise more
care when developing their marketing claims, many of which appear to be derived
exclusively from comparisons of unlike projects in terms of the nature, scope, class,
type, and size parameters.
Several times vendors have made claims of 10 to 1 improvements in productivity
rates as a result of using their tools. Invariably these claims are false and are based
on comparison of unlike projects. The most common error is that of comparing
154 ◾ A Guide to Selecting Software Measures and Metrics

Table 17.3 Comparative Results of Widely Different Taxonomy Patterns


Parameter Application A: Prototype Application B: System

Country code 1 = United States 1 = United States

Region code 06 = California 06 = California

Industry code 1569 = telecommunications 2235 = National Security

Nature 1 = New 1 = New

Scope 4 = Disposable prototype 9 = New system

Class 1 = Personal program 17 = Military contract

Type 1 = Nonprocedural 13 = Trusted system

Problem complexity 1 = Simple 9 = Complex algorithms

Code complexity 2 = Nonprocedural code 10 = Highly complex


code

Data complexity 2 = No files needed 10 = Many complex files

Application size 125 function points 10,000 function points

Application size 15 SNAP points 2,000 SNAP points

Months of effort 5.00 2,000.00

Productivity 25.00 function points/month 5.00 function points/month

Productivity 5.28 works hours/function 26 workhours/function


point point

Defect potentials 2.50 bugs per function 5.50 bugs per function
point point

Defect removal % 98.5% 91.5%

Delivered defects 5 4,675

High severity 0 748

Security flaws 0 53

Bugs/function point 0.038 0.470

$ for development $50,000 $20,000,000

$ per function point $400 $2,000


Variations in Pattern-Based Early Sizing ◾ 155

a very small project against a very large system. Another common error is to compare
only a subset of activities such as coding against a complete project measured from
requirement to delivery.
Only in the past week, the author received a question from a webinar attendee
about object-oriented productivity rates. The questioner said that his group using
object-oriented (OO) programming languages was more than 10 times as produc-
tive as the data cited in the webinar. But the data the questioner was using consisted
only of code development! The data in the webinar ran from requirements to deliv-
ery and also included project management. It was apples to oranges comparison of
coding only versus a full set of activities and occupation groups.
Unfortunately, not using a standard taxonomy and not clearly identifying the
activities that are included in the data are the norm for software measurements
circa 2016.
Chapter 18

Gaps and Errors in


When Projects Start.
When Do They End?

When a major software project starts is the single most ambiguous point in the
entire life cycle. For many projects, there can be weeks or even months of informal
discussions and preliminary requirements gathering before it is decided that the
application looks feasible.
(If the application does not look feasible and no project results, there may still
be substantial resources expended that it would be interesting to know about.)
Even when it is decided to go forward with the project that does not automati-
cally imply that the decision was reached on a particular date, which can be used to
mark the commencement of billable or formal work.
So far as can be determined, there are no standards or even any reasonable
guidelines for determining the exact starting points of software projects. The
methodology used by the author to determine project starting dates is admittedly
crude: I ask the senior project manager for his or her opinion as to when the proj-
ect began, and utilize that point unless there is a better source.
Sometimes a formal request for proposal (RFP) exists, and also the responses to
the request. For contract projects, the date of the signing of the contract may serve
as the starting point. However, for the great majority of systems and MIS applica-
tions, the exact starting point is clouded in uncertainty and ambiguity.
Although the end date of a software project is less ambiguous than the start
date, there are still many variances. The author uses the delivery of the software
to the first actual customer as the termination date or the end of the project.

157
158 ◾ A Guide to Selecting Software Measures and Metrics

Although this works for commercial packages such as Microsoft Office, it does
not work for major applications such as ERP packages where the software cannot
be used on the day of delivery, but instead has to be installed and customized for
each specific client. In this situation, the actual end of the project would be the
date of the initial usage of the application for business purposes. In the case of
major ERP packages such as SAP and Oracle, successfully using the software can
be more than six calendar months after the software was delivered and installed
on the customer’s computers due to extensive customization and migration from
legacy applications.
The bottom line is that the software industry as of 2016 does not have any
standard method or unambiguous methods for measuring either the start or end
dates of major software projects. As a result, historical data on schedules are highly
suspect.
The nominal start point for software projects suggested by the author is the day
that formal requirements begin. The nominal end point suggested by the author
for software projects is the day the first actual client receives the final version (not
the Beta test version but the actual final version).
Also confusing are gaps and errors caused by overlapping schedules of various
development activities.
As software project schedules are among the most critical of all software
project factors, one might think that methods for measuring schedules would
be fully matured after some 60 years of trial and error. There is certainly no
shortage of project scheduling tools, many of which can record historical data
as well as plan unfinished projects. For example, Microsoft Project, Computer
Aid’s Automated Project Office (APO), and the Jira tool can also measure soft-
ware schedules.
However, the measurement of original schedules, slippages to those schedules,
milestone completions, missed milestones, and the overlaps among partially con-
current activities is still a difficult and ambiguous undertaking.
One of the fundamental problems is the tendency of software projects to keep
only the current values of schedule information, rather than a continuous record
of events. For example, suppose the design of a project was scheduled to begin
in January and end in June. In May it becomes apparent that June is unlikely, so
July becomes the new target. In June, new requirements are levied, so the design
stretches to August when it is nominally complete. Unfortunately, the original date
(June in this example) and the intermediate date (July in this example) are often
lost. Each time the plan of record is updated, the new date replaces the former date,
which then disappears from view.
It would be very useful and highly desirable to keep track of each change to a
schedule, why the schedule changed, and what were the prior schedules for com-
pleting the same event or milestone.
Another ambiguity of software measurement is the lack of any universal agree-
ment as to what constitutes the major milestones of software projects. A large
Gaps and Errors in When Projects Start. When Do They End? ◾ 159

minority of projects tend to use completion of requirements, completion of design,


completion of coding, and completion of testing as major milestones. However, many
other activities are on the critical path for completing software projects, and there
may not be any formal milestones for these: examples include completion of user
documentation and completion of patent search.
It is widely known that those of us in software normally commence the next
activity in a sequence long before the current activity is truly resolved and com-
pleted. Thus design usually starts before requirements are firm, coding starts before
design is complete, and testing starts long before coding is completed. The classic
Waterfall model that assumes that activities flow from one to another in sequential
order is actually a fiction, which has almost never occurred in real life.
When an activity such as design starts before a predecessor activity such as
requirements are finished, the author uses the term and metric called overlap to
capture this phenomenon. For example, if the requirements for a project took four
months, but design started on month three, then we would say that design over-
lapped requirements by 25%. As may be seen, overlap is defined as the amount of
calendar time still remaining in an unfinished activity when a nominal successor
activity begins.
The amount of overlap or concurrency among related activities should be a
standard measurement practice, but in fact is often omitted or ignored. This is
a truly critical omission, because there are no ways to use historical data for accu-
rate schedule prediction of future projects if this information is missing. (The
average overlap on projects assessed by the author is about 25%, but the range
goes up to 100%.)
A very complex part of software cost estimation is that of dealing with sched-
ule overlap, or the amount of work in a prior activity that is unfinished when the
current activity begins. Overlap is hard to understand from words, but easy to
understand from a Gantt chart (Figure 18.1).
In this Gantt chart, the schedules for the individual technical activities total to
39 calendar months, but the elapsed schedule is only 12 months. This is because
every technical activity starts before the previous activity is finished.
A modern software planning tool or a modern software cost estimation tool can
provide very useful and more aesthetic Gantt charts (Figure 18.2).

Requirements ******
Design *********
Coding *********
Testing *********
Documentation *******
Project management *************************

Calendar months 1 2 3 4 5 6 7 8 9 10 11 12

Figure 18.1 Example of schedule overlaps for software projects.


160 ◾ A Guide to Selecting Software Measures and Metrics

Figure 18.2 Expanded sample of a Gantt chart.

Simplistic total schedule measurements along the lines of The project started in
August of 2010 and was finished in May of 2015 are essentially worthless for serious
productivity and economic studies.
Accurate and effective schedule measurement would include the schedules of
specific activities and tasks, and also include the network of concurrent and over-
lapped activities. Further, any changes to the original schedules would be recorded
too. Table 19.1 illustrates a typical schedule slippage pattern for an application of a
nominal 1,000 function points coded in Java.
As can be seen in Table 18.1, there were three slips of the original schedule,
one occurred in April and was caused by design slipping one month. The second
occurred in September and was caused by coding slipping two months. The last

Table 18.1 Typical Schedule Slippage for 1,000 Function Points


Planned Original Second Slip
Completion Schedule First Slip Slip Final Slip Months

1/5/2016 4/5/2016 9/15/2016 1/5/2017

Start date 1/5/2016 1/5/2016 1/5/2016 1/5/2016 0.00

Requirements 2/5/2016 2/5/2016 2/5/2016 2/5/2016 0.00

Design 5/5/2016 6/5/2016 6/5/2016 6/5/2016 1.00

Coding 10/5/2016 11/5/2016 12/5/2016 12/5/2016 2.00

Testing 12/5/2016 1/5/2017 2/5/2017 6/5/2017 6.00

Delivery date 1/5/2017 2/5/2017 3/5/2017 7/5/2017 7.00

Elapsed months 12.00 13.00 14.00 19.00 7.00


Gaps and Errors in When Projects Start. When Do They End? ◾ 161

and biggest slip occurred in January and was caused by testing slipping six months
leading to an overall slip of seven months contrasting the original planned schedule
with the actual results.
This pattern is fairly common due to poor quality control and also due to
project management skimping or bypassing pretest inspections and static analysis.
The inevitable result is a huge and unplanned slippage in testing schedules due to
excessive defects found when testing begins.
The main point of this chapter is not the overall magnitude of schedule slippage,
but the fact that the intermediate slippage is seldom recorded and often invisible!
Without knowing the intermediate dates it is not possible to perform really
accurate statistical surveys of software schedules. At the very least, the original start
date and original planned end date should be kept so that the final delivery date can
be used to calculate net project slippage.
One of the advantages of function point measurements, as opposed to lines of
code (LOC) metrics, is that function points are capable of measuring requirements
growth. The initial function point analysis can be performed from the initial user
requirements. Later when the application is delivered to users, a final function point
analysis is performed.
By comparing the quantity of function points at the end of the requirements
phase with the volume of function points of the delivered application, it has been
found that the average rate of growth of creeping requirements is between 1%
and 3% per calendar month, and sometimes more. The largest volume of creeping
requirements noted by the author was an astonishing 289%.
The new software nonfunctional assessment process (SNAP) metric for measur-
ing nonfunctional requirements is also subject to creep, but there are insufficient
data on SNAP point creep to include it in this chapter. However, SNAP points
seem to have less creep than ordinary function points, probably due to the fact that
they are not subject to changing user requests.
Creeping requirements are one of the major reasons for cost overruns and
schedule delays. For every requirements change of 10 function points, about 10
calendar days will be added to project schedules and more than 20 hours of staff
effort. It is important to measure the quantity of creeping requirements, and it is
also important for contracts and outsource agreements to include explicit terms
for how they will be funded.
It is also obvious that as requirements continue to grow and change, it will be
necessary to factor in the changes, and produce new cost estimates and new schedule
plans. Yet in a majority of U.S. software projects, creeping requirements are neither
measured explicitly nor factored into project plans. This problem is so hazardous that
it really should be viewed as professional malpractice on the part of management.
There are effective methods for controlling change but these are not always used.
Joint Application Design (JAD), formal requirements inspections, and change con-
trol boards have long track records of success. Yet these methods are almost never
used in applications that end up in court for cancellation or major overruns.
162 ◾ A Guide to Selecting Software Measures and Metrics

One of the standard features of the author’s Software Risk Master ™ (SRM) tool
is the prediction of requirements creep. It is necessary to do this because sometimes
the magnitude of creep is about as large as the original estimate!
The author has been an expert witness in a lawsuit in Canada where the client
added 10,000 function points to an application originally sized at 10,000 function
points. More recently an arbitration case involved the costs of adding 5,000 func-
tion points to an application originally sized at 15,000 function points.
Requirements creep is a major contributor to cost and schedule overruns, and it
cannot be omitted from software cost estimates.
The SRM sizing algorithms actually create 15 size predictions. The initial
prediction is for the nominal size at the end of requirements. SRM also predicts
requirements creep and deferred functions for the initial release.
After the first release, SRM predicts application growth for a 10 year period. To
illustrate the full set of SRM size predictions, Table 18.2 shows a sample application

Table 18.2 SRM Multiyear Sizing Example (Copyright © by Capers Jones.


All Rights Reserved)
Nominal application size in IFPUG
function points 10,000

SNAP points 1,389

Language C

Language level 2.50

Logical code statements 1,280,000

Function Points SNAP Points Logical Code

1. Size at end of 10,000 1,389 1,280,000


requirements

2. Size of 2,000 278 256,000


requirement
creep

3. Size of planned 12,000 1,667 1,536,000


delivery

4. Size of deferred – 4,800 (667) (614,400)


features

(Continued)
Gaps and Errors in When Projects Start. When Do They End? ◾ 163

Table 18.2 (Continued) SRM Multiyear Sizing Example (Copyright ©


by Capers Jones. All Rights Reserved)

Function Points SNAP Points Logical Code

5. Size of actual 7,200 1,000 921,600


delivery

6. Year 1 usage 12,000 1,667 1,536,000 Kicker

7. Year 2 usage 13,000 1,806 1,664,000

8. Year 3 usage 14,000 1,945 1,792,000

9. Year 4 usage 17,000 2,361 2,176,000 Kicker

10. Year 5 usage 18,000 2,500 2,304,000

11. Year 6 usage 19,000 2,639 2,432,000

12. Year 7 usage 20,000 2,778 2,560,000

13. Year 8 usage 23,000 3,195 2,944,000 Kicker

14. Year 9 usage 24,000 3,334 3,072,000

15. Year 10 usage 25,000 3,473 3,200,000

Kicker = Extra features added to defeat competitors.


Notes: Simplified with whole numbers for clarity.
Deferred features usually due to schedule slips.

with a nominal starting size of 10,000 function points. All of the values are in
round numbers to make the patterns of growth clear.
As can be seen in Table 18.2 software applications do not have a single fixed
size, but continue to grow and change for as long as they are being used by custom-
ers or clients. Therefore productivity and quality data need to be renormalized from
time to time. Namcook suggests every renormalization at the beginning of every
fiscal or calendar year.
Chapter 19

Gaps and Errors


in Measuring
Software Quality

A long-standing problem with measuring software quality is defining what


software quality actually means. As we all know, the topic of quality is somewhat
ambiguous in every industry. Definitions for quality can encompass subjective
aesthetic quality and also precise quantitative units such as numbers of defects
and their severity levels.
Over the years software has tried a number of alternate definitions for quality
that are not actually useful. For example, one widely used definition for software
quality has been conformance to requirements.
Requirements themselves are filled with bugs or errors that comprise about
20% of the overall defects found in software applications. Defining quality as
conformance to a major source of errors is circular reasoning and clearly invalid.
We need to include requirement errors in our definition of quality.
Another widely used definition for quality has been fitness for use. But this
definition is ambiguous and cannot be predicted before the software is released,
or even measured well after release. If an application has thousands of users, there
is no way of knowing if a piece of software is fit for all of the many possible usage
patterns.
It is obvious that a workable definition for software quality must be unambigu-
ous and capable of being predicted before release and then measured after release
and should also be quantified and not purely subjective.

165
166 ◾ A Guide to Selecting Software Measures and Metrics

Yet another definition for software quality has been a string of words ending
in …ility such as reliability and maintainability. However laudable these attributes
are, they are all ambiguous and difficult to measure. Further, they are hard to predict
before applications are built.
The quality standard ISO/IEC 9126 includes a list of words such as portability,
maintainability, reliability, and maintainability. It is astonishing that there is no
discussion of defects or bugs in this so-called quality standard.
Worse, the ISO/IEC definitions are almost impossible to predict before develop-
ment and are not easy to measure after release, nor are they quantified. It is obvious
that an effective quality measure needs to be predictable, measurable, and quantifiable.
Reliability is predictable in terms of mean time to failure (MTTF) and mean
time between failures (MTBF). Indeed these are standard predictions from the
author’s Software Risk Master (SRM) tool.
However in real life, software reliability is inversely proportional to delivered
defects and high-severity defects. Therefore the ISO quality standards should have
included defect potentials, defect removal efficiency (DRE), and delivered defect
densities.
An effective definition for software quality that can be both predicted before
applications are built and then measured after applications are delivered is as follows:
“Software quality is the absence of defects which would either cause the application
to stop working, or cause it to produce incorrect results.”
As delivered defects impact reliability, maintainability, usability, fitness for use,
conformance to requirements, and also customer satisfaction, any effective defini-
tion of software quality must recognize the central importance of achieving low
volumes of delivered defects. Software quality is impossible without low levels of
delivered defects, no matter what definition is used.
This definition has the advantage of being applicable to all software deliverables
including requirements, architecture, design, code, documents, and even test cases.
It is also applicable to all types of software: commercial, embedded, information
systems, open-source, military, web applications, smart phone applications, and
even computer games.
If software quality focuses on the prevention or elimination of defects, there are
some effective corollary metrics that are quite useful.
Finding and fixing bugs is the number one cost driver for the entire software
industry. As bug repairs cost more than any other cost driver, these costs should
be carefully measured and analyzed as part of a quality measurement program.
Unfortunately, measuring software quality is a challenging task that has not always
been done well for software.
It is not difficult to count bugs or defects, but that is not a sufficient topic
to understand software quality and software quality economics. We also need to
know what causes the bugs, the best ways of getting rid of them, the costs of having
bugs, and also the severity and consequences of software bugs that are released to
customers.
Gaps and Errors in Measuring Software Quality ◾ 167

Counting software bugs has been done for over 60 years but how this
is done is not consistent. All bugs should be counted including bugs found
privately by means of desk checks or unit tests. These private forms of bug
detection can be done with volunteers, and it is not necessary for every single
software engineer to do this.
For large applications there are likely to be hundreds or thousands of bug
reports. And there are millions of applications. This means that bug reporting
needs to include statistical analysis that can help to understand defect origins and
hopefully reduce them in the future. Therefore bug reports must be structured
to facilitate large-scale statistical studies. In today’s world of ever-increasing cyber
threats, bug reports must also be examined as possible security flaws.
The best metric for normalizing defect potentials is the function point metric.
Lines of code (LOC) cannot deal with defects found in requirements and design.
The new software nonfunctional assessment process (SNAP) metric may also be
used for normalization, but as of 2016 there is not enough empirical data about
defects per SNAP point to include it in this paper.
An important and useful metric for software quality is that of defect potentials.
This metric was developed by IBM circa 1970. It is the sum total of bugs originat-
ing from all major sources of error including requirements, design, code, and other
deliverables.
Defect potentials can be predicted before projects start, and of course bugs will
be measured and reported as they are found. The current U.S. average for software
defect potentials for 2016 is given in Table 19.1.
Most of the defect origins are self-explanatory. However, the bad-fix category
needs a word of explanation. About 7% of bug repairs contain new bugs in the
repair itself. For modules with high cyclomatic complexity, bad-fix injections have
been seen to top 35% of bugs reported by clients.

Table 19.1 Average Software Defect Potentials circa 2016


for the United States
• Requirements 0.70 defects per function point

• Architecture 0.10 defects per function point

• Design 0.95 defects per function point

• Code 1.15 defects per function point

• Security code flaws 0.25 defects per function point

• Documents 0.45 defects per function point

• Bad fixes 0.65 defects per function point

• Total 4.25 defects per function point


168 ◾ A Guide to Selecting Software Measures and Metrics

There are of course ranges in these averages. There are also variations in average
values based on sample size and nature. For example, systems software and web
applications do not have the same patterns of defect potentials.
Table 19.1 enumerates the origins of bugs in the application as delivered to cli-
ents. There are also bugs or defects in test cases and test plans, but these are not part
of the delivered software. However it is useful to record the following:

◾ Defects and omissions in test plans


◾ Defects and omissions in test scripts
◾ Defects and omissions in test case

A study by IBM of regression test cases noted that about 15% of test cases either had
errors or gaps in them.
Another important topic that needs to be studied using reported defect data is
that of error-prone modules. A study by IBM about bugs reported against the oper-
ating systems, the IMS database, and other commercial software products found
that bugs are not randomly distributed, but tend to clump in a small number of
modules. For example, the IBM IMS database product had 425 modules. About
57% of customer-reported bugs were found in only 31 of these modules. IMS had
over 200 modules with zero defects that never received bug reports. Needless to
say, identifying error-prone modules (EPM) is an important aspect of software
bug reporting.
As can be seen, software bugs originate in many different sources. At IBM it
was the role of software change teams to assign an origin code to every reported
defect whether found internally or reported by customers.
It is also necessary to know the severity of these bugs or defects, and especially
so for bugs released into the field, and hence encountered by customers. The IBM
severity scale is the most widely used and it has been in use since the early 1960s
(Table 19.2).

Table 19.2 IBM Software Defect Severity Scale


• Severity 1 Software operation stops completely

• Severity 2 Major function disabled

• Severity 3 Minor function disabled or work around available

• Severity 4 Cosmetic error that does not affect software operation

• Invalid defect Bug reported against software but due to something else

• Abeyant defect Unique bug that cannot be duplicated by repair team

• Duplicate defects Multiple reports of the same defect


Gaps and Errors in Measuring Software Quality ◾ 169

Table 19.3 Distribution of Bugs


by Severity Levels
• Severity 1 < 2%

• Severity 2 18%

• Severity 3 35%

• Severity 4 45%

• Total 100%

Table 19.4 Distribution of Invalid and Duplicate Defect Reports


• Invalid defect reports 20% of valid defect reports

• Abeyant defects 0.01% of valid defect reports

• Duplicate defects 475% of valid defect reports

The approximate 2016 distribution of valid unique bugs by severity level after
release to customers is given in Table 19.3.
Using valid unique bugs as the basis of comparison, the approximate percentage
of invalid, abeyant, and duplicate bugs is given in Table 19.4.
For commercial software with thousands of users, every valid bug report usually
has a number of duplicate reports of the same bug, sometimes hundreds of duplicates.
Once the initial bug report is identified then all other reports of the same bug are just
stamped as duplicate and are not turned over to change teams.
For those of us who have actually worked in software maintenance and defect
repairs, invalid defects, and duplicate defects form a constant stream of bug reports.
Each of these by itself is not very expensive, but in total they add significant costs
to commercial software maintenance.
The possible record for duplicate defects was a word processor that had a bug in
the installation procedure so customers could not install the application. That bug
generated over 10,000 phone calls to the vendor on day 1 of the release, and in fact
shut down the phone system temporarily.
An example of an invalid defect happened to the author. A client who used several
estimating tools sent us a bug report, but it was not a bug in our software but rather
a bug found in a competitive product. Even though it was not our bug, we routed it
to the other company and notified the client. This took about 30 minutes of our time
for a bug that was invalid.
An example of abeyant defect was a bug report from the early days of personal
computers. An application that ran well on IBM personal computers and most
clones such as Compaq displayed unusual screen colors when running on an ITT
Xtra personal computer. Of course there were not very many of these ITT Xtra
machines and only one had installed the software that generated the unusual screen
170 ◾ A Guide to Selecting Software Measures and Metrics

colors. A fix was developed by recoding the screen colors while using an ITT Xtra,
but that version would not work properly on authentic IBM computers. However,
the ITT Xtra was taken off the market so the problem disappeared.
The next aspect of software quality measurement is the most complex and dif-
ficult. When a software application damages a customer, these damages are called
by the legal term consequential damages.
Usually consequential damages are financial, but other kinds of harm can
include injuries or even deaths. It is also important to measure the consequences of
software defects that are released to customers. Table 19.5 shows a scale listing of
various kinds of consequential damages.
The consequences of software defects are usually not immediately apparent
soon after release, and may not show up for months, or even years later. This means
that software defect data needs to be kept active for a significant period of time.
Once defects have been reported and turned over to the change teams, they can
be analyzed to find out why the defects slipped out to customers. In other words,
it is useful to identify an optimal defect removal strategy that can eliminate future
defects of the same type.
There are several components of an optimal defect elimination strategy as given
in Table 19.6.

Table 19.5 Software Consequential Damages Scale


• Cons 0 Defect causes multiple kinds of damage to clients

• Cons 1 Defect causes injury or death

• Cons 2 Defect degrades or damages security

• Cons 3 Defect causes major financial loss > $1,000,000

• Cons 4 Defect causes minor financial loss

• Cons 5 Defect stops operation of physical devices

• Cons 6 Defect degrades operation of physical devices

Table 19.6 Software Defect Prevention and Removal Strategies


• Strategy 1: The defect should have been prevented from occurring

• Strategy 2: The defect should have been prevented via modeling

• Strategy 3: The defect should have been found via static analysis

• Strategy 4: The defect should have been found via formal inspections

• Strategy 5: The defect should have been found by formal testing

• Strategy 6: The defect might have been found via combinations of methods
Gaps and Errors in Measuring Software Quality ◾ 171

When all of these defect topics are put together, a typical software defect report
from a customer of a commercial software application might look like Table 19.7
after it has entered into a corporate defect tracking system.
Note that to improve the ease of statistical analysis across many industries the
defect report includes the North American Industry Classification (NAIC) code
of the client reporting the defect. These codes are available from the Department of
Commerce and other sources as well. The newer NAIC code is an update and
replacement for the older standard industry classification or SIC codes.

Table 19.7 Sample Format of a Software Defect Report


Date of report July 22, 2016

Client company ABC Company

Industry NAIC code 51120 software publishing

Client CMMI level CMMI 0 (not used by company)

Client location Boston, MA

Client contact Jane Doe, Quality Manager

Prior bug reports by client 3

Prior bug reports for app 27

Prior high-severity bug reports 5

Prior security flaws reported 2

Defect status (current) First known report of this defect

Duplicate status No known duplicate reports of this defect

Validity status Valid unique defect

Application with defect TestGen1 Test case generator

Related applications affected None

Module(s) with defect Language table selection module

Error-prone module (EPM)? Normal module not error-prone

Hardware platform HP server

Operating system Unix

Development method used Iterative development

Programming language used Java

(Continued)
172 ◾ A Guide to Selecting Software Measures and Metrics

Table 19.7 (Continued) Sample Format of a Software Defect Report


Static analysis used No

Mathematical test case design No

Certified test personnel No

Test cases for this defect No

Defect description Would not generate test cases for MySQL

Defect origin Origin 3 Design problem

Defect severity Severity 2 major feature disabled

Defect consequence Consequence 3 minor financial loss

Defect security impact No security damages from this defect

Future defect prevention None identified

Future defect removal Formal design inspections (not done)

Defect routing Dispatched to change team Zeta 7/23/2016

Defect repair status Fix scheduled for August 6 release

Change team contact J. Smith, Zeta team lead

As can be seen, software defects are endemic problems and there are so many
of them that it is important to have high efficiency in fixing them. Of course it
would be even better to lower defect potentials and raise defect removal efficiency
(DRE) levels.
The next important metric for quality is that of DRE or the percentage of
bugs found and eliminated prior to the release of software to customers. The cur-
rent U.S. average for DRE is about 92.5% and the range is from less than 80%
up to about 99.65%.
Software defect potentials can be reduced by three important techniques:

• Defect prevention (techniques such as JAD, SEMAT, QFD,


and so on)

• Pretest defect removal (inspections, static analysis, pair programming,


and so on)

• Test defect removal (automated, manual, white box, black box,


and so on)

The term efficiency is used because each form of defect prevention and defect
removal has a characteristic efficiency level, based on measuring thousands of
Gaps and Errors in Measuring Software Quality ◾ 173

software projects. The author’s SRM tool includes quality estimates as a standard
feature. Given in Table 19.8 are sample SRM quality predictions for 1,000 func-
tion point application coded in Java and developed by an expert team at level 5
on the CMMI.
As can be seen software quality measurement is a fairly complex activity, but a
very important one. Software in 2016 has far too high a value for defect potentials
and far too low a value for DRE.

Table 19.8 Software Defect Prediction Example


Per
Defect Potentials KLOC Per FP Number

Requirements defect 13.99 0.75 746


potential

Design defect potential 14.58 0.78 777

Code defect potential 20.82 1.11 1,110

Document defect potential 2.61 0.14 139

Total Defect Potential 51.99 2.77 2,773

Defect Efficiency Per


Prevention (%) KLOC Per FP Remainder Bad Fixes

1 JAD 22.50 40.29 2.15 2,149 19

2 QFD 26.00 30.17 1.61 1,609 16

3 Prototype 20.00 24.38 1.30 1,300 9

4 Models 62.00 9.33 0.50 498 24

Subtotal 81.19 9.78 0.52 522 68

Efficiency Per
Pretest Removal (%) KLOC Per FP Remainder Bad Fixes

1 Desk check 25.00 7.34 0.39 391 12

2 Pair programming 14.73 6.44 0.34 344 10

3 Static analysis 59.00 2.72 0.15 145 4

4 Inspections 85.00 0.42 0.02 22 1

Subtotal 95.57 0.43 0.02 23 27

(Continued)
174 ◾ A Guide to Selecting Software Measures and Metrics

Table 19.8 (Continued) Software Defect Prediction Example


Efficiency Per
Test Removal (%) KLOC Per FP Remainder Bad Fixes

1 Unit 30.0 0.29 0.02 16 0

2 Function 33.0 0.20 0.01 11 1

3 Regression 12.0 0.19 0.01 10 0

4 Component 30.0 0.14 0.01 7 0

5 Performance 10.0 0.13 0.01 7 0

6 System 34.0 0.13 0.00 5 0

7 Acceptance 15.0 0.08 0.00 4 0

Subtotal 81.6 0.08 0.00 4 2

Defects Discovered 2,866 98

Defects delivered 4 0

Per
QUALITY RESULTS KLOC Per FP

Defects delivered 0.08 0.00 4

High severity (sev 1 + sev 2) 0.01 0.00 1

Security flaws 0.01 0.00 0

High severity % (industry norm) 16.84%

High severity % (user claims) 58.93%

(Users exaggerate severity 2 defects for faster repairs)

Delivered per FP 0.004

High severity per FP 0.001

Security flaws per FP 0.000

Delivered per KLOC 0.082

High severed per KLOC 0.014

Security flaws per KLOC 0.006

Cumulative Defect 99.84%

(Continued)
Gaps and Errors in Measuring Software Quality ◾ 175

Table 19.8 (Continued) Software Defect Prediction Example


Removal Efficiency DRE Evaluation

>99% excellent

>97% very good

>95% good

>90% fair

<85% poor

<80% malpractice

It is technically possible to lower defect potentials to below 2.50 per function


point and to raise DRE up to 99.00% for all projects and perhaps 99.90% for criti-
cal projects. But accomplishing these targets requires a synergistic combination of
defect prevention, pretest defect removal, test defect removal, and complete and
accurate quality measurements using accurate metrics such as function points.
Some additional topics in software quality are also important, such as variations
by size and type of application. Table 19.9 shows overall U.S. results in terms of
best, average, and worst case quality levels.
Defect potentials obviously vary by size, with small projects typically having
low defect potentials. Defect potentials rise faster than size increases, with large
systems above 10,000 function points having alarmingly high defect potentials.

Table 19.9 U.S. Average Ranges of Defect Potentials


circa 2016. (Defects per IFPUG 4.3 Function Point)
Defect Origins Best Average Worst

Requirements 0.34 0.70 1.35

Architecture 0.04 0.10 0.25

Design 0.63 0.95 1.58

Code 0.47 1.15 2.63

Security flaws 0.18 0.25 0.50

Documents 0.20 0.45 0.65

Bad fixes 0.39 0.65 1.49

TOTAL 2.25 4.25 8.45


176 ◾ A Guide to Selecting Software Measures and Metrics

Table 19.10 Software Defect Potentials per Function


Point by Size. (Defects per IFPUG 4.3 Function Point)
Function Points Best Average Worst

1 0.60 1.15 2.55

10 1.25 1.90 4.25

100 1.75 3.20 6.50

1,000 2.14 4.75 9.90

10,000 3.38 6.50 13.30

100,000 4.35 8.00 14.19

Average 2.25 4.25 8.45

Table 19.10 shows U.S. ranges in defect potentials from small projects of 1
function point up to massive systems of 100,000 function points.
As can be seen defect potentials go up rapidly with application size. This is one
of the key reasons why large systems fail so often and also run late and over budget.
Table 19.11 shows the overall U.S. ranges in DRE by application size from a
size of 1 function point up to 100,000 function points. As can be seen DRE goes
down as size goes up.
Note that the defects discussed in this chapter include all severity levels, rang-
ing from severity 1 show stoppers down to severity 4 cosmetic errors. Obviously it is
important to measure defect severity levels as well as recording number of defects.
(The normal period for measuring DRE starts with requirements inspections
and ends 90 days after delivery of the software to its users or customers. Of course

Table 19.11 U.S. Software Average DRE Ranges


by Application Size
Function Points Best (%) Average (%) Worst (%)

1 99.90 97.00 94.00

10 99.00 96.50 92.50

100 98.50 95.00 90.00

1,000 96.50 94.50 87.00

10,000 94.00 89.50 83.50

100,000 91.00 86.00 78.00

Average 95.80 92.20 86.20


Gaps and Errors in Measuring Software Quality ◾ 177

there are still latent defects in the software that would not be found in 90 days, but
having a 90-days interval provides a standard benchmark for DRE. It might be
thought that extending the period from 90 days to 6 months or 12 months would
provide more accurate results. However, updates and new releases usually come out
after 90 days, so these would dilute the original defect counts.)
Latent defects found after the 90-day period can exist for years, but on average
about 50% of residual latent defects are found in each calendar year. The results
vary with number of users of the applications. The more users, the faster resid-
ual latent defects are discovered. Results also vary with the nature of the software
itself. Military, embedded, and systems software tends to find bugs or defect more
quickly than information systems. Table 19.12 shows defect potentials for various
kinds of software.
Note that International Function Point Users Group (IFPUG) function points,
version 4.3 are used in this chapter for expressing results.
A future edition may include the new SNAP metric for nonfunctional size, but
to date there is insufficient data about quality to show defects normalized using the
SNAP metric.
The form of defect called bad fix in Table 19.12 is that of secondary defects
accidentally present in a bug or defect repair itself.
There are large ranges in terms of both defect potentials and DRE levels. The
best in class organizations have defect potentials that are below 2.5 defects per func-
tion point coupled with DREs that top 99% across the board. Projects that are
below 3.0 defects per function point coupled with a cumulative DRE level of about
97% tend to be lower in cost and shorter in development schedules than applica-
tions with higher defect potentials and lower levels of removal efficiency.
Observations of projects that run late and have significant cost overruns show
that the primary cause of these problems are excessive quantities of defects that
are not discovered nor removed until testing starts. Such projects appear to be on
schedule and within budget until testing begins. Delays and cost overruns occur
when testing starts, and hundreds or even thousands of latent defects are discov-
ered. The primary schedule delays occur due to test schedules far exceeding their
original plans.
DRE levels peak at about 99.85%. In examining data from about 26,000 soft-
ware projects over a period of 40 years, only two projects had zero defect reports in
the first year after release. This is not to say that achieving a DRE level of 100% is
impossible, but it is certainly very rare.
Organizations with defect potentials higher than 5.00 per function point coupled
with DRE levels of 85% or less can be viewed as exhibiting professional malprac-
tice. In other words, their defect prevention and defect removal methods are below
acceptable levels for professional software organizations.
Most forms of testing average about 30% to 35% only in DRE levels and seldom
top 50%. Formal design and code inspections, on the other hand, often top 85% in
DRE and average about 65%.
178 ◾

Table 19.12 Average Defect Potentials for Six Application Types. (Data expressed in defects per
function point)
Web MIS Outsource Commercial System Military Average

Requirements 0.50 0.90 1.00 1.15 1.30 1.50 1.06

Design 1.00 1.15 1.20 1.30 1.50 1.65 1.30

Code 1.25 1.25 1.40 1.50 1.65 1.20 1.38

Documents 0.30 0.60 0.50 0.70 0.70 1.20 0.67

Bad fix 0.45 0.40 0.30 0.50 0.70 0.60 0.49

TOTAL 3.50 4.30 4.40 5.15 5.85 6.15 4.89


A Guide to Selecting Software Measures and Metrics
Gaps and Errors in Measuring Software Quality ◾ 179

With every form of defect removal having a comparatively low level of removal
efficiency, it is obvious that many separate forms of defect removal need to be
carried out in sequence to achieve a high level of cumulative defect removal. The
phrase cumulative defect removal refers to the total number of defects found before
the software is delivered to its customers.
Table 19.13 shows patterns of defect prevention and defect removal for the same
six forms of software shown in Table 19.12.
Note: Other forms of defect removal exist besides the ones shown here. Text static
analysis, requirements models, and the use of cause-effect graphs for test case design
are not included. This is partly to save space and partly because these methods are
still uncommon. These look like they achieve results that range between about 50%
and 80% in defect prevention and removal efficiency but from very limited samples.
Note that the Agile topic of pair programming is also left out. This is because the
author’s data on pair programming shows it to be expensive, slow, and less effective
than individual programmers using static analysis and testing. The literature on
pair programming is embarrassing. Most studies only compare unaided individu-
als against unaided pairs, without any discussion at all as to the use of inspections,
static analysis, mathematical test case design, certified test personnel, or other top-
ics that can improve software quality.

Measuring the Cost of Quality


The phrase cost of quality (COQ ) is misnomer because software quality itself saves
money and saves time. The number one cost driver for software is the cost of find-
ing and fixing bugs.
One might think that in a major industry where bug repairs are the top cost
driver that empirical data on software quality costs would be plentiful and accu-
rate. This is not the case. Software quality costs are not understood at all, and the
historical reason for this is because of bad metrics and bad measurement practices.
As already mentioned cost per defect is useless for quality economic analysis
because it ignores fixed costs and hence makes buggy software look cheaper than
high-quality software.
The metric LOC ignores requirements, design, architecture, and document
defects that collectively are more numerous than code defects and also more dif-
ficult and more expensive to remove. The LOC metric also penalizes modern high-
level languages and makes low-level languages look better than they are.
The new metric technical debt is an interesting metaphor but not an effective
metric. It includes only the direct costs of changes and repairs due to development
shortcuts and ineffective defect removal prior to release. As a result technical debt
Table 19.13 Patterns of Defect Prevention and Removal Activities
180

Web (%) MIS (%) Outsource (%) Commercial (%) System (%) Military (%)

Prevention Activities

Prototypes 20.00 20.00 20.00 20.00 20.00 20.00

Six-Sigma 20.00 20.00

JAD sessions 30.00 30.00

QFD sessions 25.00

Subtotal 20.00 44.00 44.00 20.00 52.00 36.00

Pretest Removal

Desk checking 15.00 15.00 15.00 15.00 15.00 15.00

Requirements review 30.00 25.00 20.00 20.00

Design review 40.00 45.00 45.00 30.00

Document review 20.00 20.00 20.00

Code static analysis 55.00 55.00 55.00 55.00


A Guide to Selecting Software Measures and Metrics

Code inspections 50.00 60.00 40.00

Individual Verification and 20.00


Validation

(Continued)
Table 19.13 (Continued) Patterns of Defect Prevention and Removal Activities
Web (%) MIS (%) Outsource (%) Commercial (%) System (%) Military (%)

Correctness proofs 10.00

Usability labs 25.00

Subtotal 15.00 15.00 64.30 92.00 90.00 94.00

Testing Activities

Unit test 30.00 25.00 25.00 25.00 25.00 25.00

New function test 30.00 30.00 30.00 30.00 30.00 30.00

Regression test 20.00 20.00 20.00 20.00

Integration test 30.00 30.00 30.00 30.00 30.00 30.00

Performance test 15.00 15.00 20.00

System test 35.00 35.00 35.00 35.00 40.00 35.00

Independent test 15.00

Field test 50.00 35.00 30.00

Acceptance test 25.00 25.00 30.00


Gaps and Errors in Measuring Software Quality

Subtotal 76.11 76.11 80.89 91.88 92.69 93.63


Overall Efficiency 87.40 88.63 96.18 99.32 99.58 99.33


181
182 ◾ A Guide to Selecting Software Measures and Metrics

only includes about 17% of the true costs of poor quality. The major omissions from
the technical debt metric are

1. Projects with quality so poor that they are canceled and not released.
2. Projects with quality so poor that developers are sued by unhappy clients.
3. Consequential damages or harm caused to users by poor software quality.

To understand software quality costs and quality economics, it is necessary


to collect both raw cost data and then to normalize that cost data into a standard
metric for statistical analysis.
The quality topics that need to be included in COQ data for understanding
software quality economics are included in Table 19.14.
As can be seen in Table 19.14, to understand software COQ requires a substan-
tial number of data points combined with good measurement tools and practices.
Table 19.15 illustrates what needs to be done before and during testing for an
application of 1,000 function points in size coded in Java. Assume that Table 19.15
reflects a critical application that handles targeting on a cruise missile and therefore
requires a very high level of software quality.
As can be seen measuring software quality requires a substantial amount of
data because quality control involves many different kinds of defect prevention
and defect removal activities and they all need to be measured in order to under-
stand software quality economics. The quantity of data is roughly equivalent to the
amount of data used in carrying out an annual medical examination on a human
patient or calculating the orbit of a possible exoplanet around a distant sun.
The application in Table 19.15 used a full suite of pretest defect removal activi-
ties including inspections and static analysis. This pattern is effective but uncom-
mon and used mainly for large commercial applications, medical devices, and large
systems software applications such as telecommunications switching systems and
major defense systems.
It is more common in the United States to skimp on pretest defect removal and
utilize only testing. Table 19.16 shows the results of the same application without
effective pretest defect removal and using only normal test stages.
These two tables show the same application and differ only in the use of pretest
inspections and static analysis. But the results are striking and show important facts
about software quality economics.
Table 19.17 is a side-by-side comparison of the results with and without a full
suite of pretest defect removal activities.
The bottom line of this comparison confirms Phil Crosby’s famous phrase that
quality is free. Not only is high quality software less expensive than poor quality
software, but development schedules are also shorter.
This kind of detailed information about software quality seems complex and
it is to a certain extent. However, the kinds of detailed data collected and used by
medical doctors, chemists, physicists, statisticians, and economists is much more
Gaps and Errors in Measuring Software Quality ◾ 183

Table 19.14 Cost Elements for Software Cost of Quality (COQ)


1. Defect potentials

2. Defect severity levels

3. Defect origins

4. Defect consequences

5. Defect logistics and reporting costs

6. Defect prevention methods, costs

7. Pretest defect removal methods, costs

8. Test defect removal methods, costs

9. Number of test stages used for application

10. Number of test scripts and test cases for all test stages

11. Postrelease defect removal costs

12. High-severity defect removal costs

13. Distribution and release costs for defect repairs

14. Processing costs for invalid defect reports

15. Processing costs for duplicate defect reports

16. Processing costs for abeyant defect reports

17. Costs for bad fixes or new bugs in bug repairs

18. Replacement costs for error-prone modules (EPM)

19. Loss of revenues due to schedule delays caused by poor quality

20. Loss of customers due to unhappiness caused by poor quality

21. Loss of developers due to unhappiness caused by poor quality

22. Litigation costs due to poor quality that damages clients

23. Training costs for quality-related courses

24. Training costs for defect prevention methods

25. Preparation costs for defect removal activities

26. Execution costs for defect removal activities

27. Repair costs for defects detected during removal operations

(Continued)
184 ◾ A Guide to Selecting Software Measures and Metrics

Table 19.14 (Continued) Cost Elements for Software Cost of Quality (COQ)
28. Repair costs for defects that were not detected and hence delivered
29. Costs of bad fixes or new bugs in bug repairs
30. Costs of bad-test cases that add to test expense but not to value
31. Costs of quality support tools (static analysis, testing, and so on)
32. Costs of quality reporting tools
33. Costs of test library support tools
34. Costs of quality administrative support (clerks, help desks)
35. Software methodologies used for statistical purposes
36. Programming languages used for statistical purposes
37. Hardware platform of application for statistical purposes
38. Software platform (operating system) for statistical purposes
39. CMMI level of development group for statistical purposes
40. Compensation costs for all personnel involved in quality
41. Work hours per month for all personnel involved in quality
42. Unpaid overtime hours per month for all personnel involved in quality
43. Paid overtime hours per month for all personnel involved in quality
44. Function point size for data normalization
45. SNAP point size for data normalization
46. Logical code side for secondary data normalization
47. Start date of project for statistical purposes
48. Delivery date of project for statistical purposes
49. Current release number of project for statistical purposes
50. Number of software users or clients for statistical purposes
51. Countries of software users or clients for statistical purposes
52. Industries of software users or clients for statistical purposes
53. Geographical regions of users or clients for statistical purposes
54. Companies of users or clients for statistical purposes
55. Costs of normalization, statistical analysis, and quality reporting
56. Stabilization period: months after release to achieve zero defects
Table 19.15 Cost of Quality (COQ) Pretest and Test Defect Removal
Application size (FP) 1,000 Experienced clients

Reuse % 20.00% Experienced team

Reused function points 200 Iterative with pretest removal

Developed function points 800 Defense weapon system

Classified use

Language Java CMMI 5

Language Level 6.00 Separate test department

Uncertified test personnel

LOC per function point 53.33 IFPUG 4.3 function points

Application size (LOC) 53,333 SNAP not shown

Application size (KLOC) 53.33

Reused code 10,667

Developed code 42,667

Developed KLOC 42.67

Monthly cost:
Gaps and Errors in Measuring Software Quality

$10,000

Hourly cost: $62.50

(Continued)
185
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
186

Requirements Design Code Defects Document


Defects per Defects per per Function Defects per


Function Point Function Point Point Function Point TOTALS

Defect Potentials per FP 0.38 0.70 1.95 0.24 3.28

Defect potentials 384 700 1,954 240 3,278

Percent of total defects 11.70% 21.36% 59.61% 7.33% 100.00%

Security Vulnerabilities 0 1 2 0 3

Pretest REMOVAL METHODS

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

1 Text static analysis 50.00% 50.00% 0.00% 50.00% 20.20%

Defects discovered 192 350 0 120 662

Bad-fix injection 6 11 0 4 20

Defects remaining 198 361 1,954 124 2,636

Team size 3.00


A Guide to Selecting Software Measures and Metrics

Schedule (months) 0.24

Defect logging, routing $180 $328 $0 $113 $621

Set up and running $350 $350 $0 $350 $1,050

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Defect repairs $1,498 $2,735 $0 $939 $5,172

Repair integration/test $120 $219 $0 $75 $414

Text static cost $2,148 $3,632 $0 $1,476 $7,256

$ per function point $2.15 $3.63 $0 $1.48 $7.26

$ per KLOC $40.28 $68.10 $0 $27.68 $136.05

$ per defect $11.20 $10.37 $0 $12.29 $10.96

2 Requirements inspection 87.00% 10.00% 1.00% 8.50% 7.26%

Defects discovered 172 36 20 11 238

Bad-fix injection 5 1 1 0 7

Defects remaining 31 326 1,935 114 2,405

Team size 5.50

Schedule (months) 1.44

Defect logging, routing $2,148 $451 $244 $99 $2,942


Gaps and Errors in Measuring Software Quality

Preparation/inspections $10,938 $9,375 $7,813 $4,688 $32,813


Defect repairs $26,856 $5,634 $1,404 $723 $34,617


187

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
188

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

Repair integration/test $5,908 $1,239 $672 $362 $8,181

Requirement inspection cost $45,850 $16,699 $10,133 $5,871 $78,553

$ per function point $45.85 $16.70 $10.13 $5.87 $78.55

$ per KLOC $859.69 $313.11 $189.99 $110.08 $1,472.86

$ per defect $266.76 $463.13 $518.62 $558.19 $330.07

3 Design inspection 40.00% 87.00% 7.00% 26.00% 19.15%

Defects discovered 12 283 135 30 461

Bad-fix injection 0 8 4 1 23

Defects remaining 19 51 1,803 85 1,958

Team size 5.50

Schedule (months) 1.99

Defect logging, routing $154 $3,541 $1,693 $277 $5,665

Preparation/inspections
A Guide to Selecting Software Measures and Metrics

$9,375 $10,938 $6,250 $3,125 $29,688

Defect repairs $1,927 $44,260 $10,581 $922 $57,691

Repair integration/test $578 $8,852 $4,232 $1,384 $15,046

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Design inspection cost $12,035 $67,590 $22,757 $5,708 $108,089

$ per function point $12.03 $67.59 $22.76 $5.71 $108.09

$ per KLOC $225.65 $1,267.31 $426.68 $107.02 $2,026.67

$ per defect $975.61 $238.61 $168.02 $193.37 $234.69

4 Document Inspection 5.00% 5.00% 5.00% 90.00% 8.69%

Defects discovered 1 3 90 76 170

Bad-fix injection 0 0 0 0 1

Defects remaining 18 48 1,714 9 1,788

Team size 5.00

Schedule (months) 0.46

Defect logging, routing $1 $4 $141 $72 $217.97

Preparation/inspections $2,188 $2,188 $1,333 $1,563 $7,271

Defect repairs $206 $397 $5,918 $4,776 $11,297


Gaps and Errors in Measuring Software Quality

Repair integration/test $15 $40 $1,409 $2,388 $3,851

Document inspection costs $2,410 $2,628 $8,801 $8,797 $22,637


◾ 189

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
190

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

$ per function point $2.41 $2.63 $8.80 $8.80 $22.64

$ per KLOC $45.19 $49.28 $165.01 $164.95 $424.44

$ per defect $2,553.98 $1,034.25 $97.60 $115.14 $133.10

5 Code static analysis 2.00% 10.00% 65.00% 3.00% 57.16%

Defects discovered 0 5 1,114 0 1,119

Bad-fix injection 0 0 33 0 34

Defects remaining 18 44 633 8 703

Team size 2.67

Schedule (months) 0.75

Defect logging, routing $12 $166 $1,740 $0 $1,919

Preparation/execution $267 $267 $267 $0 $800

Defect repairs $78 $1,056 $10,442 $25 $11,601

Repair integration/test
A Guide to Selecting Software Measures and Metrics

$17 $226 $5,221 $12 $5,476

Code static costs $374 $1,715 $17,670 $37 $19,797

$ per function point $0.37 $1.72 $17.67 $0.04 $19.80

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

$ per KLOC $7.02 $32.16 $331.31 $0.69 $371.19

$ per defect $1,043.51 $355.22 $15.86 $141.56 $17.69

6 Code Inspection 15.00% 25.00% 87.00% 30.00% 80.67%

Defects discovered 3 11 551 3 567

Bad-fix injection 0 0 17 0 17

Defects remaining 15 33 99 6 153

Team size 9.10

Schedule (months) 0.98

Defect logging, routing $9 $37 $1,894 $9 $1,949

Preparation/Inspection $6,250 $7,813 $12,000 $3,125 $29,188

Defect repairs $742 $3,747 $43,035 $238 $47,762

Repair integration/test $124 $511 $8,607 $119 $9,361

Code inspection cost $7,125 $12,108 $65,535 $3,491 $88,259


Gaps and Errors in Measuring Software Quality

$ per function point $7.12 $12.11 $65.54 $3.49 $88.26


$ per KLOC $133.59 $227.03 $1,228.79 $65.45 $1,654.86


191

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
192

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

$ per defect $2,701.01 $1,110.70 $118.97 $1,374.56 $155.68

Pretest Summary

Pretest gross schedule 7.78

Pretest net schedule 4.67

Requirements Design Code Documents Total

Pretest efficiency

Pretest defects removed 369 667 1,855 234 3,125

Pretest efficiency % 96.08% 95.28% 94.94% 97.50% 95.34%

Defects Remaining 15 33 99 6 153

High-Severity Defects 3 6 22 1 31

Security Vulnerabilities 0 0 0 0 0

Pretest Removal Cost

Defect logging, routing


A Guide to Selecting Software Measures and Metrics

$2,505 $4,527 $5,712 $569 $13,313

Prepare/execute $29,367 $30,929 $27,663 $12,850 $100,808

Defect repairs $31,309 $57,829 $71,380 $7,622 $168,140

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Repair integration/test $6,762 $11,087 $20,141 $4,339 $42,329

Pretest Removal Total $69,942 $104,373 $124,895 $25,380 $324,591


Removal $ per Function Point $69.94 $104.37 $124.90 $25.38 $324.59

Removal $ per KLOC $1,311.42 $1,956.99 $2,341.79 $475.88 $6,086.07

Removal $ per defect $189.76 $156.46 $67.33 $108.34 $103.87

TEST DEFECT

REMOVAL STAGES

Defects at Test Start 15 33 99 6 153

Security Vulnerabilities 0 0 0 0 0

Predicted Test Cases 591 1,179 888 0 2,658

Per function point 0.59 1.18 0.89 0.00 2.66

Per KLOC 13.85 27.64 20.81 0.00 62.30

Require Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency
Gaps and Errors in Measuring Software Quality

1 Unit testing 4.00% 7.00% 35.00% 10.00% 24.92%


Defects discovered 1 2 35 1 38
193

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
194 ◾

Efficiency Efficiency Efficiency Efficiency Efficiency

Bad-fix injection 0 0 1 0 1

Defects remaining 14 31 65 5 116

Test cases used 44 106 155 0 306

Test coverage achieved 7.50% 9.00% 17.50% 0.00% 22.10%

Team size 6.69

Schedule (months) 0.25

Defect logging, routing $9 $36 $324 $6 $375

Test case design/running $3,750 $3,750 $4,000 $0 $11,500

Defect repairs $188 $578 $2,270 $47 $3,083

Repair integration/test $75 $217 $1,081 $28 $1,401

Unit test cost $4,022 $4,581 $7,675 $81 $16,359

$ per function point $4.02 $4.58 $7.68 $0.08 $16.36


A Guide to Selecting Software Measures and Metrics

$ per KLOC $75.42 $85.89 $143.91 $1.51 $306.74

$ per defect $6,692.16 $1,981.18 $221.88 $134.38 $429.31

2 Function testing 5.00% 22.00% 40.00% 30.00% 30.39%

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Defects discovered 1 7 26 2 35

Bad-fix injection 0 0 1 0 1

Defects remaining 14 24 40 4 82

Test cases used 207 324 244 0 775

Test coverage achieved 35.00% 27.50% 27.50% 0.00% 58.50%

Team size 6.69

Schedule (months) 0.32

Defect logging, routing $34 $318 $571 $76 $999

Test case design/running $4,375 $4,375 $4,500 $0 $13,250

Defect repairs $271 $1,693 $2,448 $152 $4,565

Repair integration/test $90 $635 $1,632 $51 $2,408

Function test cost $4,770 $7,021 $9,151 $279 $21,222

$ per function point $4.77 $7.02 $9.15 $0.28 $21.22


Gaps and Errors in Measuring Software Quality

$ per KLOC $89.44 $131.64 $171.59 $5.24 $397.91


$ per defect $6,605.03 $1,036.51 $350.46 $171.88 $602.30


195

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
196

Require Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

3 Regression testing 7.50% 7.50% 33.00% 15.00% 20.31%

Defects discovered 1 2 13 1 17

Bad-fix injection 0 0 0 0 0

Defects remaining 13 22 27 3 66

Test cases used 251 236 289 0 776

Test coverage achieved 42.50% 20.05% 32.50% 0.00% 61.78%

Team size 6.69

Schedule (months) 0.21

Defect logging, routing $48 $85 $288 $13 $434

Test case design/running $3,125 $3,125 $3,750 $0 $10,000

Defect repairs $419 $681 $1,236 $54 $2,390

Repair integration/test $129 $170 $824 $18 $1,141

Regression test cost


◾ A Guide to Selecting Software Measures and Metrics

$3,721 $4,062 $6,098 $85 $13,966

$ per function point $3.72 $4.06 $6.10 $0.08 $13.97

$ per KLOC $69.77 $76.15 $114.35 $1.59 $261.85

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

$ per defect $3,610.01 $2,236.06 $462.56 $146.88 $840.91

4 Integration testing 20.00% 27.00% 45.00% 22.00% 32.84%

Defects discovered 3 6 12 1 22

Bad-fix injection 0 0 0 0 1

Defects remaining 10 17 15 3 45

Test cases used 236 245 239 0 721

Test coverage achieved 40.00% 20.80% 26.95% 0.00% 57.04%

Team size 6.69

Schedule (months) 0.30

Defect logging, routing $319 $758 $1,146 $135 $2,358

Test case design/running $3,438 $3,438 $4,125 $0 $11,000

Defect repairs $1,115 $2,274 $1,528 $68 $4,984

Repair integration/test $319 $568 $764 $23 $1,674


Gaps and Errors in Measuring Software Quality

Integration test cost $5,190 $7,038 $7,563 $226 $20,016


$ per function point $5.19 $7.04 $7.56 $0.23 $20.02


197

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
198 ◾

Efficiency Efficiency Efficiency Efficiency Efficiency

$ per KLOC $97.31 $131.96 $141.80 $4.23 $375.30

$ per defect $2,036.28 $1,160.68 $618.71 $312.50 $928.47

5 System testing 12.00% 18.00% 52.00% 34.00% 29.18%

Defects discovered 1 3 8 1 13

Bad-fix injection 0 0 0 0 0

Defects remaining 9 14 8 2 32

Test cases used 222 248 260 0 730

Test coverage achieved 37.50% 21.05% 29.28% 0.00% 57.09%

Team size 7.10

Schedule (months) 0.21

Defect logging, routing $77 $186 $497 $27 $788

Test case design/running $3,438 $3,438 $3,750 $0 $10,625

Defect repairs
A Guide to Selecting Software Measures and Metrics

$539 $932 $746 $82 $2,300

Repair integration/test $154 $280 $746 $27 $1,207

System test cost $4,208 $4,836 $5,740 $137 $14,921

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

$ per function point $4.21 $4.84 $5.74 $0.14 $14.92

$ per KLOC $78.90 $90.68 $107.62 $2.57 $279.77

$ per defect $3,414.03 $1,620.90 $721.14 $156.25 $1,143.05

6 Acceptance testing 14.00% 15.00% 20.00% 24.00% 16.39%

Defects discovered 1 2 2 0 5

Bad-fix injection 0 0 0 0 0

Defects remaining 8 12 6 1 27

Test cases used 21 41 31 0 93

Test coverage achieved 3.50% 3.50% 3.50% 0.00% 6.83%

Team size 5.27

Schedule (months) 0.08

Defect logging, routing $238 $385 $284 $78 $985

Test case design/running $0 $0 $0 $0 $0


Gaps and Errors in Measuring Software Quality

Defect repairs $953 $1,283 $332 $65 $2,632

Repair integration/test $159 $192 $142 $13 $506


◾ 199

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
200

Require Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

Acceptance test cost $1,350 $1,860 $759 $156 $4,124

$ per function point $1.35 $1.86 $0.76 $0.16 $4.12

$ per KLOC $25.31 $34.87 $14.22 $2.92 $77.33

$ per defect $1,062.50 $906.25 $500.00 $375.00 $784.75

Test summary

Test gross schedule 6.56

Test net schedule 3.28

Test Removal Efficiency

Test Defects Removed 7 21 93 5 126

Testing Efficiency % 47.81% 64.61% 93.81% 77.87% 82.36%

Defects Remaining 8 12 6 1 27

High Severity Defects 1 2 1 0 5

Security Vulnerabilities 0 0 0 0 0
A Guide to Selecting Software Measures and Metrics

Test Removal Cost

Defect logging/routing $725 $1,768 $3,112 $335 $5,940

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Test prepare/execution $18,125 $18,125 $20,125 $0 $56,375

Defect repairs $3,485 $7,441 $8,560 $468 $19,954

Repair integration/test $926 $2,063 $5,189 $160 $8,338

Test Removal Total $23,261 $29,397 $36,986 $964 $90,608

Removal $ per Function Point $23.26 $29.40 $36.99 $0.96 $90.61

Removal $ per KLOC $436.14 $551.19 $693.50 $18.07 $1,698.89

Removal $ per defect $3,238.09 $1,377.47 $398.90 $206.15 $719.56

Total Results

Total Defects Removed 376 688 1,948 239 3,251

Total Bad-fix injection 11 21 58 7 98

Cumulative Removal % 95.16% 95.51% 96.79% 96.57% 96.31%

Pretest gross schedule 7.78

Test gross schedule 6.56


Gaps and Errors in Measuring Software Quality

Gross removal schedule 14.34


Pretest net schedule 4.67


201

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
202

Require Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

Test net schedule 3.28

Net removal schedule 7.95

Remaining Defects 8 12 6 1 27

High-severity Defects 1 2 1 0 5

Security Vulnerabilities 0 0 0 0 0

Stabilization Period 1.24 1.82 1.25 0.26 3.38

(Calendar months to zero defects


after first release of software)

Remaining Defects 0.01 0.01 0.01 0.00 0.03

Per Function Point

Remaining Defects 0.15 0.22 0.11 0.02 0.51

Per KLOC

Pretest Cost $69,942 $104,373 $124,895 $25,380 $324,591


◾ A Guide to Selecting Software Measures and Metrics

Per function point $69.94 $104.37 $124.90 $25.38 $324.59

Per KLOC $1,311.42 $1,956.99 $2,341.79 $475.88 $6,086.07

(Continued)
Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Per defect $189.76 $156.46 $67.33 $108.34 $103.87

Testing Cost $23,261 $29,397 $36,986 $964 $90,608

Per function point $23.26 $29.40 $36.99 $0.96 $90.61

Per KLOC $436.14 $551.19 $693.50 $18.07 $1,698.89

Per defect $3,238.09 $1,377.47 $398.90 $206.15 $719.56

Prerelease Removal $ $93,203 $133,770 $161,882 $26,344 $415,198

Per function point $93.20 $133.77 $161.88 $26.34 $415.20

Per KLOC $1,747.56 $2,508.18 $3,035.28 $493.94 $7,784.96

Per defect $248.03 $194.31 $83.12 $110.26 $127.72

Postrelease Removal $ $8,088 $11,325 $3,248 $623 $23,284

Defect logging, routing $735 $1,096 $573 $125 $2,529

Defect repairs $5,882 $7,307 $1,529 $249 $14,966

Defect integration/test $1,471 $2,923 $1,146 $249 $5,789


Gaps and Errors in Measuring Software Quality

Technical Debt $8,088 $11,325 $3,248 $623 $23,284


$ Per function point $8.09 $11.33 $3.25 $0.62 $23.28


203

(Continued)
204

Table 19.15 (Continued) Cost of Quality (COQ) Pretest and Test Defect Removal
Require Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

$ per KLOC $151.65 $11.33 $60.90 $11.67 $436.58

$ per defect $21.52 $16.45 $1.67 $2.61 $7.16

TOTAL DEFECT COSTS (Cost $101,291 $145,095 $165,130 $26,966 $438,482


of Quality)

$ per function point $101.29 $145.09 $165.13 $26.97 $438.48

$ per KLOC $1,899.21 $2,720.53 $3,096.19 $505.62 $8,221.54

$ per defect $261.70 $204.62 $82.31 $109.57 $130.96


A Guide to Selecting Software Measures and Metrics
Table 19.16 Cost of Quality (COQ) Testing Only
Application size (FP) 1,000 Experienced clients

Reuse (%) 20.00 Experienced team

Reused function points 200 Iterative without pre-test removal

Developed function points 800 Defense weapons system

Classified use

Language Java CMMI 3

Language Level 6.00 Separate test department

Uncertified test personnel

LOC per function point 53.33 IFPUG 4.3 function points

Application size (LOC) 53,333 SNAP not shown

Application size (KLOC) 53.33

Reused code 10,667

Developed code 42,667

Developed KLOC 42.67


Gaps and Errors in Measuring Software Quality

Monthly cost: $10,000

Hourly cost: $62.50

(Continued)
◾ 205
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
206 ◾

Requirements Design Code Defects Document


Defects per Defects per per Function Defects per
Function Point Function Point Point Function Point TOTALS

Defect Potentials per FP 0.38 0.70 1.95 0.24 3.28

Defect Potentials 384 700 1,954 240 3,278

Percent of Total Defects 11.70% 21.36% 59.61% 7.33% 100.00%

Security Vulnerabilities 0 1 2 0 3

Pretest Removal Methods

1. Text static analysis Not used

2. Requirements inspection Not used

3. Design inspection Not used

4. Document inspection Not used

5. Code static analysis Not used

6. Code inspection Not used

Pretest efficiency Not used


A Guide to Selecting Software Measures and Metrics

Pretest Removal Cost Not used

TEST DEFECT REMOVAL STAGES

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Defects Document
Defects per Defects per per Function Defects per
Function Point Function Point Point Function Point TOTALS

Defects at Test Start 384 700 1,954 240 3,278

Security Vulnerabilities 0 1 2 0 3

Predicted Test Cases 591 1,179 888 0 2,658

Per Function Point 0.59 1.18 0.89 0.00 2.66

Per KLOC 13.85 27.64 20.81 0.00 62.30

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

1 Unit testing 4.00% 7.00% 34.00% 10.00% 22.96%

Defects discovered 15 49 664 24 753

Bad-fix injection 1 3 46 2 53

Defects remaining 367 648 1,243 215 2,472

Test cases used 44 106 155 – 306

Test coverage achieved 7.50% 9.00% 17.50% 0.00% 22.10%


Gaps and Errors in Measuring Software Quality

Team size 6.69

(Continued)
207
208

Table 19.16 (Continued) Cost of Quality (COQ) Testing Only


Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Schedule (months) 1.65

Defect logging, routing $240 $459 $6,228 $150 $7,077

Test case design/running $3,750 $3,750 $4,000 $0 $11,500

Defect repairs $4,795 $12,252 $43,594 $1,877 $62,518

Repair integration/test $1,918 $4,595 $20,759 $1,126 $28,398

Unit test cost $10,703 $21,056 $74,580 $3,153 $109,493

$ per function point $10.70 $21.06 $74.58 $3.15 $109.49

$ per KLOC $200.68 $394.80 $1,398.38 $59.13 $2,052.99

$ per defect $697.51 $429.64 $112.27 $131.25 $145.47

2 Function testing 5.00% 18.00% 39.00% 30.00% 27.67%

Defects discovered 18 117 485 64 684


◾ A Guide to Selecting Software Measures and Metrics

Bad-fix injection 1 8 34 5 48

Defects remaining 348 523 724 146 1,740

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Test cases used 207 324 244 0 775

Test coverage achieved 35.00% 27.50% 27.50% 0.00% 58.50%

Team size 6.69

Schedule (months) 2.46

Defect logging, routing $861 $5,465 $10,604 $3,017 $19,947

Test case design/running $4,375 $4,375 $4,500 $0 $13,250

Defect repairs $6,885 $29,146 $45,447 $6,034 $87,512

Repair integration/test $2,295 $7,286 $30,298 $2,011 $41,891

Function test cost $14,416 $46,272 $90,849 $11,063 $162,600

$ per function point $14.42 $46.27 $90.85 $11.06 $162.60

$ per KLOC $270.29 $867.60 $1,703.42 $207.43 $3,048.74

$ per defect $785.17 $396.90 $187.41 $171.88 $237.69

3 Regression testing 7.50% 7.50% 30.00% 15.00% 17.49%


Gaps and Errors in Measuring Software Quality

Defects discovered 26 39 217 22 304

(Continued)
209
210

Table 19.16 (Continued) Cost of Quality (COQ) Testing Only


Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

Bad-fix injection 2 3 15 2 21

Defects remaining 320 481 492 122 1,415

Test cases used 251 236 289 0 776

Test coverage achieved 42.50% 20.05% 32.50% 0.00% 61.78%

Team size 6.69

Schedule (months) 1.42

Defect logging, routing $1,222 $1,838 $4,753 $478 $8,292

Test case design/running $3,125 $3,125 $3,750 $0 $10,000

Defect repairs $10,589 $14,708 $27,161 $2,049 $54,507

Repair integration/test $3,258 $3,677 $13,580 $683 $21,199

Regression test cost $18,195 $23,348 $49,245 $3,210 $93,997

$ per function point $18.19 $23.35 $49.24 $3.21 $94.00


A Guide to Selecting Software Measures and Metrics

$ per KLOC $341.15 $437.78 $923.34 $60.18 $1,762.44

$ per defect $698.01 $595.30 $226.63 $146.88 $308.77

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

4 Integration testing 18.00% 24.00% 45.00% 22.00% 29.77%

Defects discovered 58 115 221 27 421

Bad-fix injection 4 8 15 2 29

Defects remaining 258 357 255 94 964

Test cases used 236 245 239 – 721

Test coverage achieved 40.00% 20.80% 26.95% 0.00% 57.04%

Team size 6.69

Schedule (months) 2.35

Defect logging, routing $2,697 $5,411 $4,841 $589 $13,538

Test case design/running $3,438 $3,438 $4,125 $0 $11,000

Defect repairs $25,173 $43,288 $27,663 $2,522 $98,647

Repair integration/test $7,192 $10,822 $13,832 $841 $32,687

Integration test cost $38,500 $62,958 $50,461 $3,952 $155,871


Gaps and Errors in Measuring Software Quality

$ per function point $38.50 $62.96 $50.46 $3.95 $155.87

(Continued)
211
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
212

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

$ per KLOC $721.88 $1,180.46 $946.15 $74.10 $2,922.59

$ per defect $669.12 $545.40 $228.01 $146.88 $370.08

5 System testing 12.00% 18.00% 45.00% 34.00% 25.09%

Defects discovered 31 64 115 32 242

Bad-fix injection 2 5 8 2 17

Defects remaining 225 289 132 59 705

Test cases used 222 248 260 0 730

Test coverage achieved 37.50% 21.05% 29.28% 0.00% 57.09%

Team size 7.10

Schedule (months) $0

Defect logging, routing $1,452 $3,016 $2,510 $695 $7,673

Test case design/running $3,438 $3,438 $3,750 $0 $10,625

Defect repairs
A Guide to Selecting Software Measures and Metrics

$13,550 $20,107 $10,758 $2,981 $47,395

Repair integration/test $3,871 $6,032 $10,758 $994 $21,655

System test cost $22,311 $32,593 $27,775 $4,670 $87,348

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

$ per function point $22.31 $32.59 $27.78 $4.67 $87.35

$ per KLOC $418.32 $611.12 $520.79 $87.56 $1,637.78

$ per defect $720.36 $506.55 $242.06 $146.88 $361.16

6 Acceptance testing 14.00% 15.00% 20.00% 24.00% 16.38%

Defects discovered 31 43 26 14 116

Bad-fix injection 2 3 2 1 8

Defects remaining 191 242 104 44 582

Test cases used 21 41 31 0 93

Test coverage achieved 3.50% 3.50% 3.50% 0.00% 6.83%

Team size 5.27

Schedule (months) 1.42

Defect logging, routing $1,476 $2,029 $578 $312 $4,396

Test case design/running $0 $0 $0 $0 $0


Gaps and Errors in Measuring Software Quality

Defect repairs $23,620 $27,057 $5,784 $2,231 $58,693


Repair integration/test $3,937 $4,059 $2,479 $446 $10,921


213

(Continued)
214

Table 19.16 (Continued) Cost of Quality (COQ) Testing Only


Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Acceptance test cost $29,033 $33,145 $8,842 $2,989 $74,010

$ per function point $29.03 $33.15 $8.84 $2.99 $74.01

$ per KLOC $544.37 $621.48 $165.79 $56.05 $1,387.69

$ per defect $921.88 $765.63 $334.38 $209.38 $640.74

Test Summary

Test gross schedule 13.15

Test net schedule 8.60

Test Removal Efficiency

Test Defects Removed 192 458 1,850 196 2,696

Testing Efficiency % 50.14% 65.39% 94.68% 81.60% 82.25%

Defects Remaining 191 242 104 44 582

High-Severity Defects 33 46 23 6 107


◾ A Guide to Selecting Software Measures and Metrics

Security Vulnerabilities 0 0 0 0 0

Test Removal Cost

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Defect logging/routing $6,727 $16,383 $24,777 $4,765 $52,653

Test prepare/execution $18,125 $18,125 $16,866.79 $0 $56,375

Defect repairs $84,613 $146,558 $160,407 $17,694 $409,272

Repair integration/test $22,472 $36,471 $91,706 $6,101 $156,749

Test Removal Total $131,937 $217,537 $293,757 $28,560 $675,049

Removal $ per Function Point $131.94 $217.54 $293.76 $28.56 $675.05

Removal $ per KLOC $2,473.82 $4,078.81 $5,507.94 $535.50 $12,657.17

Removal $ per defect $685.89 $475.15 $158.80 $145.68 $250.38

Total Results

Total Defects Removed 192 458 1,850 196 2,696

Total Bad-fix injection 13 32 129 14 189

Cumulative Removal (%) 48.36 67.73 96.44 81.60 81.77

Pretest gross schedule 0.00


Gaps and Errors in Measuring Software Quality

Test gross schedule 13.15


Gross removal schedule 13.15


215

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
216 ◾

Requirements Design Code Document Total


Efficiency Efficiency Efficiency Efficiency Efficiency

Pretest net schedule 0.00

Test net schedule 8.60

Net removal schedule 8.60

Remaining Defects 191 242 104 44 582

High-severity Defects 33 41 18 8 99

Security Vulnerabilities 0 1 1 0 2

Stabilization Period 13.62 16.26 8.62 4.54 31.36

(Calendar months to zero defects


after first release of software)

Remaining Defects per Function 0.19 0.24 0.10 0.04 58


Point

Remaining Defects per KLOC 3.59 4.54 1.95 0.83 10.91

Pretest Cost $0.00 $0.00 $0.00 $0.00 $0.00


A Guide to Selecting Software Measures and Metrics

Per function point $0.00 $0.00 $0.00 $0.00 $0.00

Per KLOC $0.00 $0.00 $0.00 $0.00 $0.00

(Continued)
Table 19.16 (Continued) Cost of Quality (COQ) Testing Only
Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

Per defect $0.00 $0.00 $0.00 $0.00 $0.00

Testing Cost $104,254 $174,965 $177,967 $22,668 $479,875

Per function point $104.25 $174.96 $177.97 $22.67 $479.87

Per KLOC $1,954.76 $3,280.59 $3,336.89 $425.03 $8,997.66

Per defect $541.98 $382.16 $96.21 $115.63 $177.99

Prerelease Removal $ $104,254 $174,965 $177,967 $22,668 $479,875

Per function point $104.25 $174.96 $177.97 $22.67 $479.87

Per KLOC $1,954.76 $3,280.59 $3,336.89 $425.03 $8,997.66

Per defect $541.98 $382.16 $96.21 $115.63 $177.99

Postrelease Removal $

Defect logging, routing $17,930 $22,715 $9,743 $4,145 $54,533

Defect repairs $155,396 $181,718 $25,980 $8,290 $371,384

Defect integration/test $35,861 $60,573 $19,485 $8,290 $124,208


Gaps and Errors in Measuring Software Quality

Technical Debt

$209,187 $265,005 $55,208 $20,725 $550,125

(Continued)
217
218

Table 19.16 (Continued) Cost of Quality (COQ) Testing Only


Requirements Design Code Document Total
Efficiency Efficiency Efficiency Efficiency Efficiency

$ per function point $209.19 $265.01 $55.21 $20.73 $550.13

$ per KLOC $3,922 $4,969 $1,035 $389 $10,315

$ per defect $1,087.48 $578.83 $29.84 $105.72 $204.05

TOTAL DEFECT COST (Cost $313,440 $439,970 $233,175 $43,393 $1,030,000


of Quality)

$ per function point $313.44 $439.97 $233.18 $43.39 $1,030.00

$ per KLOC $5,877.01 $8,249.44 $4,372.04 $813.63 $19,312.50

$ per defect $1,522.86 $898.12 $117.80 $206.86 $357.04


A Guide to Selecting Software Measures and Metrics
Gaps and Errors in Measuring Software Quality ◾ 219

Table 19.17 DRE Comparisons with and without Pretest Activities

PreTest Testing only Difference

Defect potentials 3,728 3,728 0

Pretest defect removal 3,125 0 3,125

Test defect removal 126 2,696 –2,570

Defects delivered 27 582 –555

High-severity defects delivered 5 99 –94

Security flaws delivered 0 2 –2

Stabilization months 3.38 31.26 –27.88

Test schedule months 7.95 13.55 –5.60

Defect removal efficiency (DRE) 96.31% 81.77% 14.54%

Pretest Removal Costs $324,591 $0 $324,591

Test Removal Costs $90,608 $479,875 –$389,267

Postrelease Removal Costs $23,284 $550,125 –$526,841

Total Removal Costs $438,483 $1,030,000 –$591,517

Defect removal $ per FP $438.48 $1,030.00 –$591.52

Defect removal $ per KLOC $8,221 $19,312 –$11,091

Defect removal $ per defect $131.00 $357.00 –$226.0

complex than anything used by the software industry. For some sociological reason,
only the software industry fails to carry out serious data collection with valid met-
rics and perform serious statistical analysis.
It is hard to see how software can grow from a semi-skilled craft to become a
true engineering profession without using accurate metrics and using professional-
grade measurements.
Measuring only design, code, and unit test (DCUT) is unprofessional.
Measuring only code defects, starting late with quality measures is unprofessional.
Continuing to use flawed metrics such as LOC or cost per defect for economic
analysis is unprofessional. Software needs accurate cost, schedule, and quality data
based on effective measurement practices and reliable metrics such as function
point metrics.
For software the combination of function points for data normalization and
using related metrics such as DRE can provide a solid basis for software quality
economic analysis. This combination can also show the optimal sequence of defect
220 ◾ A Guide to Selecting Software Measures and Metrics

prevention methods, pretest defect removal methods, and test stages to top 99%
in DRE for a lower cost compared to today’s poor quality control and inaccurate
quality measures and metrics.
Possibly the new SNAP metric for nonfunctional size will contribute to quality
economic understanding too, but as of 2016, there is little empirical quality data
available for this new functional metric.
Chapter 20

Gaps and Errors due to


Multiple Metrics without
Conversion Rules

There are many science and engineering disciplines that have multiple metrics
for the same values. For example, we have nautical miles, statute miles, and
kilometers for measuring speed and distance. We have both Fahrenheit and
Celsius for measuring temperature. We have three methods for measuring the
octane ratings of gasoline. There are also three methods of evaluating consumer
credit ratings. However, other engineering and business disciplines have conver-
sion rules from one metric to another.
The software industry is unique in having more metric variants than any
other engineering discipline in history, combined with an almost total lack of
conversion rules from one metric to another. As a result, producing accurate
benchmarks of software productivity and quality is much harder than for any
other engineering field.
The author has identified five distinct variations in methods for counting lines
of code (LOC), and 20 distinct variations in counting function point metrics. There
are no standard conversion rules between any of these variants, although there are
some conversion rules between COSMIC and International Function Point Users
Group (IFPUG) function points.
Here is an example of why this situation is harmful to the industry. Suppose
you are a consultant who has been commissioned by a client to find data on the
costs and schedules of producing a certain kind of software, such as a PBX switch-
ing system.

221
222 ◾ A Guide to Selecting Software Measures and Metrics

You scan the literature and benchmark databases and discover that data exist
on 90 similar PBX projects that would seem to be an adequate sample. You would
like to perform a statistical analysis of the results for presentation to the client.
But now the problems begin when trying to do statistical analysis of the 90 PBX
data samples (Table 20.1).

Table 20.1 Variations in Software Size Measurement Practices circa 2016


1. Three were measured using LOC and counted physical lines with
comments.

2. Three were measured using LOC and counted physical lines without
comments.

3. Three were measured using lines of code and counted logical statements.

4. Three were measured using lines of code and did not state the counting
method.

5. Three were constructed from reusable objects and only counted custom
code.

6. Three were constructed from reusable objects and counted reuse +


custom code.

7. Three were measured using IFPUG function point metrics without SNAP.

8. Three were measured using IFPUG function points plus SNAP points.

9. Three were measured using automated function point tools.

10. Three were measured using COSMIC function point metrics.

11. Three were measured using Full function points.

12. Three were measured using Mark II function point metrics.

13. Three were measured using FISMA function points

14. Three were measured using NESMA function points.

15. Three were measured using unadjusted function points.

16. Three were measured using Engineering function points.

17. Three were measured using legacy data mining tools.

18. Three were measured using web-object points.

19. Three were measured using Function points light.

20. Three were measured using Gartner backfired function point metrics.

(Continued)
Gaps and Errors due to Multiple Metrics without Conversion Rules ◾ 223

Table 20.1 (Continued) Variations in Software Size Measurement Practices


circa 2016
21. Three were measured using SPR backfired function point metrics.

22. Three were measured using QSM backfired function point metrics.

23. Three were measured using Feature points.

24. Three were measured using Story points.

25. Three were measured using Use-Case points.

26. Three were measured using MOOSE object-oriented metrics.

27. Three were measured using goal-question metrics with local metrics.

28. Three were measured using TSP/PSP task hours.

29. Three were measured using RTF metrics.

30. Three were measured using pattern-matching function points.

As of 2016, there are no proven and effective conversion rules between most of
these metric variations. There is no effective way of performing a statistical analysis
of results across multiple dissimilar metrics. Why the software industry has devel-
oped so many competing variants of software metrics is an unanswered sociological
question.
Another interesting sociological problem is that adherents of each of these
30 metrics claim that it is the most accurate way of measuring software. There is
no cesium atom for software accuracy and hence no effective way of determining
accuracy against a fixed and unchanging value.
Some of these metric variations make certain kinds of software seem bigger
than other metric variations. What most metric adherents believe to be accuracy is
merely getting a bigger value for their favorite types of software compared to other
metric variants.
This started with Mark II function points and has continued ever since. It can-
not be over emphasized: accuracy is relative and essentially impossible to prove.
Other factors such as consistency across multiple counting personnel and ease of
use are more relevant than fanciful claims of accuracy.
Of course for certain hazardous metrics such as cost per defect and LOC, it can
be proven mathematically that they violate standard economic assumptions, and
hence are proven to be inaccurate. In other words, it is possible to prove inaccuracy
for software metrics but not possible to prove accuracy.
The inaccuracy for cost per defect is that it penalizes quality due to the fixed
costs of defect removal. The inaccuracy of LOC is that it penalizes high-level lan-
guages due to the fixed costs of noncode work such as requirements and design.
224 ◾ A Guide to Selecting Software Measures and Metrics

For unexplained sociological reasons, the software industry has developed more
competing variations than any other industry in human history: over 3,000 pro-
gramming languages, over 60 software development methodologies, over 50 static
analysis tools, over 37 competing benchmark organizations, over 25 software project
management tools, and about 30 competing software size metrics. There are at least
5 metric variations for counting LOC and at least 20 function point variations to
say nothing of pseudo functional metrics such as story points and use-case points.
The existence of so many variations is proof that none are fully adequate or else the
adequate variation would eliminate all of the others.
Developers of new versions of function point metrics almost always fail to
provide conversion rules between their new version and older standard metrics
such as IFPUG function points. This is happening right now with the software
nonfunctional assessment process (SNAP) metric, which as yet has no conversion
rules to the older function point metric. In the author’s view, it is the responsibil-
ity of the developers of new metrics to provide conversion rules to older metrics.
The existence of five separate methods for counting source code and at least
20 variations in counting function points with almost no conversion rules from
one metric to another is a professional embarrassment to the software industry.
As of 2016, the plethora of ambiguous metrics is slowing progress toward a true
economic understanding of the software industry.
However, a partial technical solution to this problem does exist. Using the high-
speed pattern-matching method for function point size prediction in Software Risk
Master ™ (SRM) tool, it would be possible to perform separate size estimates for
using each of the metric examples shown above (although this is not likely to occur
and not likely to be useful if it does occur). The high-speed method embedded in
the SRM tool produces size in terms of both IFPUG function points and logical
code statements. In fact, SRM produces size in a total of 23 metrics as shown in
Table 20.2.
The SRM tool can also be used from before requirements start through develop-
ment and also for legacy applications. It can also be used on commercial packages,
open-source applications, and even classified applications, if they can be placed on
a standard taxonomy of nature, scope, class, type, and complexity.
Not only would this tool provide size in terms of standard IFPUG function
points, but the taxonomy that is included with the tool would facilitate large-
scale benchmark studies. After samples of perhaps 100 applications were sized
that had used story points, use-case points, or task hours, enough data would
become available to perform useful statistical studies of the size ratios of all com-
mon metrics.
Tools such as SRM can be used to convert applications from one metric to
another. In fact, it is technically possible for SRM to generate application sizes in all
current metric variants. However, doing this would not add any value to software
economic analysis and the results would probably be unintelligible to most metrics
users, who only care about one metric and ignore all others.
Gaps and Errors due to Multiple Metrics without Conversion Rules ◾ 225

Table 20.2 Software Risk Master Size Predictions for 2,500 Function Points

Metrics Size IFPUG (%)

1 IFPUG 4.3 2,500 100.00

2 Automated code based function points 2,675 107.00

3 Automated UML based function points 2,575 103.00

4 Backfired function points 2,375 95.00

5 COSMIC function points 2,857 114.29

6 Fast function points 2,425 97.00

7 Feature points 2,500 100.00

8 Function points 2,550 102.00

9 Full function points 2,925 117.00

10 Function points light 2,413 96.50

11 IntegraNova function points 2,725 109.00

12 Mark II function points 2,650 106.00

13 NESMA function points 2,600 104.00

14 RICE objects 11,786 471.43

15 SCCQI function points 7,571 302.86

16 Simple function points 2,438 102.63

17 SNAP nonfunctional size metrics 455 18.18

18 SRM pattern-matching function points 2,500 100.00

19 Story points 1,389 55.56

20 Unadjusted function points 2,225 89.00

21 Use-Case points 833 33.33

Source Code LOC per FP

22 Logical code statements 133,325 53.33

23 Physical LOC (with blanks, comments) 538,854 215.54

It is obvious to those of us that collect benchmark data that as of 2016 the soft-
ware industry has way too many metrics, but not very much actual useful data in
any of those metrics.
226 ◾ A Guide to Selecting Software Measures and Metrics

It is an interesting sociological observation that measurements tend to change


human behavior. Therefore it is important to select measurements that will cause
behavioral changes in positive and beneficial directions.
Measuring defect potentials and defect removal efficiency (DRE) levels have
been noted to make very beneficial improvements in software development
practices.
Measuring with cost per defect and LOC on the other hand tend to lead to
regressions because both metrics penalize quality and modern software engineer-
ing techniques.
Chapter 21

Gaps and Errors in Tools,


Methodologies, Languages

As this book has pointed out, collecting accurate quantified data about software
effort, schedules, staffing, costs, and quality is seldom done well and often not done
at all. But even if accurate quantitative data were collected, the data by themselves
would not be sufficient to explain the variations that occur in project outcomes.
In addition to quantitative data, it is also necessary to record a wide variety of
supplemental topics in order to explain the variations that occur. Examples of the
kinds of supplemental data are shown in Table 21.1.
This kind of information lacks any kind of standard representation. The author’s
approach uses multiple choice questions to ascertain the overall pattern of tool and
methodology usage. At the end of the questionnaire, space is provided to name
the specific tools, languages, and other factors that had a significant effect on the
project.
There are a number of widely used questionnaires that gather supporting data
on methods, tools, languages, and other factors that influence software outcomes.
The oldest of these is perhaps the questionnaire designed by the author that was first
used in 1984, about a year before the Software Engineering Institute (SEI) started
assessments. The assessment questionnaire created by the SEI is perhaps the second,
and became available by about 1985.
Since then dozens of consulting and benchmark organizations have created
questionnaires for collecting data about software tools and methods. Some of
these organizations, in alphabetical order, include the David’s Consulting Group,
Galorath Associates, Gartner Group, the International Software Benchmarking
Standards Group (ISBSG), Namcook Analytics, and Quantitative Software
Management (QSM). There are many more in addition to these.

227
228 ◾ A Guide to Selecting Software Measures and Metrics

Table 21.1 Factors That Influence Software Project Results


1. The level of the development team on the CMMI scale
2. The experience levels of management, clients, and development teams
3. The nature of the project (new, major enhancement, minor
enhancement, and so on)
4. The type of software (web, systems, embedded, military, etc.)
5. Methodologies used such as Agile, Rational Unified Process (RUP), and so on
6. Whether or not the SEMAT software engineering essence was used
7. Whether or not Six-Sigma analysis was utilized
8. Whether or not quality function deployment (QFD) was utilized
9. The kinds of estimating tools for predicting costs, schedules, and quality
10. The kinds of project management tools for tracking progress
11. What combination of 3,000 programming languages were utilized
12. Whether the project utilized subcontractors or offshore developers
13. Whether the project utilized static analysis before testing
14. The specific combination of inspections and tests were used
15. Whether automated test tools were utilized
16. Whether Joint Application Design (JAD) was utilized
17. Whether requirement models were used
18. Whether prototypes were built
19. The kinds of change control methods utilized
20. The volumes of reusable materials available and used

Each of these questionnaires may be useful in its own right, but because they
are all somewhat different and there are no industry standards that define the
information to be collected, it is hard to carry out large-scale studies.
Table 21.2 illustrates the author’s method of collecting data on the experience
levels of clients, managers, and development personnel. We use a five-point scale
where the levels mean the following:
1. Very experienced
2. Experienced
3. Average
4. Below average experience
5. All novices
Gaps and Errors in Tools, Methodologies, Languages ◾ 229

Table 21.2 Software Project Experience Factors


Experience Inputs
Note: Experience affects schedules, productivity,
and quality
Note: Default scoring is “3.00 – average”
Note: Decimals are acceptable such as “2.50”
Scoring pattern
1 = Very experienced
2 = Experienced
3 = Average experience
4 = Below average experience
5 = Novice
Experience Levels by Group Experience Scores
1 Application type experience 2.00
2 Application size experience 1.50
3 Client experience with similar projects 2.00
4 Project management experience 1.00
5 Architecture team experience 2.00
6 Requirements team experience 2.00
7 Design team experience 3.00
8 Database team experience 3.00
9 Development team experience 2.80
10 Test team experience 3.00
11 Quality assurance team experience 3.00
12 Technical writing team experience 3.00
13 Customer support team experience 4.00
14 Maintenance team experience 4.00
15 Methodology experience 4.00
16 Hardware platform experience 2.00
17 Programming language experience 1.00
18 Operating system experience 1.00
Overall Experience 2.46
230 ◾ A Guide to Selecting Software Measures and Metrics

Following are the specific kinds of personnel, where experience levels are important
to the success of software projects:
A reliable taxonomy combined with a useful set of assessment and benchmark
questions are critical steps in improving software engineering, and turning it into a
true profession instead of a craft.
As software has such high labor content and such dismal quality control, it is of
considerable importance to be able to measure productivity and quality using stan-
dard economic principles. It is also of considerable importance to be able to predict
productivity and quality before major projects start. In order to accomplish these
basic goals, a number of standards need to be adopted for various measurement
topics. These standards are shown in Table 21.3.

Table 21.3 Data Elements Needed for Effective Benchmarks and Estimates
1. A standard taxonomy for identifying projects without ambiguity.

2. A set of standard charts of accounts for collecting activity-level data.

3. A standard chart of accounts for user data on internal projects.

4. A standard for measuring schedule slippage.

5. A standard for measuring requirements creep and requirements churn.

6. A set of standard charts of accounts for maintenance (defect repairs).

7. A set of standard charts of accounts for enhancements (new features).

8. A set of standard charts of accounts for customer support and service.

9. A standard questionnaire for collecting methodology and tool information.

10. A standard definition for when projects start.

11. A standard definition for when projects end.

12. A standard definition for defect origins and defect potentials.

13. A standard definition for security flaws.

14. A standard definition for defect potentials.

15. A standard definition for defect removal efficiency (DRE).

16. A standard checklist for measuring the contribution of specialists.

17. Standard job descriptions for software specialists.

18. Multiple conversion rules between various metrics.

19. A standard for applying cost of quality (COQ) to software projects.

20. A standard for applying total cost of ownership (TCO) to software projects.
Gaps and Errors in Tools, Methodologies, Languages ◾ 231

There are more issues than the 20 standards shown here, but until these 20
standards are addressed and their problems solved, it will not be possible to use
standard economic assumptions for software applications.
What the author suggests would be a continuing series of workshops that
involved the major organizations that are concerned with software: the SEI,
IFPUG, COSMIC, ISBSG, ITMPI, PMI, ISO, Namcook Analytics, and so on.
Universities should also be involved. Unfortunately as of 2016 the relationships
among these organizations tend to be somewhat adversarial. Each wants its own
method to become the basis of international standards. Therefore cooperation on
common problems is difficult to bring about.
Also involved should be the companies that produce parametric software esti-
mation tools such as Galorath, Namcook Analytics, Price Systems, QSM, Software
Productivity Research, and USC in alphabetical order.
Appendix 1: Alphabetical
Discussion of Metrics
and Measures

Introduction
This appendix includes an alphabetical discussion of common software measure-
ment and metrics terms.
Over the past 50 years the software industry has grown to become one of the
major industries of the twenty-first century. On a global basis, software applica-
tions are the main operating tools of corporations, government agencies, and mil-
itary forces. Every major industry employs thousands of software professionals.
The total employment of software personnel on a global basis probably exceeds
20,000,000 workers.
Due to the importance of software and because of the high costs of software
development and maintenance combined with less than optimal quality, it is
important to measure both software productivity and software quality with high
precision. But this seldom happens.
For more than 60 years, the software industry has used a number of metrics that
violate standard economic concepts and produce inaccurate and distorted results.
Two of these are lines of code or LOC metrics and the cost per defect metric. LOC
metrics penalize high-level languages and make requirements and design invis-
ible. Cost per defect penalizes quality and ignores the true value of quality, which
is derived from shorter schedules and lower development and maintenance costs.
Both LOC and cost per defect metrics can be classed as professional malpractice for
overall economic analysis. However, both have limited use for more specialized
purposes.
One of the reasons IBM invested more than a million dollars into the development
of function point metrics was to provide a metric that could be used to measure
both productivity and quality with high precision and with adherence to standard
economic principles.

233
234 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

For more than 200 years, a basic law of manufacturing has been understood by
all major industries except software: “If a manufacturing cycle includes a high compo-
nent of fixed costs and there is a decline in the number of units manufactured, the cost per
unit will go up.” The problems with both LOC metrics and cost per defect are due
to ignoring this basic law of manufacturing economics.
For modern software projects, requirements and design are often more expensive
than coding. Further, requirements and design are inelastic and stay more or less
constant regardless of coding size and coding time. When there is a switch from
a low-level language such as assembly to a higher level language such as Java, the
quantity and effort for coding is reduced but requirements and design act like fixed
costs, so the cost per line of code will go up.
Table A.1 illustrates the paradoxical reversal of productivity rates using LOC
metrics in a sample of 10 versions of a private branch exchange (PBX) switching
application coded in 10 languages, but all the same size of 1,500 function points.
As can be seen from the table, the Assembly version had the largest amount of
effort but also the highest apparent productivity measured with LOC per month
and the lowest measured with function points per month.

Table A.1 Productivity Rates for 10 Versions of the Same Software Project
(A PBX Switching System of 1,500 Function Points in Size)
Function Work Hours LOC per
Effort Point per per Function Staff LOC per
Language (Months) Staff Month Point Month Staff Hour

Assembly 781.91 1.92 68.81 480 3.38

C 460.69 3.26 40.54 414 3.13

CHILL 392.69 3.82 34.56 401 3.04

PASCAL 357.53 4.20 31.46 382 2.89

PL/I 329.91 4.55 29.03 364 2.76

Ada83 304.13 4.93 26.76 350 2.65

C++ 293.91 5.10 25.86 281 2.13

Ada95 269.81 5.56 23.74 272 2.06

Objective C 216.12 6.94 19.02 201 1.52

Smalltalk 194.64 7.71 17.13 162 1.23

Average 360.13 4.17 31.69 366 2.77


Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 235

Function points match standard economic assumptions, whereas LOC metrics


reverse standard economics and distort reality. In this table using LOC metrics
for productivity comparison across different languages would be professional
malpractice due to the distortion of economic facts. In other words when mov-
ing to high-level languages units are reduced in a development cycle with a high
percentage of fixed costs.
When testing software, the time needed to write test cases and run them are
comparatively inelastic and stay more or less constant regardless of how many bugs
are found. When few bugs are found, test case preparation and execution act like
fixed costs, so the cost per defect will go up. Actual defect repairs are comparatively
flat, although there are ranges. But the ranges are found in every form of defect
removal and do not rise in a consistent pattern.
Table A.2 shows the mathematics of the cost per defect metric. Every column
uses fixed costs that are exactly the same. Labor costs are set at U.S. $75.75 per hour
for every row and column of the table. The defect repair column assumes a constant
value of 5 hours per defect for every form of test.
As can be seen from the table, the fixed costs of writing test cases and run-
ning test cases cause cost per defect to go up as defect volumes come down.
This of course is due to the basic rule of manufacturing economics that in the
presence of fixed costs resulting in a decline in units will increase cost per unit.
However, actual defect repairs were a constant 5 hours for every form of testing
in the table.

Table A.2 Cost per Defect for Six Forms of Testing (Assumes U.S. $75.75
per Staff Hour for Costs)
Writing Running Number
Test Test Repairing Total of $ per
Cases Cases Defects Costs Defects Defect
Unit test $1,250.00 $750.00 $18,937.50 $20,937.50 50 $418.75
Function $1,250.00 $750.00 $7,575.00 $9,575.00 20 $478.75
test
Regression $1,250.00 $750.00 $3,787.50 $5,787.50 10 $578.75
test
Performance $1,250.00 $750.00 $1,893.75 $3,893.75 5 $778.75
test
System test $1,250.00 $750.00 $1,136.25 $3,136.25 3 $1,045.42
Acceptance $1,250.00 $750.00 $378.75 $2,378.75 1 $2,378.75
test
236 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.3 Cost per Function Point for Six Forms of Testing (Assumes
U.S. $75.75 per Staff Hour for Costs, Assumes 100 Function Points in
the Application)
Total $ per
Writing Running Repairing Function Number
Test Cases Test Cases Defects Points of Defects

Unit test $12.50 $7.50 $189.38 $209.38 50

Function test $12.50 $7.50 $75.75 $95.75 20

Regression $12.50 $7.50 $37.88 $57.88 10


test

Performance $12.50 $7.50 $18.94 $38.94 5


test

System test $12.50 $7.50 $11.36 $31.36 3

Acceptance $12.50 $7.50 $3.79 $23.79 1


test

By contrast, looking at the same project and the same testing sequence using
the metric defect removal cost per function point, the true economic situation
becomes clear.
It is important to understand that Tables A.2 and A.3 both show the results
for the same project and also use identical constant values for writing test cases,
running them, and fixing bugs. However, defect removal costs per function point
decline when total defects decline, whereas cost per defect grows more and more
expensive as defects decline.
Here too the traditional cost per defect metric ignores the impact of fixed costs
in a process with a high percentage of fixed costs.
These are very common software industry problems. But the basic point
is that manufacturing economics and fixed costs need to be included in soft-
ware manufacturing and production studies. Thus far much of the software
literature has ignored fixed costs. All other industries except software have
understood the impact of fixed costs on manufacturing economics for more
than 200 years! Software is unique as it is the only industry that ignores fixed
costs.
Unfortunately, most universities that teach software engineering also ignore the
impact of fixed costs. Therefore software engineering students and software proj-
ect management students graduate without ever learning how to measure software
economic productivity or the economic value of high software quality levels. This
is a professional embarrassment.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 237

Six Urgent Needs for Software Engineering


Software is a major industry, but not yet a full profession with consistent excellence
in results. Indeed quality lags far behind than what is needed. Software engineering
has an urgent need for seven significant accomplishments:

1. Stop measuring with unreliable metrics such as LOC and cost per defect and
begin to move toward activity-based costs, function point metrics, and defect
removal efficiency (DRE) metrics.
2. Start every project with formal early sizing that includes requirements creep,
formal risk analysis, formal cost, and quality predictions using parametric
estimation tools, and with requirements methods that will minimize toxic
requirements and excessive requirements creep later on.
3. Start every significant project with a formal risk analysis study based on the
risk patterns of applications of about the same size and the same taxonomy
patterns.
4. Raise DRE from less than 90% to more than 99.5% across the board. This
will also shorten development schedules and lower costs. This cannot be done
by testing alone, but needs a synergistic combination of pretest inspections,
static analysis, and formal testing using mathematically designed test cases
and certified test personnel.
5. Lower defect potentials from above 4.00 per function point to below 2.00
per function point for the sum of bugs in requirements, design, code,
documents, and bad-fix injections. This can only be done by increasing the
volume of reusable materials, combined with much better quality measures
than today.
6. Increase the volume of reusable materials from less than 15% to more than
85% as rapidly as possible. Custom designs and hand coding are intrinsi-
cally expensive and error-prone. Only use of certified reusable materials that
approach zero defects can lead to industrial-strength software applications
that can operate without excessive failure and without causing high levels of
consequential damages to clients.
7. Increase the immunity of software to cyber attacks. This must go beyond
normal firewalls and antivirus packages and include permanent changes in
software permissions, and probably in the von Neumann architecture as well.
There are proven methods that can do this, but they are not yet deployed.
Cyber attacks are a growing threat to all governments, businesses, and also to
individual citizens whose bank accounts and other valuable stored data are at
increasing risk.

The remainder of this Appendix discusses a variety of software metrics and


measurement methods in alphabetical order:
238 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Alphabetical Discussion of Software


Metrics and Measures
This Appendix was created in the middle of 2016. Additional metrics topics
will be developed in the future. Metrics are not static but change dynamically
over time.

15 Common Software Risks Associated with Application Size


Nine out of 15 common software risks involve numeric information and errors with
estimates or measures as contributing factors. Many software project risks could be
minimized or avoided by formal parametric estimates, formal risk analysis prior to
starting, accurate status tracking, and accurate benchmarks from similar projects
(Table A.4).

Table A.4 Common Software Risk Factors circa 2016


1. Canceled projects ***

2. Consequential damages to clients

3. Cost overruns ***

4. Cyber attacks

5. Estimate errors or rejection of accurate estimates ***

6. Impossible demands by clients or management ***

7. Litigation for breach of contract ***

8. Litigation for patent violation

9. Poor change control

10. Poor measurement after completion ***

11. Poor quality control ***

12. Poor tracking during development ***

13. Requirements creep

14. Toxic requirements and requirements errors

15. Schedule slips by >25% ***


Note: Risks associated with software size measurement and metrics issues are
indicated by triple asterisks ***.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 239

Twenty Suggested Criteria for Software Metrics Selection


Software metrics are created by ad hoc methods, often by amateurs, and broadcast to
the world with little or no validation or empirical results. This set of 20 criteria shows
the features that effective software metrics should have as attributes (Table A.5).
Currently IFPUG function point metrics meet 19 of these 20 criteria. Function
points are somewhat slow and costly so criterion 5 is not fully met.
The new SNAP metric circa 2016 only meets criteria 7 and 8. There is training
available and a user association. However, SNAP is very expensive and does not
yet meet any of the other criteria such as having conversion rules and supporting

Table A.5 Twenty Criteria for Software Metrics Selection


1. Be validated before release to the world

2. Be standardized, preferably by ISO or OMG

3. Be unambiguous

4. Be consistent from project to project

5. Be cost effective and have automated support

6. Be useful for both predictions and measurements

7. Have formal training for new practitioners

8. Have a formal user association

9. Have ample and accurate published data

10. Have conversion rules to and from other metrics

11. Support both development and maintenance

12. Support all activities (requirements, design, code, test, etc.)

13. Support all software deliverables (documents, code, tests, etc.)

14. Support all sizes of software from small changes through major systems

15. Support all classes and types of software (embedded, systems, Web, etc.)

16. Support both quality and productivity measures and estimates

17. Support requirements creep over time

18. Support consumption and usage of software as well as construction

19. Support new projects, enhancement projects, and maintenance projects

20. Support new technologies as they appear (languages, cloud,


methodologies, etc.)
240 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

both development and maintenance. The SNAP committee has not even addressed
requirements creep or changes in SNAP size over long time periods. SNAP has not
yet been applied to embedded software, to commercial packages such as Windows
10, or to weapons systems. Function points, on the other hand, have been used to
size naval shipboard gun controls, cruise missile navigation packages, cell-phone
operating systems, and all other known forms of software.
Other function point variations such as COSMIC, NESMA, FISMA, unadjusted,
engineering function points, feature points, and so on vary in how many criteria they
meet, but most meet more than 15 of the 20 criteria.
The automated function point method meets the first 5 of the 20 criteria. It is
cost effective, standardized, and unambiguous. However, it lacks conversion rules
that do not support all classes and types of software. For example, automated func-
tion points are not yet used on embedded applications, medical devices, or weapons
systems. They have not yet been used on commercial software packages such as
SAP, Oracle, Windows 10, and so on.
The older lines of code (LOC) metric meets only criterion 5 and none of the oth-
ers. LOC metrics are fast and cheap, but otherwise fail to meet the other 19 criteria.
The LOC metric makes requirements and design invisible and penalizes high-level
languages.
The cost per defect metric does not actually meet any of the 20 criteria and also
does not address the value of high quality in achieving shorter schedules and lower
costs.
The technical debt metric does not currently meet any of the 20 criteria, although
it is such a new metric that it probably will be able to meet some of the criteria in
the future. Technical debt has a large and growing literature, but does not actually
meet criterion 9 because the literature resembles the blind men and the elephant,
with various authors using different definitions for technical debt. Technical debt
comes close to meeting criteria 14 and 15.
The story point metric for Agile projects seems to meet five criteria; that is, num-
bers 6, 14, 16, 17, and 18 but varies so widely and is so inconsistent that it cannot
be used across companies and certainly cannot be used without user stories.
The use-case metric seems to meet criteria 5, 6, 9, 11, 14, and 15 but cannot be
used to compare data from projects that do not utilize use-cases.
This set of 20 software metric criteria is a useful guide for selecting metrics that
are likely to produce results that match standard economics and do not distort real-
ity, as do so many current software metrics.

Abeyant Defects
The term abeyant defect originated in IBM in the late 1960s. It refers to an unusual
kind of bug that is unique to a single client and a single configuration and does not
occur anywhere else. In fact, the change team tasked with fixing the bug may not
be able to reproduce it. Abeyant defects are both rare and extremely troublesome
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 241

when they occur. It is usually necessary to send a quality expert to the client site
to find out what unique combination of hardware and software led to the abeyant
defect occurring. Some abeyant defects have taken more than two weeks to identify
and repair. In today’s world of software with millions of users and spotty technical
support some abeyant defects may never be fixed.

Activity-Based Costs
The term activity is defined as the sum total of the work required to produce a major
deliverable such as requirements, architecture, design documents, source code, or
test cases. The number of activities associated with software projects ranges from a
low of three (design, code, and test) to more than 50.
Several parametric estimation tools such as Software Risk Master (SRM)
predict activity costs. A typical pattern of seven software activities for a midsized
software project of 1,000 function points in size might include: (1) requirements,
(2) design, (3) coding, (4) testing, (5) quality assurance, (6) user documentation,
and (7) project management.
One of the virtues of function point metrics are that they can show productivity
rates for every known activity, as illustrated by Table A.6, which is an example for
a generic project of 1,000 function points in size coded in Java.
The ability to show productivity for each activity is a virtue of function
point metrics and is not possible with many older metrics such as LOC and cost

Table A.6 Sample of Activity-Based Costs (1,000 Function Points; Java;


132 Work Hours per Month)
Function
Work Hours per Points per Months of
Activities Function Point Staff Month Staff Effort Project Costs

Requirements 1.25 105.60 9.47 $94,697

Design 2.00 66.00 15.15 $151,515

Coding 6.00 22.00 45.45 $454,545

Testing 5.00 26.40 37.88 $378,788

Documentation 0.50 264.00 3.79 $37,879

Quality 0.75 176.00 5.68 $56,818


assurance

Project 2.00 66.00 15.15 $151,515


management

TOTAL 17.50 7.54 132.58 $1,325,758


242 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

per defect. To conserve space Table A.6 only shows seven activities, but this same
form of representation can be extended to more than 40 activities and more than
250 tasks.
Function points are the only available metric in 2016 that allows both activity
and task-level analysis of software projects.
LOC metrics could not show noncode work at all. Story points might show
activities but only for Agile projects and not for other forms of software. Use-case
points require use-cases to be used at all. Only function point metrics are method-
ologically neutral and applicable to all known software activities and tasks.
As of 2016, how the new SNAP metric will be used for activity-based cost
analysis is still uncertain.

Accuracy
The topic of accuracy is often applied to questions such as the accuracy of estimates
compared to historical data. However, it should be applied to the question of how
accurate are the historical data themselves. As discussed in the section historical
data leakage what is called historical data are often less than 50% complete and
omits major tasks and activities such as unpaid overtime, project management, and
the work of part-time specialists such as technical writers. There are little empiri-
cal data available on the accuracy of a host of important software topics including
in alphabetical order application costs, customer satisfaction, defects, development
effort, maintainability, maintenance effort, reliability, schedules, size, staffing, and
usability. Various function point metrics (COSMIC, FISMA, NESMA, etc.)
frequently assert that their specific counting method is more accurate than rival
function point methods such as those of the International Function Point Users
Group (IFPUG). These are unproven assertions and also irrelevant in an industry
where historical data include only about 37% of the true costs of software devel-
opment. As a general rule, better accuracy is needed for every software metric
without exception. There is no cesium atom for software accuracy. Consistency
across multiple-counting personnel is a useful surrogate for true accuracy.

Agile Metrics
The Agile development approach has created an interesting and unique set of met-
rics that are used primarily by the Agile community. Other metrics such as func-
tion points and DRE work with agile projects too and are needed if the Agile
method is to be compared to other methods such as Rational Unified Process (RUP)
and Team Software Process (TSP) because the Agile metrics themselves are not very
useful for cross-method comparisons. The Agile approach of dividing larger applica-
tions into small discrete sprints adds challenge to overall data collection. Some com-
mon Agile metrics include burn down, burn up, story points, and velocity. This is a
complex topic and one is still in evolution so a Google search on Agile metrics will turn
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 243

up many alternatives. The method used by the author for comparison between
Agile and other methods is to convert story points and other Agile metrics into
function points and to convert the effort from various sprints into a standard chart
of accounts showing requirements, design, coding, testing, and so on, for all sprints
in aggregate form. This allows side-by-side comparisons between agile projects and
other methods such as the RUP, TSP, waterfall, iterative, and many others.

Analysis of Variance
Analysis of variance (ANOVA) is a collection of statistical methods for analyzing
the ranges of outcomes from groups of related factors. ANOVA might be applied
to the schedules of a sample of 100 software projects of the same size and type,
or to the delivered defect volumes of the same sample. There are text books and
statistical tools available that explain and support ANOVA. ANOVA is related to
design of experiments and particularly to the design of well-formed experiments.
Variance and variations are major elements of both software estimating and
software measures.

Annual Reports
As all readers know, public companies are required to produce annual reports for
shareholders. These reports discuss costs, profits, business expansion, or contraction,
and other vital topics. Some sophisticated corporations also produce annual
software reports on the same schedule as corporate annual reports, that is, in the
first quarter of a fiscal year showing results for the prior fiscal year. The author has
produced such reports and they are valuable in explaining to senior management
at the CFO and CEO level of what kind of progress in software occurred in the
past fiscal year. Some of the topics included in these annual reports are software
demographics such as numbers of software personnel by job and occupation group,
numbers of customer supported by the software organizations, productivity for
the prior year and current year targets, quality for the prior year and current year
targets, customer satisfaction, reliability levels, and other relevant topics such as
the mix of COTS packages, open source packages, and internal development. Also
included would be modern issues such as cyber attacks and any software related liti-
gation. Really sophisticated companies might also include topics such as numbers
of software patents filed in the prior year.

Application Size Ranges


As function point metrics circa 2016 are somewhat expensive to count manually,
they have not been used on really large systems above 10,000 function points in
size. As a result, the software literature is biased toward small applications and have
little data on the world’s largest systems. Among the set of really large systems can
244 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

be found the world-wide military command and control system (WWMCCS)


at about 300,000 function points, major ERP packages such as SAP and Oracle at
about 250,000 function points, large operating systems from IBM and Microsoft
at about 150,000 function points, large information systems such as airline reser-
vation at about 100,000 function points, and dozens of heavy-duty systems soft-
ware applications such as central-office switching systems at about 25,000 function
points in size. These sizes were derived from backfiring, is mathematical conversion
from lines of code to function points. The approximate global distribution of appli-
cations by size approximates the following:

• 100,000 function points and above 1%

• 10,000–100,000 function points 5%

• 1,000–10,000 function points 15%

• 100–1,000 function points 40%

• Below 100 function points 39%

Small projects are far more numerous than large systems. Large systems are far
more expensive and more troublesome than small projects. Coincidentally Agile
development is a good choice below 1,000 function points. TSP and RUP are good
choices above 1,000 function points. So far, Agile has not scaled up well to really
large systems above 10, 000 function points but TSP and RUP do well in this zone.
Size is not constant either before release or afterward, so long as there are active
users applications growing continuously. During development, the measured rate is
1%–2% per calendar month; after release the measured rate is 8%–15% per calendar
year. A typical postrelease growth pattern might resemble the following.
More than a 10 year period a typical mission-critical departmental system start-
ing at 15,000 function points might have

Two major system releases of 2,000 function points


Two minor system releases of 500 function points
Four major enhancements of 250 function points
Ten minor enhancements of 50 function points
Total growth for 10 years 6,500 function points
System size after 10 years 21,500 function points
Ten-year total growth percent is 43%

As can be seen, software applications are never static if they have active users. This
continuous growth is important to predict before starting and to measure at the
end of each calendar or fiscal year. The cumulative information on original devel-
opment, maintenance, and enhancement is called total cost of ownership or TCO.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 245

Predicting TCO is a standard estimation feature of SRM, which also predicts


growth rates before and after release.

Appraisal Metrics
Many major corporations have annual appraisals of both technical and managerial
personnel. Normally the appraisals are given by an employee’s immediate man-
ager, but often include comments from other managers. Appraisal data are highly
confidential and in theory not used for any purpose other than compensation
adjustments or occasionally for terminations for cause. One interesting sociological
issue has been noted from a review of appraisal results in a Fortune 500 company.
Technical personnel with the highest appraisal scores tend to leave jobs more fre-
quently than those with lower scores. The most common reason for leaving was I do
not like working for bad management. Indirect observation supports the hypothesis
that teams with high appraisal scores outperform teams with low appraisal scores.
Some companies such as Microsoft try and force fit appraisal scores into patterns,
that is, only a certain and low percentage of employees can be ranked as excellent.
Although the idea is to prevent appraisal score creep or assessing many more people
who are as excellent as what truly occurs, the force-fit method tends to lower morale
and lead to voluntary turnover by employees who feel wrongly appraised. In some
countries and in companies with software personnel who are union members, it
may be illegal to have appraisals. The topic of appraisal scores and their impact on
quality and productivity needs additional study, but of necessity studies involving
appraisal scores would need to be highly confidential and covered by nondisclosure
agreements. The bottom line is that appraisals are a good source of data on experi-
ence and knowledge, and it would be useful to the industry to have better empirical
data on these important topics.

Assessment
The term assessment in a software context has come to mean a formal evaluation key
practice areas covering topics such as requirements, quality, and measures. In the
defense sector the assessment method developed by Watts Humphrey, Bill Curtis,
and colleagues at the Software Engineering Institute (SEI) is the most common. One
byproduct of SEI assessments is placing organizations on a five-point scale called
the capability maturity model integrated (CMMI®). However, the SEI is not unique
nor is it the oldest organization performing software assessments. The author’s for-
mer company, Software Productivity Research (SPR) was doing combined assess-
ment and benchmark studies in 1984, a year before the SEI was first incorporated.
There is also a popular assessment method in Europe called TickIT. Several former
officers of SPR now have companies that provide both assessment and benchmark
data collection. These include, in alphabetical order, the Davids’ Consulting Group,
Namcook Analytics LLC, and the Quality/Productivity Measurement group. SPR
246 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

itself continues to provide assessments and benchmarks as well. Assessments are


generally useful because most companies need impartial outside analysis by trained
experts to find out their software strengths and weaknesses.

Assignment Scope
The term assignment scope refers to the amount of a specific deliverable that is nor-
mally assigned to one person. The metrics used for assignment scope can be either
natural metrics such as pages of a manual or synthetic metrics such as function
points. Common examples of assignment scopes would include code volumes, test-
case construction, documentation pages, customers supported by one phone agent,
and the amount of source code assigned to maintenance personnel. Assignment
scope metrics and production rate metrics are used in software estimation tools.
Assignment scopes are discussed in several of the author’s books including Applied
Software Measurement and Estimating Software Costs.

Attrition Measures
As we all know, personnel change jobs frequently. During the high-growth period
of software engineering in the 1970s, most software engineers had as many as five
jobs for five companies. In today’s weak economy, job hopping is less common. In
any case most corporations measure annual personnel attrition rates by job titles.
Examination of exit interviews show that top personnel leave more often than
average personnel, and do so because they do not like working for bad management.
For software engineers, technical challenge and capable colleagues tend to be larger
factors in attrition than compensation.

Automatic Function Point Counting


The Object Management Group (OMG) has published a standard for automatic
function point counting. This standard is supported by an automated tool by
CAST software. A similar tool has been demonstrated by Relativity Technologies.
Both tools use mathematical approaches and can generate size from source code
or, in the CAST tool, from UML diagrams. Neither of the tools have published
data on the speed of counting or on the accuracy of the counts compared to nor-
mal manual function point analysis. Another high-speed sizing method is based
on pattern matching, as developed by the author. The pattern-matching method
can produce function point sizes for applications between 10 and 300,000 func-
tion points in about 1.8 minutes, regardless of the actual application size. The tool
operates by using a formal taxonomy of application nature, scope, class, type, and
complexity at which the tool derives function point size from historical data of
projects that share the same taxonomy pattern. The author’s sizing tool is included
in the SRM tool and is also available for demonstration purposes on the Namcook
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 247

Analytics LLC website, www. Namcook.com. The author’s tool produces size in a
total of 23 metrics including function points, story points, use-case points, physical
and logical source code size, and others.

Backfiring
In the early 1970s, IBM became aware that LOC metrics had serious flaws as a
productivity metric because it penalized modern languages and made noncod-
ing work invisible. Alan Albrecht and colleagues at IBM White Plains began
development of function points. They had hundreds of IBM applications available
with accurate counts of logical code statements. As the function point metric
was being tested it was noted that various languages had characteristic levels,
or number of code statements per function point. The COBOL language, for
example, average about 106.7 statements per function point in the procedure and
data divisions. Basic assembly language average about 320 statements per func-
tion point. These observations led to a concept called backfiring or mathematical
conversion between older LOC data and newer function points. However, due to
variances in programming styles there were ranges of more than two to one in both
directions. COBOL varied from about 50 statements per function point to more
than 175 statements per function point even though the average value was 106.7.
Backfiring was not accurate but was easy to do and soon became a common sizing
method for legacy applications where code already existed. Today in 2014 several
companies such as Gartner Group, QSM, and Software Productivity Research
(SPR) sell commercial tables of conversion rates for more than 1,000 programming
languages. Interestingly, the values among these tables are not always the same
for specific languages. Backfiring remains popular in spite of its low accuracy for
specific applications and languages.

Bad-Fix Injections
Some years ago, IBM discovered that about 7% of attempts to fix software bugs
contained new bugs in the repairs themselves. These were termed bad fixes. In
extreme cases such as very high cyclomatic complexity levels bad-fix injections can
top 25%. This brings up the point that repairs to software are themselves the sources
of error. Therefore static analysis, inspections, and regression testing are needed for
all significant defect repairs. Bad-fix injections were first identified in the 1970s.
They are discussed in the author’s book The Economics of Software Quality.

Bad-Test Cases
A study of regression test libraries by IBM in the 1970s found that about 15% of test
cases had errors in them. (The same study also found about 20% duplicate test cases
that tested the same topics without adding any value.) This is a topic that is severely
248 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

under-reported in the quality and test literature. Test cases that themselves contain
errors add to testing costs but do not add to testing thoroughness.

Balanced Scorecard
Art Schneiderman, Robert Kaplan, and David Norton (formerly of Nolan and
Norton) originated the balanced scorecard concept as known today, although there
were precursors. The book The Balanced Scorecard by Kaplan and Norton made it
popular. It is now widely used for both software and nonsoftware purposes. A bal-
anced scorecard comprises four views and related metrics that combine financial
perspective, learning and growth perspective, customer or stakeholder perspective,
and financial perspective. The balanced scorecard is not just a retroactive set of
measures, but also includes proactive forward planning and strategy approaches.
Although balanced scorecards might be used by software organizations, they are
most commonly used at higher corporate levels where software, hardware, and
other business factors need integration.

Baselines
For software process improvement, a baseline is a measurement of quality and
productivity at the current moment before the improvement program begins.
As the improvement program moves through time, additional productivity and
quality data collections will show rates of progress over time. Baselines may
also have contract implications, and outsource vendor tenders and offer to pro-
vide development or maintenance services cheaper and faster than the current
rates. In general, the same kinds of data are collected for both baselines and
benchmarks.

Bayesian Analysis
Bayesian analysis is named after the English mathematician Thomas Bayes from
the eighteenth century. Its purpose, in general, is to use historical data and obser-
vations to derive the odds of occurrences or events. In 1999, a doctoral student at
the University of Southern California, Sunita Devnani-Chulani, applied Bayesian
analysis to software cost-estimating methods such as Checkpoint (designed by the
author of this paper), COCOMO, SEER, SLIM, and some others. This was an
interesting study. In any case, Bayesian analysis is useful in combining prior data
points with hypotheses about future outcomes.

Benchmarks
The term benchmark is much older than software and originally applied to chis-
eled marks in stones used by surveyors for leveling rods. Since then the term has
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 249

become generalized and as of 2014 can be used with well more than 500 dif-
ferent forms of benchmarks in almost every industry. Major corporations have
been observed to use more than 60 benchmarks including attrition rates, com-
pensation by occupation, customer satisfaction, market shares, quality, produc-
tivity, and many more. Total costs for benchmarks can top U.S. $5,000,000
per year, but are scattered among many operating units, so benchmark costs
are seldom consolidated. In this chapter, a more narrow form of benchmark
is relevant that deals specifically with software development productivity and
sometimes with software quality. As this chapter is written in 2014 there are
more than 25 organizations that provide software benchmark services. Among
these can be found the International Software Benchmarking Standards
Group (ISBSG), Namcook Analytics (the author’s company), the Quality and
Productivity Management Group, Quantimetrics, Reifer Associates, Software
Productivity Research (SPR), and many more. The data provided by these vari-
ous benchmark organizations varies, of course, but tends to concentrate on
software development results. Function point metrics are most widely used for
software benchmarks, but other metrics such as LOC also occur. Benchmark
data can either be self-reported by clients of benchmark groups or collected
by on-site or remote meetings with clients. The on-site or remote collection of
benchmark data by commercial benchmark groups allows known errors such
as failure to record unpaid overtime to be corrected that may not occur with
self-reported benchmark data.

Breach of Contract Litigation


The author has worked as an expert witness in more than a dozen software
breach of contract cases. These are concerned with either projects that are ter-
minated without being delivered or with projects that were delivered but failed
to work or at least failed to work well. The main kinds of data collected during
breach of contract cases, center on quality and on requirements creep, both of
which are common in breach of contract litigation. Common problems noted during
these cases that are relevant to software metrics issues include: (1) poor estimates
prior to starting, (2) poor quality control during development, (3) poor change
control during development, and (4) very poor and sometimes misleading status
tracking during development. About 5% of outsource contracts seem to end up
in court. Litigation is expensive and the costs can easily top U.S. $5,000,000
for both the plaintiff and the defendant. It is an interesting phenomenon that
all of the cases except one where the author was an expert witness were for
major systems larger than 10,000 function points in size. It is unfortunate that
neither the costs of canceled projects nor the costs of breach of contract litigation
are currently included in the metric of technical debt that is discussed later in
this report.
250 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Bug
One of the legends of software engineering is that the term bug was first refereed
to an actual insect that had jammed a relay in an electromechanical computer. The
term bug has since come to mean any form of defect in either code or other deliv-
erables such as requirements and design bugs. Bug reports during development and
after release are standard software measures. See also defect later in this chapter.
There is a pedantic discussion among academics that involves differences between
failures and faults and defects and bugs, but common definitions are more widely
used than academic nuances.

Burden Rates
Software cost structures are divided into two main categories: the costs of salaries
and the costs of overhead commonly called the burden rate and also overhead. Salary
costs are obvious and include the hourly or monthly salaries of software personnel.
Burden rates are not at all obvious and vary from industry to industry, from com-
pany to company, and from country to country. In the United States some of the
normal components of burden rates include insurance, office space, computers and
equipment, telephone service, taxes, unemployment, and a variety of other fees and
local taxes. Burden rates can vary from a low of about 25% of monthly salary costs
to a high of more than 100% of salary costs. Some industries such as banking and
finance have very high burden rates; other industries such as manufacturing and
agriculture have lower burden rates. But the specifics of burden rates need to be
examined for each company in specific locations where the company does business.

Burn Down
Although this metric can be used with any method, it is most popular with Agile
projects. The burn down rate is normally expressed graphically by showing the
amount of work to be performed as compared to the amount of time desired to
complete the work. Burn down is somewhat similar in concept to earned value.
A  variety of commercial and open-source tools can produce burn down charts.
Read also the next topic of burn up. The work can be expressed in terms of user
stories or natural deliverable such as pages of documentation or source code.

Burn Up
This form of chart can also be used with any method but is most popular with Agile
projects. Burn up charts show the amount of work already completed compared to the
backlog of uncompleted work. Burn down charts, as discussed above, show uncom-
pleted work and time remaining. Here too a variety of commercial and open-source
tools can produce the charts. The work completed can be stories or natural metrics.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 251

Business Value
The term business value is somewhat subjective and ambiguous. Business value can
include tangible financial value, intangible value, and also intellectual property
such as patents. Tangible value can include revenues, profits, and services such
as education and consulting. Intangible value can include customer satisfaction,
employee morale, and benefits to human life or safety as might be found with
medical software. Business value tends to vary from industry to industry and from
company to company. It can also vary from project to project.

Canceled Project Metrics


The author’s data indicates that about 32% of large systems in the 10,000 function
point size range are canceled. The Standish report also reports significant project can-
cellations. It would benefit the industry to perform postmortems and collect standard
benchmarks for all canceled projects. The data elements would include: (1) nature
and type of project, (2) size of application at point of cancellation, (3) methodologies
used on the project such as waterfall, Agile, and so on, (4) programming languages,
(5) time to cancellation in calendar months, (6) costs accrued to point of cancella-
tion, (7) team size and occupation groups, and (8) business reason for cancellation
such as negative return on investment (ROI), poor quality, excessive schedule delays,
or some other reason. There is some difficulty in collecting data on canceled projects
because most companies are embarrassed by their failures and prefer to keep them
secret. The best and most complete data on cancelled projects does not come from
ordinary benchmark data, but from the depositions and discovery documents pro-
duced during litigation. However, some of these data may be covered by nondisclo-
sure agreements.

Certification
As this section was written in 2016 there are more than 50 U.S. organizations that
provide some form of certification for software workers. Among the kinds of cer-
tification that are currently available are certification for counting function points,
certification for testers, certification for quality assurance personnel, certification for
project managers, and certification offered by specific companies such as Microsoft
and Apple for working on their products and software packages. However, there are
little empirical data that demonstrate certification and actually improve performance
compared to uncertified personnel doing the same work. There have been studies that
show that certified function point counters are fairly congruent when counting the
same application. However, there is a shortage of data as to the performance of certi-
fied test personnel and certified project management personnel. There is no reason to
doubt that certification does improve performance; what is missing are solid bench-
mark data that proves this to be the case and quantifies the magnitudes of the benefits.
252 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Certification by Food and Drug Administration,


Federal Aviation Agency, and so on
Federal regulatory agencies in the United States such as the Food and Drug
Administration (FDA) and the Federal Aviation Agency (FAA) require certifica-
tion of both hardware and software such as medical devices and avionics packages.
The U.S. Sarbanes–Oxley law requires expensive governance (a form of pseudo
certification) for major financial software projects. This has teeth because criminal
charges can be filed for poor Sarbanes–Oxley governance! There are similar certi-
fication agencies in every major country. Various kinds of government certification
for software are expensive and require the production of a variety of special reports
and metrics. These add between 5% and 8% to the costs of software applications
undergoing certification. They also add time to schedules and certified software
packages usually require several months more than the same size and type of soft-
ware that is not certified. The Agile methodology needs to be considerably modified
to support certification because the mandated certification documents are required
and are not optional. The TSP and RUP methods are easier to use for government-
certified software packages.

Certification of Reusable Materials


As software reuse becomes a mainstream development approach, there is a growing
need for formal libraries of certified reusable materials that approach zero-defect
status. These libraries would contain much more than source code and also include
architectural patterns, design patterns, source code, reusable test cases, and reus-
able user documents. For that matter reusable plans and estimates could also be
included. The certification process would include formal inspections, static analysis,
and testing by certified test personnel using mathematical test case design methods
such as those based on design of experiments. Custom designs and manual coding
are intrinsically expensive and error prone. Construction of software from standard
reusable components is the only method that can make permanent improvements
to quality and productivity at the same time. Of course, development of the reusable
components themselves will be slower and more expensive than custom develop-
ment, but would soon return a positive ROI as reuse went above about five applica-
tions using one standard component. The catalog process for the reusable materials
would be based on a formal taxonomy of software features, which is a topic needing
more study and research.

Chaos Report
See the Standish report for additional data. This is an annual report of IT project
failures published by the consulting company of the Standish Group. Due to the
extensive literature surrounding this report, a Google search is recommended.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 253

Chaos Theory
Chaos theory is an important subfield of mathematics and physics that deals with
system evolution for systems that are strongly influenced by initial or starting con-
ditions. The sensitivity to initial conditions has become popular in science fiction
and is known as the butterfly effect based on a 1972 paper written by Edward Lorenz
that included a statement that a butterfly flapping its wings in Brazil might cause a
tornado in Texas. Chaos theory seems to be a factor in the termination of software
applications prior to delivery. It may also play part in software breach of contract
litigation. Chaos theory deals with abrupt departures from trend lines. By contrast
Rayleigh curves, discussed later in this alphabetical list, assume smooth and con-
tinuous trend lines. Since about 32% of large systems more than 10,000 function
points in size are canceled prior to completion, it seems obvious that both Rayleigh
curves and chaos theory need to be examined in a software context. A canceled
project obviously departs from a Rayleigh curve. A deeper implication of chaos
theory is that the outcomes of software systems are not predictable even if every
step is determined by the prior step. From working as an expert witness in a number
of lawsuits, it does seem probable that chaos theory is relevant to breach of contract
lawsuits. Failing projects and successful projects sometimes have similar initial con-
ditions, but soon diverge into separate paths. Chaos theory needs additional study
for its relevance to a number of software phenomena. Chaos theory is also likely
to become an important factor in analysis of cyber-security and the root causes of
successful cyber attacks. When historical data are poorly tracked, project status
information becomes invisible to project managers and governance executives. This
tends to lead a project away from a normal Rayleigh curve and toward a disastrous
ending such as termination for poor quality or a negative ROI. It would be inter-
esting to explore the combined impacts of accurate early estimates combined with
accurate progress tracking versus inaccurate early estimates and inaccurate progress
tracking.

Cloud Measures and Metrics


As we all know, cloud computing is the wave of the future. Some standard metrics
such as function points work well for both development of cloud applications, and
also for consumption and usage of cloud services. Financial measures such as ROI
also work but need to be adjusted for fixed and variable costs. Consumption of cloud
services will be a critical factor, and here function points are among the best metrics.
For example, a cloud-based software estimation tool will be about 3,000 function
points in size. If this cloud tool is used 10 hours per month by 1,000 cloud subscrib-
ers, that is, a monthly consumption rate of 30,000,000 function points. Function
points are already used for consumption studies, and this should add value to cloud
economic analysis. For consumption studies individual function points may be too
small, so a metric of kilofunction points similar to kilowatts may be needed.
254 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Capability Maturity Model Integrated Levels


One of the interesting byproducts of the CMMI® is the placement of software
organizations (not specific projects) on a plateau of five levels indicating increas-
ing sophistication: Level 1 = initial, Level 2 = managed, Level 3 = defined, Level
4 = quantitatively managed, and Level 5 = optimizing. From a study by the
author that was commissioned by the U.S. Air Force ascending the CMMI ladder
tends to improve both quality and productivity, with the caveat that best Level
1 groups are actually better than the worst Level 3 groups. The CMMI approach
is widely used in the defense community but not much used by the commercial
sector. An older ranking scale developed by the author a year before the SEI was
incorporated is also relevant and is widely used in the commercial sector. The
author’s rankings move in the opposite direction from the SEI rankings. It is too
bad that the SEI did not check the literature before bringing out their own metric.
The author’s metric also uses a five-point scale: Level 1 = expert, Level 2 = above
average, Level 3 = average, Level 4 = below average, and Level 5 = inexperienced.
The author’s scale supports two decimal places of precision such as 2.25 rather
than the integer values used by the SEI. The author’s scale can be converted to the
equivalent SEI scale. Both scales are used in the SRM tool in both predictive and
measurement modes.

Cognitive Dissonance
The phrase cognitive dissonance refers to both a theory and a set of experiments by
the psychologist Dr. Leon Festinger on opinion formation and entrenched beliefs.
Dr. Festinger found that once a belief is strongly held, the human mind rejects evi-
dence that opposes the belief. When the evidence becomes overwhelming, there is
then an abrupt change of opinion and a move to a new idea. Cognitive dissonance
is common in scientific fields and explains why theories such as sterile surgical pro-
cedures, continental drift, and Darwin’s theory of evolution were rejected by many
professionals when the theories were first published. Cognitive dissonance is also
part of military history and explains the initial rejection of major innovations such
as replacing muskets with rifles: the rejection of screw propellers for naval ships,
the rejection of naval cannon mounted on adjustable mounts, the rejection of iron-
clad ships, and the initial rejection of Samuel Colt’s revolver. Cognitive dissonance
is also part of business and caused the initial rejection of air-conditioning, and
the initial rejection of variable-speed windshield wipers for automobiles (later the
wiper idea was accepted and led to patent litigation by the inventor who had shown
a prototype to Ford). Cognitive dissonance is also part of software and explains
why various invalid metrics are still used even though they have been proven to be
inaccurate. Cognitive dissonance is clearly a factor in the continued use of invalid
metrics such as LOC and function points. It is also a factor in the resistance to new
metrics such as function points and SNAP points. Thus cognitive dissonance plays
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 255

a part in the bad metrics and measurement practices that are endemic in the soft-
ware industry. Apparently the weight of evidence is not yet strong enough to cause
an abrupt switch to function point metrics. Cognitive dissonance is also part of
software methodology selection, such as the current belief that Agile development
is a panacea and suitable for sizes and forms of software projects.

Cohesion
The cohesion metric is one of several metrics developed by Larry Constantine. See
also coupling later in this chapter. The cohesion metric deals with how closely are
all parts of a module related. High cohesion implies that all parts of a module are
closely related to whatever functionality the module provides, and hence probably
easy to read and understand.

Complexity
The scientific literature encompasses no fewer than 25 discrete forms of complex-
ity. Software engineering has managed to ignore most of these and tends to use
primarily cyclomatic and essential complexity, Halstead complexity, and the sub-
jective complexity associated with function point metrics. However, many other
forms of  complexity such as fan complexity, flow complexity, syntactic complex-
ity, semantic complexity, mnemonic complexity, and organizational complexity also
have tangible impacts on software projects and software applications. The full suite
of 25  different forms of complexity is discussed in the author’s book Estimating
Software Costs. The topic of complexity needs additional study in a software context
because major forms of complexity are not included in either software cost estimates
or software benchmarks as of 2016.

Consequential Damages
The topic of consequential damages is a legal term that refers to harm experi-
enced by a customer as the result of a product malfunctioning or failing. As it
happens, software is extremely likely to malfunction or fail, and hence probably
causes more consequential damages than any other manufactured product in
the twenty-first century. Examples of consequential damages include having to
restate prior year financial results based on bugs in accounting software, deaths
or injuries due to malfunctions of medical software, huge financial losses in stock
markets due to malfunctions of stock-trading software, errors in taxes and with-
holding due to errors in software used by tax collection agencies, and many more.
Consequential damages are not included in either cost of quality (COQ) metrics
or in technical debt metrics. One reason for this is that software developers may
not know of consequential damages unless they are actually sued by a disgruntled
client. Even then the consequential damages will be for only a single client unless
256 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

the suit is a class action. In the modern world of 2016, software bugs probably
cost more than a trillion dollars per year for consequential damages without any
good way of measuring the actual harm caused or total costs to many industries.
Even worse, the consequential damages from cyber attacks are growing faster
than almost any business risk in history. The software literature is almost silent
on the overall consequential damages caused by poor quality software with sig-
nificant security flaws.

Contracts—Fixed Cost
The concept of a fixed-cost or fixed-price contract is that the work will be performed
for an agreed amount even though there may be changes in scope. Fixed-cost con-
tracts have a tendency toward litigation in several situations. In one case where the
author was an expert witness, the client added 82 major changes totaling more than
3,000 function points. The client did not want to pay because it was a fixed-price
contract even though there was a clause for out of scope changes. The court decided
in favor of the vendor, who did get paid. In another case, an arbitration, the client
agreed to pay for changes in scope but only the amount agreed to in the contract.
Some of the late changes cost quite a bit more due to the need for extensive changes
to the architecture of the application and then regression testing. Fixed-cost con-
tracts need to include clauses for out of scope changes and also a sliding scale of
costs for changes made late in the development cycle. Fixed-cost contracts also need
constant monitoring by clients and by vendor executives.

Contracts—Time and Materials


The concept of a time and materials contract is that the vendor will charge for
actual hours expended and also charge for any tools or materials acquired, such
as certified reusable components. Time and materials contracts tend to keep good
records of hours expended and hence are useful for historical productivity studies.
A caveat is that some vendors have a tendency to either work slowly or put in addi-
tional team members that may not be needed. Therefore it is useful to have reliable
benchmark data from similar projects. Also useful would be formal estimates prior
to starting that are agreed to by both the client and the vendor. Tool such as SRM
can predict size, costs, and schedules before projects start. Time and materials con-
tracts need careful planning prior to starting to ensure that they are not extended
artificially by vendors, which has been observed with some government time and
materials contracts.

Contracts—Using Function Points


The government of Brazil requires function points for all software contracts. The
governments of Italy, Japan, Malaysia, and South Korea have also started the
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 257

same requirement. Function points are very good contract metrics because of
the large volume of benchmark data available. Further, function points lend
themselves to using a sliding scale of costs for handling requirements creep and
even removing features from software. A number of civilian outsource com-
panies are also using function points for software contracts, and this trend is
expanding in 2016. Function points are also valuable for activity-based cost
analysis and can be applied to earned value measurements, although the U.S.
government and the Department of Defense are behind the civilian sectors in
function point usage. Function points are already playing a major role in soft-
ware litigation for breach of contract, poor quality, and other endemic problems
that end up in court.

Cost Center
The phrase cost center is an accounting term that refers to a corporate organization
that does not add to bottom line profit but does expend costs. A profit center is an
organization that does produce revenue. For internal software produced by com-
panies for their own use, the majority of these operate under a cost-center model,
that is, they do the software without charges to the users. As there are no charges,
software measurement practices for centers tend to leak and omit major software
cost elements such as unpaid overtime and management. Among the author’s cli-
ents the average completeness of software cost data under the cost-center model is
only about 37% of true costs.

Cost Drivers
Quite a few software researchers such as Barry Boehm, Ian Sommerville, and the
author use the concept of cost drivers. Normal project accounting keeps track of
costs by activity. However, cost drivers aggregate costs across all activities. For
example, one of the cost drivers used by both Boehm and Jones is that of soft-
ware documentation, which spans every phase and almost every activity. In total,
more than 100 documents can be created for large systems and these often cost
more than the code itself. The four major cost drivers cited by the author of this
chapter for specific projects are (1) finding and fixing bugs, (2) document creation,
(3) meetings and communications, and (4) requirements creep. When looking at
larger national results across thousands of projects, additional cost drivers include:
(1) canceled projects, (2) cyber attacks, (3) cyber attack recovery and reparations,
and (4) litigation for breach of contract, intellectual property, and other causes.
Cost drivers are useful for software economic analysis because they highlight major
areas that need study and improvements. A full list of major software cost drivers
is shown in Table A.7.
As can be seen by Table A.7, software has far too many major risk and problem
factors that rank too high in this list of overall software cost drivers.
258 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.7 U.S. Software Cost Drivers in Rank Order for 2016
1 The cost of finding and fixing bugs

2 The cost of canceled projects

3 The cost of producing paper documents

4 The cost of programming or coding

5 The cost of security flaws and cyber attacks

6 The accrued costs of schedule delays on large applications

7 The costs of functional requirements

8 The costs of nonfunctional requirements

9 The cost of requirements changes (functional and nonfunctional)

10 The costs of user effort for internal software projects

11 The cost of postrelease customer support

12 The costs of meetings and communication

13 The costs of project management

14 The costs of legacy maintenance and renovation as software ages

15 The costs of innovation and new kinds of software

16 The costs of litigation for software failures and disasters

17 The costs of training and learning (customer training)

18 The costs of avoiding or removing security flaws

19 The costs of porting to ERP packages

20 The costs of assembling reusable components


Average $ per function point = $1,250 development; $1,400 maintenance;
$2,650 TCO.
Cyber attack $ per function point = $75 prevention; $450 recovery; $525 TCA.

Cost of Quality
The cost of quality (COQ) metric is much older than software and was first made
popular by the 1951 book titled Juran’s QC Handbook by the well-known manufac-
turing quality guru Joseph Juran. Phil Crosby’s next book, Quality Is Free (1979)
also added to the literature. Cost of quality is not well-named because it really
focuses on the cost of poor quality rather than the cost of high quality. In its general
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 259

form, cost of quality includes prevention, appraisal, failure costs, and total costs.
When used for software the author of this chapter modifies these terms for a soft-
ware context: defect prevention, pretest defect removal, test defect removal, and
postrelease defect removal. The author also includes several topics that are not part
of standard COQ analysis: cost of projects canceled due to poor quality, cost of
consequential damages or harm to customers from poor quality, and cost of litiga-
tion and damage awards due to poor quality.

Cost per Defect


The fixed costs of defect removal such as writing test cases and having maintenance
programmers ready and waiting, cause cost per defect to rise steadily throughout
the development cycle. Fixed costs also cause cost per defect to be cheapest for
the buggiest software, which clouds and confuses the economic study of software
quality. This metric is not suitable for economic analysis. The alternate metric defect
removal cost per function point is a much better indicator of the actual benefits of
high quality. Cost per defect also ignores the main value points of high quality, that
is, shorter schedules, lower costs, and more satisfied customers.

Cost per Function Point


If used carefully cost per function point is the top-ranked metric for software-
economic analysis. However, there are some caveats and cautions that need to
be understood: (1) cost vary by size and large systems cost more than small pro-
grams, (2) costs vary by type of software and systems and embedded software
costs more than Web and IT applications, (3) costs vary by geographical areas
and rural locations such as Nebraska, cost less than urban areas such as New York
or San Francisco, (4) costs vary by industry and some industries such as bank-
ing have much higher costs than others such as manufacturing, (5) costs vary
by country and some countries such as Switzerland cost a lot more than other
countries such as Pakistan, and (6) costs vary by time both while projects are
in progress and also after the release when new features are added. Continuous
growth of software overtime requires that cost data be renormalized from time
to time, such as once per year after release. (If a project is 1,000 function points
at initial release but grows by 100 function points per year for 10 years in a row,
then cost per function point needs annual adjustments to include the current year’s
changes.)

Cost per Lines of Code and KLOC


Both cost per LOC and cost per KLOC (with K standing for 1,000 LOC) have
been in use for more than 50 years, but suffer from severe errors when comparing
projects coded in different programming languages or combinations of languages.
260 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.8 Cost of Development for 10 Versions of the Same Software


Project (A PBX Switching System of 1,500 Function Points in Size)
Burdened Burdened Burdened
Effort Salary Burdened Cost per Cost per
Language (Months) (Months) Costs Function Point LOC

Assembly 781.91 $10,000 $7,819,088 $5,212.73 $20.85

C 460.69 $10,000 $4,606,875 $3,071.25 $24.18

CHILL 392.69 $10,000 $3,926,866 $2,617.91 $24.93

PASCAL 357.53 $10,000 $3,575,310 $2,383.54 $26.19

PL/I 329.91 $10,000 $3,299,088 $2,199.39 $27.49

Ada83 304.13 $10,000 $3,041,251 $2,027.50 $28.56

C++ 293.91 $10,000 $2,939,106 $1,959.40 $35.63

Ada95 269.81 $10,000 $2,698,121 $1,798.75 $36.71

Objective C 216.12 $10,000 $2,161,195 $1,440.80 $49.68

Smalltalk 194.64 $10,000 $1,946,425 $1,297.62 $61.79

Average 360.13 $10,000 $3,601,332 $2,400.89 $27.34

Following are cost per LOC side by side with cost per function point to illustrate
the errors.
As can be seen from the above table cost per LOC reverses true economic
productivity and makes the most expensive version coded in assembly language
look cheaper than the least expensive version coded in Smalltalk. The errors
of LOC metrics are clearly visible in cases given in Table A.8 where identical
applications coded in different languages are shown to highlight LOC metric
errors.

Cost per Story Point


The cost per story point metric is useful for projects utilizing user stories as require-
ments and design method. However, this metric cannot be used for large-scale
economic studies involving projects that use other kinds of requirements methods
than story points. As of 2014, story points have no ISO standards and no certification
examinations and have been observed to vary by as much as 400% from company
to company. There are very little benchmark data available using story points, and
what data that are available needs to be used with caution due to the variability of
this metric.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 261

Coupling
This is another interesting metric developed by Larry Constantine (see also cohe-
sion). The coupling metric refers to how modules exchange or share information.
Coupling can range from low coupling to high coupling. Low coupling tends to
be associated with well-structured software that is easy to read and comprehend.
There are many forms of coupling ranging from no coupling at all through content
coupling when a module depends on the inner workings of another module. Some
forms of coupling include data coupling, temporal coupling, stamp coupling, message
coupling, control coupling, and others as well. Coupling and cohesion are often
used as a set of related metrics.

Currency Exchange Rates


For international projects, software personnel will probably be paid in local
currencies. As currency exchange rates vary every day as well as varying over lon-
ger time periods, this is a significant issue for accurate estimates for international
projects. Currency exchange rates are economic topics that affect all industries that
work globally and are not just software. The most common method of dealing
with currency exchange rates in software estimates is to use current values and
then make adjustments later if significant changes occur. Currency exchange rates
also play a part in global outsource contracts and are somewhat related to inflation
rates. Both inflation and currency exchange rates can make long-range projects
unpredictable.

Customer Satisfaction Metrics


Most large software companies devote considerable time and energy to finding out
whether customers like their software or are not happy with it. Usually questionnaires
or interviews are created by human factors, specialists, or even by psychologists. Studies
of customer satisfaction at IBM noted a strong correlation between delivered defect
rates and overall satisfaction. In fact, studies of many kinds of consumer products
such as televisions and stereos found that quality was the number one determinant of
high satisfaction. Other topics include speed of defect repairs, ease of reaching support
teams, and aesthetic factors.

Customer Support Metrics


Two endemic problems of the software industry are that software projects have
too many bugs after release and that software support personnel are difficult to
reach because there may not be enough of them. Because live support personnel
are costly many companies in expensive countries such as Japan, Switzerland,
and the United States outsource customer support to countries with lower labor
262 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

costs such as India. The number of customer support personnel needed to allow
clients to reach a live person in less than 5 minutes by phone or e-mail can be
estimated and some tools such as SRM include these predictions. Some sophis-
ticated companies such as Apple have calculated customer support needs and
do a pretty good job. Others, such as Verizon, have ignored their customers
and have inadequate support where it is next to impossible to reach a live sup-
port person. Customer support staffing is based in part on expected numbers of
postrelease defects, in part on expected numbers of clients using the software,
and in part on whether support will be available 24 hours per day or only during
one or two shifts.

Cyber Attack Metrics


Cyber attacks are increasing in variety and frequency. A Google search on cyber
attacks and cyber attack metrics will show current data, which changes almost daily.
Some of the important metrics for cyber attack frequency and origins are kept
by government agencies such as the FBI, Homeland Security, and the CIA. The
Congressional Cyber-Security Caucus started by representatives Jim Langevin
from Rhode Island (democrat) and Mike McCaul (republican) from Texas pub-
lishes excellent weekly summaries on cyber attacks and is highly recommended
because it is both free and contains valuable data. (This is a rare instance of coop-
eration among democrats and republicans.) Cyber attack data includes numbers
of attack by type such as denial of service, virus, worms, and so on. It should
also include the value of any stolen materials, the number of citizens whose data
is compromised, and the eventual costs of recovering from cyber attacks. Many
companies have been lax and even incompetent in reporting cyber attacks, and
a few have lagged in notifying customer of possibly stolen data. These problems
are endemic in 2014 and seem to be growing worse. Cyber attacks have moved
from individual hackers to organized crime and also to hostile national govern-
ments, all of whom have active cyber warfare units that seek out weaknesses in
other countries including the United States. This is a huge problem and it will be
getting worse.

Cyclomatic Complexity
This metric is one of the most widely used indicators of software structure, along
with essential complexity. The cyclomatic complexity metric was developed in 1976
by Tom McCabe. It is based on graph theory and is an expression of the control
flow graph of an application. Cyclomatic complexity for software with no branches
is 1. As numbers of branches increase cyclomatic complexity increases. Once cyc-
lomatic complexity rises above 20, it is hard to follow the flow, and hence some
branches may be wrong. The formula for cyclomatic complexity is graph edges
minus nodes plus 2. Cyclomatic complexity plays a part in estimating test cases and
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 263

maintenance effort. High cyclomatic complexity levels are also cited in litigation
for poor quality. An interesting theoretical question is whether or not code with
low cyclomatic complexity is possible for very complex problems. See also essential
complexity and Halstead complexity.

Dashboard
The term dashboard is much older than software and has been applied to the
control panels of various devices such as automobiles where instruments provide
useful information to the operator. In a software context the term dashboard
refers to a continuous display of information about a project’s status, includ-
ing but not limited to completed tasks versus unfinished tasks, completed test
cases versus unfinished test cases, and completed documents versus unfinished
documents. A number of commercial and some open-source tools provide auto-
mated or semi-automated dashboards for software projects. Some of these sup-
port a number of projects at the same time and are useful for portfolio and
analysis, and also for data center analysis when many applications are executing
simultaneously.
A number of tools produce dashboards for software projects: (1) the Automated
Project Office (APO) from Computer Aid, (2) Cognos from IBM, IDashboards,
and SAP analytics, and (3) DataWatch and in fact almost 40 products.

Data Point Metrics


Major corporations own more data than they own software. Data are expensive to
create and maintain and are known to contain many errors. But as of 2014 there is
no effective size metric for databases and repositories. It is theoretically possible to
construct a data point metric that would resemble the structure of function point
metrics but size data volumes. A data point metric would be useful in studies of
data ownership and data quality. Some of the atomic elements of a data point metric
might be logical files, entities, relationships, attributes, inquiries, and interfaces. The
fundamental idea is to have function points and data points relatively congruent.
If this were the case, then an application such as website might be sized at 10,000
function points and 15,000 data points. The idea is to allow better estimates and
better benchmarks for data-rich applications such as medical records, tax records,
retail chain websites, and many others.

Defect (Definition)
There is a somewhat pedantic academic discussion of the differences between a
failure, a fault, a defect, a bug, an error, an incident, anomaly, and so on. The term
defect is a good general-purpose term that can encompass all of these. A defect is
an accidental mistake by a human that causes either total stoppage of software,
264 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

unacceptably slow performance, or the creation of incorrect data and results.


Defects can originate from multiple sources. A requirements defect would be some-
thing like the Y2K problem. A design defect would be something like understating
a performance goal. An architectural defect would be something like using client
server when a distributed network would be better. A code defect is something like
branching to the wrong address. A document defect is something like omitting
a step in the installation procedure for software. A bad-fix defect is a bug in an
attempt to fix a prior bug.

Defect Consequences
The term defect consequences is derived from the legal term consequential dam-
ages. It refers to the type and severity of harm that bugs or security flaws cause to
software customers. The most serious consequences would include human deaths,
injuries, or major financial losses in excess of U.S. $1,000,000. Other consequences
might include reduced operational efficiency, loss or theft of confidential data, and
loss of customers or at least reduced customer satisfaction levels.

Defect Density
For many years defect density has informally been defined as defects per KLOC.
This of course omits requirements and design defects, which often outnumber code
defects. Worse, this definition penalizes high-level languages. Assume you have
5,000 lines (5 KLOC) of assembly code with 50 bugs. Now assume the same algo-
rithms are coded in 1,000 lines (1 KLOC) of Java with 10 bugs. Both have exactly
10 bugs per KLOC as apparent defect densities, even though assembly has five
times as many code bugs. Assume both versions were 20 function points in size.
With this assumption, assembly has 2.5 bugs per function point, whereas Java has
only 0.5 bugs per function point. As can be seen defects per function point correctly
compensates for the reduced bug counts, whereas KLOC metrics do not show any
value for reduced defect volumes. For that matter defects per function point can
also include bugs in requirements, design, architecture, user documents, and all
other categories.

Defect Discovery Factors


When software applications are released, they still contain defects. These are
known as latent defects until they are discovered. This brings up the key topic of
what factors leads to the discovery of latent defects? As it turns out, numbers of
users, amount of usage, and variety of usage are the three critical factors. One
user using software for 1 hour a month and doing one task probably would not
find many latent defects. One million users using software 24 hours a day for
several thousand different kinds of tasks will probably flush out the majority of
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 265

latent defects in a month or two. These factors explain the measured differences in
defect discovery rates for various kinds of software. Embedded and systems soft-
ware usually have the fastest defect discovery rate; Web projects with thousands
of visitors have fast defect discovery rates. Interestingly Agile projects, which are
often done to support less than 100 users, have fairly slow defect discovery rates
that may lead to a premature assertion that Agile quality is better than it really
might be.

Defect Detection Efficiency


This is one of two quality metrics developed by IBM circa 1970. Defect detection
refers to the percentage of bugs identified prior to release. See also the next metric
on DRE. In earlier eras, defect detection efficiency (DDE) and DRE were almost
identical and only a few bugs found on the actual day of release were not fixed by
the time of release. In today’s world of 2014 with bigger systems and greater sched-
ule urgency, DDE averages more than 10% higher than DRE. That is, at least 10%
of the known bugs in a software application are not fixed before the software is
released. Of course, these bugs cause problems and create consequential damages,
not that anybody cares about clients anymore.

Defect Removal Efficiency


The phrase defect removal efficiency (DRE) refers to the percentage of bugs found
and fixed prior to release, when compared to customer-reported bugs in the
first 90 days after release. If a development teams finds and fixes 990 bugs and
clients only find and report 10 bugs in the first 3 months then DRE would be
99.9%, an excellent result. However as of 2014, average DRE hovers around 90%
and for Agile around 92%. The best DRE values come from synergistic combi-
nations of pretest inspections and static analysis combined with formal testing
using mathematical test case design and at least nine test stages: (1) subroutine
test of code segments, (2) unit test of modules, (3) function test, (4) regression
test, (5) component test, (6) performance test, (7) security test, (8) system test,
and (9) acceptance or Beta test. This combination usually tops 99% in DRE.
Projects that omit pretest defect removal and use only three or four test stages
are often less than 85% in DRE. DRE is probably the most useful and effective
quality metric. It is easy to measure, and high levels of DRE correlate with high
productivity and high levels of customer satisfaction and also with high levels of
team morale.

Defect Origins
The phrase defect origins was first defined in IBM circa 1968. The meaning is,
which specific place caused a software defect. There are six common software
266 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

defect origins: (1) software requirements, (2) architecture, (3) design, (4) source
code, (5) user documents, and (6) bad fixes or secondary bugs in defect repairs.
There are other sources of defects such as data errors and bugs in test cases, but
they are not normally included in software defect measurements. When bugs were
reported, IBM quality engineers noted the point in time where the bug was found
(inspection, testing, deployment, etc.) and also noted the place where the bug
was created. They then assigned an origin code to each bug. This allowed IBM to
explore quality in a fairly sophisticated way and led to many important findings
such as the fact that requirements and design errors often outnumber code errors.
Make no mistake, a requirements defect such as Y2K will eventually end up in
source code, but that is not where the Y2K bug started. It started as an explicit
user requirement to conserve space by using only two digits for date fields. Every
company that builds software should explore their own defect origins, as indeed
many do.

Defect Resolution Time


This topic refers to the number of hours or days between the point in time when
a bug is first reported, and the point in time when users get a new version of the
software with the bug fixed.
Early on during requirements and design when software is still easily change-
able, defect resolution time is normally less than 8 hours. As software development
proceeds, more and more artifacts may require correction. For example, a require-
ments bug found during testing may require changes to the original requirements
specification, design documents, probably the source code, and perhaps even test
cases. The logistics of defect repairs become more convoluted with time. Once
defects are fixed and tested they may not be immediately released to customers.
For low-severity bugs, defect repairs are usually aggregated into the next release.
However, for high severity bugs and especially severity 1 bugs when the software
does not work at all, patches and emergency repairs are sent out as needed.

Defect Severity Levels


All bugs are not equally serious. Way back in the early 1960s when software
was first becoming a business tool IBM recognized that software bugs needed
to be classified. The original IBM classification is still working after more than
60 years. Under this classification, severity 1 is the highest and indicates that
the software does not work at all. Severity 2 is second and indicates that a major
feature is disabled. Severity 3 is next and indicates either a minor issue or one
with an available work around. Severity 4 means a cosmetic problem such as a
spelling error that does not affect software operation at all. There are some other
categories besides severity: (1) invalid defect reports and (2) duplicate defect
reports. From analysis of hundreds of projects, the normal technical distribution
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 267

by severity level after release would be: (1) severity 1 = 1%, (2) severity 2 =
15%, (3) severity 3 = 35%, and (4) severity 4 = 49%. However, because most
companies fix high-severity bugs quicker than low-severity bugs, clients tend to
try and push low-severity bugs up into severity level 2 in order to get a quicker
repair. Some applications have more than 50% of bugs reported as severity 2 by
clients even for trivial issues such as the placement of text on a screen or the color
of a display. Also, some defect reports turn out to be suggested improvements
that are reported as defects. The actual determination of defect severity is usu-
ally assigned to a quality assurance or maintenance team that sometimes has to
negotiate with clients.

Deferred Features
For many software projects either clients, executives, or business pressures such as
government laws and mandates dictate schedules that are shorter than technically
possible. In the case of impossible schedules that something has to give, and it is
often features that are desirable but not mandatory. Below 100 function points,
software is usually delivered close to 100% complete. Above 10,000 function points
it is not uncommon for the first release to omit more than 35% of planned features
in order to make a shorter than possible delivery date. Deferred features are an
endemic problem of large software projects. An interesting law by Chris Winter
of IBM is 80% of features delivered on time are more valuable than 100% of features
delivered late.

Delphi Methods
The term Delphi method is based on a line of famous Greek oracles who lived in the
temple in Delphi and were sought out by various leaders to predict future events.
In the modern Delphi method, panels of experts answer questions in a formal
structured way but anonymously. After the first round of questions a second round
is prepared using summaries from the first round. There may be additional rounds
until a concurrence of opinions is reached. The concept is based on the hypothesis
that groups of experts can pool their knowledge and do a better job of prediction
than a single expert. Delphi is used more for corporate decisions than for software
decisions but is sometimes used for major applications with high risks. As Delphi
depends on expertise, it is important to select participants who have actual knowl-
edge of the issues.

Delivered Defects
As software DRE is almost always less than 100% and often less than 90%, the
great majority of software applications are delivered with latent defects. By using
268 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

historical data major companies such as IBM are able to predict delivered defects
in future projects, and also use effective methods to keep delivered defects at
very low levels. Some tools such as SRM predict delivered defects as a standard
feature. In fact, SRM predicts not only total defects but also delivered defects
by origin: defects caused by requirements, design, code, bad testing, and so on.
Delivered defects are predicted by using defect potentials and DRE. They are of
course measured as they occur. Note that defects are not at all found until several
years after release. Indeed delivering software with excessive defects slows down
defect discovery because clients do not trust the software and avoid using it, if
possible. Annual reports by clients of delivered defects are only around 30% of
actual latent defects for IT applications, but higher for systems and embedded
software.

Design, Code, and Unit Test


The design, code, and unit test (DCUT) method has been in use for perhaps 50 years
or more. In the 1960s, DCUT comprised about 85% of the total work of software
and was a reasonably useful approach for both estimates and benchmarks. Today in
2014 with more than 125 occupation groups involved with large systems, DCUT
comprises less than 30% of total development effort. For example, DCUT excludes
quality assurance, technical writers, integration and configuration control, project
offices, and project managers. DCUT is not an effective method today and should
be replaced by activity-based cost analysis that includes the entire suite of software
development activities.

Dilution
The term dilution refers to the loss of equity that entrepreneurs may experience
if they receive venture capital, and especially if they receive more than one round
of venture capital. In order to get funding for a software company or major
projects from a venture funding source, probably 20% of the ownership will
be turned over to the venture capitalists. If the project runs through the initial
investment, which many do, and second or third round financing are needed,
the entrepreneurs occasionally end up with less than 15% ownership. Quite a
few venture-funded companies fail completely and go bankrupt. As a service to
software entrepreneurs, SRM includes a venture-funding routine that will predict
both the number of rounds of funding and the probable dilution of ownership.
It can also predict the odds of failure or bankruptcy. As these predictions can
be done early before any money is committed at all, hopefully both the entre-
preneurs and the venture capitalist will have a good preview of probable results
before committing serious money.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 269

Documentation Costs
Documentation costs are fairly sparse for small projects and especially for Agile
projects. However, for large systems above 10,000 function points and especially
for government and military software projects, more than 100 kinds of documents
might be created and the total costs of these documents are often greater than the
costs of the source code itself. SRM has a standard feature for predicting document
numbers, pages, words, and costs. For a small project of 10 function points, total
document pages will be around 50. For projects of 100 function points, total doc-
ument pages will be around 400. For projects of 1,000 function points, total
document pages will be around 3,500. For projects of 10,000 function points, total
document pages will be around 32,500. Studies by the author have noted that pages
per function point tend to decline with larger applications because full documen-
tation might go past the lifetime reading speed of a single individual. Document
costs need additional research in the software engineering field. For large civilian
systems, document costs are the #2 cost driver and for large defense systems, some-
times the #1 cost driver even going past finding and fixing bugs. Function points
are the best metrics for studying document costs. To highlight the huge volume of
documents for major systems, following are the numbers and sizes of documents
for a systems software application of 25,000 function points, such as a central office
switching system (Table A.9).
Note that document completeness is also a problem for large systems.
Document completeness is inversely proportional to application size measured
in function points. For example, complete requirements and design documents
are only possible for small applications below about 500 function points in size.
Above that, applications grow during development and the larger they are the
more they grow. Documentation costs are the #2 cost driver for applications
larger than 10,000 function points. For military and defense projects, which
produce about three times the volume of paper as civilian projects, documen-
tation costs are the #1 cost driver. Agile projects have reduced documentation
costs, so that they are only the #4 cost driver, below coding. However for Agile
projects meetings and communication costs may be the #2 cost driver, replacing
paperwork costs.

Duplicate Defect Reports


For commercial and open-source software with hundreds or thousands of users, it
often happens that bugs are reported by more than one customer. These are called
duplicate defects. Actual quality is based on valid unique defects that exclude dupli-
cates. However, duplicate defects, if there are many of them, can add considerable
expense to maintenance costs. Duplicate defects still need to be logged and exam-
ined. It may also be necessary to notify clients of the receipt of the defect reports.
270 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.9 Software Document Sizes for 25,000 Function Points


Document Sizes Pages Words Percent Complete (%)

Requirements 4,936 1,974,490 61.16

Architecture 748 299,110 70.32

Initial design 6,183 2,473,272 55.19

Detail design 12,418 4,967,182 65.18

Test plans 2,762 1,104,937 55.37

Development plans 1,375 550,000 68.32

Cost estimates 748 299,110 71.32

User manuals 4,942 1,976,783 80.37

HELP text 4,965 1,986,151 81.37

Courses 3,625 1,450,000 79.85

Status reports 3,553 1,421,011 70.32

Change requests 5,336 2,134,284 66.16

Bug reports 29,807 11,922,934 76.22

Each duplicate defect can take from 5 minutes to 15 minutes of work. Individually
this is not very significant, but if an application receives 10,000 duplicate defects
the costs can be significant.

Earned Value Measurements


The topic of earned value is a formal method accompanied by charts and proce-
dures for combing scope, progress, costs, and remaining work in a formal manner.
Earned value analysis (EVA) is frequently used on government and defense software
contract projects, not so much in the civilian sector. Earned value measurements
(EVM) originated for Federal government project in the 1960s. The major com-
ponents of EVM include a development plan, a valuation of planned work, and
predefined earning rules for completed work, often linked to payments to contrac-
tors. Assume a project is scheduled for one year and a budget of U.S. $1,000,000.
At six months, supposedly 50% of the work would be done. If it is noted that only
30% of the work is done but 50% of the budget is gone, there is a problem that
needs to be addressed. EVM is a complex system with many formal definitions and
calculations. One issue is that EVM does not include software quality, which can
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 271

cause trouble for software projects because schedule slippage is most severe during
testing due to having more bugs than anticipated.

Enhancement Metrics
As of 2016, there are more enhancement projects for legacy applications than
there are new development projects. Enhancements are more difficult to estimate
and measure than new software development. This is because the size, structure,
and understanding of the legacy software interact with the enhancement itself.
Assume that a new small project of 100 function points is developed. This might
require a total of 12 work hours per function point. Now assume that a 100 func-
tion point enhancement is being made to a well-structured, well-documented
legacy application of 1,000 function points. As the architecture and design issues
were solved by the legacy application, the enhancement might only require 11
work hours per function point. Now assume that a 100 function point enhance-
ment is to be made to a large system of 10,000 function points with high cyc-
lomatic complexity and missing documentation. In this case, digging into the
legacy code and the need to carry out major regression testing might raise the
effort to 14 work hours per function point. As can be seen, estimating and mea-
surement needs to include both the new enhancement and the legacy application.
For measurements, both the specific enhancement needs to be measured and also
the cumulative total cost of ownership (TCO) for the updated legacy applica-
tion. In other words, two sets of measures are needed for enhancements. Several
commercial parametric estimation tools such as the author’s SRM can predict
enhancements, but need input data about the size and decay of the legacy applica-
tion. See also entropy.

Entropy
The concept of entropy is not a software concept but a basic fact of physics. All
systems and natural objects tend to have an increase in disorder overtime, which is
called entropy. This is why we age and why stars turn into supernova. For software,
entropy is observed in a gradual increase in cyclomatic and essential complexity
over time, due to the structural damages caused by hundreds of small changes over
long time periods. It is possible to reverse entropy by restructuring or refactoring
software, but this is expensive and unreliable if done manually for large systems.
Automated restricting tools exist, but they only support a few languages and are of
uncertain effectiveness. Entropy needs much more study and more direct measure-
ments. As entropy is associated with all human artifacts and also with all natural
systems, it is a fundamental fact of nature.
272 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Error-Prone Modules
In the early 1970s, IBM undertook an interesting study of the distribution of bug
reports in a number of major software projects including operating systems, com-
pilers, database products, and others. One of the most important findings was
that bugs were not randomly distributed through all modules of large systems, but
tended to clump in a few modules, which were termed error-prone modules (EPM).
For example, 57% of customer reported bugs in the IMS database application were
found in 32 modules out of a total of 425 modules. More than 300 IMS modules
had zero-defect reports. Other companies replicated these findings and EPM are
an established fact of large systems. Two common causes for EPM have been noted:
(1) high levels of cyclomatic complexity and (2) bypassing or skimping on inspec-
tions, static analysis, and formal testing. In theory EPM can be avoided by proper
quality control, but even now in 2014 they tend to be far too common in far too
many large applications.

Enterprise Resource Planning Metrics


Enterprise resource planning (ERP) refers to a class of major software applica-
tions such as SAP and Oracle that attempt to provide an integrated solution
to corporate data needs by replacing older legacy applications with a suite of
tools for accounting, marketing, manufacturing, customer resource planning,
and other common business activities. ERP packages are large and some top
250,000 function points in size. ERP installation and deployment tend to be
troublesome and routinely take longer and cost more than planned. Some of
the ERP topics that need to be measured include training of personnel, instal-
lation cost, and data migration costs to the ERP package from older software,
and new applications and enhancements to existing applications. Also the ERP
packages themselves require extensive customization. The SAP and Oracle ERP
packages use a performance metric based on RICE object. The term RICE stands
for reports, interfaces, conversions, and enhancements. A RICE object is kind
of project that needs cost and schedule estimation. Function points can also be
used for ERP planning.

Essential Complexity
This metric was also developed by Tom McCabe in 1976 and is a variation on his
more famous cyclomatic complexity metric. Note that Fred Brooks also uses the
term in a different context as the minimum set of factors in large complex problems.
The McCabe form of essential complexity is derived from cyclomatic complexity,
but it replaces well-structured control sequences with a single statement. If a code
section has a cyclomatic complexity of 10 but includes well-structured sequences
then essential complexity might be only 3 or 4.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 273

Experience
The author’s benchmark collection method and the SRM tool use experience for
a number of occupations, including client experience, software engineer experi-
ence, tester experience, software quality assurance experience, project management
experience, customer support experience, and several others. Experience is ranked
on a subjective scale of 1–5: 1 = expert; 2 = above average; 3 = average; 4 = below
average; 5 = inexperienced.
As can be seen results are slightly asymmetrical. Top teams are about 30% more
productive than average, but novice teams are only about 15% lower than average.
The reason for this is that normal corporate training and appraisal programs tend
to weed out the really unskilled personnel so that they seldom become actual team
members. The same appraisal programs reward the skilled, so that explains the fact
that the best results have a longer tail.
Software is a team activity. The ranges in performance for specific individu-
als can top 100%. But there are not very many of these super stars. Only about
5%–10% of general software populations are at the really high level of the perfor-
mance spectrum.
Individual practitioners can vary in performance by more than 10 to 1, but soft-
ware is normally a team event. In any case top performers are rare and bottom per-
formers are usually terminated, so average performance is the norm with a weight
on the high side. Also, bad management tends to slow down and degrade the per-
formance of top technical personnel, some of whom quit their jobs as a result.

Expert Estimation
The term expert estimation refers to manual software estimates by human beings as
opposed to using a parametric estimation tool such as COCOMO II, CostXpert,
KnowledgePlan, SEER, SRM, SLIM, or TrueCost. A comparison of 50 manual
estimates and 50 parametric estimates by the author found that below 250 function
points manual estimates and parametric estimates were almost identical. As applica-
tion size increased, manual estimates became progressively optimistic and predicted
shorter schedules and lower costs than what actually occurred. Above 5,000 func-
tion points, manual estimates even by experts tended to be hazardous and excessively
optimistic by more than 35%. This is not surprising because the validity of historical
data are also poor for large systems above 5,000 function points due to leakage of
major cost elements such as unpaid overtime, management, and specialists. See also
parametric estimation later in this report.

Failing Project (Definition)


All of the author’s software books since 1991 and every issue of the annual Standish
report deal with failing projects. What does failure mean in a software context?
274 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

The definition used by the author for project failure is: “software that is terminated
without delivery due to errors, delays, or cost overruns or software whose development
company is sued for breach of contract after delivery for excessive errors.” See also the
definition for successful software later in this report. In between success and fail-
ure are thousands of projects that finally get released but are late and over budget
and probably have too many bugs after delivery. That is, the modus operandi for
software circa 2014. Another cut at a definition of failing projects would be proj-
ects in the lowest 15% in terms of quality and productivity rates from the bench-
mark collections of companies such Namcook Analytics, Q/P Management Group,
Software Productivity Research, and others.

Failure Modes and Effect Analysis


This methodology was developed in the 1950s for examining hardware failures
but has also been applied to software. It is not as common for software as root-
cause analysis that is discussed later in this report. Failure modes and effect analysis
(FMEA) is an inductive approach that works backward from specific failures and
identifies earlier conditions that led to them. FMEA can work all the way back
to development and even design mistakes. It is a common approach for hardware
devices and less common for software. FMEA also includes criticality analysis
(CA). FMEA can be used in two directions: (1) predictive mode for analyzing the
risks of future failures and (2) analytical mode for examining actual failures that
have occurred. Due to the complexity of the method a Google search is recom-
mended to bring up the relevant literature.

Failure Rate
The term failure is defined by the author as software project that is terminated
prior to being completed due to poor quality, negative ROI, or some other cause
that was self-inflicted by the development team. Projects that are terminated for
business reasons such as buying a commercial software application rather than
finishing an internal application are not failures. The topic of software failures
has a lot of publicity, due in part to the large number of failures included in
the Standish Report, produced by the Standish consulting group. However, that
report only covers information technology and does not include systems soft-
ware or commercial software, both of which have lower failure rates. Also, the
Standish report does not show failures by application size, which is a serious
omission. The author’s data on project failure rates by size is as follows: The prob-
ability of a software project failing and not being completed is proportional to
the cube root of the size of the software application using IFPUG function points
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 275

with the results expressed as a percentage. For 1,000 function points the odds
are about 8%; for 10,000 function points the odds are about 16%; for 100,000
function points the odds are about 32%. These rules are not perfect, but are based
on observations taken from about 20,000 software projects of all sizes and types
including Web applications, smart phones, systems software, medical devices,
and military projects.

False Positive
The term false positive refers to misidentifying a code sequence as being incorrect,
when in fact it is correct. This metric can occur with testing and inspections,
but is most widely used with static analysis tools some of which may have more
than 10% false positives. False positives are annoying, but it is probably safer to
have a few false positives than to miss real bugs. Every form of defect removal is
less than 100% in removal efficiency and produces at least a few false positives.

Feature Bloat
This term is not ordinarily quantified, but is a subjective statement that many soft-
ware packages have features that are in excess of the ones truly needed by the vast
majority of users. For example, the author of this chapter has written 16 books and
hundreds of articles with Microsoft Word, but probably has used less than 15% of
the total feature set available in Microsoft Word. This is not to say that the features
in Word are useless, but there are so many of them that few authors ever use the
majority of available features in either Word or Excel. In theory, feature bloat could
be measured with function points and the newer SNAP metric for nonfunctional
size. However, there is a logical inconsistency. Function points are defined as user
benefits, and feature bloat is considered to have benefit. This might be resolved by
creating a bloat point metric that would be counted like function points but assume
zero-user benefits and perhaps zero-business value as well. Feature bloat is basically
a subject opinion and not a truly measureable attribute.

Fixed Costs
In a manufacturing process, the term fixed cost refers to a cost that stays constant
no matter how many products are built per month. A prime example of a fixed cost
would be the rent paid for a software office building. For software applications many
costs are not fixed in the classic sense of being constants, but they are inelastic and
stay more or less the same. For example, requirements and design are likely to stay
more or less the same regardless of what coding language is used. After release of soft-
ware, companies will have maintenance personnel standing by to fix bugs regardless
276 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

of how many bugs are reported by users. Assume a company has a full-time main-
tenance programmer standing by at a cost of U.S. $10,000 per month. Now assume
that project A has 10 bug reports in the first month of use. The cost per defect for
project A will be U.S. $1,000. Now assume that next month the same maintenance
programmer is standing by for project B, which only has 1 bug. Now the cost per
defect will be U.S. $10,000 for project B. Fixed and variable costs need to be ana-
lyzed for software, and especially for quality work. See also variable costs later in this
report. See also burden rate earlier in this report for another look at fixed costs.

Function Points
In the late 1960s and early 1970s, the number of programming languages used
inside IBM expanded from assembly to include COBOL, FORTRAN, PL/I, APL,
and others. It was found that LOC metrics penalized high-level languages and did
not encompass requirements and design work. IBM commissioned Al Albrecht
and his colleagues in IBM White Plains to develop a metric that could include
all software activities and was not based on source code. The results were func-
tion point metrics that were developed circa 1975. In 1978 at a joint conference by
Share, Guide, and IBM, Albrecht presented function points to the outside world.
Function points started to be used by IBM customers and in 1984 the IFPUG
was formed in Montreal, and later moved to the United States. Function point
metrics are the weighted combination of inputs, outputs, inquiries, logical files,
and interfaces adjusted for complexity. In today’s world of 2014 function points
are supported by an ISO standard and by certification exams. Function points are
probably the most widely used software metric in the world and have more bench-
mark data than all other metrics put together. There are a number of alternative
methods of counting function points discussed later in this report under the topic
of function point variations.

Function Points per Month


The metric function points per month is one of two common productivity metrics
based on function points. The second common metric is work hours per function
point. The two are mathematically equivalent but can produce very different results
on a global basis. The number of effective work hours per month varies from coun-
try to country. In the United States, the average number of work hours per month is
132; in China 186; in Sweden 126; in Iceland 120; in India 190, and so forth. This
means that a software project that requires 132 hours of work will take one calen-
dar month in the United States but only about three weeks in India and about five
weeks in Iceland. Estimating tools such as SRM that support global estimates and
produce both function points per month and work hours per function point need
to be sensitive to global work patterns.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 277

Function Point Variations


Not long after function points were released by IBM and taken over by the IFPUG,
several researchers claimed that IFPUG function points were not accurate for sys-
tems software or for other kinds of software outside traditional information tech-
nology. The results were a series of alternate counting rules for function points that
produced different results than IFPUG counts. The first of these variations was
the Mark II function point in the United Kingdom. It is an interesting sociologi-
cal phenomenon that all of the variants produce larger counts than IFPUG; none
produce smaller counts. The reason is that the inventors of the variants thought that
some kinds of software were harder and more complex than information systems.
Although this may be true, the difficulty could be handled by the fact that these
non-IT applications required more work hours. It was not necessary to puff up the
function point counts, but this is what happened. As a result in 2014 there are many
counting rules for function points in addition to IFPUG counting rules. Among
the variations are COSMIC function points, engineering function points, feature
points, FISMA function points, function points light, Mark II function points,
NESMA function points, and unadjusted function points. In fact, new variations
are occurring quite often so the list of function point variations is growing larger.
The author’s SRM tool produces size data in a total of 23 of these alternate metrics.
However, for expressing productivity and quality data, SRM uses IFPUG func-
tion points as the standard metric. As of 2014 IFPUG function points have more
benchmark data than all of the variants put together. It is mildly surprising that
so much energy goes into function point variations when the underlying historical
data leaks and is often less than 50% complete. The author’s opinion is that these
function point variations only make the waters muddy and make function points
less useful and less consistent than they should be. For unknown reasons human
beings seem to like creating metric variations, so we have statute and nautical miles,
Fahrenheit and Celsius, British Imperial gallons and U.S. gallons, three methods of
calculating gasoline octane ratings, and many other examples of multiple metrics
for the same topics.

Gantt Charts
The phrase Gantt chart refers to a graphical method for showing overlapped project
schedules first developed in 1910 by Henry Gantt. These charts are of course much
older than software and used by many industries. However, they are also widely
used for software projects because they show that true waterfalls are uncommon.
Instead software projects normally start a new activity before the prior activity is
finished. Thus design starts before requirements are finished; coding starts before
design is finished, and so forth. A Gantt chart consists of horizontal bars showing
time lines for activities as shown below using a simple example.
278 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Requirements **********
Design **********
Coding **********
Testing **********
Documentation **********
Quality assurance **********
Management *********************

A variety of software project management tools and software parametric estimation


tools produce Gantt charts as standard outputs and some also support PERT dia-
grams as well. GANTT is a useful visual aid for understanding schedule duration
and schedule overlaps among adjacent activities.

Generalists versus Specialists


In the early days of software, programmers or engineers handled requirements,
design, coding, and testing. As applications grew larger, specialists appeared. Today
in 2014, there are a total of 126 occupation groups. Some organizations prefer the
generalist approach with few specialists; other prefer the specialist approach with
business analysts, programmers, test specialists, quality assurance specialists, and
so forth. Empirical data indicate that the generalist approach tops out below 1,000
function points and becomes hazardous. For example, certified test specialist per-
sonnel are about 5% more efficient in finding bugs in each of the test stages of func-
tion test, regression test, performance test, and system test. To use a nonsoftware
analogy, generalists can build small boats and canoes, but if you need to build an
80,000 ton cruise ship you will need dozens of special skills.

Goal-Question Metrics
The phrase goal-question metric (GQM) refers to a fairly new and general way of
measurement developed by Dr. Victor Basili of the University of Maryland, with
contributions by Dr. David Weiss and Albert Endres of IBM. The GQM approach
is a general-purpose measurement method and not limited to software. It includes a
six-step process: (1) set a goal, (2) generate questions based on the goal, (3) specify
metrics, (4) develop data collection method, (5) collect and validate data, and (6) do
a postmortem on results. It is possible to use the GQM approach with standard
metrics such as function points and DRE. Indeed two useful goals for the software
industry are (1) to raise average productivity rates to 15 function points per staff
month and (2) to raise average DRE levels to 99%.

Good-Enough Quality Fallacy


As software managers are often poor in understanding software economics, it
has become a common place to think that software with significant bugs can be
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 279

released if it is able to work and perform most tasks. However, more sophisticated
software managers know that released bugs lower customer satisfaction and raise
support and warranty costs. Further, if software is developed properly using a com-
bination of defect prevention, pretest defect removal such as inspections and static
analysis, and formal testing by certified test personnel using mathematical test case
design, it can achieve >99% DRE and still be quicker than sloppy development
that only achieved 90% DRE or less. The good-enough fallacy is symptomatic of
inept management who need better training in software economics and software
quality control. Make no mistake: the shortest software schedules correlate with
the highest DRE levels and the lowest defect potentials. Software schedules slip
because there are too many bugs in software when testing starts. See also technical
debt discussed later in this chapter.

Governance
The financial collapse of Enron and other major financial problems partly blamed
on software that led to the passage of the Draconian Sarbanes–Oxley law in the
United States. This law is aimed at corporate executives, and can bring criminal
charges against corporate executives for poor governance or lack of due diligence.
The term governance means constant oversight and due diligence by executives of
software and operations that might have financial consequences if mistakes are
made. A number of the measures discussed here in this report are relevant to gov-
ernance including but not limited to: cyclomatic complexity, defect origins, defect
severity, defect potentials, defect detection efficiency (DDE), DRE, delivered
defects, function point size metrics, and reliability.

Halstead Complexity
The metrics discussed in this topic were developed by Dr. Maurice Halstead in 1977
and deal primarily with code complexity, although they have more general uses.
Halstead set up a suite of metrics that included operators (verbs or commands) and
operands (nouns or data). By enumerating distinct operators and operands various
metrics such as program length, volume, and difficulty are produced. Halstead
metrics and cyclomatic complexity metrics are different bug but somewhat con-
gruent. Today in 2014, Halstead complexity is less widely used than cyclomatic
complexity.

Historical Data Leakage


Leakage from historical data is an endemic problem of the software industry.
Leakage has the effect of making both quality and productivity look better than
they really are. Leakage was first noted by the author in the 1970s. The most
common omissions from historical productivity data include unpaid overtime,
280 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

project management, user costs, and the work of part-time specialists such as
quality assurance, technical writers, business analysts, Agile coaches, project
office staff, and many more. Leakage is worse for projects created via cost centers
than via profit centers. Quality data leakage is also severe and includes omitting
bugs in requirements and design, omitting bugs found by unit test, omitting
bugs found by static analysis, and omitting bugs found by developers themselves.
At IBM, there were volunteers who reported unit test and self-discovered bugs
in order to provide some kind of statistical knowledge of these topics. Among
the author’s clients, overall cost data for cost-center projects average about 37%
complete. Quality data averaged only about 24% complete. Projects developed
under time and material contracts are more accurate than fixed-price contracts.
Projects developed by profit centers are more accurate than projects developed
by cost centers.

Incidents
The term incident is used in maintenance measurement and estimation. It is
a complex term that combines many factors such as bug reports, help requests,
and change requests that may impact software applications after they have been
released. SRM estimates and measures software maintenance incidents, which
include the following:

Customer help requests

Customer defect reports

High-severity defect reports

Customer change requests

Mandatory changes

Security flaws removed

Cyber attacks: prevented

Cyber attacks: successful

Invalid defect reports

Duplicate defect reports

Bad fixes (new bugs in bug repairs)

Reopened defect reports

TOTAL INCIDENTS
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 281

Industry Comparisons
Software is produced essentially by every industry in the world. There is little pub-
lished data that compares software quality and productivity across industry lines.
From the author’s data collection of about 26,000 projects, the high-technology
industries that manufacture complex physical equipment (medical devices, avionics,
and embedded applications) have the best quality. Banks and insurance companies
have the best productivity. One of the virtues of function point metrics is the ability
to direct comparisons across all industries.
The U.S. Department of Commerce and the Census Bureau have developed an
encoding method that is used to identify industries for statistical purposes called
the North American Industry Classification (NAIC). Refer to the NAIC code
discussed later in this document for a description.

Inflation Metrics
Over long periods of time wages, taxes, and other costs tend to increase steadily. This
is called inflation and is normally measured in terms of a percentage increase. For
software, inflation rates play a part in large systems that take many years to develop.
They also play a part in long-range legacy application maintenance. Inflation also
plays a part in selection of outsource countries. For example, in 2014 the inflation
rates in China and India are higher than in the United States, which will eventually
erode the current cost advantages of these two countries for outsource contracts.

International Comparisons
Software is developed in every known country in the world. This brings up the question
of what methods are effective in comparing productivity and quality across national
boundaries? Some of the factors that have international impacts include: (1) average
compensation levels for software personnel by country, (2) national inflation rates,
(3) work hours per month by country, (4) vacation and public holidays by country,
(5) unionization of software personnel and local union regulations, (6) probabilities of
strikes or civil unrest, (7) stability of electric power supplies by country, (8) logistics such
as air travel, (9) time zones that make communication difficult between countries with
more than a 4 hour time difference, (10) knowledge of spoken and written English,
which are the dominant languages for software, and (11) intellectual property laws and
protection of patents and source code. Function point metrics allow interesting global
comparisons of quality and productivity that are not possible using other metrics.

Inspection Metrics
One of the virtues of formal inspections of requirements, design, code, and other
deliverables is the suite of standard metrics that are part of the inspection process.
282 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Inspection data routinely includes preparation effort: inspection session team size
and effort, defects detected before and during inspections, defect repair effort after
inspections, and calendar time for the inspections for specific projects. These data
are useful in comparing the effectiveness of inspections against other methods of
defect removal such as pair programming, static analysis, and various forms of test-
ing. To date, inspections have the highest levels of DRE (>85%) of any known form
of software defect removal.

Invalid Defects
The term invalid defect refers to a bug report against software applications that, on
examination, turns out not to be true defects. Some of the common reasons for
invalid defects include: user errors, hardware errors, and operating system errors
mistaken for application errors. As an example of an invalid defect, a bug report
against a competitive estimation tool was sent to the author’s company by mistake.
Even though it was not our bug, it took about an hour to forward the bug to the
actual company and to notify the client of the error. Invalid defects are not true
defects but they do accumulate costs. Overall about 15% of reported bugs against
many software applications are invalid defects.

ISO/IEC Standards
This phrase is an amalgamation of the international organization for standards,
commonly abbreviated to ISO, and the international electrotechnical commission,
commonly abbreviated to IEC. These groups have hundreds of standards cover-
ing essentially every industry. Some of the standards that are relevant to software
include: ISO/IEC 2096:2009 for function points, the ISO/IEC 9126 quality stan-
dard, and the new ISO 3101:2009 risk standard. An issue for all ISO/IEC standards
is lack of empirical data that proves the benefits of the standards. There is no reason
to doubt that international standards are beneficial, but it would be useful to have
empirical data that shows specific benefits. For example, do the ISO quality and risk
standards actually improve quality or reduce risks? As of 2014 nobody knows. The
standards community should probably take lessons from the medical community
and include proof of efficacy and avoidance of harm as part of the standards creation
process. As medicine has learned from the many harmful side-effects of prescription
drugs, releasing a medicine without thorough testing can cause immense harm to
patients including death. Releasing standards without proof of efficacy and avoid-
ance of harmful side-effects should be a standard practice itself.

Kanban
Kanban is a Japanese method of streamlining manufacturing first developed by
Toyota. It has become famous under the phrase just in time. The Kanban approach
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 283

uses interesting methods for marking progress and showing when a deliverable
is ready for the next step in production. Kanban is used with software, but not
consistently. The Agile approach has adopted some Kanban ideas, as have other
methodologies. Quite a number of methods for quality control were first used in
Japan, whose national interest in quality is thousands of years old. Other Japanese
methods include quality circles, Kaizen, and Poke Yoke. Empirical data gathered
from Japanese companies indicate very high software quality levels, so the combi-
nations of Japanese methods have proven to be useful and successful in a software
context.

Kelvin’s Law of 1883


If you cannot measure it you cannot improve it. William Thomson was to become the
first Baron of Kelvin, and is commonly known as Lord Kelvin. He was a mathema-
tician and physicist with many accomplishments, including measuring absolute
zero temperature. His famous quotation is widely cited in the software literature
and is a primary incentive for striving for effective software metrics.

Key Performance Indicators


This term is applied to dozens of industries and technical fields including software.
The general meaning is progress toward a specific goal. This definition is congru-
ent with goal-question metrics and with rates of improvement discussed later in this
report. KPI can include both quantitative and qualitative information. KPI can
also be used in predictive and measurement modes. Due to the large scope of top-
ics and the large literature available, a Google search is recommended to bring up
recent documents on KPI. SEI assessments also include KPI.

KLOC
This term uses K to express 1,000 and LOC for lines of code. This is a metric that
dates back to the 1960s as a way of measuring both software size and also software
costs and defect densities. However, both KLOC and LOC metrics share common
problems in that they penalize high-level languages and make requirements and
design effort and defects invisible.

Language Levels
In the late 1960s and early 1970s, programming languages began their rapid
increase in numbers of languages and also powers of languages. By the mid 1970s,
more than 50 languages were in use. The phrases low level and high level were sub-
jective and had no mathematical rigor. IBM wanted to be able to evaluate the power
of various languages and so developed a mathematical form for quantifying levels.
284 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

This method made basic assembly language as the primary unit and it was assigned
level 1. Other languages were evaluated based on how many statements in basic
assembly language it would take to duplicate one statement in the higher-level lan-
guage. Using this method, both COBOL and FORTRAN were level 3 languages
because it took an average of three assembly statements to provide the features
of one statement in COBOL or FORTRAN. Later when function points were
invented in 1975, the level concept was extended to support function points and
was used for backfiring or mathematical conversion between code volumes and
function points. Here too basic assembly was the starting point, and it took about
320 assembly statements to be equivalent to one function point. Today in 2014,
tables of language levels are commercially available and include about 1,000 differ-
ent languages. For example, Java is level 6; objective C is level 12; PL/I is level 4; C
is level 2.5, and so forth. This topic is popular and widely used but needs additional
study and more empirical data to prove the validity of the assigned levels for each
language. Combinations of languages can also be assigned levels, such as Java and
HTML or COBOL and SQL.

Lean Development
The term lean is a relative term that implies less body fat and a lower weight than
average. When applied to software, the term means building software with a
smaller staff than normal, whereas hopefully not slowing down development or
causing harmful side effects. Lean manufacturing originated at Toyota in Japan,
but the concepts spread to software and especially to the Agile approach. Some of
the lean concepts include eliminate waste, amplify learning, and build as fast as
possible. A lean method called value stream mapping includes useful metrics. As
with many other software concepts, lean suffers from a lack of solid empirical data
that demonstrates effectiveness and lack of harmful side effects. The author’s clients
that use lean methods have done so on small projects below 1,000 function points,
and their productivity and quality levels have been good but not outstanding. As of
2014, it is uncertain how lean concepts will scale up to large systems in the 10,000
function point size range. However, TSP and RUP have proof of success for large
systems so lean should be compared against them.

Learning Curves
The concept of learning curves is that when human beings need to master a new
skill, their initial performance will be suboptimal until the skill is truly mastered.
This means that when companies adopt a new methodology, such as Agile, the first
project may lag in terms of productivity or quality or both. Learning curves have
empirical data from hundreds of technical fields in dozens of industries. However
for software, learning curves are often ignored when estimating initial projects
based on Agile, TSP, RUP, or whatever. In general, expect suboptimal performance
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 285

for a period of three to six months followed by rapid improvement in performance


after the learning period. Assume for example that average productivity rates using
waterfall development is 6.00 function points per staff month and a company wants
to adopt lean and Agile techniques. What might occur at three month intervals
could be: First quarter = 4.00 function points per staff month, second quarter =
5.00 function points per staff month, third quarter = 6.00 function points per staff
month, fourth quarter 8.00 function points per staff month, and next calendar year
10.00 function points per staff month. In other words, performance is low for the
first six months due to the learning curve; after that it improves and hits new highs
after 12 months.

Lines of Code Metrics


Lines of code are probably the oldest metric for software and date back to the 1950s.
LOC metrics come in two distinct flavors: (1) physical lines and (2) logical code
statements. Of these, two physical lines are the easiest to count but the least accurate
in terms of how developers think about programs. Physical lines can include blanks
between paragraphs and also comments, neither of which have any bearing on code.
Logical statements deal with executable commands and data definitions, which are
the things programmers consider when writing code. However, both physical and
logical code still penalize high-level languages and make requirements and design
invisible. A study by the author of software journals such as IEEE Software, the
IBM Systems Journal, Crosstalk, Cutter, and so on found that about one third of
published articles used physical LOC; one third used logical code statements; and
the remaining third just used LOC without specifying either physical or logical.
There can be as much as a 500% difference between counts of physical and logical
code. The inconsistent use of logical and physical LOC in the software literature is
symptomatic of the sloppy measurement practices of the software community.

Maintenance Metrics
The term maintenance is highly ambiguous. No fewer than 23 different kinds of
work are subsumed under the single term maintenance. Some of these forms of
maintenance include defect repairs, refactoring, restructuring, reverse engineering,
reengineering of legacy applications, and even enhancements or adding new fea-
tures. For legal reasons, IBM made a rigorous distinction between maintenance in
the sense of defect repairs and enhancements or adding new features. A court order
required IBM to provide maintenance information to competitors, but the order
did not define what the word maintenance meant. A very useful metric for main-
tenance is to use function point metrics for the quantity of software one mainte-
nance programmer can keep up and running for one year. The current average is
about 1,500 function points. For very well structured software, the maintenance
assignment scope can top 5,000 function points. For very bad software with high
286 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

complexity and convoluted paths, the maintenance assignment scope can drop below
500 function points. Other metrics in the maintenance field include the number of
clients one telephone support person can handle during a typical day (about 10) and
the number of bugs that a maintenance programmer can fix per month (from 8 to 12).
In spite of the complexity of maintenance, the tasks of maintenance, customer sup-
port, and enhancement can be measured and predicted fairly well. This is important
because in 2014 the world population of software maintenance personnel is larger
than the world population of software development personnel.

Measurement Speed and Cost


A topic of some importance is how easy or difficult it is to use specific metrics and
measure useful facts about software. Manual methods are known to be slow and
costly. For example, manual function point counts only proceed at a rate of about
500 function points per day. At a consulting cost of U.S. $3,000 per day that
would mean it costs U.S. $6.00 for every function point counted. (The author has
filed a U.S. patent application on a high-speed early sizing method that can predict
function points and other metrics in an average time of 1.8 minutes per project
regardless of the size of the application. This is a standard feature of the author’s
SRM tool.) Collecting manual benchmark data by interviewing a development
team takes about three hours per project. Assuming four development personnel
and a manager are interviewed, the effort would be 15 staff hours for the develop-
ment group and 3 consulting hours: 18 hours in total. Assuming average costs of
benchmark data collection would cost about U.S. $2,250 per project. By contrast
self-reported data can be gathered for about half of that. Automated tools for high-
speed function point analysis, for cyclomatic complexity, and for code counting
are all available but to date none have published speed and cost data. However,
the topics of measurement speed and measurement costs are under reported in the
software literature and need more work.

Meetings and Communications


One of the major cost drivers for software projects is that of meetings and com-
munications. For example, Agile projects have cut down paperwork but increased
meeting and communication costs. Between 12% and about 20% of software
development costs are in the form of meetings with customers, team meetings, or
meetings between managers and other managers. If travel is included for interna-
tional projects, the percentages can be even higher. Agile projects have cut down on
document costs compared to ordinary projects, but increased meetings and com-
munication costs. Unless both documents and meetings and communications are
measured, which is usually not the case, it is hard to see which is the best. A typical
pattern of meetings for a software project of 2,500 function points is shown using
the SRM tool (Table A.10).
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 287

Table A.10 Software Meetings for 2,500 Function Points


SRM Estimates for Meetings and Communications

$ per
Number of Total Meeting Function
Meeting Events Meetings Attendees Hours Costs Point

Conference calls 25 7 178 $13,509 $5.40

Client meetings 6 8 186 $14,066 $5.63

Architecture/ 5 7 158 $11,949 $4.78


design meetings

Team technical 59 8 1,976 $149,665 $59.87


meetings

Team status 78 14 2,750 $208,308 $83.32


meetings

Executive status 6 7 191 $14,461 $5.78


meetings

Problem analysis 8 10 530 $40,120 $16.05


meetings

Phase reviews 3 15 362 $27,435 $10.97

Meeting FP per staff month 52.17

Meeting work hours per function point 2.53

Percent(%) of Development Costs 12.33

It is easy to see why meetings and communications are an important software cost
driver. However, they are seldom measured or included in benchmark reports even
though they may rank as high in total costs.

Methodology Comparison Metrics


A basic purpose of software metrics should be to compare the results of various
methodologies such as Agile, extreme programming, pair programming, waterfall,
RUP, TSP, Prince2, Merise, Iterative, and in fact all 35 named methodologies. The
only current metric that is useful for side-by-side comparisons of methodologies is
the function point metric. LOC does not measure requirements and design and
penalizes high-level languages. Story points do not measure projects without user
stories. Use-case points do not measure projects without use-cases. Function points
measure everything.
288 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Methodology Validation before Release


In medicine and some engineering fields, before a new therapy can be released
to the public, it must undergo a series of tests and validation exercises to ensure
that it works as advertised and does not have serious problems or cause harm.
For software methodologies it would be useful to include a validation phase
before releasing the methods to the world. IBM did validate function point met-
rics and formal inspections and also the RUP. Methods that seem to have been
released without much in the way of formal validation include Agile develop-
ment and pair programming. Now that both have been in use for a number of
years, Agile seems to be effective below 1,000 function points for projects with
limited numbers of users, some of whom can participate directly. Agile is not
yet effective for large systems above 10,000 function points or for projects with
millions of users. Agile also has problems with software that needs FDA or FAA
certification due in part to the huge volumes of paper documents required by
Federal certification. Methodologies, like prescription medicines, should come
with warning labels that describe proper use and include cautions about possible
harmful consequences if the methodology is used outside its proven range of
effectiveness. Current methods that need validation and proof of success and a
lack of harmful side effects include pair programming that is intrinsically expen-
sive and lean development that is useful for hardware but still not validated
for software.

Metrics Conversion
With two different forms of LOC metrics, there are more than a dozen variations
in function point metrics; story points, use-case points, and RICE objects, one
might think that metrics conversion between various metrics would be sophisti-
cated and supported by both commercial and open-source tools, but this is not
the case. In the author’s view it is the responsibility of a metric inventor to provide
conversion rules between a new metric and older metrics. For example, it is NOT
the responsibility of the IFPUG to waste resources deriving conversion rules for
every minor variation or new flavor function point. As a courtesy, the author’s
SRM tool does provide conversions between 23 metrics, and this seems to be
the largest number of conversions as of 2016. There are more narrow published
conversions between COSMIC and IFPUG function points. However, metrics
conversion is a very weak link in the chain of software measurement techniques.
Examples of metrics conversion are shown below for an application of a nominal
1,000 IFPUG function points. These are standard outputs from the author’s tool
(Table A.11).
In the author’s opinion software has too many metrics, too many variations of
similar metrics, and a serious shortage of accurate benchmark data based on valid
metrics and activity-based costs.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 289

Table A.11 Variations in Software Size Metrics


Alternate Metrics Size Percentage of IFPUG (%)

1 IFPUG 4.3 1,000 100.00

2 Automated code-based 1,070 107.00

3 Automated UML-based 1,030 103.00

4 Backfired function points 1,000 100.00

5 Cosmic function points 1,143 114.29

6 Fast function points 970 97.00

7 Feature points 1,000 100.00

8 FISMA function points 1,020 102.00

9 Full function points 1,170 117.00

10 Function points light 965 96.50

11 IntegraNova models 1,090 109.00

12 Mark II function points 1,060 106.00

13 NESMA function points 1,040 104.00

14 RICE objects 4,714 471.43

15 SCCQI function points 3,029 302.86

16 Simple function points 975 97.50

Metrics Education—Academic
Academic training in software metrics is embarrassingly bad. So far as can be deter-
mined from limited samples, not a single academic course mentions that LOC
metrics penalize high level languages and that cost per defect metrics penalize
quality. The majority of academics probably do not even know these basic facts of
software metrics. What universities should teach about software metrics include:
manufacturing economics and the difference between fixed and variable software
costs, activity-based cost analysis, defect potentials and defect removal efficiency,
function point analysis, metrics conversion, comparing unlike software methods,
comparing international software projects, software growth patterns during devel-
opment and after release. They should also teach the hazards of metrics with proven
mathematical and economic flaws such as LOC and cost per defect, both of which
violate standard economic assumptions.
290 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Metrics Education—Professional Societies


and Metrics Companies
Metrics training from professional societies and companies that use metrics such
as benchmark and estimation companies are generally focused on teaching specific
skills such as function point analysis. The estimating companies teach the specifics
of using their tools, and also provide some more general training on estimation and
measurement topics. Academic institutions are so weak in metrics training that
probably the societies and metrics companies provide more hours of training than
all universities put together, and do a better job overall.

Metrics—Natural
The phrase natural metric refers to a metric that measures something visible and
tangible that can be seen and counted without ambiguity. Examples of natural
metrics for software would include pages of documents, test cases created, test cases
executed, and physical LOC. By contrast synthetic metrics are not visible and not
tangible.

Metrics—Synthetic
The phrase synthetic metric refers to things that are abstract and based on math-
ematics rather than on actual physical phenomena. Examples of synthetic metrics
for software include function point metrics, cyclomatic complexity metrics, logi-
cal code statements, test coverage, and defect density. Both synthetic and natural
metrics are important, but synthetic metrics are more difficult to count. However,
synthetic metrics tend to be very useful for normalization of economic and quality
data, which are difficult to do with natural metrics.

Metrics Validation
Before a metric is released to the outside world and everyday users, it should be
validated under controlled conditions and proven to be effective and be without
harmful consequences. For metrics such as function points and SNAP metrics they
did undergo extensive validation. Other metrics were just developed and published
without any validation. Older metrics such as LOC and “cost per defect” have been
in use for more than 50 years without yet being formally studied or validated for
ranges of effectiveness and for harmful consequences.

Monte Carlo Method


This phrase implies a predictive method named after the famous gaming Mecca of
Monte Carlo. Applied to business and technology, the Monte Carlo uses numerous
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 291

samples to derive probabilities and more general rules. For example, collecting data
on software projects from a sample of 50 commercial banks might provide useful
information on ranges of banking software performance. Doing a similar study for
50 manufacturing companies would provide similar data, and comparing the two
sets would also be insightful. For predictive modeling ranges of inputs would be
defined and then dozens or scores of runs would be made to check the distributions
over the ranges. John von Neumann programmed the ENIAC computer to provide
Monte Carlo simulations so this method is as old as the computer industry. Monte
Carlo simulation is also part of some software estimation tools.

Morale Metrics
A topic needing more study, but difficult to gather data, is the impact of morale
on team performance. Many companies such as IBM and Apple perform morale
studies, but these are usually kept internally and not published outside. Sometimes
interesting correlations do get published. For example, when IBM opened the new
Santa Teresa programming center, designed specifically for software, the morale
studies found that morale was much higher at Santa Teresa than at the nearby San
Jose lab where the programmer’s had worked before. Productivity and quality were
also high. Of course, these findings might not prove causation of the new architec-
ture, but they were interesting. In general, high morale correlates with high qual-
ity and high productivity, and low morale with the opposite case. But more study
is needed on this topic because it is an important one for software engineering.
Among the factors known to cause poor morale and even voluntary termination
among software engineers have been the following: (1) poor project management,
(2) forced use of pair programming without the consent of the software personnel,
(3) impossible demands for short schedules by clients or executives, (4) more than
6 hours of unpaid overtime per week for long periods, (5) arbitrary curve fitting
for appraisals that limit the number of top personnel to a limited statistical value.

North American Industry Classification Codes (Replacements


for Standard Industry Classification Codes)
There are thousands of industries. There is also a need to do cross-industry com-
parisons for topics such as revenues, employment, quality, and so on. The U.S.
Census Bureau and the U.S. Department of Commerce have long recognized the
need for cross-industry comparisons. Some years ago they published a large table
of codes for industries called standard industry classification or SIC codes. More
recently in 1997 the SIC codes have been replaced and updated by a new encoding
method called North American Industry Classification or NAIC codes. The govern-
ment of Mexico also participated in creating the NAIC codes. The author and his
colleagues use NAIC codes when collecting benchmark data. A Google search on
NAIC code will bring up useful tables and a look-up engine for finding the NAIC
292 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

codes of thousands of industries. The full NAIC code is six digits, but for many
benchmarks the two-digit and three-digit versions are useful because they are more
general. Some relevant two-digit NAIC codes for software include: manufacturing
31–33; retail 44–45; information 51; finance 52; professional services 54; education
61. For benchmarks and also for software cost estimation, NAIC codes are useful
to ensure apples to apples comparisons. NAIC codes are free as are a number of tools
for looking up the codes for specific industries.

National Averages
Given the size and economic importance of software one might think that every
industrialized nation would have accurate data on software productivity, qual-
ity, and demographics. This does not seem to exist. There seem to be no effective
national averages for any software topic, and software demographics are suspect
too. Although basic software personnel are known fairly well, the Bureau of Labor
Statics data does not show most of the 126 occupations. For example, there is
no good data on business analysts, software quality assurance, database analysts,
and scores of other ancillary personnel associated with software development and
maintenance. Creating a national repository of quantified software data would
benefit the United States. It would probably have to be done either by a major
university or by a major nonprofit association such as the ACM, IEEE, PMI, SIM,
or perhaps all of these together. Funding might be provided by major software
companies such as Apple, Microsoft, IBM, Oracle, and the similar, all of whom
have quite a bit of money and also large research organizations. Currently the best
data on software productivity and quality tends to come from companies that build
commercial estimation tools, and companies that provide commercial benchmark
services. All of these are fairly small companies. If you look at the combined data
from all 2015 software benchmark groups such as Galorath, International Software
Benchmarking Standards Group (ISBSG), Namcook Analytics, Price Systems,
Q/P Management Group, Quantimetrics, QSM, Reifer Associates, and Software
Productivity Research, the total number of projects is about 80,000. However, all
of these are competitive companies, and with a few exceptions such as the recent
joint study by ISBSG, Namcook, and Reifer the data are not shared or compared. It
is not always consistent either. One would think that a major consulting company
such as Gartner, Accenture, or KPMG would assemble national data from these
smaller sources, but this does not seem to happen. Although it is possible in 2015
to get rough employment and salary data for a small set of software occupation
groups, there is no true national average that encompasses all industries.

Nondisclosure Agreements
When the author and his colleagues from Namcook Analytics LLC collect bench-
mark data from clients, the data are provided under a nondisclosure agreement or
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 293

NDA as commonly abbreviated. These agreements prevent the benchmark orga-


nization from identifying the client or the specific projects from which data are
collected. Of course, if the data are merely added to a collection of hundreds of
other projects for statistical analysis that does not violate the NDA because it is
not possible to identify where the data came from. Academics and many readers
of benchmark reports that conceal the sources of the data due to NDA agreements
complain that the sources should be identified, and some even assume that the data
are invalid unless the sources are named. NDAs are a normal part of professional
benchmark data collection and serve to protect proprietary client information that
should not be shared with competitors or the outside world. In a sense benchmark
NDA agreements are similar to the confidentiality between lawyers and clients and
the confidentiality of medical information between physicians and patients. NDAs
are a common method for protecting information and need to be honored by all
benchmark collection personnel.

Nonfunctional Requirements
Software requirements come in two flavors: functional requirements that are what the
customer wants the software to do, nonfunctional requirements that are needed to
make the software work on various platforms, or required by government mandate.
Consider home construction before considering software. A home built overlooking
the ocean will have windows with a view—this is a functional requirement by the
owners. But due to zoning and insurance demands, homes near the ocean in many
states will need hurricane-proof windows. This is a nonfunctional requirement. See
the discussion of the new SNAP metric later in this report. Typical nonfunctional
requirements are changes to software to allow it to operate on multiple hardware
platforms or operate under multiple operating systems.

Normalization
In software, the term normalization has different meanings in different contexts,
such as database normalization and software project result normalization. In
this chapter the form of normalization of interest is converting raw data to a
fixed metric so that comparisons of different projects are easy to understand.
The function point metric is as good choice for normalization. Both work hours
per function point and defects per function point can show the results of dif-
ferences in application size, differences in application methodology, differences
in CMMI levels, and other topics of interest. However, there is a problem that
is not well-covered in the literature, and for that matter not well-covered by the
function point associations. Application size is not constant. During develop-
ment, software applications grow due to creeping requirements at more than
1% per calendar month. After release applications continue to grow for as long
as they are being used at more than 8% per calendar year. This means that both
294 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

productivity and quality data need to be renormalized from time to time to


match the current size. The author recommends normalization at requirements
end and again at delivery for new software. For software in the field and being
used, the author recommends renormalization once a year probably at the start
of each fiscal or calendar year.

Object-Oriented Metrics
Object-oriented (OO) languages and methods have become mainstream
development approaches. For example, all software at Apple uses the Objective
C programming language. The terminology and concepts of object-oriented
development are somewhat unique and not the same as procedural languages.
However, some standard metrics such as function points and DRE work well with
object-oriented development. In addition, the OO community has developed met-
rics suites that are tailored to the OO approach. These include methods, classes,
inheritance, encapsulation, and some others. Coupling and cohesion are also used
with OO development. This is too complex a topic for a short discussion so a Google
search on object-oriented metrics will bring up interesting topics such as weighted
methods per class and depth of inheritance tree.

Occupation Groups
A study of software demographics in large companies was funded by AT&T and
carried out by the author and his colleagues. Some of the participants in the study
included IBM, the Navy, Texas Instruments, Ford, and other major organizations.
The study found 126 occupations in total, but no company employed more than 50
of them. Among the occupations were Agile coaches, architects, business analysts,
configuration control specialists, designers, estimating specialists, function point
specialists, human factors specialists, programmers or software engineers, project
office specialists, quality assurance specialists, technical writers, and test specialists.
The number of occupation groups increased with both application size and also with
company size. Traditional programming can be less than 30% of the team and less
than 30% of the effort for large applications. The study also found that no human
resource group actually knew how many software occupations were employed or
even how many software personnel were employed. It was necessary to interview
local managers. The study also found that some software personnel refused to be
identified with software due to low status. These were aeronautical or automotive
engineers building embedded software. Very likely government statistics on software
employment are wrong. If corporate HR organizations do not know how many
software people are employed they cannot tell the government software employment
either. There is a need for continuing study of this topic. Also needed are compari-
sons of productivity and quality between projects staffed with generalists and similar
projects staffed by specialists.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 295

Although programmers and testers dominate, note that neither of the occupation
group even hits 30% of overall staffing levels. Needless to say there are wide variations.
Also with a total of 126 known occupation groups, really large systems will have
much greater diversity in occupations than shown here.

Parametric Estimation
The term parametric estimation refers to software cost and quality estimates produced
by one or more commercial software estimation tools such as COCOMO II,
CostXpert, KnowledgePlan, SEER, SLIM, SRM, or TruePrice. Parametric estimates
are derived from the study and analysis of historical data from past projects. As a result
the commercial estimation companies tend to also provide benchmark services. Some
of the parametric estimation companies such as the author’s Namcook Analytics have
data on more than 20,000 projects. A comparison by the author of 50 parametric
estimates and 50 manual estimates by experienced project managers found that both
manual and parametric estimates were close for small projects below 250 function
points. But as application size increased manual estimates became progressively opti-
mistic, whereas parametric estimates stayed within 10% well past 100,000 function
points. For small projects both manual and parametric estimates should be accurate
enough to be useful, but for major systems parametric estimates are a better choice.
Some companies utilize two or more parametric estimation tools and run them all
when dealing with large mission-critical software applications. Convergence of the
estimates by separate parametric estimation tools adds value to major projects.

Pair Programming
Pair programming is an example of a methodology that should have been validated
before it started being used, but was not. The concept of pair programming is that
two programmers take turns coding and navigating using the same computer. Clearly
if personnel salaries are U.S. $100,000 per year and the burden rate is U.S. $50,000
per year, then a pair is going to cost twice as much as one programmer, that is, U.S.
$300,000 per year instead of U.S. $150,000 per year. A set of 10 pairs will cost U.S.
$3,000,000 per year, and return fairly low value. The literature on pair programming
is trivial and only compares unaided pairs against unaided individual programmers
without any reference to static analysis, inspections, or other proven methods of qual-
ity control. Although pair enthusiasts claim knowledge transfer as a virtue, there are
better methods of knowledge transfer including inspections and mentoring. Although
some programmers enjoy pair programming, many do not and several reports discuss
programmers who quit companies specifically to get away from pair programming.
This method should have been evaluated prior to release using a sample of at least
25 pairs compared to 25 individuals, and the experiments should also have compared
pairs and individuals with and without static analysis. The experiments should also
have compared pairs against individuals who used formal inspections. The author’s
296 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

data indicates that pairs always cost more, are usually slower, and are not as effective
for quality control, as individual programmers who use inspections and static analy-
sis. An unanswered question of the pair programming literature is if pairing program-
ming is good, and why not pair testers, quality assurance, project managers, business
analysts, and the other 125 occupations associated with software.

Pareto Analysis
The famous Pareto principle states that 80% of various issues will be caused by 20%
of the possible causes. The name was created by Joseph Juran and named in honor
of an Italian economist named Vilfredo Pareto who noted in 1906 that 20% of the
pea pods in his garden produced 80% of the peas. Pareto analysis is much more than
the 80/20 rule and includes sophisticated methods for analyzing complex problems
with many variables. Pareto distributions are frequently noted in software such as
the discovery of EPM and a Microsoft study shows that fixing 20% of bugs would
eliminate 80% of system crashes. Some of the areas where Pareto analysis seem to
show up include: (1) a minority of personnel seem to produce the majority of effec-
tive work and (2) in any industry a minority of companies are ranked best to work for
by annual surveys. Pareto diagrams are often used in software for things like analyz-
ing customer help requests and bug reports.
Customer complaints
280 100
90
240 80% line
80
200 70
60
160
50
120 Significant few Insignificant many
40

80 30
20
40
10
0 0
Parking Sales rep Poor Layout Sizes Clothing Clothing
difficult was rude lighting confusing limited faded shrank

Note that Pareto charts are useful for showing several kinds of data visually at the
same time.

Pattern Matching
Patterns have become an important topic in software engineering and will become
even more important as reuse enters the mainstream. Today in 2014 design patterns
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 297

and code patterns are both fairly well-known and widely used. Patterns are also use-
ful in measurement and estimation. For example, the author’s patent-pending early
sizing method is based on patterns of historical projects that match the taxonomy
of the new application that is being sized. Patterns need to be organized using stan-
dard taxonomies of application nature, scope, class, and type. Patterns are also used
by hundreds of other industries. For example, the Zillow database of real estates
and the Kelley Blue Book of used cars are both based on pattern matching.

Performance Metrics
For the most part this chapter deals with metrics for software development and main-
tenance. But software operating speed is also important, as is hardware operating
speed. There are dozens of performance metrics and performance evaluation meth-
ods. A Google search on the phrase software performance metrics is recommended.
Among these metrics are load, stress, data throughput, capacity, and many others.

Program Evaluation and Review Technique


The famous Program Evaluation and Review Technique (PERT) method was
developed by the U.S. Navy in the 1950s for handling the logistics of naval ship
construction. It is closely aligned to the critical-path approach. In practice, PERT
diagrams show a network of activities and timelines, with pessimistic, optimistic,
and expected durations. Part of the PERT analysis is to identify the critical path
where time cannot be easily compressed. PERT graphs are often used in conjunc-
tion with GANTT charts, discussed earlier. PERT is a large and complex topic so a
Google search on PERT diagrams or PERT methodology will bring up extensive sets
of papers and reports. In today’s world there are commercial and open-source tools
that can facilitate PERT analysis and create PERT diagrams and GANTT charts
for software projects.

Phase Metrics
The term phase refers to a discrete set of tasks and activities that center on produc-
ing a major deliverable such as requirements. For software projects there is some
ambiguity in phase terms and concepts, but a typical pattern of software phases
would include: (1) requirements, (2) design, (3) coding or construction, (4) testing,
and (5) deployment. Several commercial estimation tools predict software costs and
schedules by phase. However, there are major weaknesses with the phase concept.
Among these weaknesses is the fact that many activities such as technical documen-
tation, quality assurance, and project management span multiple phases. Another
weakness is the implicit assumption of a waterfall development method, so that
phases are not a good choice for Agile projects. Activity-based cost analysis is a better
and more accurate alternative to phases for planning and estimating software.
298 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Portfolio Metrics
The term portfolio in a software context refers to the total collection of software
owned and operated by a corporation or a government unit. The portfolio would
include custom developed software, commercial software packages, and open-
source software packages. In today’s world of 2015 it will also include cloud appli-
cations that companies use but do not have installed on their own computers, such
as Google documents. Function point metrics are a good choice for portfolios.
LOC metrics might be used but with thousands of applications coded in hundreds
of languages, LOC is not an optimal choice. In today’s world of 2015, a Fortune
500 company can easily own more than 5,000 software applications with an aggre-
gate size approaching 10,000,000 function points. Very few companies know how
large their portfolios are. Shown below in Table A.12 is a sample portfolio for a
manufacturing company with a total of 250,000 employees.
As can be seen from Table A.12, corporate portfolios comprise thousands of
applications and millions of function points.

Table A.12 Sample Manufacturing Corporate Portfolio


Number of
Applications Function Lines of
Corporate Functions Used Points Code
1 Accounts payable 22 55,902 3,074,593
2 Accounts receivable 29 71,678 3,942,271
3 Advertising 42 62,441 2,809,867

4 Advisory boards—technical 6 9,678 532,286


5 Banking relationships 50 175,557 9,655,657
6 Board of directors 5 7,093 390,118
7 Building maintenance 3 3,810 209,556
8 Business intelligence 27 94,302 5,186,625
9 Business partnerships 22 55,902 3,074,593
10 Competitive analysis 39 97,799 5,378,919
11 Consultant management 3 5,040 277,174
12 Contract management 42 124,883 6,868,564
13 Customer resource 77 193,740 10,655,693
management
14 Customer support 60 90,659 4,986,251

(Continued)
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 299

Table A.12 (Continued) Sample Manufacturing Corporate Portfolio


Number of
Applications Function Lines of
Corporate Functions Used Points Code
15 Divestitures 12 18,017 990,928
16 Education—customers 9 13,205 726,262
17 Education—staff 5 7,093 390,118
18 Embedded software 120 359,193 30,531,426
19 Energy consumption monitoring 5 7,093 390,118
20 Energy acquisition 5 8,032 441,749
21 Engineering 125 437,500 32,812,500
22 ERP—corporate 100 400,000 28,000,000
23 Finances (corporate) 120 335,247 18,438,586
24 Finances (divisional) 88 236,931 13,031,213
25 Governance 12 30,028 1,651,546
26 Government certification (if any) 31 45,764 2,517,025
27 Government regulations (if any) 16 24,583 1,352,043
28 Human resources 9 13,205 726,262
29 Insurance 7 10,298 566,415
30 Inventory management 60 90,659 4,986,251
31 Legal department 31 45,764 2,517,025
32 Litigation 42 62,441 3,434,282
33 Long-range planning 9 22,008 1,210,437
34 Maintenance—product 106 158,606 8,723,313
35 Maintenance—buildings 6 9,678 725,844
36 Manufacturing 269 470,014 35,251,071
37 Market research 50 75,239 4,138,139
38 Marketing 35 51,821 2,850,144
39 Measures—customer satisfaction 5 7,093 390,118
40 Measures—financial 31 45,764 2,517,025
41 Measures—market share 10 14,952 822,381
42 Measures—performance 11 16,931 931,220
(Continued)
300 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.12 (Continued) Sample Manufacturing Corporate Portfolio


Number of
Applications Function Lines of
Corporate Functions Used Points Code
43 Measures—quality 12 18,017 990,928
44 Measures—ROI and profitability 42 62,441 3,434,282
45 Mergers and acquisitions 31 76,273 4,195,041
46 Office suites 10 34,889 1,918,888
47 Open-source tools—general 93 140,068 7,703,748
48 Order entry 35 51,821 2,850,144
49 Outside services—manufacturing 31 45,764 2,517,025
50 Outside services—legal 35 86,368 4,750,241
51 Outside services—marketing 19 27,836 1,530,981
52 Outside services—sales 21 31,520 1,733,601
53 Outside services—terminations 11 13,259 729,258
54 Outsource management 42 62,441 3,434,282
55 Patents and inventions 24 35,692 1,963,038
56 Payrolls 27 67,359 3,704,732
57 Planning—manufacturing 57 85,197 4,685,808
58 Planning—products 12 18,017 990,928
59 Process management 14 21,709 1,194,018
60 Product design 77 193,740 10,655,693
61 Product nationalization 16 36,874 2,028,064
62 Product testing 50 75,239 4,138,139
63 Project offices 42 72,848 4,006,662
64 Project management 12 33,031 1,816,701
65 Purchasing 39 58,679 3,227,351
66 Quality control 16 24,583 1,352,043
67 Real estate 10 14,952 822,381
68 Research and development 154 537,321 29,552,650
69 Sales 60 90,659 4,986,251
70 Sales support 19 27,836 1,530,981

(Continued)
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 301

Table A.12 (Continued) Sample Manufacturing Corporate Portfolio


Number of
Applications Function Lines of
Corporate Functions Used Points Code
71 Security—buildings 27 40,415 2,222,839
72 Security—computing and 42 145,697 8,013,325
software
73 Shareholder relationships 10 34,889 1,918,888
74 Shipping/receiving products 35 86,368 4,750,241
75 Software development 113 337,550 18,565,265
76 Standards compliance 16 24,583 1,352,043
77 Stocks and bonds 27 94,302 5,186,625
78 Supply chain management 64 96,472 5,305,959
79 Taxes 57 141,994 7,809,679
80 Travel 12 30,028 1,651,546
81 Unbudgeted costs—cyber 42 114,476 6,296,184
attacks
82 Warranty support 8 11,661 641,378

Portfolio Totals 3,214 7,268,513 434,263,439

Productivity
The standard economic definition for productivity is goods or services produced per
unit of labor or expense. The software industry has not yet found a standard topic
that can be used for the goods or services part of this definition. Among the topics
used for goods or services can be found as function points, LOC, story points,
RICE objects, and use-case points. Of these, only function points can be applied
to every activity and every kind of software developed by all known methodolo-
gies. As of 2014, function point metrics are the best choice for software goods and
services and therefore for measuring economic productivity. However, the software
literature includes more than a dozen others such as several flavors of LOC metrics,
story point, use-case points, velocity, and so on. So far as can be determined no
other industry besides software has such a plethora of bad choices for measuring
economic productivity.
302 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Production Rate
This metric is often paired with assignment scope to create software cost and schedule
estimates for specific activities. The production rate is the amount of work a person
can complete in a fixed time period such as an hour, a week, or a month. Using the
simple natural metric of pages in a user’s guide assigned to a technical writer, the
assignment scope might be 50 pages and the production rate might be 25 pages per
month. This combination would lead to an estimate of one writer and two calendar
months. Production rates can be calculated using any metric for a deliverable item,
such as pages, source code, function points, story points, and so on.

Professional Malpractice
As software is not a licensed profession it cannot actually have professional mal-
practice in 2016. Yet several metrics in this report are cited as being professional
malpractice in specific contexts. The definition of professional malpractice is an
instance of incompetence or negligence on the part of a professional. A corollary to
this definition is that academic training in the profession should have provided
all professionals with sufficient information to avoid most malpractice situations.
As of 2016, software academic training is inadequate to warn software engineers
and software managers of the hazards of bad metrics. The metric LOC is viewed as
professional malpractice in the specific context of attempting (1) economic analysis
across multiple programming languages and (2) economic analysis that includes
requirements, design, and noncode work. LOC metrics would not be malpracticed
for studying pure coding speed or for studying code defects in specific languages.
The metric cost per defect is viewed as professional malpractice in the context of
(1) exploring the economic value of quality and (2) comparing a sequence of defect
removal operations for the same project. Cost per defect would not be a malpractice
if fixed costs were backed out or for comparing identical defect removal activi-
ties such as unit test across several projects. LOC metrics make requirements and
design invisible and penalize modern high-level languages. Cost per defect makes
the buggiest software look cheapest and ignores the true value of quality in shorten-
ing schedules and lowering costs.

Profit Center
A profit center is a corporate group or organization whose work contributes to the
income and profits of the company. The opposite case would be a cost center where
money is consumed but the work does not bring in revenues. For internal soft-
ware that companies build for their own use, some companies use the cost-center
approach, and some use the profit-center approach. Cost-center software is provided
to internal clients for free, and funding comes from some kind of corporate account.
Profit-center software would charge internal users for the labor and materials needed
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 303

to construct custom software. In general, measures and metrics are better under the
profit center model because without good data there is no way to bill the clients.
As a general rule for 2015 about 60% of internal software groups are run using the
cost-center model and 40% are run using the profit-center model. For commercial
software, development is clearly a profit-center model. For embedded software in
medical devices or automotive engines the software is part of a hardware product
and usually not sold separately. However, it still might be developed under a profit-
center model, not always. Overall profit centers tend to be somewhat more efficient
and cost effective than cost centers. This topic should be included in standard bench-
mark reports, but is actually somewhat difficult to find in the software literature.

Progress Improvements—Measured Rates


for Quality and Productivity
For an industry notorious for poor quality and low productivity it is obvious that
metrics and measurements should be able to measure improvements over time.
Although this is technically possible and not even very difficult, it seldom happens.
The reason is that when companies collect data for benchmarks they tend to regard
them as one-shot measurement exercises, and not as a continuing activity with met-
rics collected once or twice a year for long periods. Some leading companies such
as IBM do measure rates of progress and so do some consulting groups such as the
author’s Namcook Analytics LLC. From long-range measures more than a 10-year
period, quality can be improved at annual rates of 25% or more for at least five years
in a row. Productivity is harder to accomplish and annual improvements are usually
less than 10%. Quality is measured using defect potentials and DRE. Productivity
is measured using work hours per function point for applications of nominally the
same size and type. Switching to Agile from waterfall is beneficial, but the Agile
learning curve is so steep that initial results will be disappointing.

Project End Date


The start and end dates of software projects are surprisingly ambiguous. The defi-
nition for project end date used by the author is the date the software is delivered
to its intended users. This assumes that the end date is for development—clearly
maintenance and enhancement work could continue for years. An alternate end
date would be the freeze point for software projects after which no further changes
can be made to the current release. This is normally several weeks prior to delivery
to clients. There are no fixed rules for end dates.

Project-Level Metrics
Probably the most common form of benchmark in the world is an overall result
for a software project without any granularity or internal information about
304 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

activities and tasks. For example, a typical project-level benchmark for an appli-
cation of 1,000 function point might be that it required 15 work hours per
function point, had a schedule of 15 calendar months, and a cost of U.S. $1,200
per function point. The problem with this high-level view is that there is no
way to validate it. Did the project include project management? Did the project
include unpaid overtime? Did the project include part-time workers such as qual-
ity assurance and technical writers? There is no way of being sure of what really
happens with project-level metrics. See the discussion of activity-based costs ear-
lier in this report.

Project Office or Project Management Office


For large companies that build large systems above 10,000 function point in size it
is very common to have a dedicated team of planning and estimating specialists who
work together in an organization called either a project office or a project management
office (PMO). These organizations are found in most major corporations such as
IBM, AT&T, Motorola, and hundreds of others. PMO staffing runs from a low of
two up to more than a dozen for massive software projects in the 100,000 function
point size range. As ordinary project managers are not trained in either software
estimation or measurement, the PMO groups employ specialists who are trained.
Further, the PMO offices are usually well-stocked with a variety of project manage-
ment tools including parametric estimation (SEER, KnowledgePlan, Software Risk
Master, etc.), project planning tools (Microsoft Project, Timeline, etc.), and more
recently newer tools such as the Automated Project Office (APO) by Computer
Aid Inc. As a general rule large software projects supported by formal PMO groups
have better track records for on-time delivery and cost accuracy than projects of the
same size that do not have PMO organizations.

Project Start Date


The start date of a software project is one of the most uncertain and ambiguous
topics in the entire metrics literature. Long before requirements began someone
had to decide that a specific software application was needed. This need had to
be expressed to higher managers who would be asked to approve funds. The need
would have to be explained to software development management and some techni-
cal personnel. Then formal requirements gathering and analysis would occur. What
is the actual start date? For practical purposes the prerequirements discussions and
funding discussions are seldom tracked, and even if they were tracked there would
be no easy way to assign them to a project until it is defined. About the only date
that is crisp is the day when requirements gathering starts. However, for projects
created by inventors for their own purposes, there are no formal requirements other
than concepts in the mind of the inventor. When collecting benchmark data the
author asks the local project manager for the start date and also asks what work
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 305

took place on that date. Not everybody answers these questions the same way, and
there are no agreed-to rules or standards for defining a software project’s start date.

Quality
There are many competing definitions for software quality, including some like
conformance to requirements that clearly do not work well. Others such as maintain-
ability and reliability are somewhat ambiguous and only partial definitions. The
definition used by the author is the absence of defects that would cause a software
application to either stop completely or to produce incorrect results. This definition has
the virtue of being able to be used with requirements and design defects as well as
code defects. As requirements are often buggy and filled with errors, these defects
need to be included in a working definition for software quality. Defects also cor-
relate with customer satisfaction, in that as bugs go up satisfaction comes down.

Quality Function Deployment


Quality function deployment (QFD) was originally developed in Japan for hard-
ware products by Dr. Yoji Akao in 1966. More recently it has been applied soft-
ware. QFD is included in a glossary of software metrics and measurement because
of the interesting fish-bone diagrams that are part of the QFD process. These are
also called house of quality because the top of the diagram resembles a peaked roof.
The QFD topic is a complex subject and a Google search will bring up the litera-
ture on QFD. QFD is effective in improving the delivered quality of a number of
kinds of products, including software. The kinds of software using QFD tend to be
engineering and medical devices where there are significant liabilities and very high
operational reliability are needed.

Ranges of Software Development Productivity


Considering that software is more than 60 years old in 2014, one might think that
both average productivity rates and ranges of productivity would be well-known and
widely published. This is not the case. There are books such as the author’s Applied
Software Measurement (2008) that have ranges and averages, and there are benchmark
sources such as the ISBSG that publish ranges and averages for subsets, but there is
no source of national data that is continuously updated to show U.S. national aver-
ages for software productivity or the ranges of productivity. This would be somewhat
equivalent to published data on U.S. life expectancy levels. Among the author’s clients
the range of software productivity is from a low of just over 1 function point per staff
month for large defense applications to a high of just under 100 function points per
staff month for small civilian projects with more than 75% reusable materials. From
the author’s collection of about 20,000 projects, the ranges by size are expressed in
terms of function points per month which are as given in Table A.13.
306 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.13 Ranges of Function Points per Staff Month


Size Monthly Rate

1 function point 33.84

10 function points 21.56

100 function points 16.31

1,000 function points 12.67

10,000 function points 3.75

100,000 function points 2.62

Average (not weighted) 13.27

Web projects 12.32

Domestic outsource 11.07

IT projects 10.04

Commercial 9.12

Systems/embedded 7.11

Civilian government 6.21

Military/defense 5.12

Average (not weighted) 8.72

Note that there are large variations by application size and also large variations
by application type. There are also large variations by country, although interna-
tional data are not shown here. Japan and India, for example, would be better than
the United States Also note that other benchmark providers might have data with
different results from the data shown here. This could be due to the fact that nor-
mally benchmark companies have unique sets of clients so the samples are almost
always different. Also, there is little coordination of cooperation among various
benchmark groups, although the author, ISBSG, and Don Reifer did produce a
report on project size with data from all three organizations.

Ranges of Software Development Quality


Because poor quality and excessive volumes of delivered defects are endemic prob-
lems for the software industry, it would be useful to have a national repository of
software quality data. This does not exist. In fact, quality data are much harder
to collect than productivity data due to leakage that leaves out defects found in
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 307

requirements and design, defects found by static analysis, and defects found by
desk checking and unit test. Even delivered defects leak because if too many bugs
are released, usage will drop and hence latent bugs will remain latent and not be
discovered.
From the author’s collection of about 26,000 projects following are average
approximate values for software quality. Here too other benchmark sources will
vary (Table A.14).
As can be seen from the above table there are variations by application size and
also variations by application type. For national average purposes, the value shown
by type is more meaningful than size, because there are very few applications larger
than 10,000 function points, so these large sizes distort average values. In other
words, defect potentials average about 4.94, whereas defect removal averages about

Table A.14 Ranges of Software Quality and Delivered Defects


Removal Defects
Defect Potential Efficiency (%) Delivered

Size
1 1.50 96.93 0.05

10 2.50 97.50 0.06

100 3.00 96.65 0.10

1,000 4.30 91.00 0.39

10,000 5.25 87.00 0.68

100,000 6.75 85.70 0.97

Average 3.88 92.46 0.37

Type
Domestic outsource 4.32 94.50 0.24

IT projects 4.62 92.25 0.36

Web projects 4.64 91.30 0.40

Systems/embedded 4.79 98.30 0.08

Commercial 4.95 93.50 0.32

Government 5.21 88.70 0.59

Military 5.45 98.65 0.07

Average 4.94 93.78 0.30


308 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

93.78% and delivered defects average about 0.30 circa 2016 if the view is cross
industry. Overall ranges of defect potentials run from about 1.25 per function
point to about 7.50 per function point. Ranges of defect removal run from 99.65%
to a low of less than 77.00%. Of course, averages and ranges are both variable fac-
tors and change based on the size and type of software projects used in the samples
for calculating averages.

Ranges of Software Schedules


Probably the best way to handle ranges of software schedules is to use a graph
that shows best, average, and worst case schedules for a range of application sizes.
However, some useful rules of thumb can be used to predict approximate schedule
durations from start to delivery of requirements to initial customers.

Expert teams Function points raised to the 0.37 power

Good teams Function points raised to the 0.38 power

Average teams Function points raised to the 0.39 power

Below average teams Function points raised to the 0.40 power

Novice teams Function points raised to the 0.41 power

Assuming an application size of 1,000 function points these rules of them generate
the following schedule durations in calendar months:

Expert teams 12.9 calendar months

Good teams 13.8 calendar months

Average teams 14.8 calendar months

Below average teams 15.8 calendar months

Novice teams 17.0 calendar months

As can easily be seen, the differences between the experts and novices translates
into significant schedule differences, and would also lead to differences in effort,
costs, and quality.

Rayleigh Curve
Lord Rayleigh was an English physicist who won a Nobel Prize in 1904 for the discov-
ery of Argon gas. He also developed a family of curves that showed the distribution
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 309

of results for several variables. This curve was adopted by Larry Putnam and Peter
Norden adopted this curve as a method of describing software staffing, effort, and
schedules. The curves for software are known as Putnam–Norden–Rayleigh curves or
PNR. A Google search for this term will show many different articles. In general, the
curves are a good approximation for software staffing over time. The PNR curves,
and other forms of Rayleigh curves, assume smooth progress. For software this is not
always the case. There are often severe discontinuities in the real world caused by
creeping requirements, canceled projects, deferred features, or other abrupt changes.
For example, about 32% of large systems above 10,000 function points are canceled
without being completed, which truncates PNR curves. For smaller project with
better odds of success the curves are more successful. Larry Putnam was the original
developer of the SLIM estimation tool, which supports the family of curves, as do
other tools as well. See also Chaos theory earlier in this chapter for a discussion of
discontinuities and random events.

Reliability Metrics
Software reliability refers, in general, to how long software can operate success-
fully without encountering a bug or crashing. Reliability is often expressed using
mean time to failure (MTTF) and mean time between failures (MTBF). Studies
at IBM found that reliability correlated strongly with numbers of released defects
and DRE. High reliability or encountering a bug or failure less than once per year
normally demands DRE levels of >99% and delivered defect densities of <0.001
per function point.

Repair and Rework Costs


This term overlaps the term technical debt that is discussed later in the report. The
phrase deals with the sum total effort and costs for logging bugs, routing them to
repair teams, fixing the bugs, testing the fixes for regressions, integrating the new
code, and then releasing the new version of a software application. As bug repairs
are the #1 cost driver, repair and rework costs often tops 25% of total development
budgets.

Requirement
There is some ambiguity in exactly what constitutes a requirement. In general, a
requirement is a description of a specific feature that clients want software to per-
form. A requirement for a word-processing software package might include auto-
matic spell checking. Requirements can be expressed in terms of use-cases, user
stories, text, mathematical formulae, or combinations of methods. No matter how
requirements are expressed, they are known to have several attributes that cause
problems for software projects. These attributes include: (1) errors in requirements,
310 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

(2) toxic requirements that are harmful to software (such as Y2k), and (3) incom-
pleteness that leads to continuous requirements growth. There are also deeper
and more subtle problems such as the lack of an effective taxonomy that can put
requirements into a well-formed hierarchy. All software needs to accept inputs,
perform various calculations, and produce results. But this general statement needs
to be expanded into a formal taxonomy that would encompass error checking, user
error avoidance, and many others. From comparisons of explicit requirements and
function points, an average requirement takes about 3.0 function points to imple-
ment, with a range between 0.5 and 20. The IFPUG organization has developed
a method of dividing requirements into functional requirements and nonfunc-
tional requirements that is supported by a new metric called SNAP for software
nonfunctional assessment process. Functional requirements are things users want.
Nonfunctional requirements are things like government mandates that have to be
included whether users want them or not. Although this makes sense, but logically
the great bulk of software requirements documents do not distinguish between
these two categories as of 2016.

Requirements Creep
Requirements have been known to change during development for more than
60 years. It was only after function points were released in 1978 that requirements
creep could be measured explicitly. A sample of software projects were sized using
function points at the end of the requirements phase. Later the same projects were
resized at the point of delivery to customers. As both starting and ending sizes were
known and the calendar month schedules were known, this allowed researchers at
IBM and elsewhere to measure requirements creep exactly. Assume that an appli-
cation was measured at the end of requirements at 1,000 function points. Assume
that the same application was measured 12 months later at release and was found
to be 1,200 function points in size. This is an average monthly growth rate of
1.67%. The total growth or creep was 200 function points or about 16.66 function
points per month. The additional 200 function points show a total growth of 20%.
Growth does not stop with delivery, but continues forever as long as the software
is in active use. Postrelease growth is slower at about 8% per calendar year. As of
2016, there is little or no data on SNAP nonfunctional requirements growth and
change over time. Indeed the IFPUG SNAP committee has not yet addressed this
topic; a major omission.

Requirements Metrics
It is a bad assumption to believe that user requirements are error-free. User require-
ments contain many errors and some requirements may be toxic and should not be
in the application at all; Y2K is an example of a toxic requirement. The essential
metrics for software requirements include but are not limited to (1) requirements
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 311

size in pages, words, and diagrams, (2) requirements errors found by inspection,
(3) possibly toxic requirements pointed out to users by domain experts, (4) rates
of requirements growth or change during development, (5) requirements deferred
to future releases in order to achieve arbitrary schedule targets, and (6) whether
requirements are functional or nonfunctional. As the author’s estimating SRM tool
has an early sizing feature that allows it to be used prior to requirements, it predicts
all six of these essential requirements metrics. An example for an application of
10,000 function points at delivery using the RUP would be: starting size = 8,023
function points; creep = 1,977 function points; monthly rate of creep = 1.82%;
total creep = 19.77%; requirements defects = 1,146; toxic requirements = 27;
requirements completeness = 73.68%; explicit requirements = 2,512; function
points per requirement = 3.37; SNAP points 1,250 with a growth of 150 leading to
a total of 1,400 SNAP points. All of these predictions can be made before require-
ments analysis starts by using pattern matching from similar completed projects.
Requirements are volatile and also error prone. In the future formal patterns of
reusable requirements will no doubt smooth out current problems and provide bet-
ter overall requirements than are common in 2016.

Return on Investment
A Google search on the phrase return on investment (ROI) will bring up hundreds
of articles and over a dozen flavors or ROI including return on assets, financial rate
of return, economic rate of return, and a number of others. For software projects,
ROI is often not done at all and is seldom done well. What is needed for software
ROI are accurate predictions of project schedules and costs prior to starting, and
accurate measures of schedules and costs after completion. Quality also needs to be
predicted and measured because poor quality will puff up maintenance costs and
warranty repair costs to alarming levels, and may also trigger expensive litigation
due to consequential damages. It is technically possible to predict schedules and
costs with good accuracy using any of the available parametric estimation tools,
with the caveat that they cannot be used until requirements and known. The SRM
tool includes a patent-pending early sizing feature that allows it to predict costs and
schedules prior to requirements. The SRM tool also includes ROI as a standard
output, assuming that the client who commissioned the estimate can provide value
data for tangible and intangible value. SRM compares costs to value to calculate
ROI, but it does not predict value. That must be user-supplied information because
value can range from almost nothing to creating an entirely new business that will
earn billions of dollars, as shown by Microsoft, Facebook, and Twitter. The essen-
tial problems with ROI calculations circa 2014 include: (1) optimistic estimates for
costs and schedules, (2) optimistic quality estimates, (3) leaky historical data,
(4) failure to include requirements creep in schedule and cost estimates, (5) very
poor tracking of progress, (6) poor quality control that leads to delays and cost
overruns, and (7) optimistic revenue or value predictions.
312 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Reusable Materials
As custom design and manual coding of software is intrinsically expensive and
error-prone, there is a need to move away from custom development and move
toward construction from certified standard reusable components. However,
reuse covers much more than just source code. The full suite of reusable artifacts
include but are not limited to reusable (1) architecture, (2) design, (3) requirements,
(4) plans, (5) estimates, (6) data structures, (7) source code, (8) test plans, (9) test
cases, (10) test scripts, and (11) user documentation. Currently software is built
rather like an America’s Cup yacht or a Formula 1 race car, using custom designs
and extensive manual labor. In the future, software might be constructed like regu-
lar automobiles such as Fords or Toyotas, using assembly lines of reusable materials
and perhaps even robots. Neither team experience, methodologies, nor program-
ming languages have as much impact on software productivity rates as does reuse
of certified components.

RICE Objects
ERP companies such as SAP and Oracle use the phrase RICE objects as a work met-
ric. The acronym RICE stands for reports, interfaces, conversions, and enhance-
ments. These are some of the activities associated with deploying ERP and building
and customizing applications to work with ERP packages.

Risk Metrics
Software projects have a total of about 210 possible risk factors. Among these are
outright cancellation, schedule delays, cost overruns, breach of contract litiga-
tion, patent litigation, cyber attacks, and many more. The risk analysis engine of
the author’s SRM tool predicts 20 of the 210 risks and assigns each risk a prob-
ability percent based on historical data derived from similar projects of the same
size and type. For example, risk of breach of contract litigation ranges from 0%
for in-house projects to about 15% for large contract waterfall projects with inex-
perienced personnel. Risk severities are also predicted using a scale from 1 to 10
with the lower numbers being less serious. Risk avoidance probabilities are also
calculated based on weighted combinations of CMMI levels, methodologies, and
team experience levels. The worst case would be cowboy development at CMMI 1
by a team of novices. The best case would be TSP or RUP at CMMI 5 by a team
of experienced personnel. These risk predictions and metrics are standard features
of the SRM tool. The normal way of presenting risks resembles the chart as given
in Table A.15.
Risks vary by size, complexity, experience, CMMI levels, and other factors that
are specific to specific projects.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 313

Table A.15 Probabilities and Severity Levels of Software Defects


Normal Risk
Risk Analysis from Similar Projects Odds(%) Severity

Optimistic cost estimates 35.00 9.50

Inadequate quality control using only testing 30.00 10.00

Excessive schedule pressure from clients, executives 29.50 6.50

Technical problems hidden from clients, executives 28.60 10.00

Executive dissatisfaction with progress 28.50 8.50

Client dissatisfaction with progress 28.50 9.00

Poor quality and defect measures (omits >10% of bugs) 28.00 7.00

Poor status tracking 27.80 7.50

Significant requirements creep (>10%) 26.30 8.00

Poor cost accounting (omits >10% of actual costs) 24.91 6.50

Schedule slip (>10% later than plan) 22.44 8.00

Feature bloat and useless features (>10% not used) 22.00 5.00

Unhappy customers (>10% dissatisfied) 20.00 9.25

Cost overrun (>10% of planned budget) 18.52 8.50

High warranty and maintenance costs 15.80 7.75

Cancellation of project due to poor performance 14.50 10.00

Low reliability after deployment 12.50 7.50

Negative ROI due to poor performance 11.00 9.00

Litigation (patents) 9.63 9.50

Security vulnerabilities in software 9.60 10.00

Theft of intellectual property 8.45 9.50

Litigation (breach of contract) 7.41 9.50

Toxic requirements that should be avoided 5.60 9.00

Low team morale 4.65 5.50

Average Risks for this size and type of project 18.44 8.27

Financial Risk: (cancel, cost overrun, and negative ROI) 44.02


314 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Root-Cause Analysis
The phrase root-cause analysis refers to a variable set of methods and statistical
approaches that attempt to find out why specific problems occurred. Root-cause
analysis or RCA is usually aimed at serious problems that can cause harm or large
costs if not abated. RCA is not only used for software but is widely used by many hi-
technology industries and also by medicine and military researchers. As an example
of software RCA, a specific high-severity bug in a software application might have
slipped through testing because no test case looked for the symptoms of the bug.
A first-level issue might have been that project managers arbitrarily shortened test
case design periods. Another cause might be that test personnel did not use formal
test design methods based on mathematics such as design of experiments. Further,
testing might have been performed by untrained developers rather than by certified
test personnel. The idea of RCA is to work backward from a specific problem and
identify as many layers of causes as can be proven to exist. RCA is expensive, but
some tools are available from commercial and open-source vendors. See also failure
mode and effects analysis (FMEA) discussed earlier in this report.

Sample Sizes
An interesting question is what kinds of sample sizes are needed to judge software
productivity and quality levels? Probably the minimum sample would be 20 projects
of the same size, class, and type. As the permutations of size, class, and type, total
to more than 2,000,000 instances, a lot of data are needed to understand the key
variables that impact software project results. To judge national productivity and
quality levels about 10,000 projects per country would be useful. As software is a
major industry in more than 100 countries, the global sample size for the overall
software industry should include about 1,000,000 projects. As of 2014 the sum
total of all known software benchmarks is only about 80,000 software projects.
See the discussion of taxonomies later in this report.

Schedule Compression
Software schedules routinely run later than planned. Analysis by the author of more
than 500 projects found that average schedule demands by clients or senior manag-
ers approximated raising application size in function points to the 0.3 power. Actual
delivery dates for the same projects had exponents ranging from the 0.37 to 0.41
power. For a generic application of 1,000 function points clients wanted the software
in 8 calendar months and it took between 12 and 17 months to actually deliver it.
This brings up two endemic problems for the software industry: (1) software clients
and executives consistently demand schedules shorter than it is possible to build
software and (2) software construction methods need to switch from custom devel-
opment to using larger volumes of standard reusable components in order to shorten
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 315

schedules by 50% or more. Normal attempts to compress software project schedules


include adding personnel, which usually backfires, truncating quality control and
test periods, which always backfires, and increasing overlap between activities. None
of these are usually successful and indeed may make schedules worse. Another com-
mon method of schedule compression is to defer planned features to a later release.
Use of formal-risk analysis and estimation before starting of projects can minimize
the odds of irrational schedule demands. Benchmarks from projects of the same size
and type can also minimize the odds of irrational schedule demands. Overall impos-
sible schedule demands are the #1 cause and poor development and construction
methods are the #2 cause of delivering software later than desired.

Schedule Overlap
The term schedule overlap defines the normal practice of starting an activity before
a prior activity is completed. See the discussion on Gantt chart for a visual represen-
tation of schedule overlap. Normally for projects, design starts when requirements
are about 75% complete, coding starts when design is about 50% complete, and
testing starts when coding is about 25% complete. This means that the net schedule
of a software project from beginning to end is shorter than the sum of the activ-
ity schedules. Parametric estimation tools and also project management tools that
support PERT and GANTT charts all handle schedule overlaps, which are normal
for software projects. Schedule overlap is best handled using activity-based cost
analysis or task-based cost analysis. Agile projects with a dozen or more sprints are
a special case for schedule overlap calculations.

Schedule Slip
As discussed earlier in the section on schedule compression, users routinely demand
delivery dates for software projects that are quicker than technically possible.
However, schedule slip is not quite the same. Assume that a project is initially
scheduled for 18 calendar months. At about month 16 the project manager reports
that more time is needed and the schedule will be 20 months. At about month 19
the project manager reports that more time is needed and the schedule will be 22
months. At about month 21 the manager reports that more time is needed and the
schedule will be 24 months. In other words, schedule slip is the cumulative sequence
of small schedule delays usually reported only a short time before the nominal
schedule is due. This is an endemic problem for large software projects. The root
causes are inept schedule before the project starts, requirements creep during devel-
opment, and poor quality control that stretches out testing schedules. It should be
noted that most software projects seem to be on time and even early until testing,
at which point they are found to have so many bugs that the planned test schedules
double or triple.
316 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Scope
The word scope in a software context is synonymous with size and is measured
using function points, story points, LOC, or other common metrics. Scope creep is
another common term that is synonymous with requirements creep.

Security Metrics
In today’s world of hacking, denial of service, and cyber attacks companies are
beginning to record attempts at penetrating firewalls and other defenses. Also
measured are the strength of various encryption schemes for data and confidential
information. Also measured are password strength. This is a complex topic that
changes rapidly so a Google search on security metrics to stay current is recom-
mended. After software is released and actually experiences attack, data should be
kept on the specifics of each attack and also on the staffing, costs, and schedules for
recovery and also financial losses to both companies and individuals.

Six-Sigma for Software


The concept of Six-Sigma was developed in Motorola circa 1986. It became famous
when Jack Welch adopted Six-Sigma for General Electric. The concept originated
for hardware manufacturing. More recently Six-Sigma has been applied to software,
with mixed but generally good results. The term Six-Sigma is a mathematical way of
expressing reliability or the odds of defects occurring. To achieve Six-Sigma results
DRE would need to be 99.99966%. The current U.S. average for DRE is less than
90% and very few projects achieve 99%. The Six-Sigma approach has an extensive
literature and training, as does Six-Sigma for software. A Google search on the
phrase Six-Sigma for software will bring up hundreds of documents and books.

Size Adjustment
Many of the tables and graphs in this report, and others by the same author, show
data expressed in even powers of 10, that is, 100 function points, 1,000 function
points, 10,000 function points, and so on. This is not because the projects were all
even values. The author has a proprietary tool that converts application size to even
values. For example, if several PBX switches ranges from a low of 1,250 function
points to a high of 1,750 function points they could all be expressed at a median
value of 1,500 function points. The reason for this is to highlight the impact of
specific factors such as methodologies, experience levels, and CMMI levels. Size
adjustment is a subtle issue and includes adjusting defect potentials and require-
ment creep. In other words, size adjustments are not just adding or subtracting
function points and keeping all other data at the same ratios as the original. For
example, if software size in function points is doubled, defect potentials will go up
by more than 100% and DRE will decline.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 317

SNAP Metrics
Function point metrics were developed to measure the size of software features that
benefit users of the software. But there are many features in software that do not
benefit users but are still required due to technical or legal constraints. This new
metric, distinct from function points, is termed software nonfunctional assessment
process or SNAP. As an example of a nonfunctional requirement, consider home con-
struction. A home owner with an ocean view will prefer windows facing toward the
ocean, which is a functional requirement. However, local zoning codes and insur-
ance regulations mandate that windows close to the ocean must be hurricane proof,
which is very expensive. This is a nonfunctional requirement. Function points and
SNAP metrics are calculated separately. However, from the author’s clients who have
tried SNAP seem to approximate 15%–20% of the volume of function points. Due
to the fact that SNAP is new and only slowly being deployed, there may be future
changes in the counting method and additional data in the future. Some examples
of software SNAP might include security features and special features so the software
can operate on multiple hardware platforms or multiple operating systems. As this
report is being drafted, an announcement in March of 2014 indicates that IFPUG
and Galorath Associates are going to perform joint studies on SNAP metrics.

Software Employment Statistics


Most of us depend on the Bureau of Labor Statistics, the Census Bureau, and the
Department of Commerce for statistics about software employment. However, from
a study of software occupation groups commissioned by AT&T, some sources of error
were noted. The Bureau of Labor Statistics showed about 1,018,000 software pro-
grammers in 2012 with an impressive 22% growth rate. Our study showed that not
a single human resource group kept good records or even knew how many software
personnel were employed. Further, some software producers building embedded soft-
ware at companies such as Ford and several medical and avionics companies refused
to be identified as software engineers due to low status. They preferred their original
academic job titles of automotive engineer, aeronautical engineer, or anything but
software engineer. Many hi-tech companies used a generic title of member of the techni-
cal staff that included both software and hardware engineers of various kinds, without
specifically identifying the software personnel. It is very likely that government statis-
tics are on the low side. If the HR groups of Fortune 500 companies do not know how
many software people work there, probably the government does not know either.

Software Quality Assurance


As a young software engineer working for IBM in California, the author worked in
one of IBM’s software quality assurance (SQA) organizations. As SQA groups evalu-
ate the quality status of software projects, they need an independent organization
separate from the development organization. This is to ensure that SQA opinions are
318 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

objective, and not watered down based on threats of reprisals by project managers
in case of a negative opinion. The SQA groups in major companies collect quality
data and also provide quality training. The SQA personnel also participate in formal
inspections, often as moderators. In terms of staffing, SQA organizations are typi-
cally about 3% of the development team size, although there are ranges. The IBM
SQA organizations also had a true research and development function over and above
normal project status reporting. For example, while working in an IBM quality assur-
ance group, the author performed research on software metrics pros and cons, and
also designed IBM’s first parametric software estimation tool in 1973. Formal SQA
organizations are not testing groups, although some companies call testing groups
by this name. Testing groups usually report to development management, whereas
SQA groups report through a separate organization up to a VP of Quality. One of
the more famous VP’s of Quality was Phil Crosby of ITT, whole book Quality Is Free
(1979) remains a best-seller even in 2014. The author also worked for ITT and was
the software representative to the ITT corporate quality council.

Software Usage and Consumption Metrics


To complete the economic models of software projects, usage, and consumption as
well as production need to be measured. As it happens function point metrics can
be used for consumption studies as well as for production and maintenance studies.
For examples, physicians have access to more than 3,000,000 function points in
MRI devices and other diagnostic tools; attorneys have access to more than 325,000
function points of legal tools; project managers have access to about 35,000 function
points if they use tools such as parametric estimation, Microsoft Projects, and cost
accounting. The overall set of metrics and measures needed are shown below.
Functional metrics in industry

Production Value analysis


studies and usage studies
Individual Organizational Enterprise
Software Software
users and value users and value usage and value
projects portfolios
30,000 function points
100 to 100,000 Managers 600,000 function
function points points
90,000 function points
A Engineering
Engineers
16,000 function points
1,600,000 to
B 16,000,000 Salesmen 1,000,000 to
function Marketing and 20,000,000
points sales function
Administra-
points
tors
C
Supervisors

• Sizing Manufacturing
• Size
• Productivity • Replacement cost Purchasing
• Quality • Productivity
• Schedules • Quality
• Costs
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 319

A full economic model of commercial, systems, and embedded applications as well


as some IT applications would combine production and usage data. The ranges in
project management tool usage for leading, average, and lagging projects as given
in Table A.16.
Similar data are also known for software development, software quality assur-
ance, software maintenance, and software testing. It is interesting and significant

Table A.16 Software Project Management Tools


Numbers and Size Ranges of Software Project Management Tools
(Tool Sizes are Expressed in Terms of IFPUG Function Points, Version 4.2)

Project Management Tools Lagging Average Leading

1 Project planning 1,000 1,250 3,000

2 Project-cost estimating 3,000

3 Statistical analysis 3,000

4 Methodology management 750 3,000

5 Reusable-feature analysis 2,000

6 Quality estimation 2,000

7 Assessment support 500 2,000

8 Project office support 500 2,000

9 Project measurement 1,750

10 Portfolio analysis 1,500

11 Risk analysis 1,500

12 Resource tracking 300 750 1,500

13 Governance tools 1,500

14 Value analysis 350 1,250

15 Cost-variance reporting 500 500 1,000

16 Personnel support 500 500 750

17 Milestone tracking 250 750

18 Budget support 250 750

19 Function-point analysis 250 750

(Continued)
320 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Table A.16 (Continued) Software Project Management Tools


Numbers and Size Ranges of Software Project Management Tools
(Tool Sizes are Expressed in Terms of IFPUG Function Points, Version 4.2)

Project Management Tools Lagging Average Leading

20 Backfiring: LOC to FP 300

21 Earned value analysis 250 300

22 Benchmark data collection 300

Subtotal 1,800 4,600 30,000

Tools 4 12 22

that the largest differences in tool use between laggards and leaders are for project
management and quality assurance. Laggards and leaders use similar tool suites for
development, but the leaders use more than twice as many tools for management
and quality assurance tasks than do the laggards.

Sprint
The term sprint is an interesting Agile concept and term. In some Agile projects,
overall features are divided into sets that can be built and delivered separately, often
in a short time period of six weeks to two months. These subsets of overall applica-
tion functionality are called sprints. The use of this term is derived from racing and
implies a short distance rather than a marathon. The sprint concept works well for
projects below 1,000 function points, but begins to encounter logistical problems at
about 5,000 function points. For really large systems >10,000 function points there
would be hundreds of sprints and there are no current technologies for decompos-
ing really large applications into small sets of independent features that fit the sprint
concept.

Staffing Level
In the early days of software the term staffing level meant the number of pro-
grammers it might take to build an application, with ranges from 1 to perhaps 5.
In  today’s  world of 2014 with a total of 126 occupation groups this term has
become much more complicated. Parametric estimation tool such as SRM and
also project management tools such as Microsoft Project can predict the number
of people needed to build software. SRM predicts a standard set of 20 occupation
groups including business analysts, architects, programmers, test personnel, qual-
ity assurance, technical writers, and managers. Staffing is not constant for most
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 321

occupations, but rises and shrinks as work is finished. Staffing levels by occupation
include average numbers of personnel and peak numbers of personnel. See also
the Rayleigh curve discussion earlier in this report. The staffing profile for a major
system of 25,000 function points is shown below and is a standard output from the
author’s SRM tool (Table A.17).
As can be seen software is a multidiscipline team activity with many different
occupation groups and special skills.

Table A.17 Software Occupation Groups for 25,000 Function Points


Occupation Groups and Part-Time Specialists

Normal Staff Peak Staff

1 Programmers 94 141

2 Testers 83 125

3 Designers 37 61

4 Business analysts 37 57

5 Technical writers 16 23

6 Quality assurance 14 22

7 1st line managers 15 20

8 Database administration 8 11

9 Project office staff 7 10

10 Administrative support 8 10

11 Configuration control 5 6

12 Project librarians 4 5

13 2nd line managers 3 4

14 Estimating specialists 3 4

15 Architects 2 3

16 Security specialists 1 2

17 Performance specialists 1 2

18 Function point counters 1 2

19 Human factors specialists 1 2

20 3rd line managers 1 1


322 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Standish Report (Chaos Report)


The consulting company of the Standish Group publishes an annual report on
IT failures. This is called the chaos report but is also cited as the Standish report.
The report is widely cited but also widely challenged. Even so it contains inter-
esting data and information about project failures and failure modes. Note that
the Standish report is limited to IT projects and does not deal with systems or
embedded software, which have lower failure rates than IT projects. Nor does it
deal with government and military projects, which have higher failure rates than
IT projects.

Story Points
Story points are a somewhat subjective metric based on analysis of designs expressed
in terms of user stories. Story points are not standardized and vary by as much as
400% from company to company. They are used primarily with Agile projects and
can be used to predict velocity. A Google search will bring up an extensive literature
including several papers that challenge the validity of story points.

Successful Projects (Definition)


The terms software failure and software success are ambiguous in the literature. The
author’s definition of success attempts to quantify the major issues troubling soft-
ware: success means <3.00 defects per function points, >97% DRE, >97% of valid
requirements implemented, <10% requirements creep, 0 toxic requirements forced
into application by unwise clients, >95% of requirements defects removed, development
schedule achieved within + or – 3% of a formal plan, and costs achieved within + or –
3% of a formal parametric cost estimate. See also the definition of failing projects
earlier in this report. Another cut at a definition of a successful project would be
one that is in the top 15% in terms of software productivity and quality from all
of the projects collected by benchmark organizations such as Namcook Analytics,
Q/P Management Group, Software Productivity Research, and others.

Taxonomy of Software Projects


Taxonomies are the underpinning of science and extremely valuable to all sciences.
Software does not yet have an agreed-to taxonomy of software application sizes
and types used by all companies and all types of software. However, as part of
the author’s benchmark services, we have developed a useful taxonomy that allows
apple to apple comparisons of any and all kinds of projects. The taxonomy consists
of eight primary topics:
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 323

1. Application nature (new, enhancement, COTS modification, etc.)


2. Application scope (algorithms, module, program, system, etc.)
3. Application class (internal, commercial, defense, outsource, etc.)
4. Application type (Web, IT, embedded, telecom, etc.)
5. Platform complexity (single platform, multiple platforms, etc.)
6. Problem complexity (low, average, and high)
7. Code complexity (low, average, and high)
8. Data complexity (low, average, and high)

In addition to the taxonomy itself, the author’s benchmark recording method also
captures data on 20 supplemental topics that are significant to software project
results. These include:

1. Development methodology (Agile, RUP, TSP, waterfall, spiral, etc.)


2. Quality methodologies (inspections, static analysis, test stages, etc.)
3. Activity-based cost analysis of development steps
4. Defect potentials and DRE5. Special attributes (CMMI, SEMAT, FDA or
FAA certification, etc.)
5. Programming languages (assembly, Java, Objective C, HTML, mixed, etc.)
6. Requirements growth during development and after release
7. Experience levels (clients, developers, testers, managers, etc.)
8. Development country or countries (United States, India, Japan, multiple, etc.)
9. Development state or region (Florida, New York, California, etc.)
10. Hardware platforms (smartphone, tablet, embedded, mainframe, etc.)
11. Software platforms (Windows, Linux, IBM, Apple, etc.)
12. Tool suites (for design, coding, testing, project management, etc.)
13. Volumes of reusable materials available (0% to 100%)
14. Compensation levels and burden rates (project cost structures)
15. Country where the software is being built
16. Paid and unpaid overtime for the development team
17. Experience levels for clients, managers, and development teams
18. North American Industry Classification (NAIC code) for all industries
19. Hardware and software platforms for the application

The eight taxonomy factors and the 20 supplemental factors make comparisons of
projects accurate and easy to understand by clients. The taxonomy and supplemen-
tal factors are also used for pattern matching or converting historical data into use-
ful estimating algorithms. As it happens, applications that have the same taxonomy
are also about the same in terms of schedules, costs, productivity, and quality. That
being said, there are millions of permutations from the factors used in the author’s
taxonomy. However, the vast majority of software applications can be encompassed
by fewer than 100 discreet patterns.
324 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Technical Debt
The concept of technical debt was put forth by Ward Cunningham. It is a brilliant
metaphor, but not a very good metric as currently defined. The idea of techni-
cal debt is that shortcuts or poor architecture, design, or code made to shorten
development schedules will lead to downstream postrelease work. This is certainly
true. But the use of the term debt brings up the analogy of financial debt, and here
there are problems. Financial debt is normally a two-party transaction between a
borrower and a lender; technical debt is self-inflicted by one party. A subtle issue
with technical debt is that it makes a tacit assumption that shortcuts are needed
to achieve early delivery. They are not. A combination of defect prevention, pretest
defect removal, and formal testing can deliver software with close to zero-technical
debt faster and cheaper than the same project with shortcuts, which usually skimp
on quality control. A more serious problem is that too many postrelease costs are
not included in technical debt. If an outsource contractor is sued for poor perfor-
mance or poor quality, then litigation and damages should be included in techni-
cal debt. Consequential damages to users of software caused by bugs or failures
should be included in technical debt. Also losses in stock value due to poor quality
should be included in technical debt. Also, about 32% of large systems are canceled
without being completed. These have huge quality costs but zero-technical debt.
Overall technical debt seems to encompass only about 17% of the full costs of poor
quality and careless development. Another omission with technical debt is the lack
of a normalization method. Absolute technical debt like absolute financial debt is
important, but it would also help to know technical debt per function point. This
would allow comparisons of various project sizes and also various development
methods. Technical debt can be improved over time if there is an interest in doing
so. Technical debt is currently a hot topic in the software literature so it will be
interesting to see if there are changes in structure and topics over time.

Test Metrics
This is a complex topic and also somewhat ambiguous and subjective. Among the
suite of common test metrics circa 2015 are test cases created: work hours per test case,
test work hours per function point, reused regression test cases, test cases per function
point, test cases executed successfully, test cases executed and failing, test coverage for
branches, paths, code statements, and risks, defects detected, test intervals or schedule,
test iterations, test coverage (branches and paths), and test DRE levels.

Test Coverage
The phrase test coverage is somewhat ambiguous and can be used to describe the
following: the percent of code statements executed during testing, the percent
of branches or paths, and the percent of possible risks for which test cases exist.
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 325

All  definitions tend to be inversely related to cyclomatic and essential complex-


ity. For highly complex applications, the number of test cases to approach 100%
coverage can approximate infinity. As of 2016 the only software for which 100%
test coverage can be achieved would be straight-line software with a cyclomatic
complexity number of 1. Test coverage is important but surprisingly ambiguous,
given how poor software quality is and how important testing is. There should be
published data on test coverage by size, cyclomatic complexity, essential complex-
ity, and also by specific test stage such as unit test, function test, regression test, and
so forth. Currently the literature on test coverage tends to be vague and ambiguous
as to what kind of coverage actually occurs.

Total Cost of Ownership


This is a very important metric but one that is difficult to collect and to study. The
term total cost of ownership (TCO) includes development and at least three years
of maintenance, enhancement, and customer support. Some projects have been in
continuous use for more than 25 years. To be really inclusive, TCO should also
include user costs. Further, it would be helpful if TCO included cost drivers such as
finding and fixing bugs, paperwork production, coding, testing, and project man-
agement. The author’s tool SRM can predict and measure TCO for a minimum
of three calendar or fiscal years after release. The actual mathematical algorithms
could be extended past three years, but general business uncertainly lowers the
odds of predictions more than three years out. For example, corporate acquisition
or merger could make dramatic changes as could be the sale of a business unit.
A typical pattern of TCO for a moderate size application of 2,500 function points
in size for three years might be as follows (Table A.18):
In order to collect TCO data it is necessary to have a continuous measure-
ment program that collects effort data, enhancement data, support data, and defect
repair data at least twice a year for all major projects.

Table A.18 Total Cost of Ownership (TCO)


Three-Year TCO Staffing Effort Months Percent of TCO (%)

Development 7.48 260.95 46.17

Enhancement 2.22 79.75 10.58

Maintenance 2.36 85.13 10.35

Support 0.34 12.29 0.68

User costs 4.20 196.69 32.12

Total TCO 16.60 634.81 100.0


326 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Unpaid Overtime
The majority of U.S. software personnel are termed exempt that means they are
not required to be paid overtime even if they work much more than 40 hours per
week. Unpaid overtime is an important factor for both software costs and software
schedules. Unfortunately, unpaid overtime is the most common form of data that
leaks or does not get reported via normal project tracking. If you are comparing
benchmarks between identical projects and one of them had 10 hours per week
of unpaid overtime, whereas the other had 0 hours of unpaid overtime, no doubt
the project with overtime will be cheaper and have a shorter schedule. But if the
unpaid overtime is invisible and not included in project tracking data, there is no
good way to validate the results of the benchmarks. Among the author’s clients
unpaid overtime of about 4 hours per week is common, but omitting this unpaid
overtime from formal cost tracking is also common. As can be seen, the impact of
unpaid overtime on costs and schedules is significant. The following chart shows
an application of 1,000 function points with compensation at U.S. $10,000 per
staff month.

Use-Case Points
Use-cases are part of the design approach featured by the unified modeling lan-
guage (UML) and included in the RUP. Use-cases are fairly common among IBM
customers as are use-case points. This is a metric based on use-cases and used for
estimation. It was developed in 1993 by Gustav Korner prior to IBM acquiring the
method. This is a fairly complex metric and a Google search is recommended to
bring up definitions and additional literature. Use-case points and function points
can be used for the same software. Unfortunately, use-case points only apply to
projects with use-case designs, whereas function points can be used for all software,
and are therefore much better for benchmarks. IBM should have published conver-
sion rules between use-case points and function points, because both metrics were
developed by IBM. In the absence of IBM data, the author’s SRM tool predicts
and converts data between use-case points and IFPUG function points. A total of
1,000 IFPUG function points is roughly equal to 333 use-case points. However,
this value will change because use-cases also vary in depth and complexity. It is the
responsibility of newer metrics such as use-case points to provide conversion rules
to older metrics such as function points, but this responsibility is seldom acknowl-
edged by metrics developers and did not occur for use-case points.

User Costs
For internal IT projects users provide requirements, review documents, participate
in phase reviews, and may even do some actual testing. However, user costs are
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 327

seldom reported. Also, user costs are normally not included in the budgets for soft-
ware applications. The author’s SRM tool predicts user costs for IT projects. Total
user costs range from less than 50% of software development costs to more than
70% of software development costs. This topic is under reported in the software
literature and needs additional research. A sample of typical user costs for a medium
IT project of 2,500 function points is shown below:

User Activities Staffing Schedule Effort Costs $ per FP

User 3.85 8.06 44.77 $604,407 $241.76


requirements
team

User 2.78 1.93 5.37 $75,529 $29.01


architecture
team

User planning/ 1.89 1.69 5.13 $69,232 $27.69


estimating team

User prototype 3.13 4.03 12.59 $169,989 $68.00


team

User design 5.00 3.22 16.12 $217,587 $87.03


review team

User change 3.33 16.12 53.73 $725,288 $290.12


control team

User governance 2.56 10.48 26.86 $362,644 $145.06


team

User document 6.67 2.01 13.43 $181,322 $72.53


review team

User acceptance 5.56 2.42 13.43 $181,322 $72.53


test team

User installation 4.35 1.21 5.26 $70,952 $28.38


team

Subtotal 4.20 5.12 196.69 $2,755,733 $1,062.11

User costs are not always measured even though they can top 65% of development
costs. They are also difficult to measure because they are not usually included in
software project budgets, and are scattered among a variety of different organiza-
tions, each of which may have their own budgets.
328 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

Value (Intangible)
The topic of intangible value is ambiguous and varies from application to applica-
tion. Some of the many forms of intangible value include: medical value, value to
human life and safety, military value for improving military operations, customer
satisfaction value, team morale value, and corporate prestige value. It would be
theoretically possible to create a value point metric similar to function point metrics
to provide a scale or range of intangible values.

Value (Tangible)
Value comes in two flavors: tangible and intangible. The forms of software
tangible value also comes in several flavors: (1) direct revenues such as software sales,
(2) indirect revenues such as training and maintenance contracts, and (3) operating
cost reductions and work efficiency improvements. Tangible value can be expressed
in terms of currencies such as dollars and are included in a variety of accounting
formulae such as accounting rate of return and internal rate of return.

Variable Costs
As the name implies, variable costs are the opposite of fixed costs and tend to be
directly proportional to the number of units produced. An example of a variable
cost would be the number of units produced per month in a factory. An example of
a fixed cost would be the monthly rent for the factory itself. For software, an impor-
tant variable cost is the amount and cost of code produced for a specific require-
ment, which varies by language. Another variable cost would be the number and
costs of bug repairs on a monthly basis. The software industry tends to blur together
fixed and variable costs, and this explains endemic errors in the LOC metric and
the cost per defect metric.

Velocity
The term velocity is a metric widely used by Agile projects. It can be used in both
forward predictive modes and historical data collection modes. Velocity can also
be used with tangible deliverable such as document pages and also with synthetic
metrics such as story points. As velocity is not precisely defined, users could do a
Google search to bring the additional literature on the velocity metric.

Venn Diagram
In 1880, the mathematician John Venn developed a simple graphing technique to
teach set theory. Each set was represented by a circle. The relationship between two
sets could be shown by the overlap between the circles. The use of Venn diagrams
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 329

is much older than software, and is used by dozens of kinds of engineers and math-
ematicians due to the simplicity and elegance of the approach. A simple Venn dia-
gram with two sets is shown below.

Venn diagrams can be used with more than two circles, of course, but becomes
complex and lose visual appeal with more than four circles.

Visual Status and Progress Tracking


It is not uncommon for large software projects to utilize both project offices and also
war rooms. One of the purposes of both is to have current data on project status
and progress, often using visual methods. The most common visual method is to
use an entire wall and put up a general process flow chart that indicates completed
tasks, work in progress, and future tasks. One company, Shoulder’s Corporation,
has gone beyond wall charts and used a three-dimensional (3D) tracking system
using colored balls suspended from strings. Project offices are more sedate, but
usually include status-tracking tools such as Microsoft Project and Computer
Aid’s Automated Project Office. It is obvious that the industry needs animated 3D
graphical status tracking tightly coupled with planning and estimation. The track-
ing tool should also display continuous requirements growth and visualization of
defect removal progress. The tool also needs to continue beyond release and show
annual growth plus customer use and customer bug reports for 10 years or more.

War Room
In a software context, a war room is a room set aside for project planning and status
tracking. Usually they have tables with planning documents and often one or more
walls are covered with project flow diagrams that indicate current status. Usually
war rooms are found for large systems in the 10,000 function point size range.

Warranty Costs
Most software projects do not have warranties. Look at the fine print on almost any
box of commercial software and you will see phrases such as no warranty expressed
330 ◾ Appendix 1: Alphabetical Discussion of Metrics and Measures

or implied. In the rare cases where some form of warranty is provided, it can range
from replacement of a disk with a new version to actually fixing problems. There is
no general rule and each application by each company will probably have a unique
warranty policy. This is professionally embarrassing for the software industry,
which should offer standard warranties for all software. Some outsource contracts
include warranties, but here too there are variations from contract to contract.

Work Hours
There is a major difference between the nominal number of hours worked and the
actual number of hours worked. In the United States the nominal work week is
40 hours. But due to lunch breaks, coffee breaks, and other nonwork time, the
effective work week is around 33 hours per week or 132 hours per month. There
are major differences in work hours from country to country and these differences
are important for both software measurement and software estimation. The effective
work month for the United States is 132 hours, for China 186 hours, for Sweden
126 hours, and so forth. These variances mean that a project that might require one
calendar month in the United States would require only three weeks in China but
more than one month in Sweden.
Variations in work hours per month do not translate one-for-one into higher
or lower productivity. Other topics such as experience and methodologies are also
important. Even so the results are interesting and thought-provoking.

Work Hours per Function Point


The two most common methods for expressing productivity with function point
metrics are work hours per function point and function points per staff month. The
two are mathematically related, but not identical due to variations in the number
of hours worked per month. Assume a software project of 100 function points that
can be completed at a rate of 10 hours per function point. In the United States
at 132 hours per month this project would take 1,000 hours and 7.58 calendar
months, with a productivity rate of 13.2 function points per month. In China the
project would also take 1,000 hours but only 5.38 calendar months, with a produc-
tivity rate of 18.6 function points per month.

Zero Defects
The ultimate goal of software engineering is to produce software applications with
zero defects after release. As software applications are known to be error-prone, this
is a challenging goal indeed. Custom designs and hand coding of applications are
both error-prone and the average DRE as of 2014 is less than 90%, and defects aver-
age more than 3.0 per function point when requirements defects, design defects,
code defects, and bad-fix defects are all enumerated. The best approach to achieving
Appendix 1: Alphabetical Discussion of Metrics and Measures ◾ 331

zero-defects for software would be to construct applications from libraries of reusable


components that are certified to near zero-defect levels. The reusable components
would have to go through a variety of defect prevention, pretest removal, and test
stages. It is clear that cost per defect cannot be used for zero-defect quality costs, but
function point work very well.

Zero-Size Software Changes


One of the most difficult activities to predict or measure is that of changes to soft-
ware that have zero function points. Two examples might be (1) shifting an input
question from the bottom of the screen to the top and (2) reversing the sequence
of printing a software cost estimate and showing outputs before inputs, instead of
the normal way of showing inputs first. Both examples require work but have zero-
function points because the features are already present in the application. Some of
the ways of measuring these changes would be to backfire from source code that is
changed. Another method would be to record the hours needed for the work and
then apply local costs and calculate function points from costs, that is, if pas local
projects cost U.S. $1,000 per function point and the zero-size change took U.S.
$1,000 to accomplish; it is probably 1 function point in size.

Summary and Conclusions for Appendix 1


Over the past 60 years the software industry has become one of the largest indus-
tries in human history, and software applications have changed every aspect of
business, government, and military operations. No other industry has had such a
profound change on human communication and human knowledge transfer.
But in spite of the many successes of the software industry, software applications
are characterized by poor quality when released, frequent cost, and schedule
overruns, and many canceled projects. Further, software is one of the most labor-
intensive industries in history and approaches cotton cultivation in total work hours
to deliver a product.
In order to solve the problems of software and convert a manual and expensive
craft into a modern engineering profession with a high degree of manufacturing
automation, the software industry needs much better metrics and measurement
disciplines, and much more in the way of standard reusable components.
Better measures and better metrics are the stepping stones to software engi-
neering excellence. It is hoped that this report will highlight both measurement
problems and also increase the usage of effective metrics such as function points
and DRE.
Appendix 2: Twenty-Five
Software Engineering
Targets from 2016
through 2021

Introduction
The purpose of measurement and metrics is to gain insights and improve software
results. Now that metrics problems have been discussed, it is appropriate to show a
selection of potential improvements that seem to be technically feasible.
Following is a collection of 25 goals or targets for software engineering prog-
ress developed by Namcook Analytics LLC for five years between 2016 and 2021.
Some of these goals are achievable now in 2016 but not many companies have
achieved them. But some have already been achieved by a small selection of leading
companies.
Unfortunately less than 5% of the United States and global companies have
achieved any of these goals, and less than 0.1% of companies have achieved most
of them. None of the author’s clients have yet achieved every goal, although some
probably will be by 2017 or 2018.
The author suggests that every major software-producing company and gov-
ernment agency have their own set of five-year targets, using the current list as a
starting point.

1. Raise defect removal efficiency (DRE) from <90.0% to >99.5%. This is


the most important goal for the industry. It cannot be achieved by testing
alone but requires pretest inspections and static analysis. Automated proofs
and automated analysis of code are useful. Certified test personnel and math-
ematical test case design using techniques such as design of experiments are
useful. The DRE metric was developed by IBM circa 1970 to prove the value

333
334 ◾ Appendix 2: Twenty-Five Software Engineering Targets

of inspections. It is paired with the defect potential metric discussed in the next
paragraph. DRE is measured by comparing all bugs found during develop-
ment to those reported in the first 90 days by customers. The current U.S.
average is about 90%. Agile is about 92%. Quality strong methods such as
Rational Unified Process (RUP) and Team Software Process (TSP) usually
top 96% in DRE. Only a top few companies using a full suite of defect preven-
tion, pretest defect removal, and formal testing with mathematically designed
test cases and certified test personnel can top 99% in DRE. The upper limit of
DRE circa 2015 is about 99.6%. DRE of 100% is theoretically possible but has
not been encountered on more than about 1 project out of 10,000.
2. Lower software defect potentials from >4.0 per function point to
<2.0 per function point. The phrase defect potentials was coined in IBM
circa 1970. Defect potentials are the sum of bugs found in all deliverables:
(1) requirements, (2) architecture, (3) design, (4) code, (5) user documents,
and (6) bad fixes. Requirements and design bugs often outnumber code bugs.
Today defect potentials can top 6.0 per function point for large systems in
the 10,000 function point size range. Achieving this goal requires effective
defect prevention such as joint application design (JAD), quality function
deployment (QFD), requirements modeling, certified reusable components,
and others. It also requires a complete software quality measurement pro-
gram. Achieving this goal also requires better training in common sources
of defects found in requirements, design, and source code. The most effec-
tive way of lowering defect potentials is to switch from custom designs and
manual coding, which are intrinsically error prone. Construction for certified
reusable components can cause a very significant reduction in software defect
potentials.
3. Lower cost of quality (COQ ) from >45.0% of development to <15.0%
of development. Finding and fixing bugs has been the most expensive task
in software for more than 50 years. A synergistic combination of defect
prevention and pretest inspections and static analysis are needed to achieve
this goal. The probable sequence would be to raise defect removal efficiency
from today’s average of less than 90% up to 99%. At the same time defect
potentials can be brought down from today’s averages of more than 4.0
per function point to less than 2.0 per function point. This combination
will have a strong synergistic impact on maintenance and support costs.
Incidentally lowering cost of quality will also lower technical debt. But as
of 2016 technical debt is not a standard metric and varies so widely that it
is hard to quantify.
4. Reduce average cyclomatic complexity from >25.0 to <10.0. Achieving
this goal requires careful analysis of software structures, and of course it also
requires measuring cyclomatic complexity for all modules. As cyclomatic
tools are common and some are open source, every application should use
them without exception.
Appendix 2: Twenty-Five Software Engineering Targets ◾ 335

5. Raise test coverage from <75.0% to >98.5% for risks, paths, and require-
ments. Achieving this goal requires using mathematical design methods for
test case creation such as using design of experiments. It also requires mea-
surement of test coverage. It also requires predictive tools that can predict
number of test cases based on function points, code volumes, and cyclomatic
complexity. The author’s Software Risk Master (SRM) tool predicts test cases
for 18 kinds of testing and therefore can also predict probable test coverage.
6. Eliminate error-prone modules in large systems. Bugs are not randomly
distributed. Achieving this goal requires careful measurements of code defects
during development and after release with tools that can trace bugs to specific
modules. Some companies such as IBM have been doing this for many years.
Error-prone modules (EPM) are usually less than 5% of total modules, but
receive more than 50% of total bugs. Prevention is the best solution. Existing
error-prone modules in legacy applications may require surgical removal and
replacement. However, static analysis should be used on all identified EPM.
In one study a major application had 425 modules. Of these, 57% of all bugs
were found in only 31 modules built by one department. Over 300 modules
were zero-defect modules. EPM are easy to prevent but difficult to repair
once they are created. Usually surgical removal is needed. EPM are the most
expensive artifacts in the history of software. EPM is somewhat like the med-
ical condition of smallpox, that is, it can be completely eliminated with vac-
cination and effective control techniques. Error-prone modules often top 3.0
defects per function and remove less than 80% prior to release. They also tend
top 50 in terms of cyclomatic complexity. Higher defect removal via testing is
difficult due to the high cyclomatic complexity levels.
7. Eliminate security flaws in all software applications. As cyber crime
becomes more common, the need for better security is more urgent. Achieving
this goal requires use of security inspections, security testing, and automated
tools that seek out security flaws. For major systems containing valuable
financial or confidential data, ethical hackers may also be needed.
8. Reduce the odds of cyber attacks from >10.0% to <0.1%. Achieving this
goal requires a synergistic combination of better firewalls, continuous anti-
virus checking with constant updates to viral signatures, and also increasing
the immunity of software itself by means of changes to basic architecture
and permission strategies. It may also be necessary to rethink hardware and
software architectures to raise the immunity levels of both.
9. Reduce bad-fix injections from >7.0% to <1.0%. Not many people know
that about 7% of attempts to fix software bugs contain new bugs in the fixes
themselves commonly called bad-fixes. When cyclomatic complexity tops 50, the
bad-fix injection rate can soar to 25% or more. Reducing bad-fix injection
requires measuring and controlling cyclomatic complexity, using static analy-
sis for all bug fixes, testing all bug fixes, and inspections of all significant fixes
prior to integration.
336 ◾ Appendix 2: Twenty-Five Software Engineering Targets

10. Reduce requirements creep from >1.5% per calendar month to <0.25%
per calendar month. Requirements creep has been an endemic problem of
the software industry for more than 50 years. Although prototypes, Agile-
embedded users, and joint application design (JAD) are useful, it is tech-
nically possible to use automated requirements models also to improve
requirements completeness. The best method would be to use pattern match-
ing to identify the features of applications similar to the one being developed.
A precursor technology would be a useful taxonomy of software application
features, which does not actually exist in 2016 but could be created with sev-
eral months of concentrated study.
11. Lower the risk of project failure or cancellation on large 10,000 func-
tion point projects from >35.0% to <5.0%. Cancellation of large systems
due to poor quality, poor change control, and cost overruns that turn return
on investment (ROI) from positive to negative is an endemic problem of the
software industry and is totally unnecessary. A synergistic combination of
effective defect prevention and pretest inspections and static analysis can
come close to eliminating this far too common problem. Parametric estima-
tion tools that can predict risks, costs, and schedules with greater accuracy
that ineffective manual estimates are also recommended.
12. Reduce the odds of schedule delays from >50.0% to <5.0%. As the main
reasons for schedule delays are poor quality and excessive requirements creep,
solving some of the earlier problems in this list will also solve the problem of
schedule delays. Most projects seem on time until testing starts, when huge
quantities of bugs begin to stretch out the test schedule to infinity. Defect
prevention combined with pretest static analysis can reduce or eliminate
schedule delays. This is a treatable condition and it can be eliminated within
five years.
13. Reduce the odds of cost overruns from >40.0% to <3.0%. Software cost
overruns and software schedule delays have similar root causes, that is, poor
quality control and poor change control combined with excessive require-
ments creep. Better defect prevention combined with pretest defect removal
can help to cure both of these endemic software problems. Using accurate
parametric estimation tools rather than optimistic manual estimates are also
useful in lowering cost overruns.
14. Reduce the odds of litigation on outsource contracts from >5.0% to
<1.0%. The author of this book has been an expert witness in 12 breach of
contract cases. All of these cases seem to have similar root causes that include
poor quality control, poor change control, and very poor status tracking.
A synergistic combination of early sizing and risk analysis prior to contract
signing plus effective defect prevention and pretest defect removal can lower
the odds of software breach of contract litigation.
15. Lower maintenance and warranty repair costs by >75.0% compared
to 2016 values. Starting in about 2000, the number of U.S. maintenance
Appendix 2: Twenty-Five Software Engineering Targets ◾ 337

programmers began to exceed the number of development programmers.


IBM discovered that effective defect prevention and pretest defect removal
reduced delivered defects to such low levels that maintenance costs were
reduced by at least 45% and sometimes as much as 75%. Effective software
development and effective quality control has a larger impact on maintenance
costs than on development. It is technically possible to lower software main-
tenance for new applications by over 60% compared to current averages. By
analyzing legacy applications and removing error-prone modules plus some
refactoring, it is also possible to lower maintenance costs by legacy software
by about 25%. Technical debt would be reduced as well, but technical debt is
not a standard metric and varies widely so it is hard to quantify. Static analy-
sis tools should routinely be run against all active legacy applications.
16. Improve the volume of certified reusable materials from <15.0% to
>85.0%. Custom designs and manual coding are intrinsically error-prone
and inefficient no matter what methodology is used. The best way of con-
verting software engineering from a craft to a modern profession would be
to construct applications from libraries of certified reusable material, that
is, reusable requirements, design, code, and test materials. Certification to
near zero-defect levels is a precursor, so effective quality control is on the
critical path to increasing the volumes of certified reusable materials. All
candidate reusable materials should be reviewed, and code segments should
also be inspected and have static analysis runs. Also, reusable code should be
accompanied by reusable test materials, and supporting information such as
cyclomatic complexity and user information.
17. Improve average development productivity from <8.0 function points
per month to >16.0 function points per month. Productivity rates vary
based on application size, complexity, team experience, methodologies, and
several other factors. However, when all projects are viewed in aggregate,
average productivity is below 8.0 function points per staff month. Doubling
this rate needs a combination of better quality control and much higher vol-
umes of certified reusable materials, probably 50% or more.
18. Improve work hours per function point from >16.5 to <8.25. Goal 17
and this goal are essentially the same but use different metrics. However,
there is one important difference. Work hours per month will not be the
same in every country. For example, a project in the Netherlands with 116
work hours per month will have the same number of work hours as a project
in China with 186 work hours per month. But the Chinese project will need
fewer calendar months than the Dutch project due to the more intense work
pattern.
19. Improve maximum productivity to >100 function points per staff month
for 1,000 function points. Today in early 2016 productivity rates for 1,000
function points range from about 5 to 12 function points per staff month. It
is intrinsically impossible to top 100 function points per staff month using
338 ◾ Appendix 2: Twenty-Five Software Engineering Targets

custom designs and manual coding. Only construction from libraries of


standard reusable component can make such high productivity rates possible.
However, it would be possible to increase the volume of reusable materials.
The precursor needs are a good taxonomy of software features, catalogs of
reusable materials, and a certification process for adding new reusable materi-
als. It is also necessary to have a recall method in case the reusable materials
contain bugs or need changes.
20. Shorten average software development schedules by >35.0% compared
to 2016 averages. The most common complaint of software clients and cor-
porate executives at the CIO and CFO level is that big software projects
take too long. Surprisingly it is not hard to make them shorter. A synergistic
combination of better defect prevention, pretest static analysis and inspec-
tions, and larger volumes of certified reusable materials can make significant
reductions in schedule intervals. In today’s world raising software application
size in function points to the 0.4 power provides a useful approximation of
schedule duration in calendar months. But current technologies are sufficient
to lower the exponent to the 0.37 power. Raising 1,000 function points to the
0.4 power indicates a schedule of 15.8 calendar months. Raising 1,000 func-
tion points to the 0.37 power shows a schedule of only 12.9 calendar months.
This shorter schedule is made possible by using effective defect prevention
augmented by pretest inspections and static analysis. Reusable software com-
ponents could lower the exponent down to the 0.3 power or 7.9 calendar
months. Schedule delays are rampant today, but they are treatable conditions
that can be eliminated.
21. Raise maintenance assignment scopes from <1,500 function points to
>5,000 function points. The metric maintenance assignment scope refers
to the number of function points that one maintenance programmer can
keep up and running during a calendar year. This metric was developed
by IBM in the 1970s. The current range is from <300 function points for
buggy and complex software to >5,000 function points for modern software
released with effective quality control. The current U.S. average is about
1,500 function points. This is a key metric for predicting maintenance staff-
ing for both individual projects and also for corporate portfolios. Achieving
this goal requires effective defect prevention, effective pretest defect removal,
and effective testing using modern mathematically-based test case design
methods. It also requires low levels of cyclomatic complexity. Static analy-
sis should be run on all applications during development and on all legacy
applications as well.
22. Replace today’s static and rigid requirements, architecture, and design
methods with a suite of animated design tools combined with pattern
matching. When they are operating, software applications are the fastest
objects yet created by the human species. When being developed, software
applications grow and change on a daily basis. Yet every single design method
Appendix 2: Twenty-Five Software Engineering Targets ◾ 339

is static and consists either of text such as story points or very primitive and
limited diagrams such as flowcharts or UML diagrams. The technology cre-
ated a new kind of animated graphical design method in full color and also
three dimensions exist today in 2016. It is only necessary to develop the sym-
bol set and begin to animate the design process.
23. Develop an interactive learning tool for software engineering based on
massively interactive game technology. New concepts are occurring almost
every day in software engineering. New programming languages are coming
out on a weekly basis. Software lags medicine and law and other forms of
engineering in having continuing education. But live instruction is costly and
inconvenient. The need is for an interactive learning tool with a built-in cur-
riculum planning feature. It is technically possible to build such as tool today.
By licensing a game engine it would be possible to build a simulated software
university where avatars could take both classes and also interact with one
another.
24. Develop a suite of dynamic, animated project planning, and estimating
tools that will show growth of software applications. Today the outputs of
all software estimating tools are static tables augmented by a few graphs. But
software applications grow during development at more than 1% per calendar
month, and they continue to grow after release at more than 8% per calendar
year. It is obvious that software planning and estimating tools need dynamic
modeling capabilities that can show the growth of features over time. They
should also prevent the arrival (and discovery) of bugs or defects entering
from requirements, design, architecture, code, and other defect sources. The
ultimate goal, which is technically possible today, would be a graphical model
that shows application growth from the first day of requirements through
25 years of usage.
25. Introduce licensing and board certification for software engineers and
specialists. It is strongly recommended that every reader of this book also
read Paul Starr’s book The Social Transformation of American Medicine (1982).
This book won a Pulitzer Prize in 1984. Starr’s book shows how the American
Medical Association (AMA) was able to improve academic training, reduce
malpractice, and achieve a higher level of professionalism than other techni-
cal field. Medical licenses and board certification of specialists were a key
factor in medical progress. It took over 75 years for medicine to reach the
current professional status, but with Starr’s book as a guide software could do
the same within 10 years. This is outside the 5-year window of this article, but
the process should start in 2015.

Note that the function point metrics used in this book refer to function points as
defined by the International Function Point Users Group (IFPUG). Other function
points such as COSMIC, FISMA, NESMA, unadjusted, and so on can also be
used but would have different quantitative results.
340 ◾ Appendix 2: Twenty-Five Software Engineering Targets

The technology stack available in 2016 is already good enough to achieve each
of these 20 targets, although few companies have done so. Some of the technolo-
gies associated with achieving these 25 targets include, but are not limited to, the
following.

Technologies Useful in Achieving Software


Engineering Goals
◾ Use early risk analysis, sizing, and both quality and schedule/cost esti-
mation before starting major projects, such as Namcook’s Software Risk
Master (SRM).
◾ Use parametric estimation tools rather than optimistic manual estimates. All
of the parametric tools (COCOMO, CostXpert, ExcelerPlan, KnowledgePlan,
SEER, SLIM, SRM, and True Price will produce better results for large
applications than manual estimates, which become progressively optimistic
as applications size grows larger).
◾ Use effective defect prevention such as Joint Application Design (JAD) and
quality function deployment (QFD).
◾ Use pretest inspections of major deliverables such as requirements, architec-
ture, design, code, and so on.
◾ Use both text static analysis and source code static analysis for all software.
This includes new applications and 100% of active legacy applications.
◾ Use the SANs Institute list of common programming bugs and avoid them all.
◾ Use the FOG and FLESCH readability tools on requirements, design, and so on.
◾ Use mathematical test case design such as design of experiments.
◾ Use certified test and quality assurance personnel.
◾ Use function point metrics for benchmarks and normalization of data.
◾ Use effective methodologies such as Agile and XP for small projects; RUP
and TSP for large systems. Hybrid methods are also effective such as Agile
combined with TSP.
◾ Use automated test coverage tools.
◾ Use automated cyclomatic complexity tools.
◾ Use parametric estimation tools that can predict quality, schedules, and costs.
Manual estimates tend to be excessively optimistic.
◾ Use accurate measurement tools and methods with at least 3.0% precision.
◾ Consider applying automated requirements models, which seem to be effec-
tive in minimizing requirements issues.
◾ Consider applying the new SEMAT method (Software Engineering Methods
and Theory) that holds promise for improved design and code quality.
SEMAT comes with a learning curve, so reading the published book is neces-
sary prior to use.
Appendix 2: Twenty-Five Software Engineering Targets ◾ 341

It is past time to change software engineering from a craft to a true engineering


profession. It is also past time to switch from partial and inaccurate analysis of
software results, to results with high accuracy for both predictions before projects
start, and measurements after projects are completed.
The 25 goals shown above are positive targets that companies and government
groups should strive to achieve. But software engineering also has a number of harm-
ful practices that should be avoided and eliminated. Some of these are bad enough
to be viewed as professional malpractice. Following are six hazardous software
methods, some of which have been in continuous use for more than 50 years with-
out their harm being fully understood.

Six Hazardous Software Engineering Methods


to be Avoided
1. Stop trying to measure quality economics with cost per defect. This met-
ric always achieves the lowest value for the buggiest software, so it penal-
izes actual quality. The metric also understates the true economic value of
software by several hundred percent. This metric violates standard economic
assumptions and can be viewed as professional malpractice for measuring
quality economics. The best economic measure for cost of quality is defect
removal costs per function point. Cost per defect ignores the fixed costs for
writing and running test cases. It is a well-known law of manufacturing eco-
nomics that if a process has a high proportion of fixed costs the cost per unit
will go up. The urban legend that it costs 100 times as much to fix a bug after
release than before is not valid; the costs are almost flat if measured properly.
2. Stop trying to measure software productivity with “lines of code” metrics.
This metric penalizes high level languages. This metric also makes non-coding
work such as requirements and design invisible. This metric can be viewed as
professional malpractice for economic analysis involving multiple program-
ming languages. The best metrics for software productivity are work hours per
function point and function points per staff month. Both of these can be used
at activity levels and also for entire projects. These metrics can also be used for
noncode work such as requirements and design. LOC metrics have limited use
for coding itself but are hazardous for larger economic studies of full projects.
LOC metrics ignore the costs of requirements, design, and documentation
that are often larger than the costs of the code itself.
3. Stop measuring “design, code, and unit test” or DCUT. Measure full
projects including management, requirements, design, coding, integration,
documentations, all forms of testing, and so on. DCUT measures encompass
less than 30% of the total costs of software development projects. It is profes-
sionally embarrassing to measure only part of software development projects.
342 ◾ Appendix 2: Twenty-Five Software Engineering Targets

4. Be cautious of “technical debt.” This is a useful metaphor but not a com-


plete metric for understanding quality economics. Technical debt omits the
high costs of canceled projects, and it excludes both consequential damages
to clients and also litigation costs and possible damage awards to plaintiffs.
Technical debt only includes about 17% of the true costs of poor quality. Cost
of quality (COQ) is a better metric for quality economics.
5. Avoid “pair programming.” Pair programming is expensive and less effec-
tive for quality than a combination of inspections and static analysis. Do read
the literature on pair programming, and especially the reports by program-
mers who quit jobs specifically to avoid pair programming. The literature in
favor of pair programming also illustrates the general weakness of software
engineering research, in that it does not compare pair programming to meth-
ods with proven quality results such as inspections and static analysis. It only
compares pairs to single programmers without any discussion of tools, meth-
ods, inspections, and so on.
6. Stop depending only on testing without using effective methods of defect
prevention and effective methods of pretest defect removal such as inspec-
tions and static analysis. Testing by itself without pretest removal is expensive
and seldom tops 85% in defect removal efficiency (DRE) levels. A synergis-
tic combination of defect prevention, pretest removal such as static analysis,
and inspections can raise DRE to >99% while lowering costs and shortening
schedules at the same time.

The software engineering field has been very different from older and more mature
forms of engineering. One of the main differences between software engineering
and true engineering fields is that software engineering has very poor measurement
practices and far too much subjective information instead of solid empirical data.
This short chapter suggests a set of 25 quantified targets that if achieved would
make significant advances in both software quality and software productivity. But
the essential message is that poor software quality is a critical factor that needs to get
better in order to improve software productivity, schedules, costs, and economics.
Suggested Readings
on Software Measures
and Metric Issues

Abrain, Alain. Software Maintenance Management: Evolution and Continuous Improvement.


Los Alamitos, CA: Wiley-IEEE Computer Society, 2008.
Abrain, Alain. Software Metrics and Metrology. Los Alamitos, CA: Wiley-IEEE Computer
Society, 2010.
Abrain, Alain. Software Estimating Models. Los Alamitos, CA: Wiley-IEEE Computer
Society, 2015.
Black, Rex. Managing the Testing Process: Practical Tools and Techniques for Managing Hardware
and Software Testing. New York: Wiley, 2009, 672 p.
Boehm, Barry Dr. Software Engineering Economics. Englewood Cliffs, NJ: Prentice Hall,
1981, 900 p.
Brooks, Fred. The Mythical Man-Month. Reading, MA: Addison-Wesley, 1974, rev. 1995.
Brooks, Fred. The Mythical Man Month. New Jersy: Addison Wesley, 1975.
Bundschuh, Manfred and Dekkers, Carol. The IT Metrics Compendium. Heidelberg,
Germany: Springer, 2005.
Charette, Bob. Software Engineering Risk Analysis and Management. New York: McGraw-Hill, 1989.
Charette, Bob. Application Strategies for Risk Management. New York: McGraw-Hill, 1990.
Chess, Brian and West, Jacob. Secure Programming with Static Analysis. Boston, MA: Addison
Wesley, 2007, 624 p.
Cohen, Lou. Quality Function Deployment – How to Make QFD Work for You. Upper Saddle
River, NJ: Prentice Hall, 1995, 368 p.
Constantine, Larry L. Beyond Chaos: The Expert Edge in Managing Software Development.
Boston, MA: Addison Wesley, 2001.
Crosby, Philip B. Quality is Free. New York: New American Library, Mentor Books,
1979, 270 p.
DeMarco, Tom. Controlling Software Projects. New York: Yourdon Press, 1982, 284 p.
DeMarco, Tom. Peopleware: Productive Projects and Teams. New York: Dorset House, 1999, 245 p.
Ebert, Christof, Dumke, Reinder, and Bundschuh, Manfred. Best Practices in Software
Measurement. Heidelberg, Germany: Springer, 2004.
Ewusi-Mensah, Kweku. Software Development Failures. Cambridge, MA: MIT Press, 2003.

343
344 ◾ Suggested Readings on Software Measures and Metric Issues

Gack, Gary. Managing the Black Hole: The Executives Guide to Software Project Risk. Thomson,
GA: Business Expert Publishing, 2010.
Gack, Gary. Applying Six Sigma to Software Implementation Projects. http://software.isixsigma.
com/library/content/c040915b.asp (last accessed on October 13, 2016).
Galorath, Dan. Software Sizing, Estimating, and Risk Management: When Performance is
Measured Performance Improves. Philadelphia, PA: Auerbach Publishing, 2006, 576 p.
Garmus, David and Herron, David. Measuring the Software Process: A Practical Guide to
Functional Measurement. Englewood Cliffs, NJ: Prentice Hall, 1995.
Garmus, David and Herron, David. Function Point Analysis – Measurement Practices for
Successful Software Projects. Boston, MA: Addison Wesley Longman, 2001, 363 p.
Garmus, David, Russac Janet, and Edwards, Royce. Certified Function Point Counters
Examination Guide. Boca Raton, FL: CRC Press, 2010.
Gilb, Tom and Graham, Dorothy. Software Inspections. Reading, MA: Addison Wesley, 1993.
Glass, Robert L. Software Runaways: Lessons Learned from Massive Software Project Failures.
Englewood Cliffs, NJ: Prentice Hall, 1998.
Harris, Michael D.S., Herron, David, and Iwanicki, Stasia. The Business Value of IT. Boca
Raton, FL: CRC Press, Auerbach, 2008.
IFPUG (52 authors); The IFPUG Guide to IT and Software Measurement. Boca Raton, FL:
CRC Press, Auerbach publishers, 2012.
International Function Point Users Group (IFPUG). IT Measurement – Practical Advice from
the Experts. Boston, MA: Addison Wesley Longman, 2002, 759 p.
Jacobsen, Ivar, Ng Pan-Wei, McMahon, Paul, Spence, Ian, and Lidman, Svente. The Essence of
Software Engineering: Applying the SEMAT Kernel. Boston, MA: Addison Wesley, 2013.
Jones, Capers. Patterns of Software System Failure and Success. Boston, MA: International
Thomson Computer Press, 1995, 250 p.
Jones, Capers. Software Quality – Analysis and Guidelines for Success. Boston, MA; International
Thomson Computer Press, 1997, 492 p.
Jones, Capers. Sizing up software. Scientific American Magazine, 1998, 279(6):104–111.
Johnson, James et al. The Chaos Report. West Yarmouth, MA: The Standish Group, 2000.
Jones, Capers. Software Assessments, Benchmarks, and Best Practices. Boston, MA: Addison
Wesley Longman, 2000, 657 p.
Jones, Capers. Estimating Software Costs. New York: McGraw-Hill, 2007.
Jones, Capers. Conflict and Litigation Between Software Clients and Developers. Narragansett,
RI: Software Productivity Research, Inc., 2008, 45 p.
Jones, Capers. Preventing Software Failure: Problems Noted in Breach of Contract Litigation.
Narragansett, RI: Capers Jones & Associates, 2008, 25 p.
Jones, Capers. Applied Software Measurement. New York: McGraw-Hill, 3rd edition, 2008, 668 p.
Jones, Capers. Software Engineering Best Practices. New York: McGraw Hill, 2010.
Jones, Capers and Bonsignour, Olivier. The Economics of Software Quality. Reading, MA:
Addison Wesley, 2011.
Jones, Capers. A Short History of the Cost per Defect Metric. Narragansett, RI: Namcook
Analytics LLC, 2014.
Jones, Capers. A Short History of Lines of Code Metrics. Narragansett, RI: Namcook Analytics
LLC, 2014.
Jones, Capers. The Technical and Social History of Software Engineering. Boston, MA: Addison
Wesley Longman, 2014.
Kan, Stephen H. Metrics and Models in Software Quality Engineering. Boston, MA: Addison
Wesley Longman, 2nd edition, 2003, 528 p.
Suggested Readings on Software Measures and Metric Issues ◾ 345

Pressman, Roger. Software Engineering – A Practitioner’s Approach. New York: McGraw-Hill,


6th edition, 2005.
Putnam, Lawrence H. Measures for Excellence – Reliable Software On Time, Within Budget.
Englewood Cliffs, NJ: Yourdon Press–Prentice Hall, 1992, 336 p.
Putnam, Lawrence H. and Myers, Ware. Industrial Strength Software - Effective Management
Using Measurement. Los Alamitos, CA: IEEE Press, 1997, 320 p.
Radice, Ronald A. High Qualitiy Low Cost Software Inspections. Andover, MA: Paradoxicon
Publishing, 2002, 479 p.
Royce, Walker E. Software Project Management: A Unified Framework. Reading, MA: Addison
Wesley Longman, 1998.
Starr, Paul. The Social Transformation of American Medicine. (Pulitzer Prize in 1982),
New York: Basic Books, 1982.
Strassmann, Paul. The Business Value of Computers: An Executive’s Guide. Boston, MA:
International Thomson Computer Press, 1994.
Strassman, Paul. The Squandered Computer. New Canaan, CT: The Information Economics
Press, 1997, 426 p.
Wiegers, Karl E. Peer Reviews in Software – A Practical Guide. Boston, MA: Addison Wesley
Longman, 2002, 232 p.
Yourdon, Edward. Death March - The Complete Software Developer’s Guide to Surviving
“Mission Impossible” Projects. Upper Saddle River, NJ: Prentice Hall PTR, 1997, 218 p.
Yourdon, Edward. Outsource: Competing in the Global Productivity Race. Upper Saddle River,
NJ: Prentice Hall PTR, 2005, 251 p.

Websites
Information Technology Metrics and Productivity Institute (ITMPI): www. ITMPI.org
International Software Benchmarking Standards Group (ISBSG): www. ISBSG.org
International Function Point Users Group (IFPUG): www. IFPUG.org
Project Management Institute (www. PMI.org)
Capers Jones (www. Namcook.com)
346 ◾ Suggested Readings on Software Measures and Metric Issues

Software Benchmark Organizations circa 2015


Software Benchmark Providers (listed in alphabetical order)
1 4SUM Partners www.4sumpartners.com

2 Bureau of Labor Statistics, Dept. www.bls.gov


of Commerce

3 Capers Jones (Namcook www.namcook.com


Analytics LLC)

4 CAST Software www.castsoftware.com

5 Congressional Cyber-Security cybercaucus.langevin.house.gov


Caucus

6 Construx www.construx.com

7 COSMIC function points www.cosmicon.com

8 Cyber-Security and Information https://s2cpat.thecsiac.com/s2cpat/


Systems

9 David Consulting Group www.davidconsultinggroup.com

10 Forrester Research www.forrester.com

11 Galorath Incorporated www.galorath.com

12 Gartner Group www.gartner.com

13 German Computer Society http://metrics.cs.uni-magdeburg.


de/

14 Hoovers Guides to Business www.hoovers.com

15 IDC www. IDC.com

16 ISBSG Limited www.isbsg.org

17 ITMPI www.itmpi.org

18 Jerry Luftman (Stevens Institute) http://howe.stevens.edu/index.


php?id=14

19 Level 4 Ventures www.level4ventures.com

20 Metri Group, Amsterdam www.metrigroup.com

21 Namcook Analytics LLC www.namcook.com

22 Price Systems www.pricesystems.com

(Continued)
Suggested Readings on Software Measures and Metric Issues ◾ 347

Software Benchmark Providers (listed in alphabetical order)


23 Process Fusion www.process-fusion.net

24 QuantiMetrics www.quantimetrics.net

25 Quantitative Software www.qsm.com


Management (QSM)

26 Q/P Management Group www.qpmg.com

27 RBCS, Inc. www.rbcs-us.com

28 Reifer Consultants LLC www.reifer.com

29 Howard Rubin www.rubinworldwide.com

30 SANS Institute www.sabs.org

31 Software Benchmarking www.sw-benchmark.org


Organization (SBO)

32 Software Engineering www.sei.cmu.edu


Institute (SEI)

33 Software Improvement www.sig.eu


Group (SIG)

34 Software Productivity Research www. SPR.com

35 Standish Group www.standishgroup.com

36 Strassmann, Paul www.strassmann.com

37 System Verification Associates LLC http://sysverif.com

38 Test Maturity Model Integrated www.experimentus.com


Summary and
Conclusions on
Measures and Metrics

Software is one of the driving forces of modern industry and government


operations. Software controls almost every complicated device now used by
human beings. But software remains a difficult and intractable discipline that is
hard to predict and hard to measure. The quality of software is embarrassingly
bad. Given the economic importance of software, it is urgent to make software
development and maintenance true engineering disciplines, as opposed to art
forms or skilled crafts.
In order to make software engineering a true engineering discipline and a true
profession, much better measurement practices are needed than what have been
utilized to date. Quantitative and qualitative data need to be collected in standard
fashions that are amenable to statistical analysis.

349
Index

Note: Page numbers followed by f and t refer to figures and tables, respectively.

80/20 rule, 296 Ascope. See Assignment scope (Ascope)


ASQ (American Society for Quality), 128
A Assessment, 245–246
Assignment scope (Ascope), 103, 246, 302
ABAP programming language, 93 AT&T, 35
Abeyant defect, 169, 240–241 Attrition measures, 246
Academic training, 289 Automated function point method, 240
Accuracy, 242 Automated Project Office (APO), 158,
Activity-based benchmark, 80t–82t 263, 304
Activity-based costs, 241–242, 241t, 257, 268 Automated/semi-automated dashboards, 263
ad hoc method, 239 Automatic function point counting,
Agile approach, 283 246–247
Agile concept, 39 Automatic spell checking, 309
Agile development approach, 242
Agile Manifesto, 52 B
Agile methodology, 59, 252
Agile methods, measurement variations in, Backfiring, 244, 247
51–57 Bad-fix(es), 121, 177, 335
application of 1,000 function points, 55t category, 167
fundamental problems of, 52 defect, 264
SRM method, 57 injections, 247
Agile metrics, 242–243 Bad-test cases, 122, 247–248
Agile project, 53, 240, 242 Balanced scorecard, 248
1000 function points, 12t–13t The Balanced Scorecard (book), 248
ALGOL 68 programming language, 115 Bancroft Prize, 133
American Medical Association (AMA), 47, 134 Baselines, 248
American Society for Quality (ASQ), 128 Bayesian analysis, 248
Analysis of variance (ANOVA), 243 Benchmark, 2, 11, 13, 35, 37, 248–249,
Annual reports, 243 303–304
ANOVA (Analysis of variance), 243 activity-based, 80t–82t
Antipatterns, 69 data, 222, 225
APO (Automated Project Office), 158, 263, 304 elements for effective, 230t
Apparent defect density, 15 groups, 49
Apple company, 251, 262 organizations, 127, 322
Applied Software Measurement (book), 2 Bloat point metric, 275
Architectural defect, 264 Bottom line data, 3
Argon gas, 37 Brooks, Fred, Dr., 37

351
352 ◾ Index

Bug(s), 166–167, 250 Control of Communicable Diseases in Man


distribution by severity level, 169t (book), 133
repairs, bad fixes/new bugs in, 121–122 COQ. See Cost of quality (COQ)
report, 169 COSMIC, 221
Burden rate(s), 83, 250 Cost center
components in companies, 85t–86t internal software projects, 11
international projects, salary and, 84 model, 257
Burn down charts, 250 software, 302
Burn up charts, 250 Cost driver, 257–258, 258t
Business value, 250 Cost/effort-tracking methods, distribution of,
The Business Value of IT (book), 48 11, 14t
Butterfly effect, 253 Cost of quality (COQ), 138, 179, 182–220,
255, 258–259
C cost elements, 183t–184t
pretest and test defect removal,
C, 3 185t–204t
C#, 115 testing, 205t–218t
CA (criticality analysis), 274 Cost per defect metric, 45, 120, 136, 179, 223,
Capability maturity model 226, 233, 240, 259, 341
integration (CMMI), 48, distorts reality, 137
51, 139, 245, 254 vs cost per function point, 138t
development, measurement variations in, Cost per function point, 120, 137–138, 137t,
51–57 153, 236, 236t, 259
civilian application of 1,000 function cost per defect vs, 137t
points, 54t Cost per LOC and KLOC, 259–260, 260t
fundamental problems of, 52 Cost per square foot, 105
CASE (Computer Aided Software Cost per story point, 260
Engineering), 59 Cost-tracking systems, 2, 11
Celsius, 221 Coupling metric, 261
Certification process, 251 Creeping requirements, 161
FDA and FAA, 252 Criticality analysis (CA), 274
reusable materials, 252 Crunch projects/mode, 102–103
Certified reusable components, 72–73 Crystal Reports, 75
Certified test personnel, 135 Cumulative defect removal, 179
Chaos report, 252, 322 Currency exchange rates, 261
Chaos theory, 253 Customer satisfaction metrics, 261
Civilian software projects, 114 Customer support metrics, 261–262
Cloud-based software, 253 Cyber attacks, 237, 262
Cloud computing, 253 Cyber-security, 35, 253
Cloud measures and metrics, 253 experts, 93
CMMI. See Capability maturity model Cyclomatic complexity, 122–123,
integration (CMMI) 262–263
COBOL, 102, 247, 284
Code defect, 219, 264 D
Cognitive dissonance, 254–255
Cohesion metric, 255 Dashboard, 263
Complexity metric, 255 Data
Computer Aided Software Engineering elements for benchmarks, 230t
(CASE), 59 point metrics, 263
Consequential damages, 170, 255–256 DCUT. See Design, code, and unit test
Contenders, 77 (DCUT)
Contract litigation, breach of, 249 DDE (defect detection efficiency), 265
Index ◾ 353

Defect, 263–264 Estimating Software Costs (book), 255


consequences, 264 EVA (earned value analysis), 270
delivered, 267–268 EVM (earned value measurements), 270–271
density, 264 Experience, 273
design, 264 Expert Estimation, 273
discovery factors, 264–265
document, 264 F
duplicate, 269–270
invalid defect, 282 FAA (Federal Aviation Agency), 252, 288
origins, 265–266 Fahrenheit, 221
potential(s), 72, 116, 167, 175, 178t, 334 Failing project (definition), 273–274
prevention and removal activities, 180t–181t Failure modes and effect analysis (FMEA), 274
resolution time, 266 Failure rate, 274–275
severity levels, 266–267 False positive, 275
Defect detection efficiency (DDE), 265 FDA (Food and Drug Administration), 252
Defect removal efficiency (DRE), 16, 48, Feature bloat, 275
135–136, 172, 226, 237, 265, 333–334 Federal Aviation Agency (FAA), 252, 288
defect prevention and, 117t–118t Financial debt, 324
U.S. software average DRE ranges, 176t Financial measures, 253
with and without pretest activities, 219t Fish-bone diagrams, 305
Defects per KLOC, 264 Five-point scale, 228
Defense sector, 51, 245 Fixed costs, 236, 259, 275
Deferred features, 267 contracts, 256
Delivered defect, 267–268 FMEA (failure modes and effect analysis), 274
Delphi method, 267 Food and Drug Administration (FDA), 252
Design, code, and unit test (DCUT), 6, 10t, 47, Force-fit method, 245
219, 268, 341 Formal inspections, phase level, 78
Design defect, 264 FORTRAN, 284
Dilution, 268 FP. See Function points (FP)
Documentation costs, 269, 270t Fully burdened rate, 83
Document defect, 264 Fully burdened salary rate, 83
Draconian Sarbanes–Oxley law, 279 Functional and nonfunctional requirements,
DRE. See Defect removal efficiency (DRE) variations in, 105–114
Duplicate defect reports, 269–270 Function point metrics circa 2016, 243
Function points (FP), 103, 136, 167, 235, 247,
E 275–276
automated, 240
Earned value analysis (EVA), 270 contracts, 256–257
Earned value measurements (EVM), 270–271 measurement, advantages, 161
The Economics of Software Quality (book), 48, metric, 145, 150, 167
133, 247 per month, 276
Effective quality control, 123 variations, 277
Endemic software problems, 50 work hours per, 330
Enhancement metrics, 271
Enterprise resource planning (ERP), 75, 272 G
companies, 312
packages, 158 Gantt chart, 159, 160f, 277–278
Entropy, 271 Gaps and omissions, 2, 4t
ERP. See Enterprise resource planning (ERP) Geneen, Harold, 115–116
Error-prone modules (EPM), 121, 168, 272, 335 Generalists versus specialists, 278
with high numbers of bugs, 122–123 Goal-question metric (GQM), 278
Essential complexity, 272 Good-enough quality fallacy, 278–279
354 ◾ Index

Governance, 279 International Function Point Users Group


Government (IFPUG), 105, 147, 177, 221, 224,
agencies, 262 242, 277, 339
certification, 29, 106 International organization for standards
projects, 114 (ISO), 282
service and education, 87 The International Software Benchmarking
Graph theory, 262 Standards Group (ISBSG), 49, 227,
249, 292
H Invalid defect, 282
ISBSG. See The International Software
Halstead complexity, 279 Benchmarking Standards Group
Hazardous metrics, 223 (ISBSG)
High-level programming language, 115 ISO/IEC 9126 quality standard, 166
High-speed pattern-matching method, ISO/IEC standards, 282
224 ISO quality standards, 166
High-speed sizing method, 246 Iterations, 53
Historical data, 1, 11
omissions from, 1–2, 3t J
unverified, 13
Historical data leakage, 279–280 Java, 39, 234
Hi-tech companies, 317 Jira tool, 158
House of quality, 305 Joint Application Design (JAD), 161
Humphrey, Watt, 52 JOVIAL, 115
Juran’s QC Handbook (book), 258
I Just in time, 282

IBM, 13–14, 35, 116, 240, 247, 326 K


IMS database, 168
severity scale, 168, 168t Kanban approach, 282–283
IBM circa 1968, 265 Kelvin, Lord, 283
IBM circa 1970, 115–116, 122, 135, 167, Kelvin’s Law of 1883, 283
333–334 Key Performance Indicators (KPI), 283
IEC (international electrotechnical KLOC, 283
commission), 282
IFPUG. See International Function Point Users L
Group (IFPUG)
Incident, 280 Labor-intensive commodities, 104
Independent testing, 29, 114 Language levels, 283–284
Independent verification and validation Larry Putnam of Quantitative Software
(IV and V), 29, 114 Management (QSM), 38
Index Medicus, 133 Lean development, 284
Industry comparisons, 281 Learning curves, 284–285
Industry Productivity Ranges Circa 2016, Light scattering, mathematical model of, 37
89t–91t Lines of code (LOC), 136, 161, 223
Inflation metric, 281 metric, 45, 233, 240, 285, 341
Inspection metrics, 281–282 Longevity, 93, 95
Institute of Electrical and Electronic Engineers
(IEEE), 47, 134 M
Intangible value, 251, 328
International comparison, 281 Maintenance
International electrotechnical commission assignment scope, 338
(IEC), 282 metric, 285–286
Index ◾ 355

Manufacturing economics, 236 Occupation groups, 294


Mark II function points, 223, 277 One-trick-ponies, 123
McCabe, Tom, 272 Oracle, 158, 240, 244, 272, 312
Mean time between failures (MTBF), 166, 309 Organization for International Economic
Mean time to failure (MTTF), 166, 309 Cooperation and Development
Measurement speed and cost, 286 (OECD), 97
Meetings and communications, 286–287, 287t Overhead, 250
costs, 269 Overlap, 159
Methodology comparison metrics, 287
Metrics and Models in Software Quality P
Engineering (book), 48
Metrics conversion, 288–289, 289t Pair programming, 39, 53, 179, 288,
Microsoft, 35, 95, 251 295–296, 342
Microsoft Office, 158 Parametric estimation, 295
Microsoft Project, 158 tool, 1, 41, 241, 273, 315
Microsoft Windows, 75 Pareto analysis, 296
Microsoft Word, 275 Pattern matching, 296–297, 323
Military software projects, 29, 269 approach, 150, 246
MIS projects, 29, 78–80 SRM, 148, 149t
Monte Carlo method, 290–291 PBX switching system. See Private branch
Morale metrics, 291 exchange (PBX) switching system
MTBF (mean time between failures), 166, 309 Performance metrics, 297
MTTF (mean time to failure), 166, 309 Personal Software Process (PSP), 51, 80
MUMPS (programming language), 115 PERT (Program Evaluation and Review
The Mythical Man-Month (book), 37 Technique), 297
Phase metrics, 297
N PMI (Project Management Institute), 128
PMO (Project Management Office), 304
Namcook, 92, 163 PNR. See Putnam–Norden–Rayleigh (PNR)
pattern-matching approach, 147 curve
Namcook Analytics LLC, 49, 97, 127, 245, Portfolio metric, 298–301, 298t–301t
292, 333 Private branch exchange (PBX) switching
National Averages, 292 system, 3, 78, 221, 234, 234t
Natural metric, 290 Production rate (Prate), 103, 302
NDA (nondisclosure agreements), 292–293 nominal and virtual, 103
Nominal production rate, 103 Productivity metric, 301
Nominal start and end point, 158 Professional malpractice, 161, 233, 302
Nondisclosure agreements (NDA), 292–293 Professional societies and metrics companies, 290
Nonfunctional requirements, 293 Profit center, 11, 257, 302–303
Nonprofit associations, 128 Program Evaluation and Review Technique
Norden, Peter, 37 (PERT), 297
Normal and intense work patterns, Progress improvements, 303
102–103, 102t Project
Normalization, 293–294 end date, 303
North American Industry Classification office, 304
(NAIC) code, 171, 281, 291–292 start date, 304–305
Project-level metrics, 303–304
O Project Management Institute (PMI), 128
Project Management Office (PMO), 304
Objective-C, 115 Project, phase, and activity measurements,
Object Management Group (OMG), 246 77–82
Object-oriented (OO) languages, 155, 294 charts of accounts, 79t
356 ◾ Index

Published data and common metrics, 49, 49t Root-cause analysis (RCA), 274, 314
Pulitzer Prize, 46, 133, 339 Rowboat, 17
Putnam–Norden–Rayleigh (PNR) curve, Rules of thumb, 308
37–38, 309 Running tested features (RTF), 53
RUP. See Rational Unified Process
Q (RUP)

Quality, 165, 305 S


data, 116, 127, 136
data leakage, 280 Sample sizes, 314
Quality function deployment (QFD), 144, 305 SAP, 158, 240, 244, 272, 312
Quality Is Free (book), 258, 318 SAP R/3 integrated system, 93
Quantitative data, 227 Sarbanes–Oxley rules, 32
Quantitative Software Management (QSM), Schedule
49, 227 compression, 314–315
Quantitative variations, 20t–27t overlap, 315
slip, 315
R slippage pattern, 160, 160t
Scope, 316
Rapid Application Development (RAD), 59 Scrum sessions, 53
Rational Unified Process (RUP), 57, Security flaws, 72, 335
242–244, 252 reuse and software, 73t
Rayleigh curve, 38, 38f, 253, 308–309 Security metrics, 316
Rayleigh, Lord, 37 SEI. See Software Engineering Institute (SEI)
RCA (root-cause analysis), 274, 314 Severity level, defect, 266–267
Reference model, 2, 4t Six-Sigma approach, 316
Reliability, 166 Size adjustment, 316
metrics, 309 Size-by-side agile and waterfall, 56t
Reliable taxonomy, 230 Sizing method, 147
Repair and rework costs, 309 SNAP. See Software nonfunctional assessment
Reports, interfaces, conversions, and process (SNAP)
enhancements (RICE) object, 272, 312 The Social Transformation of American Medicine
Request for proposal (RFP), 157 (book), 46, 133–134, 339
Requirement, 165, 309–310 Software
bug, 266 academic training, 302
creep, 310 activities, occupations, and combinations,
metrics, 310–311 32, 32t–33t
nonfunctional, 106, 293 applications size, 107t–113t
software, 293 benchmark, 106
toxic, 310 organizations, 49
Return on investment (ROI), 121, 251, 253, 311 bug removal, 119–120
Reusable components, 73–74, 312 circa 2026, 140
Reusable materials, 312 consequential damages scale, 170, 170t
Reuse and software construction methods, 314–315
quality levels at delivery, 72t cost
security flaws at delivery, 73t drivers and poor quality, 139–140,
Reuse potential, 73 141t–142t, 143t
software, 74t estimation, 159
Risk cost-estimating
avoidance probabilities, 312 methods, 248
metrics, 312–313, 313t tool, 1
severities, 312 cost-tracking systems, 97, 102
Index ◾ 357

defect data, 116


data, missing, 116, 116t measurements and metrics, 116
potentials, 172, 176t measures, 138
prediction, 173t–175t and risk, 48
prevention and removal strategies, 170t SEI CMMI, 139, 139t
and quality measurements, missing data, worst metrics, 136
14, 14t–15t reliability, 166
report, 171t–172t reuse, variations in, 69–75
demographic, 243 impact on productivity, 70f
development schedules, ranges of, 308
complete costs, 7t–9t staffing, 38
methodologies, 59–62 patterns, 39t–40t
partial costs, 10t structures, 250
productivity, 41f testing, 132
productivity, ranges of, 305–306, 306t work patterns, 102–103, 102t
quality, ranges of, 306–308, 307t Software Engineering Institute (SEI), 48, 51,
DRE, 117–119, 117t–118t 139, 227, 245, 283
education and training company, 48 Software Metrics and Metrology (book), 48
employment statistics, 317 Software nonfunctional assessment process
engineering, 46, 236, 255 (SNAP), 105–106, 107t–113t, 123,
goals, 140, 333–340 150, 224, 240
hazardous methods of, 341–342 metric, 67, 161, 239–240, 317
needs for, 237 Software Productivity Research (SPR), 49,
technology used in, 340–341 245–247, 249
engineers, 121 Software quality assurance (SQA), 317–318
estimation tools, 127, 231 Software Quality Assurance Curricula circa
industry, 47 2016, 128t–130t
journals, 48 Software Risk Factors Circa 2016, 238t
literature, 83–84 Software Risk Master™ (SRM), 1, 41–42, 127,
metrics selection, 239–240, 240t 147, 166, 173, 241
occupation groups and specialists, 35, multiyear sizing, 162t–163t
36t–37t pattern-matching approach, 147, 149t
patterns, 71, 71t size predictions, 224, 225t
productivity measurements, 87 tool, 56, 150, 162, 224, 335
programming languages, variations in Software Size Measurement Practices circa
circa 2016, 63 2016, 222t–223t
influence on productivity, 63, 64t–67t Software Testing Courses circa 2016, 130t–132t
project Software usage and consumption metrics,
costs, unpaid overtime impact on, 318–320, 319t–320t
42, 42f Spell checking, automatic, 309
management tools, 319t–320t SPR. See Software Productivity Research (SPR)
manager, 87, 88t Sprint, 53, 320
schedules, 158 SQA (software quality assurance), 317–318
taxonomy, 322–323 The Squandered Computer (book), 48
quality, 115, 165–166 SRM. See Software Risk Master™ (SRM)
by application size, 140, 144–145, 144t Staffing level, 320–321, 321t
assurance, 132 Standard industry classification (SIC) code,
best metrics, 135–136 171, 291
company, 48, 123–134 Standish report (chaos report), 251, 274, 322
conferences, 48 Starr, Paul, 133–134
control, 47 Static analysis, 115–117, 133
costs, 119t–120t, 179 bug repairs, 121
358 ◾ Index

Status-tracking tool, 329 User costs, 326–327


Stock equity program, 95 U.S. industry segments, 43
Story point metric, 240, 242, 287 U.S. Sarbanes–Oxley law, 252
Story points, 53, 322 U.S. Software Cost Drivers in Rank Order
Successful projects (definition), 322 for 2016, 141t–142t, 258t
Supplemental data, 227, 228t U.S. Software occupation groups, variations in
Synthetic metric, 290 compensation, 93, 94t
U.S. software personnel, 97, 326
T
Tangible value, 251, 328 V
Taxonomy patterns, 148t, 154t Validation metric, 290
TCO. See Total cost of ownership (TCO) Value stream mapping, 284
Team sizes, 40, 40f Variable costs, 328
Team Software Process (TSP), 51, 80, 115, Velocity, 328
242–244, 252 Venn diagram, 328–329
Technical debt, 324, 342 Verizon, 262
metric, 45, 139, 179, 182, 240, 309, 324 Virtual production rate, 103
Test Visual status and progress tracking, 329
case design
mathematical, 123, 135
poor, 135 W
coverage, 324–325
Warranty costs, 329–330
metrics, 324
War room, 329
The Social Transformation of American Medicine
Wastage, 47, 121
(book), 50
Waterfall
TickIT, 245
development method, 297
Time and materials contract, 256
model, 159
Total cost of ownership (TCO), 105, 244–245,
Watson, Thomas J., Jr., 115–116
271, 325, 325t
Work hours, 330
Toxic requirement, 310
per function point, 103, 276, 330
TSP. See Team Software Process (TSP)
World-wide military command and control
system (WWMCCS), 244
U
Unified modeling language (UML), 326 Y
Unpaid overtime, 41–42, 97, 103, 326
U.S. average for software defect potentials for Y2K, 266, 310
2016, 167, 167t
U.S. Average Ranges of Defect Potentials Circa Z
2016, 175t
Use-case metric, 240, 242, 287 Zero defects, 137–138, 330–331
Use-case points, 326 Zero-size software changes, 331

S-ar putea să vă placă și