Sunteți pe pagina 1din 204

23/11/2012

Functional testing - Wikipedia, the free encyclopedia

Functional testing
From Wikipedia, the free encyclopedia

Functional testing is a type of black box testing that bases its test cases on the specifications of the software
component under test. Functions are tested by feeding them input and examining the output, and internal
program structure is rarely considered (not like in white-box testing).[1]
Functional testing differs from system testing in that functional testing "verif[ies] a program by checking it
against ... design document(s) or specification(s)", while system testing "validate[s] a program by checking it
against the published user or system requirements" (Kaner, Falk, Nguyen 1999, p. 52).
Functional testing typically involves five steps[citation needed]:
1.
2.
3.
4.
5.

The identification of functions that the software is expected to perform


The creation of input data based on the function's specifications
The determination of output based on the function's specifications
The execution of the test case
The comparison of actual and expected outputs

See also
Non-functional testing
Acceptance testing
Regression testing
System testing
Software testing
Integration testing
Unit testing
Database testing

References
1. ^ Kaner, Falk, Nguyen. Testing Computer Software. Wiley Computer Publishing, 1999, p. 42. ISBN 0-47135846-0.

External links
JTAG for Functional Test without Boundary-scan
(http://www.corelis.com/blog/index.php/blog/2011/01/10/jtag-for-functional-test-without-boundaryscan)

Retrieved from "http://en.wikipedia.org/w/index.php?title=Functional_testing&oldid=510357783"


Categories: Software testing Computing stubs
This page was last modified on 2 September 2012 at 00:29.
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may
en.wikipedia.org/wiki/Functional_testing

1/2

23/11/2012

Functional testing - Wikipedia, the free encyclopedia

apply. See Terms of Use for details.


Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

en.wikipedia.org/wiki/Functional_testing

2/2

Quick Quote
Request a Demo
Login

Vision

Functionality Testing
What is Functionality Testing?
Functionality testing is employed to verify whether your product meets the intended specifications and functional requirements laid out in your development documentation.
What is the purpose of Functionality Testing?
As competition in the software and hardware development arena intensifies, it becomes critical to deliver products that are virtually bug-free. Functionality testing helps your
company deliver products with a minimum amount of issues to an increasingly sophisticated pool of end users. Potential purchasers of your products may find honest and often
brutal product reviews online from consumers and professionals, which might deter them from buying your software. nResult will help ensure that your product functions as
intended, keeping your service and support calls to a minimum. Let our trained professionals find functional issues and bugs before your end users do!
How can nResult help you deliver high quality products that are functionally superior to products offered by your competition?
We offer several types of functional testing techniques:
Ad Hoc Takes advantage of individual testing talents based upon product goals, level of user capabilities and possible areas and features that may create confusion.
The tester will generate test cases quickly, on the spur of the moment.
Exploratory The tester designs and executes tests while learning the product. Test design is organized by a set of concise patterns designed to assure that testers dont
miss anything of importance.
Combination The tester performs a sequence of events using different paths to complete tasks. This can uncover bugs related to order of events that are difficult to find
using other methods.
Scripted The tester uses a test script that lays out the specific functions to be tested. A test script can be provided by the customer/developer or constructed by
nResult, depending on the needs of your organization.
Let nResult ensure that your hardware or software will function as intended. Our team will check for any anomalies or bugs in your product, through any or all stages of
development, to help increase your confidence level in the product you are delivering to market. nResult offers detailed, reasonably priced solutions to meet your testing needs.

Services offered by nResult:


Accessibility Testing
With accessibility testing, nResult ensures that your software or hardware product is accessible and effective for those with disabilities.
Read more>>
Compatibility Testing
Make sure your software applications and hardware devices function correctly with all relevant operating systems and with computing environments.
Read more>>
Interoperability Testing
Make sure your software applications and hardware devices function correctly with all other products in the market.
Read more>>
Competitive Analysis
Stack up next to your competitors with a full competitive analysis report.
Read more>>
Performance Testing
Ensure that your software/web application or website is equipped to handle anticipated and increased network traffic with adequate performance testing.
Performance Testing includes Load Testing and Benchmarking.
Read more>>
Localization Testing
Make certain that your localized product blends flawlessly with the native language and culture.
Read more>>
Medical Device Testing
nResult provides solutions for complying with challenging and expensive testing requirements for your medical device.
Read more>>
Web Application Testing
Find and eliminate weaknesses in your websites usability, functionality, performance, and browser compatibilities.
Read more>>
Certification Testing
Add instant credibility to your product from one of the most trusted names in testing.
Read more>>
Security Testing
Test your product for common security vulnerabilities; gain piece of mind in an insecure world.
nresult.com/quality-assurance/functionality-testing/

2/4

Introduction to
Performance Testing
First Presented for:

PSQT/PSTT Conference
Washington, DC May, 2003

Scott Barber
Chief Technology Officer
PerfTestPlus, Inc.
www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 1

Agenda
Why Performance Test?
What is Performance related testing?
Intro to Performance Engineering Methodology
Where to go for more info
Summary / Q&A

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 2

Why Performance Test?


Speed - Does the application respond quickly enough for the
intended users?

Scalability Will the application handle the expected user


load and beyond? (AKA Capacity)

Stability Is the application stable under expected and


unexpected user loads? (AKA Robustness)

Confidence Are you sure that users will have a positive


experience on go-live day?

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 3

Speed
User Expectations
Experience
Psychology
Usage

System Constraints
Hardware
Network
Software

Costs
Speed can be expensive!

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 4

Scalability
How many users

before it gets slow?


before it stops working?
will it sustain?
do I expect today?
do I expect before the next upgrade?

How much data can it hold?

Database capacity
File Server capacity
Back-up Server capacity
Data growth rates

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 5

Stability
What happens if

there are more users than we expect?


all the users do the same thing?
a user gets disconnected?
there is a Denial of Service Attack?
the web server goes down?
we get too many orders for the same thing?

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 6

Confidence
If you know what the performance is

you can assess risk.


you can make informed decisions.
you can plan for the future.
you can sleep the night before go-live day.

The peace of mind that it will work on go-live day


alone justifies the cost of
performance testing.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 7

What is Performance Related Testing?


Performance Validation
Performance Testing
Performance Engineering

Compare & Contrast

What?
Detect

Diagnose
No

www.PerfTestPlus.com

tR

e so
l ve
d

Why?
Resolve

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 8

Performance Validation
Performance validation is the process by which software is
tested with the intent of determining if the software meets
pre-existing performance requirements. This process aims
to evaluate compliance.

Primarily used for


determining SLA compliance.
IV&V (Independent Validation and Verification).
validating subsequent builds/releases.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 9

Performance Testing
Performance testing is the process by which software is
tested to determine the current system performance. This
process aims to gather information about current
performance, but places no value judgments on the
findings.

Primarily used for


determining capacity of existing systems.
creating benchmarks for future systems.
evaluating degradation with various loads and/or configurations.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 10

Performance Engineering
Performance engineering is the process by which software is
tested and tuned with the intent of realizing the required
performance. This process aims to optimize the most
important application performance trait, user experience.

Primarily used for


new systems with pre-determined requirements.
extending the capacity of old systems.
fixing systems that are not meeting requirements/SLAs.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 11

Compare and Contrast


Validation and Testing:
Are a subset of Engineering.
Are essentially the same except:
Validation usually focuses on a single scenario and tests
against pre-determined standards.
Testing normally focuses on multiple scenarios with no predetermined standards.

Are generally not iterative.


May be conducted separate from software development.
Have clear end points.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 12

Compare and Contrast


Engineering:

Is iterative.
Has clear goals, but fuzzy end points.
Includes the effort of tuning the application.
Focuses on multiple scenarios with pre-determined
standards.
Heavily involves the development team.
Occurs concurrently with software development.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 13

Intro to PE Methodology
Evaluate System
Develop Test Assets
Baselines and Benchmarks
Analyze Results
Tune
Identify Exploratory Tests
Execute Scheduled Tests
Complete Engagement

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 14

Evaluate System
Determine performance requirements.
Identify expected and unexpected user activity.
Determine test and/or production architecture.
Identify non-user-initiated (batch) processes.
Identify potential user environments.
Define expected behavior during unexpected circumstances.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 15

Develop Test Assets


Create Strategy Document.
Develop Risk Mitigation Plan.
Develop Test Data.
Automated test scripts:
Plan
Create
Validate

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 16

Baseline and Benchmarks


Most important for iterative testing.
Baseline (single user) for initial basis of comparison and best
case.
Benchmark (15-25% of expected user load) determines actual
state at loads expected to meet requirements.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 17

Analyze Results
Most important.
Most difficult.
Focuses on:
Have the performance criteria been met?
What are the bottlenecks?
Who is responsible to fix those bottlenecks?
Decisions.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 18

Tune
Engineering only.
Highly collaborative with development team.
Highly iterative.
Usually, performance engineer supports and validates while
developers/admins tune.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 19

Identify Exploratory Tests


Engineering only.
Exploits known bottleneck.
Assists with analysis & tuning.
Significant collaboration with tuners.
Not robust tests quick and dirty, not often reusable/relevant
after tuning is complete.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 20

Execute Scheduled Tests


Only after Baseline and/or Benchmark tests.
These tests evaluate compliance with documented
requirements.
Often are conducted on multiple hardware/configuration
variations.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 21

Complete Engagement
Document:

Actual Results
Tuning Summary
Known bottlenecks not tuned
Other supporting information
Recommendation

Package Test Assets:


Scripts
Documents
Test data

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 22

Where to go for more information


http://www.PerfTestPlus.com (My site)
http://www.QAForums.com (Huge QA Forum)
http://www.loadtester.com (Good articles and links)
http://www.segue.com/html/s_solutions/papers/s_wp_info.htm

(Good

articles and statistics)


http://www.keynote.com/resources/resource_library.html
(Good articles and statistics)

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 23

Summary
We test performance to:
Evaluate Risk.
Determine system capabilities.
Determine compliance.

Performance Engineering Methodology:

Ensures goals are accomplished.


Defines tasks.
Identifies critical decision points.
Shortens testing lifecycle.

www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 24

Questions and Contact Information


Scott Barber
Chief Technology Officer
PerfTestPlus, Inc.

E-mail:
sbarber@perftestplus.com

www.PerfTestPlus.com

Web Site:
www.PerfTestPlus.com

Introduction to Performance Testing

2006 PerfTestPlus, Inc. All rights reserved.

Page 25

Software Performance Testing


Xiang Gan

Helsinki 26.09.2006
Seminar paper
University of Helsinki
Department of Computer Science

HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI


Tiedekunta/Osasto Fakultet/Sektion Faculty/Section

Faculty of Science

Laitos Institution Department

Department of Computer Science

Tekij Frfattare Author

Xiang Gan
Tyn nimi Arbetets titel Title

Software performance testing


Oppiaine Lromne Subject

Tyn laji Arbetets art Level

Aika Datum Month and year

Sivumr Sidoantal Number of pages

26.9.2006

Tiivistelm Referat Abstract

Performance is one of the most important aspects concerned with the quality of software. It
indicates how well a software system or component meets its requirements for timeliness. Till
now, however, no significant progress has been made on software performance testing. This
paper introduces two software performance testing approaches which are named workload
characterization and early performance testing with distributed application, respectively.
ACM Computing Classification System (CCS):
A.1 [Introductory and Survey],
D.2.5 [Testing and Debugging]

Avainsanat Nyckelord Keywords

software performance testing, performance, workload, distributed application


Silytyspaikka Frvaringstlle Where deposited

Muita tietoja vriga uppgifter Additional information

ii

Contents
1

Introduction .......................................................................................1

Workload characterization approach ..................................................2


2.1 Requirements and specifications in performance testing............................. 2
2.2 Characterizing the workload ...................................................................... 2
2.3 Developing performance test cases............................................................. 3

Early performance testing with distributed application.......................4


3.1 Early testing of performance ...................................................................... 5
3.1.1
3.1.2
3.1.3
3.1.4

Selecting performance use-cases...........................................................................5


Mapping use-cases to middleware ........................................................................6
Generating stubs....................................................................................................7
Executing the test ..................................................................................................7

Conclusion .........................................................................................8

References ...............................................................................................9

1 Introduction
Although the functionality supported by a software system is apparently important, it
is usually not the only concern. The various concerns of individuals and of the
society as a whole may face significant breakdowns and incur high costs if the
system cannot meet the quality of service requirements of those non-functional
aspects, for instance, performance, availability, security and maintainability that are
expected from it.
Performance is an indicator of how well a software system or component meets its
requirements for timeliness. There are two important dimensions to software
performance timeliness, responsiveness and scalability [SmW02]. Responsiveness is
the ability of a system to meet its objectives for response time or throughput. The
response time is the time required to respond to stimuli (events). The throughput of a
system is the number of events processed in some interval of time [BCK03].
Scalability is the ability of a system to continue to meet its response time or
throughput objectives as the demand for the software function increases [SmW02].
As Weyuker and Vokolos argued [WeV00], usually, the primary problems that
projects report after field release are not system crashes or incorrect systems
responses, but rather system performance degradation or problems handling required
system throughput. If queried, the fact is often that although the software system has
gone through extensive functionality testing, it was never really tested to assess its
expected performance. They also found that performance failures can be roughly
classified as the following three categories:
l
l
l

the lack of performance estimates,


the failure to have proposed plans for data collection,
the lack of a performance budget.

This seminar paper concentrates upon the introduction of two software performance
testing approaches. Section 2 introduces a workload characterization approach which
requires a careful collection of data for significant periods of time in the production
environment. In addition, the importance of clear performance requirements written
in requirement and specification documents is emphasized, since it is the
fundamental basis to carry out performance testing. Section 3 focuses on an
approach to test the performance of distributed software application as early as
possible during the entire software engineering process since it is obviously a large
overhead for the development team to fix the performance problems at the end of the
whole process. Even worse, it may be impossible to fix some performance problems
without sweeping redesign and re-implementation which can eat up lots of time and
money. A conclusion is made at last in section 4.

2 Workload characterization approach


As indicated [AvW04], one of the key objectives of performance testing is to
uncover problems that are revealed when the system is run under specific workloads.
This is sometimes referred to in the software engineering literature as an operational
profile [Mus93]. An operational profile is a probability distribution describing the
frequency with which selected important operations are exercised. It describes how
the system has historically been used in the field and thus is likely to be used in the
future. To this end, performance requirement is one of the necessary prerequisites
which will be used to determine whether software performance testing has been
conducted in a meaningful way.

2.1

Requirements and specifications in performance testing

Performance requirements must be provided in a concrete, verifiable manner


[VoW98]. This should be explicitly included in a requirements or specification
document and might be provided in terms of throughput or response time, and might
also include system availability requirements.
One of the most serious problems with performance testing is making sure that the
stated requirements can actually be checked to see whether or not they are fulfilled
[WeV00]. For instance, in functional testing, it seems to be useless to choose inputs
with which it is entirely impossible to determine whether or not the output is correct.
The same situation applies to performance testing. It is important to write
requirements that are meaningful for the purpose of performance testing. It is quite
easy to write a performance requirement for an ATM such as, one customer can
finish a single transaction of withdrawing money from the machine in less than 25
seconds. Then it might be possible to show that the time used in most of the test
cases is less than 25 seconds, while it only fails in one test case. Such a situation,
however, cannot guarantee that the requirement has been satisfied. A more plausible
piece of performance requirement should state that the time used in such a single
transaction is less than 25 seconds when the server at host bank is run with an
average workload. Assume that a benchmark has been established which can
accurately reflect the average workload, it is then possible to test whether this
requirement has been satisfied or not.

2.2

Characterizing the workload

In order to do the workload characterization, it is necessary to collect data for


significant periods of time in production environment. This can help characterize the
system workload, and then use these representative workloads to determine what the
system performance will look like when it is run in production on significantly large
workloads.

3
The workload characterization approach described by Alberto Avritzer and Joe
Kondek [AKL02] is comprised of two steps that will be illustrated as follows.
The first step is to model the software system. Since most industrial software
systems are usually too complex to handle all the possible characteristics, then
modeling is necessary. The goal of this step is thus to establish a simplified version
of the system in which the key parameters have been identified. It is essential that
the model be as close enough to the real system as possible so that the data collected
from it will realistically reflect the true systems behavior. Meanwhile, it shall be
simple enough as it will then be feasible to collect the necessary data.
The second step is to collect data while the system is in operation after the system
has been modeled, and key parameters identified. According to the paper [AKL02],
this activity should usually be done for periods of two to twelve months. Following
that, the data must be analyzed and a probability distribution should be determined.
Although the input space, in theory, is quite enormous because of the non-uniform
property of the frequency distribution, experience has shown that there are a
relatively small number of inputs which actually occur during the period of data
collection. The paper [AKL02] showed that it is quite common for only several
thousand inputs to correspond to more than 99% of the probability mass associated
with the input space. This means that a very accurate picture of the performance that
the user of the system tends to see in the field can be drawn only through testing the
relatively small number of inputs.

2.3

Developing performance test cases

After performing the workload characterization and determining what are the
paramount system characteristics that require data collection, now we need to use
that information to design performance test cases to reflect field production usage for
the system. The following prescriptions were defined by Weyuker and Vokolos
[WeV00]. One of the most interesting points in this list of prescriptions is that they
also defined how to design performance test cases in case the detailed historical data
is unavailable. Their by then situation was that a new platform has been purchased
but not yet available; plus software has already been designed and written explicitly
for the new hardware platform. The goal of such work is to determine whether there
are likely to be performance problems once the hardware is delivered and the
software is installed and running with the real customer base.
Typical steps to form performance test cases are as follows:
l
l

identify the software processes that directly influence the overall performance of
the system,
for each process, determine the input parameters that will most significantly
influence the performance of the system. It is important to limit the parameters
to the essential ones so that the set of test cases selected will be of manageable
size,

4
l

determine realistic values for these parameters by collecting and analyzing


existing usage data. These values should reflect desired usage scenarios,
including both average and heavy workloads.
if there are parameters for which historical usage data are not available, then
estimate reasonable values based on such things as the requirements used to
develop the system or experience gathered by using an earlier version of the
system or similar systems.
if, for a given parameter, the estimated values form a range, then select
representative values from within this range that are likely to reveal useful
information about the performance behavior of the system. Each selected value
should then form a separate test case.

It is, however, important to recognize that this list cannot be treated as a precise
preparation for test cases since every system is different.

3 Early performance
distributed application

testing

with

Testing techniques are usually applied towards the end of a project. However, most
researchers and practitioners agree that the most critical performance problems, as a
quality of interest, depend upon decisions made in the very early stages of the
development life cycle, such as architectural choices. Although iterative and
incremental development has been widely promoted, the situation concerned with
testing techniques has not been changed so much.
With the increasingly advance in distributed component technologies, such as J2EE
and CORBA, distributed systems are no longer built from scratch [DPE04]. Modern
distributed systems are often built on top of middlewares. As a result, when the
architecture is defined, a certain part of the implementation of a class of distributed
applications is already available. Then, it was argued that this enables performance
testing to be successfully applied at such early stages.
The method proposed by Denaro, Polini and Emmerich [DPE04] is based upon the
observation that the middleware used to build a distributed application often
determines the overall performance of the application. However, they also noted that
only the coupling between the middleware and the application architecture
determines the actual performance. The same middleware may perform quite
differently under the context of different applications. Based on such observation,
architecture designs were proposed as a tool to derive application-specific
performance test cases which can be executed on the early available middleware
platform on which a distributed application is built. It then allows measurements of
performance to be done in the very early stage of the development process.

3.1

Early testing of performance

The approach for early performance testing of distributed component-based


applications consists of four phases [DPE04]:
l
l
l
l

selection of the use-case scenarios relevant to performance, given a set of


architecture designs,
mapping of the selected use-cases to the actual deployment technology and
platform,
creation of stubs of components that are not available in the early stages of the
development, but are needed to implement the use cases, and
execution of the test.

The detailed contents in each phase are discussed in the following sub-sections.

3.1.1

Selecting performance use-cases

First of all, the design of functional test cases is entirely different from the case in
performance testing as already indicated in the previous section. However, as for
performance testing of distributed applications, the main parameters relating to it are
much more complicated than that described before. Table 1 is excerpted from the
paper [DPE04] to illustrate this point.

Table 1: Performance parameters [DPE04].


Apart from traditional concerns about workloads and physical resources,
consideration about the middleware configuration is also highlighted in this table (in
this case, it describes J2EE-based middleware). The last row of the table classifies

6
the relative interactions in distributed settings according to the place where they
occur. This taxonomy is far from complete, however, it was believed that such a
taxonomy of distributed interactions is key for using this approach. The next step is
the definition of appropriate metrics to evaluate the performance relevance of the
available use-cases according to the interactions that they trigger.

3.1.2 Mapping use-cases to middleware


At the early stage of development process, software architecture is generally defined
at a very abstract level. It usually just describes the business logic and abstract many
details of deployment platforms and technologies. From this point, it is necessary to
understand how abstract use-cases are mapped to possible deployment technologies
and platforms.
To facilitate the mapping from abstract use-cases to the concrete instances, software
connectors might be a feasible solution as indicated [DPE04]. Software connectors
mediate interactions among components. That is, they establish the rules that govern
component interaction and specify any auxiliary mechanisms required [MMP00].
According to the paper [MMP00], four major categories of connectors,
communication, coordination, conversion, and facilitation, were identified. It was
based on the services provided to interacting components. In addition, major
connector types, procedure call, data access, linkage, stream, event, arbitrator,
adaptor, and distributor, were also identified. Each connector type supports one or
more interaction services. The architecturally relevant details of each connector type
are captured by dimensions, and possibly, sub-dimensions. One dimension consists
of a set of values. Connector species are created by choosing the appropriate
dimensions and values for those dimensions from connector types. Figure 1 depicts
the software connector classification framework which might provide a more
descriptive illustration about the whole structure.
As a particular element of software architecture, software connector was studied to
investigate the possibility of defining systematic mappings between architectures
and middlewares. Well characterized software connectors may be associated with
deployment topologies that preserve the properties of the original architecture
[DPE04]. As indicated, however, further work is still required to understand many
dimensions and species of software connectors and their relationships with the
possible deployment platforms and technologies.

Figure 1: Software connector classification framework [MMP00].

3.1.3

Generating stubs

To actually implement the test cases, it needs to solve the problem that not all of the
application components which participate in the use-cases are available in the early
stages of development. Stubs should be used in place where the components miss.
Stubs are fake versions of components that can be used instead of the corresponding
components for instantiating the abstract use-cases. Stubs will only take care that the
distributed interactions happen as specified and the other components are coherently
exercised.
The main hypothesis of this approach is that performance measurements in the
presence of the stubs are decent approximations of the actual performance of the
final application [DPE04]. It results from the observation that the available
components, for instance, middleware and databases, embed the software that mainly
impact performance. The coupling between such implementation support and the
application-specific behavior can be extracted from the use-cases, while the
implementation details of the business components remain negligible.

3.1.4

Executing the test

Building the support to test execution involves more technical problems provided
scientific problems raised in the previous three sub-sections have been solved. In
addition, several aspects, for example, deployment and implementation of workload
generators, execution of measurements, can be automated.

Conclusion

In all, two software performance testing approaches were described in this paper.
Workload characterization approach can be treated as a traditional performance
testing approach that requires to carefully collecting a series of data in the production
field and that can only be implemented at the end of the project. In contrast, early
performance testing approach for distributed software applications seems to be more
novel since it encourages to implement performance testing early in the development
process, say, when the architecture is defined. Although it is still not a very mature
approach and more researches need to be conducted upon it according to its
advocators [DPE04], its future looks like to be promising since it allows to fix those
performance problems as early as possible which is quite attractive.
Several other aspects also need to be discussed. First of all, there has been very little
research published in the area of software performance testing. For example, with
the search facility IEEE Xplore, if one enters software performance testing in the
search field, there were only 3 results returned when this paper was written. Such a
situation indicates that the field of software performance testing as a whole is only in
its initial stage and needs much more emphasis in future. Secondly, the importance
of requirements and specifications is discussed in this paper. The fact, however, is
that usually no performance requirements are provided, which means that there is no
precise way of determining whether or not the software performance is acceptable.
Thirdly, a positive trend is that software performance, as an important quality, is
increasingly punctuated during the development process. Smith and Williams
[SmW02] proposed Software Performance Engineering (SPE) which is a systematic,
quantitative approach to constructing software systems that meet performance
objectives. It aids in tracking performance throughout the development process and
prevents performance problems from emerging late in the life cycle.

References
AKL02

Avritzer A., Kondek J., Liu D., Weyuker E.J., Software performance
testing based on workload characterization. Proc. of the 3rd
international workshop on software and performance, Jul. 2002, pp.
17-24.

AvW04

Avritzer A., and Weyuker E.J., The role of modeling in the performance
testing of E-commerce applications. IEEE Transactions on software
engineering, 30, 12, Dec. 2004, pp. 1072-1083.

BCK03

Bass L., Clements P., Kazman R., Software architecture in practice,


second edition. Addision Wesley, Apr. 2003.

DPE04

Denaro G., Polini A., Emmerich W., Early performance testing of


distributed software applications. Proc. of the 4th international
workshop on software and performance, 2004, pp. 94-103.

MMP00

Mehta N., Medvidovic N. and Phadke S., Towards a taxonomy of


software connectors. In proc. of the 22nd International conference on
software engineering, 2000, pp. 178-187.

Mus93

Musa J.D., Operational profiles in software reliability engineering.


IEEE Software, 10, 2, Mar. 1993, pp. 14-32.

SmW02

Smith C.U. and Williams L.G., Performance solutions: a practical


guide to creating responsive, scalable software. Boston, MA, Addision
Wesley, 2002.

VoW98

Vokolos F.I., Weyuker E.J., Performance testing of software systems.


Proc. of the 1st international workshop on software and performance,
Oct. 1998, pp. 80-87.

WeV00

Weyuker E.J. and Vokolos F.I., Experience with performance testing of


software systems: issues, an approach and a case study. IEEE
Transactions on Software Engineering, 26, 12, Dec. 2000, pp.
1147-1156.

Software Reliability Engineering: A Roadmap


Michael R. Lyu

Michael R. Lyu received the Ph.D. in computer science from


University of California, Los Angeles in 1988. He is a Professor in the
Computer Science and Engineering Department of the Chinese
University of Hong Kong. He worked at the Jet Propulsion Laboratory,
Bellcore, and Bell Labs; and taught at the University of Iowa. He has
participated in more than 30 industrial projects, published over 250
papers, and helped to develop many commercial systems and
software tools. Professor Lyu is frequently invited as a keynote or
tutorial speaker to conferences and workshops in U.S., Europe, and
Asia. He initiated the International Symposium on Software Reliability
Engineering (ISSRE) in 1990. He also received Best Paper Awards in
ISSRE'98 and in ISSRE'2003. Professor Lyu is an IEEE Fellow and
an AAAS Fellow, for his contributions to software reliability
engineering and software fault tolerance.

Software Reliability Engineering: A Roadmap


Michael R. Lyu
Computer Science and Engineering Department
The Chinese University of Hong Kong, Hong Kong
lyu@cse.cuhk.edu.hk

Abstract
Software reliability engineering is focused on
engineering
techniques
for
developing
and
maintaining software systems whose reliability can be
quantitatively evaluated. In order to estimate as well
as to predict the reliability of software systems, failure
data need to be properly measured by various means
during software development and operational phases.
Moreover, credible software reliability models are
required to track underlying software failure processes
for accurate reliability analysis and forecasting.
Although software reliability has remained an active
research subject over the past 35 years, challenges
and open questions still exist. In particular, vital
future goals include the development of new software
reliability engineering paradigms that take software
architectures, testing techniques, and software failure
manifestation mechanisms into consideration. In this
paper, we review the history of software reliability
engineering, the current trends and existing problems,
and specific difficulties. Possible future directions and
promising research subjects in software reliability
engineering are also addressed.

1. Introduction
Software permeates our daily life. There is probably
no other human-made material which is more
omnipresent than software in our modern society. It
has become a crucial part of many aspects of society:
home appliances, telecommunications, automobiles,
airplanes, shopping, auditing, web teaching, personal
entertainment, and so on. In particular, science and
technology demand high-quality software for making
improvements and breakthroughs.
The size and complexity of software systems have
grown dramatically during the past few decades, and
the trend will certainly continue in the future. The data
from industry show that the size of the software for

various systems and applications has been growing


exponentially for the past 40 years [20]. The trend of
such growth in the telecommunication, business,
defense, and transportation industries shows a
compound growth rate of ten times every five years.
Because of this ever-increasing dependency, software
failures can lead to serious, even fatal, consequences in
safety-critical systems as well as in normal business.
Previous software failures have impaired several highvisibility programs and have led to loss of business
[28].
The ubiquitous software is also invisible, and its
invisible nature makes it both beneficial and harmful.
From the positive side, systems around us work
seamlessly thanks to the smooth and swift execution of
software. From the negative side, we often do not
know when, where and how software ever has failed,
or will fail.
Consequently, while reliability
engineering for hardware and physical systems
continuously improves, reliability engineering for
software does not really live up to our expectation over
the years.
This situation is frustrating as well as encouraging. It
is frustrating because the software crisis identified as
early as the 1960s still stubbornly stays with us, and
software engineering has not fully evolved into a
real engineering discipline. Human judgments and
subjective favorites, instead of physical laws and
rigorous procedures, dominate many decision making
processes in software engineering. The situation is
particularly critical in software reliability engineering.
Reliability is probably the most important factor to
claim for any engineering discipline, as it
quantitatively measures quality, and the quantity can
be properly engineered. Yet software reliability
engineering, as elaborated in later sections, is not yet
fully delivering its promise. Nevertheless, there is an
encouraging aspect to this situation. The demands on,
techniques of, and enhancements to software are
continually increasing, and so is the need to understand

its reliability. The unsettled software crisis poses


tremendous opportunities for software engineering
researchers as well as practitioners. The ability to
manage quality software production is not only a
necessity, but also a key distinguishing factor in
maintaining a competitive advantage for modern
businesses.
Software reliability engineering is centered on a key
attribute, software reliability, which is defined as the
probability of failure-free software operation for a
specified period of time in a specified environment [2].
Among other attributes of software quality such as
functionality, usability, capability, and maintainability,
etc., software reliability is generally accepted as the
major factor in software quality since it quantifies
software failures, which can make a powerful system
inoperative. Software reliability engineering (SRE) is
therefore defined as the quantitative study of the
operational behavior of software-based systems with
respect to user requirements concerning reliability. As
a proven technique, SRE has been adopted either as
standard or as best current practice by more than 50
organizations in their software projects and reports
[33], including AT&T, Lucent, IBM, NASA,
Microsoft, and many others in Europe, Asia, and North
America. However, this number is still relatively small
compared to the large amount of software producers in
the world.
Existing SRE techniques suffer from a number of
weaknesses. First of all, current SRE techniques
collect the failure data during integration testing or
system testing phases. Failure data collected during the
late testing phase may be too late for fundamental
design changes. Secondly, the failure data collected in
the in-house testing may be limited, and they may not
represent failures that would be uncovered under
actual operational environment. This is especially true
for high-quality software systems which require
extensive and wide-ranging testing. The reliability
estimation and prediction using the restricted testing
data may cause accuracy problems. Thirdly, current
SRE techniques or modeling methods are based on
some unrealistic assumptions that make the reliability
estimation too optimistic relative to real situations. Of
course, the existing software reliability models have
had their successes; but every model can find
successful cases to justify its existence. Without crossindustry validation, the modeling exercise may become
merely of intellectual interest and would not be widely
adopted in industry. Thus, although SRE has been
around for a while, credible software reliability
techniques are still urgently needed, particularly for
modern software systems [24].

In the following sections we will discuss the past, the


present, and the future of software reliability
engineering. We first survey what techniques have
been proposed and applied in the past, and then
describe what the current trend is and what problems
and concerns remain. Finally, we propose the possible
future directions in software reliability engineering.

2.
Historical
software
engineering techniques

reliability

In the literature a number of techniques have been


proposed to attack the software reliability engineering
problems based on software fault lifecycle. We
discuss these techniques, and focus on two of them.

2.1. Fault lifecycle techniques


Achieving highly reliable software from the
customers perspective is a demanding job for all
software engineers and reliability engineers. [28]
summarizes the following four technical areas which
are applicable to achieving reliable software systems,
and they can also be regarded as four fault lifecycle
techniques:
1) Fault prevention: to avoid, by construction, fault
occurrences.
2) Fault removal: to detect, by verification and
validation, the existence of faults and eliminate them.
3) Fault tolerance: to provide, by redundancy, service
complying with the specification in spite of faults
having occurred or occurring.
4) Fault/failure forecasting: to estimate, by evaluation,
the presence of faults and the occurrences and
consequences of failures. This has been the main focus
of software reliability modeling.
Fault prevention is the initial defensive mechanism
against unreliability. A fault which is never created
costs nothing to fix. Fault prevention is therefore the
inherent objective of every software engineering
methodology. General approaches include formal
methods in requirement specifications and program
verifications, early user interaction and refinement of
the requirements, disciplined and tool-assisted
software design methods, enforced programming
principles and environments, and systematic
techniques for software reuse.
Formalization of
software engineering processes with mathematically
specified languages and tools is an aggressive
approach to rigorous engineering of software systems.
When applied successfully, it can completely prevent
faults. Unfortunately, its application scope has been

limited. Software reuse, on the other hand, finds a


wider range of applications in industry, and there is
empirical evidence for its effectiveness in fault
prevention. However, software reuse without proper
certification could lead to disaster. The explosion of
the Ariane 5 rocket, among others, is a classic example
where seemly harmless software reuse failed miserably,
in which critical software faults slipped through all the
testing and verification procedures, and where a
system went terribly wrong only during complicated
real-life operations.
Fault prevention mechanisms cannot guarantee
avoidance of all software faults. When faults are
injected into the software, fault removal is the next
protective means. Two practical approaches for fault
removal are software testing and software inspection,
both of which have become standard industry practices
in quality assurance. Directions in software testing
techniques are addressed in [4] in detail.
When inherent faults remain undetected through the
testing and inspection processes, they will stay with the
software when it is released into the field. Fault
tolerance is the last defending line in preventing faults
from manifesting themselves as system failures. Fault
tolerance is the survival attribute of software systems
in terms of their ability to deliver continuous service to
the customers. Software fault tolerance techniques
enable software systems to (1) prevent dormant
software faults from becoming active, such as
defensive programming to check for input and output
conditions and forbid illegal operations; (2) contain the
manifested software errors within a confined boundary
without further propagation, such as exception
handling routines to treat unsuccessful operations; (3)
recover software operations from erroneous conditions,
such as checkpointing and rollback mechanisms; and
(4) tolerate system-level faults methodically, such as
employing design diversity in the software
development.
Finally if software failures are destined to occur, it is
critical to estimate and predict them. Fault/failure
forecasting involves formulation of the fault/failure
relationship, an understanding of the operational
environment, the establishment of software reliability
models, developing procedures and mechanisms for
software reliability measurement, and analyzing and
evaluating the measurement results. The ability to
determine software reliability not only gives us
guidance about software quality and when to stop
testing, but also provides information for software
maintenance needs. It can facilitate the validity of
software warranty when reliability of software has

been properly certified. The concept of scheduled


maintenance with software rejuvenation techniques [46]
can also be solidified.
The subjects of fault prevention and fault removal
have been discussed thoroughly by other articles in this
issue. We focus our discussion on issues related to
techniques on fault tolerance and fault/failure
forecasting.

2.2.
Software
measurement

reliability

models

and

As a major task of fault/failure forecasting, software


reliability modeling has attracted much research
attention in estimation (measuring the current state) as
well as prediction (assessing the future state) of the
reliability of a software system. A software reliability
model specifies the form of a random process that
describes the behavior of software failures with respect
to time. A historical review as well as an application
perspective of software reliability models can be found
in [7, 28]. There are three main reliability modeling
approaches: the error seeding and tagging approach,
the data domain approach, and the time domain
approach, which is considered to be the most popular
one. The basic principle of time domain software
reliability modeling is to perform curve fitting of
observed time-based failure data by a pre-specified
model formula, such that the model can be
parameterized with statistical techniques (such as the
Least Square or Maximum Likelihood methods). The
model can then provide estimation of existing
reliability or prediction of future reliability by
extrapolation techniques. Software reliability models
usually make a number of common assumptions, as
follows. (1) The operation environment where the
reliability is to be measured is the same as the testing
environment in which the reliability model has been
parameterized. (2) Once a failure occurs, the fault
which causes the failure is immediately removed. (3)
The fault removal process will not introduce new faults.
(4) The number of faults inherent in the software and
the way these faults manifest themselves to cause
failures follow, at least in a statistical sense, certain
mathematical formulae. Since the number of faults (as
well as the failure rate) of the software system reduces
when the testing progresses, resulting in growth of
reliability, these models are often called software
reliability growth models (SRGMs).
Since Jelinsky and Moranda proposed the first
SRGM [23] in 1972, numerous SRGMs have been
proposed in the past 35 years, such as exponential
failure time class models, Weibull and Gamma failure

time class models, infinite failure category models,


Bayesian models, and so on [28, 36, 50]. Unified
modeling approaches have also been attempted [19].
As mentioned before, the major challenges of these
models do not lie in their technical soundness, but their
validity and applicability in real world projects.
Determine Reliability
Objective

Develop
Operational Profile

Perform Software Testing


Collect Failure Data
Apply Software Reliability
Tools

Continue
Testing

Select Appropriate Software


Reliability Models
No
Use Software Reliability Models
to Calculate Current Reliability

Reliability
Objective
met?
Yes

Start to Deploy
Validate Reliability in the Field
Feedback to Next Release

Figure 1. Software Reliability Engineering


Process Overview
Figure 1 shows an SRE framework in current practice
[28]. First, a reliability objective is determined
quantitatively from the customer's viewpoint to
maximize customer satisfaction, and customer usage is
defined by developing an operational profile. The
software is then tested according to the operational
profile, failure data collected, and reliability tracked
during testing to determine the product release time.
This activity may be repeated until a certain reliability
level has been achieved. Reliability is also validated in
the field to evaluate the reliability engineering efforts
and to achieve future product and process
improvements.
It can be seen from Figure 1 that there are four major
components in this SRE process, namely (1) reliability

objective, (2) operational profile, (3) reliability


modeling and measurement, and (4) reliability
validation. A reliability objective is the specification
of the reliability goal of a product from the customer
viewpoint. If a reliability objective has been specified
by the customer, that reliability objective should be
used. Otherwise, we can select the reliability measure
which is the most intuitive and easily understood, and
then determine the customer's "tolerance threshold" for
system failures in terms of this reliability measure.
The operational profile is a set of disjoint alternatives
of system operational scenarios and their associated
probabilities of occurrence. The construction of an
operational profile encourages testers to select test
cases according to the system's likely operational usage,
which contributes to more accurate estimation of
software reliability in the field.
Reliability modeling is an essential element of the
reliability estimation process. It determines whether a
product meets its reliability objective and is ready for
release. One or more reliability models are employed
to calculate, from failure data collected during system
testing, various estimates of a product's reliability as a
function of test time. Several interdependent estimates
can be obtained to make equivalent statements about a
product's reliability. These reliability estimates can
provide the following information, which is useful for
product quality management: (1) The reliability of the
product at the end of system testing. (2) The amount of
(additional) test time required to reach the product's
reliability objective. (3) The reliability growth as a
result of testing (e.g., the ratio of the value of the
failure intensity at the start of testing to the value at the
end of testing). (4) The predicted reliability beyond the
system testing, such as the product's reliability in the
field.
Despite the existence of a large number of models,
the problem of model selection and application is
manageable, as there are guidelines and statistical
methods for selecting an appropriate model for each
application. Furthermore, experience has shown that it
is sufficient to consider only a dozen models,
particularly when they are already implemented in
software tools [28].
Using these statistical methods, "best" estimates of
reliability are obtained during testing. These estimates
are then used to project the reliability during field
operation in order to determine whether the reliability
objective has been met. This procedure is an iterative
process, since more testing will be needed if the
objective is not met. When the operational profile is
not fully developed, the application of a test

compression factor can assist in estimating field


reliability. A test compression factor is defined as the
ratio of execution time required in the operational
phase to execution time required in the test phase to
cover the input space of the program. Since testers
during testing are quickly searching through the input
space for both normal and difficult execution
conditions, while users during operation only execute
the software with a regular pace, this factor represents
the reduction of failure rate (or increase in reliability)
during operation with respect to that observed during
testing.
Finally, the projected field reliability has to be
validated by comparing it with the observed field
reliability.
This validation not only establishes
benchmarks and confidence levels of the reliability
estimates, but also provides feedback to the SRE
process for continuous improvement and better
parameter tuning. When feedback is provided, SRE
process enhancement comes naturally: the model
validity is established, the growth of reliability is
determined, and the test compression factor is refined.

2.3. Software fault tolerance techniques and


models
Fault tolerance, when applicable, is one of the major
approaches to achieve highly reliable software. There
are two different groups of fault tolerance techniques:
single version and multi-version software techniques
[29].
The former includes program modularity,
system closure, atomicity of actions, error detection,
exception handling, checkpoint and restart, process
pairs, and data diversity [44]; while the latter, so-called
design diversity, is employed where multiple software
versions are developed independently by different
program teams using different design methods, yet
they provide equivalent services according to the same
requirement specifications. The main techniques of this
multiple version software approach are recovery
blocks, N-version programming, N self-checking
programming, and other variants based on these three
fundamental techniques.
Reliability models attempt to estimate the probability
of coincident failures in multiple versions. Eckhardt
and Lee (1985) [15] proposed the first reliability model
of fault correlation in design diversity to observe
positive correlations between version failures on the
assumption of variation of difficulty on demand space.
Littlewood and Miller (1989) [25] suggested that there
was a possibility that negative fault correlations may
exist on the basis of forced design diversity. Dugan
and Lyu (1995) [14] proposed a Markov reward model

to compare system reliability achieved by various


design diversity approaches, and Tomek and Trivedi
(1995) [43] suggested a Stochastic reward net model
for software fault tolerance. Popov, Strigini et al.
(2003) [37] estimated the upper and lower bounds for
failure probability of design diversity based on the
subdomain concept on the demand space. A detailed
summary of fault-tolerant software and its reliability
modeling methods can be found in [29]. Experimental
comparisons and evaluations of some of the models are
listed in [10] and [11].

3. Current trends and problems


The challenges in software reliability not only stem
from the size, complexity, difficulty, and novelty of
software applications in various domains, but also
relate to the knowledge, training, experience and
character of the software engineers involved. We
address the current trends and problems from a number
of software reliability engineering aspects.

3.1. Software reliability and system reliability


Although the nature of software faults is different
from that of hardware faults, the theoretical foundation
of software reliability comes from hardware reliability
techniques. Previous work has been focused on
extending the classical reliability theories from
hardware to software, so that by employing familiar
mathematical modeling schemes, we can establish
software reliability framework consistently from the
same viewpoints as hardware. The advantages of such
modeling approaches are: (1) The physical meaning of
the failure mechanism can be properly interpreted, so
that the effect of failures on reliability, as measured in
the form of failure rates, can be directly applied to the
reliability models. (2) The combination of hardware
reliability and software reliability to form system
reliability models and measures can be provided in a
unified theory. Even though the actual mechanisms of
the various causes of hardware faults and software
faults may be different, a single formulation can be
employed from the reliability modeling and statistical
estimation viewpoints. (3) System reliability models
inherently engage system structure and modular design
in block diagrams. The resulting reliability modeling
process is not only intuitive (how components
contribute to the overall reliability can be visualized),
but also informative (reliability-critical components
can be quickly identified).
The major drawbacks, however, are also obvious.
First of all, while hardware failures may occur
independently (or approximately so), software failures

do not happen independently. The interdependency of


software failures is also very hard to describe in detail
or to model precisely. Furthermore, similar hardware
systems are developed from similar specifications, and
hardware failures, usually caused by hardware defects,
are repeatable and predictable. On the other hand,
software systems are typically one-of-a-kind. Even
similar software systems or different versions of the
same software can be based on quite different
specifications.
Consequently, software failures,
usually caused by human design faults, seldom repeat
in exactly the same way or in any predictable pattern.
Therefore, while failure mode and effect analysis
(FMEA) and failure mode and effect criticality
analysis (FMECA) have long been established for
hardware systems, they are not very well understood
for software systems.

3.2. Software reliability modeling


Among all software reliability models, SRGM is
probably one of the most successful techniques in the
literature, with more than 100 models existing in one
form or another, through hundreds of publications. In
practice, however, SRGMs encounter major challenges.
First of all, software testers seldom follow the
operational profile to test the software, so what is
observed during software testing may not be directly
extensible for operational use. Secondly, when the
number of failures collected in a project is limited, it is
hard to make statistically meaningful reliability
predictions. Thirdly, some of the assumptions of
SRGM are not realistic, e.g., the assumptions that the
faults are independent of each other; that each fault has
the same chance to be detected in one class; and that
correction of a fault never introduces new faults [40].
Nevertheless, the above setbacks can be overcome
with suitable means. Given proper data collection
processes to avoid drastic invalidation of the model
assumptions, it is generally possible to obtain accurate
estimates of reliability and to know that these estimates
are accurate.
Although some historical SRGMs have been widely
adopted to predict software reliability, researchers
believe they can further improve the prediction
accuracy of these models by adding other important
factors which affect the final software quality
[12,31,48]. Among others, code coverage is a metric
commonly engaged by software testers, as it indicates
how completely a test set executes a software system
under test, therefore influencing the resulting
reliability measure. To incorporate the effect of code
coverage on reliability in the traditional software
reliability models, [12] proposes a technique using

both time and code coverage measurement for


reliability prediction. It reduces the execution time by a
parameterized factor when the test case neither
increases code coverage nor causes a failure. These
models, known as adjusted Non-Homogeneous
Poisson Process (NHPP) models, have been shown
empirically to achieve more accurate predictions than
the original ones.
In the literature, several models have been proposed
to determine the relationship between the number of
failures/faults and the test coverage achieved, with
various distributions. [48] suggests that this relation is
a variant of the Rayleigh distribution, while [31] shows
that it can be expressed as a logarithmic-exponential
formula, based on the assumption that both fault
coverage and test coverage follow the logarithmic
NHPP growth model with respect to the execution time.
More metrics can be incorporated to further explore
this new modeling avenue.
Although there are a number of successful SRE
models, they are typically measurement-based models
which are employed in isolation at the later stage of the
software development process. Early software
reliability prediction models are often too insubstantial,
seldom executable, insufficiently formal to be
analyzable, and typically not linked to the target
system. Their impact on the resulting reliability is
therefore modest. There is currently a need for a
creditable end-to-end software reliability model that
can be directly linked to reliability prediction from the
very beginning, so as to establish a systematic SRE
procedure that can be certified, generalized and refined.

3.3. Metrics and measurements


Metrics and measurements have been an important
part of the software development process, not only for
software project budget planning but also for software
quality assurance purposes. As software complexity
and software quality are highly related to software
reliability, the measurements of software complexity
and quality attributes have been explored for early
prediction of software reliability [39]. Static as well as
dynamic program complexity measurements have been
collected, such as lines of code, number of operators,
relative program complexity, functional complexity,
operational complexity, and so on. The complexity
metrics can be further included in software reliability
models for early reliability prediction, for example, to
predict the initial software fault density and failure rate.
In SRGM, the two measurements related to reliability
are: 1) the number of failures in a time period; and 2)
time between failures. An important advancement of

SRGM is the notation of time during which failure


data are recorded. It is demonstrated that CPU time is
more suitable and more accurate than calendar time for
recording failures, in which the actual execution time
of software can be faithfully represented [35]. More
recently, other forms of metrics for testing efforts have
been incorporated into software reliability modeling to
improve the prediction accuracy [8,18].

The amounts and types of data to be collected for


reliability
analysis
purposes
vary
between
organizations. Consequently, the experiences and
lessons so gained may only be shared within the same
company culture or at a high level of abstraction
between organizations.
To overcome this
disadvantage, systematic failure data analysis for SRE
purposes should be conducted.

One key problem about software metrics and


measurements is that they are not consistently defined
and interpreted, again due to the lack of physical
attributes of software. The achieved reliability
measures may differ for different applications, yielding
inconclusive results. A unified ontology to identify,
describe, incorporate and understand reliability-related
software metrics is therefore urgently needed.

Given field failure data collected from a real system,


the analysis consists of five steps: 1) preprocessing of
data, 2) analysis of data, 3) model structure
identification and parameter estimation, 4) model
solution, if necessary, and 5) analysis of models. In
Step 1, the necessary information is extracted from the
field data. The processing in this step requires detailed
understanding of the target software and operational
conditions. The actual processing required depends on
the type of data. For example, the information in
human-generated reports is usually not completely
formatted. Therefore, this step involves understanding
the situations described in the reports and organizing
the relevant information into a problem database. In
contrast, the information in automatically generated
event logs is already formatted. Data processing of
event logs consists of extracting error events and
coalescing related error events.

3.4. Data collection and analysis


The software engineering process is described
sardonically as a garbage-in/garbage-out process. That
is to say, the accuracy of its output is bounded by the
precision of its input. Data collection, consequently,
plays a crucial role for the success of software
reliability measurement.
There is an apparent trade-off between the data
collection and the analysis effort. The more accuracy
is required for analysis, the more effort is required for
data collection. Fault-based data are usually easier to
collect due to their static nature. Configuration
management tools for source code maintenance can
help to collect these data as developers are required to
check in and check out new updated versions of code
for fault removal. Failure-based data, on the other
hand, are much harder to collect and usually require
additional effort, for the following reasons. First, the
dynamic operating condition where the failures occur
may be hard to identify or describe. Moreover, the
time when the failures occur must be recorded
manually, after the failures are manifested. Calendar
time data can be coarsely recorded, but they lack
accuracy for modeling purposes. CPU time data, on the
other hand, are very difficult to collect, particularly for
distributed systems and networking environment
where multiple CPUs are executing software in parallel.
Certain forms of approximation are required to avoid
the great pain in data collection, but then the accuracy
of the data is consequently reduced. It is noted that
while manual data collection can be very labor
intensive, automatic data collection, although
unavoidable, may be too intrusive (e.g., online
collection of data can cause interruption to the system
under test).

In Step 2, the data are interpreted. Typically, this step


begins with a list of measures to evaluate. However,
new issues that have a major impact on software
reliability can also be identified during this step. The
results from Step 2 are reliability characteristics of
operational software in actual environments and issues
that must be addressed to improve software reliability.
These include fault and error classification, error
propagation, error and failure distribution, software
failure dependency, hardware-related software errors,
evaluation of software fault tolerance, error recurrence,
and diagnosis of recurrences.
In Step 3, appropriate models (such as Markov
models) are identified based on the findings from Step
2. We identify model structures and realistic ranges of
parameters. The identified models are abstractions of
the software reliability behavior in real environments.
Statistical analysis packages and measurement-based
reliability analysis tools are useful at this stage.
Step 4 involves either using known techniques or
developing new ones to solve the model. Model
solution allows us to obtain measures, such as
reliability, availability, and performability. The results
obtained from the model must be validated against real
data. Reliability and performance modeling and

evaluation tools such as SHARPE [45] can be used in


this step.
In Step 5, what if questions are addressed, using
the identified models. Model factors are varied and the
resulting effects on software reliability are evaluated.
Reliability bottlenecks are determined and the effects
of design changes on software reliability are predicted.
Research work currently addressed in this area
includes software reliability modeling in the
operational phase, the modeling of the impact of
software failures on performance, detailed error and
recovery processes, and software error bursts. The
knowledge and experience gained through such
analysis can be used to plan additional studies and to
develop the measurement techniques.

3.5. Methods and tools


In addition to software reliability growth modeling,
many other methods are available for SRE. We
provide a few examples of these methods and tools.
Fault trees provide a graphical and logical framework
for a systematic analysis of system failure modes.
Software reliability engineers can use them to assess
the overall impact of software failures on a system, or
to prove that certain failure modes will not occur. If
they may occur, the occurrence probability can also be
assessed. Fault tree models therefore provide an
informative modeling framework that can be engaged
to compare different design alternatives or system
architectures with respect to reliability. In particular,
they have been applied to both fault tolerant and fault
intolerant (i.e., non-redundant) systems. Since this
technique originates from hardware systems and has
been extended to software systems, it can be employed
to provide a unified modeling scheme for
hardware/software co-design. Reliability modeling for
hardware-software interactions is currently an area of
intensive research [42].
In addition, simulation techniques can be provided
for SRE purposes. They can produce observables of
interest in reliability engineering, including discrete
integer-valued quantities that occur as time progresses.
One simulation approach produces artifacts in an
actual software environment according to factors and
influences believed to typify these entities within a
given context [47]. The artifacts and environment are
allowed to interact naturally, whereupon the flow of
occurrences of activities and events is observed. This
artifact-based simulation allows experiments to be set
up to examine the nature of the relationships between
software failures and other software metrics, such as
program structure, programming error characteristics,

and test strategies. It is suggested that the extent to


which reliability depends merely on these factors can
be measured by generating random programs having
the given characteristics, and then observing their
failure statistics.
Another reliability simulation approach [28] produces
time-line imitations of reliability-related activities and
events. Reliability measures of interest to the software
process are modeled parametrically over time. The
key to this approach is a rate-based architecture, in
which phenomena occur naturally over time as
controlled by their frequencies of occurrence, which
depend on driving software metrics such as number of
faults so far exposed or yet remaining, failure
criticality, workforce level, test intensity, and software
execution time. Rate-based event simulation is an
example of a form of modeling called system dynamics,
whose distinctive feature is that the observables are
discrete events randomly occurring in time. Since
many software reliability growth models are also based
on rate (in terms of software hazard), the underlying
processes assumed by these models are fundamentally
the same as the rate-based reliability simulation. In
general, simulations enable investigations of questions
too difficult to be answered analytically, and are
therefore more flexible and more powerful.
Various SRE measurement tools have been
developed for data collection, reliability analysis,
parameter estimation, model application and reliability
simulation. Any major improvement on SRE is likely
to focus on such tools. We need to provide tools and
environments which can assist software developers to
build reliable software for different applications. The
partition of tools, environments, and techniques that
will be engaged should reflect proper employment of
the best current SRE practices.

3.6. Testing effectiveness and code coverage


As a typical mechanism for fault removal in software
reliability engineering, software testing has been
widely practiced in industry for quality assurance and
reliability improvement. Effective testing is defined as
uncovering of most if not all detectable faults. As the
total number of inherent faults is not known, testing
effectiveness is usually represented by a measurable
testing index. Code coverage, as an indicator to show
how thoroughly software has been stressed, has been
proposed and is widely employed to represent fault
coverage.

Positive

Reference
Horgan (1994) [17]
Frankl (1988) [16]
Rapps (1988) [38]
Chen (1992) [13]
Wong (1994)
Frate (1995)

Negative

Cai (2005) [8]


Briand (2000) [6]

Findings
High code coverage brings high software reliability and low fault rate.
A correlation between code coverage and software reliability was observed.
The correlation between test effectiveness and block coverage is higher than
that between test effectiveness and the size of test set.
An increase in reliability comes with an increase in at least one code coverage
measure, and a decrease in reliability is accompanied by a decrease in at least
one code coverage measure.
Code coverage contributes to a noticeable amount of fault coverage.
The testing result for published data did not support a causal dependency
between code coverage and fault coverage.

Table 1. Comparison of Investigations on the Relation of Code Coverage to Fault Coverage


Despite the observations of a correlation between
code coverage and fault coverage, a question is raised:
Can this phenomenon of concurrent growth be
attributed to a causal dependency between code
coverage and fault detection, or is it just coincidental
due to the cumulative nature of both measures? In one
investigation of this question, an experiment involving
Monte Carlo simulation was conducted on the
assumption that there is no causal dependency between
code coverage and fault detection [6]. The testing
result for published data did not support a causal
dependency between code coverage and defect
coverage.
Nevertheless, many researchers consider coverage as
a faithful indicator of the effectiveness of software
testing results. A comparison among various studies
on the impact of code coverage on software reliability
is shown in Table 1.

the benefits of applying operational profiles can be


found in a number of industrial projects [34].
Although significant improvement can be achieved
by employing operational profiles in regression or
system testing, challenges still exist for this technique.
First of all, the operational profiles for some
applications are hard to develop, especially for some
distributed software systems, e.g., Web services.
Moreover, unlike those of hardware, the operational
profiles of software cannot be duplicated in order to
speed the testing, because the failure behavior of
software depends greatly on its input sequence and
internal status. While in unit testing, different software
units can be tested at the same time, this approach is
therefore not applicable in system testing or regression
testing. As a result, learning to deal with improper
operational profiles and the dependences within the
operational profile are the two major problems in
operational profile techniques.

3.7. Testing and operational profiles

3.8. Industry practice and concerns

The operational profile is a quantitative


characterization of how a system will be used in the
field by customers. It helps to schedule test activities,
generate test cases, and select test runs. By allocating
development and test resources to functions on the
basis of how they are used, software reliability
engineering can thus be planned with productivity and
economics considerations in mind.

Although some success stories have been reported,


there is a lack of wide industry adoption for software
reliability engineering across various applications.
Software practitioners often see reliability as a cost
rather than a value, an investment rather than a return.
Often the reliability attribute of a product takes less
priority than its functionality or innovation. When
product delivery schedule is tight, reliability is often
the first element to be squeezed.

Using an operational profile to guide system testing


ensures that if testing is terminated and the software is
shipped because of imperative schedule constraints,
the most-used operations will have received the most
testing, and the reliability level will be the maximum
that is practically achievable for the given test time.
Also, in guiding regression testing, the profile tends to
find, among the faults introduced by changes, the ones
that have the most effect on reliability. Examples of

The main reason for the lack of industry enthusiasm


in SRE is because its cost-effectiveness is not clear.
Current SRE techniques incur visible overhead but
yield invisible benefits. In contrast, a companys target
is to have visible benefit but invisible overhead. The
former requires some demonstration in the form of
successful projects, while the latter involves avoidance

of labor-intensive tasks. Many companies, voluntarily


or under compulsion from their quality control policy,
collect failure data and make reliability measurements.
They are not willing to spend much effort on data
collection, let alone data sharing. Consequently,
reliability results cannot be compared or benchmarked,
and the experiences are hard to accumulate. Most
software
practitioners
only
employ
some
straightforward methods and metrics for their product
reliability control. For example, they may use some
general guidelines for quality metrics, such as fault
density, lines of code, or development or testing time,
and compare current projects with previous ones.
As the competitive advantage of product reliability is
less obvious than that of other product quality
attributes (such as performance or usability), few
practitioners are willing to try out emerging techniques
on SRE. The fact that there are so many software
reliability models to choose from also intimidates
practitioners.
So instead of investigating which
models are suitable for their environments or which
model selection criteria can be applied, practitioners
tend to simply take reliability measurements casually,
and they are often suspicious about the reliability
numbers obtained by the models. Many software
projects claim to set reliability objectives such as five
9s or six 9s (meaning 0.99999 to 0.999999
availability or 10-5 to 10-6 failures per execution hour),
but few can validate their reliability achievement.
Two major successful hardware reliability
engineering techniques, reliability prediction by
system architecture block diagrams and FME(C)A, still
cannot be directly applied to software reliability
engineering. This, as explained earlier, is due to the
intricate software dependencies within and between
software components (and sub-systems). If software
components can be decoupled, or their dependencies
can be clearly identified and properly modeled, then
these popular techniques in hardware may be
applicable to software, whereupon wide industry
adoption may occur.
We elaborate this in the
following section.

3.9. Software architecture


Systematic examination of software architectures for
a better way to support software development has been
an active research direction in the past 10 years, and it
will continue to be center stage in the coming decade
[41]. Software architectural design not only impacts
software development activities, but also affects SRE
efforts. Software architecture should be enhanced to
decrease the dependency of different software pieces

that run on the same computer or platform so that their


reliability does not interact. Fault isolation is a major
design consideration for software architecture. Good
software architecture should enjoy the property that
exceptions are raised when faults occur, and module
failures are properly confined without causing system
failures. In particular, this type of component-based
software development approach requires different
framework, quality assurance paradigm [9], and
reliability modeling [51] from those in traditional
software development.
A recent trend in software architecture is that as
information engineering is becoming the central focus
for todays businesses, service-oriented systems and
the associated software engineering will be the de facto
standards for business development.
Service
orientation requires seamless integration of
heterogeneous components and their interoperability
for proper service creation and delivery. In a serviceoriented framework, new paradigms for system
organizations and software architectures are needed for
ensuring adequate decoupling of components, swift
discovery of applications, and reliable delivery of
services. Such emerging software architectures include
cross-platform techniques [5], open-world software [3],
service-oriented architectures [32], and Web
applications [22]. Although some modeling approaches
have been proposed to estimate the reliability for
specific Web systems [49], SRE techniques for general
Web services and other service-oriented architectures
require more research work.

4. Possible future directions


SRE activities span the whole software lifecycle. We
discuss possible future directions with respect to five
areas: software architecture, design, testing, metrics
and emerging applications.

4.1. Reliability for software architectures and


off-the-shelf components
Due to the ever-increasing complexity of software
systems, modern software is seldom built from scratch.
Instead, reusable components have been developed and
employed, formally or informally. On the one hand,
revolutionary and evolutionary object-oriented design
and programming paradigms have vigorously pushed
software reuse. On the other hand, reusable software
libraries have been a deciding factor regarding whether
a software development environment or methodology
would be popular or not. In the light of this shift,
reliability engineering for software development is

focusing on two major aspects: software architecture,


and component-based software engineering.
The software architecture of a system consists of
software components, their external properties, and
their relationships with one another. As software
architecture is the foundation of the final software
product, the design and management of software
architecture is becoming the dominant factor in
software reliability engineering research.
Welldesigned software architecture not only provides a
strong, reliable basis for the subsequent software
development and maintenance phases, but also offers
various options for fault avoidance and fault tolerance
in achieving high reliability. Due to the cardinal
importance of, and complexity involved in, software
architecture design and modeling, being a good
software architect is a rare talent that is highly
demanded. A good software architect sees widely and
thinks deeply, as the components should eventually fit
together in the overall framework, and the anticipation
of change has to be considered in the architecture
design. A clean, carefully laid out architecture
requires up-front investments in various design
considerations, including high cohesion, low coupling,
separation of modules, proper system closure, concise
interfaces, avoidance of complexity, etc.
These
investments, however, are worthwhile since they
eventually help to increase software reliability and
reduce operation and maintenance costs.
One central research issue for software architecture
concerning reliability is the design of failure-resilient
architecture. This requires an effective software
architecture design which can guarantee separation of
components when software executes.
When
component failures occur in the system, they can then
be quickly identified and properly contained. Various
techniques can be explored in such a design. For
example, memory protection prevents interference and
failure propagation between different application
processes.
Guaranteed
separation
between
applications has been a major requirement for the
integration of multiple software services in
complicated modern systems. It should be noted that
the separation methods can support one another, and
usually they are combined for achieve better reliability
returns. Exploiting this synergy for reliability
assessment is a possibility for further exploration.
In designing failure-resilient architecture, additional
resources and techniques are often engaged. For
example, error handling mechanisms for fault detection,
diagnosis, isolation, and recovery procedures are
incorporated to tolerate component failures; however,

these mechanisms will themselves have some impact


on the system. Software architecture has to take this
impact into consideration. On the one hand, the added
reliability-enhancement routines should not introduce
unnecessary complexity, making them error-prone,
which would decrease the reliability instead of
increasing it. On the other hand, these routines should
be made unintrusive while they monitor the system,
and they should not further jeopardize the system
while they are carrying out recovery functions.
Designing concise, simple, yet effective mechanisms to
perform fault detection and recovery within a general
framework is an active research topic for researchers.
While software architecture represents the product
view of software systems, component-based software
engineering addresses the process view of software
engineering. In this popular software development
technique, many research issues are identified, such as
the following. How can reliable general reusable
components be identified and designed? How can
existing components be modified for reusability? How
can a clean interface design be provided for
components so that their interactions are fully under
control? How can defensive mechanisms be provided
for the components so that they are protected from
others, and will not cause major failures? How can it
be determined whether a component is risk-free? How
can the reliability of a component be assessed under
untested yet foreseeable operational conditions? How
can the interactions of components be modeled if they
cannot be assumed independent? Component-based
software engineering allows structure-based reliability
to be realized, which facilitates design for reliability
before the software is implemented and tested. The
dependencies among components will thus need to be
properly captured and modeled first.
These methods favor reliability engineering in
multiple ways. First of all, they directly increase
reliability by reducing the frequency and severity of
failures. Run-time protections may also detect faults
before they cause serious failures. After failures, they
make fault diagnosis easier, and thus accelerate
reliability improvements. For reliability assessment,
these failure prevention methods reduce the
uncertainties of application interdependencies or
unexpected environments. So, for instance, having
sufficient separation between running applications
ensures that when we port an application to a new
platform, we can trust its failure rate to equal that
experienced in a similar use on a previous platform
plus that of the new platform, rather than being also
affected by the specific combination of other
applications present on the new platform. Structure-

based reliability models can then be employed with


this system aspect in place. With this modeling
framework assisted by well-engineered software
architecture, the range of applicability of structurebased models can further be increased. Examples of
new applications could be to specify and investigate
failure dependence between components, to cope with
wide variations of reliability depending on the usage
environment, and to assess the impact of system risk
when components are checked-in or checked-out of the
system.

4.2. Achieving design for reliability


To achieve reliable system design, fault tolerance
mechanism needs to be in place. A typical response to
system or software faults during operation includes a
sequence of stages: Fault confinement, Fault detection,
Diagnosis, Reconfiguration, Recovery, Restart, Repair,
and Reintegration. Modern software systems pose
challenging research issues in these stages, which are
described as follows:
1. Fault confinement. This stage limits the spread of
fault effects to one area of the system, thus preventing
contamination of other areas. Fault-confinement can be
achieved through use of self-checking acceptance tests,
exception handling routines, consistency checking
mechanisms, and multiple requests/confirmations. As
the erroneous system behaviours due to software faults
are typically unpredictable, reduction of dependencies
is the key to successful confinement of software faults.
This has been an open problem for software reliability
engineering, and will remain a tough research
challenge.
2. Fault detection. This stage recognizes that
something unexpected has occurred in the system.
Fault latency is the period of time between the
occurrence of a software fault and its detection. The
shorter it is, the better the system can recover.
Techniques fall in two classes: off-line and on-line.
Off-line techniques such as diagnostic programs can
offer comprehensive fault detection, but the system
cannot perform useful work while under test. On-line
techniques, such as watchdog monitors or redundancy
schemes, provide a real-time detection capability that
is performed concurrently with useful work.
3. Diagnosis. This stage is necessary if the fault
detection technique does not provide information about
the failure location and/or properties. On-line, failureprevention diagnosis is the research trend. When the
diagnosis indicates unhealthy conditions in the system
(such as low available system resources), software

rejuvenation can be performed to achieve in-time


transient failure prevention.
4. Reconfiguration. This stage occurs when a fault is
detected and a permanent failure is located. The system
may reconfigure its components either to replace the
failed component or to isolate it from the rest of the
system. Successful reconfiguration requires robust and
flexible software architecture and the associated
reconfiguration schemes.
5. Recovery. This stage utilizes techniques to
eliminate the effects of faults. Two basic recovery
approaches are based on: fault masking, retry and
rollback. Fault-masking techniques hide the effects of
failures by allowing redundant, correct information to
outweigh the incorrect information. To handle design
(permanent) faults, N-version programming can be
employed. Retry, on the other hand, attempts a second
try at an operation and is based on the premise that
many faults are transient in nature. A recovery blocks
approach is engaged to recover from software design
faults in this case. Rollback makes use of the system
operation having been backed up (checkpointed) to
some point in its processing prior to fault detection and
operation recommences from this point. Fault latency
is important here because the rollback must go back far
enough to avoid the effects of undetected errors that
occurred before the detected error. The effectiveness
of design diversity as represented by N-version
programming and recovery blocks, however, continues
to be actively debated.
6. Restart. This stage occurs after the recovery of
undamaged information. Depending on the way the
system is configured, hot restart, warm restart, or cold
restart can be achieved. In hot restart, resumption of
all operations from the point of fault detection can be
attempted, and this is possible only if no damage has
occurred. In warm restart, only some of the processes
can be resumed without loss; while in cold restart,
complete reload of the system is performed with no
processes surviving.
7. Repair. In this stage, a failed component is
replaced. Repair can be off-line or on-line. In off-line
repair, if proper component isolation can be achieved,
the system will continue as the failed component can
be removed for operation. Otherwise, the system must
be brought down to perform the repair, and so the
system availability and reliability depends on how fast
a fault can be located and removed. In on-line repair
the component may be replaced immediately with a
backup spare (in a procedure equivalent to
reconfiguration) or operation may continue without the
faulty component (for example, masking redundancy

or graceful degradation). With on-line repair, system


operation is not interrupted; however, achieving
complete and seamless repair poses a major challenge
to researchers.
8. Reintegration. In this stage the repaired module
must be reintegrated into the system. For on-line repair,
reintegration must be performed without interrupting
system operation.
Design for reliability techniques can further be
pursued in four different areas: fault avoidance, fault
detection, masking redundancy, and dynamic
redundancy. Non-redundant systems are fault
intolerant and, to achieve reliability, generally use fault
avoidance techniques. Redundant systems typically use
fault detection, masking redundancy, and dynamic
redundancy to automate one or more of the stages of
fault handling. The main design consideration for
software fault tolerance is cost-effectiveness. The
resulting design has to be effective in providing better
reliability, yet it should not introduce excessive cost,
including performance penalty and unwarranted
complexity, which may eventually prove unworthy of
the investigation.

4.3. Testing for reliability assessment


Software testing and software reliability have
traditionally belonged to two separate communities.
Software testers test software without referring to how
software will operate in the field, as often the
environment cannot be fully represented in the
laboratory. Consequently they design test cases for
exceptional and boundary conditions, and they spend
more time trying to break the software than conducting
normal operations. Software reliability measurers, on
the other hand, insist that software should be tested
according to its operational profile in order to allow
accurate reliability estimation and prediction. In the
future, it will be important to bring the two groups
together, so that on the one hand, software testing can
be effectively conducted, while on the other hand,
software reliability can be accurately measured. One
approach is to measure the test compression factor,
which is defined as the ratio between the mean time
between failures during operation and during testing.
This factor can be empirically determined so that
software reliability in the field can be predicted from
that estimated during testing. Another approach is to
ascertain how other testing related factors can be
incorporated into software reliability modeling, so that
accurate measures can be obtained based on the
effectiveness of testing efforts.

Recent studies have investigated the effect of code


coverage on fault detection under different testing
profiles, using different coverage metrics, and have
studied its application in reducing test set size [30].
Experimental data are required to evaluate code
coverage and determine whether it is a trustworthy
indicator for the effectiveness of a test set with respect
to fault detection capability. Also, the effect of code
coverage on fault detection may vary under different
testing profiles. The correlation between code
coverage and fault coverage should be examined
across different testing schemes, including function
testing, random testing, normal testing, and exception
testing. In other words, white box testing and black
box testing should be crosschecked for their
effectiveness in exploring faults, and thus yielding
reliability increase.
Furthermore, evidence for variation between different
coverage metrics can also established. Some metrics
may be independent and some correlated.
The
quantitative relationship between different code
coverage metrics and fault detection capability should
be assessed, so that redundant metrics can be removed,
and orthogonal ones can be combined. New findings
about the effect of code coverage and other metrics on
fault detection can be used to guide the selection and
evaluation of test cases under various testing profiles,
and a systematic testing scheme with predictable
reliability achievement can therefore be derived.
Reducing test set size is a key goal in software testing.
Different testing metrics should be evaluated regarding
whether they are good filters in reducing the test set
size, while maintaining the same effectiveness in
achieving reliability.
This assessment should be
conducted under various testing scenarios [8]. If such
a filtering capability can be established, then the
effectiveness of test cases can be quantitatively
determined when they are designed. This would allow
the prediction of reliability growth with the creation a
test set before it is executed on the software, thus
facilitating early reliability prediction and possible
feedback control for better test set design schemes.
Other than linking software testing and reliability
with code coverage, statistical learning techniques may
offer another promising avenue to explore.
In
particular, statistical debugging approaches [26, 52],
whose original purpose was to identify software faults
with probabilistic modeling of program predicates, can
provide a fine quantitative assessment of program
codes with respect to software faults. They can
therefore help to establish accurate software reliability

prediction models based on program structures under


testing.

4.4. Metrics for reliability prediction


Today it is almost a mandate for companies to collect
software metrics as an indication of a maturing
software development process. While it is not hard to
collect metrics data, it is not easy to collect clean and
consistent data. It is even more difficult to derive
meaningful results from the collected metrics data.
Collecting metrics data for software reliability
prediction purposes across various projects and
applications is a major challenge. Moreover, industrial
software engineering data, particularly those related to
system failures, are historically hard to obtain across a
range of organizations. It will be important for a
variety of sources (such as NASA, Microsoft, IBM,
Cisco, etc.) across industry and academia to make
available real-failure data for joint investigation to
establish credible reliability analysis procedures. Such
a joint effort should define (1) what data to collect by
considering domain sensitivities, accessibility, privacy,
and utility; (2) how to collect data in terms of tools and
techniques; and (3) how to interpret and analyze the
data using existing techniques.
In addition to industrial data collection efforts, novel
methods to improve reliability prediction are actively
being researched. For example, by extracting rich
information from metrics data using a sound statistical
and probability foundation, Bayesian Belief Networks
(BBNs) offer a promising direction for investigation in
software engineering [7]. BBNs provide an attractive
formalism for different software cases. The technique
allows software engineers to describe prior knowledge
about software development quality and software
verification and validation (SV&V) quality, with
manageable visual descriptions and automated
inferences. The software reliability process can then be
modified with inference from observed failures, and
future reliability can be predicted. With proper
engagement of software metrics, this is likely to be a
powerful tool for reliability assessment of software
based systems, finding applications in predicting
software defects, forecasting software reliability, and
determining runaway projects [1].
Furthermore, traditional reliability models can be
enhanced to incorporate some testing completeness or
effectiveness metrics, such as code coverage, as well
as their traditional testing-time based metrics. The key
idea is that failure detection is not only related to the
time that the software is under testing, but also what
fraction of the code has been executed by the testing.

The effect of testing time on reliability can be


estimated using distributions from traditional SRGMs.
However, new models are needed to describe the effect
of coverage on reliability. These two dimensions,
testing time and coverage, are not orthogonal. The
degree of dependency between them is thus an open
problem for investigation.
Formulation of new
reliability models which integrate time and coverage
measurements for reliability prediction would be a
promising direction.
One drawback of the current metrics and data
collection process is that it is a one-way, open-loop
avenue: while metrics of the development process can
indicate or predict the outcome quality, such as the
reliability, of the resulting product, they often cannot
provide feedback to the process regarding how to
make improvement.
Metrics would present
tremendous benefits to reliability engineering if they
could achieve not just prediction, but also refinement.
Traditional software reliability models take metrics
(such as defect density or times between failures) as
input and produce reliability quantity as the output. In
the future, a reverse function is urgently called for:
given a reliability goal, what should the reliability
process (and the resulting metrics) look like? By
providing such feedback, it is expected that a closedloop software reliability engineering process can be
informative as well as beneficial in achieving
predictably reliable software.

4.5. Reliability
applications

for

emerging

software

Software engineering targeted for general systems


may be too ambitious. It may find more successful
applications if it is domain-specific. In this Future of
Software Engineering volume, future software
engineering techniques for a number of emerging
application domains have been thoroughly discussed.
Emerging software applications also create abundant
opportunities
for
domain-specific
reliability
engineering.
One key industry in which software will have a
tremendous presence is the service industry. Serviceoriented design has been employed since the 1990s in
the telecommunications industry, and it reached
software engineering community as a powerful
paradigm for Web service development, in which
standardized interfaces and protocols gradually
enabled the use of third-party functionality over the
Internet, creating seamless vertical integration and
enterprise process management for cross-platform,
cross-provider, and cross-domain applications. Based

on the future trends for Web application development


as laid out in [22], software reliability engineering for
this emerging technique poses enormous challenges
and opportunities. The design of reliable Web services
and the assessment of Web service reliability are novel
and open research questions. On the one hand, having
abundant service providers in a Web service makes the
design diversity approach suddenly appealing, as the
diversified service design is perceived not as cost, but
as an available resource. On the other hand, this
unplanned diversity may not be equipped with the
necessary quality, and the compatibility among various
service providers can pose major problems. Seamless
Web service composition in this emerging application
domain is therefore a central issue for reliability
engineering. Extensive experiments are required in the
area of measurement of Web service reliability. Some
investigations have been initiated with limited success
[27], but more efforts are needed.
Researchers have proposed the publish/subscribe
paradigm as a basis for middleware platforms that
support software applications composed of highly
evolvable and dynamic federations of components. In
this approach, components do not interact with each
other directly; instead an additional middleware
mediates their communications. Publish/subscribe
middleware decouples the communication among
components and supports implicit bindings among
components. The sender does not know the identity of
the receivers of its messages, but the middleware
identifies them dynamically.
Consequently new
components can dynamically join the federation,
become immediately active, and cooperate with the
other
components
without
requiring
any
reconfiguration of the architecture. Interested readers
can refer to [21] for future trends in middleware-based
software engineering technologies.
The open system approach is another trend in
software applications. Closed-world assumptions do
not hold in an increasing number of cases, especially in
ubiquitous and pervasive computing settings, where
the world is intrinsically open. Applications cover a
wide range of areas, from dynamic supply-chain
management, dynamic enterprise federations, and
virtual endeavors, on the enterprise level, to
automotive applications and home automation on the
embedded-systems level. In an open world, the
environment changes continuously. Software must
adapt and react dynamically to changes, even if they
are unanticipated. Moreover, the world is open to new
components that context changes could make
dynamically available for example, due to mobility.
Systems can discover and bind such components

dynamically to the application while it is executing.


The software must therefore exhibit a self-organization
capability. In other words, the traditional solution that
software designers adopted carefully elicit change
requests, prioritize them, specify them, design changes,
implement and test, then redeploy the software is no
longer viable. More flexible and dynamically
adjustable reliability engineering paradigms for rapid
responses to software evolution are required.

5. Conclusions
As the cost of software application failures grows and
as these failures increasingly impact business
performance, software reliability will become
progressively more important. Employing effective
software reliability engineering techniques to improve
product and process reliability would be the industrys
best interests as well as major challenges. In this paper,
we have reviewed the history of software reliability
engineering, the current trends and existing problems,
and specific difficulties. Possible future directions and
promising research problems in software reliability
engineering have also been addressed. We have laid
out the current and possible future trends for software
reliability engineering in terms of meeting industry and
customer needs. In particular, we have identified new
software reliability engineering paradigms by taking
software architectures, testing techniques, and software
failure manifestation mechanisms into consideration.
Some thoughts on emerging software applications have
also been provided.

References
[1] S. Amasaki, O. Mizuno, T. Kikuno, and Y. Takagi, A
Bayesian Belief Network for Predicting Residual Faults in
Software Products, Proceedings of 14th International
Symposium
on
Software
Reliability
Engineering
(ISSRE2003), November 2003, pp. 215-226,
[2] ANSI/IEEE, Standard Glossary of Software Engineering
Terminology, STD-729-1991, ANSI/IEEE, 1991.
[3] L. Baresi, E. Nitto, and C. Ghezzi, Toward Open-World
Software: Issues and Challenges, IEEE Computer, October
2006, pp. 36-43.
[4] A. Bertolino, Software Testing Research: Achievements,
Challenges, Dreams, Future of Software Engineering 2007,
L. Briand and A. Wolf (eds.), IEEE-CS Press, 2007.
[5] J. Bishop and N. Horspool, Cross-Platform
Development: Software That Lasts, IEEE Computer,
October 2006, pp. 26-35.
[6] L. Briand and D. Pfahl, Using Simulation for Assessing
the Real Impact of Test Coverage on Defect Coverage,

IEEE Transactions on Reliability, vol. 49, no. 1, March 2000,


pp. 60-70.

Efficiency, IEEE Transactions on Reliability, vol. 54, no. 4,


December 2005, pp. 583-591.

[7] J. Cheng, D.A. Bell, and W. Liu, Learning Belief


Networks from Data: An Information Theory Based
Approach, Proceedings of the Sixth International
Conference on Information and Knowledge Management,
Las Vegas, 1997, pp. 325-331.

[19] C.Y. Huang, M.R. Lyu, and S.Y. Kuo, "A Unified
Scheme of Some Non-Homogeneous Poisson Process
Models for Software Reliability Estimation," IEEE
Transactions on Software Engineering, vol. 29, no. 3, March
2003, pp. 261-269.

[8] X. Cai and M.R. Lyu, The Effect of Code Coverage on


Fault Detection Under Different Testing Profiles, ICSE
2005 Workshop on Advances in Model-Based Software
Testing (A-MOST), St. Louis, Missouri, May 2005.

[20] W.S. Humphrey, The Future of Software Engineering:


I, Watts New Column, News at SEI, vol. 4, no. 1, March,
2001.

[9] X. Cai, M.R. Lyu, and K.F. Wong, A Generic


Environment for COTS Testing and Quality Prediction,
Testing Commercial-off-the-shelf Components and Systems,
S. Beydeda and V. Gruhn (eds.), Springer-Verlag, Berlin,
2005, pp. 315-347.
[10] X. Cai, M.R. Lyu, and M.A. Vouk, An Experimental
Evaluation on Reliability Features of N-Version
Programming, in Proceedings 16th International
Symposium
on
Software
Reliability
Engineering
(ISSRE2005), Chicago, Illinois, Nov. 8-11, 2005.
[11] X. Cai and M.R. Lyu, An Empirical Study on
Reliability and Fault Correlation Models for Diverse
Software Systems, in Proceedings 15th International
Symposium
on
Software
Reliability
Engineering
(ISSRE2004), Saint-Malo, France, Nov. 2004, pp.125-136.
[12] M. Chen, M.R. Lyu, and E. Wong, Effect of Code
Coverage on Software Reliability Measurement, IEEE
Transactions on Reliability, vol. 50, no. 2, June 2001,
pp.165-170.
[13] M.H. Chen, A.P. Mathur, and V.J. Rego, Effect of
Testing Techniques on Software Reliability Estimates
Obtained Using Time Domain Models, In Proceedings of
the 10th Annual Software Reliability Symposium, Denver,
Colorado, June 1992, pp. 116-123.
[14] J.B. Dugan and M.R. Lyu, Dependability Modeling for
Fault-Tolerant Software and Systems, in Software Fault
Tolerance, M. R. Lyu (ed.), New York: Wiley, 1995, pp.
109138.
[15] D.E. Eckhardt and L.D. Lee, A Theoretical Basis for
the Analysis of Multiversion Software Subject to Coincident
Errors, IEEE Transactions on Software Engineering, vol. 11,
no. 12, December 1985, pp. 15111517.
[16] P.G. Frankl and E.J. Weyuker, An Applicable Family
of Data Flow Testing Criteria, IEEE Transactions on
Software Engineering, vol. 14, no. 10, October 1988, pp.
1483-1498.
[17] J.R. Horgan, S. London, and M.R. Lyu, Achieving
Software Quality with Testing Coverage Measures, IEEE
Computer, vol. 27, no.9, September 1994, pp. 60-69.
[18] C.Y. Huang and M.R. Lyu, Optimal Release Time for
Software Systems Considering Cost, Testing-Effort, and Test

[21] V. Issarny, M. Caporuscio, and N. Georgantas: A


Perspective on the Future of Middleware-Based Software
Engineering, Future of Software Engineering 2007, L.
Briand and A. Wolf (eds.), IEEE-CS Press, 2007.
[22] M. Jazayeri, Web Application Development: The
Coming Trends, Future of Software Engineering 2007, L.
Briand and A. Wolf (eds.), IEEE-CS Press, 2007.
[23] Z. Jelinski and P.B. Moranda, Software Reliability
Research, in Proceedings of the Statistical Methods for the
Evaluation of Computer System Performance, Academic
Press, 1972, pp. 465-484.
[24] B. Littlewood and L. Strigini, Software Reliability and
Dependability: A Roadmap, in Proceedings of the 22nd
International Conference on Software Engineering
(ICSE2000), Limerick, June 2000, pp. 177-188.
[25] B. Littlewood and D. Miller, Conceptual Modeling of
Coincident Failures in Multiversion Software, IEEE
Transactions on Software Engineering, vol. 15, no. 12,
December 1989, pp. 15961614.
[26] C. Liu, L. Fei, X. Yan, J. Han, and S. Midkiff,
Statistical Debugging: A Hypothesis Testing-based
Approach, IEEE Transaction on Software Engineering, vol.
32, no. 10, October, 2006, pp. 831-848.
[27] N. Looker and J. Xu, Assessing the Dependability of
SOAP-RPC-Based Web Services by Fault Injection, in
Proceedings of 9th IEEE International Workshop on Objectoriented Real-time Dependable Systems, 2003, pp. 163-170.
[28] M.R. Lyu (ed.), Handbook of Software Reliability
Engineering, IEEE Computer Society Press and McGrawHill, 1996.
[29] M.R. Lyu and X. Cai, Fault-Tolerant Software,
Encyclopedia on Computer Science and Engineering,
Benjamin Wah (ed.), Wiley, 2007.
[30] M.R. Lyu, Z. Huang, S. Sze, and X. Cai, An Empirical
Study on Testing and Fault Tolerance for Software
Reliability Engineering, in Proceedings 14th IEEE
International
Symposium
on
Software
Reliability
Engineering (ISSRE'2003), Denver, Colorado, November
2003, pp.119-130.
[31] Y.K. Malaiya, N. Li, J.M. Bieman, and R. Karcich,
Software Reliability Growth with Test Coverage, IEEE

Transactions on Reliability, vol. 51, no. 4, December 2002,


pp. 420-426.

Symposium, IEEE Computer Society Press, Los Alamitos,


CA, 2000, pp. 270-279.

[32] T. Margaria and B. Steffen, Service Engineering:


Linking Business and IT, IEEE Computer, October 2006,
pp. 45-55.

[47] A. von Mayrhauser and D. Chen, Effect of Fault


Distribution and Execution Patterns on Fault Exposure in
Software: A Simulation Study, Software Testing,
Verification & Reliability, vol. 10, no.1, March 2000, pp. 4764.

[33] J.D. Musa, Software Reliability Engineering: More


Reliable Software Faster and Cheaper (2nd Edition),
AuthorHouse, 2004.
[34] J.D. Musa, Operational Profiles in Software Reliability
Engineering, IEEE Software, Volume 10, Issue 2, March
1993, pp. 14-32.
[35] J.D. Musa, A. Iannino, and K. Okumoto, Software
Reliability: Measurement, Prediction, Application, McGrawHill, Inc., New York, NY, 1987.
[36] H. Pham, Software Reliability, Springer, Singapore,
2000.
[37] P.T. Popov, L. Strigini, J. May, and S. Kuball,
Estimating Bounds on the Reliability of Diverse Systems,
IEEE Transactions on Software Engineering, vol. 29, no. 4,
April 2003, pp. 345359.
[38] S. Rapps and E.J. Weyuker, Selecting Software Test
Data Using Data Flow Information, IEEE Transactions on
Software Engineering, vol. 11, no. 4, April 1985, pp. 367375.
[39] Rome Laboratory (RL), Methodology for Software
Reliability Prediction and Assessment, Technical Report RLTR-92-52, volumes 1 and 2, 1992.
[40] M.L. Shooman, Reliability of Computer Systems and
Networks: Fault Tolerance, Analysis and Design, Wiley,
New York, 2002.
[41] R. Taylor and A. van der Hoek, Software Design and
Architecture: The Once and Future Focus of Software
Engineering, Future of Software Engineering 2007, L.
Briand and A. Wolf (eds.), IEEE-CS Press, 2007.
[42] X. Teng, H. Pham, and D. Jeske, Reliability Modeling
of Hardware and Software Interactions, and Its
Applications, IEEE Transactions on Reliability, vol. 55, no.
4, Dec. 2006, pp. 571-577.
[43] L.A. Tomek and K.S. Trivedi, Analyses Using
Stochastic Reward Nets, in Software Fault Tolerance, M.R.
Lyu (ed.), New York: Wiley, 1995, pp. 139165.
[44] W. Torres-Pomales, Software Fault Tolerance: A
Tutorial, NASA Langley Research Center, Hampton,
Virginia, TM-2000-210616, Oct. 2000.
[45] K.S. Trivedi, SHARPE 2002: Symbolic Hierarchical
Automated Reliability and Performance Evaluator, in
Proceedings International Conference on Dependable
Systems and Networks, 2002.
[46] K.S. Trivedi, K. Vaidyanathan, and K. GosevaPostojanova, "Modeling and Analysis of Software Aging and
Rejuvenation", in Proceedings of 33 rd Annual Simulation

[48] M.A. Vouk, Using Reliability Models During Testing


With Nonoperational Profiles, in Proceedings of 2nd
Bellcore/Purdue Workshop on Issues in Software Reliability
Estimation, October 1992, pp. 103-111.
[49] W. Wang and M. Tang, User-Oriented Reliability
Modeling for a Web System, in Proceedings of the 14th
International
Symposium
on
Software
Reliability
Engineering (ISSRE03), Denver, Colorado, November 2003,
pp.1-12.
[50] M. Xie, Software Reliability Modeling, World Scientific
Publishing Company, 1991.
[51] S. Yacoub, B. Cukic, and H Ammar, A Scenario-Based
Reliability Analysis Approach for Component-Based
Software, IEEE Transactions on Reliability, vol. 53, no. 4,
2004, pp. 465-480.
[52] A.X. Zheng, M.I. Jordan, B. Libit, M. Naik, and A.
Aiken, Statistical Debugging: Simultaneous Identification
of Multiple Bugs, in Proceedings of the 23rd International
Conference on Machine Learning, Pittsburgh, PA, 2006, pp.
1105-1112.

23/11/2012

Software Reliability Testing - Wikipedia, the free encyclopedia

Software Reliability Testing


From Wikipedia, the free encyclopedia

Software reliability testing is one of the testing field, which deals with checking the ability of software to
function under given environmental conditions for particular amount of time by taking into account all precisions
of the software. In Software Reliability Testing, the problems are discovered regarding the software design and
functionality and the assurance is given that the system meets all requirements. Software Reliability is the
probability that software will work properly in specified environment and for given time.
Probability = Number of cases when we find failure / Total number of cases under consideration
Using this formula, failure probability is calculated by testing a sample of all available input states. The set of all
possible input states is called as input space. To find reliability of software, we need to find output space from
given input space and software.[1]

Contents
1 Overview
2 Objective of reliability testing
2.1 Secondary objectives
2.2 Points for defining objectives
3 Need of reliability testing
4 Types of reliability testing
4.1 Feature test
4.2 Load test
4.3 Regression test
5 Tests planning
5.1 Steps for planning
5.2 Problems in designing test cases
6 Reliability enhancement through testing
6.1 Reliability growth testing
6.2 Designing test cases for current release
7 Reliability evaluation based on operational testing
7.1 Reliability growth assessment and prediction
7.2 Reliability estimation based on failure-free working
8 See also
9 References
10 External links

Overview
To perform software testing, it is necessary to design the test cases and test procedure for each software
module. Data is gathered from various stages of development for reliability testing,like design stage, Operating
stage etc. The tests are limited because of some restrictions, like Cost of test performing and time restrictions.
Statistical samples are obtained from the software products to test for reliability of the software. when sufficient
data or information is gathered then statistical studies are done. time constraints are handled by applying fix
en.wikipedia.org/wiki/Software_Reliability_Testing

1/6

23/11/2012

Software Reliability Testing - Wikipedia, the free encyclopedia

dates or deadlines to the tests to be performed,after this phase designed of the software is stopped and actual
implementations started.As there are restriction on cost and time the data is gathered carefully so that each data
has some purpose and it gets expected precision.[2] To achieve the satisfactory results from reliability testing one
must take care of some reliability characteristics. for example Mean Time to Failure (MTTF)[3] is measured in
terms of three factors
1. Operating Time.
2. Number of on off cycles.
3. Calendar Time.
If the restrictions are on Operation time or if focus is on first point for improvement then one can apply
compressed time accelerations to reduce the test time. If the focus is on calendar time that is there are
predefined deadlines,then intensified stress testing is used.[2]
Software Reliability is measured in terms of Mean Time Between Failure(MTBF).[4]
MTBF consisting of mean time to failure (MTTF) and mean time to repair(MTTR). MTTF means difference of
time in two consecutive failures and MTTR is the time required to fix the failure.[5] .Reliability for good software
should be always between 0 to 1.Reliability increases when the errors or bugs from the programs are
removed.[6]
e.g. MTBF = 1000 hours for average software, then software should work for 1000 hrs for continuous
operations.

Objective of reliability testing


The main objective of the reliability testing is to test the performance of the software under given conditions
without any type of corrective measure with known fixed procedures considering its specifications.

Secondary objectives
1.
2.
3.
4.
5.

To find perceptual structure of repeating failures.


To find the number of failures occurring in specified amount of time.
To find the mean life of the software.
To know the main cause of failure.
After taking preventive actions checking the performance of different units of software.

Points for defining objectives


1. Behaviour of software should be defined in given conditions.
2. The objective should be feasible.
3. Time constraints should be provided.[7]

Need of reliability testing


Nowadays in maximum number of fields we find the application of computer software. Also some software are
used in many critical applications like in industries, in military, in commercial systems etc. For these software
from last century software engineering is developing. There is no complete measure to assess them . But to
assess them software reliability measure are used as tool. So software reliability is the most important aspect of
en.wikipedia.org/wiki/Software_Reliability_Testing

2/6

23/11/2012

Software Reliability Testing - Wikipedia, the free encyclopedia

any software.[8]
To improve the performance of software product and software development process through assessment of
reliability is required. Reliability testing is of great use for software managers and software practitioners. Thus
ultimately testing reliability of a software is important.[9]

Types of reliability testing


Software Reliability Testing requires to check features provided by the software,the load that software can
handle and regression testing.[10]

Feature test
feature test for software conducts in following steps
Each operation in the software is executed once.
Interaction between the two operations is reduced and
Each operation each checked for its proper execution.
feature test is followed by the load test.[10]

Load test
This test is conducted to check the performance of the software under maximum work load. Any software
performs better up to some extent of load on it after which the response time of the software starts degrading.
For example, a web site can be tested to see till how many simultaneous users it can function without
performance degradation. This testing mainly helps for Databases and Application servers.Load testing also
requires to do software performance testing where it checks that how well some software performs under
workload.[10]

Regression test
Regression testing is used to check if any bug fixing in the software introduced new bug. One part of the
software affects the other is determined. Regression testing is conducted after every change in the software
features. This testing is periodic. The period depends on the length and features of software.[10]

Tests planning
Reliability testing costs more as compare to other types of testing. Thus while doing reliability testing proper
management and planning is required. This plan includes testing process to be implemented, data about its
environment, test schedule, test points etc.

Steps for planning


1.
2.
3.
4.
5.

Find main aim of testing.


Know the requirements of testing.
Have a look over existing data and check for the requirements.
Considering priorities of test find out necessary tests.
Utilize time constraints, available money and manpower properly.

en.wikipedia.org/wiki/Software_Reliability_Testing

3/6

23/11/2012

6.
7.
8.
9.

Software Reliability Testing - Wikipedia, the free encyclopedia

Determine specifications of test.


Allot different responsibilities to testing teams.
Decide policies for providing report of testing.
Have control over testing procedure throughout testing procedure.[7]

Problems in designing test cases


There are some problem while going through this tests.
Test cases can be simply selected by selecting valid input values for each field of the software,but after
some changes in particular module the recorded input values again needs to check. Those values may not
test the new features introduced after older version of software.
There may be some critical runs in the software which are not handled by any test case.so careful test
case selection is necessary.[10]

Reliability enhancement through testing


Studies during development and design of software helps for reliability of product. Reliability testing is basically
performed to eliminate the failure mode of the software.life testing of the product should always done after the
design part is finished or at least complete design is finalize.[11] failure analysis and design improvement is
achieved through following testings.

Reliability growth testing


[11] This testing is used

to check new prototypes of the software which are initially supposed to fail
frequently.Failure causes are detected and actions are taken to reduce defects. suppose T is total accumulated
time for prototype.n(T) is number of failure from start to time T.The graph drawn for n(T)/T is a straight line.
This graph is called Duance Plot. one can get, how much reliability can be gained after all other cycles of test
and to fix it.

solving eq.1 for n(T),

where K is e^b. if value of alpha in the equation is zero the reliability can not be improved as expected for given
number of failure. for alpha greater than zero cumulative time T increases. this explains that number of the failure
doesn't depends on test lengths.

Designing test cases for current release


If in the current version of software release we are adding new operation,then writing a test case for that
operation is done differently.
first plan how many new test cases are to be written for current version.
If the new feature is part of any existing feature then share the test cases of new and existing features
among them.
en.wikipedia.org/wiki/Software_Reliability_Testing

4/6

23/11/2012

Software Reliability Testing - Wikipedia, the free encyclopedia

Finally combine all test cases from current version and previous one and record all the results.[10]
There is a predefined rule to calculate count of new test cases for the software. if N is the probability of
occurrence of new operations for new release of the software, R is the probability of occurrence of used
operations in the current release and T is the number of all previously used test cases then

Reliability evaluation based on operational testing


In reliability testing to test the reliability of the software use the method of operational testing. Here one checks
the working of software in its relevant operational environment. But constructing such an operational
environment is the main problem. Such type of simulation is observed in some industries like nuclear industries,
in aircraft etc. Predicting future reliability is a part of reliability evaluation. There are two techniques used for this:
Steady state reliability estimation
In this case we use the feed-backs of delivered software products. Depending on those results we predict
the future reliability of next version of product. It simply follows the way of sample testing for physical
products.
Reliability growth based prediction
This method uses the documentation of testing procedure. For example consider a developed software
and after that we are creating different new versions of that software. At that time we consider data about
testing of each version of that software and on the basis of that observed trend we predict the reliability of
software.[12]

Reliability growth assessment and prediction


In assessment and prediction of software reliability we use reliability growth model. During operation of software
data about its failure is stored in statistical form and is given as input to reliability growth model. Using that data
reliability growth model will evaluate the reliability of software. Lots of data about reliability growth model is
available with probability models claiming to represent failure process. But there is no model which best suited
for all conditions. So considering circumstances we have to chose one of the model. So today such type of
problem is overcome by using advanced techniques.

Reliability estimation based on failure-free working


In this case the reliability of the software is estimated with some assumptions like
If a bug is found out then it is sure that it is going to fix by someone.
Fixing of bug is not going to affect the reliability of software.
Each fix in the software is accurate.[12]

See also
Software testing
Load testing
en.wikipedia.org/wiki/Software_Reliability_Testing

5/6

23/11/2012

Software Reliability Testing - Wikipedia, the free encyclopedia

Regression testing
Reliability engineering

References
1. ^ Software Reliability. Hoang Pham.
2. ^ a b E.E.Lewis. Introduction to Reliability Engineering.
3. ^ "MTTF" (http://www.weibull.com/hotwire/issue94/relbasics94.htm) .
http://www.weibull.com/hotwire/issue94/relbasics94.htm.
4. ^ Roger Pressman. Software Engineering A Practitioner's Approach. McGrawHill.
5. ^ "Approaches to Reliability Testing & Setting of Reliability Test Objectives"
(http://www.softwaretestinggenius.com/articalDetails.php?qry=963) .
http://www.softwaretestinggenius.com/articalDetails.php?qry=963.
6. ^ Aditya P. Mathur. Foundations of Software Testing. Pearson publications.
7. ^ a b Reliability and life testing handbook. Dimitri kececioglu.
8. ^ A Statistical Basis for Software Reliability Assessment. M. xie.
9. ^ Software Reliability modelling. M. Xie.
10. ^ a b c d e f John D. Musa. Software reliability engineering: more reliable software, faster and cheaper.
McGraw-Hill. ISBN 0-07-060319-7.
11. ^ a b E.E.Liwis. Introduction to Reliability Engineering. ISBN 0-471-01833-3.
12. ^ a b "Problem of Assessing reliability". CiteSeerX: 10.1.1.104.9831
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9831) .

External links
Mean Time Between Failure (http://www.weibull.com/hotwire/issue94/relbasics94.htm/)
Software Life Testing (http://www.weibull.com/basics/accelerated.htm/)
Retrieved from "http://en.wikipedia.org/w/index.php?title=Software_Reliability_Testing&oldid=521833844"
Categories: Software testing
This page was last modified on 7 November 2012 at 15:12.
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may
apply. See Terms of Use for details.
Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

en.wikipedia.org/wiki/Software_Reliability_Testing

6/6

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

Software performance testing


From Wikipedia, the free encyclopedia

In software engineering, performance testing is in general testing performed to determine how a system
performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate,
measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource
usage.
Performance testing is a subset of performance engineering, an emerging computer science practice which
strives to build performance into the implementation, design and architecture of a system.

Contents
1 Performance testing types
1.1 Load testing
1.2 Stress testing
1.3 Endurance testing (soak testing)
1.4 Spike testing
1.5 Configuration testing
1.6 Isolation testing
2 Setting performance goals
2.1 Concurrency/throughput
2.2 Server response time
2.3 Render response time
2.4 Performance specifications
2.5 Questions to ask
3 Pre-requisites for Performance Testing
3.1 Test conditions
3.2 Timing
4 Tools
5 Technology
6 Tasks to undertake
7 Methodology
7.1 Performance testing web applications
8 See also
9 External links

Performance testing types


Load testing
Load testing is the simplest form of performance testing. A load test is usually conducted to understand the
behaviour of the system under a specific expected load. This load can be the expected concurrent number of
users on the application performing a specific number of transactions within the set duration. This test will give
out the response times of all the important business critical transactions. If the database, application server, etc.
en.wikipedia.org/wiki/Software_performance_testing

1/7

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

are also monitored, then this simple test can itself point towards any bottlenecks in the application software.

Stress testing
Stress testing is normally used to understand the upper limits of capacity within the system. This kind of test is
done to determine the system's robustness in terms of extreme load and helps application administrators to
determine if the system will perform sufficiently if the current load goes well above the expected maximum.

Endurance testing (soak testing)


Endurance testing is usually done to determine if the system can sustain the continuous expected load. During
endurance tests, memory utilization is monitored to detect potential leaks. Also important, but often overlooked
is performance degradation. That is, to ensure that the throughput and/or response times after some long period
of sustained activity are as good or better than at the beginning of the test. It essentially involves applying a
significant load to a system for an extended, significant period of time. The goal is to discover how the system
behaves under sustained use.

Spike testing
Spike testing is done by suddenly increasing the number of or load generated by, users by a very large amount
and observing the behaviour of the system. The goal is to determine whether performance will suffer, the system
will fail, or it will be able to handle dramatic changes in load.

Configuration testing
Rather than testing for performance from the perspective of load, tests are created to determine the effects of
configuration changes to the system's components on the system's performance and behaviour. A common
example would be experimenting with different methods of load-balancing.

Isolation testing
Isolation testing is not unique to performance testing but involves repeating a test execution that resulted in a
system problem. Often used to isolate and confirm the fault domain.

Setting performance goals


Performance testing can serve different purposes.
It can demonstrate that the system meets performance criteria.
It can compare two systems to find which performs better.
Or it can measure what parts of the system or workload causes the system to perform badly.
Many performance tests are undertaken without due consideration to the setting of realistic performance goals.
The first question from a business perspective should always be "why are we performance testing?". These
considerations are part of the business case of the testing. Performance goals will differ depending on the
system's technology and purpose however they should always include some of the following:

Concurrency/throughput
If a system identifies end-users by some form of log-in procedure then a concurrency goal is highly desirable. By
en.wikipedia.org/wiki/Software_performance_testing

2/7

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

definition this is the largest number of concurrent system users that the system is expected to support at any
given moment. The work-flow of your scripted transaction may impact true concurrency especially if the
iterative part contains the log-in and log-out activity.
If the system has no concept of end-users then performance goal is likely to be based on a maximum throughput
or transaction rate. A common example would be casual browsing of a web site such as Wikipedia.

Server response time


This refers to the time taken for one system node to respond to the request of another. A simple example would
be a HTTP 'GET' request from browser client to web server. In terms of response time this is what all load
testing tools actually measure. It may be relevant to set server response time goals between all nodes of the
system.

Render response time


A difficult thing for load testing tools to deal with as they generally have no concept of what happens within a
node apart from recognizing a period of time where there is no activity 'on the wire'. To measure render
response time it is generally necessary to include functional test scripts as part of the performance test scenario
which is a feature not offered by many load testing tools.

Performance specifications
It is critical to detail performance specifications (requirements) and document them in any performance test plan.
Ideally, this is done during the requirements development phase of any system development project, prior to any
design effort. See Performance Engineering for more details.
However, performance testing is frequently not performed against a specification i.e. no one will have expressed
what the maximum acceptable response time for a given population of users should be. Performance testing is
frequently used as part of the process of performance profile tuning. The idea is to identify the weakest link
there is inevitably a part of the system which, if it is made to respond faster, will result in the overall system
running faster. It is sometimes a difficult task to identify which part of the system represents this critical path, and
some test tools include (or can have add-ons that provide) instrumentation that runs on the server (agents) and
report transaction times, database access times, network overhead, and other server monitors, which can be
analyzed together with the raw performance statistics. Without such instrumentation one might have to have
someone crouched over Windows Task Manager at the server to see how much CPU load the performance
tests are generating (assuming a Windows system is under test).
Performance testing can be performed across the web, and even done in different parts of the country, since it is
known that the response times of the internet itself vary regionally. It can also be done in-house, although routers
would then need to be configured to introduce the lag what would typically occur on public networks. Loads
should be introduced to the system from realistic points. For example, if 50% of a system's user base will be
accessing the system via a 56K modem connection and the other half over a T1, then the load injectors
(computers that simulate real users) should either inject load over the same mix of connections (ideal) or simulate
the network latency of such connections, following the same user profile.
It is always helpful to have a statement of the likely peak numbers of users that might be expected to use the
system at peak times. If there can also be a statement of what constitutes the maximum allowable 95 percentile
response time, then an injector configuration could be used to test whether the proposed system met that
specification.
en.wikipedia.org/wiki/Software_performance_testing

3/7

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

Questions to ask
Performance specifications should ask the following questions, at a minimum:
In detail, what is the performance test scope? What subsystems, interfaces, components, etc. are in and
out of scope for this test?
For the user interfaces (UIs) involved, how many concurrent users are expected for each (specify peak
vs. nominal)?
What does the target system (hardware) look like (specify all server and network appliance
configurations)?
What is the Application Workload Mix of each system component? (for example: 20% log-in, 40%
search, 30% item select, 10% checkout).
What is the System Workload Mix? [Multiple workloads may be simulated in a single performance test]
(for example: 30% Workload A, 20% Workload B, 50% Workload C).
What are the time requirements for any/all back-end batch processes (specify peak vs. nominal)?

Pre-requisites for Performance Testing


A stable build of the system which must resemble the production environment as close as is possible.
The performance testing environment should not be clubbed with User acceptance testing (UAT) or
development environment. This is dangerous as if an UAT or Integration test or other tests are going on in the
same environment, then the results obtained from the performance testing may not be reliable. As a best practice
it is always advisable to have a separate performance testing environment resembling the production
environment as much as possible.

Test conditions
In performance testing, it is often crucial (and often difficult to arrange) for the test conditions to be similar to the
expected actual use. This is, however, not entirely possible in actual practice. The reason is that the workloads
of production systems have a random nature, and while the test workloads do their best to mimic what may
happen in the production environment, it is impossible to exactly replicate this workload variability - except in
the most simple system.
Loosely-coupled architectural implementations (e.g.: SOA) have created additional complexities with
performance testing. Enterprise services or assets (that share a common infrastructure or platform) require
coordinated performance testing (with all consumers creating production-like transaction volumes and load on
shared infrastructures or platforms) to truly replicate production-like states. Due to the complexity and financial
and time requirements around this activity, some organizations now employ tools that can monitor and create
production-like conditions (also referred as "noise") in their performance testing environments (PTE) to
understand capacity and resource requirements and verify / validate quality attributes.

Timing
It is critical to the cost performance of a new system, that performance test efforts begin at the inception of the
development project and extend through to deployment. The later a performance defect is detected, the higher
the cost of remediation. This is true in the case of functional testing, but even more so with performance testing,
due to the end-to-end nature of its scope. It is always crucial for performance test team to be involved as early
as possible. As key performance requisites e.g. performance test environment acquisition and preparation is
often a lengthy and time consuming process.
en.wikipedia.org/wiki/Software_performance_testing

4/7

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

Tools
In the diagnostic case, software engineers use tools such as profilers to measure what parts of a device or
software contributes most to the poor performance or to establish throughput levels (and thresholds) for
maintained acceptable response time.

Technology
Performance testing technology employs one or more PCs or Unix servers to act as injectors each emulating
the presence of numbers of users and each running an automated sequence of interactions (recorded as a script,
or as a series of scripts to emulate different types of user interaction) with the host whose performance is being
tested. Usually, a separate PC acts as a test conductor, coordinating and gathering metrics from each of the
injectors and collating performance data for reporting purposes. The usual sequence is to ramp up the load
starting with a small number of virtual users and increasing the number over a period to some maximum. The test
result shows how the performance varies with the load, given as number of users vs response time. Various
tools, are available to perform such tests. Tools in this category usually execute a suite of tests which will
emulate real users against the system. Sometimes the results can reveal oddities, e.g., that while the average
response time might be acceptable, there are outliers of a few key transactions that take considerably longer to
complete something that might be caused by inefficient database queries, pictures etc.
Performance testing can be combined with stress testing, in order to see what happens when an acceptable load
is exceeded does the system crash? How long does it take to recover if a large load is reduced? Does it fail in
a way that causes collateral damage?
Analytical Performance Modeling is a method to model the behaviour of an system in a spreadsheet. The model
is fed with measurements of transaction resource demands (CPU, disk I/O, LAN, WAN), weighted by the
transaction-mix (business transactions per hour). The weighted transaction resource demands are added-up to
obtain the hourly resource demands and divided by the hourly resource capacity to obtain the resource loads.
Using the responsetime formula (R=S/(1-U), R=responsetime, S=servicetime, U=load), responsetimes can be
calculated and calibrated with the results of the performance tests. Analytical performance modelling allows
evaluation of design options and system sizing based on actual or anticipated business usage. It is therefore much
faster and cheaper than performance testing, though it requires thorough understanding of the hardware
platforms.

Tasks to undertake
Tasks to perform such a test would include:
Decide whether to use internal or external resources to perform the tests, depending on inhouse expertise
(or lack thereof)
Gather or elicit performance requirements (specifications) from users and/or business analysts
Develop a high-level plan (or project charter), including requirements, resources, timelines and milestones
Develop a detailed performance test plan (including detailed scenarios and test cases, workloads,
environment info, etc.)
Choose test tool(s)
Specify test data needed and charter effort (often overlooked, but often the death of a valid performance
test)
Develop proof-of-concept scripts for each application/component under test, using chosen test tools and
strategies
en.wikipedia.org/wiki/Software_performance_testing

5/7

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

Develop detailed performance test project plan, including all dependencies and associated time-lines
Install and configure injectors/controller
Configure the test environment (ideally identical hardware to the production platform), router
configuration, quiet network (we dont want results upset by other users), deployment of server
instrumentation, database test sets developed, etc.
Execute tests probably repeatedly (iteratively) in order to see whether any unaccounted for factor might
affect the results
Analyze the results - either pass/fail, or investigation of critical path and recommendation of corrective
action

Methodology
Performance testing web applications
According to the Microsoft Developer Network the Performance Testing Methodology
(http://msdn2.microsoft.com/en-us/library/bb924376.aspx) consists of the following activities:
Activity 1. Identify the Test Environment. Identify the physical test environment and the production
environment as well as the tools and resources available to the test team. The physical environment
includes hardware, software, and network configurations. Having a thorough understanding of the entire
test environment at the outset enables more efficient test design and planning and helps you identify testing
challenges early in the project. In some situations, this process must be revisited periodically throughout
the projects life cycle.
Activity 2. Identify Performance Acceptance Criteria. Identify the response time, throughput, and
resource utilization goals and constraints. In general, response time is a user concern, throughput is a
business concern, and resource utilization is a system concern. Additionally, identify project success
criteria that may not be captured by those goals and constraints; for example, using performance tests to
evaluate what combination of configuration settings will result in the most desirable performance
characteristics.
Activity 3. Plan and Design Tests. Identify key scenarios, determine variability among representative
users and how to simulate that variability, define test data, and establish metrics to be collected.
Consolidate this information into one or more models of system usage to be implemented, executed, and
analyzed.
Activity 4. Configure the Test Environment. Prepare the test environment, tools, and resources
necessary to execute each strategy as features and components become available for test. Ensure that the
test environment is instrumented for resource monitoring as necessary.
Activity 5. Implement the Test Design. Develop the performance tests in accordance with the test
design.
Activity 6. Execute the Test. Run and monitor your tests. Validate the tests, test data, and results
collection. Execute validated tests for analysis while monitoring the test and the test environment.
Activity 7. Analyze Results, Tune, and Retest. Analyse, Consolidate and share results data. Make a
tuning change and retest. Improvement or degradation? Each improvement made will return smaller
improvement than the previous improvement. When do you stop? When you reach a CPU bottleneck,
the choices then are either improve the code or add more CPU.

See also
Stress testing (software)
en.wikipedia.org/wiki/Software_performance_testing

6/7

23/11/2012

Software performance testing - Wikipedia, the free encyclopedia

Benchmark (computing)
Web server benchmarking
Application Response Measurement

External links
Web Load Testing for Dummies (http://www.gomez.com/ebook-web-load-testing-for-dummiesgeneric/) (Book, PDF Version)
The Art of Application Performance Testing - O'Reilly ISBN 978-0-596-52066-3
(http://oreilly.com/catalog/9780596520670) (Book)
Performance Testing Guidance for Web Applications (http://msdn2.microsoft.com/enus/library/bb924375.aspx) (MSDN)
Performance Testing Guidance for Web Applications (http://www.amazon.com/dp/0735625700) (Book)
Performance Testing Guidance for Web Applications
(http://www.codeplex.com/PerfTestingGuide/Release/ProjectReleases.aspx?ReleaseId=6690) (PDF)
Performance Testing Guidance (http://www.codeplex.com/PerfTesting) (Online KB)
Enterprise IT Performance Testing (http://www.perftesting.co.uk) (Online KB)
Performance Testing Videos (http://msdn2.microsoft.com/en-us/library/bb671346.aspx) (MSDN)
Open Source Performance Testing tools (http://www.opensourcetesting.org/performance.php)
"User Experience, not Metrics" and "Beyond Performance Testing"
(http://www.perftestplus.com/pubs.htm)
"Performance Testing Traps / Pitfalls" (http://www.mercury-consultingltd.com/wp/Performance_Testing_Traps.html)
Retrieved from "http://en.wikipedia.org/w/index.php?title=Software_performance_testing&oldid=523786276"
Categories: Software testing Software optimization
This page was last modified on 19 November 2012 at 03:44.
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may
apply. See Terms of Use for details.
Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

en.wikipedia.org/wiki/Software_performance_testing

7/7

Systematic software testing


Peter Sestoft
IT University of Copenhagen, Denmark1
Version 2, 2008-02-25

This note introduces techniques for systematic functionality testing of software.

Contents
1 Why software testing?

2 White-box testing

3 Black-box testing

10

4 Practical hints about testing

14

5 Testing in perspective

15

6 Exercises

16

Why software testing?

Programs often contain errors (so-called bugs), even though the compiler accepts the program
as well-formed: the compiler can detect only errors of form, not of meaning. Many errors
and inconveniences in programs are discovered only by accident when the program is being
used. However, errors can be found in more systematic and effective ways than by random
experimentation. This is the goal of software testing.
You may think, why dont we just fix errors when they are discovered? After all, what
harm can a program do? Consider some effects of software errors:
In the 1991 Gulf war, some Patriot missiles failed to hit incoming Iraqi Scud missiles,
which therefore killed people on the ground. Accumulated rounding errors in the control
softwares clocks caused large navigation errors.
Errors in the software controlling the baggage handling system of Denver International
Airport delayed the entire airports opening by a year (19941995), causing losses of
around 360 million dollars. Since September 2005 the computer-controlled baggage
system has not been used; manual baggage handling saves one million dollars a month.
The first launch of the European Ariane 5 rocket failed (1996), causing losses of hundreds
of million dollars. The problem was a buffer overflow in control software taken over from
Ariane 4. The software had not been re-tested to save money.
1

Original 1998 version written for the Royal Veterinary and Agricultural University, Denmark.

Errors in a new train control system deployed in Berlin (1998) caused train cancellations
and delays for weeks.
Errors in poorly designed control software in the Therac-25 radio-therapy equipment
(1987) exposed several cancer patients to heavy doses of radiation, killing some.
A large number of other software-related problems and risks have been recorded by the RISKS
digest since 1985, see the archive at http://catless.ncl.ac.uk/risks.

1.1

Syntax errors, semantic errors, and logic errors

A program in Java, or C# or any other language, may contain several kinds of errors:
syntax errors: the program may be syntactically ill-formed (e.g. contain while x {},
where there are no parentheses around x), so that strictly speaking it is not a Java
program at all;
semantic errors: the program may be syntactically well-formed, but attempt to access
non-existing local variables or non-existing fields of an object, or apply operators to the
wrong type of arguments (as in true * 2, which attempts to multiply a logical value
by a number);
logical errors: the program may be syntactically well-formed and type-correct, but
compute the wrong answer anyway.
Errors of the two former kinds are relatively trivial: the Java compiler javac will automatically discover them and tell us about them. Logical errors (the third kind) are harder to deal
with: they cannot be found automatically, and it is our own responsibility to find them, or
even better, to convince ourselves that there are none.
In these notes we shall assume that all errors discovered by the compiler have been fixed.
We present simple systematic techniques for finding semantic errors and thereby making it
plausible that the program works as intended (when we can find no more errors).

1.2

Quality assurance and different kinds of testing

Testing fits into the more general context of software quality assurance; but what is software
quality? ISO Standard 9126 (2001) distinguishes six quality characteristics of software:
functionality: does this software do what it is supposed to do; does it work as intended?
usability: is this software easy to learn and convenient to use?
efficiency: how much time, memory, and network bandwidth does this software consume?
reliability: how well does this software deal with wrong inputs, external problems such
as network failures, and so on?
maintainability: how easy is it to find and fix errors in this software?
portability: how easy is it to adapt this software to changes in its operating environment,
and how easy is it to add new functionality?
The present note is concerned only with functionality testing, but note that usability testing
and performance testing address quality characteristics number two and three. Reliability can
be addressed by so-called stress testing, whereas maintainability and portability are rarely
systematically tested.
2

1.3

Debugging versus functionality testing

The purpose of testing is very different from that of debugging. It is tempting to confuse the
two, especially if one mistakenly believes that the purpose of debugging is to remove the last
bug from the program. In reality, debugging rarely achieves this.
The real purpose of debugging is diagnosis. After we have observed that the program does
not work as intended, we debug it to answer the question: why doesnt this program work?
When we have found out, we modify the program to (hopefully) work as intended.
By contrast, the purpose of functionality testing is to strengthen our belief that the
program works as intended. To do this, we systematically try to show that it does not work.
If our best efforts fail to show that the program does not work, then we have strengthened
our belief that it does work.
Using systematic functionality testing we might find some cases where the program does
not work. Then we use debugging to find out why. Then we fix the problem. And then we
test again to make sure we fixed the problem without introducing new ones.

1.4

Profiling versus performance testing

The distinction between functionality testing and debugging has a parallel in the distinction
between performance testing and profiling. Namely, the purpose of profiling is diagnosis. After
we have observed that the program is too slow or uses too much memory, we use profiling to
answer the question: why is this program so slow, why does it use so much memory? When
we have found out, we modify the program to (hopefully) use less time and memory.
By contrast, the purpose of performance testing is to strengthen our belief that the program is efficient enough. To do this, we systematically measure how much time and memory
it uses on different kinds and sizes of inputs. If the measurements show that it is efficient
enough for those inputs, then we have strengthened our belief that the program is efficient
enough for all relevant inputs.
Using systematic performance testing we might find some cases where the program is too
slow. Then we use profiling to find out why. Then we fix the problem. And then we test
again to make sure we fixed the problem without introducing new ones.
Schematically, we have:
Purpose \ Quality
Diagnosis
Quality assurance

1.5

Functionality
Debugging
Functionality testing

Efficiency
Profiling
Performance testing

White-box testing versus black-box testing

Two important techniques for functionality testing are white-box testing and black-box testing.
White-box testing, sometimes called structural testing or internal testing, focuses on the
text of the program. The tester constructs a test suite (a collection of inputs and corresponding
expected outputs) that demonstrates that all branches of the programs choice and loop
constructs if, while, switch, try-catch-finally, and so on can be executed. The
test suite is said to cover the statements of the program.
Black-box testing, sometimes called external testing, focuses on the problem that the program is supposed to solve; or more precisely, the problem statement or specification for the
3

program. The tester constructs a test data set (inputs and corresponding expected outputs)
that includes typical as well as extreme input data. In particular, one must include inputs
that are described as exceptional or erroneous in the problem description.
White-box testing and black-box testing are complementary approaches to test case generation. White-box testing does not focus on the problem area, and therefore may not discover
that some subproblem is left unsolved by the program, whereas black-box testing should.
Black-box testing does not focus on the program text, and therefore may not discover that
some parts of the program are completely useless or have an illogical structure, whereas
white-box testing should.
Software testing can never prove that a program contains no errors, but it can strengthen
ones faith in the program. Systematic software testing is necessary if the program will be
used by others, if the welfare of humans or animals depends on it (so-called safety-critical
software), or if one wants to base scientific conclusions on the programs results.

1.6

Test coverage

Given that we cannot make a perfect test suite, how do we know when we have a reasonably
good one? A standard measure of a test suites comprehensiveness is coverage. Here are some
notions of coverage, in increasing order of strictness:
method coverage: does the test suite make sure that every method (including function,
procedure, constructor, property, indexer, action listener) gets executed at least once?
statement coverage: does the test suite make sure that every statement of every method
gets executed at least once?
branch coverage: does the test suite make sure that every transfer of control gets executed at least once?
path coverage: does the test suite make sure that every execution path through the
program gets executed at least once?
Method coverage is the minimum one should expect from a test suite; in principle we know
nothing at all about a method that has not been executed by the test suite.
Statement coverage is achieved by the white-box technique described in Section 2, and is
often the best coverage one can achieve in practice.
Branch coverage is more demanding, especially in relation to virtual method calls (socalled virtual dispatch) and exception throwing. Namely, consider a single method call statement a.m() where expression a has type A, and class A has many subclasses A1, A2 and so on,
that override method m(). Then to achieve branch coverage, the test suite must make sure
that a.m() gets executed for a being an object classs A1, an object of class A2, and so on.
Similarly, there is a transfer of control from an exception-throwing statement throw exn to
the corresponding exception handler, if any, so to achieve branch coverage, the test suite must
make sure that each such statement gets executed in the context of every relevant exception
handler.
Path coverage is usually impossible to achieve in practice, because any program that
contains a loop will usually have an infinite number of possible execution paths.

White-box testing

The goal of white-box testing is to make sure that all parts of the program have been executed,
for some notion of part, as described in Section 1.6 on test coverage. The approach described
in this section gives statement coverage. The resulting test suite includes enough input data
sets to make sure that all methods have been called, that both the true and false branches
have been executed in if statements, that every loop has been executed zero, one, and more
times, that all branches of every switch statement have been executed, and so on. For every
input data set, the expected output must be specified also. Then, the program is run with
all the input data sets, and the actual outputs are compared to the expected outputs.
White-box testing cannot demonstrate that the program works in all cases, but it is a
surprisingly efficient (fast), effective (thorough), and systematic way to discover errors in the
program. In particular, it is a good way to find errors in programs with a complicated logic,
and to find variables that are initialized with the wrong values.

2.1

Example 1 of white-box testing

The program below receives some integers as argument, and is expected to print out the
smallest and the greatest of these numbers. We shall see how one performs a white-box test
of the program. (Be forewarned that the program is actually erroneous; is this obvious?)
public static void main ( String[] args )
{
int mi, ma;
if (args.length == 0)
/* 1 */
System.out.println("No numbers");
else
{
mi = ma = Integer.parseInt(args[0]);
for (int i = 1; i < args.length; i++)
/* 2 */
{
int obs = Integer.parseInt(args[i]);
if (obs > ma) ma = obs;
/* 3 */
else if (mi < obs) mi = obs;
/* 4 */
}
System.out.println("Minimum = " + mi + "; maximum = " + ma);
}
}

The choice statements are numbered 14 in the margin. Number 2 is the for statement.
First we construct a table that shows, for every choice statement and every possible outcome,
which input data set covers that choice and outcome:

Choice
1 true
1 false
2 zero times
2 once
2 more than once
3 true
3 false
4 true
4 false

Input property
No numbers
At least one number
Exactly one number
Exactly two numbers
At least three numbers
Number > current maximum
Number current maximum
Number current maximum and > current minimum
Number current maximum and current minimum

Input data set


A
B
B
C
E
C
D
E, 3rd number
E, 2nd number

While constructing the above table, we construct also a table of the input data sets:
Input data set
A
B
C
D
E

Input contents
(no numbers)
17
27 29
39 37
49 47 48

Expected output
No numbers
17 17
27 29
37 39
47 49

Actual output
No numbers
17 17
27 29
39 39
49 49

When running the above program on the input data sets, one sees that the outputs are wrong
they disagree with the expected outputs for input data sets D and E. Now one may
run the program manually on e.g. input data set D, which will lead one to discover that the
condition in the programs choice 4 is wrong. When we receive a number which is less than
the current minimum, then the variable mi is not updated correctly. The statement should
be:
else if (obs < mi) mi = obs;

/* 4a */

After correcting the program, it may be necessary to reconstruct the white-box test. It may
be very time consuming to go through several rounds of modification and re-testing, so it
pays off to make the program correct from the outset! In the present case it suffices to change
the comments in the last two lines of the table of choices and outcomes, because all we did
was to invert the condition in choice 4:
Choice
1 true
1 false
2 zero times
2 once
2 more than once
3 true
3 false
4a true
4a false

Input property
No numbers
At least one number
Exactly one number
Exactly two numbers
At least three numbers
Number > current maximum
Number current maximum
Number current maximum and < current minimum
Number current maximum and current minimum

Input data set


A
B
B
C
E
C
D
E, 2nd number
E, 3rd number

The input data sets remain the same. The corrected program produced the expected output
for all input data sets AE.
6

2.2

Example 2 of white-box testing

The program below receives some non-negative numbers as input, and is expected to print out
the two smallest of these numbers, or the smallest, in case there is only one. (Is this problem
statement unambiguous?). This program, too, is erroneous; can you find the problem?
public static void main ( String[] args )
{
int mi1 = 0, mi2 = 0;
if (args.length == 0)
/* 1
System.out.println("No numbers");
else
{
mi1 = Integer.parseInt(args[0]);
if (args.length == 1)
/* 2
System.out.println("Smallest = " + mi1);
else
{
int obs = Integer.parseInt(args[1]);
if (obs < mi1)
/* 3
{ mi2 = mi1; mi1 = obs; }
for (int i = 2; i < args.length; i++)
/* 4
{
obs = Integer.parseInt(args[i]);
if (obs < mi1)
/* 5
{ mi2 = mi1; mi1 = obs; }
else if (obs < mi2)
/* 6
mi2 = obs;
}
System.out.println("The two smallest are " + mi1
}
}
}

*/

*/

*/
*/

*/
*/

+ " and " + mi2);

As before we tabulate the programs choices 16 and their possible outcomes:

Choice
1 true
1 false
2 true
2 false
3 false
3 true
4 zero time
4 once
4 more than once
5 true
5 false
6 true
6 false

Input property
No numbers
At least one number
Exactly one number
At least two numbers
Second number first number
Second number < first number
Exactly two numbers
Exactly three numbers
At least four numbers
Third number < current minimum
Third number current minimum
Third number current minimum and < second least
Third number current minimum and second least
7

Input data set


A
B
B
C
C
D
D
E
H
E
F
F
G

The corresponding input data sets might be:


Input data set
A
B
C
D
E
F
G
H

Contents
(no numbers)
17
27 29
39 37
49 48 47
59 57 58
67 68 69
77 78 79 76

Expected output
No numbers
17
27 29
37 39
47 48
57 58
67 68
76 77

Actual output
No numbers
17
27 0
37 39
47 48
57 58
67 0
76 77

Running the program with these test data, it turns out that data set C produces wrong
results: 27 and 0. Looking at the program text, we see that this is because variable mi2
retains its initial value, namely, 0. The program must be fixed by inserting an assignment
mi2 = obs just before the line labelled 3. We do not need to change the white-box test,
because no choice statements were added or changed. The corrected program produces the
expected output for all input data sets AH.
Note that if the variable declaration had not been initialized with mi2 = 0, the Java
compiler would have complained that mi2 might be used before its first assignment. If so, the
error would have been detected even without testing.
This is not the case in many other current programming languages (e.g. C, C++, Fortran),
where one may well use an uninitialized variable its value is just whatever happens to be
at that location in the computers memory. The error may even go undetected by testing,
when the value of mi2 equals the expected answer by accident. This is more likely than it may
sound, if one runs the same (C, C++, Fortran) program on several input data sets, and the
same data values are used in several data sets. Therefore it is a good idea to choose different
data values in the data sets, as done above.

2.3

Summary, white-box testing

Program statements should be tested as follows:


Statement
if
for
while
do-while
switch
try-catch-finally

Cases to test
Condition false and true
Zero, one, and more than one iterations
Zero, one, and more than one iterations
One, and more than one, iterations
Every case and default branch must be executed
The try clause, every catch clause, and the finally clause
must be executed

A conditional expression such as (x != 0 ? 1000/x : 1) must be tested for the condition (x != 0) being true and being false, so that both alternatives have been evaluated.

Short-cut logical operators such as (x != 0) && (1000/x > y) must be tested for all
possible combinations of the truth values of the operands. That is,
(x != 0)
false
true
true

&&

(1000/x > y)
false
true

Note that the second operand in a short-cut (lazy) conjunction will be computed only if the
first operand is true (in Java, C#, C, and C++). This is important, for instance, when the
condition is (x != 0) && (1000/x > y), where the second operand cannot be computed if
the first one is false, that is, if x == 0. Therefore it makes no sense to require that the
combinations (false, false) and (false, true) be tested.
In a short-cut disjunction (x == 0) || (1000/x > y) it holds, dually, that the second
operand is computed only if the first one is false. Therefore, in this case too there are only
three possible combinations:
(x == 0)
true
false
false

||

(1000/x > y)
false
true

Methods The test suite must make sure that all methods have been executed. For recursive
methods one should test also the case where the method calls itself.
The test data sets are presented conveniently by two tables, as demonstrated in this
section. One table presents, for each statement, what data sets are used, and which property
of the input is demonstrated by the test. The other table presents the actual contents of the
data sets, and the corresponding expected output.

Black-box testing

The goal of black-box testing is to make sure that the program solves the problem it is
supposed to solve; to make sure that it works. Thus one must have a fairly precise idea of
the problem that the program must solve, but in principle one does not need the program
text when designing a black-box test. Test data sets (with corresponding expected outputs)
must be created to cover typical as well as extreme input values, and also inputs that are
described as exceptional cases or illegal cases in the problem statement. Examples:
In a program to compute the sum of a sequence of numbers, the empty sequence will
be an extreme, but legal, input (with sum 0).
In a program to compute the average of a sequence of numbers, the empty sequence
will be an extreme, and illegal, input. The program should give an error message for
this input, as one cannot compute the average of no numbers.
One should avoid creating a large collection of input data sets, just to be on the safe side.
Instead, one must carefully consider what inputs might reveal problems in the program, and
use exactly those. When preparing a black-box test, the task is to find errors in the program;
thus destructive thinking is required. As we shall see below, this is just as demanding as
programming, that is, as constructive thinking.

3.1

Example 1 of black-box testing

Problem: Given a (possibly empty) sequence of numbers, find the smallest and the greatest
of these numbers.
This is the same problem as in Section 2.1, but now the point of departure is the above
problem statement, not any particular program which claims to solve the problem.
First we consider the problem statement. We note that an empty sequence does not
contain a smallest or greatest number. Presumably, the program must give an error message
if presented with an empty sequence of numbers.
The black-box test might consist of the following input data sets: An empty sequence (A).
A non-empty sequence can have one element (B), or two or more elements. In a sequence with
two elements, the elements can be equal (C1), or different, the smallest one first (C2) or the
greatest one first (C3). If there are more than two elements, they may appear in increasing
order (D1), decreasing order (D2), with the greatest element in the middle (D3), or with the
smallest element in the middle (D4). All in all we have these cases:
Input property
No numbers
One number
Two numbers, equal
Two numbers, increasing
Two numbers, decreasing
Three numbers, increasing
Three numbers, decreasing
Three numbers, greatest in the middle
Three numbers, smallest in the middle

10

Input data set


A
B
C1
C2
C3
D1
D2
D3
D4

The choice of these input data sets is not arbitrary. It is influenced by our own ideas about
how the problem might be solved by a program, and in particular how it might be solved the
wrong way. For instance, the programmer might have forgotten that the sequence could be
empty, or that the smallest number equals the greatest number if there is only one number,
etc.
The choice of input data sets may be criticized. For instance, it is not obvious that data
set C1 is needed. Could the problem really be solved (wrongly) in a way that would be
discovered by C1, but not by any of the other input data sets?
The data sets C2 and C3 check that the program does not just answer by returning the
first (or last) number from the input sequence; this is a relevant check. The data sets D3 and
D4 check that the program does not just compare that first and the last number; it is less
clear that this is relevant.
Input data set
A
B
C1
C2
C3
D1
D2
D3
D4

3.2

Contents
(no numbers)
17
27 27
35 36
46 45
53 55 57
67 65 63
73 77 75
89 83 85

Expected output
Error message
17 17
27 27
35 36
45 46
53 57
63 67
73 77
83 89

Actual output

Example 2 of black-box testing

Problem: Given a (possibly empty) sequence of numbers, find the greatest difference between
two consecutive numbers.
We shall design a black-box test for this problem. First we note that if there is only
zero or one number, then there are no two consecutive numbers, and the greatest difference
cannot be computed. Presumably, an error message must be given in this case. Furthermore,
it is unclear whether the difference is signed (possibly negative) or absolute (always nonnegative). Here we assume that only the absolute difference should be taken into account, so
that the difference between 23 and 29 is the same as that between 29 and 23.
This gives rise to at least the following input data sets: no numbers (A), exactly one
number (B), exactly two numbers. Two numbers may be equal (C1), or different, in increasing
order (C2) or decreasing order (C3). When there are three numbers, the difference may be
increasing (D1) or decreasing (D2). That is:
Input property
No numbers
One number
Two numbers, equal
Two numbers, increasing
Two numbers, decreasing
Three numbers, increasing difference
Three numbers, decreasing difference
11

Input data set


A
B
C1
C2
C3
D1
D2

The data sets and their expected outputs might be:


Input data set
A
B
C1
C2
C3
D1
D2

Contents
(no numbers)
17
27 27
36 37
48 46
57 56 59
69 65 67

Expected output
Error message
Error message
0
1
2
3
4

Actual output

One might consider whether there should be more variants of each of D1 and D2, in which the
three numbers would appear in increasing order (56,57,59), or decreasing (59,58,56), or
increasing and then decreasing (56,57,55), or decreasing and then increasing (56,57,59).
Although these data sets might reveal errors that the above data sets would not, they do
appear more contrived. However, this shows that black-box testing may be carried on indefinitely: you will never be sure that all possible errors have been detected.

3.3

Example 3 of black-box testing

Problem: Given a day of the month day and a month mth, decide whether they determine a
legal date in a non-leap year. For instance, 31/12 (the 31st day of the 12th month) and 31/8
are both legal, whereas 29/2 and 1/13 are not. The day and month are given as integers, and
the program must respond with Legal or Illegal.
To simplify the test suite, one may assume that if the program classifies e.g. 1/4 and
30/4 as legal dates, then it will consider 17/4 and 29/4 legal, too. Correspondingly, one may
assume that if the program classifies 31/4 as illegal, then also 32/4, 33/4, and so on. There
is no guarantee that the these assumptions actually hold; the program may be written in a
contorted and silly way. Assumptions such as these should be written down along with the
test suite.
Under those assumptions one may test only extreme cases, such as 0/4, 1/4, 30/4, and
31/4, for which the expected outputs are Illegal, Legal, Legal, and Illegal.

12

Contents
0 1
1 0
1 1
31 1
32 1
28 2
29 2
31 3
32 3
30 4
31 4
31 5
32 5
30 6
31 6
31 7
32 7
31 8
32 8
30 9
31 9
31 10
32 10
30 11
31 11
31 12
32 12
1 13

Expected output
Illegal
Illegal
Legal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Legal
Illegal
Illegal

Actual output

It is clear that the black-box test becomes rather large and cumbersome. In fact it is just as
long as a program that solves the problem! To reduce the number of data sets, one might
consider just some extreme values, such as 0/1, 1/0, 1/1, 31/12 and 32/12; some exceptional
values around February, such as 28/2, 29/2 and 1/3, and a few typical cases, such as 30/4,
31/4, 31/8 and 32/8. But that would weaken the test a little: it would not discover whether
the program mistakenly believes that June (not July) has 31 days.

13

Practical hints about testing


Avoid test cases where the expected output is zero. In Java and C#, static and nonstatic fields in classes automatically get initialized to 0. The actual output may therefore
equal the expected output by accident.
In languages such as C, C++ and Fortran, where variables are not initialized automatically, testing will not necessarily reveal uninitialized variables. The accidental value of
an uninitialized variable may happen to equal the expected output. This is not unlikely,
if one uses the same input data in several test cases. Therefore, choose different input
data in different test cases, as done in the preceding sections.
Automate the test, if at all possible. Then it can conveniently be rerun whenever the
program has been modified. This is usually done as so-called unit tests. For Java,
the JUnit framework from www.junit.org is a widely used tool, well supported by
integrated development environments such as BlueJ and Eclipse. For C#, the NUnit
framework from www.nunit.org is widely used. Microsofts Visual Studio Team System
also contains unit test facilities.
As mentioned in Section 3 one should avoid creating an excessively large test suite that
has redundant test cases. Software evolves over time, and the test suite must evolve
together with the software. For instance, if you decide to change a method in your
software so that it returns a different result for certain inputs, then you must look at
all test cases for that method to see whether they are still relevant and correct; in that
situation it is unpleasant to discover that the same functionality is tested by 13 different
test cases. A test suite is a piece of software too, and should have no superfluous parts.
When testing programs that have graphical user interfaces with menus, buttons, and
so on, one must describe carefully step by step what actions menu choices, mouse
clicks, and so on the tester must perform, and what the programs expected reactions
are. Clearly, this is cumbersome and expensive to carry out manually, so professional
software houses use various tools to simulate user actions.

14

Testing in perspective
Testing can never prove that a program has no errors, but it can considerably improve
the confidence one has in its results.
Often it is easier to design a white-box test suite than a black-box one, because one
can proceed systematically on the basis of the program text. Black-box testing requires
more guesswork about the possible workings of the program, but can make sure that
the program does what is required by the problem statement.
It is a good idea to design a black-box test at the same time you write the program.
This reveals unclarities and subtle points in the problem statement, so that you can take
them into account while writing the program instead of having to fix the program
later.
Writing the test cases and the documentation at the same time is also valuable. When
attempting to write a test case, one often realizes what information users of a method
or class will be looking for in the documentation. Conversely, when one makes a claim
(when n+i>arr.length, then FooException is thrown) about the behaviour of a class
or method in the documentation, that should lead to one or more test cases that check
this claim.
If you further use unit test tools to automate the test, you can actually implement
the tests before you implement the corresponding functionality. Then you can more
confidently implement the functionality and measure your implementation progress by
the number of test cases that succeed. This is called test-driven development.
From the testers point of view, testing is successful if it does find errors in the program;
in this case it was clearly not a waste of time to do the test. From the programmers
point of view the opposite holds: hopefully the test will not find errors in the program.
When the tester and the programmer are one and the same person, then there is a
psychological conflict: one does not want to admit to making mistakes, neither when
programming nor when designing test suites.
It is a useful exercise to design a test suite for a program written by someone else. This
is a kind of game: the goal of the programmer is to write a program that contains no
errors; the goal of the tester is to find the errors in the program anyway.
It takes much time to design a test suite. One learns to avoid needless choice statements
when programming, because this reduces the number of test cases in the white-box
test. It also leads to simpler programs that usually are more general and easier to
understand.2
It is not unusual for a test suite to be as large as the software it tests. The C5 Generic
Collection Library for C#/.NET (http://www.itu.dk/research/c5) implementation has
27,000 lines of code, and its unit test has 28,000 lines.
How much testing is needed? The effort spent on testing should be correlated with the
consequences of possible program errors. A program used just once for computing ones
taxes need no testing. However, a program must be tested if errors could affect the
safety of people or animals, or could cause considerable economic losses. If scientific
conclusions will be drawn from the outputs of a program, then it must be tested too.

A program may be hard to understand even when it has no choice statements; see Exercises 10 and 11.

15

Exercises

1. Problem: Given a sequence of integers, find their average.


Use black-box techniques to construct a test suite for this problem.
2. Write a program to solve the problem from Exercise 1. The program should take its
input from the command line. Run the test suite you made.
3. Use white-box techniques to construct a test suite for the program written in Exercise 2,
and run it.
4. Problem: Given a sequence of numbers, decide whether they are sorted in increasing
order. For instance, 17 18 18 22 is sorted, but 17 18 19 18 is not. The result must be
Sorted or Not sorted.
Use black-box techniques to construct a test suite for this problem.
5. Write a program that solves the problem from Exercise 4. Run the test suite you made.
6. Use white-box techniques to construct a test suite for the program written in Exercise 5.
Run it.
7. Write a program to decide whether a given (day, month) pair in a non-leap year is legal,
as discussed in Section 3.3. Run your program with the (black-box) test suite given
there.
8. Use white-box techniques to construct a test suite for the program written in Exercise 7.
Run it.
9. Problem: Given a (day, month) pair, compute the number of the day in a non-leap
year. For instance, (1, 1) is number 1; (1,2), which means 1 February, is number 32,
(1,3) is number 60; and (31,12) is number 365. This is useful for computing the distance
between two dates, e.g. the length of a course, the duration of a bank deposit, or the
time from sowing to harvest. The date and month can be assumed legal for a non-leap
year.
Use black-box techniques to construct a test suite for this problem.
10. We claim that this Java method solves the problem from Exercise 9.
static int dayno(int day, int mth)
{
int m = (mth+9)%12;
return (m/5*153+m%5*30+(m%5+1)/2+59)%365+day;
}

Test this method with the black-box test suite you made above.
11. Use white-box techniques to construct a test suite for the method shown in Exercise 10.
This appears trivial and useless, since there are no choice statements in the program
at all. Instead one may consider jumps (discontinuities) in the processing of data.
In particular, integer division (/) and remainder (%) produce jumps of this sort. For
mth < 3 we have m = (mth + 9) mod 12 = mth + 9, and for mth 3 we have
m = (mth + 9) mod 12 = mth 3. Thus there is a kind of hidden choice when going
from mth = 2 to mth = 3. Correspondingly for m / 5 and (m % 5 + 1) / 2. This can
be used for choosing test cases for white-box test. Do that.
12. Consider a method String toRoman(int n) that is supposed to convert a positive
integer to the Roman numeral representing that integer, using the symbols I = 1,
V = 5, X = 10, L = 50, C = 100, D = 500 and M= 1000. The following rules determine
the Roman numeral corresponding to a positive number:
16

In general, the symbols of a Roman numeral are added together from left to right,
so II = 2, XX = 20, XXXI = 31, and MMVIII = 2008.
The symbols I, X and C may appear up to three times in a row; the symbol M may
appear any number of times; and the symbols V, L and D cannot be repeated.
When a lesser symbol appears before a greater one, the lesser symbol is subtracted,
not added. So IV = 4, IX = 9, XL = 40 and CM = 900.
The symbol I may appear once before V and X; the symbol X may appear once
before L and C; the symbol C may appear once before D and M; and the symbols V,
L and D cannot appear before a greater symbol.
So 45 is written XLV, not VL; and 49 is written XLIX, not IL; and 1998 is written
MCMXCVIII, not IIMM.
Exercise: use black-box techniques to construct a test suite for the method toRoman.
This can be done in two ways. The simplest way is to call toRoman(n) for suitably chosen
numbers n and checking that it returns the expected string. The more ambitious way
is to implement (and test!) the method fromRoman described in Exercise 12 below, and
use that to check Roman.
13. Consider a method int fromRoman(String s) with this specification: The method
checks that string s is a well-formed Roman numeral according to the rules in Exercise 12, and if so, returns the corresponding number; otherwise throws an exception.
Use black-box techniques to construct a test suite for this method. Remember to include
also some ill-formed Roman numerals.

17

23/11/2012

Usability testing - Wikipedia, the free encyclopedia

Usability testing
From Wikipedia, the free encyclopedia

Usability testing is a technique used in user-centered interaction design to evaluate a product by testing it on
users. This can be seen as an irreplaceable usability practice, since it gives direct input on how real users use the
system.[1] This is in contrast with usability inspection methods where experts use different methods to evaluate a
user interface without involving users.
Usability testing focuses on measuring a human-made product's capacity to meet its intended purpose. Examples
of products that commonly benefit from usability testing are foods, consumer products, web sites or web
applications, computer interfaces, documents, and devices. Usability testing measures the usability, or ease of
use, of a specific object or set of objects, whereas general human-computer interaction studies attempt to
formulate universal principles.

Contents
1 History of usability testing
2 Goals of usability testing
3 What usability testing is not
4 Methods
4.1 Hallway testing
4.2 Remote Usability Testing
4.3 Expert review
4.4 Automated expert review
5 How many users to test?
6 See also
7 References
8 External links

History of usability testing


Henry Dreyfuss in the late 1940s contracted to design the state rooms for the twin ocean liners "Independence"
and "Constitution." He built eight prototype staterooms and installed them in a warehouse. He then brought in a
series of travelers to "live" in the rooms for a short time, bringing with them all items they would normally take
when cruising. His people were able to discover over time, for example, if there was space for large steamer
trunks, if light switches needed to be added beside the beds to prevent injury, etc., before hundreds of state
rooms had been built into the ship.[2]
A Xerox Palo Alto Research Center (PARC) employee wrote that PARC used extensive usability testing in
creating the Xerox Star, introduced in 1981.[3]
The Inside Intuit book, says (page 22, 1984), "... in the first instance of the Usability Testing that later became
standard industry practice, LeFevre recruited people off the streets... and timed their Kwik-Chek (Quicken)
usage with a stopwatch. After every test... programmers worked to improve the program."[4]) Scott Cook,
Intuit co-founder, said, "... we did usability testing in 1984, five years before anyone else... there's a very big
difference between doing it and having marketing people doing it as part of their... design... a very big difference
en.wikipedia.org/wiki/Usability_testing

1/6

23/11/2012

Usability testing - Wikipedia, the free encyclopedia

between doing it and having it be the core of what engineers focus on.[5]

Goals of usability testing


Usability testing is a black-box testing technique. The aim is to observe people using the product to discover
errors and areas of improvement. Usability testing generally involves measuring how well test subjects respond
in four areas: efficiency, accuracy, recall, and emotional response. The results of the first test can be treated as a
baseline or control measurement; all subsequent tests can then be compared to the baseline to indicate
improvement.
Efficiency -- How much time, and how many steps, are required for people to complete basic tasks?
(For example, find something to buy, create a new account, and order the item.)
Accuracy -- How many mistakes did people make? (And were they fatal or recoverable with the right
information?)
Recall -- How much does the person remember afterwards or after periods of non-use?
Emotional response -- How does the person feel about the tasks completed? Is the person confident,
stressed? Would the user recommend this system to a friend?
To assess the usability of the system under usability testing, quantitative and/or qualitative Usability goals (also
called usability requirements[6]) have to be defined beforehand.[7][6][8] If the results of the usability testing meet
the Usability goals, the system can be considered as usable for the end-users whose representatives have tested
it.

What usability testing is not


Simply gathering opinions on an object or document is market research or qualitative research rather than
usability testing. Usability testing usually involves systematic observation under controlled conditions to
determine how well people can use the product.[9] However, often both qualitative and usability testing are used
in combination, to better understand users' motivations/perceptions, in addition to their actions.
Rather than showing users a rough draft and asking, "Do you understand this?", usability testing involves
watching people trying to use something for its intended purpose. For example, when testing instructions for
assembling a toy, the test subjects should be given the instructions and a box of parts and, rather than being
asked to comment on the parts and materials, they are asked to put the toy together. Instruction phrasing,
illustration quality, and the toy's design all affect the assembly process.

Methods
Setting up a usability test involves carefully creating a scenario, or realistic situation, wherein the person performs
a list of tasks using the product being tested while observers watch and take notes. Several other test
instruments such as scripted instructions, paper prototypes, and pre- and post-test questionnaires are also used
to gather feedback on the product being tested. For example, to test the attachment function of an e-mail
program, a scenario would describe a situation where a person needs to send an e-mail attachment, and ask him
or her to undertake this task. The aim is to observe how people function in a realistic manner, so that developers
can see problem areas, and what people like. Techniques popularly used to gather data during a usability test
include think aloud protocol, Co-discovery Learning and eye tracking.

Hallway testing
en.wikipedia.org/wiki/Usability_testing

2/6

23/11/2012

Usability testing - Wikipedia, the free encyclopedia

Hallway testing (or Hall Intercept Testing) is a general methodology of usability testing. Rather than using an
in-house, trained group of testers, just five to six random people are brought in to test the product, or service.
The name of the technique refers to the fact that the testers should be random people who pass by in the
hallway.[10]
Hallway testing is particularly effective in the early stages of a new design when the designers are looking for
"brick walls," problems so serious that users simply cannot advance. Anyone of normal intelligence other than
designers and engineers can be used at this point. (Both designers and engineers immediately turn from being
test subjects into being "expert reviewers." They are often too close to the project, so they already know how to
accomplish the task, thereby missing ambiguities and false paths.)

Remote Usability Testing


In a scenario where usability evaluators, developers and prospective users are located in different countries and
time zones, conducting a traditional lab usability evaluation creates challenges both from the cost and logistical
perspectives. These concerns led to research on remote usability evaluation, with the user and the evaluators
separated over space and time. Remote testing, which facilitates evaluations being done in the context of the
users other tasks and technology can be either synchronous or asynchronous. Synchronous usability testing
methodologies involve video conferencing or employ remote application sharing tools such as WebEx. The
former involves real time one-on-one communication between the evaluator and the user, while the latter
involves the evaluator and user working separately.[11]
Asynchronous methodologies include automatic collection of users click streams, user logs of critical incidents
that occur while interacting with the application and subjective feedback on the interface by users.[12] Similar to
an in-lab study, an asynchronous remote usability test is task-based and the platforms allow you to capture
clicks and task times. Hence, for many large companies this allows you to understand the WHY behind the
visitors' intents when visiting a website or mobile site. Additionally, this style of user testing also provides an
opportunity to segment feedback by demographic, attitudinal and behavioural type. The tests are carried out in
the users own environment (rather than labs) helping further simulate real-life scenario testing. This approach
also provides a vehicle to easily solicit feedback from users in remote areas quickly and with lower
organisational overheads.
Numerous tools are available to address the needs of both these approaches. WebEx and Go-to-meeting are
the most commonly used technologies to conduct a synchronous remote usability test.[13] However,
synchronous remote testing may lack the immediacy and sense of presence desired to support a collaborative
testing process. Moreover, managing inter-personal dynamics across cultural and linguistic barriers may require
approaches sensitive to the cultures involved. Other disadvantages include having reduced control over the
testing environment and the distractions and interruptions experienced by the participants in their native
environment.[14] One of the newer methods developed for conducting a synchronous remote usability test is by
using virtual worlds.[15]

Expert review
Expert review is another general method of usability testing. As the name suggests, this method relies on
bringing in experts with experience in the field (possibly from companies that specialize in usability testing) to
evaluate the usability of a product.

Automated expert review


Similar to expert reviews, automated expert reviews provide usability testing but through the use of programs
en.wikipedia.org/wiki/Usability_testing

3/6

23/11/2012

Usability testing - Wikipedia, the free encyclopedia

given rules for good design and heuristics. Though an automated review might not provide as much detail and
insight as reviews from people, they can be finished more quickly and consistently. The idea of creating
surrogate users for usability testing is an ambitious direction for the Artificial Intelligence community.

How many users to test?


In the early 1990s, Jakob Nielsen, at that time a researcher at Sun Microsystems, popularized the concept of
using numerous small usability teststypically with only five test subjects eachat various stages of the
development process. His argument is that, once it is found that two or three people are totally confused by the
home page, little is gained by watching more people suffer through the same flawed design. "Elaborate usability
tests are a waste of resources. The best results come from testing no more than five users and running as many
small tests as you can afford."[10] Nielsen subsequently published his research and coined the term heuristic
evaluation.
The claim of "Five users is enough" was later described by a mathematical model[16] which states for the
proportion of uncovered problems U

where p is the probability of one subject identifying a specific problem and n the number of subjects (or test
sessions). This model shows up as an asymptotic graph towards the number of real existing problems (see figure
below).

In later research Nielsen's claim has eagerly been questioned with both empirical evidence[17] and more
advanced mathematical models.[18] Two key challenges to this assertion are:
1. Since usability is related to the specific set of users, such a small sample size is unlikely to be
representative of the total population so the data from such a small sample is more likely to reflect the
en.wikipedia.org/wiki/Usability_testing

4/6

23/11/2012

Usability testing - Wikipedia, the free encyclopedia

sample group than the population they may represent


2. Not every usability problem is equally easy-to-detect. Intractable problems happen to decelerate the
overall process. Under these circumstances the progress of the process is much shallower than predicted
by the Nielsen/Landauer formula.[19]
It is worth noting that Nielsen does not advocate stopping after a single test with five users; his point is that
testing with five users, fixing the problems they uncover, and then testing the revised site with five different users
is a better use of limited resources than running a single usability test with 10 users. In practice, the tests are run
once or twice per week during the entire development cycle, using three to five test subjects per round, and with
the results delivered within 24 hours to the designers. The number of users actually tested over the course of the
project can thus easily reach 50 to 100 people.
In the early stage, when users are most likely to immediately encounter problems that stop them in their tracks,
almost anyone of normal intelligence can be used as a test subject. In stage two, testers will recruit test subjects
across a broad spectrum of abilities. For example, in one study, experienced users showed no problem using
any design, from the first to the last, while naive user and self-identified power users both failed repeatedly.[20]
Later on, as the design smooths out, users should be recruited from the target population.
When the method is applied to a sufficient number of people over the course of a project, the objections raised
above become addressed: The sample size ceases to be small and usability problems that arise with only
occasional users are found. The value of the method lies in the fact that specific design problems, once
encountered, are never seen again because they are immediately eliminated, while the parts that appear
successful are tested over and over. While it's true that the initial problems in the design may be tested by only
five users, when the method is properly applied, the parts of the design that worked in that initial test will go on
to be tested by 50 to 100 people.

See also
ISO 9241
Software testing
Educational technology
Universal usability
Commercial eye tracking
Don't Make Me Think
Software performance testing
System Usability Scale (SUS)
Test method
Tree testing
RITE Method
Component-Based Usability Testing
Crowdsource testing
Usability goals

References
1. ^ Nielsen, J. (1994). Usability Engineering, Academic Press Inc, p 165
2. ^ NN/G Usability Week 2011 Conference "Interaction Design" Manual, Bruce Tognazzini, Nielsen Norman
Group, 2011
3. ^ http://interactions.acm.org/content/XV/baecker.pdf
4. ^ http://books.google.com/books?id=lRs_4U43UcEC&printsec=frontcover&sig=ACfU3U1xvA7en.wikipedia.org/wiki/Usability_testing

5/6

23/11/2012

Usability testing - Wikipedia, the free encyclopedia

f80TP9Zqt9wkB9adVAqZ4g#PPA22,M1
5. ^ http://news.zdnet.co.uk/itmanagement/0,1000000308,2065537,00.htm
6. ^ a b International Standardization Organization. ergonomics of human system interaction - Part 210 -: Human
centred design for interactive systems (Rep N9241-210). 2010, International Standardization Organization
7. ^ Nielsen, Usability Engineering, 1994
8. ^ Mayhew. The usability engineering lifecycle: a practitioner's handbook for user interface design. London,
Academic press; 1999
9. ^ http://jerz.setonhill.edu/design/usability/intro.htm
10. ^ a b "Usability Testing with 5 Users (Jakob Nielsen's Alertbox)"
(http://www.useit.com/alertbox/20000319.html) . useit.com. 13.03.2000.
http://www.useit.com/alertbox/20000319.html.; references Jakob Nielsen, Thomas K. Landauer (April 1993).
"A mathematical model of the finding of usability problems" (http://dl.acm.org/citation.cfm?
id=169166&CFID=159890676&CFTOKEN=16006386) . Proceedings of ACM INTERCHI'93 Conference
(Amsterdam, The Netherlands, 24-29 April 1993). http://dl.acm.org/citation.cfm?
id=169166&CFID=159890676&CFTOKEN=16006386.
11. ^ Andreasen, Morten Sieker; Nielsen, Henrik Villemann; Schrder, Simon Ormholt; Stage, Jan (2007). "What
happened to remote usability testing?". Proceedings of the SIGCHI conference on Human factors in computing
systems - CHI '07. p. 1405. doi:10.1145/1240624.1240838 (http://dx.doi.org/10.1145%2F1240624.1240838) .
ISBN 9781595935939.
12. ^ Dray, Susan; Siegel, David (2004). "Remote possibilities?". Interactions 11 (2): 10.
doi:10.1145/971258.971264 (http://dx.doi.org/10.1145%2F971258.971264) .
13. ^ http://www.boxesandarrows.com/view/remote_online_usability_testing_why_how_and_when_to_use_it
14. ^ Dray, Susan; Siegel, David (March 2004). "Remote possibilities?: international usability testing at a distance".
Interactions 11 (2): 1017. doi:10.1145/971258.971264 (http://dx.doi.org/10.1145%2F971258.971264) .
15. ^ Chalil Madathil, Kapil; Joel S. Greenstein (May 2011). "Synchronous remote usability testing: a new approach
facilitated by virtual worlds". Proceedings of the 2011 annual conference on Human factors in computing
systems. CHI '11: 22252234. doi:10.1145/1978942.1979267 (http://dx.doi.org/10.1145%2F1978942.1979267)
. ISBN 9781450302289.
16. ^ Virzi, R.A., Refining the Test Phase of Usability Evaluation: How Many Subjects is Enough? Human Factors,
1992. 34(4): p. 457-468.
17. ^ http://citeseer.ist.psu.edu/spool01testing.html
18. ^ Caulton, D.A., Relaxing the homogeneity assumption in usability testing. Behaviour & Information
Technology, 2001. 20(1): p. 1-7
19. ^ Schmettow, Heterogeneity in the Usability Evaluation Process. In: M. England, D. & Beale, R. (ed.),
Proceedings of the HCI 2008, British Computing Society, 2008, 1, 89-98
20. ^ Bruce Tognazzini. "Maximizing Windows" (http://www.asktog.com/columns/000maxscrns.html) .
http://www.asktog.com/columns/000maxscrns.html.

External links
Usability.gov (http://www.usability.gov/)
A Brief History of the Magic Number 5 in Usability Testing
(http://www.measuringusability.com/blog/five-history.php)
Retrieved from "http://en.wikipedia.org/w/index.php?title=Usability_testing&oldid=519424139"
Categories: Usability Software testing Educational technology Evaluation methods Tests
This page was last modified on 23 October 2012 at 17:34.
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may
apply. See Terms of Use for details.
Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

en.wikipedia.org/wiki/Usability_testing

6/6

23/11/2012

Usability Testing | Usability.gov

Search
Text Size
Home | Basics |

Methods

Templates

Resources Across Government

Articles & Discussion

Print

Download Reader

Guidelines

About Us

Home > Methods > Test & Refine the Site > Usability Testing

Planning the Project


Analyze Current Site
Design New Site
Test & Refine the Site

Usability Testing

Introduction | Test Plan | Preparation & Testing | Data Analyses & Report

Step-by-Step Guide

Introduction

We provide a step-by-step visual


map to guide you through the
user-centered design process.

Topics on This Page

Introduction to Usability Testing

Types of Evaluations

Four Things to Keep in Mind

Usability Testing

Cost

Heuristic Evaluations
Implement & Retest
Methods at a Glance

Introduction to Usability Testing


Usability testing is a technique used to evaluate a product by testing it
with representative users. In the test, these users will try to complete
typical tasks while observers watch, listen and takes notes.
Your goal is to identify any usability problems, collect quantitative data
on participants' performance (e.g., time on task, error rates), and
determine participant's satisfaction with the product.

When to Test
You should test early and test often. Usability testing lets the design
and development teams identify problems before they get coded (i.e.,
"set in concrete). The earlier those problems are found and fixed, the
less expensive the fixes are.

No Lab Needed
You DO NOT need a formal usability lab to do testing. You can do
effective usability testing in any of these settings:
a fixed laboratory having two or three connected rooms outfitted
with audio-visual equipment
a conference room, or the user's home or work space, with portable
recording equipment
a conference room, or the user's home or work space, with no
recording equipment, as long as someone is observing the user and
taking notes
remotely, with the user in a different location

What You Learn


You will learn if participants are able to complete identified routine tasks
successfully and how long it takes to do that. You will find out how
satisfied participants are with your Web site. Overall, you will identify
changes required to improve user performance. And you can match the
performance to see if it meets your usability objectives.

Four Things to Keep in Mind


1. Testing the Site NOT the Users
We try hard to ensure that participants do not think that we are
testing them. We help them understand that they are helping us
test the prototype or Web site.
2. Performance vs. Subjective Measures
We measure both performance and subjective (preference) metrics.
Performance measures include: success, time, errors, etc. Subjective
measures include: user's self reported satisfaction and comfort
ratings.
People's performance and preference do not always match. Often
users will perform poorly but their subjective ratings are very high.
Conversely, they may perform well but subjective ratings are very
low.
3. Make Use of What You Learn
Usability testing is not just a milestone to be checked off on the
project schedule. The team must consider the findings, set priorities,
and change the prototype or site based on what happened in the
usability test.
4. Find the Best Solution
Most projects, including designing or revising Web sites, have to
deal with constraints of time, budget, and resources. Balancing all
those is one of the major challenges of most projects.

Cost
Cost depends on the size of the site, how much you need to test, how
many different types of participants you anticipate having, and how
formal you want the testing to be. Remember to budget for more than
one usability test. Building usability into a Web site (or any product) is

www.usability.gov/methods/test_refine/learnusa/index.html

1/2

23/11/2012

Usability Testing | Usability.gov


one usability test. Building usability into a Web site (or any product) is
an iterative process.
Consider these elements in budgeting for usability testing:

Time
You will need time to plan the usability test. It will take the usability
specialist and the team time to get familiarized with the site and do dry
runs with scenarios. Budget for the time it takes to test users and for
analyzing the data, writing the report, and discussing the findings.

Recruiting Costs
Recruiting Costs: time of in-house person or payment to a recruiting firm.
Developing a user database either in-house or firm recruiting becomes
less time consuming and cheaper. Also allow for the cost of paying or
providing gifts for the participants.

Rental Costs
If you do not have equipment, you will have to budget for rental costs
for the lab or other equipment.
Back to Top

Acce ssibility | Privacy Policy | Vie we rs & Playe rs | USA.gov


This is an official U.S. Gove rnm e nt W e b site m anage d by the U.S. De partm e nt of He alth & Hum an Se rvice s.

www.usability.gov/methods/test_refine/learnusa/index.html

2/2

Usability Testing

18

Usability Testing
There are two major considerations when
conducting usability testing. The first is to ensure that the best possible
method for testing is used. Generally, the best method is to conduct
a test where representative participants interact with representative
scenarios. The tester collects data on the participants success, speed of
performance, and satisfaction. The findings, including both quantitative
data and qualitative observations information, are provided to designers
in a test report. Using inspection evaluations, in place of well-controlled
usability tests, must be done with caution. Inspection methods, such as
heuristic evaluations or expert reviews, tend to generate large numbers
of potential usability problems that never turn out to be actual usability
problems.
The second major consideration is to ensure that an iterative approach
is used. After the first test results are provided to designers, they should
make changes and then have the Web site tested again. Generally, the
more iterations, the better the Web site.

Research-Ba s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

18:3 Prioritize
18:1
Use an Iterative
Tasks Design Approach
an iterative design approach to create the most
useful and usable Web site.

Comments: Iterative design consists of creating

Relative Importance:
Strength of Evidence:

paper or computer prototypes, testing the


prototypes, and then making changes based on
the test results. The test and make changes process is repeated until the
Web site meets performance benchmarks (usability goals). When these
goals are met, the iterative process ends.
The iterative design process helps to substantially improve the usability of
Web sites. One recent study found that the improvements made between
the original Web site and the redesigned Web site resulted in thirty percent
more task completions, twenty-five percent less time to complete the tasks,
and sixty-seven percent greater user satisfaction. A second study reported
that eight of ten tasks were performed faster on the Web site that had been
iteratively designed. Finally, a third study found that forty-six percent of
the original set of issues were resolved by making design changes to the
interface.

Sources: Badre, 2002; Bailey, 1993; Bailey and Wolfson, 2005; Bradley and

Johnk, 1995; Egan, et al., 1989; Hong, et al., 2001; Jeffries, et al., 1991; Karat,
Campbell, and Fiegel, 1992; LeDoux, Connor and Tullis, 2005; Norman and
Murphy, 2004; Redish and Dumas, 1993; Tan, et al., 2001.

See page xxii


for detailed descriptions
of the rating scales

R e s e a r c h - B a s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

Usability Testing

Guideline: Develop and test prototypes through

189

Usability Testing

190

18:2 Solicit Test Participants Comments


Guideline: Solicit usability testing participants

comments either during or after the performance


of tasks.

Comments: Participants may be asked to give their

Relative Importance:
Strength of Evidence:

comments either while performing each task (think


aloud) or after finishing all tasks (retrospectively).
When using the think aloud method, participants report on incidents as soon
as they happen. When using the retrospective approach, participants perform
all tasks uninterrupted, and then watch their session video and report any
observations (critical incidents).
Studies have reported no significant difference between the think aloud versus
retrospective approaches in terms of the number of useful incident reports
given by participants. However, the reports (with both approaches) tended to
be positively biased and think aloud participants may complete fewer tasks.
Participants tend not to voice negative reports. In one study, when using the
think aloud approach, users tended to read text on the screen and verbalize
more of what they were doing rather than what they were thinking.

Sources: Bailey, 2003; Bowers and Snyder, 1990; Capra, 2002; Hoc and Leplat,
1983; Ohnemus and Biers, 1993; Page and Rahimi, 1995; Van Den Haak, De
Jong, and Schellens, 2003; Wright and Converse, 1992.

18:3 Evaluate Web Sites Before and After Making Changes


Guideline: Conduct before and after studies

when revising a Web site to determine changes in


usability.

Comments: Conducting usability studies prior to

Relative Importance:
Strength of Evidence:

and after a redesign will help designers determine if


changes actually made a difference in the usability
of the site. One study reported that only twenty-two percent of users were able
to buy items on an original Web site. After a major redesign effort, eighty-eight
percent of users successfully purchased products on that site.

Sources: John and Marks, 1997; Karat, 1994a; Ramey, 2000; Rehman, 2000;
Williams, 2000; Wixon and Jones, 1996.

Research-Ba s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

18:3 Prioritize Tasks


18:4

Relative Importance:

preventing easy tasks from being easy.

Strength of Evidence:

Comments: When deciding which usability issues to

fix first, address the tasks that users believe to be easy but are actually difficult.
The Usability Magnitude Estimation (UME) is a measure that can be used to
assess user expectations of the difficulty of each task. Participants judge how
difficult or easy a task will be before trying to do it, and then make a second
judgment after trying to complete the task. Each task is eventually put into
one of four categories based on these expected versus actual ratings:

Tasks that were expected to be easy, but were actually difficult;

Tasks that were expected to be difficult, but were actually easy;

Tasks that were expected to be easy and were actually easy; and

Tasks that were expected to be difficult and were difficult to


complete.

Sources: Rich and McGee, 2004.

18:5 Distinguish Between Frequency and Severity


Guideline: Distinguish between frequency and

severity when reporting on usability issues and


problems.

Comments: The number of users affected

Relative Importance:
Strength of Evidence:

determines the frequency of a problem. To be


most useful, the severity of a problem should
be defined by analyzing difficulties encountered by individual users. Both
frequency and severity data can be used to prioritize usability issues that
need to be changed. For example, designers should focus first on fixing
those usability issues that were shown to be most severe. Those usability
issues that were encountered by many participants, but had a severity rating
of nuisance, should be given much less priority.

Sources: Woolrych and Cockton, 2001.

See page xxii


for detailed descriptions
of the rating scales

R e s e a r c h - B a s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

Usability Testing

Guideline: Give high priority to usability issues

191

Usability Testing

192

18:2 Select
18:6
Solicit the
TestRight
Participants
Number of
Comments
Participants
Guideline: Select the right number of participants

when using different usability techniques. Using too


few may reduce the usability of a Web site; using
too many wastes valuable resources.

Relative Importance:
Strength of Evidence:

Comments: Selecting the number of participants to

use when conducting usability evaluations depends


on the method being used:
Inspection evaluation by usability specialists:
T
 he typical goal of an inspection evaluation is to have usability experts
separately inspect a user interface by applying a set of broad usability
guidelines. This is usually done with two to five people.
T
 he research shows that as more experts are involved in evaluating the
usability of the product, the greater the number of usability issues will
be identified. However, for every true usability problem identified, there
will be at least one usability issue that is not a real problem. Having more
evaluators does decrease the number of misses, but is also increases
the number of false positives. Generally, the more expert the usability
specialists, the more useful the results.
Performance usability testing with users:
E arly in the design process, usability testing with a small number of users
(approximately six) is sufficient to identify problems with the information
architecture (navigation) and overall design issues. If the Web site has
very different types of users (e.g., novices and experts), it is important to
test with six or more of each type of user. Another critical factor in this
preliminary testing is having trained usability specialists as the usability test
facilitator and primary observers.
O
 nce the navigation, basic content, and display features are in place,
quantitative performance testing (measuring times, wrong pathways,
failure to find content, etc.) can be conducted to ensure that usability
objectives are being met. To measure each usability objective to a
particular confidence level, such as ninety-five percent, requires a larger
number of users in the usability tests.
W
 hen the performance of two sites is compared (i.e., an original site and a
revised site), quantitative usability testing should be employed. Depending
on how confident the usability specialist wants to be in the results, the
tests could require a larger number of participants.

Research-Ba s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

18:3 Prioritize Tasks

193

Sources: Bailey, 1996; Bailey, 2000c; Bailey, 2000d; Brinck and Hofer, 2002;
Chin, 2001; Dumas, 2001; Gray and Salzman, 1998; Lewis, 1993; Lewis,
1994; Nielsen and Landauer, 1993; Perfetti and Landesman, 2001; Virzi,
1990; Virzi, 1992.

18:7 Use the Appropriate Prototyping Technology


Guideline: Create prototypes using the most

Relative Importance:
appropriate technology for the phase of the
design, the required fidelity of the prototype, and
skill of the person creating the prototype.
Strength of Evidence:

Comments: Designers can use either paper-based

or computer-based prototypes. Paper-based


prototyping appears to be as effective as computer-based prototyping
when trying to identify most usability issues. Several studies have shown
that there was no reliable difference in the number of usability issues
detected between computer and paper prototypes. However, usability test
participants usually prefer interacting with computer-based prototypes.
Paper prototypes can be used when it is necessary to view and evaluate
many different (usually early) design ideas, or when computer-based
prototyping does not support the ideas the designer wants to implement, or
when all members of the design team need to be includedeven those that
do not know how to create computer-based prototypes.
Software tools that are available to assist in the rapid development of
prototypes include PowerPoint, Visio, including other HTML base tools.
PowerPoint can be used to create medium fidelity prototypes. These
prototypes can be both interactive and dynamic, and are useful when the
design requires more than a pencil-and-paper prototype.

Sources: Sefelin, Tscheligi and Giller, 2003; Silvers, Voorheis and Anders, 2004;
Walker, Takayama and Landay, 2002.

See page xxii


for detailed descriptions
of the rating scales

R e s e a r c h - B a s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

Usability Testing

It is best to perform iterative cycles of usability testing over the course
of the Web sites development. This enables usability specialists and
designers to observe and listen to many users.

Usability Testing

194

18:8
Inspection
EvaluationComments
Results Cautiously
18:2 Use
Solicit
Test Participants
Guideline: Use inspection evaluation results
with caution.

Relative Importance:

Comments: Inspection evaluations include heuristic

Strength of Evidence:
evaluations, expert reviews, and cognitive
walkthroughs. It is a common practice to conduct
an inspection evaluation to try to detect and resolve
obvious problems before conducting usability tests. Inspection evaluations
should be used cautiously because several studies have shown that they appear
to detect far more potential problems than actually exist, and they also tend to
miss some real problems. On average, for every hit there will be about 1.3 false
positives and .5 misses.
Another recent study concluded that the low effectiveness of heuristic
evaluations as a whole was worrisome because of the low problem detection
rate (p=.09), and the large number of evaluators required (16) to uncover
seventy-five percent of the potential usability issues.
Another difficulty when conducting heuristic evaluations is that evaluators
frequently apply the wrong heuristic, which can mislead designers that are
trying to fix the problem. One study reported that only thirty-nine percent of
the heuristics were appropriately applied.
Evaluators seem to have the most success identifying usability issues that can be
seen by merely looking at the display, and the least success finding issues that
require users to take several steps (clicks) to a target.
Heuristic evaluations and expert reviews may best be used to identify potential
usability issues to evaluate during usability testing. To improve somewhat
on the performance of heuristic evaluations, evaluators can use the usability
problem inspector (UPI) method or the Discovery and Analysis Resource
(DARe) method.

Sources: Andre, Hartson and Williges, 2003; Bailey, Allen and Raiello, 1992;

Catani and Biers, 1998; Cockton and Woolrych 2001; Cockton and Woolrych,
2002; Cockton, et al., 2003; Fu, Salvendy and Turley, 1998; Fu, Salvendy and
Turley, 2002; Law and Hvannberg, 2002; Law and Hvannberg, 2004; Nielsen
and Landauer, 1993; Nielsen and Mack, 1994; Rooden, Green and Kanis,
1999; Stanton and Stevenage, 1998; Virzi, Sorce and Herbert, 1993; Wang and
Caldwell, 2002.

Research-Ba s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

See page xxii


for detailed descriptions
of the rating scales

18:9
RecognizeTasks
the Evaluator Effect
18:3 Prioritize
conducting inspection evaluations.

Strength of Evidence:

Comments: The evaluator effect occurs when

multiple evaluators evaluating the same interface detect markedly different


sets of problems. The evaluators may be doing an expert review, heuristic
evaluation, or cognitive walkthrough.
The evaluator effect exists for evaluators who are novice or experienced,
while detecting cosmetic and severe problems, and when evaluating simple
or complex Web sites. In fact, when using multiple evaluators, any one
evaluator is unlikely to detect the majority of the severe problems that will
be detected collectively by all evaluators. Evaluators also tend to perceive
the problems they detected as more severe than the problems detected by
others.
The main cause of the evaluator effect seems to be that usability evaluation is a
complex cognitive activity that requires evaluators to exercise difficult judgments.

Sources: Hertzum and Jacobsen, 2001; Jacobsen, Hertzum and John, 1998;

Molich, et al., 1998; Molich, et al., 1999; Nielsen and Molich, 1990; Nielsen,
1992; Nielsen, 1993; Redish and Dumas, 1993; Selvidge, 2000.

18:10 Apply Automatic Evaluation Methods


Guideline: Use appropriate automatic evaluation

methods to conduct initial evaluations on Web sites.

Relative Importance:

Comments: An automatic evaluation method is

Strength of Evidence:
one where software is used to evaluate a Web
site. An automatic evaluation tool can help find
certain types of design difficulties, such as pages
that will load slowly, missing links, use of jargon, potential accessibility
problems, etc. While automatic evaluation methods are useful, they should
not be used as a substitute for evaluations or usability testing with typical
users. There are many commercially available automatic evaluation methods
available for checking on a variety of Web site parameters.

Sources: Brajnik, 2000; Campbell and Stanley, 1963; Gray and Salzman, 1998;
Holleran, 1991; Ivory and Hearst, 2002; Ramey, 2000; Scholtz, 1998; World
Wide Web Consortium, 2001.

R e s e a r c h - B a s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

195

Usability Testing

Guideline: Beware of the evaluator effect when

Relative Importance:

Usability Testing

196

18:2 Solicit
Test Participants
Comments
18:11
Use Cognitive
Walkthroughs
Cautiously
Guideline: Use cognitive walkthroughs with caution.

Relative Importance:

Comments: Cognitive walkthroughs are often

conducted to resolve obvious problems before


Strength of Evidence:
conducting performance tests. The cognitive
walkthrough appears to detect far more potential
problems than actually exist, when compared with
performance usability testing results. Several studies have shown that only
about twenty-five percent of the potential problems predicted by the cognitive
walkthrough were found to be actual problems in a performance test. About
thirteen percent of actual problems in the performance test were missed
altogether in the cognitive walkthrough. Cognitive walkthroughs may best be
used to identify potential usability issues to evaluate during usability testing.

Sources: Blackmon, et al., 2002; Desurvire, Kondziela and Atwood, 1992;

Hassenzahl, 2000; Jacobsen and John, 2000; Jeffries and Desurvire, 1992; John
and Mashyna, 1997; Karat, 1994b; Karat, Campbell and Fiegel, 1992; Spencer,
2000.

18:12 Choosing Laboratory vs. Remote Testing


Guideline: Testers can use either laboratory or

remote usability testing because they both elicit


similar results.

Comments: In laboratory-based testing, the

Relative Importance:
Strength of Evidence:

participant and the tester are in the same physical


location. In remote testing, the tester and the
participant are in different physical locations. Remote testing provides
the opportunity for participants to take a test in their home or office. It is
convenient for participants because it requires no travel to a test facility.
Studies have evaluated whether remote testing is as effective as traditional,
lab-based testing. To date, they have found no reliable differences between
lab-based and remote testing in terms of the number of types of usability
issues identified. Also, they report no reliable differences in task completion
rate, time to complete the tasks, or satisfaction scores.

Sources: Brush, Ames and Davis, 2004; Hartson, et al., 1996; Thompson,
Rozanski and Rochester, 2004; Tullis, et al., 2002.

Research-Ba s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

18:3 Prioritize
TasksRatings Cautiously
18:13
Use Severity
Relative Importance:

Comments: Most designers would like usability

specialists to prioritize design problems that they Strength of Evidence:


found either by inspection evaluations or expert
reviews. So that they can decide which issues to
fix first, designers would like the list of potential
usability problems ranked by each ones severity level. The research
literature is fairly clear that even highly experienced usability specialists
cannot agree on which usability issues will have the greatest impact on
usability.
One study had 17 expert review and usability test teams evaluate and test
the same Web page. The teams had one week to do an expert review,
or two weeks to do a usability test. Each team classified each usability
issue as a minor problem, serious problem, or critical problem. There was
considerable disagreement in which problems the teams judged as minor,
serious or critical, and there was little agreement on which were the top five
problems. Another study reported that heuristic evaluators overestimated
severity twenty-two percent of the time, and underestimated severity
seventy-eight percent of the time when compared with usability testing
results.

Sources: Bailey, 2005; Catani and Biers, 1998; Cockton and Woolrych, 2001;
Dumas, Molich and Jeffries, 2004; Hertzum and Jacobsen, 2001; Jacobsen,
Hertzum and John, 1998; Law and Hvannberg, 2004; Molich, 2005.

See page xxii


for detailed descriptions
of the rating scales

R e s e a r c h - B a s e d We b D e s i g n & U s a b i l i t y G u i d e l i n e s

Usability Testing

Guideline: Use severity ratings with caution.

197

Usability Testing
Basics
An Overview

Usability Testing Basics

Contents
Usability Testing Defined ........................................................................................................................ 3
Decide What to Test ............................................................................................................................ 3
Determine When to Test What ........................................................................................................... 4
Decide How Many to Test ................................................................................................................... 4
Design the Test..................................................................................................................................... 5
Consider the Where, When, and How .......................................................................................... 5
Scenarios and Tasks ....................................................................................................................... 5
Prepare to Measure the Experience ............................................................................................. 7
Select Data to Capture .................................................................................................................... 8
Recruit Participants .............................................................................................................................. 8
Recruitment Ideas ............................................................................................................................ 8
Compensation ................................................................................................................................... 9
Prepare for Test Sessions .................................................................................................................. 9
Setting ................................................................................................................................................ 9
Schedule Participants ...................................................................................................................... 9
Stakeholders ..................................................................................................................................... 9
Observers .......................................................................................................................................... 9
Script ................................................................................................................................................ 10
Questionnaires and Surveys ........................................................................................................ 10
Conduct Test Sessions ..................................................................................................................... 10
Begin with a Run-through ............................................................................................................. 10
At the Test Session ........................................................................................................................ 10
Facilitation ....................................................................................................................................... 11
After the Session ............................................................................................................................ 11
Analyze Your Study ........................................................................................................................... 12
Step 1: Identify exactly what you observed................................................................................ 12
Step 2: Identify the causes of any problems .............................................................................. 12
Step 3: Determine Solutions ......................................................................................................... 12
Deliverables......................................................................................................................................... 13
Appendix A: Participant Recruitment Screener ................................................................................. 15
Recruitment Script.............................................................................................................................. 15

About this Document


This document details usability testing basicshow to apply them with any product or prototype and when to
apply them during any point in the development process. It also discusses how to conduct, analyze, and
report on usability test findings. Then you can learn about how to do it all in Morae 3.

www.techsmith.com

Usability Testing Basics

www.techsmith.com

Usability Testing Basics

Usability Testing Defined


Usability tests identify areas where people struggle with a product and help you make recommendations for
improvement. The goal is to better understand how real users interact with your product and to improve the
product based on the results. The primary purpose of a usability test is to improve a design.
In a typical usability test, real users try to accomplish typical goals, or tasks, with a product under controlled
conditions. Researchers, stakeholders, and development team members watch, listen, collect data, and take
notes.
Since usability testing employs real customers accomplishing real tasks, it can provide objective performance
data, such as time on task, error-rate, and task success. There is also no substitute for watching users
struggle with or have great success in completing a task when using a product. This observation helps
designers and developers gain empathy with users, and help them think of alternative designs that better
support tasks and workflow.

Decide What to Test


Meet with stakeholders, including members of the development team when possible, to map out the goals for
the test and discuss what areas of the system or product you will evaluate. In order to gather all of the
information you will need to conduct your test, ask for feedback on:
Background:

Product description and reasons for requesting feedback

Participants:

The desired qualities of participants and characteristics of users or


customers of the product

Usability Goals:

What you hope to learn with this test

Key Points:

What kinds of actions/features the test tasks should coverthis also


may include a list of specific questions the team wants the usability test
to answer

Timeline:

The timeline for testingwhen the product or prototype will be ready for
testing, when the team would like to discuss the results, or any other
constraints

Additional
Information:

Anything else that needs to be taken into consideration

Be sure to identify user goals and needs as well. With this information you can then develop scenarios and
tasks for participants to perform that will help identify where the team can make improvements.
For example:
Who uses (or would use) the product?
What are their goals for using the product?
What tasks would those people want to or have to accomplish to meet that goals?
Are there design elements that cause problems and create a lot of support calls?
Are you interested in finding out if a new product feature makes sense to current users?

www.techsmith.com

Usability Testing Basics

Determine When to Test What


Usability testing can employ many methods and work with products at many levels of development. If there is
enough of an interface to complete tasksor or even imagine completing a taskit is possible to perform a
usability test. You can test a product at various stages of development:
Low-fidelity prototype or paper prototype
Hand drawn, mocked up, or wireframe version of a product or web site that allows for a paper
prototype style test before work begins or early in development.
High-fidelity prototype
An interactive system that can be used on a computer, such as a Flash version of a products user
interface and interactivity. High fidelity prototypes should include representative data and mimic the
experience users would have when using the finished product to accomplish tasks. Usually performed
as development progresses.
Alpha and Beta versions
These not-ready-for-release versions often are stable enough and rich enough to be sent or accessed
by remote participants for a usability test.
Release version
A product that has been released to customers, and is especially effective for testing the workflow of
the product from beginning to end.
Comparative or A/B
Multiple versions of a design are used in testing (often alternated between participants) to measure
differences in performance and satisfaction.

Decide How Many to Test


The number of participants varies based on the type and purpose of the test. Opinions vary, but at least four
participants from each group of user types (user types are determined by stakeholders and development
team members when determining testing goals) are usually needed to test a product. Different testing
techniques require different numbers of participants, as explained in this table.

Recommended Number of Participants by Testing Technique


BENCHMARK
METRICS

DIAGNOSTIC
(FORMATIVE)
EVALUATION

SUMMATIVE TESTING

How many?

8-24 users

4-6 users

6-12+ users

Metrics and
Measures

Focus on metrics for


time, failures, etc
Tests current process
or product

Less formal
Increased focus on
qualitative data

More formal
Metrics based on
usability goals

Why

Establish baseline
metrics

Find and fix


problems

Measure success of new


design

When

Before a design
project begins or early
in development

During design

At end of process

How often

Once

Iterative

Once

Source: Ginny Redish

www.techsmith.com

Usability Testing Basics

Design the Test


Document your test plan with a protocol. You may want to use a test planning checklist to help you track all the
details. Examples of each are available on the Morae Resource CD.
Scenario and task design are one of the most important factors to consider. Plan to have participants
accomplish typical tasks with the product under controlled conditions. The tasks should provide the data that
answers your design questions.

Consider the Where, When, and How


You will need to schedule rooms, labs and equipment, and know where your participants will be located.
Usability tests can take place in a lab, conference room, quiet office space, or a quiet public space. Morae
enables you to capture the product or screen and software data, facial expressions, and verbal comments.
UserVue, TechSmiths remote testing service, lets participants participate in a test from home or work.
Recordings from UserVue import seamlessly into Morae.

Scenarios and Tasks


Tasks are the activities you will ask your participant to do during a usability test; scenarios frame tasks and
provide motivation for the participant to perform those tasks. You may have one scenario or several, depending
on your tasks.
Both tasks and scenarios should be adjusted to meet goals and should be part of the conversation you have
with stakeholders about the test.

Tips for Writing Scenarios


You may find it easiest to write tasks first, then scenarios, or vice versa. In our examples, we start with
writing a scenario.
Imagine why your users would want to use your product in general, then specifically what would
motivate them to encounter the design elements you are evaluating. Scenarios should be a story that
provides motivation to your participants.
Effective tasks often contain scenario information, which give the test participant an understanding of
their motivation and context for accomplishing the task and all information needed to complete the
task. A scenario could be given to participants before beginning the tasks.
For example, to find out how users use the store on your Web site, a scenario could state, You have
been researching different types of video cameras to buy to record family videos and transfer them on
to your computer. You want to use the Web site to find information about video cameras and purchase
a camera based on your needs.
Another method is to fold scenario information into each task.
For example, a task might state, Purchase a video camera, but a task with scenario information
would give more detail: You want to purchase a video camera that is small and lightweight, and can
transfer video files on to your computer. You have a budget of $400. Find and purchase a camera that
meets your needs.

Scenario Dos and Donts


DO: Create a believable scenario
DONT: Create long, complex scenarios consider breaking them up into smaller scenarios for
smaller groups of tasks

www.techsmith.com

Usability Testing Basics

Tips for Writing Tasks


Tasks can contain as little or as much information as necessary to aid participants and give them context and
motivation. Different types of tests require different types of tasksa paper prototype might seek a more
open-ended task where another type of test may need very specific tasks. Tasks fall into three main
categories:
Prescribed tasks as the test designer you determine what the participant will do.
Example: You want to enhance your copy of SnagIt. Using TechSmith.com, download and install the
Typewriter style numbers for SnagIt.
Participant defined ask participants to tell you something they would normally do with your product,
and then have them do the task they described.
Example: Using SnagIt 9, take a capture of something that you normally capture in your work or
personal life - or something similar to what you normally would do - and enhance and share it the way
you normally would. Please feel free to customize or use SnagIt in any way that meets your needs.
Open ended allow participants to organically explore the product based on a scenario you provide.
Example: We are giving you $100 to buy software that will capture your screen. Using the internet,
find the software you want to buy. When you are done you can keep the software you purchase as
well as any remaining funds.
The order of tasks will often follow a natural flow of product use. When order does not matter for the user, the
order of tasks might need to be varied to avoid testing bias. It may be best to begin with a simple task to ease
the user into the testing situation and build confidence.

Task Dos and Donts


DO: Use the language of the participant, and write tasks that the participant might realistically expect
to do in his or her use of the product.
DO: Identify specific activities that represent typical tasks that your users would perform with your
product. The tasks should relate back to the goals for the test and relate to your scenario. There are
several types of tasks that you might use based on the data you are interested in collecting.
DO: Provide additional information such as a credit card number for payment transactions, receipts for
an expense reporting system, email addresses, etc.
DONT: Use any of the terms used in the product avoid clues about what to do. Avoid terms like
Click on your shopping cart to check out.
DONT: Lead the participant by using directions that are too explicit. Avoid language such as click on
the red button to begin.
DONT: Write so that you describe how to perform the task.
DONT: Write dependent tasks that require participants to complete one task before moving on; if data
or other artifacts from the first task are needed, provide them in subsequent tasks.

www.techsmith.com

Usability Testing Basics

Prepare to Measure the Experience


Usability testing tests a product under the most realistic circumstances possible while controlling the
conditions. This method of user research lets the researcher collect data, as measured in numbers
(quantitative) and documented as part of the test (qualitative). Different data are used to measure various
aspects of usability.

Key Evaluation Measures for Usability Testing


NAME

WHATS MEASURED

WHEN TO USE THIS MEASURE

Task
Success

Whether or not the participant was


successful, and to what degree. (For
example, completed with ease,
completed with difficulty, failed to
complete.)

Critical when effectiveness of the product is a


primary goal.

Time on
Task

The length of time it takes the


participant to complete a task. May
be averaged for all participants, and
can be compared between tests.

Critical when efficiency is a primary usability


goal, and when efficiency is a primary influence
on satisfaction.

Errors

A count of the errors each


participant makes in each task.
Errors may be categorized or
predefined.

Critical to both efficiency and effectiveness, use


this measure when you want to minimize the
problems a user may encounter in the product.

Learnability

A task is repeated at least once to


determine whether the time on task
is shorter, fewer errors are made, or
the task is more successful.

Important to measure whether the interface will


be easier to use over time.

Satisfaction

Enumerates participants overall


feelings about a product before,
during and/or after a test.

Allows the participants to quantify and describe


their emotional reaction to a product before,
during or after a study.

Mouse
Clicks

Measures the number of clicks that


a participant makes.

Measures the effectiveness and efficiency of a


product, suggests that a participant was able to
accomplish a task with less effort.

Mouse
Movement

Measures the distance the mouse


travels.

Measures efficiency, suggests that a participant


was able to accomplish a task with less effort.

Problem/
Issue Counts

Records, counts, ranks and/or


categorizes problems observed.

Provides an overview of the issues that may be


causing other measures to be less ideal. Allows
comparison across studies to determine
improvement. These are often weighted by how
severe an issue may be.

Optimal Path

Observes the path a participant


takes to accomplish a task, and
compares it to a predefined optimal
path.

Measures the variance from the ideal path.

Make Your
Own

With Rich Recording Technology


data, you can design the study that
fits your needs.

Unlimited

www.techsmith.com

Usability Testing Basics

Select Data to Capture


Rich Recording Technology will automatically record a wide set of data about user activity and input on the
computer. You can set up markers in Recorder that will let your observers log other activity, including:
Task start and end points (to record time on task)
Places where the participant:
Reaches a milestone
Makes an error
Fails to complete a task
Accesses help (or asks the facilitator)
Encounters a problem

Capture Qualitative Data


Some things are not measured with numbers. Reactions, quotes, facial expressions and participant behaviors
(like gesturing, pushing a chair back, and so on) are also important data points that require a human to
interpret. Alert your observers to make note when these things happen as well youll be able to highlight
them later, as you review your recording. Set markers for quotes and behaviors in the Morae Study
Configuration for observers to use.

Identify Success Paths


For each task you write, its good practice to have all stakeholders agree on the success paths so everyone
has a common understanding about when participants are successful and when they are not. You might
decide that there is only one success path or several depending on your product.
Test observers can then help count errors and problems associated with each task and you can identify when
participants are able to successfully complete tasks or not.

Recruit Participants
Recruiting is one of the most important components of a usability test. Your participants should adequately
reflect your true base of users and the user types you have decided to test, and represent a range of new and
experienced users in a way that would actually use your product.

Recruitment Ideas
Use your own customer databases or contacts
Hire an outside agency: look for market research firms if there are none specializing in usability
recruiting, good screeners are vital. There is a cost per candidate
Post on Craigs List: dont identify your company, just qualifications
Post something on your web site: start a usability testing page where site visitors can sign up to
participate
Place and ad in the paper: good for local audiences
When doing your own recruiting you should identify criteria that will help you select qualified
candidates. Experience with the product or the field, computer experience, age and other
demographics may be important to consider. See Appendix A: Participant Recruitment Screener.
When recruiting using an outside recruiting firm, a screener helps you get the right participants. A recruiting
screener is used to determine if a potential participant matches the user characteristics defined in the usability
test protocol. Include questions about demographics, frequency of use, experience level, etc.

www.techsmith.com

Usability Testing Basics

Ask questions that will help you filter out participants that dont match your target users and indicate when to
thank people for their time and let them know that they do not qualify. Ask enough questions so that you know
you have the right people. For example, qualified participants for a test of an online shopping site should have
access to a computer at home or at work and meet the other required demographics (age range, etc).

Compensation
You will need to think about what kind of compensation you will offer participants. Typically participants get
cash or an equivalent, a gift certificate, even merchandise from your company.

Prepare for Test Sessions


Now you have your test protocol and your participants, youre ready to get started.

Setting
The most important factors are that the environment be comfortable for participants, similar to their real-world
environment, and reasonably similar between participants.

Schedule Participants
Schedule your participants to have adequate time to work through the test at their own pace. Allow enough
time between sessions to reset, debrief and regroup.

Stakeholders
When working your stakeholders, help them understand how you will conduct your testing. Stakeholders need
to understand how you will be interacting with your participant during test sessions. They need to understand
that you are there to facilitate the test and observe behavior, not help the participant complete tasks. There
are two basic models:
Facilitator interacts with the participant you often get more qualitative information, especially
when the facilitator is good at asking neutral questions and encouraging participants to find their own
answers in the product.
Facilitator does not interact with the participant you can get more natural behavior, but
participants are left to struggle or quit on their own. You often will not get as much qualitative data as
participants may not talk out loud as much. You may get more accurate measures of time on task and
failure, however.

Observers
At least one person can be enlisted to help you log all of your recordings for the data points youve
set out. By having someone else log the sessions, the facilitator can concentrate on the test. At the
same time, the recording will capture a rich set of data for later analysis.
In addition to a designated person to help you observe and log data, there may be a long list of
stakeholders who will benefit from observing a test. They commonly include the developers,
managers, product managers, quality testing analysts, sales and marketing staff, technical support
and documentation.
Watching users actually struggle with a product is a powerful experience. Observing test sessions helps make
the team open to making changes.
Remember to warn your observers not to drive changes until you and the team have had an opportunity to
analyze all testing and decide upon changes that will address the root causes.

www.techsmith.com

Usability Testing Basics

Script
Create a facilitator script to help you and your facilitators present a consistent set of information to your
participants. The script will also serve as a reminder to you to say certain things to your participants and
provide them appropriate paper work at the right times.
In your script, remind participants that the usability test is an evaluation of the product and not of their
performance, that all problems they find are helpful and that their feedback is valuable, and let participants
know their data will be aggregated with the rest of the participant data and they will not be identified.

Questionnaires and Surveys


A typically usability study usually has at least two surveys (questionnaires), one administered before the
participant starts tasks and one administered at the end of the test, which collects subjective information such
as how satisfied participant were with the product and how easy it is to use. You can also administer a survey
after each task; these are typically used to measure satisfaction or ease of use for each task.
Pre-test survey collects demographic and product usage data about participants such as computer
use, online shopping habits, internet usage, age, gender, etc and should help product teams
understand more about typical customers and how they are reflected in the test participants
Post-task survey questions that rate subjective ease of use or satisfaction for each task. Ask other
questions or other types of questions when appropriate. Limit the number of post-task questions so as
not to overwhelm participants.
Post-test survey post-test surveys are often used to measure satisfaction; use SUS for a standard
satisfaction measure or use your own questions. See our templates for ideas of questions you might
use.

Survey Dos and Donts


DO: Check with your HR and legal departments to make sure there are no regulations or
requirements about the data you can collect.
DO: Use age ranges rather than specific ages when asking participants for their age.
DO: Include comment fields for questions where you want to hear more from participants.
DONT: Collect gender information; if you want to collect gender note that information separately
based on observation.

Conduct Test Sessions


Good practices for conducting your test start before the participant comes and follows through after he or she
leaves.

Begin with a Run-through


Run through your test yourself or with someone else to make sure the tasks make sense and can be
completed with the version of the product you are testing
Conduct a pilot test with a participant this participant can be a co-worker or someone you have
access to that would be part of the target audience
Allow enough time before the test session to make changes.

At the Test Session


Welcome your participant and make them comfortable

10

www.techsmith.com

Usability Testing Basics

Use the script to help you remember what you need to do and say
Ask participants to fill out the consent form (include a non-disclosure agreement if your company
requires one)
Remember your facilitation skills and start the test
Allow enough time between test sessions to set equipment and prepare for your next participant
See Usability Testing and Morae for details on using Morae when conducting your test sessions.

Facilitation
Technique matters: An impartial facilitator conducts the test without influencing the participant. The facilitator
keeps the test flowing, provides simple directions, and keeps the participant focused. The facilitator may be
located near the participant or in another room with an intercom system. Often, participants are asked to keep
a running narration (called the think-aloud protocol) and the facilitator must keep the participant talking.

Test Session Dos and Donts


DO: Ensure the participants physical comfort.
DO: Ask open ended questions
o What are you thinking right now?
o What are you trying to do?
o Is there anything else you might try?
o Where would you go?
o What did you expect to happen?
o You seemed surprised or frustrated?
o Exactly how did that differ from what you expected to happen?
o Would you expect that information to be provided?
o Please keep talking
DO: Provide open-ended hints only when asked: Do you see something that will help you?
DONT: Provide direction or tell the user how to accomplish the task.
DONT: Offer approval or disapproval with words, facial expressions or body language.
DONT: Crowd the participant physically; allow the participant to move, take a break or quit.
DONT: Make notes only when the participant does something interestingkeep the sound of your
keyboard or pen consistent so that you avoid giving clues to the participant.

Techniques for Task Failures


Occasionally, a participant will fail to complete or will outright quit trying to complete a task. Indirect hints or
encouragement such as is there anything on the screen to help you? may be used to encourage the
participant to explore, but at some point he or she should be allowed to fail.
If a participant fails a task but needs the information from that task to continue, a recommended technique is
to count the failure but have the participant try the required portion of the first task again. Doing this lets you
understand better how long it takes participants to get a particular interaction. You can then gage how easy
or hard it is to learn to perform the task and more about where they might be confused by your product.
Provide a more direct hint only as a very last resort.

After the Session


Reset your machine, clear data and save the participants work, if appropriate.

www.techsmith.com

11

Usability Testing Basics

Debrief with observers to note trends and issues


Clean the environment for the next participant.

Analyze Your Study


Analyzing is a three step process:
Step 1: Identify exactly what you observed
Step 2: Identify the causes of any problems
Step 3: Determine Solutions
Source: Whitney Quesenbery

Step 1: Identify exactly what you observed


Your analysis following the test lets you find the critical problems and issues that help you design a better
product. Review what youve seen and note:
How did people perform? Were they successful?
How long did it take them to complete a task?
What mistakes were made?
What problems did they encounter? Where?
How often and how many problems did they have?
How did they feel? What did they say? What significant things did they do?
What worked well?
You can begin with a review your recordings for those measures that you selected when you designed the
test. Create your project, import the recordings of each participant, and look for the data you defined when
you planned your test.

Step 2: Identify the causes of any problems


Ask yourself and the team a series of questions about the problems observed.
Was there a problem with workflow? The navigation? The terminology?
How severe was the problem? Did the participant fail or was the participant significantly delayed? Did
it present an obstacle? How difficult was the obstacle to overcome?
Why? Why did the participant have a problem? After you ask that question, ask Why? again.
Repeat that process until you reach the fundamental, underlying problem.

Step 3: Determine Solutions


In some cases, the researcher is tasked to make recommendations. In other environments, solutions are
determined by designers and/or development teams. The researcher can mentor solutions that address the
root causes of a usability problem and meet the needs of the user.
One technique is to have all stakeholders meet to review the findings and determine recommendations at a
debrief meeting. Diagramming problems, listing them and discussing each can produce a shared
understanding of how to address the problems.
Even when working alone, it is essential that you discuss your usability recommendations with your team
developers, marketing, sales to learn what works and what doesn't work from a business and technical
point of view. If you are the person from whom recommendations are expected, solicit other opinions and be
prepared to set your ideas aside.

12

www.techsmith.com

Usability Testing Basics

Tips for Great Recommendations


Use your data to form conclusions and drive design changes.
Remember to note that good things have happened; mention them first.
Make sure your recommendations address the original problem noted, and limit the recommendation
to that original problem. Create solutions that address the cause of the problem, not the symptoms.
Keep them short and to the point.
Make your recommendations specific. For example, rather than recommending a change in a
workflow, diagram the optimal workflow based on the test findings.
Address the needs of as many users (or types of users) as possible.
Recommend the least possible change, and then recommend a quick usability test to see if youve
solved the problem. If not, try another tweak, or move on to a larger change.
A picture or video is worth a thousand words: enhance your recommendations with wireframes, video
clips and annotated screenshots.
Use the language of your audience: executives, developers, etc.
Show an interest in what happens to your great recommendations. Ask follow-up questions if your
great recommendations are not followed. Maybe you can learn something.

Avoid making recommendations that:


Are based on opinions.
Are vague or not actionable.
Only list complaints.
Create a new set of problems for users.
Are targeted only to a single type of user, for example, a design targeted for expert users at the
expense of other types.
Source: Rolf Molich, et al

Deliverables
Deliverables reports, presentations, highlight videos and so on -- document what was done for future
reference. They often detail the usability problems found during the test plus any other data such as time on
task, error rate, satisfaction, etc. The Usability Test Report on the Morae Resource CD is one template you
might use to report results. Generally speaking, report or presentations will include:
Summary
Description of the product and the test objectives
Method
Participants
Context of the test
Tasks
Testing environment and equipment
Experiment design
What was measured (Metrics)
Results and findings
How participants fared (Graphs and tables)
Why they might not have done well (or why they did do well)

www.techsmith.com

13

Usability Testing Basics

Recommendations or Next Steps


Depending on the project objectives and stakeholders, the report can also take the form of a presentation.
Morae makes it easy for you to include highlight videos at important points, to illustrate the problem in the
participants own words.

Resources
A Practical Guide to Usability Testing, Revised by Joe Dumas and Ginny Redish, Intellect, 2
Edition, 1999

nd

Usability Testing and Research by Carol Barnum, Longman, 2002


Handbook of Usability Testing: how to Plan, Design and Conduct Effective Tests by Jeff Rubin and Dana
nd
Chisnell, Wiley 2 Edition, 2008

References
Usability and Accessibility STEC Workshop 2008 Whitney Quesenbery
Recommendations on Recommendations, Rolf Molich, Kasper Hornbaek, Steve Krug, Jeff Johnson,
Josephine Scott, 2008, accepted for publication in User Experience Magazine, issue 7.4, October 2008

14

www.techsmith.com

Usability Testing Basics

Appendix A: Participant Recruitment Screener


The usability test of the X Product requires 12 participants from 2 user groups.
USER TYPE

NUMBER

CHARACTERISTICS

Experienced
product users

Current product users/customers who have used X Product


for at least 1 year and use it at least 3 times a month
3 males, 3 females

New product
users

People who have no prior experience with X Product, but do


have at least 1 years experience using similar products
(e.g. data processing tools).
3 males, 3 females

Participation: All participants will spend about 60 minutes in the usability session. Incentive will be $50 in
cash.
Schedule: The usability tests will be conducted from May 5-7, 2008. Use schedule of available testing time
slots to schedule individual participants once they have passed the recruitment screener.
AVAILABLE TIME
SLOTS

TUES. MAY 5

WED. MAY 6

THURS. MAY 7

9-10 am

10:30-11:30 am
1-2 pm
2:30-3:30 pm
4-5 pm

Recruitment Script
Introduction
Hello, may I speak with ________. We are looking for participants to take part in a research study evaluating
the usability of the X Product. There will be $50 cash in compensation for the hour long session, which will
take the X Building located downtown. The session would involve one-on-one meeting with a researcher
where you would sit down in front of a computer and try to use a product while being observed and answering
questions about the product.
Would you be interested in participating?
If not: Thank you for taking the time to speak with me. If you know of anyone else who might be
interested in participating please have them call me, [Name], at 555-1234.

www.techsmith.com

15

Usability Testing Basics

Screening
I need to ask you a couple of questions to determine whether you meet the eligibility criteriaDo you have a
couple of minutes?
If not: When is a good time to call back?
Keep in mind that your answers to these questions to not automatically allow or disallow you take part in the
studywe just need accurate information about your background, so please answer as well as you can.
Have you ever used X product?
If yes:
How long have you used it for? [criteria: at least 1 yr.]
And how often do you use it? [criteria: at least 3 times a month]
If no:
Have you ever used any data processing products, such as [list competitor or similar products]?
[criteria: Yes]
If yes: How long have you used it for? [criteria: at least 1 yr.]
And how often do you use it? [criteria: at least 3 times a month]
Self-identify participant gender via voice and name and other cues.

Scheduling
If participant meets criteria: Will you be able to come to the X Building located downtown for one hour
between May 15 and 19? Free parking is available next to the building.
How is [name available times and dates]?
You will be participating in a one-on-one usability test session on [date and time]. Do you require any special
accommodations?
I need to have an e-mail address to send specific directions and confirmation information to. Thanks again!
If participant does not meet criteria: Unfortunately, you do not fit the criteria for this particular evaluation and
will not be able to participate. Thank you for taking the time to speak with me.
Use the screener questions in this script can in an email address for written recruitment.

16

www.techsmith.com

Chapter 13

Functional Testing
A functional specification is a description of intended program1 behavior,
distinct from the program itself. Whatever form the functional specification
takes whether formal or informal it is the most important source of information for designing tests. The set of activities for deriving test case specifications from program specifications is called functional testing.
Functional testing, or more precisely, functional test case design, attempts
to answer the question What test cases shall I use to exercise my program?
considering only the specification of a program and not its design or implementation structure. Being based on program specifications and not on the
internals of the code, functional testing is also called specification-based or
black-box testing.
Functional testing is typically the base-line technique for designing test
cases, for a number of reasons. Functional test case design can (and should)
begin as part of the requirements specification process, and continue through
each level of design and interface specification; it is the only test design technique with such wide and early applicability. Moreover, functional testing is
effective in finding some classes of fault that typically elude so-called whitebox or glass-box techniques of structural or fault-based testing. Functional testing techniques can be applied to any description of program behavior, from an informal partial description to a formal specification and at
any level of granularity, from module to system testing. Finally, functional
test cases are typically less expensive to design and execute than white-box
tests.

1 In this chapter we use the term program generically for the artifact under test, whether
that artifact is a complete application or an individual unit together with a test harness. This is
consistent with usage in the testing research literature.

47

Functional Testing

48

Required Background

Chapters 14 and 15:


The material on control and data flow graphs is required to understand
section 13.7, but it is not necessary to comprehend the rest of the chapter.
Chapter 27:
The definition of pre- and post-conditions can be helpful in understanding section 13.8, but it is not necessary to comprehend the rest of the
chapter.

13.1

Overview

In testing and analysis aimed at verification2 that is, at finding any discrepancies between what a program does and what it is intended to do
one must obviously refer to requirements as expressed by users and specified
by software engineers. A functional specification, i.e., a description of the expected behavior of the program, is the primary source of information for test
case specification.
Functional testing, also known as black-box or specification-based testing, denotes techniques that derive test cases from functional specifications.
Usually functional testing techniques produce test case specifications that
identify classes of test cases and be be instantiated to produce individual test
cases.
A particular functional testing technique may be effective only for some
kinds of software or may require a given specification style. For example,
a combinatorial approach may work well for functional units characterized
by a large number of relatively independent inputs, but may be less effective for functional units characterized by complex interrelations among inputs. Functional testing techniques designed for a given specification notation, e.g., finite state machines or grammars, are not easily applicable to other
specification styles.
The core of functional test case design is partitioning the possible behaviors of the program into a finite number of classes that can reasonably expected to consistently be correct or incorrect. In practice, the test case designer often must also complete the job of formalizing the specification far
enough to serve as the basis for identifying classes of behaviors. An important side effect of test design is highlighting weaknesses and incompleteness
of program specifications.
Deriving functional test cases is an analytical process which decomposes
specifications into test cases. The myriad of aspects that must be taken into
2 Here we focus on software verification as opposed to validation (see Chapter 2). The problems of validating the software and its specifications, i.e., checking the program behavior and its
specifications with respect to the users expectations, is treated in Chapter 12.

Draft version produced 20th March 2002

Overview

49


Test cases and test suites can be derived from several sources of information, including specifications (functional testing), detailed design and source code (structural testing), and hypothesized defects (fault-based testing). Functional test case design is an
indispensable base of a good test suite, complemented but never replaced by by structural and fault-based testing, because there are classes of faults that only functional testing effectively detects. Omission of a feature, for example, is unlikely to be revealed by
techniques which refer only to the code structure.
Consider a program that is supposed to accept files in either plain ASCII text, or
HTML, or PDF formats and generate standard PostScript. Suppose the programmer overlooks the PDF functionality, so the program accepts only plain text and HTML files. Intuitively, a functional testing criterion would require at least one test case for each item in
the specification, regardless of the implementation, i.e., it would require the program to
be exercised with at least one ASCII, one HTML, and one PDF file, thus easily revealing
the failure due to the missing code. In contrast, criterion based solely on the code would
not require the program to be exercised with a PDF file, since all of the code can be exercised without attempting to use that feature. Similarly, fault-based techniques, based on
potential faults in design or coding, would not have any reason to indicate a PDF file as a
potential input even if missing case were included in the catalog of potential faults.
A functional specification often addresses semantically rich domains, and we can use
domain information in addition to the cases explicitly enumerated in the program specification. For example, while a program may manipulate a string of up to nine alphanumeric characters, the program specification may reveal that these characters represent a
postal code, which immediately suggests test cases based on postal codes of various localities. Suppose the program logic distinguishes only two cases, depending on whether
they are found in a table of U.S. zip codes. A structural testing criterion would require
testing of valid and invalid U.S. zip codes, but only consideration of the specification and
richer knowledge of the domain would suggest test cases that reveal missing logic for
distinguishing between U.S.-bound mail with invalid U.S. zip codes and mail bound to
other countries.
Functional testing can be applied at any level of granularity where some form of specification is available, from overall system testing to individual units, although the level of
granularity and the type of software influence the choice of the specification styles and
notations, and consequently the functional testing techniques that can be used.
In contrast, structural and fault-based testing techniques are invariably tied to program structures at some particular level of granularity, and do not scale much beyond
that level. The most common structural testing techniques are tied to fine-grain program structures (statements, classes, etc.) and are applicable only at the level of modules
or small collections of modules (small subsystems, components, or libraries).

Draft version produced 20th March 2002

Functional Testing

50

account during functional test case specification makes the process error prone.
Even expert test designers can miss important test cases. A methodology for
functional test design systematically helps by decomposing the functional
test design activity into elementary steps that cope with single aspect of the
process. In this way, it is possible to master the complexity of the process and
separate human intensive activities from activities that can be automated.
Systematic processes amplify but do not substitute for skills and experience
of the test designers.
In a few cases, functional testing can be fully automated. This is possible
for example when specifications are given in terms of some formal model,
e.g., a grammar or an extended state machine specification. In these (exceptional) cases, the creative work is performed during specification and design
of the software. The test designers job is then limited to the choice of the test
selection criteria, which defines the strategy for generating test case specifications. In most cases, however, functional test design is a human intensive
activity. For example, when test designers must work from informal specifications written in natural language, much of the work is in structuring the
specification adequately for identifying test cases.

13.2

Random versus Partition Testing Strategies

With few exceptions, the number of potential test cases for a given program
is unimaginably huge so large that for all practical purposes it can be considered infinite. For example, even a simple function whose input arguments
legal inputs. In contrast to input spaces,
are two 32-bit integers has
budgets and schedules are finite, so any practical method for testing must select an infinitesimally small portion of the complete input space.
Some test cases are better than others, in the sense that some reveal faults
and others do not.3 Of course, we cannot know in advance which test cases
reveal faults. At a minimum, though, we can observe that running the same
test case again is less likely to reveal a fault than running a different test case,
and we may reasonably hypothesize that a test case that is very different from
the test cases that precede it is more valuable than a test case that is very
similar (in some sense yet to be defined) to others.
As an extreme example, suppose we are allowed to select only three test
cases for a program that breaks a text buffer into lines of 60 characters each.
Suppose the first test case is a buffer containing 40 characters, and the second
is a buffer containing 30 characters. As a final test case, we can choose a buffer
containing 16 characters or a buffer containing 100 characters. Although we
cannot prove that the 100 character buffer is the better test case (and it might
not be; the fact that 16 is a power of 2 might have some unforeseen significance), we are naturally suspicious of a set of tests which is strongly biased
toward lengths less than 60.

3 Note that the relative value of different test cases would be quite different if our goal were to
measure dependability, rather than finding faults so that they can be repaired.

Draft version produced 20th March 2002

Random versus Partition Testing Strategies

51


While the informal meanings of words like test may be adequate for everyday conversation, in this context we must try to use terms in a more precise and consistent manner. Unfortunately, the terms we will need are not always used consistently in the literature, despite the existence of an IEEE standard that defines several of them. The terms
we will use are defined below.
Independently testable feature (ITF): An ITF is a functionality that can be tested independently of other functionalities of the software under test. It need not correspond
to a unit or subsystem of the software. For example, a file sorting utility may be capable of merging two sorted files, and it may be possible to test the sorting and
merging functionalities separately, even though both features are implemented by
much of the same source code. (The nearest IEEE standard term is test item.)
As functional testing can be applied at many different granularities, from unit testing through integration and system testing, so ITFs may range from the functionality of an individual Java class or C function up to features of a integrated system
composed of many complete programs. The granularity of an ITF depends on the
exposed interface at whichever granularity is being tested. For example, individual
methods of a class are part of the interface of the class, and a set of related methods
(or even a single method) might be an ITF for unit testing, but for system testing the
ITFs would be features visible through a user interface or application programming
interface.
Test case: A test case is a set of inputs, execution conditions, and expected results. The
term input is used in a very broad sense, which may include all kinds of stimuli
that contribute to determining program behavior. For example, an interrupt is as
much an input as is a file. (This usage follows the IEEE standard.)
Test case specification: The distinction between a test case specification and a test case
is similar to the distinction between a program and a program specification. Many
different test cases may satisfy a single test case specification. A simple test specification for a sorting method might require an input sequence that is already in
sorted order. A test case satisfying that specification might be sorting the particular
vector (alpha, beta, delta.) (This usage follows the IEEE standard.)
Test suite: A test suite is a set of test cases. Typically, a method for functional testing
is concerned with creating a test suite. A test suite for a program, a system, or an
individual unit may be made up of several test suites for individual ITFs. (This usage
follows the IEEE standard.)
Test: We use the term test to refer to the activity of executing test cases and evaluating
their result. When we refer to a test, we mean execution of a single test case, except where context makes it clear that the reference is to execution of a whole test
suite. (The IEEE standard allows this and other definitions.)

Draft version produced 20th March 2002

Functional Testing

52

Accidental bias may be avoided by choosing test cases from a random distribution. Random sampling is often an inexpensive way to produce a large
number of test cases. If we assume absolutely no knowledge on which to
place a higher value on one test case than another, then random sampling
maximizes value by maximizing the number of test cases that can be created
(without bias) for a given budget. Even if we do possess some knowledge suggesting that some cases are more valuable than others, the efficiency of random sampling may in some cases outweigh its inability to use any knowledge
we may have.
Consider again the line-break program, and suppose that our budget is
one day of testing effort rather than some arbitrary number of test cases. If the
cost of random selection and actual execution of test cases is small enough,
then we may prefer to run a large number of random test cases rather than
expending more effort on each of a smaller number of test cases. We may in
a few hours construct programs that generate buffers with various contents
and lengths up to a few thousand characters, as well as an automated procedure for checking the program output. Letting it run unattended overnight,
we may execute a few million test cases. If the program does not correctly
handle a buffer containing a sequence of more than 60 non-blank characters
(a single word that does not fit on a line), we are likely to encounter this
case by sheer luck if we execute enough random tests, even without having
explicitly considered this case.
Even a few million test cases is an infinitesimal fraction of the complete
input space of most programs. Large numbers of random tests are unlikely
to find failures at single points (singularities) in the input space. Consider,
for example, a simple procedure for returning the two roots of a quadratic

and suppose we choose test inputs (values of the


equation
to
coefficients , , and ) from a uniform distribution ranging from
. While uniform random sampling would certainly cover cases in which

(where the equation has no real roots), it would be very unlikely to


test the case in which
and
, in which case a naive implementation
of the quadratic formula

will divide by zero (see Figure 13.1).


Of course, it is unlikely that anyone would test only with random values.
Regardless of the overall testing strategy, most test designers will also try some
special values. The test designers intuition comports with the observation
that random sampling is an ineffective way to find singularities in a large
input space. The observation about singularities can be generalized to any
characteristic of input data that defines an infinitesimally small portion of
the complete input data space. If again we have just three real-valued inputs
, , and , there is an infinite number of choices for which
, but random
sampling is unlikely to generate any of them because they are an infinitesimal
part of the complete input data space.
Draft version produced 20th March 2002

Random versus Partition Testing Strategies

- 1:
- 2:
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
- 9:
-10:
-11:
-12:
-13:
-14:
-15:
-16:
-17:
-18:
-19:
-20:
-21:
-22:
-23:
-24:
-25:
-26:
-27:
-28:
-29:
-30:
-31:
-32:
-33:
-34:
-35:
-36:
-37:
-38:
-39:
-40:
-41:

53

Figure 13.1: The Java class roots, which finds roots of a quadratic equation.
The case analysis in the implementation is incomplete: It does not properly
and
. We cannot anticipate all such
handle the case in which
faults, but experience teaches that boundary values identifiable in a specification are disproportionately valuable. Uniform random generation of even
large numbers of test cases is ineffective at finding the fault in this program,
but selection of a few special values based on the specification quickly uncovers it.

Draft version produced 20th March 2002

Functional Testing

54

The observation about special values and random samples is by no means


limited to numbers. Consider again, for example, breaking a text buffer into
lines. Since line breaks are permitted at blanks, we would consider blanks a
special value for this problem. While random sampling from the character
set is likely to produce a buffer containing a sequence of at least 60 non-blank
characters, it is much less likely to produce a sequence of 60 blanks.
The reader may justifiably object that a reasonable test designer would not
create text buffer test cases by sampling uniformly from the set of all characters, but would instead classify characters depending on their treatment,
lumping alphabetic characters into one class and white space characters into
another. In other words, a test designer will partition the input space into
classes, and will then generate test data in a manner that is likely to choose
data from each partition.4 Test designers seldom use pure random sampling;
usually they exploit some knowledge of application semantics to choose samples that are more likely to include special or trouble-prone regions of the
input space.
A testing method that divides the infinite set of possible test cases into a
finite set of classes, with the purpose of drawing one or more test cases from
each class, is called a partition testing method. When partitions are chosen
according to information in the specification, rather than the design or implementation, it is called specification-based partition testing, or more briefly,
functional testing. Note that not all testing of product functionality is functional testing. Rather, the term is used specifically to refer to systematic testing based on a functional specification. It excludes ad hoc and random testing, as well as testing based on the structure of a design or implementation.
Partition testing typically increases the cost of each test case, since in addition to generation of a set of classes, creation of test cases from each class
may be more expensive than generating random test data. In consequence,
partition testing usually produces fewer test cases than random testing for
the same expenditure of time and money. Partitioning can therefore be advantageous only if the average value (fault-detection effectiveness) is greater.
If we were able to group together test cases with such perfect knowledge
that the outcome of test cases in each class were uniform (either all successes, or all failures), then partition testing would be at its theoretical best.
In general we cannot do that, nor even quantify the uniformity of classes of
test cases. Partitioning by any means, including specification-based partition
testing, is always based on experience and judgment that leads one to believe
that certain classes of test case are more alike than others, in the sense that
failure-prone test cases are likely to be concentrated in some classes. When
we appealed above to the test designers intuition that one should try boundary cases and special values, we were actually appealing to a combination of
experience (many failures occur at boundary and special cases) and knowl4 We are using the term partition in a common but rather sloppy sense. A true partition
would separate the input space into disjoint classes, the union of which is the entire space. Partition testing separates the input space into classes whose union is the entire space, but the classes
may not be disjoint.

Draft version produced 20th March 2002

A Systematic Approach

55

edge that identifiable cases in the specification often correspond to classes of


input that require different treatment by an implementation.
Given a fixed budget, the optimum may not lie in only partition testing or
only random testing, but in some mix that makes use of available knowledge.
For example, consider again the simple numeric problem with three inputs,
, , and . We might consider a few special cases of each input, individually
and in combination, and we might consider also a few potentially-significant
). If no faults are revealed by these few test cases,
relationships (e.g.,
there is little point in producing further arbitrary partitions one might then
turn to random generation of a large number of test cases.

13.3 A Systematic Approach


Deriving test cases from functional specifications is a complex analytical process that partitions the input space described by the program specification.
Brute force generation of test cases, i.e., direct generation of test cases from
program specifications, seldom produces acceptable results: test cases are
generated without particular criteria and determining the adequacy of the
generated test cases is almost impossible. Brute force generation of test cases
relies on test designers expertise and is a process that is difficult to monitor
and repeat. A systematic approach simplifies the overall process by dividing
the process into elementary steps, thus decoupling different activities, dividing brain intensive from automatable steps, suggesting criteria to identify adequate sets of test cases, and providing an effective means of monitoring the
testing activity.
Although suitable functional testing techniques can be found for any granularity level, a particular functional testing technique may be effective only
for some kinds of software or may require a given specification style. For example, a combinatorial approach may work well for functional units characterized by a large number of relatively independent inputs, but may be less
effective for functional units characterized by complex interrelations among
inputs. Functional testing techniques designed for a given specification notation, e.g., finite state machines or grammars, are not easily applicable to
other specification styles. Nonetheless we can identify a general pattern of
activities that captures the essential steps in a variety of different functional
test design techniques. By describing particular functional testing techniques
as instantiations of this general pattern, relations among the techniques may
become clearer, and the test designer may gain some insight into adapting
and extending these techniques to the characteristics of other applications
and situations.
Figure 13.2 identifies the general steps of systematic approaches. The
steps may be difficult or trivial depending on the application domain and the
available program specifications. Some steps may be omitted depending on
the application domain, the available specifications and the test designers
expertise. Instances of the process can be obtained by suitably instantiating
Draft version produced 20th March 2002

56

Functional Testing
different steps. Although most techniques are presented and applied as stand
alone methods, it is also possible to mix and match steps from different techniques, or to apply different methods for different parts of the system to be
tested.
Identify Independently Testable Features Functional specifications can be
large and complex. Usually, complex specifications describe systems that can
be decomposed into distinct features. For example, the specification of a web
site may include features for searching the site database, registering users
profiles, getting and storing information provided by the users in different
forms, etc. The specification of each of these features may comprise several
functionalities. For example, the search feature may include functionalities
for editing a search pattern, searching the data base with a given pattern,
and so on. Although it is possible to design test cases that exercise several
functionalities at once, the design of different tests for different functionalities can simplify the test generation problem, allowing each functionality to
be examined separately. Moreover, it eases locating faults that cause the revealed failures. It is thus recommended to devise separate test cases for each
functionality of the system, whenever possible.
The preliminary step of functional testing consists in partitioning the specifications into features that can be tested separately. This can be an easy step
for well designed, modular specifications, but informal specifications of large
systems may be difficult to decompose into independently testable features.
Some degree of formality, at least to the point of careful definition and use of
terms, is usually required.
Identification of functional features that can be tested separately is different from module decomposition. In both cases we apply the divide and
conquer principle, but in the former case, we partition specifications according to the functional behavior as perceived by the users of the software under
test,5 while in the latter, we identify logical units that can be implemented
separately. For example, a web site may require a sort function, as a service
routine, that does not correspond to an external functionality. The sort function may be a functional feature at module testing, when the program under
test is the sort function itself, but is not a functional feature at system test,
while deriving test cases from the specifications of the whole web site. On
the other hand, the registration of a new user profile can be identified as one
of the functional features at system level testing, even if such functionality is
implemented with several modules (unit at the design level) of the system.
Thus, identifying functional features does not correspond to identifying single modules at the design level, but rather to suitably slicing the specifications
to be able to attack their complexity incrementally, aiming at deriving useful
test cases for the whole system under test.
5 Here the word user indicates who uses the specified service. It can be the user of the system,
when dealing with specification at system level; but it can be another module of the system,
when dealing with specifications at unit level.

Draft version produced 20th March 2002

A Systematic Approach

57

Identify
Independently
Testable
Features

Functional Specifications

Independently Testable Feature


fy
nti tive
Ide enta
s
s
pre lue
Re Va

Brute
Force
Testing

D
a M erive
od
el

Representative Values

Model

Ge

Semantic Constraints
Combinatorial Selection
Exaustive Enumeration
Random Selection

Finite State Machine


Grammar
Algebraic Specification
Logic Specification
Control/Data Flow Graph

se
Ca
st- ns
e
T io
te at
ra ific
ne pec
e
G S

ne
Sp rate
ec Te
ific st
ati -Ca
on se
s

Test Selection Criteria

Generate
Test Cases

Test Case Specifications

Manual Mapping
Symbolic Execution
A-posteriori Satisfaction

Instantiate
Tests

Test Cases

Scaffolding

Figure 13.2: The main steps of a systematic approach to functional program


testing.

Draft version produced 20th March 2002

58

Functional Testing
Independently testable features are described by identifying all the inputs
that form their execution environments. Inputs may be given in different
forms depending on the notation used to express the specifications. In some
cases they may be easily identifiable. For example, they can be the input alphabet of a finite state machine specifying the behavior of the system. In
other cases, they may be hidden in the specification. This is often the case of
informal specifications, where some inputs may be given explicitly as parameters of the functional unit, but other inputs may be left implicit in the description. For example, a description of how a new user registers at a web site
may explicitly indicate the data that constitutes the user profile to be inserted
as parameters of the functional unit, but may leave implicit the collection of
elements (e.g., database) in which the new profile must be inserted.
Trying to identify inputs may help in distinguishing different functions.
For example, trying to identify the inputs of a graphical tool may lead to a
clearer distinction between the graphical interface per se and the associated
calbacks to the application. With respect to the web-based user registration
function, the data to be inserted in the database are part of the execution
environment of the functional unit that performs the insertion of the user
profile, while the combination of fields that can be use to construct such data
is part of the execution environment of the functional unit that takes care of
the management of the specific graphical interface.
Identify Representative Classes of Values or Derive a Model The execution
environment of the feature under test determines the form of the final test
cases, which are given as combinations of values for the inputs to the unit.
The next step of a testing process consists of identifying which values of each
input can be chosen to form test cases. Representative values can be identified directly from informal specifications expressed in natural language. Alternativey, representative values may be selected indirectly through a model,
which can either be produced only for the sake of testing or be available as
part of the specification. In both cases, the aim of this step is to identify the
values for each input in isolation, either explicitly through enumeration, or
implicitly trough a suitable model, but not to select suitable combinations of
such values, i.e., test case specifications. In this way, we separate the problem of identifying the representative values for each input, from the problem
of combining them to obtain meaningful test cases, thus splitting a complex
step into two simpler steps.
Most methods that can be applied to informal specifications rely on explicit enumeration of representative values by the test designer. In this case,
it is very important to consider all possible cases and take advantage of the information provided by the specification. We may identify different categories
of expected values, as well as boundary and exceptional or erroneous values.
For example, when considering operations on a non-empty lists of elements,
we may distinguish the cases of the empty list (an error value) and a singleton element (a boundary value) as special cases. Usually this step determines
Draft version produced 20th March 2002

A Systematic Approach

59

characteristics of values (e.g., any list with a single element) rather than actual
values.
Implicit enumeration requires the construction of a (partial) model of the
specifications. Such a model may be already available as part of a specification or design model, but more often it must be constructed by the test
designer, in consultation with other designers. For example, a specification
given as a finite state machine implicitly identifies different values for the inputs by means of the transitions triggered by the different values. In some
cases, we can construct a partial model as a mean for identifying different
values for the inputs. For example, we may derive a grammar from a specification and thus identify different values according to the legal sequences of
productions of the given grammar.
Directly enumerating representative values may appear simpler and less
expensive than producing a suitable model from which values may be derived. However, a formal model may also be valuable in subsequent steps of
test case design, including selection of combinations of values. Also, a formal model may make it easier to select a larger or smaller number of test
cases, balancing cost and thoroughness, and may be less costly to modify and
reuse as the system under test evolves. Whether to invest effort in producing a
model is ultimately a management decision that depends on the application
domain, the skills of test designers, and the availability of suitable tools.
Generate Test Case Specifications Test specifications are obtained by suitably combining values for all inputs of the functional unit under test. If representative values were explicitly enumerated in the previous step, then test
case specifications will be elements of the Cartesian product of values selected for each input. If a formal model was produced, then test case specifications will be specific behaviors or combinations of parameters of the model,
and single test case specification could be satisfied by many different concrete inputs. Either way, brute force enumeration of all combinations is unlikely to be satisfactory.
The number of combinations in the Cartesian product of independently
selected values grows as the product of the sizes of the individual sets. For a
simple functional unit with 5 inputs each characterized by 6 values, the size
test case specifications, which may be
of the Cartesian product is
an impractical number for test cases for a simple functional unit. Moreover, if
(as is usual) the characteristics are not completely orthogonal, many of these
combinations may not even be feasible.
Consider the input of a function that searches for occurrences of a complex pattern in a web database. Its input may be characterized by the length
of the pattern and the presence of special characters in the pattern, among
other aspects. Interesting values for the length of the pattern may be zero,
one, or many. Interesting values for the presence of special characters may be
zero, one, or many. However, the combination of value zero for the length
of the pattern and value many for the number of special characters in the
Draft version produced 20th March 2002

Functional Testing

60

pattern is clearly impossible.


The test case specifications represented by the Cartesian product of all
possible inputs must be restricted by ruling out illegal combinations and selecting a practical subset of the legal combinations. Illegal combinations are
usually eliminated by constraining the set of combinations. For example, in
the case of the complex pattern presented above, we can constrain the choice
of one or more special characters to a positive length of the pattern, thus ruling out the illegal cases of patterns of length zero containing special characters.
Selection of a practical subset of legal combination can be done by adding
information that reflects the hazard of the different combinations as perceived
by the test designer or by following combinatorial considerations. In the former case, for example, we can identify exceptional values and limit the combinations that contain such values. In the pattern example, we may consider
only one test for patterns of length zero, thus eliminating many combinations
that can be derived for patterns of length zero. Combinatorial considerations
reduce the set of test cases by limiting the number of combination of values of
different inputs to a subset of the inputs. For example, we can generate only
tests that exhaustively cover all combinations of values for inputs considered
pair by pair.
Depending on the technique used to reduce the space represented by the
Cartesian product, we may be able to estimate the number of test cases generated with the approach and modify the selected subset of test cases according to budget considerations. Subsets of combinations of values, i.e., potential special cases, can be often derived from models of behavior by applying
suitable test selection criteria that identify subsets of interesting behaviors
among all behaviors represented by a model, for example by constraining the
iterations on simple elements of the model itself. In many cases, test selection
criteria can be applied automatically.
Generate Test Cases and Instantiate Tests The test generation process is
completed by turning test case specifications into test cases and instantiating
them. Test case specifications can be turned into test cases by selecting one
or more test cases for each item of the test case specification.

13.4

Category-Partition Testing

Category-partition testing is a method for generating functional tests from informal specifications. The main steps covered by the core part of the categorypartition method are:
A. Decompose the specification into independently testable features: Test designers identify features to be tested separately, and identify parameters and any other elements of the execution environment the unit depends on. Environment dependencies are treated identically to explicit
Draft version produced 20th March 2002

Category-Partition Testing

61

parameters. For each parameter and environment element, test designers identify the elementary parameter characteristics, which in the
category-partition method are usually called categories.



B. Identify Relevant Values: Test designers select a set of representative classes
of values for each parameter characteristic. Values are selected in isola-
tion, independent of other parameter characteristics. In the categorypartition method, classes of values are called choices, and this activity is
called partitioning the categories into choices.

C. Generate Test Case Specifications: Test designers indicate invalid combinations of values and restrict valid combinations of values by imposing
semantic constraints on the identified values. Semantic constraints restrict the values that can be combined and identify values that need not
be tested in different combinations, e.g., exceptional or invalid values.
Categories, choices, and constraints can be provided to a tool to automatically generate a set of test case specifications. Automating trivial and
repetitive activities such as these makes better use of human resources and
reduces errors due to distraction. Just as important, it is possible to determine the number of test cases that will be generated (by calculation, or by actually generating them) before investing any human effort in test execution. If
the number of derivable test cases exceeds the budget for test execution and
evaluation, test designers can reduce the number of test cases by imposing
additional semantic constraints. Controlling the number of test cases before
test execution begins is preferable to ad hoc approaches in which one may at
first create very thorough test suites and then test less and less thoroughly as
deadlines approach.
We illustrate the category-partition method using a specification of a feature from the direct sales web site of Chipmunk Electronic Ventures. Customers are allowed to select and price custom configurations of Chipmunk
computers. A configuration is a set of selected options for a particular model
of computer. Some combinations of model and options are not valid (e.g.,
digital LCD monitor with analog video card), so configurations are tested for
validity before they are priced. The check-configuration function (Table 13.3)
is given a model number and a set of components, and returns the boolean
value True if the configuration is valid or False otherwise. This function has
been selected by the test designers as an independently testable feature.
A. Identify Independently Testable Features and Parameter Characteristics
We assume that step starts by selecting the Check-configuration feature to
be tested independently of other features. This entails choosing to separate
testing of the configuration check per se from its presentation through a user
interface (e.g., a web form), and depends on the architectural design of the
software system.
Draft version produced 20th March 2002

Functional Testing

62





Model:


















Set of Components:








Check-Configuration:

Figure 13.3: The functional specification of the feature Check-configuration


of the web site of a computer manufacturer.

Draft version produced 20th March 2002

Category-Partition Testing

63

Step requires the test designer to identify the parameter characteristics,


i.e., the elementary characteristics of the parameters and environment elements that affect the units execution. A single parameter may have multiple elementary characteristics. A quick scan of the functional specification
would indicate model and components as the parameters of check configuration. More careful consideration reveals that what is valid must be determined by reference to additional information, and in fact the functional specification assumes the existence of a data base of models and components.
The data base is an environment element that, although not explicitly mentioned in the functional specification, is required for executing and thus testing the feature, and partly determines its behavior. Note that our goal is not
to test a particular configuration of the system with a fixed database, but to
test the generic system which may be configured through different database
contents.
Having identified model, components, and product database as the parameters and environment elements required to test the Check-configuration
functionality, the test designer would next identify the parameter characteristics of each.
Model may be represented as an integer, but we know that it is not to be
used arithmetically, but rather serves as a key to the database and other tables. The specification mentions that a model is characterized by a set of slots
for required components and a set of slot for optional components. We may
identify model number, number of required slots, and number of optional slots
as characteristics of parameter model.
Parameter components is a collection of pairs. The size
of a collection is always an important characteristic, and since components
are further categorized as required or optional, the test designer may identify
number of required components with non-empty selection and number of optional components with non-empty selection as characteristics. The matching between the tuple passed to Check-Configuration and the one actually
required by the selected model is important and may be identified as category Correspondence of selection with model slots. The actual selections are
also significant, but for now the test designer simply identifies required component selection and optional component selection, postponing selection of
relevant values to the next stage in test design.
The environment element product database is also a collection, so number of models in the database and number of components in the database are
parameter characteristics. Actual values of database entries are deferred to
the next step in test design.
There are no hard-and-fast rules for choosing categories, and it is not a
trivial task. Categories reflect the test designers judgment regarding which
classes of values may be treated differently by an implementation, in addition to classes of values that are explicitly identified in the specification. Test
designers must also use their experience and knowledge of the application
domain and product architecture to look under the surface of the specification and identify hidden characteristics. For example, the specification fragDraft version produced 20th March 2002

64

Functional Testing
ment in Table 13.3 makes no distinction between configurations of models
with several required slots and models with none, but the experienced test
designer has seen enough failures on degenerate inputs to test empty collections wherever a collection is allowed.
The number of options that can (or must) be configured for a particular
model of computer may vary from model to model. However, the categorypartition method makes no direct provision for structured data, such as sets
of pairs. A typical approach is to flatten collections and describe characteristics of the whole collection as parameter characteristics.
Typically the size of the collection (the length of a string, for example, or in
this case the number of required or optional slots) is one characteristic, and
descriptions of possible combination of elements (occurrence of a special
characters in a string, for example, or in this case the selection of required
and optional components) are separate parameter characteristics.
Suppose the only significant variation among pairs was
between pairs that are compatible and pairs that are incompatible. If we
treated each pair as a separate characteristic, and assumed
slots, the category-partition method would generate all combinations of
compatible and incompatible slots. Thus we might have a test case in which
the first selected option is compatible, the second is compatible, and the third
incompatible, and a different test case in which the first is compatible but the
second and third are incompatible, and so on, and each of these combinations could be combined in several ways with other parameter characteristics. The number of combinations quickly explodes, and moreover since the
number of slots is not actually fixed, we cannot even place an upper bound
on the number of combinations that must be considered. We will therefore
choose the flattening approach and select possible patterns for the collection
as a whole.
Should the representative values of the flattened collection of pairs be one
compatible selection, one incompatible selection, all compatible selections, all
incompatible selections, or should we also include mix of 2 or more compatible
and 2 or more incompatible selections? Certainly the latter is more thorough,
but whether there is sufficient value to justify the cost of this thoroughness is
a matter of judgment by the test designer.
We have oversimplified by considering only whether a selection is compatible with a slot. It might also happen that the selection does not appear in
the database. Moreover, the selection might be incompatible with the model,
or with a selected component of another slot, in addition to the possibility
that it is incompatible with the slot for which it has been selected. If we treat
each such possibility as a separate parameter characteristic, we will generate many combinations, and we will need semantic constraints to rule out
combinations like there are three options, at least two of which are compatible with the model and two of which are not, and none of which appears in
the database. On the other hand, if we simply enumerate the combinations
that do make sense and are worth testing, then it becomes more difficult to
be sure that no important combinations have been omitted. Like all design

Draft version produced 20th March 2002

Category-Partition Testing

65

decisions, the way in which collections and complex data are broken into parameter characteristics requires judgment based on a combination of analysis and experience.
B. Identify Relevant Values This step consists of identifying a list of relevant values (more precisely, a list of classes of relevant values) for each of the
parameter characteristics identified during step . Relevant values should
be identified for each category independently, ignoring possible interactions
among values for different categories, which are considered in the next step.
Relevant values may be identified by manually applying a set of rules known
as boundary value testing or erroneous condition testing. The boundary value
testing rule suggests selection of extreme values within a class (e.g., maximum and minimum values of the legal range), values outside but as close as
possible to the class, and interior (non-extreme) values of the class. Values
near the boundary of a class are often useful in detecting off by one errors
in programs. The erroneous condition rule suggests selecting values that are
outside the normal domain of the program, since experience suggests that
proper handling of error cases is often overlooked.
Table 13.1 summarizes the parameter characteristics and the corresponding relevant values identified for feature Check-configuration.6 For numeric
characteristics, whose legal values have a lower bound of , i.e., number of
models in database and number of components in database, we identify , the
erroneous value, , the boundary value, and
, the class of values greater
than , as the relevant value classes. For numeric characteristics whose lower
bound is zero, i.e., number of required slots for selected model and number
of optional slots for selected model, we identify as a boundary value, and
many as other relevant classes of values. Negative values are impossible here,
so we do not add a negative error choice. For numeric characteristics whose
legal values have definite lower and upper-bounds, i.e., number of optional
components with selection empty and number of optional components with
selection empty, we identify boundary and (when possible) erroneous conditions corresponding to both lower and upper bounds.
Identifying relevant values is an important but tedious task. Test designers may improve manual selection of relevant values by using the catalog approach described in Section 13.8, which captures the informal approaches
used in this section with a systematic application of catalog entries.

C. Generate Test Case Specifications A test case specification for a feature


is given as a combination of values, one for each identified parameter characteristic. Unfortunately, the simple combination of all possible relevant values for each parameter characteristic results in an unmanageable number of
test cases (many of which are impossible) even for simple specifications. For
6 At this point, readers may ignore the items in square brackets, which indicate the constraints
as identified in step of the category-partition method.

Draft version produced 20th March 2002

Functional Testing

66
Parameter: Model
Model number

Number of required slots for selected model(#SMRS)

Number of optional slots for selected model (#SMOS)

Parameter: Components
Correspondence of selection with
model slots

Number of required components


with selection empty

Required component selection

Number of optional components


with select empty

Environment element: Product database


Number of models in database
(#DBM)

Number of components
database (#DBC)

Optional component selection

in

Table 13.1: An example category-partition test specification for the the configuration checking feature of the web site of a computer vendor.
Draft version produced 20th March 2002

Category-Partition Testing

67

example, in the Table 13.1 we find 7 categories with 3 value classes, 2 categories with 6 value classes, and one with four value classes, potentially resulting in

test cases, which would be acceptable only if
the cost of executing and checking each individual test case were very small.
However, not all combinations of value classes correspond to reasonable test
case specifications. For example, it is not possible to create a test case from
a test case specification requiring a valid model (a model appearing in the
database) where the database contains zero models.
The category-partition method allows one to omit some combinations by
indicating value classes that need not be combined with all other values. The
label indicates a value class that need be tried only once, in combination with non-error values of other parameters. When constraints are
considered in the category-partition specification of Table 13.1, the number
of combinations to be considered is reduced to

. Note that we have treated component not in database as


an error case, but have treated incompatible with slot as a normal case of
an invalid configuration; once again, some judgment is required.
Although the reduction from 314,928 to 2,711 is impressive, the number
of derived test cases may still exceed the budget for testing such a simple feature. Moreover, some values are not erroneous per se, but may only be useful
or even valid in particular combinations. For example, the number of optional components with non-empty selection is relevant to choosing useful
test cases only when the number of optional slots is greater than 1. A number of non-empty choices of required component greater than zero does not
make sense if the number of required components is zero.
Erroneous combinations of valid values can be ruled out with the property
and if-property constraints. The property constraint groups values of a single
parameter characteristic to identify subsets of values with common properties. The property constraint is indicated with label property PropertyName,
where PropertyName identifies the property for later reference. For example, property RSNE (required slots non-empty) in Table 13.1 groups values
that correspond to non-empty sets of required slots for the parameter characteristic Number of Required Slots for Selected Model (#SMRS), i.e., values 1
and many. Similarly, property OSNE (optional slots non-empty) groups nonempty values for the parameter characteristic Number of Optional Slots for
Selected Model (#SMOS).
The if-property constraint bounds the choices of values for a parameter
characteristic once a specific value for a different parameter characteristic
has been chosen. The if-property constraint is indicated with label if PropertyName, where PropertyName identifies a property defined with the property
constraint. For example, the constraint if RSNE attached to values 0 and
number of required slots of parameter characteristic Number of required components with selection empty limits the combination of these values with
the values of the parameter characteristics Number of Required Slots for Selected Model (#SMRS), i.e., values 1 and many, thus ruling out the illegal combination of values 0 or number of required slots for Number of required com-

Draft version produced 20th March 2002

Functional Testing

68

ponents with selection empty with value 0 for Number of Required Slots for
Selected Model (#SMRS). Similarly, the if OSNE constraint limits the combinations of values of the parameter characteristics Number of optional components with selection empty and Number of Optional Slots for Selected Model
(#SMOS).
The property and if-property constraints introduced in Table 13.1 further
reduce the number of combinations to be considered to

. (Exercise Ex13.4 discusses derivation of this
number.)
The number of combinations can be further reduced by iteratively adding
property and if-property constraints and by introducing the new single constraint, which is indicated with label single and acts like the error constraint,
i.e., it limits the number of occurrences of a given value in the selected combinations to 1.
Introducing new property, if-property, and single constraints further does
not rule out erroneous combinations, but reflects the judgment of the test designer, who decides how to restrict the number of combinations to be considered by identifying single values (single constraint) or combinations (property
and if-property constraints) that are less likely to need thorough test according to the test designers judgment.
The single constraints introduced in Table 13.1 reduces the number of
,
combinations to be considered to
which may be a reasonable tradeoff between costs and quality for the considered functionality. The number of combinations can also be reduced by
applying combinatorial techniques, as explained in the next section.
The set of combinations of values for the parameter characteristics can
be turned into test case specifications by simply instantiating the identified
combinations. Table 13.2 shows an excerpt of test case specifications. The
error tag in the last column indicates test cases specifications corresponding
to the error constraint. Corresponding test cases should produce an error
indication. A dash indicates no constraints on the choice of values for the
parameter or environment element.
Choosing meaningful names for parameter characteristics and value classes
allows (semi)automatic generation of test case specifications.

13.5

The Combinatorial Approach

However one obtains sets of value classes for each parameter characteristic,
the next step in producing test case specifications is selecting combinations
of classes for testing. A simple approach is to exhaustively enumerate all
possible combinations of classes, but the number of possible combinations
rapidly explodes.
Some methods, such as the category-partition method described in the
previous section, take exhaustive enumeration as a base approach to generating combinations, but allow the test designer to add constraints that limit
Draft version produced 20th March 2002

The Combinatorial Approach

69

Table 13.2: An excerpt of test case specifications derived from the value
classes given in Table 13.1
growth in the number of combinations. This can be a reasonable approach
when the constraints on test case generation reflect real constraints in the
application domain, and eliminate many redundant combinations (for example, the error entries in category-partition testing). It is less satisfactory
when, lacking real constraints from the application domain, the test designer
is forced to add arbitrary constraints (e.g., single entries in the categorypartition method) whose sole purpose is to reduce the number of combinations.
Consider the parameters that control the Chipmunk web-site display, shown
in Table 13.3. Exhaustive enumeration produces 432 combinations, which
is too many if the test results (e.g., judging readability) involve human judgment. While the test designer might hypothesize some constraints, such as
observing that monochrome displays are limited mostly to hand-held devices, radical reductions require adding several single and property constraints without any particular rationale.
Exhaustive enumeration of all -way combinations of value classes for
parameters, on the one hand, and coverage of individual classes, on the
other, are only the extreme ends of a spectrum of strategies for generating
combinations of classes. Between them lie strategies that generate all pairs of
classes for different parameters, all triples, and so on. When it is reasonable
to expect some potential interaction between parameters (so coverage of individual value classes is deemed insufficient), but covering all combinations
is impractical, an attractive alternative is to generate -way combinations for
, typically pairs or triples.
How much does generating possible pairs of classes save, compared to
Draft version produced 20th March 2002

Functional Testing

70

Display Mode

Color

Language

Fonts

Screen size

Table 13.3: Parameters and values controlling Chipmunk web-site display


generating all combinations? We have already observed that the number of
all combinations is the product of the number of classes for each parameter,
and that this product grows exponentially with the number of parameters.
It turns out that the number of combinations needed to cover all possible
pairs of values grows only logarithmically with the number of parameters
an enormous saving.
A simple example may suffice to gain some intuition about the efficiency
of generating tuples that cover pairs of classes, rather than all combinations.
Suppose we have just the three parameters display mode, screen size, and
fonts from Table 13.3. If we consider only the first two, display mode and
screen size, the set of all pairs and the set of all combinations are identical,
pairs of classes. When we add the third parameter,
and contain
fonts, generating all combinations requires combining each value class from
fonts with every pair of display mode screen size, a total of 27 tuples; extending from to
parameters is multiplicative. However, if we are generating
pairs of values from display mode, screen size, and fonts, we can add value
classes of fonts to existing elements of display mode screen size in a way that
covers all the pairs of fontsscreen size and all the pairs of fontsdisplay mode
without increasing the number of combinations at all (see Table 13.4). The
key is that each tuple of three elements contains three pairs, and by careful
selecting value classes of the tuples we can make each tuple cover up to three
different pairs.
Table 13.5 shows 17 tuples that cover all pairwise combinations of value
classes of the five parameters. The entries not specified in the table () correspond to open choices. Each of them can be replaced by any legal value
for the corresponding parameter. Leaving them open gives more freedom for
selecting test cases.
Generating combinations that efficiently cover all pairs of classes (or triples,
or . . . ) is nearly impossible to perform manually for many parameters with
many value classes (which is, of course, exactly when one really needs to use

Draft version produced 20th March 2002

The Combinatorial Approach

71

Table 13.4: Covering all pairs of value classes for three parameters by extending the cross-product of two parameters
the approach). Fortunately, efficient heuristic algorithms exist for this task,
and they are simple enough to incorporate in tools.7
The tuples in Table 13.5 cover all pairwise combinations of value choices
for parameters. In many cases not all choices may be allowed. For example, the specification of the Chipmunk web-site display may indicate that
monochrome displays are limited to hand-held devices. In this case, the tuples covering the pairs Monochrome Laptop and Monochrome Full-size ,
i.e., the fifth and ninth tuples of Table 13.5, would not correspond to legal inputs. We can restrict the set of legal combinations of value classes by adding
suitable constraints. Constraints can be expressed as tuples with wild-card
characters to indicate any possible value class. For example, the constraints

Monochrome Laptop
Monochrome Full-size
indicates that tuples that contain the pair Monochrome Hand-held as
values for the fourth and fifth parameter are not allowed in the relation of Table 13.3. Tuples that cover all pairwise combinations of value classes without
violating the constraints can be generated by simply removing the illegal tuples and adding legal tuples that cover the removed pairwise combinations.
Open choices must be bound consistently in the remaining tuples, e.g., tuple
Portuguese Monochrome Text-only - must become
Portuguese Monochrome Text-only - Hand-held
Constraints can also be expressed with sets of tables to indicate only the
legal combinations, as illustrated in Table 13.6, where the first table indicates
7 Exercise Ex13.12 discusses the problem of computing suitable combinations to cover all
pairs.

Draft version produced 20th March 2002

Functional Testing

72

Table 13.5: Covering all pairs of value classes for the five parameters

that the value class Hand-held for parameter Screen can be combined with
any value class of parameter Color, including Monochrome, while the second table indicates that the value classes Laptop and Full-size for parameter
Screen size can be combined with all values classes but Monochrome for parameter Color.
If constraints are expressed as a set of tables that give only legal combinations, tuples can be generated without changing the heuristic. Although
the two approaches express the same constraints, the number of generated
tuples can be different, since different tables may indicate overlapping pairs
and thus result in a larger set of tuples. Other ways of expressing constraints
may be chosen according to the characteristics of the specifications and the
preferences of the test designer.
So far we have illustrated the combinatorial approach with pairwise coverage. As previously mentioned, the same approach can be applied for triples
or larger combinations. Pairwise combinations may be sufficient for some
subset of the parameters, but not enough to uncover potential interactions
among other parameters. For example, in the Chipmunk display example,
the fit of text fields to screen areas depends on the combination of language,
fonts, and screen size. Thus, we may prefer exhaustive coverage of combinations of these three parameters, but be satisfied with pairwise coverage of
other parameters. In this case, we first generate tuples of classes from the
parameters to be most thoroughly covered, and then extend these with the
Draft version produced 20th March 2002

The Combinatorial Approach

73

Hand-held devices
Display Mode

Language

Fonts

Color

Screen size

Laptop and Full-size devices


Display Mode

Language

Fonts

Color

Screen size

Table 13.6: Pairs of tables that indicate valid value classes for the Chipmunk
web-site display controller

Draft version produced 20th March 2002

Functional Testing

74
parameters which require less coverage.8

13.6

Testing Decision Structures

The combinatorial approaches described above primarily select combinations of orthogonal choices. They can accommodate constraints among choices,
but their strength is in generating combinations of (purportedly) independent choices. Some specifications, formal and informal, have a structure that
emphasizes the way particular combinations of parameters or their properties determine which of several potential outcomes is chosen. Results of the
computation may be determined by boolean predicates on the inputs. In
some cases, choices are specified explicitly as boolean expressions. More often, choices are described either informally or with tables or graphs that can
assume various forms. When such a decision structure is present, it can play
a part in choosing combinations of values for testing.
For example, the informal specification of Figure 13.4 describes outputs
that depend on type of account (either educational, or business, or individual), amount of current and yearly purchases, and availability of special prices.
These can be considered as boolean conditions, e.g., the condition educational account is either true or false (even if the type of account is actually
represented in some other manner). Outputs can be described as boolean
expressions over the inputs, e.g., the output no discount can be associated
with the boolean expression
individual account
current purchase
special offer price
business account
current purchase
current purchase
special offer price

tier 1 individual threshold


individual scheduled price
tier 1 business threshold
tier 1 business yearly threshold
business scheduled price

When functional specifications can be given as boolean expressions, a


good test suite should exercise at least the effects of each elementary condition occurring in the expression. (In ad hoc testing, it is common to miss a
bug in one elementary condition by choosing test cases in which it is masked
by other conditions.) For simple conditions, we might derive test case specifications for all possible combinations of truth values of the elementary conditions. For complex formulas, testing all combinations of elementary
conditions is apt to be too expensive; we can select a much smaller subset of
combinations that checks the effect of each elementary condition. A good
way of exercising all elementary conditions with a limited number of test
cases is deriving a set of combinations such that each elementary condition
can be shown to independently affect the outcome.

8 See

exercise Ex13.14 for additional details.

Draft version produced 20th March 2002

Testing Decision Structures

75

Pricing: The pricing function determines the adjusted price of a configuration for a
particular customer. The scheduled price of a configuration is the sum of the
scheduled price of the model and the scheduled price of each component in the
configuration. The adjusted price is either the scheduled price, if no discounts
are applicable, or the scheduled price less any applicable discounts.
There are three price schedules and three corresponding discount schedules,
Business, Educational, and Individual. The Business price and discount schedules apply only if the order is to be charged to a business account in good standing. The Educational price and discount schedules apply to educational institutions. The Individual price and discount schedules apply to all other customers.
Account classes and rules for establishing business and educational accounts
are described further in [. . . ].
A discount schedule includes up to three discount levels, in addition to the possibility of no discount. Each discount level is characterized by two threshold
values, a value for the current purchase (configuration schedule price) and a
cumulative value for purchases over the preceding 12 months (sum of adjusted
price).
Educational prices The adjusted price for a purchase charged to an educational account in good standing is the scheduled price from the educational price schedule. No further discounts apply.
Business account discounts Business discounts depend on the size of the current
purchase as well as business in the preceding 12 months. A tier 1 discount is
applicable if the scheduled price of the current order exceeds the tier 1 current
order threshold, or if total paid invoices to the account over the preceding 12
months exceeds the tier 1 year cumulative value threshold. A tier 2 discount
is applicable if the current order exceeds the tier 2 current order threshold, or
if total paid invoices to the account over the preceding 12 months exceeds the
tier 2 cumulative value threshold. A tier 2 discount is also applicable if both the
current order and 12 month cumulative payments exceed the tier 1 thresholds.
Individual discounts Purchase by individuals and by others without an established
account in good standing are based on current value alone (not on cumulative purchases). A tier 1 individual discount is applicable if the scheduled price
of the configuration in the current order exceeds the the tier 1 current order
threshold. A tier 2 individual discount is applicable if the scheduled price of the
configuration exceeds the tier 2 current order threshold.
Special-price non-discountable offers Sometimes a complete configuration is offered at a special, non-discountable price. When a special, non-discountable
price is available for a configuration, the adjusted price is the non-discountable
price or the regular price after any applicable discounts, whichever is less.

Figure 13.4: The functional specification of feature pricing of the Chipmunk


web site.

Draft version produced 20th March 2002

Functional Testing

76

A predicate is a function with a boolean (True or False) value. When the input argument of the predicate is clear, particularly when it describes some property of the input of
a program, we often leave it implicit. For example, the actual representation of account
types in an information system might be as three-letter codes, but in a specification we
may not be concerned with that representation we know only that there is some predicate educational-account which is either True or False.
An elementary condition is a single predicate that cannot be decomposed further. A
complex condition is made up of elementary conditions, combined with boolean connectives.
The boolean connectives include and ( ), or ( ), not ( ), and several less common derived connectives such as implies and exclusive or.

A systematic approach to testing boolean specifications consists in first


constructing a model of the boolean specification and then applying test criteria to derive test case specifications.
STEP 1: derive a model of the decision structure We can produce different
models of the decision structure of a specification, depending on the original
specification and on the technique we use for deriving test cases. For example, if the original specification prescribes a sequence of decisions, either in
a program-like syntax or perhaps as a decision tree, we may decide not to derive a different model but rather treat it as a conditional statement. Then we
can directly apply the methods described in Chapter 14 for structural testing,
i.e., basic condition, compound condition, or modified condition/decision
adequacy criteria. On the other hand, if the original specification is expressed
informally as in Figure 13.4, we can transform it into either a boolean expression or a graph or a tabular model before applying a test case generation technique.
Techniques for deriving test case specifications from decision structures
were originally developed for graph models, and in particular cause effect
graphs, which have been used since the early seventies. Cause-effect graphs
are tedious to derive and do not scale well to complex specifications. Tables,
on the other hand, are easy to work with and scale well.
A decision structure can be represented with a decision table where rows
correspond to elementary conditions and columns correspond to combinations of elementary conditions. The last row of the table indicates the expected outputs. Cells of the table are labeled either true, false, or dont care
(usually written ), to indicate the truth value of the elementary condition.
Thus, each column is equivalent to a logical expression joining the required
values (negated, in the case of false entries) and omitting the elementary conditions with dont care values.9
9 The

set of columns sharing a label is therefore equivalent to a logical expression in sum-of-

Draft version produced 20th March 2002

Decision tables are completed with a set of constraints that limit the possible combinations of elementary conditions. A constraint language can be
based on boolean logic. Often it is useful to add some shorthand notations
for common conditions that are tedious to express with the standard connectives, such as at-most-one(C1, . . . , Cn) and exactly-one(C1, . . . , Cn).
Figure 13.5 shows the decision table for the functional specification of feature pricing of the Chipmunk web site presented in Figure 13.4.
The informal specification of Figure 13.4 identifies three customer profiles: educational, business, and individual. Table 13.5 has only rows educational and business. The choice individual corresponds to the combination false, false for choices educational and business, and is thus redundant.
The informal specification of Figure 13.4 indicates different discount policies depending on the relation between the current purchase and two progressive thresholds for the current purchase and the yearly cumulative purchase. These cases correspond to rows 3 through 6 of table 13.5. Conditions
on thresholds that do not correspond to individual rows in the table can be
defined by suitable combinations of values for these rows. Finally, the informal specification of Figure 13.4 distinguishes the cases in which special offer
prices do not exceed either the scheduled or the tier 1 or tier 2 prices. Rows 7
through 9 of the table, suitably combined, capture all possible cases of special
prices without redundancy.
Constraints formalize the compatibility relations among different elementary conditions listed in the table: Educational and Business accounts are
exclusive; A cumulative purchase exceeding threshold tier 2, also exceeds
threshold tier 1; a yearly purchase exceeding threshold tier 2, also exceeds
threshold tier 1; a cumulative purchase not exceeding threshold tier 1 does
not exceed threshold tier 2; a yearly purchase not exceeding threshold tier 1
does not exceed threshold tier 2; a special offer price not exceeding threshold tier 1 does not exceed threshold tier 2; and finally, a special offer price
exceeding threshold tier 2 exceeds threshold tier 1.
STEP 2: derive test case specifications from a model of the decision structure Different criteria can be used to generate test suites of differing complexity from decision tables.
The basic condition adequacy criterion requires generation of a test case
specification for each column in the table, and corresponds to the intuitive
principle of generating a test case to produce each possible result. Dont care
entries of the table can be filled out arbitrarily, so long as constraints are not
violated.
The compound condition adequacy criterion requires a test case specification for each combination of truth values of elementary conditions. The
compound condition adequacy criterion generates a number of cases exponential in the number of elementary conditions, and can thus be applied only
to small sets of elementary conditions.
products form.

F
F
F
F
ND

F
F
F
T
SP

Individual
F
F
F
F
T
T
F
F
F
T
T1 SP

F
F
T
F
T2

F
F
T
T
SP

T
F
F
F
ND

T
F
F
T
SP

T
T
F
F
F
T1

T
T
F
F
T
SP

T
F
T
F
F
T1

Business
T
T
F
T
T
T
F
T
F
SP T2

T
T
T
T
SP

T
T
F
T2

T
T
T
SP

T
T
F
T2

T
T
T
SP

Constraints
at-most-one(Edu, Bus)
YP YT1
YP YT2
CP CT1
CP CT2
SP T1
SP T2

at-most-one(YP
at-most-one(CP
at-most-one(SP

YT1, YP YT2)
CT1, CP CT2)
T1, SP T2)

Abbreviations
Edu.
Bus.
CP
YP
CP
YP
SP
SP
SP

CT1
YT1
CT2
YT2
Sc
T1
T2

Educational account
Business account
Current purchase greater than threshold 1
Year cumulative purchase greater than threshold 1
Current purchase greater than threshold 2
Year cumulative purchase greater than threshold 2
Special Price better than scheduled price
Special Price better than tier 1
Special Price better than tier 2

Edu
ND
T1
T2
SP

Educational price
No discount
Tier 1
Tier 2
Special Price

78

Figure 13.5: The decision table for the functional specification of feature pricing of the Chipmunk web site of Figure 13.4.

Draft version produced 20th March 2002

Functional Testing

Education
T
T
F
T
Edu SP

Edu.
Bus.
CP CT1
YP YT1
CP CT2
YP YT2
SP Sc
SP T1
SP T2
Out

Testing Decision Structures

79

For the modified condition/decision adequacy criterion (MC/DC), each column in the table represents a test case specification. In addition, for each of
the original columns, MC/DC generates new columns by modifying each of

the cells containing True or False. If modifying a truth value in one column
results in a test case specification consistent with an existing column (agree-
ing in all places where neither is dont care), the two test cases are represented
by one merged column, provided they can be merged without violating constraints.
The MC/DC criterion formalizes the intuitive idea that a thorough test
suite would not only test positive combinations of values, i.e., combinations
that lead to specified outputs, but also negative combinations of values, i.e.,
combinations that differ from the specified ones and thus should produce
different outputs, in some cases among the specified ones, in some other
cases leading to error conditions.
Applying MC/DC to column 1 of table 13.5 generates two additional columns:
one for Educational Account = false and Special Price better than scheduled
price = false, and the other for Educational Account = true and Special Price
better than scheduled price = true. Both columns are already in the table
(columns 3 and 2, respectively) and thus need not be added.
Similarly, from column 2, we generate two additional columns corresponding to Educational Account = false and Special Price better than scheduled
price = true, and Educational Account = true and Special Price better than
scheduled price = false, also already in the table.
The generation of a new column for each possible variation of the boolean
values in the columns, varying exactly one value for each new column, produces 78 new columns, 21 of which can be merged with columns already in
the table. Figure 13.6 shows a table obtained by suitably joining the generated
columns with the existing ones. Many dont care cells from the original table
are assigned either true or false values, to allow merging of different columns
or to obey constraints. The few dont-care entries left can be set randomly to
obtain a complete test case specification.
There are many ways of merging columns that generate different tables.
The table in Figure 13.6 may not be the optimal one, i.e., the one with the
fewest columns. The objective in test design is not to find an optimal test
suite, but rather to produce a cost effective test suite with an acceptable tradeoff between the cost of generating and executing test cases and the effectiveness of the tests.
The table in Figure 13.6 fixes the entries as required by the constraints,
while the initial table in Figure 13.5 does not. Keeping constraints separate
from the table corresponding to the initial specification increases the number of dont care entries in the original table, which in turn increases the opportunity for merging columns when generating new cases with the MC/DC
criterion. For example, if business account = false, the constraint at-mostone(Edu, Bus) can be satisfied by assigning either true or false to entry educational account. Fixing either choice prematurely may later make merging
with a newly generated column impossible.

Draft version produced 20th March 2002

CT1
YT1
CT2
YT2
Sc
T1
T2

T
F
T
F
F
F
F
F
Edu

T
F
T
F
T
T
SP

F
F
F
F
F
F
F
F
ND

F
F
F
F
T
T
SP

F
F
T
F
F
F
F
T1

F
F
T
F
F
T
T
SP

F
F
T
T
T
F
F
T2

F
F
T
T
T
T
T
SP

F
T
F
F
F
F
F
F
ND

F
T
F
F
F
T
T
SP

F
T
T
F
F
F
F
F
T1

F
T
T
F
F
T
SP

F
T
F
T
F
F
F
F
F
T1

F
T
F
T
F
F
T
T
SP

F
T
T
T
F
F
F
F
T2

F
T
T
T
F
F
T
T
T
SP

F
T
T
F
T
F
F
T2

F
T
T
F
T
T
T
T
SP

F
T
F
T
F
T
F
F
T2

F
T
F
T
F
T
T
T
T
SP

T
F
F
T
F
F
F
F
F
Edu

T
F
F
F
T
F
SP

T
F
T
T
F
Edu

Abbreviations
Edu.
Bus.
CP CT1
YP YT1
CP CT2
YP YT2
SP Sc
SP T1
SP T2

Educational account
Business account
Current purchase greater than threshold 1
Year cumulative purchase greater than threshold 1
Current purchase greater than threshold 2
Year cumulative purchase greater than threshold 2
Special Price better than scheduled price
Special Price better than tier 1
Special Price better than tier 2

Edu
ND
T1
T2
SP

Educational price
No discount
Tier 1
Tier 2
Special Price

80

Figure 13.6: The set of test cases generated for feature pricing of the Chipmunk web site applying the modified adequacy
criterion.

Draft version produced 20th March 2002

Functional Testing

Edu.
Bus.
CP
YP
CP
YP
SP
SP
SP
Out

T
F
T
T
T
SP

F
F
T
F
T
T
T
SP

F
F
T
F
F
T
SP

Deriving Test Cases from Control and Data Flow Graphs

81

13.7 Deriving Test Cases from Control and Data Flow


Graphs
Functional specifications are seldom given as flow graphs, but sometimes
they describe a set of mutually dependent steps to be executed in a given
(partial) order, and can thus be modeled with flow graphs.
For example the specification of Figure 13.7 describes the Chipmunk functionality that processes shipping orders. The specification indicates a set of
steps to check for the validity of fields in the order form. Type and validity of
some of the values depend on other fields in the form. For example, shipping
methods are different for domestic and international customers, and allowed
methods of payment depend on the kind of customer.
The informal specification of Figure 13.7 can be modeled with a control
flow graph, where the nodes represent computations and branches model
flow of control consistently with the dependencies among computations, as
illustrated in Figure 13.8. Given a control or a data flow graph model, we
can generate test case specifications using the criteria originally proposed for
structural testing and described in Chapters ?? and ??.
Control flow testing criteria require test cases that exercise all the elements of a particular type in a graph. The node testing adequacy criterion
requires each node to be exercised at least once and corresponds to the statement testing structural adequacy criterion. It is easy to verify that test T-node
causes all nodes of the control flow graph of Figure 13.8 to be traversed and
thus satisfies the node adequacy criterion.

T-node
Case

Too
small
TC-1
No
No
TC-2
Abbreviations:
Too small
Ship where
Ship how
Cust type
Pay method
Same addr
CC Valid

Ship
where
Int
Dom

Ship
method
Air
Air

Cust
type
Bus
Ind

Pay
method
CC
CC

Same
addr
No

CC
valid
Yes
No (abort)

CostOfGoods MinOrder ?
Shipping address, Int = international, Dom = domestic
Air = air freight, Land = land freight
Bus = business, Edu = educational, Ind = individual
CC = credit card, Inv = invoice
Billing address = shipping address ?
Credit card information passes validity check?

The branch testing adequacy criterion requires each branch to be exercised at least once, i.e., each edge of the graph to be traversed for at least one
test case. Test T-branch covers all branches of the control flow graph of Figure 13.8 and thus satisfies the branch adequacy criterion.

Draft version produced 20th March 2002

Functional Testing

82

Process shipping order: The Process shipping order function checks the validity of orders and prepares the receipt.
A valid order contains the following data:
cost of goods If the cost of goods is less than the minimum processable order
(MinOrder) then the order is invalid.
shipping address The address includes name, address, city, postal code, and
country.
preferred shipping method If the address is domestic, the shipping method
must be either land freight, or expedited land freight, or overnight air.
If the address is international, the shipping method must be either air
freight or expedited air freight; a shipping cost is computed based on
address and shipping method.
type of customer which can be individual, business or educational
preferred method of payment Individual customers can use only credit cards,
while business and educational customers can choose between credit
card and invoice.
card information if the method of payment is credit card, fields credit card
number, name on card, expiration date, and billing address, if different
than shipping address, must be provided. If credit card information is not
valid the user can either provide new data or abort the order.
The outputs of Process shipping order are
validity Validity is a boolean output which indicates whether the order can be
processed.
total charge The total charge is the sum of the value of goods and the computed shipping costs (only if validity = true).
payment status if all data are processed correctly and the credit card information is valid or the payment is invoice, payment status is set to valid, the
order is entered and a receipt is prepared; otherwise validity = false.

Figure 13.7: The functional specification of feature process shipping order of


the Chipmunk web site.

Draft version produced 20th March 2002

Deriving Test Cases from Control and Data Flow Graphs

83

;;;;;
Process shipping order

CostOfGoods < MinOrder


no

shipping address

international

domestic

preferred shipping method = air


freight OR expedited air freight

preferred shipping method = land freight,


OR expedited land freight OR overnight air

calculate international shipping charge

calculate domestic shipping charge

total charge = goods + shipping


individual customer

no

no

no
yes

method of payement
yes
credit card

obtain credit card data: number,


name on card, expiration date

yes

invoice

billing address = shipping address


no

no

obtain billing address

valid credit card


information

yes

no

abort order?
yes

payement status = valid


enter order
prepare receipt
invalid order

Figure 13.8: The control flow graph corresponding to functionality Process


shipping order of Figure 13.7

Draft version produced 20th March 2002

Functional Testing

84
T-branch
Case

Too
Ship
Ship
Cust
Pay
Same
CC
small where method type method addr
valid
TC-1
No
Int
Air
Bus
CC
No
Yes
No
Dom
Land

TC-2
TC-3
Yes

TC-4
No
Dom
Air

No
Int
Land

TC-5
TC-6
No

Edu
Inv

TC-7
No

CC
Yes

No

CC

No (abort)
TC-8
TC-9
No

CC

No (no abort)
Abbreviations:
(as above)
In principle, other test adequacy criteria described in Chapter 14 can be
applied to more control structures derived from specifications, but in practice
a good specification should rarely result in a complex control structure, since
a specification should abstract details of processing.

13.8

Catalog Based Testing

The test design techniques described above require judgment in deriving value
classes. Over time, an organization can build experience in making these
judgments well. Gathering this experience in a systematic collection can speed
up the process and routinize many decisions, reducing human error. Catalogs
capture the experience of test designers by listing all cases to be considered
for each possible type of variable that represents logical inputs, outputs, and
status of the computation. For example, if the computation uses a variable
whose value must belong to a range of integer values, a catalog might indicate the following cases, each corresponding to a relevant test case:
1. The element immediately preceding the lower bound of the interval
2. The lower bound of the interval
3. A non-boundary element within the interval
4. The upper bound of the interval
5. The element immediately following the upper bound
The catalog would in this way cover the intuitive cases of erroneous conditions (cases 1 and 5), boundary conditions (cases 2 and 4), and normal conditions (case 3).
The catalog based approach consists in unfolding the specification, i.e.,
decomposing the specification into elementary items, deriving an initial set
Draft version produced 20th March 2002

Catalog Based Testing

85

of test case specifications from pre-conditions, post-conditions, and definitions, and completing the set of test case specifications using a suitable test
catalog.
STEP 1: identify elementary items of the specification The initial specification is transformed into a set of elementary items that have to be tested.
Elementary items belong to a small set of basic types:
Pre-conditions represent the conditions on the inputs that must be satisfied
before invocation of the unit under test. Preconditions may be checked
either by the unit under test (validated preconditions) or by the caller
(assumed preconditions).
Post-conditions describe the result of executing the unit under test.
Variables indicate the elements on which the unit under test operates. They
can be input, output, or intermediate values.
Operations indicate the main operations performed on input or intermediate variables by the unit under test
Definitions are shorthand used in the specification
As in other approaches that begin with an informal description, it is not
possible to give a precise recipe for extracting the significant elements. The
result will depend on the capability and experience of the test designer.
Consider the informal specification of a function for converting URL-encoded form data into the original data entered through an html form. An
informal specification is given in Figure 13.7.10
The informal description of cgi decode uses the concept of hexadecimal
digit, hexadecimal escape sequence, and element of a cgi encoded sequence.
This leads to the identification of the following three definitions:
DEF 1 hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D,
E, F, a, b, c, d, e, f
DEF 2 a CGI-hexadecimal is a sequence of three characters: , where
and are hexadecimal digits
DEF 3 a CGI item is either an alphanumeric character, or character , or a
CGI-hexadecimal
In general, every concept introduced in the description as a support for
defining the problem can be represented as a definition.
The description of cgi decode mentions some elements that are inputs
and outputs of the computation. These are identified as the following variables:
10 The informal specification is ambiguous and inconsistent, i.e., it is the kind of spec one is
most likely to encounter in practice.

Draft version produced 20th March 2002

Functional Testing

86














INPUT: encoded





OUTPUT: decoded









OUTPUT: return value



cgi decode:

Draft version produced 20th March 2002

Catalog Based Testing

87

VAR 1 Encoded: string of ASCII characters


VAR 2 Decoded: string of ASCII characters
VAR 3 return value: Boolean
Note the distinction between a variable and a definition. Encoded and decoded are actually used or computed, while hexadecimal digits, CGI-hexadecimal,
and CGI item are used to describe the elements but are not objects in their
own right. Although not strictly necessary for the problem specification, explicit identification of definitions can help in deriving a richer set of test cases.
The description of cgi decode indicates some conditions that must be satisfied upon invocation, represented by the following preconditions:
PRE 1 (Assumed) the input string Encoded is a null-terminated string of characters.
PRE 2 (Validated) the input string Encoded is a sequence of CGI items.
In general, preconditions represent all the conditions that should be true
for the intended functioning of a module. A condition is labeled as validated
if it is checked by the module (in which case a violation has a specified effect,
e.g., raising an exception or returning an error code). Assumed preconditions
must be guaranteed by the caller, and the module does not guarantee a particular behavior in case they are violated.
The description of cgi decode indicates several possible results. These can
be represented as a set of postconditions:
POST 1 if the input string Encoded contains alphanumeric characters, they
are copied to the corresponding position in the output string.
POST 2 if the input string Encoded contains characters +, they are replaced
by ASCII SPACE characters in the corresponding positions in the output
string.
POST 3 if the input string Encoded contains CGI-hexadecimals, they are replaced by the corresponding ASCII characters.
POST 4 if the input string Encoded is a valid sequence, cgi decode returns 0.
POST 5 if the input string Encoded contains a malformed CGI-hexadecimal,
i.e., a substring %xy, where either x or y is absent or are not hexadecimal digits, cgi decode returns 1
POST 6 if the input string Encoded contains any illegal character, cgi decode
returns a positive value.
The postconditions should, together, capture all the expected outcomes of
the module under test. When there are several possible outcomes, it is possible to capture them all in one complex postcondition or in several simple
Draft version produced 20th March 2002

Functional Testing

88

PRE 1
PRE 2
POST 1
POST 2

POST 3
POST 4
POST 5

POST 6
VAR 1
VAR 2
VAR 3
DEF 1
DEF 2
DEF 3
OP 1

(Assumed) the input string


is a null-terminated string of
characters
is a sequence of CGI items
(Validated) the input string
if the input string
contains alphanumeric characters, they
are copied to the output string in the corresponding positions.
contains characters +, they are reif the input string
placed in the output string by ASCII SPACE characters in the corresponding positions
contains CGI-hexadecimals, they are
if the input string
replaced by the corresponding ASCII characters.
is well-formed, cgi-decode returns 0
if the input string
if the input string
contains a malformed CGI hexadecimal, i.e., a substring %xy, where either x or y are absent or are
not hexadecimal digits, cgi decode returns 1
if the input string
contains any illegal character,
cgi decode returns a positive value
: a string of ASCII characters
: a string of ASCII characters
: a boolean

are ASCII characters in range [0 .. 9, A .. F,


a .. f]

are sequences %xy, where x and y are hexadecimal digits
is an alphanumeric character, or +, or a CGIA
hexadecimal
Scan
Table 13.8: Elementary items of specification cgi-decode

postconditions; here we have chosen a set of simple contingent postconditions, each of which captures one case.
Although the description of cgi decode does not mention explicitly how
the results are obtained, we can easily deduce that it will be necessary to scan
the input sequence. This is made explicit in the following operation:
OP 1 Scan the input string Encoded.
In general, a description may refer either explicitly or implicitly to elementary operations which help to clearly describe the overall behavior, like definitions help to clearly describe variables. As with variables, they are not strictly
necessary for describing the relation between pre- and postconditions, but
they serve as additional information for deriving test cases.
The result of step 1 for cgi decode is summarized in Table 13.8.
Draft version produced 20th March 2002

Catalog Based Testing

89

STEP 2 Derive a first set of test case specifications from preconditions, postconditions and definitions The aim of this step is to explicitly describe the
partition of the input domain:
Validated Preconditions: A simple precondition, i.e., a precondition that is
expressed as a simple boolean expression without and or or, identifies
two classes of input: values that satisfy the precondition and values that
do not. We thus derive two test case specifications.
A compound precondition, given as a boolean expression with and or
or, identifies several classes of inputs. Although in general one could
derive a different test case specification for each possible combination
of truth values of the elementary conditions, usually we derive only a
subset of test case specifications using the modified condition decision
coverage (MC/DC) approach, which is illustrated in Section 13.6 and in
Chapter ??. In short, we derive a set of combinations of elementary conditions such that each elementary condition can be shown to independently affect the outcome of each decision. For each elementary condition , there are two test case specifications in which the truth values
of all conditions except are the same, and the compound condition
as a whole evaluates to True for one of those test cases and False for the
other.
Assumed Preconditions: We do not derive test case specifications for cases
that violate assumed preconditions, since there is no defined behavior
and thus no way to judge the success of such a test case. We also do not
derive test cases when the whole input domain satisfies the condition,
since test cases for these would be redundant. We generate test cases
from assumed preconditions only when the MC/DC criterion generates
more than one class of valid combinations (i.e., when the condition is a
logical disjunction of more elementary conditions).
Postconditions: In all cases in which postconditions are given in a conditional form, the condition is treated like a validated precondition, i.e.,
we generate a test case specification for cases that satisfy and cases that
do not satisfy the condition.
Definition: Definitions that refer to input or output variables are treated like
postconditions, i.e., we generate a set of test cases for each definition
given in conditional form with the same criteria used for validated preconditions. The test cases are generated for each variable that refers to
the definition.
The elementary items of the specification identified in step 1 are scanned
sequentially and a set of test cases is derived applying these rules. While
scanning the specifications, we generate test case specifications incrementally. When new test case specifications introduce a finer partition than an
existing case, or vice versa, the test case specification that creates the coarser
Draft version produced 20th March 2002

Functional Testing

90

partition becomes redundant and can be eliminated. For example, if an existing test case specification requires a non-empty set, and we have to add
two test case specifications that require a size that is a power of two and one
which is not, the existing test case specification can be deleted.
Scanning the elementary items of the cgi decode specification given in
Table 13.7, we proceed as follows:
PRE 1: The first precondition is a simple assumed precondition, thus, according to the rules, we do not generate any test case specification. The

, but
only condition would be
this matches every test case and thus it does not identify a useful partition.
PRE 2: The second precondition is a simple validated precondition, thus we
generate two test case specifications, one that satisfies the condition
and one that does not:
TC-PRE2-1
TC-PRE2-2

: a sequence of CGI items


: not a sequence of CGI items

postconditions: all postconditions in the cgi decode specification are given


in a conditional form with a simple condition. Thus, we generate two
test case specifications for each of them. The generated test case specifications correspond to a case that satisfies the condition and a case that
violates it.
POST 1:
TC-POST1-1
acters
TC-POST1-2
acters

: contains one or more alphanumeric char-

: does not contain any alphanumeric char-

: contains one or more character +


: does not any contain character +

: contains one or more CGI-hexadecimals


: does not contain any CGI-hexadecimal

POST 2:
TC-POST2-1
Tc-POST2-2
POST 3:
TC-POST3-1
TC-POST3-2

POST 4: we do not generate any new useful test case specifications, because the two specifications are already covered by the specifications generated from POST 2.
Draft version produced 20th March 2002

Catalog Based Testing

91

POST 5: we generate only the test case specification that satisfies the
condition; the test case specification that violates the specification
is redundant with respect to the test case specifications generated
from POST 3
TC-POST5-1 :

contains one or more malformed CGI-hexadecimals

POST 6: as for POST 5, we generate only the test case specification that
satisfies the condition; the test case specification that violates the
specification is redundant with respect to most of the test case specifications generated so far.
TC-POST6-1

: contains one or more illegal characters

definitions none of the definitions in the specification of cgi decode is given


in conditional terms, and thus no test case specifications are generated
at this step.
The test case specifications generated from postconditions refine test case
specification TC-PRE2-1, which can thus be eliminated from the checklist.
The result of step 2 for cgi decode is summarized in Table 13.9.
STEP 3 Complete the test case specifications using catalogs The aim of this
step is to generate additional test case specifications from variables and operations used or defined in the computation. The catalog is scanned sequentially. For each entry of the catalog we examine the elementary components
of the specification and we add test case specifications as required by the catalog. As when scanning the test case specifications during step 2, redundant
test case specifications are eliminated.
Table 13.10 shows a simple catalog that we will use for the cgi decoder example. A catalog is structured as a list of kinds of elements that can occur in
a specification. Each catalog entry is associated with a list of generic test case
specifications appropriate for that kind of element. We scan the specification
for elements whose type is compatible with the catalog entry, then generate
the test cases defined in the catalog for that entry. For example, the catalog of
Table 13.10 contains an entry for boolean variables. When we find a boolean
variable in the specification, we instantiate the catalog entry by generating
two test case specifications, one that requires a True value and one that requires a False value.
Each generic test case in the catalog is labeled in, out, or in/out, meaning
that a test case specification is appropriate if applied to either an input variable, or to an output variable, or in both cases. In general, erroneous values
should be used when testing the behavior of the system with respect to input
variables, but are usually impossible to produce when testing the behavior of
the system with respect to output variables. For example, when the value of
an input variable can be chosen from a set of values, it is important to test
the behavior of the system for all enumerated values and some values outside the enumerated set, as required by entry ENUMERATION of the catalog.
Draft version produced 20th March 2002

Functional Testing

92

PRE 2
[

POST 1
[
[

]
]

POST 2
[
[

]
]

POST 3
[
[

]
]

POST 4
POST 5

POST 6
[

: not a sequence of CGI items




: contains alphanumeric characters
: does not contain alphanumeric characters


: contains +
: does not contain +


: contains CGI-hexadecimals
: does not contain a CGI-hexadecimal



: contains illegal characters

VAR 2

DEF 1

DEF 2

DEF 3

OP 1


: contains malformed CGI-hexadecimals

VAR 1

VAR 3

Table 13.9: Test case specificationsDraft


for cgi-decode
version produced
generated
20th
after
March
step2002
2
.

Catalog Based Testing


[ ]
[ ]

True
False


[ ]
[ ]

Each enumerated value


Some value outside the enumerated set


[ ]
[ ]
[ ]
[ ]
[ ]

93

(the element immediately preceding the lower bound)


(the lower bound)
A value between and
(the upper bound)
(the element immediately following the upper bound)


[ ]
(the constant value)
(the element immediately preceding the constant value)
[ ]
(the element immediately following the constant value)
[ ]
Any other constant compatible with
[ ]

[ ]
[ ]
[ ]

(the constant value)


Any other constant compatible with
Some other compatible value


[ ]
[ ]
[ ]
[ ]
[ ]
[ ]

Empty
A single element
More than one element
Maximum length (if bounded) or very long
Longer than maximum length (if bounded)
Incorrectly terminated


[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]


occurs at beginning of sequence
occurs in interior of sequence
occurs at end of sequence
occurs contiguously
does not occur in sequence
where is a proper prefix of
Proper prefix occurs at end of sequence
Table 13.10: Part of a simple test catalog.

Draft version produced 20th March 2002

Functional Testing

94

However, when the value of an output variable belongs to a finite set of values,
we should derive a test case for each possible outcome, but we cannot derive
a test case for an impossible outcome, so entry ENUMERATION of the catalog specifies that the choice of values outside the enumerated set is limited
to input variables. Intermediate variables, if present, are treated like output
variables.
Entry Boolean of the catalog applies to (VAR 3). The catalog
requires a test case that produces the value True and one that produces the
value False. Both cases are already covered by test cases TC-PRE2-1 and TCPRE2-2 generated for precondition PRE 2, so no test case specification is actually added.
Entry Enumeration of the catalog applies to any variable whose values are
chosen from an explicitly enumerated set of values. In the example, the values

in POST 5 are defined by
of (DEF 3) and of improper
enumeration. Thus, we can derive new test case specifications by applying
.
entry enumeration to POST 5 and to any variable that can contain
The catalog requires creation of a test case specification for each enumerated value and for some excluded values. For , which uses DEF 3, we
generate a test case specification where a CGI-item is an alphanumeric character, one where it is the character +, one where it is a CGI-hexadecimal,
and some where it is an illegal value. We can easily ascertain that all the required cases are already covered by test case specifications for TC-POST11, TC-POST1-2, TC-POST2-1, TC-POST2-2, TC-POST3-1, and TC-POST3-2, so
any additional test case specifications would be redundant.
From the enumeration of malformed CGI-hexadecimals in POST 5, we derive the following test cases: %y, %x, %ky, %xk, %xy (where x and y are hexadecimal digits and k is not). Note that the first two cases, %x (the second
hexadecimal digit is missing) and %y (the first hexadecimal digit is missing)
are identical, and %x is distinct from %xk only if %x are the last two characters
in the string. A test case specification requiring a correct pair of hexadecimal
digits (%xy) is a value out of the range of the enumerated set, as required by
the catalog.
The added test case specifications are:
TC-POST5-2

: terminated with %x, where x is a hexadecimal digit

TC-POST5-3 : contains %ky, where k is not a hexadecimal digit and y


is a hexadecimal digit.
TC-POST5-4
not.

: contains %xk, where x is a hexadecimal digit and k is

The test case specification corresponding to the correct pair of hexadecimal digits is redundant, having already been covered by TC-POST3-1. The
test case TC-POST5-1 can now be eliminated because it is more general than
the combination of TC-POST5-2, TC-POST5-3, and TC-POST5-4.
Draft version produced 20th March 2002

Catalog Based Testing

95

Entry Range applies to any variable whose values are chosen from a finite
range. In the example, ranges appear three times in the definition of hexadecimal digit. Ranges also appear implicitly in the reference to alphanumeric
characters (the alphabetic and numeric ranges from the ASCII character set)
in DEF 3. For hexadecimal digits we will try the special values / and : (the
characters that appear before 0 and after 9 in the ASCII encoding), the values 0 and 9 (upper and lower bounds of the first interval), some value between 0 and 9, and similarly @, G, A, F, and some value between A and
F for the second interval and , g, a, f, and some value between a and f
for the third interval.
These values will be instantiated for variable , and result in 30 additional test case specifications (5 values for each subrange, giving 15 values
for each hexadecimal digit and thus 30 for the two digits of CGI-hexadecimal).
The full set of test case specifications is shown in Table ??. These test case
specifications are more specific than (and therefore replace) test case specifications TC-POST3-1, TC-POST5-3, and TC-POST5-4.
For alphanumeric characters we will similarly derive boundary, interior
and excluded values, which result in 15 additional test case specifications,
also given in Table ??. These test cases are more specific than (and therefore
replace) TC-POST1-1, TC-POST1-2, TC-POST6-1.
Entry Numeric Constant does not apply to any element of this specification.
Entry Non-Numeric Constant applies to + and %, occurring in DEF 3 and
DEF 2 respectively. Six test case specifications result, but all are redundant.
(VAR 1),
(VAR 2), and
Entry Sequence applies to
(DEF 2). Six test case specifications result for each, of which only five are mutually non-redundant and not already in the list. From VAR 1 ( ) we
generate test case specifications requiring an empty sequence, a sequence
containing a single element, and a very long sequence. The catalog entry requiring more than one element generates a redundant test case specification,
which is discarded. We cannot produce reasonable test cases for incorrectly
terminated strings (the behavior would vary depending on the contents of
memory outside the string), so we omit that test case specification.
(VAR 2) would
All test case specifications that would be derived for
be redundant with respect to test case specifications derived for
(VAR
1).

(DEF 2) we generate two additional test case specFrom
: a sequence that terminates with % (the only
ifications for variable
way to produce a one-character subsequence beginning with %) and a sequence containing %xyz, where x, y, and z are hexadecimal digits.
(OP 1) and generates 17 test case specEntry Scan applies to
ifications. Three test case specifications (alphanumeric, +, and
) are
generated for each of the first 5 items of the catalog entry. One test case specification is generated for each of the last two items of the catalog entry when
Scan is applied to CGI item. The last two items of the catalog entry do not
apply to alphanumeric characters and +, since they have no non-trivial preDraft version produced 20th March 2002

Functional Testing

96

fixes. Seven of the 17 are redundant. The ten generated test case specifications are summarized in Table 13.11.
Test catalogs, like other check-lists used in test and analysis (e.g., inspection check-lists), are an organizational asset that can be maintained and enhanced over time. A good test catalog will be written precisely and suitably
annotated to resolve ambiguity (unlike the sample catalog used in this chapter). Catalogs should also be specialized to an organization and application
domain, typically using a process such as defect causal analysis or root cause
analysis. Entries are added to detect particular classes of faults that have been
encountered frequently or have been particularly costly to remedy in previous projects. Refining check-lists is a typical activity carried out as part of
process improvement. When a test reveals a program fault, it is useful to
make a note of which catalog entries the test case originated from, as an aid
to measuring the effectiveness of catalog entries. Catalog entries that are not
effective should be removed.

13.9

Deriving Test Cases from Finite State Machines

Finite state machines are often used to specify sequences of interactions between a system and its environment. State machine specifications in one
form or another are common for control and interactive systems, such as embedded systems, communication protocols, menu driven applications, threads
of control in a system with multiple threads or processes.
In several application domains, specifications may be expressed directly
as some form of finite-state machine. For example, embedded control systems are frequently specified with Statecharts, communication protocols are
commonly described with SDL diagrams, and menu driven applications are
sometimes modeled with simple diagrams representing states and transitions.
In other domains, the finite state essence of the systems are left implicit in
informal specifications. For instance, the informal specification of feature
Maintenance of the Chipmuk web site given in Figure 13.9 describes a set of
interactions between the maintenance system and its environment that can
be modeled as transitions through a finite set of process states. The finite
state nature of the interaction is made explicit by the finite state machine
shown in Figure 13.10. Note that some transitions appear to be labeled by
conditions rather than events, but they can be interpreted as shorthand for
an event in which the condition becomes true or is discovered (e.g., lack
component is shorthand for discover that a required component is not in
stock.
Many control or interactive systems are characterized by an infinite set of
states. Fortunately, the non-finite-state parts of the specification are often
simple enough that finite state machines remain a useful model for testing as
well as specification. For example, communication protocols are frequently
specified using finite state machines, often with some extensions that make
Draft version produced 20th March 2002

Deriving Test Cases from Finite State Machines

contains %xy,
with y in [B..E]
ter +
contains %xF
TC-DEF2-24
does not contain
TC-POST2-2
contains %xG
TC-DEF2-25
character +
contains %x
TC-DEF2-26
does not contain
TC-POST3-2
contains %xa
TC-DEF2-27
a CGI-hexadecimal
contains %xy,
TC-DEF2-28
terminates with
TC-POST5-2
with y in [b..e]
%x
contains %xf
TC-DEF2-29
is the empty seTC-VAR1-1
contains %xg
TC-DEF2-30
quence
contains %$
TC-DEF2-31
is a sequence
TC-VAR1-2
contains %xyz
TC-DEF2-32
containing a single charcontains /
TC-DEF3-1
acter
contains 0
TC-DEF3-2
is a very long seTC-VAR1-3
contains , with c
TC-DEF3-3
quence
in [1..8]
contains %/y
TC-DEF2-1
contains 9
TC-DEF3-4
contains %0y
TC-DEF2-2
contains :
TC-DEF3-5
contains %xy,
TC-DEF2-3
contains @
TC-DEF3-6
with x in [1..8]
contains A
TC-DEF3-7
contains %9y
TC-DEF2-4
contains , with c
TC-DEF3-8
contains %:y
TC-DEF2-5
in [B..Y]
contains %@y
TC-DEF2-6
contains Z
TC-DEF3-9
contains %Ay
TC-DEF2-7
contains [
TC-DEF3-10
contains %xy,
TC-DEF2-8
contains
TC-DEF3-11
with x in [B..E]
contains a
TC-DEF3-12
contains %Fy
TC-DEF2-9
contains , with c
TC-DEF3-13
contains %Gy
TC-DEF2-10
in [b..y]
contains %y
TC-DEF2-11
contains z
TC-DEF3-14
contains %ay
TC-DEF2-12
contains
TC-DEF3-15
contains %xy,
TC-DEF2-13
contains
TC-OP1-1
with x in [b..e]
contains +
TC-OP1-2
contains %fy
TC-DEF2-14
contains %xy
TC-OP1-3
contains %gy
TC-DEF2-15
contains $
TC-OP1-4
contains %x/
TC-DEF2-16
contains +$
TC-OP1-5
contains %x0
TC-DEF2-17
contains %xy$
TC-OP1-6
contains %xy,
TC-DEF2-18
contains
TC-OP1-7
with y in [1..8]
contains ++
TC-OP1-8
contains %x9
TC-DEF2-19
contains
TC-OP1-9
contains %x:
TC-DEF2-20
%xy%zw
contains %x@
TC-DEF2-21
contains
TC-OP1-10
contains %xA
TC-DEF2-22
%x%yz
are hexadecimal digits, is an alphanumeric character, represents
where
the beginning of the string, and $ represents the end of the string.
TC-POST2-1

contains charac-

TC-DEF2-23

97

Table 13.11: Summary table: Test case specifications for cgi-decode generated with a catalog.

Draft version produced 20th March 2002

Functional Testing

98

Maintenance:

Figure 13.9: The functional specification of feature Maintenace of the Chipmuk web site.

Draft version produced 20th March 2002

Deriving Test Cases from Finite State Machines

99

e
ct
je
e
at
im
st

5
Wait for
acceptance

Wait for
component

accept
estimate

Wait for
pick up

p ic k

Repair
(maintenance
station)

repair completed

(U una
)
S ble
t (a
or
n
e
n
UE t o r
po
m
re ep a
o
component
c
s id ir
k
en
l ac
arrives (a)
t)
8 Repair
lack component (b)

(regional
headquarters)

arrives (b)
l ac
kc
om
pon
en
t (c
component
)
arrives (c)

Repaired

l
sf u
c es
suc

ai r
rep

unable to
repair

component
unable to repair
(not (US or UE resident)

up

re
pa
ir

re

in
c o v al
nu nt r a id
mb ct
er

fu
l

Maintenance
(no warranty)

estimate
costs

Wait for
returning

return

es
s

by re q u
[ U S p h on e e st
( co o r UE o r we
n t ra
b
r
ct n esi de n
um
b er t]
)

su
cc

t a t t io n
u es
ta
re q n ce s )
a
n rr an t y
e
t
in
m a no w a
(

k up
p ic

request at
maintenance station
or by express cour ier
(contract number)

NO
Maintenance

Repairi
(main
headquarters)

Figure 13.10: The finite state machine corresponding to functionality Maintenance specified in Figure 13.9

Draft version produced 20th March 2002

Functional Testing

100

TC-1
TC-2
TC-3
TC-4

T-Cover
02410
0524560
035960
0357587897960

Table 13.12: A set of test specifications in the form of paths in a finite-state


machine specification. States are indicated referring to the numbers given in
Figure 13.10. For example, TC-1 is a test specification requiring transitions
(0,2), (2,4), (4,1), and (1,0) be traversed, in that order.

them not truly finite-state. A state machine that simply receives a message
on one port and then sends the same message on another port is not really
finite-state unless the set of possible messages is finite, but is often rendered
as a finite state machine, ignoring the contents of the exchanged messages.
State-machine specifications can be used both to guide test selection and
in construction of an oracle that judges whether each observed behavior is
correct. There are many approaches for generating test cases from finite state
machines, but most are variations on a basic strategy of checking each state
transition. One way to understand this basic strategy is to consider that each
transition is essentially a specification of a precondition and postcondition,
e.g., a transition from state to state on stimulus means if the system
is in state and receives stimulus , then after reacting it will be in state .
For instance, the transition labeled accept estimate from state Wait for acceptance to state Repair (maintenance station) of Figure 13.10 indicates that if an
item is on hold waiting for the customer to accept an estimate of repair costs,
and the customer accepts the estimate, then the maintenance station begins
repairing the item.
A faulty system could violate any of these precondition, postcondition
pairs, so each should be tested. For instance, the state Repair (maintenance
station) can be arrived through three different transitions, and each should
be checked.
Details of the approach taken depend on several factors, including whether
system states are directly observable or must be inferred from stimulus/response
sequences, whether the state machine specification is complete as given or
includes additional, implicit transitions, and whether the size of the (possibly
augmented) state machine is modest or very large.
A basic criterion for generating test cases from finite state machines is
transition coverage, which requires each transition to be traversed at least
once. Test case specifications for transition coverage are often given as sets of
state sequences or transition sequences. For example, T-Cover in Table 13.12
is a set of four paths, each beginning at the initial state, which together cover
all transitions of the finite state machine of Figure 13.10. T-Cover thus satisfies
the transition coverage criterion.
The transition coverage criterion depends on the assumption that the finiteDraft version produced 20th March 2002

Deriving Test Cases from Finite State Machines

101

state machine model is a sufficient representation of all the important state,


e.g., that transitions out of a state do not depend on how one reached that
state. Although it can be considered a logical flaw, in practice one often finds
state machines that exhibit history sensitivity, i.e., the transitions from a
state depend on the path by which one reached that state. For example, in
Figure 13.10, the transition taken from state Wait for component when the
component becomes available depends on how the state was entered. This is
a flaw in the model there really should be three distinct Wait for component
states, each with a well-defined action when the component becomes available. However, sometimes it is more expedient to work with a flawed statemachine model than to repair it, and in that case test suites may be based on
more than the simple transition coverage criterion.
Coverage criteria designed to cope with history sensitivity include single state path coverage, single transition path coverage, and boundary interior
loop coverage.The single state path coverage criterion requires each path that
traverses states at most once to be exercised. The single transition path coverage criterion requires each path that traverses transitions at most once to be
exercised. The boundary interior loop coverage criterion requires each distinct loop of the state machine to be exercised the minimum, an intermediate, and the maximum number of times11 . These criteria may be practical for
very small and simple finite-state machine specifications, but since the number of even simple paths (without repeating states) can grow exponentially
with the number of states, they are often impractical.
Specifications given as finite-state machines are typically incomplete, i.e.,
they do not include a transition for every possible (state, stimulus) pair. Often
the missing transitions are implicitly error cases. Depending on the system,
the appropriate interpretation may be that these are dont care transitions
(since no transition is specified, the system may do anything or nothing), self
transitions (since no transition is specified, the system should remain in the
same state), or (most commonly) error transitions that enter a distinguished
state and possibly trigger some error handling procedure. In at least the latter
two cases, thorough testing includes the implicit as well as the explicit state
transitions. No special techniques are required; the implicit transitions are
simply added to the representation before test cases are selected.
The presence of implicit transitions with a dont care interpretation is typically an implicit or explicit statement that those transitions are impossible,
e.g., because of physical constraints. For example, in the specification of the
maintenance procedure of Figure 13.10, the effect of event lack of component is specified only for the states that represent repairs in progress. Sometimes it is possible to test such sequences anyway, because the system does
not prevent such events from occurring Where possible, it may be best to
treat dont care transitions as self transitions (allowing the possibility of imperfect translation from physical to logical events, or of future physical layers

11 The boundary interior path coverage was originally proposed for structural coverage of program control flow, and is described in Chapter 14

Draft version produced 20th March 2002

Functional Testing

102

Advanced search: The Advanced search function allows for searching elements in the
website database.
The key for searching can be:
a simple string , i.e., a simple sequence of characters,
a compound string , i.e.,

a string terminated with character *, used as wild character, or


a string composed of substrings included in braces and separated
with commas, used to indicate alternatives.

a combination of strings , i.e., a set of strings combined with the boolean operators NOT, AND, OR, and grouped within parenthesis to change the priority of operators.
Examples:
laptop The routine searches for string laptop
DVD*,CD* The routine searches for strings that start with substring DVD
or CD followed by any number of characters
NOT (C2021*) AND C20* The routine searches for strings that start with substring C20 followed by any number of characters, except substring 21

Figure 13.11: The functional specification of feature advanced search of the


Chipmunk web site.

with different properties), or as error transitions (requiring that unanticipated


sequences be recognized and handled). If it is not possible to produce test
cases for the dont care transitions, then it may be appropriate to pass them
to other validation or verification activities, for example, by including explicit
assumptions in a requirements or specification document that will undergo
inspection.

13.10

Deriving Test Cases from Grammars

Sometimes, functional specifications are given in the form of grammars or


regular expressions. This is often the case in description of languages, such
as specifications of compilers or interpreters. More often syntactic structures are described with natural or domain specific languages, such as simple
scripting rules and complex document structures.
The informal specification of the advanced search functionality of the Chipmuk website shown in Figure 13.11 defines the syntax of the search pattern.
Not surprisingly, this specification can easily be expressed as a grammar. Figure 13.12 expresses the specification as a grammar in Bachus Naur Form (BNF).
Draft version produced 20th March 2002

Deriving Test Cases from Grammars

search
binop
term
regexp
choices

search

regexp

binop term

search

term

search

Char regexp
regexp

103

Char

choices

regexp choices

Figure 13.12: The BNF description of functionality Advanced search.

A second example is given in Figure 13.13, which specifies a product configuration of the Chipmuk website. In this case, the syntactic structure of
product configuration is described by an XML schema, which defines an element Model of type ProductConfigurationType. XML schemata are essentially
a variant of BNF, so it is not difficult to render the schema in the same BNF
notation, as shown in Figure 13.13.
In general, grammars are well suited to represent inputs of varying and unbounded size, boundary conditions, and recursive structures. None of which
can be easily captured with fixed lists of parameters, as required by most

methods presented in this chapter.
Generating test cases from grammar specifications is straightforward and
can easily be automated. To produce a string, we start from a non-terminal
symbol and we progressively substitute non-terminals occurring in the current string with substrings, as indicated by the applied productions, until we
obtain a string composed only of terminal symbols. In general at each step,
several rules can be applied. A minimal set of test cases can be generated
by requiring each production to be exercised at least once. Test cases can
be generated by starting from the start symbol and applying all productions.
The number and complexity of the generated test cases depend on the order of application of the productions. If we first apply productions with nonterminals on the right hand side, we generate a smaller set of test cases, each
one tending to be a large test case. On the contrary, first applying productions
with only terminals on the right hand side, we generate larger sets of smaller
test cases. An algorithm that favors non-terminals applied to the BNF for Advanced Search of Figure 13.11, generates the test case
not Char *, Char and (Char or Char)
that exercise all productions. The derivation tree for this test case is given
in Figure 13.15. It shows that all productions of the BNF are exercised at least

once.

The minimal set of test cases can be enriched by considering boundary
conditions. Boundary conditions apply to recursive productions. To generate test cases for boundary conditions we need to identify the minimum and
maximum number of recursive applications of a production and then generate a test case for the minimum, maximum, one greater than minimum and

Draft version produced 20th March 2002

Functional Testing

104

XML is a parenthetical language: descriptions of items are either enclosed in


angular parenthesis ( ) or terminated with /item clauses. Schema and
annotation ( xsd:schema ... and xsd:annotation ... /xsd:annotation ) give
information about the XML version and the authors. The first clause
( xsd:element ... describes a Model as an instance of type
ProductConfigurationType. The clause xsd:complexType ...
/xsd:complexType describes type ProductConfigurationType as composed
of

a field modelNumber of type String. Field modelNumber is required.


a possibly empty set of Components, each characterized by fields ComponentType and ComponentValue, both of type string.
a possibly empty set of OptionalComponents, each characterized by a
ComponentType of type string

Figure 13.13: The XML Schema that describes a Product configuration of the
Chipmuk website
Draft version produced 20th March 2002

Deriving Test Cases from Grammars

105

Model

modelNumber compSequence optCompSequence

compSequence

Component compSequence

optCompSequence

OptionalComponent

Component

ComponentType ComponentValue

OptionalComponent

ComponentType

modelNumber

ComponentType

ComponentValue

optCompSequence

Figure 13.14: The BNF description of Product Configuration.

<search>
<search> <binop> <term>
not <search>
<term>
<regexp>
Char <regexp>
{<choices>}

and

(<search>)

<search> <binop> <term>


<term>
<regexp>

or <regexp>
Char

Char

<regexp> , <choices>
*

<regexp>
Char

Figure 13.15: The derivation tree of a test case for functionality Advanced
Search derived from the BNF specification of Figure 13.12.

Draft version produced 20th March 2002

Functional Testing

106

Model Model

modelNumber compSequence optCompSequenc

compSeq1 limit=16 compSequence

Component compSequence

compSeq2 compSequence
optCompSeq1 limit=16 optCompSequence
optCompSeq2 optCompSequence

OptionalComponent optCompSequence

Comp Component

ComponentType ComponentValue

OptComp OptionalComponent

ComponentType

modNum modelNumber

CompTyp ComponentType

CompVal ComponentValue

Figure 13.16: The BNF description of Product Configuration extended with


production names and limits.

one smaller than maximum number of application of each production.


To apply boundary condition grammar based criteria, we need to add limits to the recursive productions. Names and limits are shown in Figure 13.16,
which extends the grammar of Figure 13.14. Compound productions are decomposed into their elementary components. Production names are used
for references purpose. Limits are added only to recursive productions. In
the example of Figure 13.16, the limit of both productions compSeq1 and optCompSeq1 is set to 16, i.e., we assume that each model can have at most 16
required and 16 optional components.
The boundary condition grammar based criteria would extend the minimal set by adding test cases that cover the following choices:

zero required components (compSeq1 applied 0 times)


one required component (compSeq1 applied 1 time)

fifteen required components (compSeq1 applied times)


sixteen required components (compSeq1 applied times)
zero optional components (optCompSeq1 applied 0 times)
one optional component (optCompSeq1 applied 1 time)

fifteen optional components (optCompSeq1 applied times)


sixteen optional components (optCompSeq1 applied times)
Draft version produced 20th March 2002

Choosing a Suitable Approach


weight
weight
weight
weight
weight
weight
weight
weight
weight
weight

107
Model
compSeq1
compSeq2
optCompSeq1
optCompSeq2
Comp
OptComp
modNum
CompTyp
CompVal

1
10
0
10
0
1
1
1
1
1

Figure 13.17: A sample seed that assigns probabilities to productions of the


BNF specification of the BNF of Product Configuration.

Additional boundary condition grammar based criteria can be defined by


also requiring specific combinations of applications of productions, e.g., requiring all productions to be simultaneously applied the minimum or the
maximum number of times. This additional requirement applied to the example of Figure 13.16 would require additional test cases corresponding to
the cases of (1) both no required and no optional components (both compSeq1
and optCompSeq1 applied 0 times), and (2) 16 required and 16 additional

components (both compSeq1 and optCompSeq1 applied times).

Probabilistic grammar based criteria assign probabilities to productions,


thus indicating which production to select at each step to generate test cases.
Unlike names and limits, probabilities are attached to grammar productions
as a separate set of annotations, called seed. In this way, we can generate
several sets of test cases from the same grammar with different seeds. Figure 13.17 shows a sample seed for the grammar that specify the product configuration functionality of the Chipmuk web site presented in Figure 13.16.
Probabilities are indicated as weights that determine the relative occurrences of the production in a sequence of applications that generate a test
case. The same weight for compSeq1 and optCompSeq1 indicates that test
cases are generated by balancing the applications of these two productions,
i.e., they contain the same number of required and optional components.
Weight disables the productions, which are then applied only when the application of competing productions reaches the limit indicated in the grammar.

13.11 Choosing a Suitable Approach


We have seen several approaches to functional testing, each applying to different kinds of specifications. Given a specification, there may be one or more
techniques well suited for deriving functional test cases, while some other
Draft version produced 20th March 2002

108

Functional Testing
techniques may be hard or even impossible to apply, or may lead to unsatisfactory results. Some techniques can be interchanged, i.e., they can be applied to the same specification and lead to similar results. Other techuiques
are complementary, i.e., they apply to different aspects of the same specification or at different stages of test case generation. In some cases, approaches
apply directly to the form in which the specification is given, in some other
cases, the specification must be transformed into a suitable form.
The choice of approach for deriving functional testing depends on several
factors: the nature of the specification, the form of the specification, expertieses and experience of test designers, the structure of the organization, the
availability of tools, the budget and quality constraints, and the costs of designing and implementing the scaffolding.
Nature and form of the specification Different approaches exploit different characteristics of the specification. For example, the presence of several
constraints on the input domain may suggest the category partition method,
while lack of constraints may indicate a combinatorial approach. The presence of a finite set of states could suggest a finite state machine approach,
while inputs of varying and unbounded size may be tackled with grammar
based approaches. Specifications given in a specific format, e.g., as finite state
machines, or decision structures suggest the corresponding techniques. For
example, functional test cases for SDL specifications of protocols are often
derived with finite state machine based criteria.
Experience of test designers and organization Experience of testers and
company procedures may drive the choice of the testing technique. For example, test designers expert in category partition may prefer this technique
over a catalog based approach when both are applicable, while a company
that works in a specific domain may require the use of catalogs suitably produced for the domain of interest.
Tools Some techniques may require the use of tools, whose availability and
cost should be taken into account when choosing a specific testing technique.
For example, several tools are available for deriving test cases from SDL specifications. The availability of one of these tools may suggest the use of SDL for
capturing a subset of the requirements expressed in the specification.
Budget and quality constraints Different quality and budget constraints
may lead to different choices. For example, the need of quickly check a software product without stringent reliability requirements may lead to chose a
random test generation approach, while a thorough check of a safety critical application may require the use of sophisticated methods for functional
test case generation. When choosing a specific approach, it is important to
Draft version produced 20th March 2002

Choosing a Suitable Approach

109

evaluate all cost related aspects. For example, the generation of a large number of random tests may require the design of sophisticated oracles, which
may raise the costs of testing over an acceptable threshold; the cost of a specific tool and the related training may go beyond the advantages of adopting
a specific approach, even if the nature and the form of the specification may
suggest the suitability of that approach.
Many engineering activities require careful trading off different aspects.
Functional testing is not an exception: successfully balancing the many aspects is a difficult and often underestimated problem that requires highly
skilled designers. Functional testing is not an exercise of choosing the optimal approach, but a complex set of activities for finding a suitable combination of models and techniques that can lead to a set of test cases that satisfy
cost and quality constraints. This balancing extends beyond test design to
software design for test. Appropriate design not only improves the software
development process, but can greatly facilitate the job of test designers, and
thus lead to substantial savings.
Too often test designers make the same mistake as non-expert programmers, that is to start generating code in one case, test cases in the other,
without prior analysis of the problem domain. Expert test designers carefully examine the available specifications, their form, domain and company
constraints for identifying a suitable framework for designing test case specifications before even starting to consider the problem of test case generation.

Open research issues


Functional testing is by far the most popular way of deriving test cases in industry, but both industrial practice and research are still far from general and
satisfactory methodologies. Key reasons for the relative shortage of results
are the intrinsic difficulty of the problem and the difficulty of working with
informal specifications. Research in functional testing is increasingly active
and progresses in many directions.
A hot research area concerns the use of formal methods for deriving test
cases. In the past three decades, formal methods have been mainly studied as
a means for formally proving software properties. Recently, a lot of attention
has been moved towards the use of formal methods for deriving test cases.
There are three main open research topics in this area:

definition of techniques for automatically deriving test cases from particular formal methods. Formal methods present new challenges and
opportunities for deriving test cases. We can both adapt existing techniques borrowed from other disciplines or research areas and define
new techniques for test case generation. The formal nature can support fully automatic generation of test cases, thus opening additional
problems and research challenges.
Draft version produced 20th March 2002

Functional Testing

110

adaptation of formal methods to be more suitable for test case generation. As illustrated in this chapter, test cases can be derived in two
broad ways, either by identifying representative values or by deriving a
model of the unit under test. The possibility of automatically generating test cases from different formal methods offers the opportunities of
a large set of models to be used in testing. The research challenge relies
in the capability of identifying a tradeoff between costs of generating
formal models and savings in automatically generating test cases. The
possibility of deriving simple formal models capturing only the aspects
of interests for testing has been already studied in some specific areas,
like concurrency, where test cases can be derived from models of the
concurrency structure ignoring other details of the system under test,
but the topic presents many new challenges if applied to wider classes
of systems and models.
identification of a general framework for deriving test cases from any
particular formal specification. Currently research is moving towards
the study of techniques for generating test cases for specific formal methods. The unification of methods into a general framework will constitute an additional important result that will allow the interchange of
formal methods and testing techniques.

Another hot research area is fed by the increasing interest in different specification and design paradigms. New software development paradigms, such
as the object oriented paradigm, as well as techniques for addressing increasingly important topics, such as software architectures and design patterns,
are often based on new notations. Semi-formal and diagrammatic notations
offer several opportunities for systematically generating test cases. Resarch is
active in investigating different possibilities of (semi) automatically deriving
test cases from these new forms of specifications and studying the effectiveness of existing test case generation techniques12 .
Most functional testing techniques do not satisfactory address the problem of testing increasingly large artifacts. Existing functional testing techniques do not take advantages of test cases available for parts of the artifact
under test. Compositional approaches for deriving test cases for a given system taking advantage of test cases available for its subsystems is an important
open research problem.

Further Reading
Functional testing techniques, sometimes called black-box testing or specificationbased testing, are presented and discussed by several authors. Ntafos [DN81]
makes the case for random, rather than systematic testing; Frankl, Hamlet,
12 Problems and state-of-art techniques for testing object oriented software and software architectures are discussed in Chapters ?? and ??

Draft version produced 20th March 2002

Choosing a Suitable Approach

111

Littlewood and Strigini [FHLS98] is a good starting point to the more recent
literature considering the relative merits of systematic and statistical approaches.
Category partition testing is described by Ostrand and Balcer [OB88]. The
combinatorial approach described in this chapter is due to Cohen, Dalal,
Fredman, and Patton [CDFP97]; the algorithm described by Cohen et al. is
patented by Bellcore. Myers classic text [Mye79] describes a number of techniques for testing decision structures. Richardson, OMalley, and Tittle [ROT89]
and Stocks and Carrington [SC96] are among more recent attempts to generate test cases based on the structure of (formal) specifications. Beizers Black
Box Testing [Bei95] is a popular presentation of techniques for testing based
on control and data flow structure of (informal) specifications.
Catalog-based testing of subsystems is described in depth by Maricks The
Craft of Software Testing [Mar97].
Test design based on finite state machines has been important in the domain of communication protocol development and conformance testing; Fujiwara, von Bochmann, Amalou, and Ghedamsi [FvBK 91] is a good introduction. Gargantini and Heitmeyer [GH99] describe a related approach applicable to software systems in which the finite-state machine is not explicit but
can be derived from a requirements specification.
Test generation from context-free grammars is described by Celentano et
al. [CCD 80] and apparently goes back at least to Hanfords test generator
for an IBM PL/I compiler [Han70]. The probabilistic approach to grammarbased testing is described by Sirer and Bershad [SB99], who use annotated
grammars to systematically generate tests for Java virtual machine implementations.

Related topics
Readers interested in the complementarites between functional and structural testing as well as readers interested in the testing decision structures and
control and data flow graphs may continue with the next Chapters that describe structural and data flow testing. Readers interested in finite state machine based testing may go to Chapters 17 and ?? that discuss testing of object
oriented and distributed system, respectively. Readers interested in the quality of specifications may goto Chapters 25 and ??, that describe inspection
techniques and methods for testing and analysis of specifications, respectively. Readers interested in other aspect of functional testing may move to
Chapters 16 and ??, that discuss technuqes for testing complex data structures and GUIs, respectively.

Draft version produced 20th March 2002

Functional Testing

112

Exercises
Ex13.1. In the Extreme Programming (XP) methodology [?], a written description of a desired feature may be a single sentence, and the first step to designing the implementation of that feature is designing and implementing a set
of test cases. Does this aspect of the XP methodology contradict our assertion
that test cases are a formalization of specifications?
Ex13.2. Compute the probability of selecting a test case that reveals the fault inserted in line 25 of program Root of Figure 13.1 by randomly sampling the
. Cominput domain, assuming that type double has range
pute the probability of selecting a test case that reveals a fault, asuming that
both lines 18 and 25 of program Root contains the same fault, i.e., missing
condition
. Compare the two probabilities.

Ex13.3. Identify independently testable units in the following specification.


Desk calculator Desk calculator performs the following algebraic operations: sum, subtraction, product, division, and percentage on integers and
real numbers. Operands must be of the same type, except for percentage,
which allows the first operator to be either integer or real, but requires the
second to be an integer that indicates the percentage to be computed. Operations on integers produce integer results. Program Calculator can be used
with a textual interface that provides the following commands:
Mx=N where Mx is a memory location,i.e., M0,.. M9 and N is a number. Integers are given as non-empty sequences of digits, with or without sign.
Real numbers are given as non-empty sequences of digits that include a
dot ., with or without sign. Real numbers can be terminated with an
optional exponent, i.e., character E followed by an integer. The command displays the stored number.
Mx=display , where Mx is a memory location and display indicates the value
shown on the last line.
operand1 operation operand2 , where operand1 and operand2 are numbers or memory locations or display and operation is one of the following symbols: +, -, *, /, %, where each symbol indicates a particular operation. Operands must follow the type conventions. The command displays the result or the string Error.
or with a graphical interface that provides a display with 12 characters and
the following keys:

, , , ,

, the 10 digits

, , , , , the operations
to display the result of a sequence of operations
Draft version produced 20th March 2002

Choosing a Suitable Approach

113

, to clear display

, , , ,

, where is pressed before a digit to indicate


the target memory, 0. . . 9, keys , , , pressed after
and a digit indicate the operation to be performed on the target memory: add display to memory, store display in memory, restore memory,
i.e., move the value in memory to the display and clear memory.


prints 65 (the
Example:
value 15 is stored in memory cell 3 and then retrieved to compute 80 15).

Ex13.4. Assume we have a set of parameter caracteristics (categories) and value


classes (choices) obtained by applying the category partition method to an
informal specification. Write an algorithm for computing the number of
combinations of value classes for each of the following restricted cases:

(Case 1) Parameter characteristics and value classes are given without


constraints
(Case 2) Only constraints error and single are used (without constraints
property and if-property)
(Case 3) Constraints are used, but constraints property and if-property
are not used for value classes of the same paramter characteristics, i.e.,
only one of these two types of contrain can be used for value classes of
the same parameter characteristic. Moreover, constraints are not nested,
i.e., if a value class of a given parameter characteristic is constrained
with if-property with repect to a set of different parameter characteristics , then cannot be further constrained with if-property.

Ex13.5. Given a set of parameter characteristics (categories) and value classes (choices)
obtained by applying the category partition method to an informal specification, explain either with a deduction or with examples why unrestricted
use of constraints property and if-property makes it difficult to compute the
number of derivable combinations of value classes.
Write heuristics to compute a reasonable upper bound for the number of
derivable combinations of value classes when constraints can be used without limits.
Ex13.6. Consider the following specification, which extends the specification of
the feature Check-configuration of the Chipmuk web site given in Figure
13.3. Derive a test case specification using the category partition method
and compare the test specification you obtain with the specification of Table 13.1. Try to identify a procedure for deriving the test specifications of the
new version of the functional specification from the former version. Discuss
the suitability of category-partition test design for incremental development
with evolving specifications.
Draft version produced 20th March 2002

114

Functional Testing
Check-Configuration: the Check-configuration function checks the validity of a computer configuration. The parameters of check-configuration
are:
Product line: A product line identifies a set of products sharing several
components and accessories. Different product lines have distinct
components and accessories.
Example: Product lines include desktops, servers, notebooks, digital cameras, printers.
Model: A model identifies a specific product and determines a set of
constraints on available components. Models are characterized by
logical slots for components, which may or may not be implemented
by physical slots on a bus. Slots may be required or optional. Required slots must be assigned a suitable component to obtain a legal configuration, while optional slots may be left empty or filled
depending on the customers needs.
Example: The required slots of the Chipmunk C20 laptop computer include a screen, a processor, a hard disk, memory, and an
operating system. (Of these, only the hard disk and memory are
implemented using actual hardware slots on a bus.) The optional
slots include external storage devices such as a CD/DVD writer.
Set of Components: A set of pairs, which must correspond to the required and optional slots associated with the model.
A component is a choice that can be varied within a model, and
which is not designed to be replaced by the end user. Available components and a default for each slot is determined by the model. The
special value empty is allowed (and may be the default selection)
for optional slots.
In addition to being compatible or incompatible with a particular
model and slot, individual components may be compatible or incompatible with each other.
Example: The default configuration of the Chipmunk C20 includes
20 gigabytes of hard disk; 30 and 40 gigabyte disks are also available. (Since the hard disk is a required slot, empty is not an allowed choice.) The default operating system is RodentOS 3.2, personal edition, but RodentOS 3.2 mobile server edition may also be
selected. The mobile server edition requires at least 30 gigabytes of
hard disk.
Set of Accessories: An accessory is a choice that can be varied within a
model, and which is designed to be replaced by the end user. Available choices are determined by a model and its line. Unlike components, an unlimited number of accessories may be ordered, and the
default value for accessories is always empty. The compatibility of
some accessories may be determined by the set of components, but
accessories are always considered compatible with each other.
Draft version produced 20th March 2002

Choosing a Suitable Approach

115

Example: Models of the notebook family may allow accessories including removable drives (zip, cd, etc.), PC card devices (modem,
lan, etc.), additional batteries, port replicators, carrying case, etc.
Ex13.7. Update the specification of feature Check-configuration of the Chipmuk
web site given in Figure 13.3 by using information from the test specification
provided in Table 13.1.
Ex13.8. Derive test specifications using the category partition method for the following Airport connection check function:
Airport connection check: The airport connection check is part of an
(imaginary) travel reservation system. It is intended to check the validity of a single connection between two flights in an itinerary. It is
described here at a fairly abstract level, as it might be described in a
preliminary design before concrete interfaces have been worked out.
Specification Signature: Valid Connection (Arriving Flight: flight, Departing Flight: flight) returns Validity Code
Validity Code 0 (OK) is returned if Arriving Flight and Departing Flight
make a valid connection (the arriving airport of the first is the departing airport of the second) and there is sufficient time between
arrival and departure according to the information in the airport
database described below.
Otherwise, a validity code other than 0 is returned, indicating why
the connection is not valid.
Data types
Flight: A flight is a structure consisting of
A unique identifying flight code, three alphabetic characters followed by up to four digits. (The flight code is not used by the
valid connection function.)
The originating airport code (3 characters, alphabetic)
The scheduled departure time of the flight (in universal time)
The destination airport code (3 characters, alphabetic)
The scheduled arrival time at the destination airport.
Validity Code: The validity code is one of a set of integer values with
the following interpretations
0: The connection is valid.
10: Invalid airport code (airport code not found in database)
15: Invalid connection, too short: There is insufficient time between
arrival of first flight and departure of second flight.
16: Invalid connection, flights do not connect. The destination airport of Arriving Flight is not the same as the originating airport
of Departing Flight.
20: Another error has been recognized (e.g., the input arguments
may be invalid, or an unanticipated error was encountered).
Draft version produced 20th March 2002

Functional Testing

116

Airport Database
The Valid Connection function uses an internal, in-memory table
of airports which is read from a configuration file at system initialization. Each record in the table contains the following information:
Three-letter airport code. This is the key of the table and can be
used for lookups.
Airport zone. In most cases the airport zone is a two-letter country code, e.g., us for the United States. However, where passage
from one country to another is possible without a passport, the
airport zone represents the complete zone in which passportfree travel is allowed. For example, the code eu represents the
European countries which are treated as if they were a single
country for purposes of travel.
Domestic connect time. This is an integer representing the minimum number of minutes that must be allowed for a domestic
connection at the airport. A connection is domestic if the originating and destination airports of both flights are in the same
airport zone.
International connect time. This is an integer representing the
minimum number of minutes that must be allowed for an international connection at the airport. The number -1 indicates
that international connections are not permitted at the airport.
A connection is international if any of the originating or destination airports are in different zones.
Ex13.9. Derive test specifications using the category partition method for the function SUM of Excel from the following description taken from the Excel
manual:
SUM: Adds all the numbers in a range of cells.
Syntax
SUM(number1,number2, ...)
Number1, number2, ...are 1 to 30 arguments for which you want
the total value or sum.
Numbers, logical values, and text representations of numbers
that you type directly into the list of arguments are counted. See
the first and second examples following.
If an argument is an array or reference, only numbers in that array or reference are counted. Empty cells, logical values, text, or
error values in the array or reference are ignored. See the third
example following.
Arguments that are error values or text that cannot be translated into numbers cause errors.
Draft version produced 20th March 2002

Choosing a Suitable Approach

117

Examples
SUM(3, 2) equals 5
SUM(3, 2, TRUE) equals 6 because the text values are translated
into numbers, and the logical value TRUE is translated into the
number 1.
Unlike the previous example, if A1 contains 3 and B1 contains
TRUE, then:
SUM(A1, B1, 2) equals 2 because references to nonnumeric values
in references are not translated.
If cells A2:E2 contain 5, 15, 30, 40, and 50:
SUM(A2:C2) equals 50
SUM(B2:E2, 15) equals 150
Ex13.10. Eliminate from the test specifications of the feature check configuration
given in Table 13.1 all constraints that do not correspond to infeasible tuples,
but have been added for the sake of reducing the number of test cases.
Compute the number of test cases corresponding to the new specifications.
Apply the combinatorial approach to derive test cases covering all pairwise
combinations.
Compute the number of derived test cases.
Ex13.11. Consider the value classes obtained by applying the category partition
approach to the Airport Connection Check example of Exercise Ex13.8. Eliminate from the test specifications all constraints that do not correspond to
infeasible tuples and compute the number of derivable test cases. Apply the
combinatorial approach to derive test cases covering all pairwise combinations, and compare the number of derived test cases.
Ex13.12. Given a set of parameter characteristics and value classes, write a heuristic algorithm that selects a small set of tuples that cover all possible pairs of
the value classes using the combinatorial approach. Assume that parameter
characteristics and value classes are given without constraints.
Ex13.13. Given a set of parameter characteristics and value classes, compute a
lower bound on the number of tuples required for covering all pairs of values
according to the combinatorial approach.
Ex13.14. Generate a set of tuples that cover all triples of language, screen-size, and
font and all pairs of other parameters for the specification given in Table
13.3.
Ex13.15. Consider the following columns that correspond to educational and individual accounts of feature pricing of Figure 13.4:
Draft version produced 20th March 2002

Functional Testing

118

Edu.
CP
CP
SP
SP
SP
Out

CT1
CT2
Sc
T1
T2

Education
T
T
F
T
Edu SP

F
F
F
ND

Individual
F
F
F
F
T
T
F
F
T
F
T
SP T1 SP

F
T
F
T2

F
T
T
SP

write a set of boolean expressions for the outputs and apply the modified
condition/decision adequacy criterion (MC/DC) presented in Chapter 14
to derive a set of test cases for the derived boolean expressions. Compare the
result with the test case specifications given in Figure 13.6.
Ex13.16. Derive a set of test cases for the Airport Connection Check example of
Exercise Ex13.8 using the catalog based approach.
Extend the catalog of Table 13.10 as needed to deal with specification constructs.
Ex13.17. Derive sets of test cases for functionality Maintenance applying Transition Coverage, Single State Path Coverage, Single Tranistion Path Coverage,
and Boundary Interior Loop Coverage to the FSM specification of Figure
13.9
Ex13.18. Derive test cases for functionality Maintenance applying Transition Coverage to the FSM specification of Figure 13.9, assuming that implicit transitions are (1) error conditions or (2) self transitions.
Ex13.19. We have stated that the transitions in a state-machine specification can
be considered as precondition, postcondition pairs. Often the finite-state
machine is an abstraction of a more complex system which is not truly finitestate. Additional state information is associated with each of the states, including fields and variables that may be changed by an action attached to a
state transition, and a predicate that should always be true in that state. The
same system can often be described by a machine with a few states and complicated predicates, or a machine with more states and simpler predicates.
Given this observation, how would you combine test selection methods for
finite-state machine specifications with decision structure testing methods?
Can you devise a method that selects the same test cases regardless of the
specification style (more or fewer states)? Is it wise to do so?

Draft version produced 20th March 2002

S-ar putea să vă placă și