Sunteți pe pagina 1din 683

Reviews of the Third Edition

“The book is thorough and comprehensive in its coverage of principles and practices of program evaluation and
performance measurement. The authors are striving to bridge two worlds: contemporary public governance
contexts and an emerging professional role for evaluators, one that is shaped by professional judgement informed
by ethical/moral principles, cultural understandings, and reflection. With this edition the authors successfully
open up the conversation about possible interconnections between conventional evaluation in new public
management governance contexts and evaluation grounded in the discourse of moral-political purpose.”

—J. Bradley Cousins

University of Ottawa

“The multiple references to body-worn-camera evaluation research in this textbook are balanced and interesting,
and a fine addition to the Third Edition of this book. This careful application of internal and external validity for
body-worn cameras will be illustrative for students and researchers alike. The review of research methods is specific
yet broad enough to appeal to the audience of this book, and the various examples are contemporary and topical
to evaluation research.”

—Barak Ariel

University of Cambridge, UK, and Alex Sutherland, RAND Europe, Cambridge, UK

“This book provides a good balance between the topics of measurement and program evaluation, coupled with
ample real-world application examples. The discussion questions and cases are useful in class and for homework
assignments.”

—Mariya Yukhymenko

California State University, Fresno

“Finally, a text that successfully brings together quantitative and qualitative methods for program evaluation.”

—Kerry Freedman

Northern Illinois University

“The Third Edition of Program Evaluation and Performance Measurement: An Introduction to Practice remains an
excellent source book for introductory courses to program evaluation, and a very useful reference guide for
seasoned evaluators. In addition to covering in an in-depth and interesting manner the core areas of program
evaluation, it clearly presents the increasingly complementary relationship between program evaluation and
performance measurement. Moreover, the three chapters devoted to performance measurement are the most
detailed and knowledgeable treatment of the area that I have come across in a textbook. I expect that the updated
book will prove to be a popular choice for instructors training program evaluators to work in the public and not-
for-profit sectors.”

—Tim Aubry

University of Ottawa

“This text guides students through both the philosophical and practical origins of performance measurement and
program evaluation, equipping them with a profound understanding of the abuses, nuances, mysteries, and
successes [of those topics]. Ultimately, the book helps students become the professionals needed to advance not
just the discipline but also the practice of government.”

2
—Erik DeVries

Treasury Board of Canada Secretariat

3
Program Evaluation and Performance Measurement

Third Edition

4
This book is dedicated to our teachers, people who have made our love of learning a life’s work. From Jim McDavid:
Elinor Ostrom, Tom Pocklington, Jim Reynolds, and Bruce Wilkinson. From Irene Huse: David Good, Cosmo Howard,
Evert Lindquist, Thea Vakil. From Laura Hawthorn: Karen Dubinsky, John Langford, Linda Matthews.

Sara Miller McCune founded SAGE Publishing in 1965 to support the dissemination of usable
knowledge and educate a global community. SAGE publishes more than 1000 journals and over 800
new books each year, spanning a wide range of subject areas. Our growing selection of library products
includes archives, data, case studies and video. SAGE remains majority owned by our founder and after
her lifetime will become owned by a charitable trust that secures the company’s continued
independence.

Los Angeles | London | New Delhi | Singapore | Washington DC | Melbourne

5
Program Evaluation and Performance Measurement

An Introduction to Practice

Third Edition

James C. McDavid
University of Victoria, Canada

Irene Huse
University of Victoria, Canada

Laura R. L. Hawthorn

6
Copyright © 2019 by SAGE Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or
mechanical, including photocopying, recording, or by any information storage and retrieval system, without
permission in writing from the publisher.

For Information:

SAGE Publications, Inc.

2455 Teller Road

Thousand Oaks, California 91320

E-mail: order@sagepub.com

SAGE Publications Ltd.

1 Oliver’s Yard

55 City Road

London, EC1Y 1SP

United Kingdom

SAGE Publications India Pvt. Ltd.

B 1/I 1 Mohan Cooperative Industrial Area

Mathura Road, New Delhi 110 044

India

SAGE Publications Asia-Pacific Pte. Ltd.

3 Church Street

#10–04 Samsung Hub

Singapore 049483

Printed in the United States of America.

This book is printed on acid-free paper.

18 19 20 21 22 10 9 8 7 6 5 4 3 2 1
Names: McDavid, James C., author. | Huse, Irene, author. | Hawthorn, Laura R. L.

Title: Program evaluation and performance measurement : an introduction to practice / James C. McDavid, University of Victoria, Canada, Irene
Huse, University of Victoria, Canada, Laura R. L. Hawthorn.

Description: Third Edition. | Thousand Oaks : SAGE Publications, Inc., Corwin, CQ Press, [2019] | Revised edition of the authors' Program
evaluation and performance measurement, c2013. | Includes bibliographical references and index.

Identifiers: LCCN 2018032246 | ISBN 9781506337067 (pbk.)

Subjects: LCSH: Organizational effectiveness–Measurement. | Performance–Measurement. | Project management–Evaluation.

Classification: LCC HD58.9 .M42 2019 | DDC 658.4/013–dc23 LC record available at https://lccn.loc.gov/2018032246

Acquisitions Editor: Helen Salmon

Editorial Assistant: Megan O’Heffernan

Content Development Editor: Chelsea Neve

Production Editor: Andrew Olson

Copy Editor: Jared Leighton and Kimberly Cody

Typesetter: Integra

Proofreader: Laura Webb

7
Indexer: Sheila Bodell

Cover Designer: Ginkhan Siam

Marketing Manager: Susannah Goldes

8
Contents
Preface
Acknowledgments
About the Authors
Chapter 1 • Key Concepts and Issues in Program Evaluation and Performance Management
Chapter 2 • Understanding and Applying Program Logic Models
Chapter 3 • Research Designs For Program Evaluations
Chapter 4 • Measurement for Program Evaluation and Performance Monitoring
Chapter 5 • Applying Qualitative Evaluation Methods
Chapter 6 • Needs Assessments for Program Development and Adjustment
Chapter 7 • Concepts and Issues in Economic Evaluation
Chapter 8 • Performance Measurement as an Approach to Evaluation
Chapter 9 • Design and Implementation of Performance Measurement Systems
Chapter 10 • Using Performance Measurement for Accountability and Performance Improvement
Chapter 11 • Program Evaluation and Program Management
Chapter 12 • The Nature and Practice of Professional Judgment in Evaluation
Glossary
Index

9
10
Preface

The third edition of Program Evaluation and Performance Measurement offers practitioners, students, and other
users of this textbook a contemporary introduction to the theory and practice of program evaluation and
performance measurement for public and nonprofit organizations. Woven into the chapters is the performance
management cycle in organizations, which includes: strategic planning and resource allocation; program and
policy design; implementation and management; and the assessment and reporting of results.

The third edition has been revised to highlight and integrate the current economic, political, and socio-
demographic context within which evaluators are expected to work. We feature more evaluation exemplars,
making it possible to fully explore the implications of the evaluations that have been done. Our main exemplar,
chosen in part because it is an active and dynamic public policy issue, is the evaluation of body-worn cameras
(BWCs) which have been widely deployed in police departments in the United States and internationally. Since
2014, as police departments have deployed BWCs, a growing number of evaluations, some experimental, some
quasi-experimental, and some non-experimental have addressed questions around the effectiveness of BWCs in
reducing police use of force, citizen complaints and, more broadly, the perceived fairness of the criminal justice
system.

We introduce BWC evaluations in Chapter 1 and follow those studies through Chapter 2 (program logics),
Chapter 3 (research designs), and Chapter 4 (measurement) as well as including examples in other chapters.

We have revised and integrated the chapters that focus on performance measurement (Chapters 8, 9 and 10) to
feature research and practice that addresses the apparent paradox in performance measurement systems: if they are
designed to improve accountability, first and foremost, then over the longer term they often do not further
improve program or organizational performance. Based on a growing body of evidence and scholarship, we argue
for a nuanced approach to performance measurement where managers have incentives to use performance results
to improve their programs, while operating within the enduring requirements to demonstrate accountability
through external performance reporting.

In most chapters we have featured textboxes that introduce topics or themes in a short, focused way. For example
we have included a textbox in Chapter 3 that introduces behavioral economics and nudging as approaches to
designing, implementing, and evaluating program and policy changes. As a second example, in Chapter 4, data
analytics is introduced as an emerging field that will affect program evaluation and performance measurement in
the future.

We have updated discussions of important evaluation theory-related issues but in doing so have introduced those
topics with an eye on what is practical and accessible for practitioners. For example, we discuss realist evaluation in
Chapter 2 and connect it to the BWC studies that have been done, to make the point that although realist
evaluation offers us something unique, it is a demanding and resource-intensive approach, if it is to be done well.

Since the second edition was completed in 2012, we have seen more governments and non-profit organizations
face chronic fiscal shortages. One result of the 2008–2009 Great Recession is a shift in the expectations for
governments – doing more with less, or even less with less, now seems to be more the norm. In this third edition,
where appropriate, we have mentioned how this fiscal environment affects the roles and relationships among
evaluators, managers, and other stakeholders. For example, in Chapter 6 (needs assessments), we have included
discussion and examples that describe needs assessment settings where an important question is how to ration
existing funding among competing needs, including cutting lower priority programs. This contrasts with the more
usual focus on the need for new programs (with new funding).

In Chapter 1, we introduce professional judgment as a key feature of the work that evaluators do and come back
to this theme at different points in the textbook. Chapter 12, where we discuss professional judgment in some
depth, has been revised to reflect trends in the field, including evaluation ethics and the growing importance of

11
professionalization of evaluation as a discipline. Our stance in this textbook is that an understanding of
methodology, including how evaluators approach cause-and-effect relationships in their work, is central to being
competent to evaluate the effectiveness of programs and policies. But being a competent methodologist is not
enough to be a competent evaluator. In Chapter 12 we expand upon practical wisdom as an ethical foundation for
evaluation practice. In our view, evaluation practice has both methodological and moral dimensions to it. We have
updated the summaries and the discussion questions at the end of the chapters.

The third edition of Program Evaluation and Performance Measurement will be useful for senior undergraduate or
introductory graduate courses in program evaluation, performance measurement, and performance management.
The book does not assume a thorough understanding of research methods and design, instead guiding the reader
through a systematic introduction to these topics. Nor does the book assume a working knowledge of statistics,
although there are some sections that do outline the roles that statistics play in evaluations. These features make
the book well suited for students and practitioners in fields such as public administration and management,
sociology, criminology, or social work where research methods may not be a central focus.

A password-protected instructor teaching site, available at www.sagepub.com/mcdavid, features author-provided


resources that have been designed to help instructors plan and teach their courses. These resources include a test
bank, PowerPoint slides, SAGE journal articles, case studies, and all tables and figures from the book. An open-
access student study site is also available at www.sagepub.com/mcdavid. This site features access to recent,
relevant full-text SAGE journal articles.

12
Acknowledgments

The third edition of Program Evaluation and Performance Measurement was completed substantially because of the
encouragement and patience of Helen Salmon, our main contact at Sage Publications. As a Senior Acquisitions
Editor, Helen has been able to suggest ways of updating our textbook that have sharpened its focus and improved
its contents. We are grateful for her support and her willingness to countenance a year’s delay in completing the
revisions of our book.

Once we started working on the revisions we realized how much the evaluation field had changed since 2012
when we completed the second edition. Being a year later in completing the third edition than was planned is
substantially due to our wanting to include new ideas, approaches and exemplars, where appropriate.

We are grateful for the comments and informal suggestions made by colleagues, instructors, students, and
consultants who have used our textbook in different ways in the past six years. Their suggestions to simplify and in
some cases reorganize the structure of chapters, include more examples, and restate some of the conceptual and
technical parts of the book have improved it in ways that we hope will appeal to users of the third edition.

The School of Public Administration at the University of Victoria provided us with unstinting support as we
completed the third edition of our textbook. For Jim McDavid, being able to arrange several semesters
consecutively with no teaching obligations, made it possible to devote all of his time to this project. For Irene
Huse, being able to count on timely technical support for various computer-related needs, and an office for the
textbook-related activities, were critical to being able to complete our revisions.

Research results from grant support provided by the Social Sciences and Humanities Research Council in Canada
continue to be featured in Chapter 10 of our book. What is particularly encouraging is how that research on
legislator uses of public performance reports has been extended and broadened by colleagues in Canada, the
United States, and Europe. In Chapter 10, we have connected our work to this emerging performance
measurement and performance management movement.

The authors and SAGE would like to thank the following reviewers for their feedback:

James Caillier, University of Alabama

Kerry Freedman, Northern Illinois University

Gloria Langat, University of Southampton

Mariya Yukhymenko, California State University Fresno

13
About the Authors

James C. McDavid

(PhD, Indiana, 1975) is a professor of Public Administration at the University of Victoria in British
Columbia, Canada. He is a specialist in program evaluation, performance measurement, and organizational
performance management. He has conducted extensive research and evaluations focusing on federal, state,
provincial, and local governments in the United States and Canada. His published research has appeared in
the American Journal of Evaluation, the Canadian Journal of Program Evaluation and New Directions for
Evaluation. He is currently a member of the editorial board of the Canadian Journal of Program Evaluation
and New Directions for Evaluation.

In 1993, Dr. McDavid won the prestigious University of Victoria Alumni Association Teaching Award. In
1996, he won the J. E. Hodgetts Award for the best English-language article published in Canadian Public
Administration. From 1990 to 1996, he was Dean of the Faculty of Human and Social Development at the
University of Victoria. In 2004, he was named a Distinguished University Professor at the University of
Victoria and was also Acting Director of the School of Public Administration during that year. He teaches
online courses in the School of Public Administration Graduate Certificate and Diploma in Evaluation
Program.
Irene Huse

holds a Master of Public Administration and is a PhD candidate in the School of Public Administration at
the University of Victoria. She was a recipient of a three-year Joseph-Armand Bombardier Canada Graduate
Scholarship from the Social Sciences and Humanities Research Council. She has worked as an evaluator and
researcher at the University of Northern British Columbia, the University of Victoria, and in the private
sector. She has also worked as a senior policy analyst in several government ministries in British Columbia.
Her published research has appeared in the American Journal of Evaluation, the Canadian Journal of
Program Evaluation, and Canadian Public Administration.
Laura R. L. Hawthorn

holds a Master of Arts degree in Canadian history from Queen’s University in Ontario, Canada and a
Master of Public Administration degree from the University of Victoria. After completing her MPA, she
worked as a manager for several years in the British Columbia public service and in the nonprofit sector
before leaving to raise a family. She is currently living in Vancouver, running a nonprofit organization and
being mom to her two small boys.

14
1 Key Concepts and Issues in Program Evaluation and
Performance Measurement

Introduction 3
Integrating Program Evaluation and Performance Measurement 4
Connecting Evaluation to the Performance Management System 5
The Performance Management Cycle 8
Policies and Programs 10
Key Concepts in Program Evaluation 12
Causality in Program Evaluations 12
Formative and Summative Evaluations 14
Ex Ante and Ex Post Evaluations 15
The Importance of Professional Judgment in Evaluations 16
Example: Evaluating a Police Body-Worn Camera Program in Rialto, California 17
The Context: Growing Concerns With Police Use of Force and Community Relationship 17
Implementing and Evaluating the Effects of Body-Worn Cameras in the Rialto Police Department 18
Program Success Versus Understanding the Cause-and-Effect Linkages: The Challenge of Unpacking
the Body-Worn Police Cameras “Black Box” 20
Connecting Body-Worn Camera Evaluations to This Book 21
Ten Key Evaluation Questions 22
The Steps in Conducting a Program Evaluation 28
General Steps in Conducting a Program Evaluation 28
Assessing the Feasibility of the Evaluation 30
Doing the Evaluation 37
Making Changes Based on the Evaluation 41
Summary 43
Discussion Questions 44
References 45

15
Introduction
Our main focus in this textbook is on understanding how to evaluate the effectiveness of public-sector policies
and programs. Evaluation is widely used in public, nonprofit, and private-sector organizations to generate
information for policy and program planning, design, implementation, assessment of results,
improvement/learning, accountability, and public communications. It can be viewed as a structured process that
creates and synthesizes information intended to reduce the level of uncertainty for decision makers and
stakeholders about a given program or policy. It is usually intended to answer questions or test hypotheses, the
results of which are then incorporated into the information bases used by those who have a stake in the program
or policy. Evaluations can also uncover unintended effects of programs and policies, which can affect overall
assessments of programs or policies. On a perhaps more subtle level, the process of measuring performance or
conducting program evaluations—that is, aside from the reports and other evaluation products—can also have
impacts on the individuals and organizations involved, including attentive stakeholders and citizens.

The primary goal of this textbook is to provide a solid methodological foundation to evaluative efforts, so that
both the process and the information created offer defensible contributions to political and managerial decision-
making. Program evaluation is a rich and varied combination of theory and practice. This book will introduce a
broad range of evaluation approaches and practices, reflecting the richness of the field. As you read this textbook,
you will notice words and phrases in bold. These bolded terms are defined in a glossary at the end of the book.
These terms are intended to be your reference guide as you learn or review the language of evaluation. Because this
chapter is introductory, it is also appropriate to define a number of terms in the text that will help you get some
sense of the “lay of the land” in the field of evaluation.

In the rest of this chapter, we do the following:

Describe how program evaluation and performance measurement are complementary approaches to creating
information for decision makers and stakeholders in public and nonprofit organizations.
Introduce the concept of the performance management cycle, and show how program evaluation and
performance measurement conceptually fit the performance management cycle.
Introduce key concepts and principles for program evaluations.
Illustrate a program evaluation with a case study.
Introduce 10 general questions that can underpin evaluation projects.
Summarize 10 key steps in assessing the feasibility of conducting a program evaluation.
Finally, present an overview of five key steps in doing and reporting an evaluation.

16
Integrating Program Evaluation and Performance Measurement
The richness of the evaluation field is reflected in the diversity of its methods. At one end of the spectrum,
students and practitioners of evaluation will encounter randomized experiments (randomized controlled trials,
or RCTs) in which people (or other units of analysis) have been randomly assigned to a group that receives a
program that is being evaluated, and others have been randomly assigned to a control group that does not get the
program. Comparisons of the two groups are usually intended to estimate the incremental effects of programs.
Essentially, that means determining the difference between what occurred as a result as a program and what would
have occurred if the program had not been implemented. Although RCTs are not the most common method used
in the practice of program evaluation, and there is controversy around making them the benchmark or gold
standard for sound evaluations, they are still often considered exemplars of “good” evaluations (Cook, Scriven,
Coryn, & Evergreen, 2010; Donaldson, Christie, & Melvin, 2014).

Frequently, program evaluators do not have the resources, time, or control over program design or
implementation situations to conduct experiments. In many cases, an experimental design may not be the most
appropriate for the evaluation at hand. A typical scenario is to be asked to evaluate a policy or program that has
already been implemented, with no real ways to create control groups and usually no baseline (pre-program) data
to construct before–after comparisons. Often, measurement of program outcomes is challenging—there may be
no data readily available, a short timeframe for the need for the information, and/or scarce resources available to
collect information.

Alternatively, data may exist (program records would be a typical situation), but closer scrutiny of these data
indicates that they measure program or client characteristics that only partly overlap with the key questions that
need to be addressed in the evaluation. We will learn about quasi-experimental designs and other quantitative and
qualitative evaluation methods throughout the book.

So how does performance measurement fit into the picture? Evaluation as a field has been transformed in the past
40 years by the broad-based movement in public and nonprofit organizations to construct and implement systems
that measure program and organizational performance. Advances in technology have made it easier and less
expensive to create, track, and share performance measurement data. Performance measures can, in some cases,
productively be incorporated into evaluations. Often, governments or boards of directors have embraced the idea
that increased accountability is a good thing and have mandated performance measurement to that end.
Measuring performance is often accompanied by requirements to publicly report performance results for
programs.

The use of performance measures in evaluative work is, however, seldom straightforward. For example, recent
analysis has shown that in the search for government efficiencies, particularly in times of fiscal restraint,
governments may cut back on evaluation capacity, with expectations that performance measurement systems can
substantially cover the performance management information needs (de Lancer Julnes & Steccolini, 2015). This
trend to lean on performance measurement, particularly in high-stakes accountability situations, is increasingly
seen as being detrimental to learning, policy and program effectiveness, and staff morale (see, for example,
Arnaboldi et al., 2015; Coen & Roberts, 2012; Greiling & Halachmi, 2013; Mahler & Posner, 2014). We will
explore this conundrum in more depth later in the textbook.

This textbook will show how sound performance measurement, regardless of who does it, depends on an
understanding of program evaluation principles and practices. Core skills that evaluators learn can be applied to
performance measurement. Managers and others who are involved in developing and implementing performance
measurement systems for programs or organizations typically encounter problems similar to those encountered by
program evaluators. A scarcity of resources often means that key program outcomes that require specific data
collection efforts are either not measured or are measured with data that may or may not be intended for that
purpose. Questions of the validity of performance measures are important, as are the limitations to the uses of
performance data.

17
We see performance measurement approaches as complementary to program evaluation, and not as a replacement
for evaluations. The approach of this textbook is that evaluation includes both program evaluation and
performance measurement, and we build a foundation in the early chapters of the textbook that shows how
program evaluation can inform measuring the performance of programs and policies. Consequently, in this
textbook, we integrate performance measurement into evaluation by grounding it in the same core tools and
methods that are essential to assess program processes and effectiveness. We see an important need to balance these
two approaches, and our approach in this textbook is to show how they can be combined in ways that make them
complementary, but without overstretching their real capabilities. Thus, program logic models (Chapter 2),
research designs (Chapter 3), and measurement (Chapter 4) are important for both program evaluation and
performance measurement. After laying the foundations for program evaluation, we turn to performance
measurement as an outgrowth of our understanding of program evaluation (Chapters 8, 9, and 10). Chapter 6 on
needs assessments builds on topics covered in the earlier chapters, including Chapter 1. Needs assessments can
occur in several phases of the performance management cycle: strategic planning, designing effective programs,
implementation, and measuring and reporting performance. As well, cost–benefit analysis and cost–effectiveness
analysis (Chapter 7) build on topics in Chapter 3 (research designs) and can be conducted as part of strategic
planning, or as we design policies or programs, or as we evaluate their outcomes (the assessment and reporting
phase).

Below, we introduce the relationship between organizational management and evaluation activities. We expand on
this issue in Chapter 11, where we examine how evaluation theory and practice are joined with management in
public and nonprofit organizations. Chapter 12 (the nature and practice of professional judgment) emphasizes
that the roles of managers and evaluators depend on developing and exercising sound professional judgment.

18
Connecting Evaluation to the Performance Management System
Information from program evaluations and performance measurement systems is expected to play a role in the way
managers operate their programs (Hunter & Nielsen, 2013; Newcomer & Brass, 2016). Performance
management, which is sometimes called results-based management, emerged as an organizational management
approach that has been part of a broad movement of new public management (NPM) in public administration.
NPM has had significant impacts on governments worldwide since it came onto the scene in the early 1990s. It is
premised on principles that emphasize the importance of stating clear program and policy objectives, measuring
and reporting program and policy outcomes, and holding managers, executives, and politicians accountable for
achieving expected results (Hood, 1991; Osborne & Gaebler, 1992).

While the drive for NPM—particularly the emphasis on explicitly linking funding to targeted outcomes—has
abated somewhat as paradoxes of the approach have come to light (Pollitt & Bouckaert, 2011), particularly in
light of the global financial crisis (Coen & Roberts, 2012; OECD, 2015), the importance of evidence of actual
accomplishments is still considered central to performance management. Performance management systems will
continue to evolve; evidence-based and evidence-informed decision making depend heavily on both evaluation
and performance measurement, and will respond as the political and fiscal structure and the context of public
administration evolve. There is discussion recently of a transition from NPM to a more centralized but networked
New Public Governance (Arnaboldi et al., 2015; Osborne, 2010; Pollitt & Bouckaert, 2011), Digital-Era
Governance (Dunleavy, Margetts, Bastow, & Tinker, 2006; Lindquist & Huse, 2017), Public Value Governance
(Bryson, Crosby, & Bloomberg, 2014), and potentially a more agile governance (OECD, 2015; Room, 2011). In
any case, evidence-based or evidence-informed policy making will remain an important feature of public
administration and public policy.

Increasingly, there is an expectation that managers will be able to participate in evaluating their own programs and
also be involved in developing, implementing, and publicly reporting the results of performance measurement.
These efforts are part of an organizational architecture designed to pull together the components to achieve
organizational goals. Changes to improve program operations and efficiency and effectiveness are expected to be
driven by evidence of how well programs are doing in relation to stated objectives.

American Government Focus on Program Performance Results

In the United States, successive federal administrations beginning with the Clinton administration in 1992 embraced program goal
setting, performance measurement, and reporting as a regular feature of program accountability (Joyce, 2011; Mahler & Posner, 2014).
The Bush administration, between 2002 and 2009, emphasized the importance of program performance in the budgeting process. The
Office of Management and Budget (OMB) introduced assessments of programs using a methodology called PART (Performance
Assessment Rating Tool) (Gilmour, 2007). Essentially, OMB analysts reviewed existing evaluations conducted by departments and
agencies as well as performance measurement results and offered their own overall rating of program performance. Each year, one fifth of
all federal programs were “PARTed,” and the review results were included with the executive branch (presidential) budget requests to
Congress.

The Obama administration, while instituting the 2010 GPRA Modernization Act (see Moynihan, 2013) and departing from top-down
PART assessments of program performance (Joyce, 2011), continued this emphasis on performance by appointing the first federal chief
performance officer, leading the “management side of OMB,” which was expected to work with agencies to “encourage use and
communication of performance information and to improve results and transparency” (OMB archives, 2012). The GPRA Modernization
Act is intended to create a more organized and publicly accessible system for posting performance information on the
www.Performance.gov website, in a common format. There is also currently a clear theme of improving the efficiencies and integration of
evaluative evidence, including making better use of existing data.

At the time of writing this book, it is too early to tell what changes the Trump administration will initiate or will keep from previous
administrations, although there is intent to post performance information on the Performance.gov website, reflecting updated goals and
alignment. Its current mission is “to assist the President in meeting his policy, budget, management and regulatory objectives and to fulfill
the agency’s statutory responsibilities” (OMB, 2018, p. 1).

19
Canadian Government Evaluation Policy

In Canada, there is a long history of requiring program evaluation of federal government programs, dating back to the late 1970s. More
recently, a major update of the federal government’s evaluation policy occurred in 2009, and again in 2016 (TBS, 2016a). The main
plank in that policy is a requirement that federal departments and agencies evaluate the relevance and performance of their programs on a
5-year cycle, with some exemptions for smaller programs and contributions to international organizations (TBS, 2016a, sections 2.5 and
2.6). Performance measurement and program evaluation is explicitly linked to accountability (resource allocation [s. 3.2.3] and reporting
to parliamentarians [s. 3.2.4]) as well as managing and improving departmental programs, policies, and services (s. 3.2.2). There have
been reviews of Canadian provinces (e.g., Gauthier et al., 2009), American states (Melkers & Willoughby, 2004; Moynihan, 2006), and
local governments (Melkers & Willoughby, 2005) on their approaches to evaluation and performance measurement. In later chapters, we
will return to this issue of the challenges of using the same evaluative information for different purposes (see Kroll, 2015; Majone, 1989;
Radin, 2006).

In summary, performance management is now central to public and nonprofit management. What was once an
innovation in the public and nonprofit sectors in the early 1990s has since become an expectation. Central
agencies (including the U.S. Federal Office of Management and Budget [OMB], the General Accountability
Office [GAO], and the Treasury Board of Canada Secretariat [TBS]), as well as state and provincial finance
departments and auditors, develop policies and articulate expectations that shape the ways program managers are
expected to create and use performance information to inform their administrative superiors and other
stakeholders outside the organization about what they are doing and how well they are doing it. It is worthwhile
following the websites of these organizations to understand the subtle and not-so-subtle shifts in expectations and
performance frameworks for the design, conduct, and uses of performance measurement systems and evaluations
over time, especially when there is a change in government.

Fundamental to performance management is the importance of program and policy performance results being
collected, analyzed, compared (sometimes to performance targets), and then used to monitor, learn, and make
decisions. Performance results are also expected to be used to increase the transparency and accountability of
public and nonprofit organizations and even governments, principally through periodic public performance
reporting. Many jurisdictions have embraced mandatory public performance reporting as a visible sign of their
commitment to improved accountability (Van de Walle & Cornelissen, 2014).

20
The Performance Management Cycle
Organizations typically run through an annual performance management cycle that includes budget
negotiations, announcing budget plans, designing or modifying programs, managing programs, reporting their
financial and nonfinancial results, and making informed adjustments. The performance management cycle is a
useful normative model that includes an iterative planning–implementation–assessment–program adjustments
sequence. The model can help us understand the various points at which program evaluation and performance
measurement can play important roles as ways of providing information to decision makers who are engaged in
leading and managing organizations and programs to achieve results, and reporting the results to legislators and
the public.

In this book, the performance management cycle illustrated in Figure 1.1 is used as a framework for organizing
different evaluation topics and showing how the analytical approaches covered in key chapters map onto the
performance management cycle. Figure 1.1 shows a model of how organizations can integrate strategic planning,
program and policy design, implementation, and assessment of results into a cycle where evaluation and
performance measures can inform all phases of the cycle. The assessment and reporting part of the cycle is central to
this textbook, but we take the view that all phases of the performance management cycle can be informed by
evaluation and performance measurement.

We will use the performance management cycle as a framework within which evaluation and performance
measurement activities can be situated for managers and other stakeholders in public sector and nonprofit
organizations. It is important to reiterate, however, that specific evaluations and performance measures are often
designed to serve a particular informational purpose—that is, a certain phase of the cycle—and may not be
appropriate for other uses.

The four-part performance management cycle begins with formulating and budgeting for clear (strategic)
objectives for organizations and, hence, for programs and policies. Strategic objectives are then translated into
program and policy designs intended to achieve those objectives. This phase involves building or adapting
organizational structures and processes to facilitate implementing and managing policies or programs. Ex ante
evaluations can occur at the stage when options are being considered and compared as candidates for design and
implementation. We will look a bit more closely at ex ante evaluations later in the textbook. For now, think of
them as evaluations that assess program or policy options before any are selected for implementation.

21
Figure 1.1 The Performance Management Cycle

The third phase in the cycle is about policy and program implementation and management. In this textbook, we
will look at formative evaluations as a type of implementation-related evaluation that typically informs managers
how to improve their programs. Normally, implementation evaluations assess the extent to which intended
program or policy designs are successfully implemented by the organizations that are tasked with doing so.
Implementation is not the same thing as outcomes/results. Weiss (1972) and others have pointed out that
assessing implementation is a necessary condition to being able to evaluate the extent to which a program has
achieved its intended outcomes. Bickman (1996), in his seminal evaluation of the Fort Bragg Continuum of Care
Program, makes a point of assessing how well the program was implemented, as part of his evaluation of the
outcomes. It is possible to have implementation failure, in which case any observed outcomes cannot be attributed
to the program. Implementation evaluations can also examine the ways that existing organizational structures,
processes, cultures, and priorities either facilitate or impede program implementation.

The fourth phase in the cycle is about assessing performance results, and reporting to legislators, the public, and
other (internal or external) stakeholders. This phase is also about summative evaluation, that is, evaluation that is
aimed at answering questions about a program or policy achieving its intended results, with a view to making
substantial program changes, or decisions about the future of the program. We will discuss formative and
summative evaluations more thoroughly later in this chapter.

Performance monitoring is an important way to tell how a program is tracking over time, but, as shown in the
model, performance measures can inform decisions made at any stage of the performance cycle, not just the
assessment stage. Performance data can be useful for strategic planning, program design, and management-related
implementation decisions. At the Assessment and Reporting Results phase, “performance measurement and
reporting” is expected to contribute to accountability for programs. That is, performance measurement can lead to
a number of consequences, from program adjustments to impacts on elections. In the final phase of the cycle,

22
strategic objectives are revisited, and the evidence from earlier phases in the cycle is among the inputs that may
result in new or revised objectives—usually through another round of strategic planning.

Stepping back from this cycle, we see a strategic management system that encompasses how ideas and evaluative
information are gathered for policy planning and subsequent funding allocation and reallocation. Many
governments have institutionalized their own performance information architecture to formalize how programs
and departments are expected to provide information to be used by the managerial and political decision makers.
Looking at Canada and the United States, we can see that this architecture evolves over time as the governance
context changes and also becomes more complex, with networks of organizations contributing to outcomes. The
respective emphasis on program evaluation and performance measurement can be altered over time. Times of
change in government leadership are especially likely to spark changes in the performance information
architecture. For example, in Canada, the election of the current Liberal Government in the 2015 federal election
after nine years of Conservative Government leadership has resulted in a government-wide focus on implementing
high-priority policies and programs and ensuring that their results are actually delivered (Barber, 2015; Barber,
Moffitt, & Kihn, 2011).

23
Policies And Programs
As you have been reading this chapter, you will have noticed that we mention both policies and programs as
candidates for performance measurement and evaluation. Our view is that the methodologies that are discussed in
this textbook are generally appropriate for evaluating both policies and programs. Some analysts use the terms
interchangeably—in some countries, policy analysis and evaluation is meant to encompass program evaluation
(Curristine, 2005). We will define them both so that you can see what the essential differences are.

What Is a Policy?

Policies connect means and ends. The core of policies are statements of intended outcomes/objectives (ends) and the means by which
government(s) or their agents (perhaps nonprofit organizations or even private-sector companies) will go about achieving these outcomes.
Initially, policy objectives can be expressed in election platforms, political speeches, government responses to questions by the media, or
other announcements (including social media). Ideally, before a policy is created or announced, research and analysis has been done that
establishes the feasibility, the estimated effectiveness, or even the anticipated cost-effectiveness of proposed strategies to address a problem
or issue. Often, new policies are modifications of existing policies that expand, refine, or reduce existing governmental activities.

Royal commissions (in Canada), task forces, reports by independent bodies (including think tanks), or even public inquiries
(congressional hearings, for example) are ways that in-depth reviews can set the stage for developing or changing public policies. In other
cases, announcements by elected officials addressing a perceived problem can serve as the impetus to develop a policy—some policies are a
response to a political crisis.

An example of a policy that has significant planned impacts is the British Columbia government’s November 2007 Greenhouse Gas
Reduction Targets Act (Government of British Columbia, 2007) that committed the provincial government to reducing greenhouse gas
emissions in the province by 33% by 2020. From 2007 to 2013, British Columbia reduced its per capita consumption of petroleum
products subject to the carbon tax by 16.1%, as compared with an increase of 3.0% in the rest of Canada (World Bank, 2014).

The legislation states that by 2050, greenhouse gas emissions will be 80% below 2007 levels. Reducing greenhouse gas emissions in
British Columbia will be challenging, particularly given the more recent provincial priority placed on developing liquefied natural gas
facilities to export LNG to Asian countries. In 2014, the BC government passed a Greenhouse Gas Industrial Reporting and Control Act
(Government of British Columbia, 2014) that includes a baseline-and-credit system for which there is no fixed limit on emissions, but
instead, polluters that reduce their emissions by more than specified targets (which can change over time) can earn credits that they can
sell to other emitters who need them to meet their own targets. The World Bank annually tracks international carbon emission data
(World Bank, 2017).

What Is a Program?

Programs are similar to policies—they are means–ends chains that are intended to achieve some agreed-on objective(s). They can vary a
great deal in scale and scope. For example, a nonprofit agency serving seniors in the community might have a volunteer program to make
periodic calls to persons who are disabled or otherwise frail and living alone. Alternatively, a department of social services might have an
income assistance program serving clients across an entire province or state. Likewise, programs can be structured simply—a training
program might just have classroom sessions for its clients—or be complicated—an addiction treatment program might have a range of
activities, from public advertising, through intake and treatment, to referral, and finally to follow-up—or be complex—a
multijurisdictional program to reduce homelessness that involves both governments and nonprofit organizations.

To reduce greenhouse gases in British Columbia, many different programs have been implemented—some targeting the government
itself, others targeting industries, citizens, and other governments (e.g., British Columbia local governments). Programs to reduce
greenhouse gases are concrete expressions of the policy. Policies are usually higher level statements of intent—they need to be translated
into programs of actions to achieve intended outcomes. Policies generally enable programs. In the British Columbia example, a key
program that was implemented starting in 2008 was a broad-based tax on the carbon content of all fuels used in British Columbia by
both public- and private-sector emitters, including all who drive vehicles in the province. That is, there is a carbon tax component added
to vehicle per liter fuel costs.

Increasingly, programs can involve several levels of government, governmental agencies, and/or nonprofit organizations. A good example
is Canada’s federal government initiatives, starting in 2016, to bring all provinces on board with GHG reduction initiatives. These kinds
of programs are challenging for evaluators and have prompted some in the field to suggest alternative ways of assessing program processes
and outcomes. Michael Patton (1994, 2011) has introduced developmental evaluation as one approach, and John Mayne (2001, 2011)
has introduced contribution analysis as a way of addressing attribution questions in complex program settings.

24
In the chapters of this textbook, we will introduce multiple examples of both policies and programs, and the
evaluative approaches that have been used for them. A word on our terminology—although we intend this book
to be useful for both program evaluation and policy evaluation, we will refer mostly to program evaluations.

25
Key Concepts In Program Evaluation

26
Causality in Program Evaluations
In this textbook, a key theme is the evaluation of the effectiveness of programs. One aspect of that issue is whether
the program caused the observed outcomes. Our view is that program effectiveness and, in particular, attribution
of observed outcomes are the core issues in evaluations. In fact, that is what distinguishes program evaluation from
other, related professions such as auditing and management consulting. Picciotto (2011) points to the centrality of
program effectiveness as a core issue for evaluation as a discipline/profession:

What distinguishes evaluation from neighboring disciplines is its unique role in bridging social science
theory and policy practice. By focusing on whether a policy, a program or project is working or not (and
unearthing the reasons why by attributing outcomes) evaluation acts as a transmission belt between the
academy and the policy-making. (p. 175)

In Chapter 3, we will describe the logic of research designs and how they can be used to examine causes and effects
in evaluations. Briefly, there are three conditions that are widely accepted as being jointly necessary to establish a
causal relationship between a program and an observed outcome: (1) the program has to precede the observed
outcome, (2) the presence or absence of the program has to be correlated with the presence or absence of the
observed outcome, and (3) there cannot be any plausible rival explanatory factors that could account for the
correlation between the program and the outcome (Cook & Campbell, 1979).

In the evaluation field, different approaches to assessing causal relationships have been proposed, and the debate
around using experimental designs continues (Cook et al., 2010; Cresswell & Cresswell, 2017; Donaldson et al.,
2014). Our view is that the logic of causes and effects (the three necessary conditions) is important to understand,
if you are going to do program evaluations. Looking for plausible rival explanations for observed outcomes is
important for any evaluation that claims to be evaluating program effectiveness. But that does not mean that we
have to have experimental designs for every evaluation.

Program evaluations are often conducted under conditions in which data appropriate for ascertaining or even
systematically addressing the attribution question are hard to come by. In these situations, the evaluator or
members of the evaluation team may end up relying, to some extent, on their professional judgment. Indeed, such
judgment calls are familiar to program managers, who rely on their own observations, experiences, and
interactions to detect patterns and make choices on a daily basis. Scriven (2008) suggests that our capacity to
observe and detect causal relationships is built into us. We are hardwired to be able to organize our observations
into patterns and detect/infer causal relationships therein.

For evaluators, it may seem “second best” to have to rely on their own judgment, but realistically, all program
evaluations entail a substantial number of judgment calls, even when valid and reliable data and appropriate
comparisons are available. As Daniel Krause (1996) has pointed out, “A program evaluation involves human
beings and human interactions. This means that explanations will rarely be simple, and interpretations cannot
often be conclusive” (p. xviii). Clearly, then, systematically gathered evidence is a key part of any good program
evaluation, but evaluators need to be prepared for the responsibility of exercising professional judgment as they do
their work.

One of the key questions that many program evaluations are expected to address can be worded as follows:

To what extent, if any, were the intended objectives met?

Usually, we assume that the program in question is “aimed” at some intended objective(s). Figure 1.2 offers a
picture of this expectation.

27
Figure 1.2 Linking Programs and Intended Objectives

The program has been depicted in a “box,” which serves as a conceptual boundary between the program and the
program environment. The intended objectives, which we can think of as statements of the program’s intended
outcomes, are shown as occurring outside the program itself; that is, the intended outcomes are results intended to
make a difference outside of the activities of the program itself.

The arrow connecting the program and its intended outcomes is a key part of most program evaluations and
performance measurement systems. It shows that the program is intended to cause the outcomes. We can restate
the “objectives achievement” question in words that are a central part of most program evaluations:

Was the program effective (in achieving its intended outcomes)?

Assessing program effectiveness is the most common reason we conduct program evaluations and create
performance measurement systems. We want to know whether, and to what extent, the program’s actual results
are consistent with the outcomes we expected. In fact, there are two evaluation issues related to program
effectiveness. Figure 1.3 separates these two issues, so it is clear what each means.

Figure 1.3The Two Program Effectiveness Questions Involved in Most Evaluations

The horizontal causal link between the program and its outcomes has been modified in two ways: (1) intended
outcomes have been replaced by the observed outcomes (what we actually observe when we do the evaluation),
and (2) a question mark (?) has been placed over that causal arrow.

We need to restate our original question about achieving intended objectives:

To what extent, if at all, was the program responsible for the observed outcomes?

Notice that we have focused the question on what we actually observe in conducting the evaluation, and that the
“?” above the causal arrow now raises the key question of whether the program (or possibly something else) caused
the outcomes we observe. In other words, we have introduced the attribution question—that is, the extent to
which the program was the cause or a cause of the outcomes we observed in doing the evaluation. Alternatively,
were there factors in the environment of the program that caused the observed outcomes?

We examine the attribution question in some depth in Chapter 3, and refer to it repeatedly throughout this book.

28
As we will see, it is often challenging to address this question convincingly, given the constraints within which
program evaluators work.

Figure 1.3 also raises a second evaluation question:

To what extent, if at all, are the observed outcomes consistent with the intended outcomes?

Here, we are comparing what we actually find with what the program was expected to accomplish. Notice that
answering that question does not tell us whether the program was responsible for the observed or intended outcomes.

Sometimes, evaluators or persons in organizations doing performance measurement do not distinguish the
attribution question from the “achievement of intended outcomes” question. In implementing performance
measures, for example, managers or analysts spend a lot of effort developing measures of intended outcomes.
When performance data are analyzed, the key issue is often whether the actual results are consistent with intended
outcomes. In Figure 1.3, the dashed arrow connects the program to the intended outcomes, and assessments of
that link are often a focus of performance measurement systems. Where benchmarks or performance targets have
been specified, comparisons between actual outcomes and intended outcomes can also be made, but what is
missing from such comparisons is an assessment of the extent to which observed and intended outcomes are
attributable to the program (McDavid & Huse, 2006).

29
Formative and Summative Evaluations
Michael Scriven (1967) introduced the distinction between formative and summative evaluations (Weiss, 1998a).
Since then, he has come back to this issue several more times (e.g., Scriven, 1991, 1996, 2008). Scriven’s
definitions reflected his distinction between implementation issues and evaluating program effectiveness. He
associated formative evaluations primarily with analysis of program design and implementation, with a view to
providing program managers and other stakeholders with advice intended to improve the program “on the
ground.” For Scriven, summative evaluations dealt with whether the program had achieved intended, stated
objectives (the worth of a program). Summative evaluations could, for example, be used for accountability
purposes or for budget reallocations.

Although Scriven’s (1967) distinction between formative and summative evaluations has become a part of any
evaluator’s vocabulary, it has been both elaborated and challenged by others in the field. Chen (1996) introduced
a framework that featured two evaluation purposes—improvement and assessment—and two program stages—
process and outcomes. His view was that many evaluations are mixed—that is, evaluations can be both formative
and summative, making Scriven’s original dichotomy incomplete. For Chen (1996), improvement was formative,
and assessment was summative—and an evaluation that is looking to improve a program can be focused on both
implementation and objectives achievement. The same is true for evaluations that are aimed at assessing programs.

In program evaluation practice, it is common to see terms of reference that include questions about how well the
program was implemented, how (technically) efficient the program was, and how effective the program was. A
focus on program processes is combined with concerns about whether the program was achieving its intended
objectives.

In this book, we will refer to formative and summative evaluations but will define them in terms of their intended
uses. This is similar to the distinction offered in Weiss (1998a) and Chen (1996). Formative evaluations are
intended to provide feedback and advice with the goal of improving the program. Formative evaluations in this
book include those that examine program effectiveness but are intended to offer advice aimed at improving the
effectiveness of the program. One can think of formative evaluations as manager-focused evaluations, in which the
continued existence of the program is not questioned.

Summative evaluations are intended to ask “tough questions”: Should we be spending less money on this program?
Should we be reallocating the money to other uses? Should the program continue to operate? Summative
evaluations focus on the “bottom line,” with issues of value for money (costs in relation to observed outcomes) as
alternative analytical approaches.

In addition to formative and summative evaluations, others have introduced several other classifications for
evaluations. Eleanor Chelimsky (1997), for example, makes a similar distinction to the one we make between the
two primary types of evaluation, which she calls (1) evaluation for development (i.e., the provision of evaluative
help to strengthen institutions and to improve organizational performance) and (2) evaluation for accountability
(i.e., the measurement of results or efficiency to provide information to decision makers). She adds to the
discussion a third general purpose for doing evaluations: evaluation for knowledge (i.e., the acquisition of a deeper
understanding about the factors underlying public problems and about the “fit” between these factors and the
programs designed to address them). Patton’s (1994, 2011) “developmental evaluation” is another approach,
related to ongoing organizational learning in complex settings, which differs in some ways from the formative and
summative approaches generally adopted for this textbook. Patton sees developmental evaluations as preceding
formative or summative evaluations (Patton, 2011). As we shall see, however, there can be pressures to use
evaluations (and performance measures) that were originally intended for formative purposes, to be repurposed
and “used” summatively. This is a challenge particularly in times of fiscal stress, where cutbacks in budget are
occurring and can result in evaluations being seen to be inadequate for the (new) uses at hand (Shaw, 2016).

30
31
Ex Ante and Ex Post Evaluations
Typically, evaluators are expected to conduct evaluations of ongoing programs. Usually, the program has been in
place for some time, and the evaluator’s tasks include assessing the program up to the present and offering advice
for the future. These ex post evaluations are challenging: They necessitate relying on information sources that may
or may not be ideal for the evaluation questions at hand. Rarely are baselines or comparison groups available, and
if they are, they are only roughly appropriate. In Chapters 3 and 5, we will learn about the research design options
and qualitative evaluation alternatives that are available for such situations. Chapter 5 also looks at mixed-methods
designs for evaluations.

Ex ante (before implementation) program evaluations are less frequent. Cost–benefit analyses can be conducted ex
ante, to prospectively address at the design stage whether a policy or program (or one option from among several
alternatives) is cost-beneficial. Assumptions about implementation and the existence and timing of outcomes, as
well as costs, are required to facilitate such analyses. We discuss economic evaluation in Chapter 7.

In some situations, it may be possible to implement a program in stages, beginning with a pilot project. The pilot
can then be evaluated (and compared with the existing “no program” status quo) and the evaluation results used as
a kind of ex ante evaluation of a broader implementation or scaling up of the program. Body-worn cameras for
police officers are often introduced on a pilot basis, accompanied by an evaluation of their effectiveness.

One other possibility is to plan a program so that before it is implemented, baseline measures of outcomes are
constructed, and appropriate data are gathered. The “before” situation can be documented and included in any
future program evaluation or performance measurement system. In Chapter 3, we discuss the strengths and
limitations of before-and-after research designs. They offer us an opportunity to assess the incremental impacts of
the program. But, in environments where there are other factors that could also plausibly account for the observed
outcomes, this design, by itself, may not be adequate.

Program evaluation clients often expect evaluators to come up with ways of telling whether the program achieved
its objectives—that is, whether the intended outcomes were realized and why—despite the difficulties of
constructing an evaluation design that meets conventional standards to assess the cause-and-effect relationships
between the program and its outcomes.

The Importance of Professional Judgment in Evaluations

One of the principles underlying this book is the importance of exercising professional judgment as program evaluations are designed,
executed, and acted on. Our view is that although sound and defensible methodologies are necessary foundations for credible evaluations,
each evaluation process and the associated evaluation context necessitates making decisions that are grounded in professional judgment.
Values, ethics, political awareness, and social/cultural perspectives are important, beyond technical expertise (Donaldson & Picciotto,
2016; House, 2016; Schwandt, 2015). There are growing expectations that stakeholders, including beneficiaries, be considered equitably
in evaluations, and expectations to integrate evaluative information across networked organizations (Stockmann & Meyer, 2016; Szanyi,
Azzam, & Galen, 2013).

Our tools are indispensable—they help us construct useful and defensible evaluations. But like craftspersons or artisans, we ultimately
create a structure that combines what our tools can shape at the time with what our own experiences, beliefs, values, and expectations
furnish and display. Some of what we bring with us to an evaluation is tacit knowledge—that is, knowledge based on our experience—
and it is not learned or communicated except by experience.

Key to understanding all evaluation practice is accepting that no matter how sophisticated our designs, measures, and other methods are, we
will exercise professional judgment in our work. In this book, we will see where professional judgment is exercised in the evaluation process
and will begin to learn how to make defensible judgments. Chapter 12 is devoted to the nature and practice of professional judgment in
evaluation.

The following case summary illustrates many of the facets of program evaluation, performance measurement, and
performance management that are discussed in this textbook. We will outline the case in this chapter, and will
return to it and other examples in later chapters of the book.

32
33
Example: Evaluating A Police Body-Worn Camera Program In Rialto,
California

34
The Context: Growing Concerns With Police Use of Force and Community
Relationship
Police forces in many Canadian and American cities and towns—as part of a global trend—have begun using
body-worn cameras (BWCs) or are considering doing so (Lum et al., 2015). Aside from the technological
advances that have made these small, portable cameras and their systems available and more affordable, there are a
number of reasons to explain their growing use. In some communities, relationships between police and citizens
are strained, and video evidence holds the promise of reducing police use of force, or complaints against the police.
Recordings might also facilitate resolution of complaints. Just the presence of BWCs might modify police and
citizen behaviors, and de-escalate potentially violent encounters (Jennings, Fridell, & Lynch, 2014). Recent high-
profile incidents of excessive police use of force, particularly related to minority groups, have served as critical
sparks for immediate political action, and BWCs are seen as a partial solution (Cubitt, Lesic, Myers, & Corry,
2017; Lum et al., 2015; Maskaly et al., 2017). Recordings could also be used in officer training. Aside from the
intent to improve transparency and accountability, the use of BWCs holds the potential to provide more objective
evidence in crime situations, thereby increasing the likelihood and speed of convictions.

On the other hand, implementation efforts can be hampered by police occupational cultures and their responses
to the BWC use policies. Also, because the causal mechanisms are not well understood, BWCs may have
unanticipated and unintended negative consequences on the interactions between police and citizens. There are
also privacy concerns for both police and citizens. Thus, police BWC programs and policies raise a number of
causality questions that have just begun to be explored (see Ariel et al., 2016; Ariel et al., 2018a, 2018b; Cubitt et
al., 2017; Hedberg, Katz, & Choate, 2017; Lum et al., 2015; Maskaly et al., 2017). The Center for Evidence-
Based Crime Policy at George Mason University (2016) notes, “This rapid adoption of BWCs is occurring within
a low information environment; researchers are only beginning to develop knowledge about the effects, both
intentional and unintentional, of this technology” (p. 1 of website). Some of the evaluations are RCTs (including
our example that follows).

The U.S. Bureau of Justice Assistance (2018) provides a website (Body-Worn Camera Toolkit:
https://www.bja.gov/bwc/resources.html) that now holds over 700 articles and additional resources about BWCs.
About half of these are examples of local governments’ policies and procedures. Public Safety Canada (2018) has
approximately 20 similar resources. The seminal study by Ariel, Farrar, and Sutherland, The Effect of Body-Worn
Cameras on Use of Force and Citizens’ Complaints Against the Police: A Randomized Controlled Trial (Ariel et al.,
2015) will be used in this chapter to highlight the importance of evaluating the implementation and outcomes of
this high-stakes program. Related studies will also be mentioned throughout this textbook, where relevant.

35
Implementing and Evaluating the Effects of Body-Worn Cameras in the
Rialto Police Department
The City of Rialto Police Department was one of the first in the United States to implement body-worn cameras
and systematically evaluate their effects on citizen–police interactions (Ariel, Farrar, & Sutherland, 2015). The
study itself took place over 12 months, beginning in 2012. Rialto Police Department was nearly disbanded in
2007 when the city considered contracting for police services with the Los Angeles County Sherriff’s Department.
Beset by a series of incidents involving questionable police officer behaviors including use-of-force incidents, the
city hired Chief Tony Farrar in 2012. He decided to address the problems in the department by investing in body-
worn cameras for his patrol officers and systematically evaluating their effectiveness. The evaluation addressed this
question: “Do body-worn cameras reduce the prevalence of use-of-force and/or citizens’ complaints against the
police?” (Ariel et al., 2015, p. 509). More specifically, the evaluation was focused on this hypothesis: Police body-
worn cameras will lead to increases in socially desirable behaviors of the officers who wear them and reductions in
police use-of-force incidents and citizen complaints.

To test this hypothesis, a randomized controlled trial was conducted that became known internationally as the
“Rialto Experiment”—the first such study of BWCs (Ariel et al., 2015). Over the year in which this program was
implemented, officer shifts (a total of 988 shifts) were randomly assigned to either “treatment-shifts” (489), where
patrol officers would wear a BWC that recorded all incidents of contact with the public, or to “control-shifts”
(499), where they did not wear a BWC. Each week entailed 19 shifts, and each shift was 12 hours in duration and
involved approximately 10 officers patrolling in Rialto. Each of the 54 patrol officers had multiple shifts where
they did wear a camera, and shifts where they did not.

The study defined a use-of-force incident as an encounter with “physical force that is greater than basic control or
‘compliance holds’—including the use of (a) OC spray [pepper spray], (b) baton (c) Taser, (d) canine bite or (e)
firearm” (Ariel et al., 2015, p. 521). Incidents were measured using four variables:

1. Total incidents that occurred during experiment shifts, as recorded by officers using a standardized police
tracking system;
2. Total citizen complaints filed against officers (as a proxy of incidents), using a copyrighted software tool;
3. Rate of incidents per 1,000 police–public contacts, where total number of police–public contacts was
recorded using the department’s computer-aided dispatch system; and
4. Qualitative incident analysis, using videotaped content.

Key Findings
Ariel et al. (2015) concluded that the findings supported the overall hypothesis that wearing cameras increased
police officers’ compliance with rules of conduct around use of force, due to increased self-consciousness of being
watched.

A feature of the evaluation was comparisons not only of the BWC shifts and the non-BWC shifts (the
experimental design) but comparisons with data from months and years before the initiation of the study, as well as
after implementation. Thus, the evaluation design included two complementary approaches. The data from the
before–after component of the study showed that complaints by citizens for the whole department dropped from
28 in the year before the study, to just three during the year it was implemented; almost a 90% drop. Use-of-force
incidents dropped from 61 in the year before implementation to 25 during implementation, a 60% drop.

When comparing the BWC shifts with the non-BWC (control) shifts, there were about half as many use-of-force
incidents for the BWC shifts (eight as compared with 17 respectively). There was not a significant difference in
number of citizen complaints, given how few there were during the year of the experiment.

The qualitative findings supported the main hypothesis in this evaluation.

36
Tying the findings back to the key questions of the study, the results indicated that wearing cameras did appear to
increase the degree of self-awareness that the police officers had of their behavior and thereby could be used as a
social control mechanism to promote socially desirable behavior.

More generally, the significance of the problem of police uses of force in their encounters with citizens is
international is scope. Since the Rialto evaluation, there have been a large number of evaluations of similar
programs in other U.S. cities, as well as cities in other countries (Cubitt et al., 2017; Maskaly et al., 2017). The
widespread interest in this technology as an approach to managing use-of-force incidents has resulted in a large
number of variations in how body-worn cameras have been deployed (for example, whether they must be turned
on for all citizen encounters—that was true in Rialto—or whether officers can exercise discretion on whether to
turn on the cameras), what is being measured as program outcomes, and what research designs/comparisons are
conducted (U.S. Bureau of Justice, 2018; Cubitt et al., 2017).

37
Program Success Versus Understanding the Cause-and-Effect Linkages: The
Challenge of Unpacking the Body-Worn Police Cameras “Black Box”
Even though the Rialto Police Department program was evaluated with a randomized controlled design, it
presents us with a puzzle. It has been recognized that it may not have simply been the wearing of cameras that
modified behaviors but an additional “treatment” wherein officers informed citizens (in an encounter) that the
interaction was being recorded (Ariel et al., 2018a, 2018b; White, Todak, & Gaub, 2017). In fact, at least four
different causal mechanisms can be distinguished:

1. One in which the cameras being on all the time changed police behavior.
2. A second in which the cameras being on all the time changed citizen behavior.
3. A third in which the cameras being on all the time changed police behavior and that, in turn, changed
citizen behavior.
4. A fourth in which the body-worn cameras affect citizen behavior and that, in turn, affects police behavior.

Collectively, they create a challenge in interpreting the extent to which the cameras themselves affect officer
behaviors and citizen behaviors. This challenge goes well beyond the Rialto experiment. By 2016, Barak Ariel and
his colleagues had found, after 10 studies, that “in some cases they [BWCs] help, in some they don’t appear to
change police behavior, and in other situations they actually backfire, seemingly increasing the use of force” (Ariel,
2016, p. 36). This conundrum highlights the importance of working to determine the underlying mechanisms
that cause a policy or program to change people’s behavior.

Ariel et al. (2017), Hedberg et al. (2017), and Gaub et al. (2016) are three of the most recent studies to explore
the contradictory findings from BWC research. The root of the problem is that we do not yet know what the
BWC mechanisms are that modify the behaviors of police or citizens when BWCs are in use. Are the mechanisms
situational, psychological, or organizational/institutional? If a theory of deterrence (see Ariel et al., 2018b;
Hedberg et al., 2017) cannot adequately explain police and citizen behavioral outcomes of the use of BWCs, do
other behavioral organizational justice theories (Hedberg et al., 2017; Nix & Wolfe, 2016) also have a role to play
in our understanding? Deterrence theory relates to individual reactions to the possibility of being under
surveillance, whereas organizational justice concepts, in the case of policing, relate to perceptions of procedural
fairness in the organization. Nix and Wolfe (2016) take a closer look at organizational justice in the policing
context and explain,

The third, and most important, element of organizational justice is procedural fairness. Over and above
outcome-based equity, employees look for supervisory decisions and organizational processes to be
handled in procedurally just manners—decisions are clearly explained, unbiased, and allow for
employee input. (p. 14)

So what mechanisms and theories might explain police and citizen changes in behavior when body-worn cameras
are introduced into the justice system? As Ariel (2016) noted as the subtitle of his recent paper, Body-worn cameras
give mixed results, and we don’t know why.

38
Connecting Body-Worn Camera Evaluations to This Book
Although this textbook will use a variety of evaluations from different fields to illustrate points about evaluation
theory and practice, body-worn-camera-related programs and their evaluations give us an opportunity to explore a
timely, critical policy issue with international reach. We will pick up on the ways that evaluations of body-worn
cameras intersect with different topics in our book: logic models, research designs, measurement issues,
implementation issues, and the uses of mixed methods to evaluate programs.

The BWC studies offer us timely examples that can help evaluators to understand the on-the-ground implications
of conducting defensible evaluations. Briefly, they are as follows:

Body-worn camera programs for police forces have come into being in response to high-stakes sociopolitical
problems—clearly there is rationale for such programs.
Evaluation of BWC initiatives fit into varying components of the performance management cycle,
including strategic planning and resource allocation, program and policy design, implementation and
management, and assessing and reporting results.
Ex ante studies have been conducted in some jurisdictions to examine police perceptions about the
possibility of initiating BWC programs, before a BWC system is purchased and implemented.
“Gold standard” randomized controlled trials have been conducted and have produced compelling
evidence, yet the results of multiple studies are contradictory.
Much can be learned from the internal validity and construct validity problems for BWC studies. For
example, even in randomized settings, it is difficult to keep the “experimental” and the “control” group
completely separate (in Rialto, the same officers were part of both the experimental and control groups
suggesting diffusion effects—a construct validity problem).
Local and organizational culture seems to be at the root of puzzling and sometimes contradictory evaluation
results (an external validity issue).
Existing data and performance measures are inconsistently defined and collected across communities,
creating a challenge for evaluators wanting to synthesize existing studies as one of their lines of evidence.
Many evaluations of BWCs include quantitative and qualitative lines of evidence.
Implementation issues are as much a concern as the outcomes of BWC programs. There is so much
variability in the way the BWCs are instituted, the policies (or not) on their uses, and the contexts in which
they are introduced that it is difficult to pin down what this program is fundamentally about. (What is the
core technology?) This is both an implementation problem and a construct validity problem.
Governments and police forces are concerned with cost-based analyses and other types of economic
evaluations but face challenges in quantitatively estimating actual costs and benefits of BWCs.
BWC evaluators operate in settings where their options are constrained. They are challenged to develop a
methodology that is defensible and to produce reports and recommendations that are seen to be credible
and useful, even where, for example, there is resistance to the mandatory use of BWCs for the
“experimental” police (as compared with the control group).
The evaluators use their professional judgment as they design and implement their studies. Methods
decisions, data collection decisions, interpretations of findings, conclusions, and recommendations are all
informed by judgment. There is no template or formula to design and conduct such evaluations in particular
settings. Instead, there are methodological approaches and tools that are applied by evaluators who have
learned their craft and, of necessity, tackle each project as a craftsperson.

These points will be discussed and elaborated in other chapters of this textbook. Fundamentally, program
evaluation is about gathering information that is intended to answer questions that program managers and other
stakeholders have about a program. Program evaluations are always affected by organizational and political factors
and are a balance between methods and professional judgment.

Your own experience and practice will offer additional examples (both positive and otherwise) of how evaluations
get done. In this book, we will blend together important methodological concerns—ways of designing and

39
conducting defensible and credible evaluations—with the practical concerns facing evaluators, managers, and
other stakeholders as they balance evaluation requirements and organizational realities.

40
Ten Key Evaluation Questions
The previous discussion focused on one of the key questions that program evaluations are expected to answer—
namely, whether the program was successful in achieving its intended outcomes. Aside from the question of
program effectiveness, there are other questions that evaluations can address. They are summarized in Table 1.1.
To help us make sense of these 10 questions, we have included an open systems model (Figure 1.4) of a typical
program that shows how objectives, resources (inputs), outputs, and outcomes are linked. You can review that
model, locate the key words that are highlighted in Table 1.1, and see how the questions are related to each other.

Figure 1.4An Open Systems Model of Programs and Key Evaluation Issues

Source: Adapted from Nagarajan and Vanheukelen (1997, p. 20).

Table 1.1 Ten Possible Evaluation Questions


Table 1.1 Ten Possible Evaluation Questions

1. What is the need for a program?

2. Is the program relevant?

3. Was the structure/logic of the program appropriate?

4. Was the program implemented as intended?

5. Was the program technically efficient?

6. Was the program responsible for the outcomes that actually occurred (effectiveness 1)?

7. Did the program achieve its intended objectives (effectiveness 2)?

41
8. Was the program cost-effective?

9. Was the program cost beneficial?

10. Was the program adequate?

42
1. What is the need for a program?
A needs assessment can occur either before program options are developed (an ex ante needs assessment) or during
their implemented lifetime (ex post needs assessment). Typically, needs assessments gather information using either
or both qualitative and quantitative methodologies, and compare existing programs or services with levels and
types of needs that are indicated by the data. These comparisons can suggest gaps that might be addressed by
developing or modifying programs, and allocating resources to reduce or eliminate these gaps.

Needs assessment done before a program is developed can inform the way that the objectives are stated, and
suggest performance measures and targets that would reduce needs gaps. If a needs assessment is done during the
time a program is implemented, it can be a part of an evaluation of the program’s effectiveness—is the program
achieving its intended outcomes, and does the program meet the needs of the stakeholder groups at which it was
targeted? Such an evaluation might suggest ways of improving the existing program, including refocusing the
program to better meet client needs. We will be discussing needs assessments in Chapter 6 of this textbook.

2. Is the program relevant?


Programs are aimed at objectives that are intended to reflect priorities of governments, boards of directors, or
other stakeholders. These priorities can change. Governments change, and differing views on social, economic, or
political issues emerge that suggest a need to reassess priorities and either adjust direction or embark on a new
course. Programs that were consistent with government or other stakeholder priorities at one point can become
less relevant over time.

Assessing the relevance of a program typically involves examining documents that outline the original (and
current) directions of the program, on the one hand, and comparing those with statements of current and future
priorities, on the other. Interviews with key stakeholders are usually an important part of relevance assessments.
Assessing the relevance of a program is different from assessing the need for a program or measuring its
effectiveness—assessments of relevance are almost always qualitative and rely substantially on the experience and
judgment of the evaluators as well as of stakeholders.

3. Was the structure/logic of the program appropriate?


Typically, programs address a problem or issue that has arisen in the public sector. Programs often elaborate
policies. The scope and reach of programs can vary a great deal, depending on the complexity of the problem.
When programs are being developed, researching options is useful. This often involves comparisons among
jurisdictions to see whether/how they have tackled similar problems and whether they have information about the
success of their strategies.

Selecting a strategy to address a problem is constrained by time, available resources, and prevailing political views.
Proposed solutions (programs) can be a compromise of competing organizational/stakeholder views, but this may
not be the most appropriate means to achieving a desired objective.

Assessing the appropriateness of a program focuses on the structure that is intended to transform resources into
results. Related questions include the following:

Does the logic of the program reflect evidence-based theories of change that are relevant for this situation
(if there are such theories of change)?
Does the logic of the program reflect smart or promising practices in other jurisdictions?
Is the logic of the program internally consistent?
Are all the essential components there, or are there one or more components that should be added to
increase the likelihood of success?
Overall, is the logic/design the best means to achieve the objectives, given the context in which the program

43
will be implemented?

We discuss program theories and program logics in Chapter 2.

4. Was the program implemented as intended?


Assessing implementation involves an examination of the program inputs, program activities, and the outputs
from those activities. Programs or policies are implemented in environments that are affected by—and can affect
—the program. Program objectives drive the design and implementation process; inputs (typically budgetary
resources, human resources, and technologies) are converted into activities that, in turn, produce outputs. These
are explained in greater detail in Chapter 2.

Programs can consist of several components (components are typically clusters of activities), and each is associated
with a stream of activities and outputs. For example, a program that is focused on training unemployed persons so
that they can find permanent jobs may have a component that markets the program to prospective clients, a
component in which the actual training is offered, a component that features activities intended to connect trained
persons with prospective employers, and a component that follows up with clients and employers to solve
problems and increase the likelihood that job placements are successful.

Assessing such a program to see whether it has been fully implemented would involve looking at each component,
assessing the way that it had been implemented (what activities have happened), identifying and describing any
bottlenecks in the processes, and seeing whether outputs have been produced for different activities. Since the
outputs of most programs are necessary (but not sufficient) to produce outcomes, tracking outputs as part of
measuring program performance monitors program implementation and provides information that is an essential
part of an implementation evaluation.

Assessing program implementation is sometimes done in the first stages of an evaluation process, when
considering evaluation questions, clarifying the program objectives, understanding the program structure, and
putting together a history of the program. Where programs are “new” (say, 2 years old or less), it is quite possible
that gaps will emerge between descriptions of intended program activities and what is actually getting done. One
way to assess implementation is to examine the fidelity between intended and actual program components,
activities, and even outputs (Century, Rudnick, & Freeman, 2010). Indeed, if the gaps are substantial, a program
evaluator may elect to recommend an analysis that focuses on just implementation issues, setting aside other
results-focused questions for a future time.

5. Was the program technically efficient?


Technical efficiency involves comparing inputs with outputs, usually to assess the productivity of the program or
to calculate the costs per unit of output. For example, most hospitals calculate their cost per patient day. This
measure of technical efficiency compares the costs of serving patients (clients) with the numbers of clients and the
time that they (collectively) spend in the hospital. If a hospital has 100 beds, it can provide a maximum of 36,500
(100 × 365) patient days of care in a year. Administrative and resource-related constraints would typically reduce
such a maximum to some fraction of that number.

Knowing the expenditures on patient care (calculating this cost can be challenging in a complex organization like a
hospital) and knowing the actual number of patient days of care provided, it is possible to calculate the cost of
providing a unit of service (cost per patient day). An additional indicator of technical efficiency would be the
comparison of the actual cost per patient day with a benchmark cost per patient day if the hospital were fully
utilized. Economic evaluation issues are examined in Chapter 7.

6. Was the program responsible for the outcomes that actually occurred?
Effectiveness (1) in Figure 1.4 focuses on the linkage between the program and the outcomes that actually

44
happened. The question is whether the observed outcomes were due to the program or, instead, were due to some
combination of environmental factors other than the program. In other words, can the observed outcomes be
attributed to the program? We discuss the attribution issue in Chapter 3.

7. Did the program achieve its intended objectives?


Effectiveness (2) in Figure 1.4 compares the program objectives with the outcomes that actually occurred.
Attaining the intended outcomes is not equivalent to saying that the program caused these outcomes. It is possible
that shifts in environmental factors accounted for the apparent success (or lack of it) of the program. An example
of environmental factors interfering with the evaluation of a program in British Columbia occurred in a province-
wide program to target drinking drivers in the mid-1970s. The Counterattack Program involved public
advertising, roadblocks, vehicle checks, and 24-hour license suspensions for persons caught with alcohol levels
above the legal blood alcohol limit. A key measure of success was the number of fatal and injury accidents on
British Columbia provincial highways per 100 million vehicle miles driven—the expectation being that the
upward trend prior to the program would be reversed after the program was implemented. Within 5 months of
the beginning of that program, British Columbia also adopted a mandatory seatbelt law, making it impossible to
tell whether Counterattack was responsible (at a province-wide level) for the observed downward trend in
accidents that happened. In effect, the seatbelt law was a rival hypothesis that could plausibly explain the
outcomes of the Counterattack Program.

Performance measures are often intended to track whether policies and programs achieve their intended objectives
(usually, yearly outcome targets are specified). Measuring performance is not equivalent to evaluating the
effectiveness (1) of a program or policy. Achieving intended outcomes does not tell us whether the program or
policy in question caused those outcomes. If the outcomes were caused by factors other than the program, the
resources that were expended were not used cost-effectively.

8. Was the program cost-effective?


Cost-effectiveness involves comparing the costs of a program with the outcomes. Ex post (after the program has
been implemented) cost–effectiveness analysis compares actual costs with actual outcomes. Ex ante (before
implementation) cost–effectiveness analysis compares expected costs with expected outcomes. The validity of ex
ante cost–effectiveness analysis depends on how well costs and outcomes can be forecasted. Cost–effectiveness
analyses can be conducted as part of assessing the effectiveness of the policy or program. Ratios of costs per unit of
outcome offer a way to evaluate a program’s performance over time, compare a program with other similar
programs elsewhere, or compare program performance with some benchmark (Yeh, 2007).

Key to conducting a cost–effectiveness evaluation is identifying an outcome that represents the program well
(validly) and can be compared with costs quantitatively to create a measure of unit costs. An example of a cost–
effectiveness ratio for a program intended to place unemployed persons in permanent jobs would be cost per
permanent job placement.

There is an important difference between technical efficiency and cost-effectiveness. Technical efficiency compares
the cost of inputs with units of outputs, whereas cost-effectiveness compares the cost of inputs with units of
outcomes. For example, if one of the components of the employment placement program is training for
prospective workers, a measure of the technical efficiency (comparing costs with units of output) would be the cost
per worker trained. Training could be linked to permanent placements, so that more trained workers would
presumably lead to more permanent placements (an outcome). Cost-effectiveness is discussed in Chapter 7.

9. Was the program cost-beneficial?


Cost–benefit analysis compares the costs and the benefits of a program. Unlike technical efficiency or cost–
effectiveness analysis, cost–benefit analysis converts all the outcomes of a program into monetary units (e.g.,
dollars), so that costs and benefits can be compared directly. Typically, a program or a project will be

45
implemented and operate over several years, and expected outcomes may occur over a longer period of time. For
example, when a cost–benefit analysis of a hydroelectric dam is being conducted, the costs and the benefits would
be spread out over a long period of time, making it necessary to take into account when the expected costs and
benefits occur, in any calculations of total costs and total benefits.

In many public-sector projects, particularly those that have important social dimensions, converting outcomes into
monetary benefits is difficult and often necessitates assumptions that can be challenged.

Cost–benefit analyses can be done ex ante or ex post—that is, before a program is implemented or afterward. Ex
ante cost–benefit analysis can indicate whether it is worthwhile going ahead with a proposed option, but to do so,
a stream of costs and outcomes must be assumed. If implementation problems arise, or the expected outcomes do
not materialize, or unintended impacts occur, the actual costs and benefits can diverge substantially from those
estimated before a program is implemented. Cost–benefit analysis is a subject of Chapter 7.

10. Was the program adequate?


Even if a program was technically efficient, cost-effective, and even cost-beneficial, it is still possible that the
program will not resolve the problem for which it was intended. An evaluation may conclude that the program
was efficient and effective, but the magnitude of the problem was such that the program was not adequate to
achieve the overall objective.

Changes in the environment can affect the adequacy of a program. A program that was implemented to train
unemployed persons in resource-based communities might well have been adequate in an expanding economy,
but if macroeconomic trends reverse, resulting in the closure of mills or mines, the program may no longer be
sufficient to address the problem at hand.

Anticipating the adequacy of a program is also connected with assessing the need for a program: Is there a
(continuing/growing/diminishing) need for a program? Needs assessments are an important part of the program
management cycle, and although they present methodological challenges, they can be very useful in planning or
revising programs. We discuss needs assessments in Chapter 6.

46
The Steps In Conducting A Program Evaluation
Our approach to presenting the key topics in this book is that an understanding of program evaluation concepts
and principles is important before designing and implementing performance measurement systems. When
performance measurement expanded across government jurisdictions in the 1990s, expectations were high for this
new approach (McDavid & Huse, 2012). In many organizations, performance measurement was viewed as a
replacement for program evaluation (McDavid, 2001; McDavid & Huse, 2006). Three decades of experience with
actual performance measurement systems suggests that initial expectations were unrealistic. Relying on
performance measurement alone to evaluate programs does not get at why observed results occurred (Effectiveness
[1]). Performance measurement systems monitor and can tell us whether a program “achieved” its intended
outcomes (Effectiveness [2]). Program evaluations are intended to answer “why” questions.

In this chapter, we will outline how program evaluations in general are done, and once we have covered the core
evaluation-related knowledge and skills in Chapters 2, 3, 4, and 5, we will turn to performance measurement in
Chapters, 8, 9, and 10. In Chapter 9, we will outline the key steps involved in designing and implementing
performance measurement systems.

Designing and Conducting an Evaluation Is Not a Linear Process

Even though each evaluation is different, it is useful to outline the steps that are generally typical, keeping in mind that for each
evaluation, there will be departures from these steps. Our experience with evaluations is that as each evaluation is designed and conducted,
the steps in the process are revisited in an iterative fashion. For example, the process of constructing a logic model of the program may
result in clarifying or revising the program objectives and even prompt revisiting the purposes of the evaluation, as additional
consultations with stakeholders take place.

47
General Steps in Conducting a Program Evaluation
Rutman (1984) distinguished between planning for an evaluation and actually conducting the evaluation. The
evaluation assessment process can be separated from the evaluation study itself, so that managers and other
stakeholders can see whether the results of the evaluation assessment support a decision to proceed with the
evaluation. It is worth mentioning that the steps outlined next imply that a typical program evaluation is a project,
with a beginning and an end point. This is still the mainstream view of evaluation practice, but others have argued
that evaluation should be more than “studies.” Mayne and Rist (2006), for example, suggest that evaluators should
be prepared to do more than evaluation projects. Instead, they need to be engaged with organizational
management: leading the development of results-based management systems (including performance
measurement and performance management systems) and using all kinds of evaluative information, including
performance measurement, to strengthen the evaluative capacity in organizations. They maintain that creating and
using evaluative information has to become more real-time and that managers and evaluators need to think of each
other as partners in constructing knowledge management systems and practices. Patton (2011) takes this vision
even further—for him, developmental evaluators in complex settings need to be engaged in organizational change,
using their evaluation knowledge and skills to provide real-time advice that is aimed at organizational innovation
and development.

Table 1.2 summarizes 10 questions that are important as part of evaluation assessments. Assessing the feasibility of
a proposed evaluation project and making a decision about whether to go ahead with it is a strategy that permits
several decision points before the budget for an evaluation is fully committed. A sound feasibility assessment will
yield products that are integral to a defensible evaluation product.

The end product of the feasibility assessment phase entails the aggregation of enough information that it should be
straightforward to implement the evaluation project, should it proceed. In Chapter 6, when we discuss needs
assessments, we will see that there is a similar assessment phase for planning needs assessments.

Five additional steps are also outlined in Table 1.2 for conducting and reporting evaluations. Each of the questions
and steps is elaborated in the discussion that follows.

Table 1.2 Checklist of Key Questions and Steps in Conducting Evaluation Feasibility
Assessments and Evaluation Studies
Table 1.2 Checklist of Key Questions and Steps in Conducting Evaluation Feasibility Assessments and
Evaluation Studies

Steps in assessing the feasibility of an evaluation

1. Who are the clients for the evaluation, and who are the stakeholders?

2. What are the questions and issues driving the evaluation?

3. What resources are available to do the evaluation?

4. Given the evaluation questions, what do we already know?

5. What is the logic and structure of the program?

6. Which research design alternatives are desirable and feasible?

7. What kind of environment does the program operate in, and how does that affect the comparisons
available to an evaluator?

48
8. What data sources are available and appropriate, given the evaluation issues, the program structure, and
the environment in which the program operates?

9. Given all the issues raised in Points 1 to 8, which evaluation strategy is most feasible, and which is
defensible?

10. Should the evaluation be undertaken?

Steps in conducting and reporting an evaluation

1. Develop the data collection instruments, and pre-test them.

2. Collect data/lines of evidence that are appropriate for answering the evaluation questions.

3. Analyze the data, focusing on answering the evaluation questions.

4. Write, review, and finalize the report.

5. Disseminate the report.

Assessing the Feasibility of the Evaluation


1. Who are the clients for the evaluation, and who are the other stakeholders?

Program evaluations are substantially user driven. Michael Patton (2008) makes a utilization focus a key criterion
in the design and execution of program evaluations. Intended users must be identified early in the process and
must be involved in the evaluation feasibility assessment. The extent of their involvement will depend on whether
the evaluation is intended to make incremental changes to the program or, instead, is intended to provide
information that affects the existence of the program. Possible clients could include but are not limited to

program/policy managers,
agency/ministry executives,
external agencies (including central agencies),
program recipients,
funders of the program,
political decision makers/members of governing bodies (including boards of directors), and
community leaders.

All evaluations are affected by the interests of stakeholders. Options for selecting what to evaluate, who will have
access to the results, how to collect the information, and even how to interpret the data generally take into account
the interests of key stakeholders. In most evaluations, the clients (those commissioning the evaluation) will have
some influence over how the goals, objectives, activities, and intended outcomes of the program are defined for the
purpose of the evaluation (Boulmetis & Dutwin, 2000). Generally, the more diverse the clients and audience for
the evaluation results, the more complex the negotiation process that surrounds the evaluation itself. Indeed, as
Shaw (2000) comments, “Many of the issues in evaluation research are influenced as much, if not more, by
political as they are by methodological considerations” (p. 3).

An evaluation plan, outlining items such as the purpose of the evaluation, the key evaluation questions, and the
intended audience(s), worked out and agreed to by the evaluators and the clients prior to the start of the
evaluation, is very useful. Owen and Rogers (1999) discuss the development of evaluation plans in some detail. In
the absence of such a written plan, they argue, “There is a high likelihood that the remainder of the evaluation
effort is likely to be unsatisfactory to all parties” (p. 71), and they suggest the process should take up to 15% of the
total evaluation budget.

49
2. What are the questions and issues driving the evaluation?

Evaluators, particularly as they are learning their craft, are well advised to seek explicit answers to the following
questions:

Why do the clients want it done?


What are the main evaluation issues that the clients want addressed? (Combinations of the 10 evaluation
questions summarized in Table 1.1 are usually in play).
Are there hidden agendas or covert reasons for wanting the policy or program evaluated? For example, how
might the program organization or the beneficiaries be affected?
Is the evaluation intended to be for incremental adjustments/improvements, major decisions about the
future of the program, or both?

Answering these questions prior to agreeing to conduct an evaluation is essential because, as Owen and Rogers
(1999) point out,

There is often a diversity of views among program stakeholders about the purpose of an evaluation.
Different interest groups associated with a given program often have different agendas, and it is essential
for the evaluator to be aware of these groups and know about their agendas in the negotiation stage. (p.
66)

Given time and resource constraints, an evaluator cannot hope to address all the issues of all program stakeholders
within one evaluation. For this reason, the evaluator must reach a firm agreement with the evaluation clients about
the questions to be answered by the evaluation. This process will involve working with the clients to help narrow
the list of questions they are interested in, a procedure that may necessitate “educating them about the realities of
working within a budget, challenging them as to the relative importance of each issue, and identifying those
questions which are not amenable to answers through evaluation” (Owen & Rogers, 1999, p. 69).

3. What resources are available to do the evaluation?

Typically, resources to design and complete evaluations are scarce. Greater sophistication in evaluation designs
almost always entails larger organizational expenditures and greater degrees of control by the evaluator. For
example, achieving the necessary control over the program and its environment to conduct experimental or quasi-
experimental evaluations generally entails modifying existing administrative procedures and perhaps even
temporarily changing or suspending policies (e.g., to create no-program comparison groups). This can have ethical
implications—withholding a program from vulnerable persons or families can cause harm (Rolston, Geyer and
Locke, 2013). We discuss the ethics of evaluations in Chapter 12.

It is useful to distinguish among several kinds of resources needed for evaluations:

Time
Human resources, including persons with necessary knowledge, skills, and experience
Organizational support, including written authorizations for other resources needed to conduct the
evaluation
Money

It is possible to construct and implement evaluations with very modest resources. Bamberger, Rugh, Church, and
Fort (2004) have suggested strategies for designing impact evaluations with very modest resources—they call their
approach shoestring evaluation. Another recently introduced approach is rapid impact evaluation (Government
of Canada, 2018; Rowe, 2014). Agreements reached about all resource requirements should form part of the
written evaluation plan.

4. What evaluation work has been done previously?

50
Evaluators should take advantage of work that has already been done. There may be previous evaluations of the
current program or evaluations of similar ones in other jurisdictions. Internet resources are very useful as you are
planning an evaluation, although many program evaluations are unpublished and may be available only through
direct inquiries.

Aside from literature reviews, which have been a staple of researchers for as long as theoretical and empirical work
have been done, there is growing emphasis on approaches that take advantage of the availability of consolidations
of reports, articles, and other documents on the Internet. An example of a systematic review was the study done
by Anderson, Fielding, Fullilove, Scrimshaw, and Carande-Kulis (2003) that focused on cognitive outcomes for
early childhood programs in the United States. Anderson and her colleagues began with 2,100 possible
publications and, through a series of filters, narrowed those down to 12 studies that included comparison group
research designs, were robust in terms of their internal validity, and measured cognitive outcomes for the programs
being evaluated.

The Cochrane Collaboration (2018) is an international project begun in 1993 that is aimed at conducting
systematic reviews of health-related interventions. They also produce the Cochrane Handbook for Systematic
Reviews of Interventions. These reviews can be useful inputs for governments and organizations that want to know
the aggregate effect sizes for interventions using randomized controlled trials that have been grouped and
collectively assessed.

The Campbell Collaboration (2018) is an organization that is focused on the social sciences and education.
Founded in 1999, its mission is to promote “positive social and economic change through the production and use
of systematic reviews and other evidence synthesis for evidence-based policy and practice.”

The Government Social Research Unit in the British government has published a series of guides, including The
Magenta Book: Guidance for Evaluation (HM Treasury, 2011). Chapter 6 in The Magenta Book, “Setting Out the
Evaluation Framework,” includes advice on using existing research in policy evaluations. Literature reviews and
quantitative and qualitative systematic reviews are covered. The main point here is that research is costly, and
being able to take advantage of what has already been done can be a cost-effective way to construct lines of
evidence in an evaluation.

An important issue in synthesizing previous work is how comparable the studies are. Variations in research
designs/comparisons, the ways that studies have been conducted (the precise research questions that have been
addressed), the sizes of samples used, and the measures that have been selected will all influence the comparability
of previous studies and the validity of any aggregate estimates of policy or program effects.

5. What is the structure and logic of the program?

Programs are means–ends relationships. Their intended objectives, which are usually a product of
organizational/political negotiations, are intended to address problems or respond to social/economic/political
issues or needs that emerge from governments, interest groups, and other stakeholders. Program structures are the
means by which objectives are expected to be achieved.

Logic models are useful for visually summarizing the structure of a program. They are a part of a broader
movement in evaluation to develop and test program theories when doing evaluations (Funnell & Rogers, 2011).
Program logic models are widely used to show the intended causal linkages in a program. There are many
different styles of logic models (Funnell & Rogers, 2011), but what they have in common is identifying the major
sets of activities in the program, their intended outputs, and the outcomes (often short, medium, and longer term)
that are expected to flow from the outputs (Knowlton & Phillips, 2009).

An example of a basic schema for a logic model is illustrated in Figure 1.5. The model shows the stages in a typical
logic model: program process (including outputs) and outcomes. We will be discussing logic models in some detail
in Chapter 2 of this textbook.

51
Figure 1.5 Linear Program Logic Model

Source: Adapted from Coryn, Schröter, Noakes, & Westine (2011) as adapted from Donaldson (2007, p.
25).

Logic models are usually about intended results—they outline how a program is expected to work, if it is
implemented and works as planned. Key to constructing a logic model is a clear understanding of the program
objectives. One challenge for evaluators is working with stakeholders, including program managers and executives,
to refine the program objectives. Ideally, program objectives should have five characteristics:

1. An expected direction of change for the outcome is specified.


2. An expected magnitude of change is specified.
3. An expected time frame is specified.
4. A target population is specified.
5. The outcome is measurable.

The government’s stated objective of reducing greenhouse gas emissions in British Columbia by 33% by the year
2020 is a good example of a clearly stated policy objective. From an evaluation standpoint, having an objective
that is clearly stated simplifies the task of determining whether that policy has achieved its intended outcome.
Political decision makers often prefer more general language in program or policy objectives so that there is
“room” to interpret results in ways that suggest some success. As well, many public-sector policy objectives are
challenging to measure.

6. Which research design alternatives are desirable and appropriate?

Key to evaluating the effectiveness of a program are comparisons that allow us to estimate the incremental impacts
of the program, ideally over what would have happened if there had been no intervention. This is the attribution
question. In most evaluations, it is not feasible to conduct a randomized experiment—in fact, it is often not
feasible to find a control group. Under these conditions, if we want to assess program effectiveness, it is still
necessary to construct comparisons (e.g., among subgroups of program recipients who differ in their exposure to
the program) that permit some ways of estimating whether the program made a difference.

For evaluators, there are many issues that affect the evaluation design choices available. Among them are the
following:

Is it possible to identify one or more comparison groups that are either not affected by the program or
would be affected at a later time?
How large is the client base for the program? (This affects sampling and statistical options.)
Is the organization in which the program is embedded stable, or in a period of change? (This can affect the
feasibility of proceeding with the evaluation.)
How is the environment of this program different from other locales where a similar program has been
initiated?

Typically, evaluations involve constructing multiple comparisons using multiple research designs; it is unusual, for
example, for an evaluator to construct a design that relies on measuring just one outcome variable using one
research design. Instead, evaluations will identify a set of outcome (and output) variables. Usually, each outcome
variable will come with its own research design. For example, a policy of reducing alcohol-related fatal crashes on
British Columbia highways might focus on using coordinated police roadblocks and breathalyzer tests to affect the

52
likelihood that motorists will drink and drive. A key outcome variable would be a time series of (monthly) totals of
alcohol-related fatal crashes—data collected by the Insurance Corporation of British Columbia (ICBC). An
additional measure of the success might be the cross-sectional survey-based perceptions of motorists in
jurisdictions in which the policy has been implemented. The two research designs—a single time series and a case
study design—have some complementary features that can strengthen the overall evaluation design.

When we look at evaluation practice, many evaluations rely on research design options that do not have the
benefit of baselines or no-program comparison groups. These evaluations rely instead on a combination of
independent lines of evidence to construct a multifaceted picture of program operations and results.
Triangulating those results becomes a key part of assessing program effectiveness. An important consideration for
practitioners is knowing the strengths and weaknesses of different designs so that combinations of designs can be
chosen that complement each other (offsetting each other’s weaknesses where possible). We look at the strengths
and weaknesses of different research designs in Chapter 3.

7. What kind of environment does the program operate in, and how does that affect the
comparisons available to an evaluator?

Programs, as open systems, are always embedded in an environment. The ways that the environmental factors—
other programs, organizational leaders, other departments in the government, central agencies, funders, as well as
the economic, political, and social context—affect and are affected by a program are typically dynamic. Even if a
program is well established and the organization in which it is embedded is stable, these and other external
influences can affect how the program is implemented, as well as what it accomplishes. Many evaluators do not
have sufficient control in evaluation engagements to partial out all environmental factors, so qualitative
assessments, direct observation, experience, and judgment often play key roles in estimating (a) which factors, if
any, are in play for a program at the time it is evaluated and (b) how those factors affect the program process and
results. In sum, identifying appropriate comparisons to answer the evaluation questions are typically conditioned
by the contexts in which a program (and the evaluation) are embedded.

8. What information/data sources are available and appropriate, given the evaluation questions,
the program structure, the comparisons that would be appropriate, and the environment in which
the program operates?

In most evaluations, resources to collect data are quite limited, and many research design options that would be
desirable are simply not feasible. Given that, it is important to ask what data are available and how the constructs
in key evaluation questions would be measured, in conjunction with decisions about research designs. Research
design considerations (specifically, internal validity) can be used as a rationale for prioritizing additional data
collection.

Specific questions include the following:

What are the data (sources) that are currently available? (e.g., baseline data, other studies)
Are currently available data reliable and complete?
How can currently available data be used to validly measure constructs in the key evaluation questions?
Are data available that allow us to assess key environmental factors (qualitatively or quantitatively) that
would plausibly affect the program and its outcomes?
Will it be necessary for the evaluator to collect additional information to measure key constructs?
Given research design considerations, what are the highest priorities for collecting additional data?

The availability and quality of program performance data have the potential to assist evaluators in scoping an
evaluation project. Performance measurement systems that have been constructed for programs, policies, or
organizations are usually intended to periodically measure outputs and outcomes. For monitoring purposes, these
data are often arrayed in a time series format so that managers can monitor the trends and estimate whether
performance results are tracking in ways that suggest program effectiveness. Where performance targets have been
specified, the data can be compared periodically with the targets to see what the gaps are, if any.

53
Some jurisdictions, including the federal government in Canada (TBS, 2016a; 2016b), have linked performance
data to program evaluations, with the stated goal of making performance results information—which is usually
intended for program managers—more useful for evaluations of program efficiency and effectiveness.

There is one more point to make with respect to potential data sources. Evaluations that focus a set of questions
on, for example, program effectiveness, program relevance, or program appropriateness, will usually break these
questions down further so that an evaluation question will yield several more specific subquestions that are tailored
to that evaluation. Collectively, answering these questions and subquestions is the agenda for the whole evaluation
project.

What can be very helpful is to construct a matrix/table that displays the evaluation questions and subquestions as
rows, and the prospective data sources or lines of evidence that will be used to address each question as columns.
In one table, then, stakeholders can see how the evaluation will address each question and subquestion. Given that
typical evaluations are about gathering and analyzing multiple lines of evidence, a useful practice is to make sure
that each evaluation subquestion is addressed by at least two lines of evidence. Lines of evidence typically include
administrative records, surveys, focus groups, stakeholder interviews, literature reviews/syntheses, and case studies
(which may involve direct observations).

9. Given all the issues raised in Points 1 to 8, which evaluation strategy is most feasible and
defensible?

No evaluation design is unassailable. The important thing for evaluators is to be able to understand the underlying
logic of assessing the cause-and-effect linkages in an intended program structure, anticipate the key criticisms that
could be made, and have a response (quantitative, qualitative, or both) to each criticism.

Most of the work that we do as evaluators is not going to involve randomized controlled experiments or even
quasi-experiments, although some consider those to be the “gold standard” of rigorous social scientific research
(see, e.g., Cook et al., 2010; Donaldson, Christie, & Mark, 2014; Lipsey, 2000). Although there is far more
diversity in views of what is sound evaluation practice, it can become an issue for a particular evaluation, given the
background or interests of persons or organizations who might raise criticisms of your work. It is essential to
understand the principles of rigorous evaluations to be able to proactively acknowledge limitations in an evaluation
strategy. In Chapter 3, we will introduce the four kinds of validity that have been associated with a structured,
quantitative approach to evaluation that focuses on discerning the key cause-and-effect relationships in a policy or
program. Ultimately, evaluators must make some hard choices and be prepared to accept the fact that their work
can—and probably will—be criticized, particularly for high-stakes summative evaluations.

10. Should the evaluation be undertaken?

The final question in an assessment of evaluation feasibility is whether to proceed with the actual evaluation. It is
possible that after having looked at the mix of the evaluator preparing the assessment recommends that no
evaluation be done at this time. Although a rare outcome of the evaluation assessment phase, it does happen, and
it can save an organization considerable time and effort that probably would not have yielded a credible product.

evaluation issues,
resource constraints,
organizational and political issues (including the stability of the program), and
research design options and measurement constraints,

Evaluator experience is key to being able to negotiate a path that permits designing a credible evaluation project.
Evaluator judgment is an essential part of considering the requirements for a defensible study, and making a
recommendation to either proceed or not.

Doing the Evaluation

54
Up to this point, we have outlined a planning and assessment process for conducting program evaluations. That
process entails enough effort to be able to make an informed decision about proceeding or not with an evaluation.
The work also serves as a substantial foundation for the evaluation, if it goes ahead. If a decision is made to
proceed with the evaluation and if the methodology has been determined during the feasibility stage, there are five
more steps that are common to most evaluations.

1. Develop the measures, and pre-test them.

Evaluations typically rely on a mix of existing and evaluation-generated data sources. If performance data are
available, it is essential to assess how accurate and complete they are before committing to using them. As well,
relying on administrative databases can be an advantage or a cost, depending on how complete and accessible
those data are.

For data collection conducted by the evaluator or other stakeholders (sometimes, the client will collect some of the
data, and the evaluators will collect other lines of evidence), instruments will need to be designed. Surveys are a
common means of collecting new data, and we will include information on designing and implementing surveys
in Chapter 4 of this textbook.

For data collection instruments that are developed by the evaluators (or are adapted from some other application),
pre-testing is important. As an evaluation team, you usually have one shot at collecting key lines of evidence. To
have one or more data collection instruments that are flawed (e.g., questions are ambiguous, questions are not
ordered appropriately, some key questions are missing, some questions are redundant, or the instrument is too
long) undermines the whole evaluation. Pre-testing need not be elaborate; usually, asking several persons to
complete an instrument and then debriefing them will reveal most problems.

Some methodologists advocate an additional step: piloting the data collection instruments once they are pre-
tested. This usually involves taking a small sample of persons who would actually be included in the evaluation as
participants and asking them to complete the instruments. This step is most useful in situations in which survey
instruments have been designed to include open-ended questions—these questions can generate very useful data
but are time-consuming to code later on. A pilot test can generate a range of open-ended responses that can be
used to develop semi-structured response frames for those questions. Although some respondents in the full survey
will offer open-ended comments that are outside the range of those in the pilot test, the pre-coded options will
capture enough to make the coding process less time-consuming.

2. Collect the data/lines of evidence that are appropriate for answering the evaluation questions.

Collecting data from existing data sources requires both patience and thoroughness. Existing records, files,
spreadsheets, or other sources of secondary (existing) data can be well organized or not. In some evaluations the
consultants discover, after having signed a contract that made some assumptions about the condition of existing
data sources, that there are unexpected problems with the data files. Missing records, incomplete records, or
inconsistent information can increase data collection time and even limit the usefulness of whole lines of evidence.

One of the authors was involved in an evaluation of a regional (Canadian) federal-provincial economic
development program in which the consulting company that won the contract counted on project records being
complete and easily accessible. When they were not, the project methodology had to be adjusted, and costs to the
consultants increased. A disagreement developed around who should absorb the costs, and the evaluation process
narrowly avoided litigation.

Collecting data through the efforts of the evaluation team or their subcontractors also requires a high level of
organization and attention to detail. Surveying is a principal means of collecting evaluation-related data from
stakeholders. Good survey techniques (in addition to having a defensible way to sample from populations) involve
sufficient follow-up to help ensure that response rates are acceptable. Often, surveys do not achieve response rates
higher than 50%. (Companies that specialize in doing surveys usually get better response rates than that.) If
inferential statistics are being used to generalize from survey samples to populations, lower response rates weaken

55
any generalizations. A significant problem now is that people increasingly feel they are oversurveyed. This can
mean that response rates will be lower than they have been historically. In evaluations where resources are tight, it
may be that evaluators have to accept lower response rates, and they compensate for that (to some extent) by
having multiple lines of evidence to offer opportunities to triangulate findings.

3. Analyze the data, focusing on answering the evaluation questions.

Data analysis can be quantitative (involves working with variables that are represented numerically) or qualitative
(involves analysis of words, documents, text, and other non-numerical representations of information, including
direct observations). Most evaluations use combinations of qualitative and quantitative data. Mixed methods have
become the dominant approach for doing evaluations, following the trend in social science research more generally
(Creswell & Plano Clark, 2017).

Quantitative data facilitate numerical comparisons and are important for estimates of technical efficiency, cost-
effectiveness, and the costs and benefits of a program. In many governmental settings, performance measures tend
to be quantitative, facilitating comparisons between annual targets and actual results. Qualitative data are valuable
as a way of describing policy or program processes and impacts, using cases or narratives to offer in-depth
understanding of how the program operates and how it affects stakeholders and clients. Open-ended questions can
provide the opportunity for clients to offer information that researchers may not have thought to ask for in the
evaluation.

A general rule that should guide all data analysis is to employ the least complex method that will fit the situation.
One of the features of early evaluations based on models of social experimentation was the reliance on
sophisticated, multivariate statistical models to analyze program evaluation data. Although that strategy addressed
possible criticisms by scholars, it often produced reports that were inaccessible, or perceived as untrustworthy from
a user’s perspective because they could not be understood. More recently, program evaluators have adopted mixed
strategies for analyzing data, which rely on statistical tools where necessary, but also incorporate visual/graphic
representations of findings.

In this book, we will not cover data analysis methods in detail. References to statistical methods are in Chapter 3
(research designs) and in Chapter 4 (measurement). In Chapter 3, key findings from examples of actual program
evaluations are displayed and interpreted. In an appendix to Chapter 3, we summarize basic statistical tools and
the conditions under which they are normally used. In Chapter 5 (qualitative evaluation methods), we cover the
fundamentals of qualitative data analysis as well as mixed-methods evaluations, and in Chapter 6, in connection
with needs assessments, we introduce some basics of sampling and generalizing from sample findings to
populations.

4. Write, review, and finalize the report.

Evaluations are often conducted in situations in which stakeholders will have different views of the effectiveness of
the program. Where the main purpose for the evaluation is to make judgments about the merit or worth of the
program, evaluations can be contentious.

A steering committee that serves as a sounding board/advisory body for the evaluation is an important part of
guiding the evaluation. This is particularly valuable when evaluation reports are being drafted. Assuming that
defensible decisions have been made around methodologies, data collection, and analysis strategies, the first draft
of an evaluation report will represent a synthesis of lines of evidence and an overall interpretation of the
information that is gathered. It is essential that the synthesis of evidence address the evaluation questions that
motivated the project. In addressing the evaluation questions, evaluators will be exercising their judgment.
Professional judgment is conditioned by knowledge, values, beliefs, and experience and can mean that members of
the evaluation team will have different views on how the evaluation report should be drafted.

Working in a team makes it possible for evaluators to share perspectives, including the responsibility for writing
the report. Equally important is some kind of challenge process that occurs as the draft report is completed and

56
reviewed. Challenge functions can vary in formality, but the basic idea is that the draft report is critically reviewed
by persons who have not been involved in conducting the evaluation. In the audit community, for example, it is
common for draft audit reports to be discussed in depth by a committee of peers in the audit organization who
have not been involved in the audit. The idea is to anticipate criticisms of the report and make changes that are
needed, producing a product behind which the audit office will stand. Credibility is a key asset for individuals and
organizations in the audit community, generally.

In the evaluation community, the challenge function is often played by the evaluation steering committee.
Membership of the committee can vary but will typically include external expertise, as well as persons who have a
stake in the program or policy. Canadian federal departments and agencies use blind peer review of evaluation-
related products (draft final reports, methodologies, and draft technical reports) to obtain independent assessments
of the quality of evaluation work. Depending on the purposes of the evaluation, reviews of the draft report by
members of the steering committee can be contentious. One issue for executives who are overseeing the evaluation
of policies is to anticipate possible conflicts of interest by members of steering committees.

In preparing an evaluation report, a key part is the recommendations that are made. Here again, professional
judgment plays a key role; recommendations must not only be backed up by evidence but also be appropriate,
given the context for the evaluation. Making recommendations that reflect key evaluation conclusions and are
feasible is a skill that is among the most valuable that an evaluator can develop.

Although each program evaluation report will have unique requirements, there are some general guidelines that
assist in making reports readable, understandable, and useful:

Rely on visual representations of findings and conclusions where possible.


Use clear, simple language in the report.
Use more headings and subheadings, rather than fewer, in the report.
Prepare a clear, concise executive summary.
Structure the report so that it reflects the evaluation questions and subquestions that are driving the
evaluation—once the executive summary, table of contents, lists of figures and tables, the introductory
section of the report, and the methodology section of the report have been written, turn to the evaluation
questions, and for each one, discuss the findings from the relevant lines of evidence.
Conclusions should synthesize the findings for each evaluation question and form the basis for any
recommendations that are written.
Be prepared to edit or even seek professional assistance to edit the penultimate draft of the report before
finalizing it.

5. Disseminate the report.

Evaluators have an obligation to produce a report and make a series of presentations of the findings, conclusions,
and recommendations to key stakeholders, including the clients of the evaluation. There are different views of how
much interaction is appropriate between evaluators and clients. One view, articulated by Michael Scriven (1997),
is that program evaluators should be very careful about getting involved with their clients; interaction at any stage
in an evaluation, including postreporting, can compromise their objectivity. Michael Patton (2008), by contrast,
argues that unless program evaluators get involved with their clients, evaluations are not likely to be used.

The degree and types of interactions between evaluators and clients/managers will depend on the purposes of the
evaluation. For evaluations that are intended to recommend incremental changes to a policy or program, manager
involvement will generally not compromise the validity of the evaluation products. But for evaluations in which
major decisions that could affect the existence of the program are in the offing, it is important to assure evaluator
independence. We discuss these issues in Chapters 11 and 12 of this textbook.

Making Changes Based on the Evaluation


Evaluations can and hopefully do become part of the process of making changes in the programs or the

57
organization in which they operate. Where they are used, evaluations tend to result in incremental changes, if any
changes can be attributed to the evaluation. It is quite rare for an evaluation to result in the elimination of a
program, even though summative evaluations are often intended to raise this question (Weiss, 1998a).

The whole issue of whether and to what extent evaluations are used continues to be an important topic in the
field. Although there is clearly a view that the quality of an evaluation rests on its methodological defensibility
(Fitzpatrick, 2002), many evaluators have taken the view that evaluation use is a more central objective for doing
evaluations (Amo & Cousins, 2007; Fleischer & Christie, 2009; Leviton, 2003; Mark & Henry, 2004; Patton,
2008). The following are possible changes based on evaluations:

Making incremental changes to the design of an existing policy or program


Making incremental changes to the way the existing policy or program is implemented
Increasing the scale of the policy or program
Increasing the scope of the policy or program
Downsizing the policy or program
Replacing the policy or program
Eliminating the policy or program

These changes would reflect instrumental uses of evaluations (direct uses of evaluation products). In addition,
there are conceptual uses (the knowledge from the evaluation becomes part of the background in the organization
and influences other programs at other times) and symbolic uses (the evaluation is used to rationalize or legitimate
decisions made for political reasons) (Kirkhart, 2000; Højlund, 2014; Weiss, 1998b). More recently, uses have
been broadened to include process uses (effects of the process of doing an evaluation) and misuses of evaluations
(Alkin & King, 2016; Alkin & King, 2017).

Some jurisdictions build in a required management response to program evaluations. The federal government of
Canada, for example, requires the program being evaluated to respond to the report with a management response
that addresses each recommendation, indicates whether the program agrees with the recommendation, if not why
not, and if so, the actions that will be taken to implement each recommendation (Treasury Board of Canada,
2016a; 2016b). This process is intended to ensure that there is instrumental use of each evaluation report.

Evaluations are one source of information in policy and program decision making. Depending on the context,
evaluation evidence may be a key part of decision making or may be one of a number of factors that are taken into
account (Alkin & King, 2017).

Evaluation as Piecework: Working Collaboratively With Clients and Peers

In this chapter, we have outlined a process for designing and conducting evaluations, front to back. But evaluation engagements with
clients can divide up projects so that the work is distributed. For example, in-house evaluators may do the overall design for the project,
including specifying the evaluation questions, the lines of evidence, and perhaps even the methodologies for gathering the evidence. The
actual data collection, analysis, and report writing may be contracted out to external evaluators. Working collaboratively in such settings
where one or more stages in a project are shared, needs to be balanced with evaluator independence. Competent execution of specific tasks
is part of what is expected in today’s evaluation practice, particularly where clients have their own in-house evaluation capacity. In
Chapter 12, we talk about the importance of teamwork in evaluation—teams can include coworkers and people from other organizations
(including client organizations).

58
Summary
This book is intended for persons who want to learn the principles and the essentials of the practice of program evaluation and
performance measurement. The core of this book is our focus on evaluating the effectiveness of policies and programs. This includes an
emphasis on understanding the difference between outcomes that occur due to a program and outcomes that may have changed over time
due to factors other than the program (that is, the counterfactual). We believe that is what distinguishes evaluation from other related
fields. Given the diversity of the field, it is not practical to cover all the approaches and issues that have been raised by scholars and
practitioners in the past 40-plus years. Instead, this book adopts a stance with respect to several key issues that continue to be debated in
the field.

First, we approach program evaluation and performance measurement as two complementary ways of creating information that are
intended to reduce uncertainties for those who are involved in making decisions about programs or policies. We have structured the
textbook so that methods and practices of program evaluation are introduced first and then are adapted to performance measurement—
we believe that sound performance measurement practice depends on an understanding of program evaluation core knowledge and skills.

Second, our focus on program effectiveness is systematic. Understanding the logic of causes and effects as it is applied to evaluating the
effectiveness of programs is important and involves learning key features of experimental and quasi-experimental research designs; we
discuss this in Chapter 3.

Third, the nature of evaluation practice is such that all of us who have participated in program evaluations understand the importance of
values, ethics, and judgment calls. Programs are embedded in values and are driven by values. Program objectives are value statements—
they state what programs should do. The evaluation process, from the initial step of deciding to proceed with an evaluation assessment to
framing and reporting the recommendations, is informed by our own values, experiences, beliefs, and expectations. Methodological tools
provide us with ways of disciplining our judgment and rendering key steps in ways that are transparent to others, but many of these tools
are designed for social science research applications. In many program evaluations, resource and contextual constraints mean that the tools
we apply are not ideal for the situation at hand. Also, more and more, evaluators must consider issues such as organizational culture,
political culture, social context, and the growing recognition of the importance of “voice” for groups of people who have been
marginalized.

That is, there is more to evaluation that simply determining whether a program or policy is “effective.” Effective for whom? There is
growing recognition that as a profession, evaluators have an influence in making sure voices are equitably heard. Learning some of the
ways in which we can cultivate good professional judgment is a principal topic in Chapter 12 (the nature and practice of professional
judgment). Professional judgment is both about disciplining our own role in evaluation practice as well as becoming more self-aware (and
ethical) as practitioners.

Fourth, the importance of program evaluation and performance measurement in contemporary public and nonprofit organizations is
related to a continuing, broad international movement to manage for results. Performance management depends on having credible
information about how well programs and policies have been implemented and how effectively and efficiently they have performed.
Understanding how program evaluation and performance measurement fit into the performance management cycle and how evaluation
and program management work together in organizations is a theme that runs through this textbook.

59
Discussion Questions
1. As you were reading Chapter 1, what five ideas about the practice of program evaluation were most important for you?
Summarize each idea in a couple of sentences and keep them so that you can check on your initial impressions of the textbook as
you cover other chapters in the book.
2. Read the table of contents for this textbook and, based on your own background and experience, explain what you anticipate will
be the easiest parts of this book for you to understand. Why?
3. Again, having looked over the table of contents, which parts of the book do you think will be most challenging for you to learn?
Why?
4. Do you consider yourself to be a “words” person—that is, you are most comfortable with written and spoken language; a
“numbers” person—that is, you are most comfortable with numerical ways of understanding and presenting information; or
“both”—that is, you are comfortable combining qualitative and quantitative information?
5. Find a classmate who is willing to discuss Question 4 with you. Find out from each other whether you share a “words,”
“numbers,” or a “both” preference. Ask each other why you seem to have the preferences you do. What is it about your
background and experiences that may have influenced you?
6. What do you expect to get out of this textbook for yourself? List four or five goals or objectives for yourself as you work with the
contents of this textbook. An example might be, “I want to learn how to conduct evaluations that will get used by program
managers.” Keep them so that you can refer to them as you read and work with the contents of the book. If you are using this
textbook as part of a course, take your list of goals out at about the halfway point in the course and review them. Are they still
relevant, or do they need to be revised? If so, revise them so that you can review them once more as the course ends. For each of
your own objectives, how well do you think you have accomplished that objective?
7. What do you think it means to be objective? Do you think it is possible to be objective in the work we do as evaluators? In
anything we do? Offer some examples of reasons why you think it is possible to be objective (or not).

60
References
Alkin, M. C., & King, J. A. (2017). Definitions of evaluation use and misuse, evaluation influence, and factors
affecting use. American Journal of Evaluation, 38(3), 434–450.

Alkin, M. C., & King, J. A. (2016). The historical development of evaluation use. American Journal of Evaluation,
37(4), 568–579.

Amo, C., & Cousins, J. B. (2007). Going through the process: An examination of the operationalization of
process use in empirical research on evaluation. New Directions for Evaluation, 116, 5–26.

Ariel, B. (2016). The puzzle of police body cams: Body-worn cameras give mixed results, and we don’t know why.
IEEE Spectrum, 53(7), 32–37.

Ariel, B., Farrar, W. A., & Sutherland, A. (2015). The effect of police body-worn cameras on use of force and
citizens’ complaints against the police: A randomized controlled trial. Journal of Quantitative Criminology,
31(3), 509–535.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J.,. . . & Henderson, R. (2016). Wearing
body cameras increases assaults against officers and does not reduce police use of force: Results from a global
multi-site experiment. European Journal of Criminology, 13(6), 744–755.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J.,. . . & Henderson, R. (2017). “Contagious
accountability”: A global multisite randomized controlled trial on the effect of police body-worn cameras on
citizens’ complaints against the police. Criminal Justice and Behavior, 44(2), 293–316.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J.,. . . & Henderson, R. (2018a). Paradoxical
effects of self-awareness of being observed: Testing the effect of police body-worn cameras on assaults and
aggression against officers. Journal of Experimental Criminology, 14(1), 19–47.

Ariel, B., Sutherland, A., Henstock, D., Young, J., & Sosinski, G. (2018b). The deterrence spectrum: Explaining
why police body-worn cameras ‘work’ or ‘backfire’ in aggressive police–public encounters. Policing: A Journal of
Policy and Practice, 12(1), 6–26.

Arnaboldi, M., Lapsley, I., & Steccolini, I. (2015). Performance management in the public sector: The ultimate
challenge. Financial Accountability & Management, 31(1), 1–22.

Anderson, L. M., Fielding, J. E., Fullilove, M. T., Scrimshaw, S. C., & Carande-Kulis, V. G. (2003). Methods for
conducting systematic reviews of the evidence of effectiveness and economic efficiency of interventions to
promote healthy social environments. American Journal of Preventive Medicine, 24(3 Suppl.), 25–31.

Bamberger, M., Rugh, J., Church, M., & Fort, L. (2004). Shoestring evaluation: Designing impact evaluations
under budget, time and data constraints. American Journal of Evaluation, 25(1), 5–37.

61
Barber, M. (2015). How to run a government: So that citizens benefit and taxpayers don’t go crazy. London, UK:
Penguin.

Barber, M., Moffit, A., & Kihn, P. (2011). Deliverology 101: A field guide for educational leaders. Thousand Oaks,
CA: Corwin.

Bickman, L. (1996). A continuum of care. American Psychologist, 51(7), 689–701.

Boulmetis, J., & Dutwin, P. (2000). The ABC’s of evaluation: Timeless techniques for program and project managers.
San Francisco, CA: Jossey-Bass.

Bryson, J. M., Crosby, B. C., & Bloomberg, L. (2014). Public value governance: Moving beyond traditional
public administration and the new public management. Public Administration Review, 74(4), 445–456.

Campbell Collaboration. (2018). Our vision, mission and key principles. Retrieved from
https://www.campbellcollaboration.org/about-campbell/vision-mission-and-principle.html

Center for Evidence-Based Crime Policy at George Mason University (2016). Retrieved from
http://cebcp.org/technology/body-cameras

Century, J., Rudnick, M., & Freeman, C. (2010). A framework for measuring fidelity of implementation: A
foundation for shared language and accumulation of knowledge. American Journal of Evaluation, 31(2),
199–218.

Chelimsky, E. (1997). The coming transformations in evaluation. In E. Chelimsky & W. R. Shadish (Eds.),
Evaluation for the 21st century: A handbook (pp. ix–xii). Thousand Oaks, CA: Sage.

Chen, H.-T. (1996). A comprehensive typology for program evaluation. Evaluation Practice, 17(2), 121–130.

Cochrane Collaboration. (2018). About us. Retrieved from www.cochrane.org/about-us. Also: Cochrane handbook
for systematic reviews of interventions, retrieved from http://training.cochrane.org/handbook

Coen, D., & Roberts, A. (2012). A new age of uncertainty. Governance, 25(1), 5–9.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings.
Chicago, IL: Rand-McNally.

Cook, T. D., Scriven, M., Coryn, C. L., & Evergreen, S. D. (2010). Contemporary thinking about causation in
evaluation: A dialogue with Tom Cook and Michael Scriven. American Journal of Evaluation, 31(1), 105–117.

Coryn, C. L., Schröter, D. C., Noakes, L. A., & Westine, C. D. (2011). A systematic review of theory-driven

62
evaluation practice from 1990 to 2009. American Journal of Evaluation, 32(2), 199–226.

Creswell, J. W., & Creswell, J. D. (2017). Research design: Qualitative, quantitative, and mixed methods approaches.
Thousand Oaks: Sage.

Creswell, J. W., & Plano Clark, V. (2017). Designing and conducting mixed methods research (3rd ed.). Thousand
Oaks, CA: Sage.

Cubitt, T. I., Lesic, R., Myers, G. L., & Corry, R. (2017). Body-worn video: A systematic review of literature.
Australian & New Zealand Journal of Criminology, 50(3), 379–396.

Curristine, T. (2005). Government performance: Lessons and challenges. OECD Journal on Budgeting, 5(1),
127–151.

de Lancer Julnes, P., & Steccolini, I. (2015). Introduction to symposium: Performance and accountability in
complex settings—Metrics, methods, and politics. International Review of Public Administration, 20(4),
329–334.

Donaldson, S. I. (2007). Program theory-driven evaluation science. New York, NY: Lawrence Erlbaum.

Donaldson, S. I., Christie, C. A., & Mark, M. M. (Eds.). (2014). Credible and actionable evidence: The foundation
for rigorous and influential evaluations. Los Angeles, CA: Sage.

Donaldson, S. I., & Picciotto, R. (Eds.). (2016). Evaluation for an equitable society. Charlotte, NC: Information
Age Publishing.

Dunleavy, P., Margetts, H., Bestow, S., & Tinkler, J. (2006). New public management is dead—Long live digital-
era governance. Journal of Public Administration Research and Theory, 16(3), 467–494.

Farrar, W. (2013). Self-awareness to being watched and socially-desirable behavior: A field experiment on the effect of
body-worn cameras and police use-of-force. Washington, DC: Police Foundation.

Fitzpatrick, J. (2002). Dialogue with Stewart Donaldson. American Journal of Evaluation, 23(3), 347–365.

Fleischer, D., & Christie, C. (2009). Evaluation use: Results from a survey of U.S. American Evaluation
Association members. American Journal of Evaluation, 30(2), 158–175.

Funnell, S., & Rogers, P. (2011). Purposeful program theory: Effective use of theories of change and logic models. San
Francisco, CA: Jossey-Bass.

Gaub, J. E., Choate, D. E., Todak, N., Katz, C. M., & White, M. D. (2016). Officer perceptions of body-worn

63
cameras before and after deployment: A study of three departments. Police Quarterly, 19(3), 275–302.

Gauthier, B., Barrington, G. V., Bozzo, S. L., Chaytor, K., Dignard, A., Lahey, R.,. . . Roy, S. (2009). The lay of
the land: Evaluation practice in Canada in 2009. The Canadian Journal of Program Evaluation, 24(1), 1–49.

Gilmour, J. B. (2007). Implementing OMB’s Program Assessment Rating Tool (PART): Meeting the challenges
of integrating budget and performance. OECD Journal on Budgeting, 7(1), 1C.

Government of British Columbia. (2007). Greenhouse Gas Reduction Targets Act. British Columbia: Queen’s
Printer. Retrieved from
http://www.bclaws.ca/EPLibraries/bclaws_new/document/ID/freeside/00_07042_01#section12

Government of British Columbia. (2014). Greenhouse Gas Industrial Reporting and Control Act. Retrieved from
http://www.bclaws.ca/civix/document/id/lc/statreg/14029_01

Government of British Columbia. (2016). Climate leadership. Victoria, BC: Government of British Columbia.
Retrieved from http://climate.gov.bc.ca

Government of Canada. (2018). Guide to rapid impact evaluation. Retrieved from


https://www.canada.ca/en/treasury-board-secretariat/services/audit-evaluation/centre-excellence-
evaluation/guide-rapid-impact-evaluation.html

Greiling, D., & Halachmi, A. (2013). Accountability and organizational learning in the public sector. Public
Performance & Management Review, 36(3), 380–406.

Hedberg, E., Katz, C. M., & Choate, D. E. (2017). Body-worn cameras and citizen interactions with police
officers: Estimating plausible effects given varying compliance levels. Justice Quarterly, 34(4), 627–651.

HM Treasury, Government of the United Kingdom. (2011). Magenta book: Guidance for evaluation. Retrieved
from https://www.gov.uk/government/publications/the-magenta-book

Højlund, S. (2014). Evaluation use in the organizational context–changing focus to improve theory. Evaluation,
20(1), 26–43.

Hood, C. (1991). A public management for all seasons? Public Administration, 69(1), 3–19.

House, E. R. (2016). The role of values and evaluation in thinking. American Journal of Evaluation, 37(1),
104–108.

Hunter, D., & Nielsen, S. (2013). Performance management and evaluation: Exploring complementarities. New
Directions in Evaluation, 137, 7–17.

64
Jennings, W. G., Fridell, L. A., & Lynch, M. D. (2014). Cops and cameras: Officer perceptions of the use of
body-worn cameras in law enforcement. Journal of Criminal Justice, 42(6), 549–556.

Joyce, P. G. (2011). The Obama administration and PBB: Building on the legacy of federal performance-
informed budgeting? Public Administration Review, 71(3), 356–367.

Kirkhart, K. E. (2000). Reconceptualizing evaluation use: An integrated theory of influence. New Directions for
Evaluation, 88, 5–23.

Knowlton, L. W., & Phillips, C. C. (2009). The logic model guidebook. Thousand Oaks, CA: Sage.

Krause, D. R. (1996). Effective program evaluation: An introduction. Chicago, IL: NelsonHall.

Kroll, A. (2015). Drivers of performance information use: Systematic literature review and directions for future
research. Public Performance & Management Review, 38(3), 459–486.

Leviton, L. C. (2003). Evaluation use: Advances, challenges and applications. American Journal of Evaluation,
24(4), 525–535.

Lipsey, M. W. (2000). Method and rationality are not social diseases. American Journal of Evaluation, 21(2),
221–223.

Lindquist, E. A., & Huse, I. (2017). Accountability and monitoring government in the digital era: Promise,
realism and research for digital-era governance. Canadian Public Administration, 60(4), 627–656.

Lum, C., Koper, C., Merola, L., Scherer, A., & Reioux, A. (2015). Existing and ongoing body worn camera research:
Knowledge gaps and opportunities. Fairfax, VA: George Mason University.

Mahler, J., & Paul Posner, P. (2014). Performance movement at a crossroads: Information, accountability and
learning. International Review of Public Administration, 19(2), 179–192.

Majone, G. (1989). Evidence, argument, and persuasion in the policy process. London, UK: Yale University Press.

Mark, M. M., & Henry, G. T. (2004). The mechanisms and outcomes of evaluation influence. Evaluation, 10(1),
35–57.

Maskaly, J., Donner, C., Jennings, W. G., Ariel, B., & Sutherland, A. (2017). The effects of body-worn cameras
(BWCs) on police and citizen outcomes: A state-of-the-art review. Policing: An International Journal of Police
Strategies & Management, 40(4), 672–688.

Mayne, J. (2001). Addressing attribution through contribution analysis: Using performance measures sensibly.

65
Canadian Journal of Program Evaluation, 16(1), 1–24.

Mayne, J. (2011). Contribution analysis: Addressing cause and effect. In K. Forss, M. Marra, & R. Schwartz
(Eds.), Evaluating the complex: Attribution, contribution, and beyond: Comparative policy evaluation (Vol. 18, pp.
53–96). New Brunswick, NJ: Transaction.

Mayne, J., & Rist, R. C. (2006). Studies are not enough: The necessary transformation of evaluation. Canadian
Journal of Program Evaluation, 21(3), 93–120.

McDavid, J. C. (2001). Program evaluation in British Columbia in a time of transition: 1995–2000. Canadian
Journal of Program Evaluation, 16(Special Issue), 3–28.

McDavid, J. C., & Huse, I. (2006). Will evaluation prosper in the future? Canadian Journal of Program
Evaluation, 21(3), 47–72.

McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

Melkers, J., & Willoughby, K. (2004). Staying the course: The use of performance measurement in state governments.
Washington, DC: IBM Center for the Business of Government.

Melkers, J., & Willoughby, K. (2005). Models of performance-measurement use in local government:
Understanding budgeting, communication, and lasting effects. Public Administration Review, 65(2), 180–190.

Moynihan, D. P. (2006). Managing for results in state government: Evaluating a decade of reform. Public
Administration Review, 66(1), 77–89.

Moynihan, D. P. (2013). The new federal performances system: Implementing the new GPRA Modernization Act.
Washington, DC: IBM Center for the Business of Government.

Nagarajan, N., & Vanheukelen, M. (1997). Evaluating EU expenditure programs: A guide. Luxembourg:
Publications Office of the European Union.

Newcomer, K., & Brass, C. T. (2016). Forging a strategic and comprehensive approach to evaluation within
public and nonprofit organizations: Integrating measurement and analytics within evaluation. American Journal
of Evaluation, 37(1), 80–99.

Nix, J., & Wolfe, S. E. (2016). Sensitivity to the Ferguson effect: The role of managerial organizational justice.
Journal of Criminal Justice, 47, 12–20.

OECD. (2015). Achieving public sector agility at times of fiscal consolidation, OECD Public Governance Reviews.
Paris, France: OECD Publishing.

66
Office of Management and Budget. (2012). Office of Management and Budget [Obama archives]. Retrieved from
https://obamawhitehouse.archives.gov/omb/organization_mission/

Office of Management and Budget. (2018). Office of Management and Budget. Retrieved from
https://www.whitehouse.gov/omb

Osborne, D., & Gaebler, T. (1992). Reinventing government: How the entrepreneurial spirit is transforming the
public sector. Reading, MA: Addison-Wesley.

Osborne, S. P. (Ed.). (2010). The new public governance: Emerging perspectives on the theory and practice of public
governance. London, UK: Routledge.

Owen, J. M., & Rogers, P. J. (1999). Program evaluation: Forms and approaches (International ed.). London,
England: Sage.

Patton, M. Q. (1994). Developmental evaluation. Evaluation Practice, 15(3), 311–319.

Patton, M. Q. (2008). Utilization focused evaluation (4th ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (2011). Developmental evaluation: Applying complexity to enhance innovation and use. New York,
NY: Guilford Press.

Picciotto, R. (2011). The logic of evaluation professionalism. Evaluation, 17(2), 165–180.

Pollitt, C., & Bouckaert, G. (2011). Public management reform (2nd and 3rd ed.). Oxford, UK: Oxford University
Press.

Public Safety Canada (2018). Searchable website: https://www.publicsafety.gc.ca/

Radin, B. (2006). Challenging the performance movement: Accountability, complexity, and democratic values.
Washington, DC: Georgetown University Press.

Rolston, H., Geyer, J., & Locke, G. (2013). Final report: Evaluation of the Homebase Community Prevention
Program. New York, NY: ABT Associates. Retrieved from
http://www.abtassociates.com/AbtAssociates/files/cf/cf819ade-6613–4664–9ac1–2344225c24d7.pdf

Room, G. (2011). Complexity, institutions and public policy: Agile decision making in a turbulent world.
Cheltenham, UK: Edward Elgar.

Rowe, A. (2014). Introducing Rapid Impact Evaluation (RIE): Expert lecture. Retrieved from
https://evaluationcanada.ca/distribution/20130618_rowe_andy.pdf

67
Rutman, L. (1984). Introduction. In L. Rutman (Ed.), Evaluation research methods: A basic guide (Sage Focus
Editions Series, Vol. 3, 2nd ed., pp. 9–38). Beverly Hills, CA: Sage.

Schwandt, T. (2015). Evaluation foundations revisited: Cultivating a life of the mind for practice. Stanford, CA:
Stanford University Press.

Scriven, M. (1967). The methodology of evaluation. In R. Tyler, R. Gagne, & M. Scriven (Eds.), Perspectives of
curriculum evaluation (AERA Monograph Series—Curriculum Evaluation, pp. 39–83). Chicago, IL: Rand
McNally.

Scriven, M. (1991). Beyond formative and summative evaluation. In M. W. McLaughlin & D. C. Phillips (Eds.),
Evaluation and education: At quarter century (pp. 18–64). Chicago, IL: University of Chicago Press.

Scriven, M. (1996). Types of evaluation and types of evaluator. Evaluation Practice, 17(2), 151–161.

Scriven, M. (1997). Truth and objectivity in evaluation. In E. Chelimsky & W. R. Shadish (Eds.), Evaluation for
the 21st century: A handbook (pp. 477–500). Thousand Oaks, CA: Sage.

Scriven, M. (2008). A summative evaluation of RCT methodology & an alternative approach to causal research.
Journal of Multidisciplinary Evaluation, 5(9), 11–24.

Shaw, I. (2000). Evaluating public programmes: Contexts and issues. Burlington, VT: Ashgate.

Shaw, T. (2016). Performance budgeting practices and procedures. OECD Journal on Budgeting, 15(3), 1–73.

Stockmann, R., & Meyer, W. (Eds.). (2016). The future of evaluation: Global trends, new challenges and shared
perspectives. London, UK: Palgrave Macmillan.

Szanyi, M., Azzam, T., & Galen, M. (2013). Research on evaluation: A needs assessment. Canadian Journal of
Program Evaluation, 27(1), 39–64.

Treasury Board of Canada Secretariat. (2016a). Policy on results. Retrieved from http://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31300&section=html

Treasury Board of Canada Secretariat. (2016b). Directive on results. Retrieved from https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31306&section=html

U. S. Bureau of Justice Assistance (2018). Body-worn camera toolkit, U.S. department of justice: Bureau of justice
assistance. Retrieved from https://www.bja.gov/bwc/resources.html

Van de Walle, S., & Cornelissen, F. (2014). Performance reporting. In M. Bovens, R. E. Goodin, & T.

68
Schillemans (Eds.), The Oxford handbook on public accountability (pp. 441–455). Oxford, UK: Oxford
University Press.

Weiss, C. H. (1972). Evaluation research: Methods for assessing program effectiveness. Englewood Cliffs, NJ: Prentice
Hall.

Weiss, C. H. (1998a). Evaluation: Methods for studying programs and policies (2nd ed.). Upper Saddle River, NJ:
Prentice Hall.

Weiss, C. H. (1998b). Have we learned anything new about the use of evaluation? American Journal of Evaluation,
19(1), 21–33.

White, M. D., Todak, N., & Gaub, J. E. (2017). Assessing citizen perceptions of body-worn cameras after
encounters with police. Policing: An International Journal of Police Strategies & Management, 40(4), 689–703.

World Bank. (2014). Developing in a changing climate. British Columbia’s carbon tax shift: An environmental and
economic success (Blog: Submitted by Stewart Elgie). from
http://blogs.worldbank.org/climatechange/print/british-columbia-s-carbon-tax-shift-environmental-and-
economic-success

World Bank. (2017). State and Trends of Carbon Pricing 2017. Washington, DC: World Bank. © World Bank.
https://www.openknowledge.worldbank.org/handle/10986/28510 License: CC BY 3.0 IGO.

Yeh, S. S. (2007). The cost-effectiveness of five policies for improving student achievement. American Journal of
Evaluation, 28(4), 416–436.

69
2 Understanding and Applying Program Logic Models

Introduction 51
Logic Models and the Open Systems Approach 52
A Basic Logic Modeling Approach 54
An Example of the Most Basic Type of Logic Model 58
Working with Uncertainty 60
Problems as Simple, Complicated, and Complex 60
Interventions as Simple, Complicated, or Complex 61
The Practical Challenges of Using Complexity Theory in Program Evaluations 62
Program Objectives and Program Alignment With Government Goals 64
Specifying Program Objectives 64
Alignment of Program Objectives With Government and Organizational Goals 66
Program Theories and Program Logics 68
Systematic Reviews 69
Contextual Factors 70
Realist Evaluation 71
Putting Program Theory Into Perspective: Theory-Driven Evaluations and Evaluation Practice 74
Logic Models that Categorize and Specify Intended Causal Linkages 75
Constructing a Logic Model for Program Evaluations 79
Logic Models for Performance Measurement 81
Strengths and Limitations of Logic Models 84
Logic Models in a Turbulent World 85
Summary 86
Discussion Questions 87
Appendices 88
Appendix A: Applying What You Have Learned: Development of a Logic Model for a Meals on
Wheels Program 88
Translating a Written Description of a Meals on Wheels Program Into a Program Logic Model
88
Appendix B: A Complex Logic Model Describing Primary Health Care in Canada 88
Appendix C: Logic Model for the Canadian Evaluation Society Credentialed Evaluator Program 92
References 94

70
Introduction
Logic models are an almost-indispensable aid in designing, operating, and evaluating programs and policies.
Program logic models are graphic representations of the structure of programs; they simplify and illustrate the
intended cause-and-effect linkages connecting resources, activities, outputs, and outcomes. In this textbook, we see
logic models as a visual “results chain” (BetterEvaluation, 2017). The intent of this chapter is to build a step-by-
step understanding of what program logic models are, how to use them to create a structured road map of how a
program’s activities are meant to lead to its outcomes, and how building logic models facilitates program
evaluation and performance measurement.

In this chapter, we will also introduce the concept of evaluation constructs, which are the names for settings,
interventions, people, and outcomes that are intended to be the cornerstones of the evaluation. Constructs are key
to the creation of logic models and the evaluation that follows. We will also touch on the idea of mechanisms,
which are the underlying processes that can be considered as explanatory factors between program activities and
the intended outcomes (Pawson & Tilley, 1997). Similarly, logic models can be informed by the program’s theory
of change. Although the theoretic approaches can vary, in terms of logic modeling they can be foundational for an
evidence-informed understanding of how to strategically apply resources to address social, economic, and other
problems in particular program and policy contexts. As well, the process of developing a logic model for evaluation
provides an opportunity to guide development of performance measures that can be useful for monitoring of
programs and policies.

In this chapter, we discuss how logic models can be useful for simple, complicated, or complex interventions.
Although somewhat debated, evaluators have discovered that for complex problems or with complex
interventions, a phased and iterative development of logic models can be useful for understanding the logic chain
(Funnell & Rogers, 2011; Moore et al., 2015; Rogers, 2008). We also assess the strengths and limitations of logic
modeling as an approach to representing program structures.

Logic models are later featured in our discussions of research designs (Chapter 3), measurement procedures
(Chapter 4), and designing and implementing performance measurement systems (Chapter 9). Chapter 2 also
features three appendices. Appendix A is an exercise for users of this textbook; you are given a narrative description
of a Meals on Wheels program and asked to construct a corresponding program logic model. A solution is also
included with this exercise. Appendix B is an example of a program logic that was constructed from a large-scale
review of primary health care in Canada, and illustrates a logic model in a complex setting. Appendix C includes
the logic model for an evaluation of the Professional Designation Program (PDP) that is offered by the Canadian
Evaluation Society. The evaluation was done by a team from the Claremont Graduate School evaluation group—
the Canadian PDP program is the first of its kind, internationally.

71
Logic Models and the Open Systems Approach
To this point in the textbook, we have represented programs as “boxes” that interact with their environments.
Program activities in the box produce outputs that “emerge” from the box as outcomes and ideally affect whatever
aspect(s) of their environs the program is aimed at, in ways that are intended. This representation of programs is
metaphorical; that is, we are asserting that programs are “box like” and engage in purposeful activities that are
directed at social, economic, or physical problems or other conditions we wish to change. This metaphor is
severely simplified, however, because it does not show that the program or policy (i) can range along a continuum
of simplicity to complexity in terms of its problems of interest and its mechanisms for change, (ii) operates in a
context that can range from stable to unstable, and (iii) may be subject to emergent feedback loops that can occur
as something is being evaluated or as the program is implemented (Funnell & Rogers, 2011; Stame, 2004).

One might instead envision programs as occurring in an open system, illustrated by Morgan’s (2006)
“organization as organism” metaphor (Morgan, 2006). The concept of an open system is rooted in biology
(Gould, 2002). Biological systems are layered. Organisms interact with their immediate environments; species
interact with each other and with their natural environment; and complex, geographically defined interactions
among species and their physical environments can be viewed as ecosystems. If we look at biological organisms,
they give and take in their interactions with their environment. They have boundaries, but their boundaries are
permeable; they are open systems, not closed, self-contained systems. The environment can change over time.
Interactions between organisms and their environments can result in adaptations; if they do not, then species may
become extinct. If we apply this metaphor to programs, program inputs are converted to activities within the
program (program structures perform functions), and program outputs are a key form of interaction between the
program and its environment.

Unlike the “organizations as machines” metaphor with its assembly line approach, the open-systems metaphor
encourages conceptualizing organizations—and programs within them—in more dynamic terms (Morgan, 2006).
Programs are always embedded in their environment, and assessing their implementation or their results involves
identifying and understanding the relationships between the program and its environment. A key part of the
dynamics is feedback loops: At the program or policy level of analysis, positive feedback can indicate that the
program is doing well or perhaps should do more of what it is currently doing, and negative feedback indicates the
program may be under-resourced or may need modification. When we introduced our open-systems model of a
program in Chapter 1 (Figure 1.4), the evaluation questions embedded in that model can be seen as feedback
channels. As well, in a well-functioning performance management system (see Figure 1.1), the entire performance
management cycle, from developing strategic objectives to using performance information to modify those
objectives, can be seen as an open system with feedback loops.

As an aside at this point but related to feedback loops, consider the following: While learning and positive change
can occur in the organization or its individuals as a program is implemented or assessed, in cases where the act of
being evaluated or measured has implications that may threaten the level of funding or even the continued
existence of the program, there can be unintended negative implications (Funnell & Rogers, 2011; Smith, 1995).
We will expand on this in our performance measures discussions, particularly in Chapter 10.

At this initial stage, the key point to remember is that formative and summative evaluations are, essentially, an
attempt to credibly and defensibly determine the difference between the outcomes of a program and what would
have occurred if the program had not been initiated, or what might happen if it is implemented differently or
implemented in a different context. There is a set of skills needed to understand the logic of program outcomes
and the counterfactual, but as part of this, it is also vital that evaluators have a good sense of the organizational,
social, political, and fiscal context in which evaluative efforts are conducted.

The open systems approach has not only thrived but also dominates our view of public and nonprofit programs.
Public- and nonprofit-sector managers are encouraged to see the performance management cycle in open systems
terms: strategic planning (which includes environmental scanning) develops objectives; policy and program

72
design attaches resources (and activities) to those objectives; program implementation emphasizes aligning
management components and systems so that objectives and outcomes can be achieved in specific contexts; and
performance measurement and program evaluation are intended as means of providing feedback to the managers
and to other stakeholders in a network of accountability and performance improvement relationships. The cycle is
completed when accountability-related reporting feeds into the next cycle of objective-setting and program
adjustments. This open systems view has a number of implications, pointing to the potential usefulness of logic
modeling.

In this chapter, we consider how basic logic models can be used to plan for evaluations and/or performance
measurement, such as recent programs where city police officers are outfitted with body-worn cameras. It is a
fitting example because studies have discovered seemingly contradictory findings (Ariel et al., 2016, 2017; Lum et
al., 2015; Maskaly et al., 2017), and while this might seem like a simple intervention, these programs are found to
exhibit many of the signs of complex interventions.

Implications of Understanding Policies and Programs as Open Systems

Where publicized “accountability” targets or goals are used as a basis for budgetary decisions, it may result in various kinds of individual
or organizational “gaming” related to evaluation or performance measurement. This contextual problem is especially salient in times of
fiscal restraint when programs are subject to being eliminated or having funding cuts. Even the creation of performance measurement
systems can be subject to gaming, when measures intended for program learning and improvement (e.g., managerial decisions) are also
expected to be called into use for program budget reallocations. These are things to remember:

1. Programs exist in dynamic environments, which both afford opportunities and offer constraints to programs and to evaluators.
2. Programs have permeable boundaries; that is, there is at least a conceptual demarcation between the program and its
environment. Usually, that boundary cannot be observed directly. But acknowledging there is a boundary affects how we “see”
programs and how we model them.
3. Programs are purposeful systems—that is, they are human constructions with which we intend to accomplish objectives we value.
Typically, program objectives are the result of decisions made politically and organizationally.
4. Programs have structures; structures produce activities/processes; and activities, in turn, produce results (outputs and outcomes),
all of which can be described with logic models.
5. The program and the system within which a program exists can range in complexity, which impacts the type of modeling used to
describe the program and its environmental context.

73
A Basic Logic Modeling Approach
Logic models play an important role in performance management. They can be used as a part of the strategic
planning process to clarify intended objectives and the program designs that are meant to achieve these objectives
or outcomes. In Canada, Australia, the United Kingdom, the United States, and many other countries, logic
modeling has become central to public-sector program evaluation systems and the performance management
architecture. Government and nonprofit agencies are often expected to conduct program evaluations and, in most
cases, to develop performance measurement systems. This occurs nationally, as well as at subnational levels and in
many nonprofit organizations. In Canada, for example, the federal agency that supports government-wide
reporting and evaluation activities has prepared a guide for federal department performance reports, which
includes a logic model template (Treasury Board of Canada Secretariat, 2010). In the United States, the
Government Accountability Office (GAO) provides similar resources, with evaluation design guidance that
includes the development of logic models (U.S. GAO, 2012). In the United Kingdom, HM Treasury (2011)
provides evaluation and logic-modeling guidance via The Magenta Book. Many organizations have used the Kellogg
Foundation Logic Model Development Guide (Kellogg, 2006). In addition, many government agencies provide
tailored evaluation and logic-modeling guidance, such as the exemplary resources of the UK’s Ministry of
Transportation (UK Government, 2017). Similarly, many nongovernment organizations provide evaluation
guidance specific to their policy arena (see, for example, Calgary Homeless Foundation, 2017). Although there is a
lot of common advice, evaluators do benefit from accessing the evaluation and logic modeling guidance most
relevant to the program that is to be studied.

Table 2.1 presents a basic framework for modeling program logics. We will discuss the framework and then
introduce several examples of logic models that have been constructed using the framework. The framework in
Table 2.1 does two things. First, it classifies the main parts of a typical logic model into inputs, components
(organizational/program), implementation activities, outputs, outcomes, and contextual environment. Second, it
offers a template to specify directionally (i.e., left to right) how inputs and outcomes are intended to be linked,
causally. We begin by defining key terms, and then show an example of a logic model (Table 2.2), and later a
figure (2.1) that includes cause-and-effect linkages.

Table 2.1 Basic Components of a Program Logic


Table 2.1 Basic Components of a Program Logic

Program Implementation Intended Outcomes

Implementation Short-, Medium-, and


Inputs Components Outputs
Longer-Term Outcomes
Activities

To
Money provide
Intended by the
People . . . Work
design of the
(program Major clusters To give done
program
providers) of program . . . Program
Outcomes (or
Equipment/ activities To do activities
impacts) relate to
Technology . . . completed
program objectives
Facilities To make
. . .

Environment/Context

Program inputs are the resources that are required to operate the program; they typically include money, people

74
(program providers), equipment (including technologies), and facilities. Program inputs are an important part of
logic models. It is typically possible to monetize inputs—that is, convert them to equivalent dollar/currency
values. Evaluations that compare program costs with outputs (technical efficiency), program costs with outcomes
(cost-effectiveness), or program costs with monetized value of outcomes (cost–benefit analysis) all require estimates
of inputs expressed in dollars (or some other currency). Performance measurement systems that focus on efficiency
or productivity also compare program costs with results.

Program components are clusters of activities in the program. They can be administrative units within an
organization that is delivering a program. For example, a job-training program with three components (intake,
skills development, and job placement) might be organized so that there are organizational work groups for each
of these three components. Alternatively, it might be that one work group does these three clusters of activities, a
situation common in the nonprofit sector.

Implementation activities are included for each component of a program logic model. These are modeled, to
some extent, on an approach to modeling program logics that was introduced by Rush and Ogborne (1991). In
their approach, they developed a category for program implementation activities that are necessary to produce
program outputs. Implementation activities are about getting the program running—that is, getting the things
done in the program itself that are necessary to have an opportunity to achieve the intended outcomes.
Implementation activities simply state the kinds of work that the program managers and workers need to do, not
the intended outputs and outcomes for the program. Typical ways of referring to implementation activities begin
with words like “to provide,” “to give,” “to do,” or “to make.” An example of an implementation activity for the
intake component of a job-training program might be “to assess the training needs of clients.” Another example
from the skills development component of the same program might be “to provide work skills training for
clients.”

Successful implementation does not assure us that the intended outcomes will be achieved, but implementation is
considered a necessary condition for program success. If the implementation activities do not occur, there is no
real point in trying to determine whether the program was efficient (technically efficient) or effective.

It is possible that when programs are developed, implementation becomes a major issue. If the staff in an agency
or department is already fully committed to existing programs, then a new program initiative may well be slowed
down or even held up at the point where implementation should occur. Furthermore, if a new program conflicts
with the experience and values of those who will be responsible for implementing it (including previous
unsuccessful attempts to implement similar programs), it may be that ways will be found to informally “stall” the
process, perhaps in the hope that a change in organizational direction or leadership would effectively cancel the
program. Implementation is the main focus of process evaluations. In the example of police body-worn cameras,
the evaluation studies have uncovered a variety of unexpected implications of inconsistent BWC implementation
(Ariel et al., 2017; Maskaly et al., 2017).

For successful implementation of performance measurement systems, resistance to change is an important issue
that must be managed. In Chapter 9, we discuss the challenge of sustaining organizational support for
performance measurement, particularly the role that program managers (and others) have in the whether and how
the performance information is used. Program outputs occur as a result of the activities, and can be viewed as the
transition from program activities to program outcomes. Outputs are typically tangible and countable. Examples
would include number of clients served, number of patients admitted into a hospital, or number of fire inspections
in a local government program. In Canada, universities are nearly all partially funded by governments, and a key
output for universities in their accountability relationships with governments is the numbers of students enrolled.
This output is linked to funding; not meeting enrollment targets can mean funding cuts in the next fiscal year.
Because outcomes are often difficult to quantify, program outputs are sometimes used for accountability and
funding uses. This can often occur where networks of organizations, devolved from central government, are
expected to deliver programs that align with government priorities.

Program outcomes are the intended results that correspond to program objectives. Typically, programs will have
several outcomes, and it is common for these outcomes to be differentiated by when they are expected to occur. In

75
a program logic of a housing rehabilitation program that is intended to stabilize the population (keep current
residents in the neighborhood) by upgrading the physical appearance of dwellings in an inner-city neighborhood,
we might have a program process that involves offering owners of houses property tax breaks if they upgrade their
buildings. A short-term outcome would be the number of dwellings that have been rehabilitated. That, in turn,
would hopefully lead to reduced turnover in the residents in the neighborhood, a longer-term outcome which
might be called an “impact.”

There can be confusion between “outcomes” as measures (variables) of constructs in program logic models
(including short, medium term and longer term program results) or ‘outcomes’ as defined as the change in that
variable that is specifically attributable to the program or policy. In Figure 1.4 in Chapter 1, which illustrates the
open systems model of programs, when we refer to “actual outcomes,” we mean the outcomes observed in the
evaluation.

Observed outcomes may or may not be attributable to the program. If we conduct an evaluation and determine
that the actual outcome(s) are due to the program, we can say the program is effective. This is the issue we focus
on in evaluations where attribution and counterfactuals are important.

Observed outcomes might not be attributable to the program—perhaps a combination of factors other than the
program has produced the observed outcomes. Navigating threats to validity is about determining, as best we can,
whether observed outcomes are the result of the program.

Thomas Schwandt (2015) points out that “the literature is not in agreement on the definitions of outcome and
impact evaluation. Some evaluators treat them as virtually the same, others argue that outcome evaluation is
specifically concerned with immediate changes occurring in recipients of the program, while impact examines
longer-term changes in participants’ lives” (p. 4).

In this textbook, we distinguish between outcomes as “performance” measures (how do the observed outcomes
compare to program objectives?) and outcomes that are attributable to the program (were the observed outcomes
due to the program?). In Figure 1.4, this is the difference between Effectiveness 1 evaluation questions and
Effectiveness 2 questions, respectively.

Program impacts, then, in this textbook refer to longer-term outcomes that are attributable to the program.
Impacts are often seen as longer-term effects of a program in society. In situations where programs are being
evaluated in developing countries, one view of effectiveness evaluations is that they should focus on impacts: the
longer-term outcomes that can be attributed to the program. To do this well, impact evaluations emphasize the
importance of rigorous comparisons that yield estimates of the incremental effects of programs (Gertler, Martinez,
Premand, Rawlings, & Vermeesch, 2016).

Environmental factors can enhance the likelihood that a program will succeed—a regional call center opening in
a community at the same time an employment training program is implemented may make it easier for program
trainees to find work. Environmental factors can also impede the success of a program. For example, rapidly rising
real estate values can impact a city’s efforts to maintain a sufficient supply of low-cost housing. As rental units are
renovated to capture higher rents, the pool of low-cost options dwindles, in some cases increasing homelessness.
The inclusion of environmental factors helps acknowledge that the program is part of an open system and that
there are contextual factors to consider.

Specifying environmental factors that could affect the outcomes of programs is a step toward anticipating how
these factors actually operate as we are evaluating a program. In Chapter 3, we examine categories of rival
hypotheses that can complicate our efforts to examine the intended connections between programs and outcomes.

Initial, Intermediate, and Long-Term Outcomes

Logic models generally are displayed so that a time-related sequence (left to right) is implied in the model. That is, logic models are

76
displayed so that resources occur first, then activities, then outputs, then outcomes. Outcomes can be displayed as short-term,
intermediate, and long-term. The sequence of outcomes is meant to indicate intended causality. Intermediate outcomes follow from the
initial short-term outcomes, and the long-term outcomes are intended as results of intermediate outcomes.

Distinguishing between short-term, intermediate, and long-term outcomes recognizes that not all effects of program activities are
discernable immediately on completion of the program. For example, social assistance recipients who participate in a program designed to
make them long-term members of the workforce may not find employment immediately on finishing the program. The program,
however, may have increased their self-confidence and their job-hunting, interviewing, and resume-writing skills. These short-term
outcomes may, within a year of program completion, lead to the long-term outcome of participants finding employment. Such situations
remind us that some program outcomes need to be measured at one or more follow-up points in time, perhaps 6 months or a year (or
more, depending on the intended logic) after program completion.

77
An Example of the Most Basic Type of Logic Model
An important aim of logic modeling is to describe the programs in an organization in a way that identifies key
activities that lead to outputs and anticipated outcomes. A basic approach involves categorizing program structures
and processes so that outcomes (which are typically the focus of performance measurement and evaluation efforts)
can be distinguished from other program activities. An illustration of such a logic model is created—as an example
—from several police body-worn camera (BWC) programs across the United States, the United Kingdom, and
Canada. There have been many such programs launched in recent years, sparked in part by bystander videos of
police shootings of citizens and the subsequent political attention to racial factors in the shootings (Ruane, 2017).
In 2015, for example, President Obama launched a USD $20 million Body-Worn Camera Pilot Partnership
Program for local and tribal law enforcement organizations (U.S. Whitehouse [archives], 2015). The major
features of this basic logic model, shown in Table 2.2, are categories for inputs, activities, outputs, and three kinds
of outcomes: (1) initial, (2) intermediate, and (3) long-term. The details are drawn from a broad overview of
recent police body-worn-camera review studies (Ariel et al., 2017; Lum et al., 2015; Maskaly et al., 2017; White,
2014)

The activities of the BWC programs are stated in general terms, and the model is intended to be a way of
translating a verbal or written description of the program into a model that succinctly depicts the program. The
outputs indicate the work done and are the immediate results of activities that occur given the program’s inputs.
Outcomes are intended to follow from the outputs and are succinct versions of the objectives of the program.

Table 2.2 Example Program Logic Model of Police Body-Worn Camera Programs
Table 2.2 Example Program Logic Model of Police Body-Worn Camera Programs

Outcomes

Inputs Activities Outputs Initial Intermediate

Reduced
use-of-
force
incidents
by police
Reduced
Establishment Increased public complaints
of BWC tech Number of BWCs awareness of BWCs about
Funding system available Improved officer
for Establishment Number of officers confidence in misconduct
purchase of BWC usage trained evidence for Increased
of body- policies Officer BWC-based prosecutions positive
worn Establishment documentation to De-escalation of resolutions
cameras of training augment notes officer force of
Funding policies/ system Real-life video- Fewer assaults complaints
for initial Creation of enhanced training against police Increased
and public for officers Reduced police efficiency
ongoing communications Communications to incidents in response
BWC about BWCs public Changes in racial to
tech Creation of Communications to patterns of complaints
system BWC program officers interactions Reduced
Training internal (memos/newsletters) Improved citizen number of
and communications Citizen notifications willingness to be arrests
technical system (e.g.,

78
technical system (e.g., that event will be witness Earlier
assistance newsletters, recorded Improved officer court case
internal training/preparation resolutions
websites) Higher
prosecution
rates
Reduced
city
liabilities

Environmental Context (e.g., organizational culture, community crime factors, community history)

The bullets in this example illustrate broadly defined program constructs. Constructs are the words or phrases in
logic models that we use to describe programs and program results, including the cause-and-effect linkages in the
program. Program logic models can differ in the ways categories are labeled and in the level of detail in the
modeling process itself—specifically, in the extent to which the logic models are also intended to be causal models
that make explicit the intended connections between activities, outputs, and outcomes. As well, logic models can
differ in how they are presented. Aside from the ordering principle (e.g., left to right or top to bottom) that guides
the user from inputs to outcomes, some organizations prefer logic models that include a column specifically for
societal impacts that are intended to occur if the outcomes are achieved. As well, some organizations present logic
models in a vertical format—top to bottom, where the top might be the inputs and the bottom would be the
longer term outcomes.

A logic model like the one in Table 2.2 might be used to develop performance measures; the words and phrases in
the model could become the basis for more clearly defined program constructs that, in turn, could be used to
develop variables to be measured. But at this stage of development, the model has a limitation. It does not specify
how its various activities are linked to specific outputs, or how particular outputs are connected to initial
outcomes. In other words, the model offers us a way to categorize and describe program processes and outcomes,
but it is of limited use as a causal model of the intended program structure. As a basis for developing performance
measures, it is difficult to see which of the constructs in the model are more important and, hence, are candidates
for being used as constructs in an evaluation, or constructing performance measures.

Most of the logic models presented in this chapter are linear—that is, a causal ordering is implied such that inputs
lead to activities and, hence, to outputs and outcomes. In recent years, there has been increased mention of the
level of complexity of programs or interventions (see Funnell & Rogers, 2011; Patton, 2010; Patton, McKegg, &
Wehipeihana, 2015). Complexity introduces the challenge of possible non-linearity of cause and effect. In the next
section, we discuss complexity and our approach to addressing simple, complicated, and complex interventions.

79
Working with Uncertainty
For logic models to be a useful tool in an evaluation, an important consideration when constructing them is the
level of complexity of the programs or policies being evaluated. It can be especially challenging to find a
straightforward yet sufficient way to capture in a logic model the complexity of a multi-component, multi-agency
program that operates in a fast-changing environment. Logic models, to be most useful, must straddle a fine line:
they need to capture the most important causal links and the relevant context of a program without becoming
overwhelming. That is, they need just enough detail to guide an evaluation and help assess and show the
incremental differences a program has made. For the foreseeable future, governments may find themselves in
fiscally restrained times, with increasing pressure on policy-makers to make difficult allocation decisions and to
account for these decisions (Stockmann & Meyer, 2016). Nonprofit organizations, as well, will face heightened
pressure to be accountable for delivering specified results with their public and donated funds. The role of public-
sector evaluation (and audit) has changed over time—and particularly in the post-GFC (global financial crisis)
fiscal environment, the evaluation function in many governments has become more systematized and outcome-
focused (Dahler-Larsen, 2016; Shaw, 2016). New public management reforms, such as the devolution of
government organizations with an emphasis on “incentivizing the managers to manage,” have made some
interventions more challenging to model because of multiple components delivered by multiple, networked
organizations. Overall, an increasing proportion of public-sector interventions are seen as “complex,” yet there is
more demand for unambiguous, black-and-white evaluations and performance measures to help inform and
defend policy and program decisions (Stockmann & Meyer, 2016).

80
Problems as Simple, Complicated, and Complex
“Complexity” can refer to problems that are intended to be addressed by intervention programs. That is, some
evaluators view the problems themselves—rather than the interventions—as varying in their complexity.
Glouberman and Zimmerman (2002) provide a much-cited model of problems distinguished as simple,
complicated, or complex. Using the examples of “following a recipe” (simple), “sending a rocket to the moon”
(complicated), and “raising a child” (complex), Glouberman and Zimmerman (p. 2) provide the following table to
illustrate the differences.

Table 2.3

Simple, Complicated and Complex Problems

Sending a Rocket to the


Following a Recipe Raising a Child
Moon

Formulae are critical and


The recipe is essential Formulae have a limited application
necessary

Sending one rocket increases Raising one child provides


Recipes are tested to ensure easy
assurance that the next will be experience but no assurance of
replication
OK success with the next

No particular expertise is required. High levels of expertise in a Expertise can contribute but is
But cooking expertise increases variety of fields are necessary neither necessary nor sufficient to
success rate for success assure success

Recipes produce standardized Rockets are similar in critical Every child is unique and must be
products ways understood as an individual

The best recipes give good results There is a high degree of


Uncertainty of outcome remains
every time certainty of outcome

Optimistic approach to problem Optimistic approach to Optimistic approach to problem


possible problem possible possible
Source: The Romanow Papers, Volume II: Changing Health Care in Canada, edited by Pierre-Gerlier Forest, Gregory Marchildon, and Tom
McIntosh © University of Toronto Press 2004. Reprinted with permission of the publisher.

With “simple” problems we can expect that the process to achieve the objective is fairly straightforward and linear.
With a “complicated” problem, a greater level of expertise and coordination of components (and perhaps other
agencies) is required, but there is still reasonable certainty in the outcome, especially with experience and fine-
tuning over time. “Complex” problems, in contrast, can have the emergent, non-linear qualities of complex
systems and are more difficult to evaluate and to generalize (Funnell & Rogers, 2011; Glouberman &
Zimmerman, 2002).

Glouberman and Zimmerman (2002) argue that complex problems cannot necessarily be simplified to be studied
as component problems, because of the interdependent and dynamic qualities of the variables, both program-
related and in the environmental context. Patton (2010) makes a similar point about complexity in his
developmental evaluation formulation, and there are those who argue that because most social programs exist in
complex systems, with inherent dynamic, emergent, non-linear, and thus unpredictable qualities, that means that
linear, rational tools such as logic modeling and reductionist thinking are often inadequate for studying social
interventions (Mowles, 2014; Stacey, 2011; Stame, 2010).

81
82
Interventions as Simple, Complicated, or Complex
On the other hand, there are those who maintain that logic models are indeed useful for illustrating and modeling
even complex interventions (see Chen, 2016; Craig et al., 2013; Funnell & Rogers, 2011; Rogers, 2008).
Beginning with Glouberman and Zimmerman’s (2002) model, Funnell and Rogers (2011) take the view that
programmatic interventions—not the problems themselves—are simple, complicated, or complex, and that it is
feasible to construct logic models to illustrate the proposed causal linkages. From this pragmatic point of view,
even complex interventions are amenable to defensible evaluation.

Evaluators need to take into account the complexity of the intervention and its components, and when building
logic models for complicated and complex interventions, complexity can be handled with a phased or iterative
approach. Sometimes, nested logic models may be necessary (Funnell & Rogers, 2011).

83
The Practical Challenges of Using Complexity Theory in Program
Evaluations
Much has been written about using systems thinking and complexity theory in evaluations, and interest is on the
rise (Craig et al., 2010; Gates, 2016; Mowles, 2014; Walton, 2016). But there is limited actual take-up in
evaluative work because, chiefly, evaluations using complexity theory are beset with a number of practical
challenges (Walton, 2016). Governments and funders often have traditional expectations of evaluation approaches
and the type of information they will deliver, and there are real and perceived constraints in resources, time, and
data requirements for complexity theory approaches. Indeed, there is not a standard agreement on the meaning of
complexity, and there are limitations in understanding how to create a complexity-informed evaluation (Walton,
2016). Gates (2016) also notes that there has as yet been only limited discussion of how systems thinking and
complexity theory can be useful for accountability-related evaluations that are focused on program effectiveness
and cost-effectiveness. Stockmann and Meyer (2016) discuss this conundrum as a tension between science and
utility. Utility is defined from the point of view of usefulness for the decision makers, and the challenge is
described as follows:

If the expectations on both sides are not to be disappointed, the evaluation results (which are produced by
‘science’) have to fulfil certain conditions:

(1) They have to provide answers to the specific information requirements of the clients, in other words,
they have to be relevant to the decisions that are to be made.
(2) They have to be delivered in time, that is, within the previously agreed time frame in which the
clients’ decisions have to be made.
(3) They must not exceed a certain degree of complexity, in other words the information supplied has to
be recognizable as relevant in respect of the cognitive interest formulated by the client or in respect of the
decisions the latter has to make. (p. 239)

In situations where managers are working with programs meant to address complex problems, it is quite likely that
evaluations of outcomes will offer a mixed picture. In some cases, it might be necessary to have a greater focus on
the outputs than on the outcomes because of the many interdependent variables that are impacting the outcomes.

The problem of other causes of outcomes (rival hypotheses) is, in fact, central to the issue of attribution, which we
will discuss in Chapter 3. For programs addressing simple problems, outputs can easily be linked to outcomes, and
consideration of rival hypotheses is relatively straightforward. For complicated or complex problems, however,
rival hypotheses are a major concern; the logics of social programs tend to be more uncertain, reflecting the
importance of factors that can interrupt or interact with the intended cause-and-effect linkages.

So, while it is analytically possible to go into the far reaches of using complexity theory in evaluations, there is an
argument for pragmatism in how we are to evaluate programs and policies in a complex world (Chen, 2016;
Funnell & Rogers, 2011; Reynolds et al., 2016; Stockmann & Meyer, 2016). Our position in this textbook is that of
the pragmatic practitioner. Managers and decision-makers need straightforward information about the incremental
effects of programs, to be used for management decisions and budget allocations. Defensible and decision-relevant
methodologies and analyses are needed that reinforce confidence in the evaluative process and the evidence. This is
true, we argue, even in the case of complex interventions. An evaluator needs to keep in mind the purpose of an
evaluation, the people who are intended to use the evaluation to inform their decisions, and the social context of the
program.

Complex Interventions: Implications for Performance Measurement

84
In many jurisdictions, managers are expected to participate in developing performance measures of their own programs and to account for
program results. Current efforts in government departments, agencies, and nonprofit organizations to develop and use performance
measures raise the question of who should be accountable for program outcomes. This can be particularly problematic with complicated
or complex interventions because of the challenges in identifying the relevant causal variables, untangling their linkages, and taking into
account the rival hypotheses that weaken intended links to outcomes. As well, in these settings, the context can produce emergent effects
that are not anticipated in tracking systems. In cases where performance results are reported publicly, the political culture in the reporting
organization’s environment can be an important factor in how performance measurement results are used and, indeed, how credible they
are. We will discuss these issues in Chapter 10, but for now, what we need to keep in mind is that if program managers (or even
organization executives) are faced with requirements to be accountable through publicly reporting performance results that are likely to be
used summatively, for accountability purposes or budgeting cuts, they may well respond to such incentives by choosing performance
measures (or performance targets) that have a high likelihood of making the program look effective, minimizing focus on areas that may
need improvement. In effect, if external performance reporting is high stakes, performance measures may be strategically chosen for good
optics, or performance results may be sanitized. This undermines the usefulness of performance information for internal management–
related uses, since those performance results may then be less relevant for decision-making (McDavid & Huse, 2012; Stockmann &
Meyer, 2016). So, similar to the imperative to consider context when creating logic models for evaluations, the level of complexity of an
intervention is an important factor when framing a logic model that is intended to develop performance measures for accountability
purposes.

85
Program Objectives and Program Alignment with Government Goals
One of the most important considerations when developing a logic model, whether it is for guiding an evaluation,
assisting in strategic planning, or simply facilitating a shared conversation about aligning the efforts of a program,
is getting initial clarity on this question: What are the objectives of the program? From the program’s objectives,
which may be fairly broad or vague, it is often possible to outline the outcomes that become part of the logic
model. It makes sense to begin a logic model first with the intended objectives in mind, then iteratively construct a
road map of how the inputs, activities, and outputs are intended to achieve the intended outcomes. In this section,
we discuss program objectives and program alignment with larger goals, and their importance when constructing
logic models.

86
Specifying Program Objectives
The performance management cycle introduced in Chapter 1 includes a policy and program design phase, and a
part of designing programs and policies is stating objectives. Ideally, organizations should begin and end each cycle
with a commitment to constructing/adjusting clear strategic objectives. These will lend themselves to clear
program mandates that, in turn, will facilitate implementation and evaluation. The circle is closed (or one loop in
the spiral is completed) when the evaluation/performance results are reported and used for the next round of
refining or redefining the strategic objectives.

Ideal Program Objectives

From both a program evaluation and a performance measurement standpoint, ideal program objectives should have at least four
characteristics:

1. They should specify the target population/domain over which expected program outcomes should occur.
2. They should specify the direction of the intended effects—that is, positive or negative change.
3. They should specify the time frame over which expected changes will occur.
4. Ideally, the outcomes embedded in program objectives should be measurable, although this is sometimes not feasible. When
measurable, they should specify the magnitude of the expected change.

These four criteria, if all realized, will greatly facilitate the work of evaluators. The evidence for program outcomes can then be analyzed
considering the population, direction, and time frame factors that were specified.

An example of a well-stated (hypothetical) program objective might be as follows:

The Neighborhood Watch Program that has been implemented in the Cherry Hill area of Boulder,
Colorado, will reduce reported burglaries in that part of the city by 20% in the next 2 years.

In most situations, however, program objectives are far less precisely stated. Programs and policies are usually put
together in a political context. In contrast to (normative) models of performance management that rely on a view
of organizations that is essentially devoid of the impacts of power and politics, most of us have experienced the
political “give and take” that is intrinsic in putting together the resources and support needed to mount an
initiative. Power dynamics are an intrinsic part of organizations, and getting things done means that individuals
and groups need to work within the formal and informal structures in the organization (de Lancer Julnes &
Holzer, 2001).

Morgan (2006), in his examination of various metaphors of organizations, includes a case for considering
organizations as political systems, with “interests, conflict, and power” (p. 149). One key to understanding the
background politics is that persons and groups who participate in setting objectives do not necessarily share the
same values. Resolving differences among values through bargaining or even conflict/conflict resolution processes
is the essence of politics. The implication for constructing program objectives is that competing and perhaps even
conflicting views will often need to be reflected in the objectives. That can mean that the words chosen will reflect
these different values and may result in objectives that are general, even vague, and seem to commit the program
to outcomes that will be difficult to measure. The objectives are, in fact, political statements and carry the freight
of political discourse—they promise something to stakeholders. Objectives also sometimes need to be broad
enough to allow for modification for various locales.

Unclear language can create challenges when the time comes to measure vague constructs. This can be especially
challenging with social programs where there are multiple agencies involved. For example, the Troubled Families
Programme (Department for Communities and Local Government, 2016a, 2016b) in the United Kingdom had
broad objectives so that local agencies could tailor the services to the needs of their respective communities. The
objectives for the program, updated in 2015 after a pilot program, are stated as follows:

87
The new programme has three objectives:

For families: to achieve significant and sustained progress with 400,000 families with multiple, high-
cost problems;
For local services: to reduce demand for reactive services by using a whole family approach to transform
the way services work with these families; and,
For the taxpayer: to demonstrate this way of working results in cost savings. (p. 18)

These objectives are broad and suggest a logic for the program but lack many of the desirable characteristics of
program objectives we identified previously, particularly time frame, amount of outcomes expected, and the
measurability of outcomes.

Given the amount of interpretation that is required to identify outcomes that can be included in a logic model—
and can eventually be measured—it is important that an evaluator secure agreement on what the program is
actually intended to accomplish, before the evaluation begins. In the case of the Rialto body-worn cameras project,
key objectives of such programs were described as “reducing police use-of-force and complaints against officers,
enhancing police legitimacy and transparency, increasing prosecution rates and improving evidence capture by the
police” (Ariel, Farrar, & Sutherland, 2015, p. 510). The many evaluations of body-worn camera projects have
included various parts of these objectives, including police use-of-force incidence, assaults against police, citizen
complaints, crime reporting, and crime rate reductions (Cubitt, Lesic, Myers, & Corry, 2016; Maskaly et al.,
2017). Qualitative studies have addressed some of the less-quantifiable objectives, such as citizen perceptions after
police encounters (White, 2014) and officer perceptions before and after implementation (Gaub et al., 2016;
Smykla, Crow, Crichlow, & Snyder, 2016).

Not all objectives or even parts of objectives will be equally important to all stakeholders. And perhaps more
importantly, it is typically impossible to address all outcomes in one evaluation. Depending on the purposes of the
evaluation and the stakeholders involved, it may be possible to simplify and, hence, clarify objectives, and which
ones are key for the evaluation questions. This strategy relies on identifying a primary group of stakeholders and
being able to work with them to translate program objectives into language that is more amenable to evaluation.

88
Alignment of Program Objectives With Government and Organizational
Goals
We will consider the place of context a little later in this chapter, but locating and specifying a program’s alignment
within the larger organizational and/or government goals is becoming more explicit in government performance
management guidelines. This is partly because of governments’ tendency to devolve some programs to nonprofit
organizations, necessitating a need to demonstrate how a network of efforts fit together in addressing an ultimate
goal or goals. And even within the public sector, the drive for finding efficiencies and improving effectiveness and
coordination of government agencies, especially in times of fiscal restraint, has amplified requirements to show
alignment of programs with ultimate government goals (Shaw, 2016).

An emphasis on alignment is critical for managing performance. Programs can be thought of as being embedded
open systems within organizations, which themselves are open systems (Reynolds et al., 2016). This image of
nested open systems suggests that outcomes from a program are designed to contribute to the objectives of an
organization. The objectives of the U.S. Department of Agriculture, for example, are intended to contribute to the
strategic objectives of the U.S. government, together with the objectives of other federal departments. In most
governments, there is some kind of strategic planning function that yields an array of goals or objectives, which are
intended to guide the construction of organizational objectives (the vision statement, the strategic goals, and the
mission statement) that, in turn, provide a framework for program objectives. Some governments take this nesting
of objectives further by constructing systems that cascade performance objectives from the government-wide level
down to work groups in organizations and even to individuals.

The government of Canada, for example, states the following:

The Treasury Board Policy on Results, which replaced the Policy on Management, Resources and
Results Structures, further strengthens the alignment of the performance information presented in DPs
[Departmental Plans], other Estimates documents and the Public Accounts of Canada. The policy
establishes the Departmental Results Framework (DRF) of appropriated organizations as the structure
against which financial and non-financial performance information is provided for Estimates and
parliamentary reporting. The same reporting structure applies irrespective of whether the organization is
reporting in the Main Estimates, the DP, the DRR or the Public Accounts of Canada. (TBS, 2017, p.
1).

In another challenge related to government alignment, a number of OECD countries, particularly in fiscal
constraint conditions, periodically or on an ad hoc basis, conduct unique types of evaluations called spending
reviews or something similar. These reviews are most commonly done in a search for savings options and
significant reallocations. According to the OECD’s definition of spending reviews,

Spending review is the process of developing and adopting savings measures, based on the systematic
scrutiny of baseline expenditure. (Robinson, 2014, p. 3, emphasis added)

In Canada, the most recent name for the spending review is the resource alignment review, and the Treasury
Board’s Policy on Results makes clear that performance measures and evaluations are expected to be made
available for resource alignment reviews (Treasury Board of Canada, 2016b). Over time, as the results of this
policy in the Canadian context unfolds, there may be tension in trying to create performance management systems
where evaluations and performance measures are expected to be used for budget allocation decisions.
Organizations soon become acutely aware of the possibilities and react accordingly. In the United States, the
problem is evident:

89
Presently, performance information is not a widely used input into budget negotiations. The usefulness
of the performance reports generated in the executive is undermined by trust in the data within. At the
moment, there is little independently verifiable information for Congress to make allocative budget
choices using a performance-informed approach. Some Congress members, particularly those in
opposition, doubt the reliability of the data provided by agencies, citing political motivations in the
selection and presentation of information. (Shaw, 2016, pp. 127–128)

While conducting an evaluation, then, an evaluator must consider these implicit normative and behavioral forces,
going beyond the explicitly expressed goals of the organization or program. Also, understanding how organizations
change—whether and when they are open to change—is an asset. As complex systems with “path dependencies”
and inherent institutional inertia, organizations will tend to resist change until moments of punctuated equilibrium
(impacts of contextual factors such as political or economic crisis) open the doors and prompt policy or program
change (Haynes, 2008).

90
Program Theories and Program Logics
So far, we have introduced the idea of clarifying program objectives as a first step when constructing a logic model.
Next, in building logic models that have columns of components, activities, outputs, and/or outcomes further
broken down into sections that indicate hypothesized causal linkages (i.e., with arrows), it is helpful—some would
argue necessary—to have a foundational sense of the theories that are thought to be in play in achieving outcomes
from programs or policies. In Chapter 1, we introduced 10 evaluation questions, combinations of which typically
guide particular program evaluations. One of those questions focused on program appropriateness: “Was the
structure/logic of the policy or program appropriate?” This question is about examining alternative ways that the
program objectives could have been achieved, with a view to assessing whether the implemented program structure
was the best way to proceed.

Responding to this question involves examining how program logic models are developed, with consideration of
how local experience, research, theories, and factors like organizational inertia influence how a program is designed
and implemented. There is a growing interest in program theories—that is, ways of thinking about programs that
reflect our understanding of causal relationships among the factors that can be included in a program logic model.
With theory-driven evaluations, instead of treating programs as black boxes and simply asking whether the
program was causally connected with the observed outcomes, logic models are one way to elaborate the program
structure. We test the linkages in logic models as a part of the evaluation (Astbury & Leeuw, 2010). In the case of
body-worn cameras, one theory that can be considered when building the logic model is self-awareness (see Ariel et
al., 2017). Does wearing a body-worn camera (the activity) increase an officer’s self-awareness, resulting in more
socially desirable behavior on the part of the officer (short-term outcomes) and fewer physical altercations between
officers and citizens (medium-term outcomes)? Another possibility is this: Does the cognizance of being filmed
increase a citizen’s self-awareness, thus causing a “cooling off” effect that results in fewer altercations? Or perhaps
there is an interaction effect from both citizen and officer? (Ariel et al., 2017).

In a theory-driven evaluation, then, we not only want to know if the program was effective but also want to
consider how key constructs in our logic model—our “working theory” of the program in the context in which the
program was implemented—are linked to each other empirically and whether the empirical patterns we observe
correspond to the expected linkages among the constructs (Funnell & Rogers, 2011). Note that this does not
mean the theory is explicitly part of the logic model, but that it is implicitly considered when designing the logic
model. Theory development and understanding of the mechanisms of change will develop over time as the
accumulation of evaluations and other studies in a particular program or policy area adds to the body of
knowledge over time, and that understanding is the backdrop to the logic model.

In Chapter 3, we expand on construct validity; one element of construct validity is the extent to which the
empirical relationships among variables in our evaluation correspond with the expected/theoretical relationships
among corresponding constructs as described in the logic model.

91
Systematic Reviews
Typically, assessing appropriateness involves comparing the program logic with other examples of programs that
have tackled the same or similar problems. Evaluators who are interested in program theories have taken advantage
of our growing ability, with electronic databases, to compare and assess large numbers of evaluations or research
studies that have already been completed. Systematic reviews of the results from evaluations in a given area is
called meta-analysis, and is distinguished from meta-evaluation in that the latter involves the critical assessment
(as opposed to systematic review) of one or more completed evaluation projects. In program evaluation, systematic
reviews (that is, meta-analyses) can be done in a subfield (e.g., health-related programs that are focused on
smoking cessation) to synthesize the key findings from a large number of studies, even offering quantitative
estimates of the aggregate effects of interventions.

As we have mentioned, there are several large-scale collaboratives that have been working on systematic reviews
since the early 1990s. The best known of these is the Cochrane Collaboration (2018), begun in 1993 with the
goal of conducting systematic reviews and syntheses of randomized controlled trials in the health field. Their web-
based and searchable systematic reviews are intended for policymakers and program designers worldwide. Another
recognized collaboration is the Campbell Collaboration (2018), named after Donald T. Campbell, a well-known
evaluator who was associated with applications of experimental and quasi-experimental research designs for social
programs and policies. Begun in 1999, the Campbell Collaboration focuses on systematic reviews of evaluations of
programs and policy interventions in education, criminal justice, and social welfare. Although the collaboration
does include syntheses of qualitative studies, its main emphasis is on experimental and quasi-experimental
evaluations.

Within evaluation, there are also growing numbers of ad hoc systematic reviews that can be found in academic
journals. These are intended to synthesize evaluations in a particular field or subfield with a view to describing
underlying patterns, trends, strengths and weaknesses, and key findings. Cubitt et al. (2016) and Maskaly et al.
(2017), for example, did systematic reviews of body-worn camera research that are useful for guiding further
evaluation research.

Another example of a systematic review was published in 2003 and focused on early childhood development
programs in the United States (Anderson, Fielding, Fullilove, Scrimshaw, & Carande-Kulis, 2003). The reviewers
searched five different computerized databases, looking for studies that included early childhood development–
related keywords in the titles or abstracts, were published between 1965 and 2000, and included some kind of
comparison group research design (either program vs. control group or before vs. after). The team began with 2,100
articles, and by the time they completed their screening process ended up with 23 reports or publications (based
on 16 studies) that met all their search criteria. Among the products of this review was a logic model that offers an
overall synthesis of the cause-and-effect linkages that were supported by evidence in the studies. We have
reproduced this logic model in Figure 2.1.

92
Figure 2.1 Logic Model Synthesizing Key Causal Linkages Among Early Childhood Education Programs

Source: Adapted from Anderson, L. M., Fielding, J. E., Fullilove, M. T., Scrimshaw, S. C., & Carande-Kulis,
V. G. (2003).

This logic model, though it differs from the template we generally use in this textbook, can be viewed as a visual
representation of the theory of how early childhood education programs work, based on what we know. The
model, based on the synthesis conducted by Anderson et al. (2003), could be a helpful resource to those designing
specific early childhood education programs. Because it is a synthesis, however, one limitation is that it does not
offer us guidance on how local contextual factors would affect the workability of a specific program.

Program logic models are usually specific to a context. Programs and the logic models we construct to depict them
must take into account local factors (knowledge, experience, and program structures that reflect adaptations to the
local organizational, political, and social contexts), as well as embedded program theories that are more general in
scope and application. When we evaluate programs and, in particular, when we assess the cause-and-effect links in
logic models, we are typically examining the evidence-informed theories that are reflected in the program structure
at hand, combined with the embedded local contextual factors. We further examine this idea in what follows.

93
Contextual Factors
Our earlier discussion of open systems approaches to constructing logic models asserted that programs operate in
environments, and the outcomes are intended generally to have their ultimate effect outside the program from
which they originate. In that case, what are some of the factors in the environments of programs that can offer
opportunities and constraints as programs work to achieve their objectives?

Table 2.4 summarizes some of the factors that exist in the environments of programs, which condition their
success or failure. The factors listed have been divided into those that are internal to the public sector and those
that are not. The list is not intended to be exhaustive but, instead, to alert evaluators to the fact that programs and
their evaluations occur in a rich and dynamic environment, which must be taken into account in the work that
they do.

Table 2.4 Examples of Factors in the Environments of Programs That Can Offer Opportunities
and Constraints to the Implementation and Outcome Successes of Programs
Table 2.4 Examples of Factors in the Environments of Programs That Can Offer Opportunities and
Constraints to the Implementation and Outcome Successes of Programs

Factors in the Public


Factors in Society
Sector

Other programs Clientsa

Senior executives Interest/advocacy organizations or individuals

Other
Media, including mass media and social media
departments/agencies

Other governments or
Private-sector organizations particularly for public–private partnerships
levels of government

Funding agencies

Elected officials Nonprofit organizations that are working in the same sector as the program in
question—increasingly there is an expectation of collaboration between
Regulatory agencies government and nonprofit program providers

Courts and tribunals

Changes to laws, rules,


Exogenous trends and events, such as fiscal constraints, catastrophic events,
protocols, or changes
events of viral impact
in government

a. Note: Some programs have outcomes that are focused within a department or across departments of a
government, in which case, the clients would be in the public sector. An example might be a program to
address workplace harassment.

The key point to keep in mind about context is that while there will be far too many contextual factors to consider
them all while building the logic model, the program does not exist in a vacuum, and important contextual factors
should be considered for the logic model and be mentioned in the evaluation report. We will expand on this idea
next, when discussing another theory-related approach to evaluation: realist(ic) evaluation.

94
95
Realist Evaluation
Realist evaluation (Pawson, 2002a, 2002b, 2006, 2013; Pawson & Tilley, 1997) began in the late 1990s as a
critique of the “black box” approach to social interventions that was seen as dominating evaluation theory and
practice, particularly when experiments or quasi-experiments were conducted. Like other proponents of theory-
driven evaluations, Pawson and Tilley (1997) have argued for unpacking the program box and examining the
conditions under which a given program might be expected to work. But instead of building logic models to
elaborate cause-and-effect linkages within a given program structure, realist evaluators focus on configurations
they call context–mechanism–outcomes (CMOs). For them, causes and effects in programs are always mediated
by the context in which a program is implemented. They maintain, “Cause describes the transformative potential
of phenomena. One happening may well trigger another but only if it is in the right condition in the right
circumstances” (p. 34). It is a potential mistake, for example, to assume that clients are homogeneous. Client
motivation is a contextual variable; for clients who are motivated to participate in a given program, the program
will be more likely to be successful. Contextual factors can include program-related conditions, organizational
conditions, political context, cultural context, individual factors such as age or gender, and more. Mechanisms are
the underlying theory-related factors that contribute to triggering the causal relationships.

Realist evaluators believe that what we need to do is develop program knowledge that is based on the CMOs that
are associated with program successes and failures. If we want programs or policies to make a difference, then
evaluators must understand why they work, that is, what mechanisms are afoot and under what conditions they
operate to actually explain why a given link in a causal model of a program works.

Figure 2.2 depicts a basic context–mechanism–outcomes diagram adapted from Pawson and Tilley (1997, p. 58).

Figure 2.2

Source: Adapted from Pawson, R., & Tilley, N. (1997), p. 58.

We bring this approach to your attention for three reasons:

1. As evaluators, it is important to be aware of the various approaches that propose ways to bring the change
mechanism(s) or theories into the evaluation design conversation (e.g., theory-driven evaluation [Chen,
1990]; program theory [Funnell & Rogers, 2011]; realist evaluation [Pawson & Tilley, 1997]; theory of
change [Weiss, 1995]).
2. It is undeniably worthwhile to carefully consider the context and underlying mechanisms when constructing
the logic model and designing or evaluating a program. Although the overall context and the behavioral
mechanisms can be difficult or impossible to measure, they are part of the explanatory factors that underpin
the extent to which a program or policy fails or succeeds at eliciting hoped-for individual or societal

96
outcomes.
3. The context–mechanism–output diagram is deceptively simple, but in practice, it is difficult to include even
a broad selection of the possible mechanisms and contextual variables in a logic model—and even in an
evaluation itself. The mechanisms themselves are typically not measurable but become more evident over
time as the body of evaluative literature builds. It can be overwhelming to try to diagram and design an
evaluation that includes all possible mechanisms and contextual factors that may be driving the outcomes.

Realist evaluators argue that unless we understand the mechanisms in play and how they operate—or do not—in
particular contexts, we will not be able to understand, in a finer-grained way, why programs work or do not work.
Proponents of realistic evaluation would argue that by focusing on CMOs, we will be able to increase the
likelihood of program success. However, even in the seemingly basic case of measuring the effects—on officers and
citizens—of police body-worn cameras, in various communities and under various policies (e.g., high or low police
discretion about when the camera is turned on), seems to illustrate that an understanding of the mechanisms and
the impacts of various contextual factors builds only slowly over time, as more and more evaluations occur
(Maskaly et al., 2017).

Maxfield et al. (2017, p. 70–71), based on the literature review by Lum et al. (2015), list the following examples
of mechanisms that have been suggested by the body-worn cameras literature so far:

1. The “self-awareness” mechanism . . .


2. The “oversight” mechanism . . .
3. The “compliance” mechanism . . .
4. The “rational choice” mechanism . . .
5. The “symbolic action” mechanism . . .
6. The “expectation” mechanism.

In addition to the examples of mechanisms, Maxfield et al. (2017) provide the following list of contexts to
consider (p. 71–72):

1. The “community-based” context . . .


2. The “trigger” context . . .
3. The “culture” context . . .
4. “Subculture” context . . .
5. The “policy” context . . .
6. The “political” context . . .

These lists underscore the challenges of applying realist evaluation—the permutations and combinations of
contexts and mechanisms identified so far, for body-worn camera programs, suggests a research program that will
extend well into the future. In the meantime, police departments and policymakers need to make decisions based
on what is known: that knowledge, based on the evaluations done so far, is not definitive but is sufficient to
inform.

97
Putting Program Theory Into Perspective: Theory-Driven Evaluations and
Evaluation Practice
Evaluation theory and, in particular, theory-driven evaluations have been a part of our field since the early 1990s
(Chen, 1990). Funnell and Rogers (2011), in their book on logic models and theories of change, have identified
theories of change for programs that focus on people. These theories can focus on individuals, families, groups,
organizations, communities, provinces, states, or even countries and are amalgams of theoretical perspectives from
different social science disciplines, as well as subfields of evaluation. Interest in the theories that are reflected in
logic models is emerging as a subfield in evaluation. Because program logic models usually focus on intended
changes to clients, work groups, organizations, and other units of analysis in the public and nonprofit sectors,
evaluators are working on classifying the theoretical models embedded in program logics as change-related
theories. Discerning overarching patterns that emerge when we step back from specific program logics can be
viewed as a form of meta-analysis, although the process of building theories of change also involves integrating
knowledge from social science disciplines, such as psychology, sociology, economics, and political science (Pawson,
2013).

What Funnell and Rogers (2011) and others (e.g., Patton, 2010; Pawson, 2013; Stame, 2010, 2013) are now
doing is moving beyond evaluation as a transdisciplinary process to conceptualizing evaluation in substantive
theoretical terms. In other words, they are addressing the question, What do we know about the substantive theories
or theoretical mechanisms that help us design and implement programs that are effective?

There is still much to be learned, however. In a recent systematic review of the practice of theory-driven evaluation
from 1990 to 2010, Coryn, Schröter, Noakes, and Westine (2011) assessed the extent to which 45 evaluations
that claim to be about theory-driven evaluation actually reflect the core principles and practices of such
evaluations, including elaborating and testing the theory or theories that are embedded in a program structure.
What they found suggests some substantial gaps between the theory and practice of theory-driven evaluations.
With respect to theory-guided planning, design, and execution of the evaluations in the sample, they concluded
this:

In many of the cases reviewed, the explication of a program theory was not perceptibly used in any
meaningful way for conceptualizing, designing or executing the evaluation reported and could easily
have been accomplished using an alternative evaluation approach (e.g. goal-based or objectives-
oriented). (p. 213)

Similarly, Breuer, Lee, De Silva, and Lund (2016) conducted a systematic review of the use of theory of change
(ToC) in the design and evaluation of public health interventions. Of the 62 papers that fit their criteria, they
found “In many cases, the ToC seems to have been developed superficially and then used in a cursory way during
the evaluation” (p. 13).

This suggests that although program theory is an emerging issue for evaluators, there are still significant
advancements needed in terms of pragmatically establishing and testing the links in program logics as localized
instances of program theories. Building a more coherent knowledge base from ongoing research will add to the
viability of using program theory to systematically inform the development, implementation, and evaluation of
programs.

We have introduced program theories in Chapter 2 to underscore the importance of taking advantage of existing
evidence in the design and implementation of programs. Program logic models can sometimes take into
consideration program theory that is aimed at particular organizational and community settings, although there
will be local factors (organizational political issues, for example) that affect program design and program
implementation. When we are examining the intended causal linkages in logic models, we are assessing how well
that theory (as rendered in a particular context) holds up in the settings in which it has been implemented. Our

98
main focus will continue to be around developing and testing, as best as we can, the logic models that are typically
developed by evaluators and stakeholders as a practical way to describe the intended transformation of resources to
intended outcomes for specific programs.

99
Logic Models that Categorize and Specify Intended Causal Linkages
Figure 2.3 illustrates a logic model for a program that was implemented as an experiment in two Canadian
provinces in 1993, the results of which are still used as a basis for ongoing antipoverty research. The Self-
Sufficiency Project (Michalopoulos et al., 2002) was intended to test the hypothesis that welfare recipients (nearly
all program and control group participants were single mothers) who are given a monetary incentive to work will
choose to do so. The incentive that was offered to program recipients made it possible for them to work full time
and still receive some financial assistance support for up to 3 years. The program was implemented in British
Columbia (on Canada’s west coast) and New Brunswick (on Canada’s east coast). Social assistance recipients from
the two provinces were pooled, and a random sample of approximately 6,000 families was drawn. Each family was
approached by the evaluation team and asked if they would be willing to participate in a 3-year trial to see whether
an opportunity to earn income without foregoing social assistance benefits would increase labor force participation
rates.

100
101
Figure 2.3 The Income Self-Sufficiency Program: Logic Model

Most families agreed to participate in the experiment, acknowledging they may be assigned to the program group
or the control group. They were then randomly assigned to either a program group or a control group, and those
in the program group were offered the incentive to work. Each program family had up to 12 months to decide to
participate—social assistance recipients needed to find full-time employment within the first 12 months after they
were included in the program group to qualify for an income supplement. The income supplement made it
possible for recipients to work full time while retaining part of their social assistance benefits. Since the benefits
were determined on a monthly basis, for each month he or she worked full time, a participant would receive a
check that boosted take-home earnings by approximately 50% on average. If full-time employment ceased at any
point, persons would continue to be eligible for social assistance at the same levels as before the experiment began.

Persons in the control group in this study were offered the same employment placement and training services that
were available to all income assistance recipients, including those in the program group. The only difference
between the two groups, then, was that only the program group was eligible for the supplement.

The logic model in Figure 2.3 shows a straightforward program. There is one program component and one
implementation activity. In Figure 2.3, there is no separate column for inputs. In this program, the principal input
was the financial incentive, and this is reflected in the one program component.

In the model, the program outputs all focus on ways of counting participation in the program. Families who have
participated in the program—that is, have made the transition from depending solely on social assistance for their
income to working full time—are expected to experience a series of changes. These changes are outlined in the
intended outcomes. The three short-term outcomes identify key results that follow immediately from the program
outputs. Once a social assistance recipient has decided to opt into the program, he or she is expected to search for
full-time paid work (30 hours a week or more) and, if successful, gives up his or her monthly social assistance
payments. Participants are eligible for an income supplement that is linked to continued participation in the
program—if they drop out of the program, they give up the income supplement and become eligible for social
assistance payments again. The three short-term outcomes summarize these changes: (1) increased full-time
employment for program participants; (2) increased cash transfer payments connected with the income
supplement incentive; and (3) reduced short-term use of income assistance.

There are two more columns of intended outcomes in the logic model, corresponding to an expected sequence of
results that range from medium- to longer-term outcomes. Each short-term outcome is connected to a medium-
term outcome, and they, in turn, are connected to each other. The one-way arrows linking outputs and outcomes
are meant to convey intended causal linkages among the constructs in the model. Increased stable employment is
clearly a key outcome—it follows from the previous outcomes, and it is intended to cause the longer term
outcomes in the model. The overall longer term objectives for the program are depicted as the three longer term
outcomes: (1) reduced poverty, (2) reduced return to income assistance, and (3) increased tax revenues for
governments.

If the program actually operates the way it is represented in the logic model in Figure 2.1, it will have achieved its
intended outcomes. The logic model shows what is expected. The challenge for the evaluator is to see whether
what was expected actually occurred.

The framework we introduced in Table 2.2 is a template that can guide constructing logic models for many
different programs. The way the logic model for the Self-Sufficiency Program looks is specific to that program.
However, logic models have some common features.

Features of Logic Models

Some will have one component, many will have multiple components, depending on how complicated the program is.

102
Where a program has multiple components, there will be at least one implementation activity for each component.
Each component will have at least one output, but there can be several outputs for particular components.
Each logic model will have its own configuration of outcomes, with some having short-, medium-, and longer-term outcomes,
and others having outcomes that all are expected to occur at the same time.
Each short-term outcome needs to be connected to one or more subsequent outcomes.
Although there is no requirement that the causal arrows depicted in logic models all have to be one-way arrows, using two-way
arrows complicates program logic models considerably. Assessing two-way causal linkages empirically is challenging.

In what follows, we introduce an example of a logic model based on constructs found in a selection of police
body-worn-cameras studies (Ariel et al., 2017; Lum et al., 2015; Maskaly et al., 2017; White, 2014).

Figure 2.4 Logic Model for the Body-Worn Cameras Program

Remember that logic models should offer a succinct visual image of how a program is expected to achieve its
intended outcomes. Working-group development of a logic model of a program is often of benefit to program
managers and other stakeholders, who may have a good intuitive understanding of the program process but have
never had the opportunity to discuss and construct a visual image of their program. If we look at the original
Rialto study (Ariel, Farrar, & Sutherland, 2015)—the effect of police body-worn cameras on use of force and
citizens’ complaints against the police—we can pull out the following constructs that could be used in a logic
model:

Inputs:
body-worn cameras, a web-based computerized video management system, and the Rialto Police Department’s incident tracking
system
Activities:
To track and record, electronically and with written police reports, the following: police shifts, use-of-force incidents (and details),
and citizen complaints
Outputs:
(i) cameras deployed, (ii) number of contacts between police officers and the public
Outcomes:
(i) citizen complaints, and (ii) unnecessary/excessive and reasonable use-of-force

103
Context:
(from research): “Mistrust and a lack of confidence may already characterize some communities/ perception of their local police
force” (p. 510)

The Rialto study also discusses possible “situational, psychological, and organizational” (p. 512) strands of research
and suggest possible theoretical constructs, such as “self-awareness and socially-desirable responding” (p. 511) and
“deterrence theory” (p. 516), but note: “how body-worn-cameras may be used to affect behavior and—specifically
—that of police officers, is as yet unknown” (p. 517). This example is illustrative of the logic of a program
implemented in one community, and that program and its evaluation has become the foundation of an emerging
subfield. As of early 2018, that study had been cited over 200 times in further studies and reviews.

104
Constructing a Logic Model for Program Evaluations
The evaluator plays a key role in the process of logic model creation. It is not uncommon for program planners
and operators to be unfamiliar with logic models, and to have difficulty developing them. It may be necessary and
appropriate for the evaluator to explain what a logic model is and how it clarifies the structure of the program and
its objectives. Then, as the model is developed, the evaluator synthesizes different views of the program structure,
including the view(s) in current documents, and offers a visual interpretation of the program. The evaluator’s
background and familiarity with the type of program being evaluated play a part in the development of the logic
model.

Within the inputs, components, activities, outputs, and outcomes framework, the process of representing a
program as a logic model is typically iterative, relying on a combination of activities:

Reviewing any documentation that describes the program and its objectives (policy documents, legislative
mandates, working papers, memoranda, research studies, etc.)
Reviewing studies of related programs, including systematic reviews or meta-analyses
Meeting with the program managers to learn how they see the inputs, purposes, and activities of the
program
Meeting with other stakeholders, in situations where the program is funded or mandated
intergovernmentally or interorganizationally
Drafting a logic model
Discussing it with program managers/other stakeholders
Revising it so that it is seen as a workable model of the intended processes and outcomes of the program
Affirming that the logic model is adequate for the evaluation that is being undertaken

Ultimately, the evaluator is striving for a workable logic model—there will be “rough edges” to most logic models,
and there may not be complete agreement on how the model should look. It is essential to keep in mind that a
workable logic model will be detailed enough to represent the key parts of the program and the main causal
linkages, but it cannot hope to model all the details. Ron Corbeil (1986) has pointed out that evaluators can
succumb to “boxitis”—that is, a desire to get it all down on paper.

Helpful Hints in Constructing Logic Models

Brainstorming can be useful in group settings to build first drafts of logic models. If this approach is feasible, thoughtfully gather
together an organizational and/or stakeholder team that can work together to discuss the program’s key outcomes and objectives,
and the inputs and activities expected to lead to the outcomes. Be familiar with program documentation for ideas.
When you are building a first draft of a logic model, there are several ways to brainstorm a program structure. One is to begin
with a written description of the program (perhaps from a document that has been posted on a website), and as you are reading
the description, make notes. As you encounter sentences that describe a part of the program (e.g., components or results) or
describe how one part of the program is connected to another, write those down as phrases on a piece of blank paper and put a
circle around each one. By the time you are done, you will have a set of circles that you can provisionally connect to each other in
ways that are consistent with your understanding of the program description. That is your first version of the program structure.
You can reorganize the structure to reduce linkages that cross each other and work toward a first draft. You may have to read the
description and adjust the model several times.
When you are constructing logic models, state the constructs as simply and as precisely as possible; keep in mind that later on
these constructs will usually need to be measured, and if they are not stated clearly, the measures may be invalid.
Generally put one construct in each box in your logic model. This is especially useful in describing the outcomes of a program. If
more than one construct is included in a box, then it will be less clear how each construct is connected with other parts of the
model, as well as how each construct is connected to others in the box.
Constructs cannot cause themselves in logic models—it is necessary for you to decide where a given construct fits in the logic, and
having done so, do not repeat that construct at a later point in the model. It is a common mistake to have a construct as an
output and then, with minor rewording, include it as a short-term outcome. Another common mistake is to describe an
implementation activity and then repeat it later in the logic model. Constructs must be distinct throughout the logic model,
otherwise the model will be confusing. One way to tell whether constructs are distinct is to ask how they might be measured—if
the same measure (e.g., the same cluster of questions on a survey) seems appropriate for two or more constructs, you have
conceptual overlap that needs attention.

105
A useful tool is to write words or phrases down on sticky notes, and after having read the program description, have the group
place those on a large piece of paper or a whiteboard in roughly the way they are connected in the program. Connect the sticky
notes with arrows to indicate the intended causal linkages. Revise the structure to simplify it as needed, and transfer the working
logic model to one piece of paper.
Various approaches can be used iteratively to build a first draft of the model.

Program logic models are constructed using a variety of methods and information sources, but overall, the process
is essentially qualitative and involves the exercise of considerable judgment—it is a craft.

106
Logic Models for Performance Measurement
Program evaluation and performance measurement are complementary ways to evaluate programs. The same core
tools and methods that are useful for program evaluation are also useful for performance measurement. The way
we have structured this textbook is to introduce program logics (Chapter 2), research designs (Chapter 3),
measurement (Chapter 4), and qualitative methods (Chapter 5) as core methods for program evaluation, pointing
out along the way how these methods and tools are also useful for understanding, developing, and implementing
performance measurement systems. In Chapter 8, we take a deeper dive into performance measurement and show
how it fits into the performance management cycle that we introduced in Chapter 1. Increasingly, evaluators are
expected to play a role in developing and implementing performance measurement systems. This is, in part,
because of their expertise and, in part, because many organizations want performance measurement systems to
yield data that are also useful for program evaluations (Treasury Board of Canada Secretariat, 2016a).

Logic models are an important tool in developing performance measures for programs. Because logic models
identify key components, activities, outputs, and outcomes, and the linkages among them, logic models can be
used to frame discussions of what to measure if setting up a performance measurement and monitoring system.
Often, performance measurement systems are focused around outcomes; the rationale for that approach is that
outcomes are the best indication of whether a program has delivered value to its stakeholders. Performance
measurement can be coupled with performance targets so that actual results can be compared with intended results
with a view to identifying gaps and, hence, a need to assess why those gaps occurred.

Identifying key performance measures (sometimes called key performance indicators, or KPIs) has become a
significant issue for public-sector and nonprofit organizations. One strategy for deciding what to measure is to use
a framework or template that provides structured guidance. It is becoming common for national and subnational
governments to expect that entities will develop performance measurement systems that are integrated with the
outputs and outcomes that are included in the program’s logic model. This allows for the tracking and
accountability needed to help confirm that the expected results are occurring. As well, it is thought to reinforce
organizational alignment with government’s broader objectives. In cases where the programs or policies are too
complex to be held singularly accountable for outcomes (they may, for example, be impacted by many factors,
including external influences), organizations may be held accountable for at least the delivery of certain output
targets.

Figure 2.5 displays a logic model that is part of the “system planning framework” for a program created by the
Calgary Homeless Foundation (2017, p. 20). The organization works with other Calgary NGOs, and its vision is
“Together, we will end homelessness in Calgary” (Calgary Homeless Foundation website). This model shows how
inputs for the program are converted into outputs, outcome, and a longer-term impact. If we look at Figure 2.5,
the foundation’s work is aimed at improving the likelihood that “clients will remain stably housed,” with the
ultimate goal of “independence from the system” (p. 20).

107
Figure 2.5 Logic Model for the Calgary Homeless Foundation

Source: Calgary Homeless Foundation (2017). Used with permission.

The logic model includes several clusters of activities, all of which are potentially measurable. There are linkages
from three outputs (one of which is a cluster), leading to one outcome and one impact. In designing a
performance measurement system for Calgary homelessness organizations, “clients will remain stably housed” is a
key construct and would be a candidate for a KPI.

108
Strengths and Limitations of Logic Models
Conceptualizing programs as open systems that we can represent with logic models has the advantage of
facilitating ways of communicating about programs that words alone cannot. Visual models are worth “a thousand
words,” and we have come to rely on logic models in our work as evaluators. Internationally, logic modeling is
widely viewed as a language that makes it possible to describe programs in practically universal terms (Montague,
2000).

When we are describing programs, most of us expect to see an open system that can be depicted as a logic model.
We proceed as if that assumption is true. We work to fit programs into a logic modeling framework, yet there may
be pitfalls in too easily assuming we have done justice to the program itself and its environment. In Chapter 12,
we discuss the role that expectations play in our work as evaluators. Our contention is that we are all affected by
our expectations, and they can make a difference in what we “see” when we are doing our work.

Program logics do three things: (1) they categorize organizational work that is focused around values expressed as
program objectives, (2) they describe expected cause-and-effect linkages that are intended to achieve program
outcomes, and (3) they distinguish what is in the program from what is in its environment. Program components
facilitate our communication about the program and often correspond with the way the organization that delivers
the program is structured. Program components can turn out to be branches or divisions in a department or, at
least, be work units that are responsible for that component.

Program logic models that correspond to organizational charts can be useful when we are focusing on authority,
responsibility, and accountability. The models encourage an administratively rational view of how programs are
implemented. They are often used to develop organizational-level performance measures in situations where
public performance reporting for alignment and accountability is required. In Chapter 9, we will look at program
logic modeling for whole organizations. What we are talking about in Chapter 2 is logic modeling for programs
and policies. Complex organizations can include an array of programs.

Program managers are in a position where they can be expected to achieve success—programs are usually
announced as solutions to politically visible problems, but the “state of the art” in terms of understanding program
causal links (and being able to predict outcomes once the program is implemented) is simply not going to yield a
set of engineering principles that will always work, or even work most of the time. Lack of certainty around
program causal links, coupled with resource constraints, can be a scenario for program results that may not
“deliver” the outcomes expected. Under such conditions, is it fair to hold managers accountable for program
outcomes? Where do we draw the line between managerial responsibility (and, hence, accountability) and a more
diffused collective responsibility for solving societal problems through the use of our programs?

Organizational charts can be misleading, when we think about program objectives that span organizational
boundaries or even government jurisdictions. Think again of a program to house homeless people, as an example.
Typically, homelessness is not just a problem of having no place to live. Many who are homeless also have mental
illnesses and/or have addiction problems. To address the needs of these people, programs would need to offer a
range of services that include housing but extend to addictions treatment, psychological services, health services,
and perhaps even services that liaise with police departments. Rarely are existing administrative structures set up to
do all these things. Homelessness programs usually reach across organizational boundaries, even across
governments, and require putting together administrative structures that include ways to link existing
organizations. The need for alignment between existing organizational structures and the policy and program
problems that need to be addressed is one reason why programs are complex.

As we have indicated, one way to answer these questions is to distinguish outputs from program outcomes. As seen
in Table 2.1, outputs are categorized as part of the program, whereas intended outcomes are arrayed separately,
from short-term to long-term outcomes. Program managers are generally willing to be held accountable for
outputs (the work done in a program) because outputs are typically more controllable by those delivering the
program. In fact, in some jurisdictions, there has been an emphasis on organizations being held accountable for

109
outputs, and elected officials being held accountable for outcomes (Gill, 2008, 2011). Being held accountable for
outcomes can be problematical because other factors in the environment besides the program can influence the
outcomes.

110
Logic Models in a Turbulent World
When we think of logic models, we usually think of a “snapshot” of a program that endures. We want our logic
models to be valid descriptions of programs during and after the time we are doing our evaluations. Michael
Patton (2011) has suggested that when we do a formative or summative evaluation, we generally assume that we
have a stable program structure with which to work. A formative evaluation would aim to improve the program
structure (or the program implementation), and a summative evaluation would focus on assessing the merit and
worth of the program.

But what happens if the program environment is turbulent? What if program managers and other stakeholders are
constantly trying to adapt the program to changing circumstances? The program structure may not be settled but,
instead, continue to evolve over time. Patton (2011) offers an evaluation approach for complex settings in which
organizations and their environments are co-evolving. Developmental evaluation is intended for situations where
it is acknowledged that the program and the organization in which it is embedded are constantly changing—even
the objectives of the organization and its programs are subject to revision. In these circumstances, logic models
would, at best, capture the state of play only for a short period of time. Consequently, in situations like these, logic
models may be less valuable than narrative descriptions of programs that capture the key threads in the
organizational development and decision-making process.

111
Summary
Logic models are visual representations of programs that show how resources for a program are converted into activities and,
subsequently, into intended results. Program logic models provide evaluators, program managers, and other stakeholders with a visual
image that does two basic things: (1) it divides the program delivery process into categories—inputs, components, implementation
activities, and results (outputs and outcomes)—and (2) it displays the intended causal linkages. Based on an open systems metaphor, logic
models can also distinguish between what is in the program and what is in the program’s environment. Typically, program outcomes are
expected to have impacts in the program’s environment, consistent with the intended objectives.

Program logics are an important foundation for an evaluator’s efforts to understand whether and in what ways the program was effective
—that is, whether the program actually produced the outcomes that were observed and whether those outcomes are consistent with the
program objectives. They are also very useful for developing performance measurement systems to monitor program outputs and
outcomes. They assist evaluators by identifying key constructs in a program that are candidates for being translated into performance
measures.

Constructing logic models is an iterative process that relies on qualitative methods. Evaluators typically view documents that describe a
program, consult stakeholders, and consider other sources of information about the program as the logic model is being drafted.

Implicit in a logic model is a mix of local contextual factors that condition the options that are considered and ultimately are ranked when
programs or policies are designed or revised. As well, program designers and evaluators call on factors that reflect the underlying program
theory or theories. Program theory approaches are becoming a more prominent topic in program evaluation as evaluators and other
stakeholders wrestle with the questions, “What works? When does it work? How does it work?” (Astbury & Leeuw, 2010). Program
theories are informed by the results of prior program evaluations, as well as a wide range of social science–based knowledge and evidence.

The emphasis on program theories and program mechanisms is more about creating substantive causal knowledge for each program field
as well as the whole evaluation field. Logic modeling has become an essential tool for evaluators and program managers alike. Where
programs are established and where program environments are stable, logic models are an efficient way to communicate program structure
and objectives to stakeholders. In environments where there is a lot of uncertainty about program priorities or a lot of turbulence in the
environmental context in which the program is embedded, logic models need to evolve to adapt to the changes.

112
Discussion Questions
1. Read the program description of the Meals on Wheels program in Appendix A of this chapter, and use the contents of the chapter
to build a model of the program. To make your learning more effective, do not look at the solution (at the end of the chapter)
before you have completed the logic model.

Based on your experience of constructing the logic model for the Meals on Wheels program, what advice would you give to a
classmate who has not yet constructed his or her first logic model? What specific step-by-step guidance would you give (five key
pieces of advice)? Try to avoid repeating what the chapter says logic models are, that is—components, implementation activities,
outputs, and outcomes. Instead focus on the actual process you would use to review documents, contact stakeholders, and other
sources of information as you build a logic model. How would you know you have a “good” logic model when it is completed?
2. Logic models “take a picture” of a program that can be used in both program evaluations and performance measurement systems.
What are the organizational and program conditions that make it possible to construct accurate logic models?
3. What are the organizational and program conditions that make it challenging to develop accurate and useful logic models?
4. Why is it that formulating clear objectives for programs is so challenging?
5. Knowlton and Phillips (2009) include examples of program logic models where the program clients are treated as inputs to the
program. In Chapter 2, we have argued that program clients are not inputs but, instead, are exogenous (external) to the program.
What are the advantages and disadvantages of thinking of program clients as inputs?

113
Appendices

114
Appendix A: Applying What You Have Learned: Development of a Logic
Model for a Meals on Wheels Program

Translating a Written Description of a Meals on Wheels Program Into a


Program Logic Model
Introduction. The following is a written description of a typical Meals on Wheels program. These programs are
generally intended to deliver hot meals to elderly people in a community.

Program Description. Meals on Wheels is a program that, with the involvement of volunteers, takes meals to
individuals who have difficulty cooking for themselves. The program has two primary activities: (1) meal
distribution and (2) contact with clients. These activities work together to realize Meals on Wheels’ long-term
goals: to reduce clients’ use of the health care system and to allow clients to live independently in their own
homes.

To achieve these goals, Meals on Wheels ensures that its clients have the opportunity to have meals that improve
their nutritional intake. This, in turn, improves their quality of health. Social contact is provided by the volunteers
who deliver the meals and check to see whether clients need additional assistance. The result is that the clients feel
secure, are less isolated, are well fed, and have a better understanding of good nutrition and food handling.

The success of the program is, in part, determined by the number of meals eaten and/or delivered, as well as the
number of follow-up visits and amount of time spent with each client. In providing good nutrition and
community contact, Meals on Wheels allows its clients to be healthier and better equipped to live independently
at home.

Your Task. Using the written description of the Meals on Wheels program, construct a logic model of the
program. In your model, make sure to identify the program components, the outputs, and the short-, medium-,
and long-term outcomes. Also, make sure that you connect each output to one or more short-term outcomes and
also connect the short-term outcomes to medium-term outcomes and so on, so that another person who is not
familiar with the program could see how particular constructs in your model are connected to other constructs.
Figure 2A.1 illustrates one potential model.

115
Figure 2A.1 Logic Model for Meals on Wheels Program

Source: Adapted from Watson, D., Broemeling, A., Reid, R., & Black, C. (2004).

116
Appendix B: A Complex Logic Model Describing Primary Health Care in
Canada
Primary health care in Canada is a provincial responsibility under the Canadian constitution. Each of the 10
provinces has its own system, but all are focused around public provision of primary health care services. Public
money is the principal source of funds for a wide range of programs and services that are provided. Primary health
care includes services and products that are intended to address acute and episodic health conditions, as well as
manage chronic health conditions (Watson, Broemeling, Reid, & Black, 2004). Among the services that are
included are physician visits, hospital visits, diagnostic tests, and a wide range of clinical and nonclinical health-
related activities. Primary health care is the bulk of the Canadian health system.

In 2004, the Centre for Health Services and Policy Research at the University of British Columbia published a
report called A Results-Based Logic Model for Primary Health Care: Laying an Evidence-Based Foundation to Guide
Performance Measurement, Monitoring and Evaluation (Watson et al., 2004). The logic model that is the
centerpiece of this report was constructed over a period of 2 years and involved a literature review of existing
performance measurement and accountability systems in Canada and elsewhere, as well as consultations with a
wide range of stakeholders across Canada, including

approximately 200 primary health care practitioners from the health regions across British Columbia,
approximately 40 academics and professional association representatives,
approximately 10 researchers and consultants,
approximately 50 primary health care leaders and evaluation specialists working for provincial and territorial
ministries of health across Canada, and
approximately 350 participants who attended a session hosted at a national conference in primary health
care in May 2004.

The logic model was built using a Treasury Board of Canada template (Treasury Board of Canada Secretariat,
2001, cited in Watson et al., 2004) that is similar to the linear logic modeling template that was described in
Chapter 1 (Figure 1.8), in which program inputs are linked with activities, outputs, and then short-, medium-,
and long-term outcomes. The Treasury Board of Canada template is reproduced in Figure 2B.1 (from Watson et
al., 2004, p. 3). One feature of the template in Figure 2B.1 is a recognition that as a program structure is extended
to include intermediate and long-term outcomes, the influence of external factors grows in importance.

Figure 2B.1 Treasury Board of Canada Results-Based Logic Model Template

Source: Watson et al. (2004, p. 3).

Figure 2B.2 is the results-based logic model for primary health care that was developed by the Centre for Health

117
Services and Policy Research at the University of British Columbia (Watson, Broemeling, & Wong, 2009; Watson
et al., 2004). Although the model is based on a linear logic modeling template, the model is complex. It includes
two-way relationships among intermediate outcome constructs and depicts a system that involves multiple health
care providers in multiple jurisdictions across Canada. It reflects the complex systems that have evolved for service
delivery. The model depicted in Figure 2B.2 is a descriptive aggregation of primary health care in Canada. It was
created by synthesizing information from multiple lines of evidence and represents a model of how the primary
health care system in Canada is intended to work.

Figure 2B.2 Results-Based Logic Model for Primary Health Care in Canada

One intended use of the model is to serve as a template for a cross-Canada federal–provincial effort to measure
and compare the performance of the health systems in the 10 provinces. In Canada, although health is a provincial
responsibility, the federal government provides significant financial support to all the provinces. In its efforts to
ensure that the financial contributions are producing results, the federal government negotiates performance
targets that focus on particular health service–related outcomes. Minimum waiting times for particular categories
of surgeries is an example of such a performance target, and those results are compared across Canada on at least
an annual basis.

118
Appendix C: Logic Model for the Canadian Evaluation Society Credentialed
Evaluator Program
In 2016, the Claremont Evaluation Centre, Claremont Graduate University, conducted a formative evaluation of
the Canadian Evaluation Society Credentialed Evaluator (CE) Designation Program (Fierro, Galport, Hunt,
Codd, & Donaldson, 2016). The program was started in 2009 and is the first program internationally that offers
an opportunity to evaluation practitioners to apply for a professional designation. The program evaluation used a
mixed-methods approach to gather multiple independent lines of evidence. In addition to secondary sources of
data, data were obtained via surveys or interviews from current members of the Canadian Evaluation Society
(CES), former members, evaluators who had never joined the CES, CES leadership, CES Board of Directors, CES
Credentialing Board members, organizations that commission evaluations in Canada, employers of evaluators,
partners for CES, and individuals who had been involved in the process leading up to creating the CE designation.

The logic model is complicated—there is an overall linear flow, but there are multiple components and multiple
linkages among the outputs and outcomes. Accompanying the logic model, we have included a table that lists the
assumptions and external factors that help to contextualize and elaborate the logic model. In Chapter 12, we will
come back to this evaluation when we discuss professionalization.

Table 2C.1 CES Professional Designation Program Logic Model With a Focus on Early
Intended Outcomes
Table 2C.1 CES Professional Designation Program Logic Model With a Focus on Early Intended Outcomes

Assumptions

CE designation is viewed as relevant to and capable of addressing needs of evaluators and others who
play important roles in the professional practice of evaluation (e.g., commissioners of evaluation,
employers of evaluators, educators, policymakers).
Evaluators and others who play important roles in the professional practice of evaluation (e.g.,
commissioners of evaluation, employers of evaluators, educators, policymakers) see the value of and
desire professionalism of the field.
Most applicants are satisfied with the application and review process and view it as credible and fair.
Most CES members are satisfied with the PDP.
There is an existing/current demand for the CE designation.
Able to maintain high enough participation of Credentialing Board and sufficient PDP
infrastructure to meet demand.
Actions taken to improve PDP processes are successful.
Means for acquiring the necessary qualifications to achieve the CE designation are available and
feasible to obtain among evaluators who desire the designation.
Availability and accessibility of relevant training to support continuing education and maintenance
of CE designation.
Desire for ongoing maintenance of CE designation over evaluator’s career.
Sufficient pool of individuals who identify professionally as evaluators and stay in the profession.
Achievement and maintenance of a critical mass of CEs.

External Factors

Extent of alignment between CE designation requirements and other existing policies, procedures,
or requirements with which practicing evaluators need to comply.
Existing level of recognition among entities beyond CES that play an important role in the
professional practice of evaluation and level of the need for and value of CES, the CE designation,
and the professionalization of the field.

119
Preexisting and strong professional allegiance of evaluators trained outside evaluation.
Existence of self-sufficient evaluation subcultures.
Fiscal austerity that is not conducive to professional development and staff support.

Source: Fierro, Galport, Hunt, Codd, & Donaldson (2016, pp. 70–71). Logic model and table used with permission of the Canadian
Evaluation Society.

Figure 2C.1

Source: Fierro, Galport, Hunt, Codd, & Donaldson (2016).

120
References
Anderson, L. M., Fielding, J. E., Fullilove, M. T., Scrimshaw, S. C., & Carande-Kulis, V. G. (2003). Methods for
conducting systematic reviews of the evidence of effectiveness and economic efficiency of interventions to
promote healthy social environments. American Journal of Preventive Medicine, 24(Suppl. 3), 25–31.

Ariel, B., Farrar, W. A., & Sutherland, A. (2015). The effect of police body-worn cameras on use of force and
citizens’ complaints against the police: A randomized controlled trial. Journal of Quantitative Criminology,
31(3), 509–535.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J.,. . . Henderson, R. (2016). Wearing body
cameras increases assaults against officers and does not reduce police use of force: Results from a global multi-
site experiment. European Journal of Criminology, 13(6), 744–755.

Ariel, B., Sutherland, A., Henstock, D., Young, J., & Sosinski, G. (2017). The deterrence spectrum: Explaining
why police body-worn cameras ‘work’ or ‘backfire’ in aggressive police–public encounters. Policing: A Journal of
Policy and Practice, 1–21.

Astbury, B., & Leeuw, F. L. (2010). Unpacking black boxes: Mechanisms and theory building in evaluation.
American Journal of Evaluation, 31(3), 363–381.

BetterEvaluation. (2017). Develop Programme Theory. BetterEvaluation: Sharing information to improve


evaluation. Retrieved from http://www.betterevaluation.org/plan/define/develop_logic_model

Breuer, E., Lee, L., De Silva, M., & Lund, C. (2016). Using theory of change to design and evaluate public health
interventions: A systematic review. Implementation Science, 11(1), 1–17.

Calgary Homeless Foundation. (2017). Calgary system planning framework. Retrieved from
http://calgaryhomeless.com/content/uploads/SSPF_V116_2017–03–15.pdf

Campbell Collaboration. (2018). Our Vision, Mission and Key Principles. Retrieved from
https://www.campbellcollaboration.org/about-campbell/vision-mission-and-principle.html

Chen, H. T. (1990). Theory-driven evaluations. Newbury Park, CA: Sage.

Chen, H. T. (2016). Interfacing theories of program with theories of evaluation for advancing evaluation practice:
Reductionism, systems thinking, and pragmatic synthesis. Evaluation and Program Planning, 59, 109–118.

Cochrane Collaboration. (2018). About us. Retrieved from www.cochrane.org/about-us. Also: Cochrane handbook
for systematic reviews of interventions. Retrieved from http://training.cochrane.org/handbook

Corbeil, R. (1986). Logic on logic charts. Program Evaluation Newsletter. Ottawa: Office of the Comptroller

121
General of Canada.

Coryn, C. L., Schröter, D. C., Noakes, L. A., & Westine, C. D. (2011). A systematic review of theory-driven
evaluation practice from 1990 to 2009. American Journal of Evaluation, 32(2), 199–226.

Craig, P., Dieppe, P., Macintyre, S., Michie, S., Nazareth, I., & Petticrew, M. (2013). Developing and evaluating
complex interventions: The new Medical Research Council guidance. International Journal of Nursing Studies,
50(5), 587–592.

Cubitt, T. I., Lesic, R., Myers, G. L., & Corry, R. (2017). Body-worn video: A systematic review of literature.
Australian & New Zealand Journal of Criminology, 50(3), 379–396.

Dahler-Larsen, P. (2016). The changing role of evaluation in a changing society. In R. Stockmann & W. Meyer
(Eds.), The future of evaluation: Global trends, new challenges, shared perspectives. London, UK: Palgrave
Macmillan.

de Lancer Julnes, P., & Holzer, M. (2001). Promoting the utilization of performance measures in public
organizations: An empirical study of factors affecting adoption and implementation. Public Administration
Review, 61(6), 693–708.

Department for Communities and Local Government. (2016a). The first troubled families programme 2012 to
2015: An overview. London, UK: National Archives.

Department for Communities and Local Government. (2016b). National evaluation of the troubled families
programme: Final synthesis report. London, UK: National Archives. Retrieved from
https://www.niesr.ac.uk/sites/default/files/publications/Troubled_Families_Evaluation_Synthesis_Report.pdf

Fierro, L. A., Galport, N., Hunt, A., Codd, H., & Donaldson, S. I. (2016). Canadian Evaluation Society:
Credentialed Evaluator Designation Program—Evaluation report. Claremont Graduate University: Claremont
Evaluation Center. Retrieved from https://evaluationcanada.ca/txt/2016_pdp_evalrep_en.pdf

Funnell, S., & Rogers, P. (2011). Purposeful program theory: Effective use of theories of change and logic models. San
Francisco, CA: Jossey-Bass.

Gates, E. (2016). Making sense of the emerging conversation in evaluation about systems thinking and complexity
science. Evaluation and Program Planning, 59, 62–73.

Gaub, J. E., Choate, D. E., Todak, N., Katz, C. M., & White, M. D. (2016). Officer perceptions of body-worn
cameras before and after deployment: A study of three departments. Police Quarterly, 19(3), 275–302.

Gertler, P., Martinez, S., Premand, P., Rawlings, L., & Vermeesch, C. (2016). Impact evaluation in practice (2nd
ed.). New York, NY: World Bank Group.

122
Gill, D. (2008). Managing for results in New Zealand—The search for the “Holy Grail”? In KPMG International
(Ed.), Holy Grail or achievable quest? International perspectives on public sector management (pp. 29–40).
Toronto, Canada: KPMG International.

Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state organisations in New Zealand.
Wellington, NZ: Institute of Policy Studies.

Glouberman, S., & Zimmerman, B. (2002). Complicated and complex systems: What would successful reform of
Medicare look like? (Discussion Paper No. 8). Ottawa, Ontario: Commission on the Future of Health Care in
Canada.

Gould, S. J. (2002). The structure of evolutionary theory. London, UK: Harvard University Press.

Haynes, P. (2008). Complexity theory and evaluation in public management: A qualitative systems approach.
Public Management Review, 10(3), 401–419.

HM Treasury, Government of the United Kingdom. (2011). Magenta book: Guidance for evaluation. Retrieved
from https://www.gov.uk/government/publications/the-magenta-book

Kellogg, W. K. (2006). Logic model development guide. Michigan: WK Kellogg Foundation. Retrieved from
https://www.wkkf.org/resource-directory/resource/2006/02/wk-kellogg-foundation-logic-model-
bodevelopment-guide

Knowlton, L. W., & Phillips, C. C. (2009). The logic model guidebook. Thousand Oaks, CA: Sage.

Lum, C., Koper, C, Merola, L. M., Scherer, A., & Reioux, A. (2015). Existing and ongoing body worn camera
research: Knowledge gaps and opportunities. Report for the Laua and John Arnold Foundation. Fairfax, VA:
Center for Evidence-Based Crime Policy, George Mason University.

Maskaly, J., Donner, C., Jennings, W. G., Ariel, B., & Sutherland, A. (2017). The effects of body-worn cameras
(BWCs) on police and citizen outcomes: A state-of-the-art review. Policing: An International Journal of Police
Strategies & Management, 40(4), 672–688.

Maxfield, M., Hou, Y., Butts, J. A., Pipitone, J. M., Fletcher, L. T., & Peterson, B. (2017). Multiple research
methods for evidence generation. In J. Knutsson & L. Tompson (Eds.), Advances in evidence-based policing (pp.
64–83). New York, NY: Routledge.

McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

Michalopoulos, C., Tattrie, D., Miller, C., Robins, P. K., Morris, P., Gyarmati, D., . . . Ford, R. (2002). Making
work pay: Final report on the self-sufficiency project for long-term welfare recipients. Ottawa, Ontario, Canada:
Social Research and Demonstration Corporation.

123
Montague, S. (2000). Focusing on inputs, outputs, and outcomes: Are international approaches to performance
management really so different? Canadian Journal of Program Evaluation, 15(1), 139–148.

Moore, G. F., Audrey, S., Barker, M., Bond, L., Bonell, C., Hardeman, W.,. . . Wight, D. (2015). Process
evaluation of complex interventions: Medical Research Council guidance. BMJ, 350, 1–7.

Morgan, G. (2006). Images of organization (Updated ed.). Thousand Oaks, CA: Sage.

Mowles, C. (2014). Complex, but not quite complex enough: The turn to the complexity sciences in evaluation
scholarship. Evaluation, 20(2), 160–175.

Patton, M. Q. (2011). Developmental evaluation: Applying complexity concepts to enhance innovation and use. New
York, NY: Guilford Press.

Patton, M. Q., McKegg, K., & Wehipeihana, N. (2015). Developmental evaluation exemplars: Principles in practice.
New York: Guilford Publications.

Pawson, R. (2002a). Evidence-based policy: In search of a method. Evaluation, 8(2), 157–181.

Pawson, R. (2002b). Evidence-based policy: The promise of “realist synthesis.” Evaluation, 8(3), 340–358.

Pawson, R. (2006). Evidence-based policy: A realist perspective. Thousand Oaks, CA: Sage.

Pawson, R. (2013). The science of evaluation: A realist manifesto. Thousand Oaks, CA: Sage.

Pawson, R., & Tilley, N. (1997). Realistic evaluation. Thousand Oaks, CA: Sage.

Reynolds, M., Gates, E., Hummelbrunner, R., Marra, M., & Williams, B. (2016). Towards systemic evaluation.
Systems Research and Behavioral Science, 33(5), 662–673.

Robinson, M. (2014). Spending reviews. OECD Journal on Budgeting, 13(2), 81–122.

Rogers, P. J. (2008). Using programme theory to evaluate complicated and complex aspects of interventions.
Evaluation, 14(1), 29–48.

Ruane, J. M. (2017). Re (searching) the truth about our criminal justice system: Some challenges. Sociological
Forum, 32(S1), 1127–1139.

Rush, B., & Ogborne, A. (1991). Program logic models: Expanding their role and structure for program planning
and evaluation. Canadian Journal of Program Evaluation, 6(2), 95–106.

124
Schwandt, T. (2015). Evaluation foundations revisited: Cultivating a life of the mind for practice. Stanford, CA:
Stanford University Press.

Shaw, T. (2016). Performance budgeting practices and procedures. OECD Journal on Budgeting, 15(3), 65–136.

Smith, P. (1995). On the unintended consequences of publishing performance data in the public sector.
International Journal of Public Administration, 18(2–3), 277–310.

Smykla, J. O., Crow, M. S., Crichlow, V. J., & Snyder, J. A. (2016). Police body-worn cameras: Perceptions of
law enforcement leadership. American Journal of Criminal Justice, 41(3), 424–443.

Stacey, R. (2011). Strategic management and organizational dynamics. The challenge of complexity. Gosport: Pearson
Education Limited: Asford Color Press Ltd.

Stame, N. (2004). Theory-based evaluation and types of complexity. Evaluation, 10(1), 58–76.

Stame, N. (2010). What doesn’t work? Three failures, many answers. Evaluation, 16(4), 371–387.

Stame, N. (2013). A European evaluation theory tree. In M. C. Alkin (Ed.), Evaluation roots—A wider perspective
of theorists’ views and influences (p. 355–370). Thousand Oaks, CA: Sage

Stockmann, R., & Meyer, W. (Eds). 2016. The future of evaluation: Global trends, new challenges, shared
perspectives. London, UK: Palgrave Macmillan.

Treasury Board of Canada Secretariat. (2010). Supporting effective evaluations: A guide to developing performance
measurement strategies—(Chapter 5—Logic Model). Retrieved from https://www.canada.ca/en/treasury-board-
secretariat/services/audit-evaluation/centre-excellence-evaluation/guide-developing-performance-measurement-
strategies.html

Treasury Board of Canada Secretariat. (2016a). Directive on results. Retrieved from https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31306

Treasury Board of Canada Secretariat. (2016b). Policy on results. Retrieved from https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31300

Treasury Board of Canada Secretariat. (2017). Departmental plans. Retrieved from


https://www.canada.ca/en/treasury-board-secretariat/services/planned-government-spending/reports-plans-
priorities.html

U.K. Government. (2017). Transport social research and evaluation. Retrieved from
https://www.gov.uk/government/collections/social-research-and-evaluation#featured-research-reports-and-
guidance

125
U.S. Government Accountability Office (2012). GAO-12–208G—Designing Evaluations: 2012 Revision.
Retrieved from https://www.gao.gov/assets/590/588146.pdf

U.S. White House (archives). (2015). Fact Sheet: Creating opportunity for all through stronger, safer
communities. Office of the Press Secretary. Retrieved from https://obamawhitehouse.archives.gov/the-press-
office/2015/05/18/fact-sheet-creating-opportunity-all-through-stronger-safer-communities

Walton, M. (2016). Expert views on applying complexity theory in evaluation: Opportunities and barriers.
Evaluation, 22(4), 410–423.

Watson, D., Broemeling, A., Reid, R., & Black, C. (2004). A results-based logic model for primary health care:
Laying an evidence-based foundation to guide performance measurement, monitoring and evaluation. Vancouver,
British Columbia, Canada: Centre for Health Services and Policy Research.

Watson, D., Broemeling, A., & Wong, S. (2009). A results-based logic model for primary health care: A
conceptual foundation for population-based information systems. Healthcare Policy, 5, 33–46.

Weiss, C. H. (1995). Nothing as practical as good theory: Exploring theory-based evaluation for comprehensive
community initiatives for children and families. In J. Connell et al. (Eds.), New approaches to evaluating
community initiatives: Concepts, methods, and contexts. Washington, DC: Aspen Institute.

White, M. D. (2014). Police officer body-worn cameras: Assessing the evidence. Washington, DC: Office of Justice
Programs, U.S. Department of Justice.

126
3 Research Designs for Program Evaluations

127
Contents
Introduction 98
Our Stance 98
What is Research Design? 104
The Origins of Experimental Design 105
Why Pay Attention to Experimental Designs? 110
Using Experimental Designs to Evaluate Programs 112
The Perry Preschool Study 112
Limitations of the Perry Preschool Study 115
The Perry Preschool Study in Perspective 116
Defining and Working With the Four Basic Kinds of Threats to Validity 118
Statistical Conclusions Validity 118
Internal Validity 118
Police Body-Worn Cameras: Randomized Controlled Trials and Quasi-Experiments 122
Construct Validity 124
The ‘Measurement Validity’ Component of Construct Validity 125
Other Construct Validity Problems 126
External Validity 129
Quasi-experimental Designs: Navigating Threats to Internal Validity 131
The York Neighborhood Watch Program: An Example of an Interrupted Time Series Research
Design Where the Program Starts, Stops, and Then Starts Again 136
Findings and Conclusions From the Neighborhood Watch Evaluation 137
Non-Experimental Designs 140
Testing the Causal Linkages in Program Logic Models 141
Research Designs and Performance Measurement 145
Summary 147
Discussion Questions 148
Appendices 150
Appendix 3A: Basic Statistical Tools for Program Evaluation 150
Appendix 3B: Empirical Causal Model for the Perry Preschool Study 152
Appendix 3C: Estimating the Incremental Impact of a Policy Change—Implementing and Evaluating
an Admission Fee Policy in the Royal British Columbia Museum 153
References 157

128
Introduction
Chapter 3 introduces the logic of research designs in program evaluations. Because we believe that questions about
program effectiveness are at the core of what evaluators do in their practice, in this chapter, we explore causes and
effects in evaluations and how to manage rival hypotheses that can confound our efforts to understand why
program outcomes happen. Evaluators are often in the position of being expected to render judgments about
program effectiveness and the extent to which the program was responsible for the actual outcomes.
Understanding the logic of research designs improves our ability to render defensible judgments.

In this chapter, we cover experimental, quasi-experimental, and non-experimental research designs. After
introducing experimental designs, we describe the Perry Preschool Study as an exemplar of experimental designs.
That study has had a major impact on early childhood education–related public policies in the United States and
continues to produce research results as its participants age. In this chapter, we describe the four general categories
of validities that are core to understanding the logic of research designs. Validity, broadly, relates to the extent to
which the research designs are capable of credibly describing causes and effects in the real world. The four types we
examine are as follows: (1) statistical conclusions validity, (2) internal validity, (3) construct validity, and (4)
external validity.

Because understanding internal validity is central to being able to assess causal linkages in program evaluations, we
describe the nine categories of threats to internal validity and offer examples of each. We introduce five important
quasi-experimental research designs, describe the possible threats to internal validity for each of those designs, and
offer an extended example of an evaluation that uses quasi-experimental designs to assess the effectiveness of
programs or policies. We include a second quasi-experimental example as Appendix C of this chapter.

Finally, we again explore program theory and the challenges of testing the causal linkages of program logic models.
In Appendix B of this chapter, we show how the Perry Preschool Study has been able to test the main causal
linkages in the logic of that program. Our last topic in Chapter 3 brings us back to a key theme in this textbook:
the relationships between program evaluation and performance measurement. We show how performance
monitoring can use research designs to make comparisons over time and how performance data can be useful in
conducting program evaluations.

129
Our Stance
Over the past four decades, the field of evaluation has become increasingly diverse in terms of what are viewed as
appropriate designs, methods, and practices. Experimental and quasi-experimental research designs in evaluations
are an important part of the field (Donaldson, Christie, & Mark, 2014). Some evaluators would argue that they
are the methodological core of what we do as evaluators; they offer ways to examine causes and effects, a central
issue when we assess program effectiveness. In this textbook, we do not advocate that all evaluations should be
based on experimental or quasi-experimental research designs. Instead, we are suggesting that any program
evaluator needs to understand how these designs are constructed and how to think through the rival hypotheses
that can undermine our efforts to assess cause-and-effect linkages in evaluations. We are advocating a way of
thinking about evaluations that we believe is valuable for a wide range of situations where a key question is whether
the program was effective—that is, whether the observed outcomes can be attributed to the program—regardless
of the research designs or the methods that are employed.

There are three conditions for establishing a causal relationship between two variables (Shadish, Cook, &
Campbell, 2002).

1. Temporal asymmetry, that is, the variable that is said to be the cause precedes the variable that is the effect.
2. Covariation, that is, as one variable varies, the other also co-varies either positively or negatively.
3. No plausible rival hypotheses, that is, no other factors that could plausibly explain the co-variation between
the independent and the dependent variable.

These three conditions are individually necessary and jointly sufficient to establish a causal relationship between
two variables. The first tends to be treated at a theoretical/conceptual level, as well as at an empirical level. We
hypothesize temporal asymmetry and then look for ways of observing it in our program implementations. The
second and third conditions are addressed by statistical conclusions validity and internal validity, respectively.

During the 1960s and into the 1970s, most evaluators would have agreed that a good program evaluation should
emulate social science research and, more specifically, that research designs should come as close to randomized
experiments as possible (Alkin, 2012; Cook & Campbell, 1979). The ideal evaluation would be one where people
were randomly assigned either to a program group or to a control group, key variables were measured, the
program was implemented, and after some predetermined period of exposure to the program, quantitative
comparisons were made between the two groups. In these evaluations, program success was tied to finding
statistically significant differences between program and control group averages on the outcome variable(s) of
interest.

Large-scale social experiments were implemented, and evaluations were set up that were intended to determine
whether the programs in question produced the outcomes that their designers predicted for them. Two such
experiments were the New Jersey Negative Income Tax Experiment (Pechman & Timpane, 1975) and the Kansas
City Preventive Patrol Experiment (Kelling, 1974a, 1974b). In the New Jersey experiment, the intent was to test
whether it would be feasible to combat poverty by providing certain levels of guaranteed income. An underlying
issue was whether a guaranteed income would undermine incentive to work. Samples of eligible low-income
families (earning up to 150% of the poverty line income) were randomly assigned to a control group or various
treatment groups where each group received a combination of minimum guaranteed family income plus a specific
negative income tax rate. An example of a treatment group worked like this: For a family whose income fell, say,
below $10,000 per year, there would be payments that were related to how far below $10,000 the family earned.
The lower the family income, the greater the payment—the greater the negative income tax (Pechman &
Timpane, 1975).

The seminal guaranteed minimum income experiments from the 1970s, such as the New Jersey experiment and,
in Canada, the Manitoba Basic Annual Income Experiment (Hum et al., 1983), have relevance today (Forget,
2017; Simpson, Mason, & Godwin, 2017). Not only is poverty still an unresolved social and policy issue, but
globalization, growing income disparities, and job losses from rapid technological advances have refocused political

130
attention on this type of social experiment. The original studies provide some foundational information for new
experiments such as Ontario’s Basic Income Pilot and Finland’s Partial Basic Income Trial (Bowman, Mallett &
Cooney-O’Donaghue, 2017; Kangas, Simanainen, & Honkanen, 2017; Stevens & Simpson, 2017; Widerquist,
2005; Widerquist et al., 2013).

In another oft-cited example of an experimental research design, the Kansas City Preventive Patrol Experiment
(Kelling, 1974a, 1974b; Larson, 1982) was intended to test the hypothesis that the level of routine preventive
patrol in urban neighborhoods would not affect the actual crime rate (measured by victimization surveys of
residents), the reported crime rate, or citizen perceptions of safety and security (measured by surveys of residents).
In one part of Kansas City, 15 police patrol beats were randomly assigned to one of three conditions: (1) no
routine preventive patrol (police would only enter the beat if there was a call for their services), (2) normal levels of
patrol, and (3) 2 to 3 times the normal level of patrol. The experiment was run for a year, and during that time,
extensive measurements of key variables were made. The designers of the experiment intended to keep the
knowledge of which beats were assigned to which condition confidential and believed that if the level of patrol
could be shown to not affect key crime and citizen safety indicators, police departments elsewhere could save
money by modifying the levels of patrol that they deployed.

Although they provided valuable information, neither of these early social experiments was entirely successful.
This was not because the planned interventions failed but because of methodological problems that limited the
validity of the evaluations. In the New Jersey Negative Income Tax Experiment, one problem was the differential
participant dropouts from the experimental and control groups, which weakened the comparability of the groups
for statistical analyses (Lyall & Rossi, 1976; Watts & Rees, 1974). In the Kansas City Preventive Patrol
Experiment, even though the police department was told to keep the experimental conditions confidential, police
officers tended to respond to calls for service in the low-patrol zones with more visibility (sirens, lights) and more
patrol cars. Residents, when surveyed to see if they perceived the levels of patrol being different across the
experimental conditions, tended to see the same levels of patrol in the low-patrol zones as did residents in the
control zones. That may have been due to the behaviors of the police officers themselves. Later in this chapter, we
will describe a threat to the construct validity of research designs in which those in the control group(s) try harder
to make up for the fact that they are not able to partake of the treatment. This effect is sometimes called the John
Henry effect, named after the famous song about an epic battle between a railroad worker and a steam drill where
the outcome is how fast they can drive railroad spikes (Cook & Campbell, 1979; Heinich, 1970).

Even if the problems with patrol officers undermining the experimental conditions are set aside, the biggest issue
with the Preventive Patrol Experiment was its limited generalizability to other jurisdictions (limited external
validity). How could any city announce to its residents that it was going to lower routine patrol levels? And if they
did not announce it, what would the political consequences be if people found out?

In the 1970s and into the 1980s, one key trend in evaluation practice was a move to evaluations based on quasi-
experimental designs, away from large-scale social experiments. Quasi-experiments did not rely on randomized
assignment of participants to program and control groups as the principal way to compare and measure
incremental program effects, but instead used comparisons that needed less evaluator control over the evaluation
setting. The goal was still the same—to assess cause-and-effect relationships between programs and outcomes
(Campbell & Stanley, 1966; Cook & Campbell, 1979; Shadish, Cook, & Campbell, 2002).

At the same time, at least two principal sets of criticisms of experiments and quasi-experiments emerged. The first
was from within the community of evaluation academics and practitioners and amounted to concerns that the
results of experiments could not easily be generalized. Cronbach (1982) suggested that the logic of experiments,
with its emphasis on assessing cause-and-effect linkages in particular settings, limited the generalizability of the
findings to other settings or other participants. For Cronbach, each evaluation setting was some combination of
units (people, usually), treatments (program[s]), observations (how things were measured), and settings (time and
place for the program implementation). He called these features of all program evaluations “UTOS.” For
Cronbach, even if an evaluation was well done, the results were limited to one time and one location and were
generally not very useful to policy makers. We will come back to this issue later in this chapter.

131
The second line of criticisms was more fundamental and had to do with an emerging view that using social science
methods, including experimental and quasi-experimental approaches, missed some fundamental things about the
meanings of human interactions, including the ways that people participate in programs and policies, either as
providers or as beneficiaries. Fundamentally, proponents of qualitative evaluation approaches emphasized the
importance of subjectivity in discerning what programs mean to beneficiaries and other stakeholders. Qualitative
evaluators challenged the assumption that it is possible to objectively measure human attributes—an important
assumption of the experimental and quasi-experimental approach. A key difference between advocates of
qualitative evaluations (e.g., Guba & Lincoln, 1989; Heshusius & Smith, 1986) and those who continued to
assert the superiority of experiments and quasi-experiments was the qualitative evaluators’ emphasis on
words/narratives of the participants as the basis for the research and analysis in evaluations. Advocates of
experiments and quasi-experiments tended to emphasize the use of quantitative techniques, often involving
applications of statistics to numerical data that compared average outcome scores for program versus control
groups and tested differences for statistical significance.

Whither the “Gold Standard”?

In the United States, the American Evaluation Association became involved in a divisive debate in 2003 on the appropriateness of making
experimental research designs the “gold standard” in evaluations of federally funded programs in education (Donaldson & Christie,
2005). Christie and Fleischer (2010) used the 2003 federal evaluation guidelines favoring experimental evaluation designs (these were
deemed to be scientifically based) and performed a content analysis of 117 evaluation studies published over a 3-year period (2004–2006)
in eight North American evaluation-focused journals. The authors chose this time span because it chronologically comes after the
scientifically based research movement was initiated in 2002. The scientifically based research movement “prioritizes the use of
randomized controlled trials (RCTs) to study programs and policies” (Christie & Fleischer, 2010, p. 326). What they discovered was that
in spite of U.S. federal government guidelines, in evaluation practice, experimental designs were used in only 15% of the studies. Quasi-
experimental designs were used in another 32%, and non-experimental designs were used in 48% of the studies—the latter being the
most common designs (Christie & Fleischer, 2010). In sum, evaluation practice continued to be diverse, and studies employed a wide
range of designs in spite of this change in U.S. government policy.

Today, we have even more diversity in our approaches to evaluation (see Alkin, 2012). At one end of the
spectrum, we continue to have a robust movement in evaluation that is committed to experiments and quasi-
experiments. In fact, with the advent of the Cochrane Collaboration (2018) in health research and evaluation in
the early 1990s and the Campbell Collaboration (2018) in social program evaluation in the late 1990s, as well as
the recent emergence of a collaborative that is aimed at promoting experiments and quasi-experiments in
evaluations of international development programs (Barahona, 2010; White, 2010), there is a growing interest in
conducting experimental and quasi-experimental evaluations, synthesizing the results across whole sets of
evaluations, and reporting aggregated estimates of program effects. We have included a textbox in this chapter that
introduces behavioral economics, nudging (Thaler & Sundstein, 2008), and their connections to evaluation. A
core feature of program and policy nudges is a commitment to evaluating them using experimental or quasi-
experimental methodologies.

At the other end of the spectrum, we have a wide range of qualitative approaches to research and evaluation that
reflect different academic disciplines and include different philosophies that underpin these approaches. In the
middle, we are seeing a growing trend to mixing quantitative and qualitative methods in the same evaluation
(Creswell & Plano Clark, 2011; Johnson & Christensen, 2017; Patton, 2008; Stufflebeam & Shinkfield, 2007).
We will discuss qualitative evaluation and pragmatic mixed-methods approaches in Chapter 5 of this textbook.

Although the field of evaluation is increasingly diverse philosophically and methodologically, we see a continued
interest in the central questions that underpin much of what is in this textbook: Did the program or policy achieve
its intended outcomes? Was the program or policy responsible, in whole or in part, for the observed outcomes? In other
words, objective-achievement and attribution continue to be central to the evaluation enterprise. Being able to
credibly address questions of program effectiveness is the core of what distinguishes evaluators from others who
assess, review, or audit programs, policies, managers, and organizations.

132
Behavioral Economics, Nudging, and Research Designs: Implications for Evaluation

Donald T. Campbell, in an interview with Kenneth Watson that was included in the first issue of the Canadian Journal of Program
Evaluation (Watson, 1986), spoke about using experimental and quasi-experimental research designs to evaluate disseminable packages. For
Campbell, a disseminable package is

. . . a program that can be replicated in other contexts with reasonably similar results. Let us call it a DP program. A textbook is a DP
program, so is 55 mile an hour speed limit, and the Japanese quality circles. The way they work depends on the situation, but in principle,
they are replicable. (Watson, 1986, p. 83)

Fast forward to the widespread international interest (OECD, 2017) in using experimental and quasi-experimental research designs to
evaluate nudges. Nudges are changes in policies and programs that are aimed at influencing choices while preserving the freedom to choose
(Thaler & Sundstein, 2008). Nudges are modeled on principles from behavioral economics—principles that are grounded in the findings
from experimental research. Behavioral economics originated in the experimental psychological research that Daniel Kahneman and Amos
Tversky started in the 1970s (Kahneman, 2011) to examine whether rational actor assumptions that are at the core of neoclassical
economics are borne out in fact.

In other words, do people, when faced with choices, behave the ways that microeconomists say they should behave? What Kahneman and
Tversky and others since have discovered is that human decision making does not line up with the assumptions in classical
microeconomic theory. We human beings are not rational in the ways economists say we should be, but we are consistent in our “biases.”
Thus, policy and program changes based on behavioral economics appear to have the potential to be disseminable packages. But how does
this movement to design and implement and then evaluate nudges relate to program evaluation?

Generally, nudges are simple program or policy changes (OECD, 2017). They involve changing one thing about a policy or a program
and then evaluating, using an experimental or quasi-experimental design, the success (and often, the cost-effectiveness) of that change. An
example might be changing where people who fill out their income tax forms sign the form (the end of the form or the top of the first
page). Does signing the top of the first page (and declaring that all that follows is true of course) result in different patterns of claimed
deductions from those who sign the bottom of the end of the form? This kind of nudge would be “low touch” (French & Oreopoulos,
2017). French and Oreopoulos point out that “high-touch” nudges are “far more difficult to implement and evaluate” (p. 626).

High-touch nudges (their example is a program in the province of Manitoba, Canada, that trained employment assistance workers to
conduct motivational interviews with their clients) are complicated or even complex programs. They verge on the complexity of social
experiments that were done decades ago. Nudging, behavioral economics, and behavioral insights units in governments are here to stay.
Campbell’s advice about using experimental and quasi-experimental designs to evaluate disseminable packages looks to be sound advice
today, particularly given the resources and time required to design, implement, and evaluate large-scale social experiments.

133
What is Research Design?
Research design in evaluations is fundamentally about examining the linkage depicted in Figure 3.1.

Notice what we have done in Figure 3.1. We have taken the program, which we “unpacked” in different ways in
our Chapter 2 logic models, and repacked it. The detail of logic models has been simplified again so that the
program is back in a box for now.

Figure 3.1 Did the Program Cause the Observed Outcomes?

Why have we done this? Would it not make more sense to keep the logic models we have worked on so far and test
the causal linkages in such models? That way, we would be able to corroborate whether the intended linkages
between various outputs and outcomes are supported by evidence gathered in the evaluation. We will look at this
option later in this chapter, but for now, this diagram illustrates some basics.

The main reason we “repack” the logic models is that in this chapter, we want to introduce research designs
systematically. Meeting the requirements to examine a given cause-and-effect linkage (in our case, the link
between the program and an observed outcome) means we must find ways of testing it while holding constant
other factors (including other linkages) that could influence it. A typical program logic will have a number of
important causal linkages. In order to test these linkages using research designs, we would need to isolate each one
in turn, holding constant the linkages in the rest of the logic model, to know whether that particular linkage is
supported by evidence.

The problem lies in finding ways of holding everything else constant while we examine each linkage in turn. In
most evaluations of programs, we simply do not have the time or the resources to do this; it is not feasible. Thus,
in thinking about research designs, we tend to focus on the main causal linkage, which is the one between the
program as a whole (back into its box) and the observed outcomes. Notice that we are interested in the program to
observed outcomes linkage and not the program to intended outcomes linkage that we introduced in Figure 1.4 in
Chapter 1. The latter linkage is more the concern of performance monitoring systems, which are complementary
to program evaluations.

Later in this chapter, we look at ways that have been developed to more fully test program logics. One approach
that is quite demanding in terms of resources is to conduct an evaluation that literally tests all possible
combinations of program components in an experimental design (Cook & Scioli, 1972). Another one that is more
practical is to use several complementary research designs in an evaluation and test different parts of the program
logic with each one. These designs are often referred to as patched-up research designs (Cordray, 1986), and
usually, they do not test all the causal linkages in a logic model. We will look at an example of such a program
logic/evaluation design later in this chapter.

134
The Origins of Experimental Design
Experimental design originated in disciplines where it was essential to be able to isolate hypothesized cause-and-
effect relationships in situations where more than one factor could cause an outcome. In agricultural research in
the post–World War I period, for example, people were experimenting with different kinds of grain seeds to
produce higher yields. There was keen interest in improving crop yields—this was a period when agriculture was
expanding and being mechanized in the United Kingdom, the United States, Canada, and elsewhere.

Researchers needed to set their tests up so that variation in seed types was the only factor that could explain the
number of bushels harvested per acre. Alternatively, sometimes they were testing fertilizer (applied and not
applied) or whether the adjacent land was unplanted or not. Typically, plots of a uniform size would be set up at
an agricultural research station. Care would be taken to ensure that the soil type was uniform across all the plots
and was generalizable to the farmlands where the grains would actually be grown. That meant that experiments
would need to be repeated in different geographic locations as soil types, length of the frost-free season, and
rainfall varied.

Seed would be planted in each plot, with the amount of seed, its depth, and the kind of process that was used to
cover it being carefully controlled. Again, the goal was to ensure that seeding was uniform across the plots.
Fertilizer may have been added to all plots (equally) or to some plots to see if fertilizers interacted with the type of
seed to produce higher (or lower) yields.

The seed plots might have been placed side by side or might have had areas of unplanted land between each.
Again, that may have been a factor that was being examined for its effects on yield.

During the growing season, moisture levels in each plot would be monitored, but typically, no water would be
provided other than rainfall. It was important to know if the seed would mature into ripe plants with the existing
rainfall and the length of the season in that region. Because the seed plots were in the same geographic area, it was
generally safe to assume that rainfall would be equal across all the plots.

Depending on whether the level of fertilizer and/or the presence of unplanted land next to the seed plots were also
being deliberately manipulated along with the seed type, the research design might have been as simple as two
types of plots: one type for a new “experimental” seed and the other for an existing, widely used seed. Or the
research design might have involved plots that either received fertilizer or did not, and plots that were located next
to unplanted land or not.

Figure 3.2 displays a research design for the situation where just the seed type is being manipulated. As a rule, the
number of plots of each type would be equal. As well, there would need to be enough plots so that the researchers
could calculate the differences in observed yields and statistically conclude whether the new seed improved yields.
Statistical methods were developed to analyze the results of agricultural research experiments. Ronald A. Fisher, a
pioneer in the development of statistical tools for small samples, worked at the Rothansted Experimental
(Agricultural) Station in England from 1919 to 1933. His book, Statistical Methods for Research Workers (Fisher,
1925), is one of the most important statistics textbooks written in the 20th century.

Figure 3.2 Research Design to Test Seed Yields Where Seed Type Is the Only Factor Being Manipulated

In Figure 3.2, “X” denotes the factor that is being deliberately manipulated—in this case, the seed type. More
generally, the “X” is the treatment or program that is being introduced as an innovation to be evaluated. Keep in
mind that we have “rolled up” the program so that we are testing the main link between the program and a key

135
observed outcome (in this case, bushels per acre). O1 and O2 are observations made on the variable that is
expected to be affected by the “X.” Treatments or programs have intended outcomes. An outcome that is
translated into something that can be measured is a variable. In our case, O1 and O2 are measures of the yield of
grain from each group of seed plots: so many bushels per acre (or an average for each group of plots).

Figure 3.3 displays a more complicated research design when seed type, fertilizer, and cultivated (nonseeded) land
nearby are all being manipulated. Clearly, many more seed plots would be involved, costing considerably more
money to seed, monitor, and harvest. Correspondingly, the amount of information about yields under differing
experimental conditions would be increased.

Figure 3.3 Research Design to Test Seed Yields Where Seed Type, Fertilizer, and Contiguous Unplanted
Land Are All Possible Causes of Grain Yield

Figure 3.3 is laid out to illustrate how the three factors (seed type, fertilizer, and contiguous cultivated land) that
are being manipulated would be “paired up” to fully test all possible combinations. In each of the cells of the
figure, there are the original two types of plots: (1) those with the new seed and (2) those without. The plots where
“X” has occurred in each of the four cells of the figure have the same new type of seed, and the plots in each of the
four cells that do not get “X” are planted with the same regular seed. Because each cell in Figure 3.3 represents a
different treatment, each “X” has been subscripted uniquely. In effect, the simpler research design illustrated in
Figure 3.2 has been reproduced four times: once for each of the combinations of the other two factors. Notice also
that the observations (“O”) of bushels per acre have also been subscripted so that for each of the eight
experimental outcomes (experimental seed vs. standard seed in each cell), we have a measure of the yield in bushels
per acre. For example, O1, O3, O5, and O7 are the observations of the experimental new seed type in Figure 3.3,
one for each of the four new seed conditions.

When analyzing the results for the research design shown in Figure 3.3, we would probably use a statistical
method called three-way analysis of variance. Basically, we would be able to see whether there was a statistically
significant difference in yields for each of the three main experimental conditions: (1) type of seed, (2) planted or
unplanted land next to the target plot, and (3) use of fertilizer. These are the three main effects in this experiment.
As well, we could examine the interactions among the main effects to see if different combinations of seed type,
fertilizer amount, and unplanted versus planted land produced yields that point to certain combinations being

136
better or worse than what we would expect from adding up the main effects. In Appendix A of this chapter, we
will summarize some basic statistical tools that are used by evaluators. In this textbook, we do not describe
statistical methods in detail but, along the way, will mention some tools and tests that are appropriate in different
situations. In Chapter 6, in particular, we discuss sampling methods and how to estimate sample sizes needed to
be able to generalize needs assessment results.

This agricultural experimental research design is an example of the logic of experiments. The key feature of all
“true” experiments is random assignment (to treatment and control conditions) of whatever units are being
manipulated. The emphasis is on controlling any factors that interfere with sorting out the causal link between the
intervention and the outcome variable. Randomization is intended to do that. The nature of randomization is that
if it is well done, any pre-existing differences between units are distributed randomly when the units are assigned.
That, ideally, controls for any outside factors (rival hypotheses) and ensures that we can see what impact the
intervention/program/treatment has on the outcome, without having to worry about other factors interfering with
that linkage. The logic of this kind of design, which has its origins in agriculture, has been generalized to drug
trials, social experiments, and a wide range of other evaluation-related applications.

In program evaluations, experimental research designs work best where the following is true: the evaluator is
involved in the design of the evaluation before the program is implemented; there are sufficient resources to
achieve a situation where there is a program group or groups and a control group; and it is feasible to do a random
assignment of units to treatment versus control groups, and sustain those assignments long enough to test fairly
the effectiveness of the program. Usually, the treatment group gets whatever we are interested in testing. This is
usually an innovation, a new program, or a new way of delivering an existing program. Sometimes, more than one
treatment group is appropriate where we want to see the impacts of combinations of factors, similar to the
agricultural experiment described earlier. The feasibility of designing and implementing an experiment depends, in
part, on how complex the program is; larger, more complex programs tend to be more difficult to implement as
experiments, given the difficulties in ensuring uniform implementation and sustaining it for the life of the
experiment. Shorter programs or pilot programs, in terms of expected outcomes, tend to be easier to implement as
experiments.

If the experiment has “worked,” outcome differences (if any) can confidently be attributed to the program. We
can say that the program caused the observed difference in outcomes; that is, the causal variable occurred before
the observed effect, the causal variable co-varied with the effect variable, and there were no plausible rival
hypotheses.

Figure 3.4 suggests a visual metaphor for what research designs focusing on program effectiveness strive to achieve.
The causal linkage between the program and the observed outcomes is effectively isolated so that other, plausible
rival hypotheses are deflected. The line surrounding the program and its outcomes in the figure represents the
conceptual barrier against rival hypotheses that is created by a defensible research design.

Figure 3.4 Visual Metaphor for a Defensible Research Design

Table 3.1 shows two different experimental designs. They step back from the specifics of the agricultural

137
experiment we described earlier, and summarize the structure of typical experiments. The first design, which is
perhaps the most common of all experimental designs, involves measuring the outcome variable(s) before and after
the program is implemented. This before–after design is the classic experimental design and is often used when
evaluators have sufficient resources and control to design and implement a before–after outcome measurement and
data collection process. Having done so, it is possible to calculate the before–after changes in both the treatment
and control groups, compare the differences, as well as ensuring that the pre-test results indicate the two groups
are similar before the program begins. The second design is called the after-only experimental design and does
not measure the outcome variable before the treatment begins; it generally works well where random assignment
to treatment and control groups has occurred. Random assignment generally ensures that the only difference
between the two groups is the treatment. This is the design that was used in the agricultural experiment described
earlier. Both designs in Table 3.1 include “no-program” groups to achieve the “all-other-things-being-equal”
comparison, which permits us to see what differences, if any, the program (specifically) makes.

Table 3.1 Two Experimental Designs


Table 3.1 Two Experimental Designs

Pre-test–Post-test Design (Classic Design)

1 R1 O1 X O2

2 R2 O3 O4

Post-test Only Design (After-Only Design)

3 R1 X O1

4 R2 O2

The random assignment of cases/units of analysis to treatment and control groups is indicated in Table 3.1 by the
letter “R” in front of both the treatment and control groups. This process is intended to create a situation where
the clients in the program and no-program groups are equivalent in all respects, except that one group gets the
program. Not pre-testing the two groups assumes that they are equivalent. Where the numbers of clients randomly
assigned are small (approximately fewer than 30 units each for the program and the control groups), pre-testing
can establish that the two groups are really equivalent, in terms of their measured sociodemographic variables and
the outcome measure.

When reviewing Table 3.1, keep in mind that the “X” designates the program or treatment of interest in the
evaluation. The “O”s indicate measurements of the outcome variable of interest in the evaluation. When we
measure the outcome variable before and after the program has been implemented, we are able to calculate the
average change in the level of outcome. For example, if our program was focused on improving knowledge of
parenting skills, we could measure the average gain in knowledge (the “after” minus the “before” scores) for the
program group (O2 − O1) and the control group (O4 − O3). When we compare the average gain in outcome levels
between the two groups after the program has been implemented, we can see what the incremental effect of the
program was. Typically, we would use a two-sample t test to conduct the program-control statistical comparisons.

Where we do not have pre-test measures of the outcome variable, as in the after-only experimental design, we
would compare the averages of the program and control groups after the program was implemented (O1 − O2).
One additional thing to keep in mind is that in experimental evaluations where we have several outcomes of
interest, we have separate (but parallel) experimental designs for each variable. For example, if we are interested in
evaluating the attitude changes toward parenting as well as the knowledge gains from a parenting program, we
would have two research designs—one for each outcome. Likewise, where we have more than one treatment or a
combination of treatments, each combination is designated by a separate X, subscripted appropriately. This

138
situation is illustrated in our original agricultural example, in Figure 3.3.

The second design in Table 3.1 does have the potential of solving two problems that can arise in experiments.
Sometimes, pre-testing can be intrusive and can have its own effect on the post-test measurement. Furthermore,
the pre-test can interact with the program to affect the post-test average. Suppose you are evaluating a server
intervention program that is intended to train employees who serve alcoholic beverages in bars and other
establishments. One expected outcome might be improved knowledge of ways to spot customers who should not
be served any more drinks and how to say “no” in such situations.

If “knowledge level” is measured with the responses to a set of true–false statements before the training, it is
possible that measuring knowledge sensitizes the servers to the program and boosts their post-program average
scores. As well, if the same true–false questionnaire was used before and after the training, we might expect higher
scores in the control group simply because employees are familiar with the questions. Using the first design might
produce outcomes that are higher than they should be, misleading those who might want to generalize the results
to other program situations. The pre-test interacting with the program is an example of a construct validity
problem: How well does the “training” that occurs in the experiment properly parallel the concept of “training”
that is intended to be implemented as a program in other settings that are not evaluated? The pre-test boosting the
post-test average scores in the control group is a testing issue; this is an internal validity problem. That is, does
the pre-test act as a “practice” for both groups? Why is this so important? Because if the program is later
implemented, the servers will not be getting a “pre-test” before implementation!

Although rarely done, it is possible to combine the two experimental designs so that we have four groups—two of
which are pre-tested and two of which are not. This more elaborate research design is called the Solomon Four-
Group Design (Campbell & Stanley, 1966) and is specifically designed to address problems caused by pre-testing.
If we look at the four groups in Table 3.1 together, we could find out if taking a pre-test interacted with the
program to affect post-test results. We would compare the first and third rows of that table (the group that is both
pre-tested and post-tested and gets the training program, and the group that is not pre-tested but gets the training
program). If the average post-test score for the pre-tested group (who got the program) is higher than the average
for the post-test-only group who got the program, we can conclude that the pre-test has boosted the average post-
test score, and we have avoided a construct validity problem with our research design. That is, since the pre-test
boosted the score, and because we can now calculate the difference between the two groups who both got the
training program, we can estimate how much the pre-test boosted the score, and thus, we have addressed the
construct validity problem.

By comparing the second and fourth rows of the table, we can see whether pre-testing boosted post-test scores, in
the absence of the program. If that happened, then we have identified (and can compensate for) a testing threat to
the internal validity of the research design. We will consider both of these validity problems later in this chapter.

139
Why Pay Attention to Experimental Designs?
The field of program evaluation continues to debate the value of experimental designs (Cook, Scriven, Coryn, &
Evergreen, 2010; Donaldson, Christie, & Mark, 2014; Scriven, 2008; Shadish et al., 2002). On the one hand,
they are generally seen to be costly, require more control over the program setting than is usually feasible, and are
vulnerable to a variety of implementation problems. But for some evaluators and some government jurisdictions,
experimental designs continue to be the “gold standard” when it comes to testing causal relationships (Donaldson
et al., 2014; Jennings & Hall, 2012; Gueron, 2017).

Weisburd (2003), in a discussion of the ethics of randomized trials, asserts that the superior (internal) validity of
randomized experiments makes them the ethical choice in criminal justice evaluations:

At the core of my argument is the idea of ethical practice. In some sense, I turn traditional discussion of
the ethics of experimentation on its head. Traditionally, it has been assumed that the burden has been
on the experimenter to explain why it is ethical to use experimental methods. My suggestion is that we
must begin rather with a case for why experiments should not be used. The burden here is on the
researcher to explain why a less valid method should be the basis for coming to conclusions about
treatment or practice. The ethical problem is that when choosing non-experimental methods we may be
violating our basic professional obligation to provide the most valid answers we can to the questions that
we are asked to answer. (p. 350)

Although Weisburd’s (2003) view would be supported by some advocates of experimentation in evaluation,
practitioners also recognize that there can be ethical risks associated with randomized experiments. Shadish et al.
(2002) discuss the ethics of experimentation and point out that in the history of research with human participants,
there are examples that have shaped our current emphasis on protecting the rights of individuals, including their
right to informed consent, before random assignment occurs. In an evaluation involving individuals, an evaluator
should provide information about the purpose and the process of the evaluation, how the information gathered
will be used, and who will have access to the data and the reports. Participants should be informed of whether
their responses will be anonymous (i.e., an individual cannot be identified by their responses, even by the
evaluator) and/or confidential (i.e., the evaluator knows the identity of the respondents but verifies that they will
not reveal their identity).

Deception has become a central concern with any research involving human participants, but it is highlighted in
situations where people are randomly assigned, and one group does not receive a program that conveys a possible
benefit. In situations where the participants are disadvantaged (socially, economically, or psychologically), even
informed consent may not be adequate to ensure that they fully understand the consequences of agreeing to
random assignment. Shadish et al. (2002) suggest strategies for dealing with situations where withholding
treatment is problematic. For example, persons assigned to the control group can be promised the treatment at a
later point.

Some program evaluators have argued that because opportunities to use experimental or even quasi-experimental
designs are quite limited, the whole idea of making experiments the paradigm for program evaluations that
examine causes and effects is misguided. That is, they argue that we are setting up an ideal that is not achievable
and expecting evaluators to deal with issues that they cannot be expected to resolve. As Berk and Rossi (1999)
argue, “There is really no such thing as a truly perfect evaluation, and idealized textbook treatments of research
design and analysis typically establish useful aspirations but unrealistic expectations” (p. 9).

The reality is that many situations in which evaluations are wanted simply do not permit the kind of control and
resources that experiments demand, yet we do proceed with the evaluation, knowing that our findings,
conclusions, and recommendations will be based, in part, on evidence that does not meet the standards implied by
the experimental approach. Evidence is the essential core around which any program evaluation is built, but the

140
constraints on resources and time available and the evaluator’s lack of control over program implementation will
usually mean that at least some issues that ideally should be settled with data from experiments will, in fact, be
settled with other lines of evidence, ultimately combined with sound professional judgments.

141
Using Experimental Designs to Evaluate Programs
In the field of evaluation, there is a rich literature that chronicles the experiences of researchers and practitioners
with studies in which a core feature is the use of randomized experiments. Although the field has diversified—and
continues to diversify—in terms of criteria for judging appropriate evaluation designs, randomized experiments
remain a key part of our profession. In the following two sections, we consider the Perry Preschool Study and
experimental designs of a selection of police body-worn cameras studies.

142
The Perry Preschool Study
Among projects that have relied on randomized experiments as their core research design, one of the most well
known is the Perry Preschool Study. It has been recognized as an exemplar among evaluations (Henry & Mark,
2003) and has been the focus of economists’ efforts to build theories on the formation of human capital
(Heckman, 2000). This project began in the early 1960s in Ypsilanti, Michigan, and even though the original
children (aged 3 and 4 years when the study began) have since grown up and are now into their fifth decade, the
research organization has grown up with the participants. The High/Scope Educational Research Foundation
continues to follow the program and control groups. The most recent monograph (the eighth since the study
began) was published in 2005, and there are plans to follow the program and control groups into the future
(Schweinhart, 2013; Schweinhart et al., 2005).

The project began in 1962 in an African American neighborhood in south Ypsilanti, Michigan, with the
researchers canvassing the neighborhood around the Perry elementary school for families with young children who
might be candidates for the study. The goal was to find low-socioeconomic-status families, and arrange for their
children to be tested with the Stanford–Binet Intelligence Test. Children who tested in a range of 70 to 85 were
considered as potential participants (Heckman et al., 2010). In the first year, a total of 28 children were included
in the study. Once eligibility was confirmed for the 28, they were matched on their IQ (intelligence quotient)
scores and randomly assigned (using a coin toss) to two groups. An exchange process was then used to move boys
and girls so that the gender mix in each of the two groups was about equal. As well, the sociodemographic
characteristics of parents (scholastic attainment, father’s or single parent’s employment level, and ratio of rooms
per person in the household) were taken into account, and children were moved between the two groups to
equalize family backgrounds. Once all the equalizing had been done, the two (now matched) groups as a whole
were randomly assigned to either the program or the control condition.

The same procedure was used in four successive years (1963–1967) to select four additional waves of program
participants (treatment and control groups), resulting in a total of 123 children being included in the experiment.
Of those, 58 were in the preschool group and 65 in the control group. Several additional adjustments were made
—the first was that children who came from single-parent families and could not participate in the school and
home-based visits that were included in the program were moved from the program to the control group. This
created an imbalance in the percentage of single-parent households in the two groups, with 31% ending up in the
control group and 9% in the program group (Berrueta-Clement, Schweinhart, Barnett, Epstein, & Weikart,
1984). This difference was statistically significant. The second adjustment was done to reduce experimental
diffusion, in which those in the program group mingle with those in the control group. (We will look at this
when we discuss construct validity.) This adjustment involved assigning all younger siblings of program
participants to the program group, regardless of gender. In the program groups, a total of eight families were
affected by this protocol, and among them, 19 children out of the 58 in the experimental group were from sibling
families (Heckman, 2007; Heckman & Masterov, 2004).

The basic research design was a two-group, before–after comparison of the preschool and no-preschool groups.
Initially, the focus was on cognitive change (change in IQ), but as additional waves of data collection were added,
more and different observations of outcome variables were included in the research. What started out as a
randomized before–after comparison group design evolved into a time series where some variables are tracked over
time, and new variables are added and then tracked.

Table 3.2 summarizes the basic research design for the Perry Preschool Study. The subscripted Os are the
observations of IQ, measured by the Stanford–Binet intelligence test, and the X is the preschool program itself.
Although we show one research design in Table 3.2, keep in mind that this same design was used for each of the
five waves of the recruitment process—they have all been rolled up into the design below.

Table 3.2 Basic Research Design for the Perry Preschool Program
Table 3.2 Basic
Research Design

143
Research Design
for the Perry
Preschool Program

R1 O1 X O2

R2 O3 O4

Table 3.3 shows the research design that emerged with the longitudinal study. Other than IQ, no outcome
variables were measured before the program began, so all of those variables were not compared with any pre-test
averages or percentages. The research design was a post-program randomized experiment, as described earlier in
this chapter in Table 3.1. The five waves of data collection included a total of 715 study outcomes—not all were
carried forward to successive waves, as the measures for each wave were age specific.

Table 3.3 Longitudinal Research Design for the Perry Preschool Program
Table 3.3 Longitudinal Research Design for the Perry Preschool
Program

R1 X OGrade School OHigh School OYoung Adult OAdult OMiddle Age

R2 OGrade School OHigh School OYoung Adult OAdult OMiddle Age

The program was based on a cognitive development model (Campbell et al., 2001) that emphasized a structured
format for daily preschool sessions (2.5 hours each morning, Monday through Friday, from October through
May), a low ratio of children to teachers (about 6 to 1 in the program), visits by the parent(s) to the school, and
1.5-hour weekly home visits by the teachers to all the families in the program group. For the first wave of children
(1962–1963), the program lasted 1 year. For each of the four successive waves, the program ran for 2 years. At the
time the program was designed, the prevailing cognitive theory was that an enriched preschool experience would
increase the IQ scores of the children and give them a boost as they made the transition from preschool to grade
school. Longer term effects of this increase in measured intelligence were not known at the time.

What makes this study unique is how long the children have been followed from their initial preschool experience.
Major efforts to collect data initially focused on ages 3 to 11, to track the transition to grade school and measure
school-related performance. Although the children in the program group did experience an initial boost in their
IQ scores, that difference faded over time in grade school. By then, the research team was able to measure
differences in school performance, and those differences persisted over time. A second major data collection effort
was launched when the children were aged 14 and 15 years. A third project collected data at age 19, a fourth at age
27, and a fifth (and the most recent) at age 40. There are plans to do a follow-up at age 50 (Heckman, Ichimura,
Smith, & Todd, 2010). No other child development experiment has been conducted over as long a time.

As each wave of data was collected, more and different variables were added—in effect, the program theory
evolved and was elaborated over time as differences between the preschool and no-preschool groups persisted into
adolescence and adulthood. The initial focus on cognitive differences shifted to a focus on school performance.
That, in turn, evolved into a focus on social and interpersonal issues, economic issues (including employment),
and criminal justice–related encounters. Figure 3.5 has been reproduced from the summary of the most recent
monograph in the project (Schweinhart et al., 2005, p. 2) and displays a selected set of variables for the program
and no-program groups up to age 40. What you can see is a pattern of statistically significant differences that
suggests that the preschool group has performed better across the whole time span for this experiment. The initial
difference in the percentage of preschool children with IQ scores 90 points or higher moves forward to school-
related variables, then to employment and criminal justice–related incidents. These six variables are a small sample
of the findings that have been reported since the study began.

144
Figure 3.5 Major Findings: Perry Preschool Study at Age 40

Source: Schweinhart et al. (2005).

This pattern of differences has been the dominant feature of the findings to date. Schweinhart, Barnes, and
Weikart (1993) have generally reported group differences using straightforward percentages or averages that are
compared using tests of statistical significance and have relied on the logic of randomization to minimize the uses
of multivariate statistical techniques—they argue that the experimental design controls for variables that could
confound the intergroup comparisons over time.

145
Limitations of the Perry Preschool Study
The research design was not implemented exactly as intended (Heckman et al., 2010). Initially, when the
randomization process was completed, differences between the two groups of children and families accrued as the
study team implemented the program. Assigning to the control group single parents who could not be involved in
the school and home visits resulted, effectively, in a change to the demographic mix of the program and control
groups. Berrueta-Clement et al. (1984) have argued that this initial difference washed out by the time the children
were age 13, but that is after the fact—those children could not part take in the program because their parent was
working, and they were not only disadvantaged but should have been part of the program group. In addition to
this problem, the research team decided to assign younger siblings to the same group as their older siblings. For
families with a child in the preschool program, that meant that any younger siblings were also assigned to the
program. Although this decision reduced cross-group diffusion of the program effects, it created two problems: (1)
These younger siblings were not randomly assigned, and (2) within families, there was the possibility of siblings
reinforcing each other over time (Heckman & Masterov, 2004). Although the research team (see Schweinhart et
al., 2005) has analyzed the data with only one sibling per family to account for this nonrandom assignment, this
does not resolve a construct validity problem: Could program impacts be confounded with sibling reinforcement
in those families where more than one child was in the program group? That is, was the construct “the program”
just the program or was it a combination of “the program and some kind of sibling reinforcement”?

There are other issues with the study as well. A key one is that girls generally performed better than boys in the
program group (Schweinhart et al., 2005). This difference emerged in the grade school and high school waves of
data collection and raises the question of why this program, with its emphasis on classroom time and teacher home
visits, would be more effective for girls. Schweinhart et al. (2005) suggest that one mechanism might be that the
teachers tended to see girls as doing better academically, resulting in them recommending fewer special education
alternatives for these students and hence increasing the likelihood that the girls would graduate from high school
on time.

When we look at this experiment more generally, we can see that teachers in grade school and high school played a
key role; they not only taught students in the two groups but made decisions that ended up being data points in
the study. What if the teachers knew which students were program participants and which were not? This is quite
possible, given that the experiment (at least through the elementary grades) was conducted in the catchment area
for one elementary school in Ypsilanti. What if they treated the students from the two groups differently—what if
they had different expectations for the children in the two groups? In educational research, the Pygmalion effect is
about how teacher expectations can influence their interactions with students and their assessments of student
performance (Rosenthal & Jacobson, 1992). That could have had an effect on the trajectories of the children in
the study as well as an effect on the validity of some of the variables in the study. Similarly, if teachers were aware
of who was in the control group, they might have tried to balance things out by providing extra attention to them,
creating a potential for compensatory equalization of treatments, another construct validity issue. We will
further examine construct validity issues in this chapter. In the Perry Preschool Project, some teacher-related
construct validity issues emerged in the literature years later (Derman-Sparks, 2016). Overall, it would not be
surprising if some of the teachers in grade school and high school had expectations and values that affected their
actions and the decisions they made regarding student performance and advancement.

If we look at the Perry Preschool Study as an example of an evaluation that uses an experimental research design,
how does it fare overall?

In addition to the problems already mentioned, there are several others: The overall size of the program and
control groups is quite small, creating problems for any statistical testing that relies on assumptions about the
distributions of scores on particular measures. As well, the researchers conducted multiple tests of statistical
significance with the same data. (Recall that over the course of all five waves, there were 715 separate measures, all
of which would have been examined for differences between the two groups.) A problem that arises when you
conduct multiple statistical tests with the same data is that a certain proportion of those tests will turn out to be

146
statistically significant by chance alone. If we are using the .05 level of significance, for example, then 1 in 20 tests
could be significant by chance alone. Since many of the tests in this experiment used a .10 level of significance as a
criterion for determining whether a comparison was noteworthy (think of the .10 level as the probability that we
could be wrong if we decide there is a significant difference between the two groups on a given measure), we
would expect about 72 significant results to be in error, given the .10 level that the Perry Preschool Study selected.

The program has, then, been controversial, in part because of its departures from the strict conditions laid down
to design and implement randomized experiments and in part because it is deemed to be too “rich,” in terms of
resources needed, to be implementable elsewhere. As well, even though Heckman and his colleagues have pointed
out that the cohorts of children and families that were included in the study were demographically similar to large
numbers of African Americans at that time (Heckman et al., 2010), the external validity (generalizability of the
study results) is limited.

147
The Perry Preschool Study in Perspective
The Perry Preschool experiment is unique. No other study has succeeded in following participants for as long a
period of time and so successfully. In the age-27 and age-40 follow-ups, for example, between 90% and 96% of
the participants were successfully reached for interviews (Schweinhart et al., 2005). Very few studies focusing on
child development have attracted as much attention (Anderson et al., 2003; Henry & Mark, 2003). The Perry
Preschool Program has been compared with the much broader range of Head Start programs and is visibly more
successful than other programs that have focused on preschool interventions for low-income children (Datta,
1983).

Not surprisingly, the data from the experiment have been reanalyzed extensively. Perhaps the most definitive
reanalyses have been done by James Heckman and his colleagues at the University of Chicago. Heckman is a
Nobel Prize–winning economist who has taken an interest in the Perry Preschool Study. In several papers, he has
reanalyzed the experimental findings and redone the cost–benefit analyses that the High/Scope research team has
done along the way. In general, he has been able to address many of the limitations that we have mentioned in this
chapter. His overall conclusion (Heckman et al., 2010), after having conducted econometric analyses to adjust for
the research design and statistical shortcomings of the study, is that the Perry Preschool results are robust. For
example, some of his conclusions are as follows:

a. Statistically significant Perry treatment effects survive analyses that account for the small sample size of the
study.
b. Correcting for the effects of selectively reporting statistically significant responses, there are substantial
impacts of the program for both males and females. Experimental results are stronger for females at younger
adult ages and for males at older adult ages.
c. Accounting for the compromised randomization of the program often strengthens the case for statistically
significant and economically important estimated treatment effects for the Perry program as compared to
effects reported in the previous literature. (p. 2)

In addition, Heckman et al. (2010) concluded that the Perry participants are representative of a disadvantaged
African American population and that there is some evidence that the dynamics of the local economy in which
Perry was conducted may explain gender differences by age in earnings and employment status.

Despite its limitations, the Perry Preschool program was designed and implemented with a lot of resources and
was intensive when compared with other child development programs, including Head Start programs. The
teachers were well trained, were given salary bonuses to participate in the program, and were probably highly
motivated to make the program work. The program and control group participants have been followed since they
were 3 years old, and being a part of this experiment looks to be a lifetime affair.

It is an important study, particularly because it was successfully implemented as a randomized experiment, in a


field where there continues to be intense interest in understanding what works to change the developmental
trajectories for disadvantaged children. It has played a significant role in American public policy circles;
notwithstanding the differences between typical Head Start programs and the relatively costly, high-quality Perry
Preschool Program, the latter was a key part of the public policy decision by the Reagan administration to keep
Head Start in the early 1980s.

148
Defining and Working with The Four Basic Kinds of Threats to Validity
In this section, we will be covering the four basic kinds of threats to the validity of research designs and the
subcategories of threats within these basic categories. Over the past 45 years, major contributions have been made
to describing the ways that research designs for program evaluations can be threatened by validity problems.
Campbell and Stanley (1966) defined threats to the internal and external validity of research designs, and Cook
and Campbell (1979) defined and elaborated threats to validity by describing four different classes of validity
problems that can compromise research designs in program evaluations. These are statistical conclusions validity,
internal validity, construct validity, and external validity. There are various typologies of validity, and there is
not a consensus on defining and delineating the various kinds of validity (see Reichardt, 2011; Shadish, Cook, &
Campbell, 2002; Trochim, 2006), but we define them below in the manner in which we use them in this
textbook. Our strongest emphasis is on Shadish et al.’s (2002) approach to threats to internal, construct, and
external validity. It seems most relevant to our objective of providing a sound foundational understanding of how
to construct and conduct a defensible, credible program evaluation or performance measurement system.

149
Statistical Conclusions Validity
This kind of research design validity is primarily about correctly using statistical tests (descriptive and inferential
tests). In an evaluation that uses quantitative data, particularly where samples have been drawn from populations,
issues like sampling procedures, size of samples, and levels of measurement all influence which statistical tools
are appropriate. When we are analyzing the correlations between program variables and outcome variables, the
validity of the statistical conclusions depends on whether the assumptions for the statistical tests have been met.
Statistical conclusions validity is about establishing a significant correlation between the independent and
dependent variables, using statistical methods that are valid. As one example of a threat to statistical conclusions
validity, consider the Perry Preschool Study, where the problem of conducting many tests of significance on the
same database increased the likelihood that significant differences between the program and the control group
would occur by chance alone. Shadish et al. (2002) note that they prefer “what we call structural design features
from the theory of experimentation rather than to use statistical modeling procedures” (p. xvi). In different words,
the closer the research designs are to experimental designs, the simpler will be the appropriate statistical tests for
comparing program and control group outcomes.

150
Internal Validity
Internal validity is about ruling out experiment-based rival hypotheses that could explain the causal linkage(s)
between the program variables (independent variables) and the observed outcomes (dependent variables). In other
words, internal validity concerns the hypothesized relationship, or causal inferences, between the dependent
variable(s) and the independent variable(s) in the study. There are nine categories of internal validity threats—each
of which can contain more specific threats. Note that with internal validity we are considering the validity of the
relationships between the variables we are measuring, which are representations of the constructs but are not the
constructs themselves. For example, when we are designing an evaluation of the empirical relationship between
body-worn camera usage (the independent variable) and the number of citizen complaints (the dependent
variable), the internal validity issues would apply to the relationship between those measured variables that occur
in the study.

For internal validity then, we want to do what we can to ensure that we getting a clean picture of the causal
relationship in the study, and that it is not being affected by some other hidden factor or factors.

We can think of internal validity threats as potential errors in our research designs (Shadish et al., 2002). There is a
fine—and sometimes confusing—distinction between validity threats based on a weakness in the evaluation’s
design (internal validity) and validity threats based on the inferences made from the evaluation observations to the
constructs that they are supposed to represent (construct validity). This distinction will become clearer as we
outline some of the main subtypes of internal and construct validity. One thing to keep in mind as you read this
section of Chapter 3 is that these are possible threats to internal validity and not probable threats in a given
program evaluation.

1. History: External events or factors can coincide with the implementation of a policy or program. This threat
can happen in any research design where a program has been implemented, and the outcome variable is
measured before and after or just after implementation.

Example: The province-wide British Columbia CounterAttack program, aimed at reducing


accidents and injuries on British Columbia highways due to alcohol consumption, was introduced
in May 1977. The provincial seat belt law was introduced as a policy in October of the same year.
Because the seat belt law was intended to reduce accidents and injuries, it is virtually impossible to
disentangle the outcomes of the CounterAttack program and the seat belt policy. (e.g., Is it the
causal effect of the CounterAttack program or, as a rival hypothesis, is it the effect of the seat belt law
on accidents and injuries? Or perhaps both?)

2. Maturation: As program participants grow older, their development-related behaviors tend to change in
ways that could appear to be outcomes, particularly for programs that focus on children and adolescents.
This threat is a problem in research designs that measure an outcome variable before and after a program has
been implemented.

Example: A youth vandalism prevention program in a community is developed in a short stretch


of time during a period of rapid population growth. The population matures roughly as a cohort.
Children born into the community also mature as a cohort. If a program is developed to “combat
a rising youth vandalism problem” when the average youth age is 12, by the time the average age
is 16, the community may have outgrown the problem even without the program. (e.g., Is it the
effect of just the prevention program, the effect of aging of the cohort on level of vandalism, or
both?)

3. Testing: Taking the same post-test as had been administered as a pre-test can produce higher post-test scores

151
due to gaining familiarity with the testing procedure. This threat is relevant to any research design where
pre- and post-tests are used and the same instrument measures the outcome variable before and after
implementation of the program.

Example: Servers in a pub score higher after the server-training program on a test of “knowledge
level” that uses a pre–post measure of knowledge, not because they have increased their
knowledge during training but simply because they are familiar with the test from when it was
administered before the training (Is it the effect of just the training program, having taken the pre-
test on server knowledge level, or both?)

4. Instrumentation: This threat can occur if, as the program is implemented, the way in which key outcome
variables are measured is also changed. Research designs where there is only one group that gets the program
and the outcome variable is measured before and after the program is implemented are vulnerable to this
threat.

Example: A program to decrease burglaries is implemented at the same time that the records
system in a police department is automated: reporting forms change, definitions of different types
of crimes are clarified, and a greater effort is made to “capture” all crimes reported in the database.
The net effect is to “increase” the number of reported crimes. (Is it the effect of just the program,
the effects of changing the records system on number of burglaries, or both?)

5. Statistical regression: Extreme scores on a pre-test tend to regress toward the mean of the distribution for
that variable in a post-test. Thus, if program participants are selected because they scored low or high on the
pre-test, their scores on the post-test will tend to regress toward the mean of the scores for all possible
participants, regardless of their participation in the program. Research designs that have one measure of the
outcome before the program is implemented and one afterward are vulnerable to this threat.

Example: People are selected for an employment skills training program on the basis of low scores
on a self-esteem measure. On the post-test, their self-esteem scores increase. (Are the apparent
changes in self-esteem a result of the training program or a natural tendency that extreme scores on
the pre-test will tend to drift toward average on a second test?)

6. Selection: Persons/units of analysis chosen for the program may be different from those chosen for the
control group. This is a threat to internal validity that can apply to any research design where two or more
groups (one of which is the program group) are being compared.

Example: A program to lower recidivism among youth offenders selects candidates for the
program from the population in a juvenile detention center. In part, the candidates are selected
because they are thought to be reasonable risks in a halfway house living environment. If this
group was compared with the rest of the population in the detention center (as a control group),
differences between the two groups of youths, which could themselves predict recidivism, might
explain program outcomes/comparisons. (Are the differences just the effect of the program, the
effect of the selection process that resulted in pre-program baseline differences on recidivism, or both?)

7. Attrition/mortality: People/units of analysis may “drop out” over the course of the evaluation. This is a
problem in research designs where outcomes are measured before and after program implementation, and
there may be systematic differences in those who drop out of the program, as compared with those who
remain in the program.

152
Example: A program to rehabilitate chronic drug users may lose participants who would be least
likely to succeed in the program. If the pre-test “group” were simply compared with the post-test
group, one could mistakenly conclude that the program had been successful. (Is it just the effect
of the program or the effect of losing participants who were finding that program was not effective?)

8. Ambiguous temporal sequence in the “cause” and the “effect” variables: This threat can occur when it is not
clear whether a key variable in the program causes the outcome, or vice versa. This can be a validity problem
for any research design, including experimental designs, although it is important to specify how the causal reversal
would work. It is resolved by applying the theory that underlies the program intervention and making sure
that the program implementation was consistent with the program theory.

Example: A program that is intended to improve worker productivity hypothesizes that by


improving worker morale, productivity will improve. The data show that both morale and worker
productivity improve. But the program designers may well have missed the fact that improved
morale is not the cause but the effect. Or there is a reciprocal relationship between the two
variables such that improvements in morale will induce improvements in productivity, which, in
turn, will induce improved morale, which, in turn, will improve productivity. Evaluations of
complex programs, in which there are causal linkages that are reciprocal, can be challenging to do
because of this problem.

9. Selection-based interactions: Selection can interact with other internal validity threats so that the two (or
more) threats produce joint effects (additive and otherwise) on outcome variables. Any research design
where there is a program group and a control group is vulnerable to this class of threats. Keep in mind that
we are not talking about one threat but a range of possible threats to internal validity that can vary from one
evaluation to the next.

Example: A program to improve reading abilities in a school district is implemented so that


program classes are located in higher-income areas and control classes in lower-income areas.
Tests are given (pre, post) to both groups, and the findings are confounded not only by selection
bias but also by the fact that higher-income children tend to mature academically more quickly.
(Is the improvement in reading abilities due to just the program or the difference in the two groups
before the program, plus the fact that the two groups may be maturing at different rates?)

Pinpointing which internal validity threats are likely in a study helps identify what solutions may be feasible. That
is, each of the nine categories of internal validity threats suggests possible ways of mitigating particular problems,
although designing a study to sort out reciprocal or ambiguous causation can be challenging (Shadish et al., 2002).
To avoid the intrusion of history factors, for example, anticipate environmental events that could coincide with
the implementation of a policy or program and, ideally, deploy a control group so that the history factors affect both
groups, making it possible to sort out the incremental effects of the program.

The difficulty with that advice—or corresponding “solutions” to the other eight types of problems—is in having
sufficient resources and control over the program design and implementation to structure the evaluation to
effectively permit a “problem-free” research design. When we introduced experimental research designs earlier in
this chapter, we pointed out that randomization—that is, randomly assigning people or units of analysis to a
program and a control group—is an efficient way to control all possible threats to internal validity—the exception
being the possible problem of ambiguous temporal sequence. One reason that some evaluators have dubbed
randomized experiments the “gold standard” in program evaluations is that they are able to handle threats to
internal validity well. We saw that in the Perry Preschool Study, the research team relied very heavily on the
original randomization process to make longitudinal claims about the outcomes of the preschool experience.
Challenges to that study have focused, in part, on the fact that there were, in fact, several ways in which

153
randomization was not properly carried out. Recent police body-worn camera studies provide informative
examples of benefits and challenges of RCTs and quasi-experiments.

154
Police Body-Worn Cameras: Randomized Controlled Trials and Quasi-
Experiments
We have mentioned the police body-worn camera (BWC) studies periodically before this point, and throughout
the following sections we are going to take the opportunity to highlight as examples some of the experimental and
quasi-experimental evaluations that have contributed to the growing pool of studies intended to determine the
impacts of this important and relatively new technology. To illustrate threats to construct validity and external
validity, these BWC studies and other examples will be used. The police BWC studies began with the seminal
Rialto study (Ariel, Farrar, & Sutherland, 2015) and have grown in number to a point where there have been at
least four systematic reviews of the studies since 2014 (Cubitt, Lesic, Myers, & Corry, 2017; Lum, Koper, Merola,
Scherer, & Reioux, 2015; Maskaly, Donner, Jennings, Ariel, & Sutherland, 2017; White, 2014). Many of these
BWC studies have avoided internal validity threats because they used the randomized controlled trial designs.
Barak Ariel and his colleagues have replicated the original study in other communities and done a variety of
similar studies.

One of the chief approaches of these evaluations has been to evaluate the effects of police BWCs on police use of
force and citizens’ complaints. Additionally, over time, the focus of studies has included citizen behaviors, cost-
effectiveness of police BWCs, police perceptions of BWCs, and effects on the justice system. To provide a flavor of
the topics, Table 3.3 summarizes a selection of recent studies.

Table 3.3 Police Body-Worn Camera Evaluations


Table 3.3 Police Body-Worn Camera Evaluations

Research
Title of Article Reference
Design

The effect of police body-worn cameras on use of force and citizens’


RCT and Ariel et al.,
complaints against police: A randomized controlled trial (the original study
time series 2015
in Rialto, California Police Department)

Wearing body cameras increases assaults against officers and does not Ariel et al.,
RCTs
reduce policy use of force: Results from a global multi-site experiment 2016a

Report: Increases in police use of force in the presence of body-worn


Ariel et al.,
cameras are driven by officer discretion: A protocol-based subgroup analysis RCTs
2016b
of ten randomized experiments

RCTs and
Officer perceptions of body-worn cameras before and after deployment: A Gaub et al.,
pre-and
study of three departments 2016
post-testing

Hedberg,
Body-worn cameras and citizen interactions with police officers: Estimating Quasi- Katz, &
plausible effects of varying compliance levels experimental Choate,
2017

RCTs and
Paradoxical effects of self-awareness of being observed: Testing the effect of before-after Ariel et al.,
police body-worn cameras on assaults and aggression against officers analyses 2017a

155
on the effect of policy body-worn cameras on citizens’complaints against RCTs Ariel et al.,
the police (these studies replicate the Rialto study) 2017b

A quasi-experimental evaluation of the effects of police body-worn cameras


Quasi- Jennings et
(BWCs) on response-to-resistance in a large metropolitan police
experimental al., 2017
department

Analysis of
The deterrence spectrum: Explaining why police body-worn cameras Ariel et al.,
multiple
“work” or “backfire” in aggressive police–public encounters 2018
RCTs

Sutherland,
Post-experimental follow-ups—Fade-out versus persistent effects: The RCT and Ariel, Farrar,
Rialto policy body-worn camera experiment four years on time series & Da Anda,
2017

“I’m glad that was on camera”: A case study of police officer’s perception of Sandhu,
Qualitative
cameras 2017

The BWC studies provide good illustrations of the challenges inherent even in randomized controlled trials, not to
mention quasi-experimental designs. As well, some provide examples of research designs that have elegant
triangulation of methods to try to overcome threats to validity and to broaden understanding of the mechanisms
of behavior change.

156
Construct Validity
When we construct a logic model, as we discussed in Chapter 2, we are stating what we expect to happen when the
program is implemented. The constructs in our logic models are at a level that is above the level where measurable
variables we actually work with in program evaluations occur, so construct validity is about moving between the level
of constructs (and their intended causal linkages) to the level of variables (and their empirical correlations). In other
words, on the theoretical or conceptual plane, we would like to understand how Construct A affects Construct B
(e.g., does knowledge of being video recorded affect the behavior of police and the interactions of police and citizens),
but we need to operationalize these constructs—that is, work with the empirical plane—to design and implement
an evaluation where we can measure variables. So in the case of body-worn cameras, for example, the proposed
constructs are often translated into variables such as these: Does the presence of officers’ body-worn cameras (worn in
a specific way for the study at hand, and whether officers have discretion about when to turn them on, etc.) affect
the use-of-force incidents by police (recorded in a specified way in that department) or the number of citizen
complaints against police (reported and recorded in specific ways in that department) in that setting over a
particular period of time.

Figure 3.6 Linking the Theoretical and Empirical Planes in a Program Evaluation

Source: Adapted from Trochim, Donnelly, & Arora (2016).

Shadish et al. (2002) expanded upon earlier definitions of construct validity (see Cook & Campbell, 1979, and
Cronbach, 1982): “construct validity is now defined as the degree to which inferences are warranted from the
observed persons, settings, and cause and effect operations included in a study to the constructs that these

157
instances might represent” (Shadish et al., p. 38).

158
The ‘Measurement Validity’ Component of Construct Validity
Measurement, which we discuss in Chapter 4, is about translating constructs into observables: at the “level” of
observables, we work with variables, not the constructs themselves. Measurement validity, which we discuss in
Chapter 4, is about assessing whether our empirical measures are valid with respect to the constructs they are
intended to measure. Current approaches to measurement validity are focused on moving from the measured
variables in the empirical plane to the theoretical plane in Figure 3.6 (Borsboom, Mellenberg, & van Heerden,
2004). Construct validity includes measurement validity but is broader than that—it focuses on moving both ways
between the theoretical and empirical planes, and includes consideration of the validity of the relationship between
the causal proposition (from the conceptual plane) and the empirical correlation (from the empirical plane). That
is, “[it] concerns the match between study operations and the constructs used to describe those operations”
(Shadish et al., 2002, p. 72).

We can think of construct validity being about responses to two questions: Am I implementing what I think I am
implementing? Am I measuring what I think I am measuring? It includes (but is not limited to) the question of
how valid the measures are of the constructs in an evaluation. Construct validity is also about being clear what the
constructs in a policy or program are; that is, being clear that in the evaluation no ambiguities have crept into the
ways that key constructs are defined (and linked to each other) and, at the same time, no ambiguities have crept into the
way they are actually measured, so that the empirical evaluation findings can be said to validly relate back to the
structure (the logic) of the policy or program. Figure 3.6 illustrates how the theoretical and empirical planes are
linked in our approach to construct validity. In doing a program evaluation (or developing and implementing a
performance measurement system), we work on the empirical plane. Construct validity is focused on the vertical
arrows in Figure 3.6—the links between the conceptual and empirical planes in an evaluation.

In any evaluation where data are gathered, there are choices to be made about what to measure, and the
measurement procedures that specify how the data will be collected. For example, in many of the body-worn
camera studies, the researchers are interested in measuring the number of police use-of-force incidents, as a
dependent variable. Broadly, although there are situations where use of force is legitimate, the variable is seen as a
measure related to the construct of deterrence of illegitimate force (Ariel et al., 2015, p. 518). That is, if there are
fewer use-of-force incidents after BWCs are implemented, that is seen as representing deterrence of illegitimate
force. Though it may seem a simple issue, the measurement of use of force has some challenges. Below is the Ariel
et al. (2015) description of how the study measured use of force:

Rialto Police Department used a system called Blue Team to track “recorded” use-of-force incidents.
This standardized tracking system enabled us to count how many reported incidents had occurred
during the experimental period in both experimental and control shifts, and to verify the details of the
incidents, such as time, date, location, and whether the officer or the suspect initiated the incident.
Rialto Police Department records instances of use-of-force, which encompasses physical force that is
greater than basic control or “‘compliance holds’—including the use of (a) OC spray, (b) baton (c)
Taser, (d) canine bite or (e) firearm”. These are the types of force responses that we considered as eligible
use-of force incidents. We operationalized the “use-of-force” dependent variable as whether or not force
was used in a given shift.

We acknowledge that police software cannot “measure” the use-of-force, and that it is nearly always up
to the individual officer to account for those incidents where force was used. Given the subjectivity of
this variable and the measurement problems we reviewed above, we therefore relied on these official
written reports, but not without hesitation. (pp. 521–522)

The “use” of body-worn cameras sits on the conceptual plane, but needs to be “operationalized” for measurement.
For body-worn cameras, there are different considerations in measuring/operationalizing their “use.” What model
of camera is used? How are they attached to a police officer’s uniform? How visible are they? When are they

159
turned on or off? Measurement validity will be further covered in the following chapter, but we did want show
here how it is distinguished as a specific form of construct validity.

Other Construct Validity Problems


There are construct validity problems that are not defined as measurement validity issues. Mostly, they related to
the reactions of people who know they are involved in a study, when it differs from how people would react just
from the implementation of the program when it is not a study. Let’s look at one example: Are the officers reacting
differently to BWC “use” when they know they are part of an evaluation study that will be analyzed and
published, as compared with when the BWC policy is implemented just as a new police department policy? When
we are exploring the relationship between “wearing the body-worn camera” and the volume of citizen complaints,
we need to consider the following: are we looking at a situation where that relationship also includes officers’
behavioral reactions to being part of a study, in either the “treatment” or “control” condition?

As another example, in the BWC studies, there is a possible construct validity problem that involves the way the
program construct (the intended cause in a network of conceptual cause-and-effect linkages) has been
operationalized. In the Rialto study and others that replicated it since (Ariel et al., 2017b), the independent
variable is generally the implementation of police wearing BWCs. Even though there has been randomization of
“treatment” and “control” conditions (an RCT), it is the shifts that are randomized, resulting in officers sometimes
wearing a BWC (when randomly assigned to a “treatment” shift), and on some shifts, he or she is not wearing a
BWC (control shift). The effect is that it becomes difficult to untangle the effects of actually wearing a BWC from
the effects of working with others who have worn a BWC (Maskaly et al., 2017). There is a diffusion of
treatments effect, which is a construct validity problem. We will come back to this example shortly.

Another example of a construct validity problem is in an evaluation of a server-training program in Thunder Bay,
Ontario. The evaluators assigned matched pairs of drinking establishments to program and no-program
conditions (Gliksman, McKenzie, Single, Douglas, Brunet, & Moffatt, 1993). Managers in the establishments had
been asked if they were willing to have their servers trained but were cautioned not to tell their servers about the
evaluation. Given the incentives for managers to look good or “cooperate,” it is possible that managers mentioned
the evaluation to their servers. The construct validity problem created again is as follows: What is the “program”—
is it server training, or is it server training plus the informal influence of bar managers? In the terminology that
Shadish, Cook, and Campbell (2002) use to specify different possible threats to construct validity, the server-
training construct problem is due to “inadequate explication of the construct” where “failure to adequately
explicate a construct may lead to incorrect inferences about the relationship between operation and the construct”
(p. 73).

Shadish, Cook, and Campbell (2002, p. 73) point out that there are 14 different possible threats to construct
validity. We will cover only the most relevant evaluation-related ones for this textbook. They are particularly well
illustrated by the BWC studies we discuss. As we suggested earlier, in the Rialto Police Department evaluation of
their BWC program, a construct validity problem was created by the possibility that when the program was
implemented, officers changed their behavior not only because they were wearing and using a BWC but because
the effects of the program were diffusing from the “treatment” shifts to the “control” shifts. This situation results
in ambiguity in how the “program” has been operationalized: Does it consist of wearing a BWC in all encounters
with citizens when frontline officers are on shift, or does it consist of reacting to learning about BWCs from others
in combination with the experience when in a “treatment” shift?

This construct validity problem also cropped up when the Rialto Police Department research design (shifts are the
main unit of analysis instead of officers) was implemented in seven other police departments (Ariel et al., 2017).
In all of those replications, there were no significant differences in citizen complaints against the police between the
treatment and the control groups, but in all cases, significant drops in the number of complaints (before versus
after BWC implementation) occurred for the whole police department (Ariel et al., 2017b, p. 302). In effect, the
diffusion of treatment problem in the original Rialto Police Department evaluation design was replicated—
occurring repeatedly when the same research design was replicated in other police departments.

160
Four possible threats to construct validity that can happen even when RCTs are the research design are as follows.

1. Diffusion of treatments: People can sometimes communicate about their program experiences/program
learning to members of the control group.

Example: Aside from the BWC example earlier, suppose that two groups of employees in a company are
selected to participate in a team-building experiment. One group participates in team-building workshops.
The other group (who may have an opportunity to take the workshop later) serves as the control group.
Employees communicate, and some of the skills are transferred informally. As we have discussed, in some of
the BWC studies, one of the chief obstacles is diffusion of treatment, as officers in the “control” shifts work
with officers who have worn cameras. Additionally, the officers get experience in both the “treatment” and
“control” shifts.
2. Compensatory equalization of treatments: The group that is not supposed to get the program is offered
components of the program, or similar benefits, because the program provider wishes to balance perceived
inequalities between the two groups.

Example: In the evaluation of the Head Start Program in the United States (Puma, Bell, Cook, & Heid,
2010), the original research design called for children in the control group not to be able to enroll in a local
Head Start program for 2 years. But the evaluators discovered that some families in the control group were
enrolling their children nevertheless, and by the end of the first year of the treatment, so many wanted to be
able to enroll that the program condition was shortened to 1 year from 2—the evaluators compensated the
control families by permitting them an early opportunity to enroll their children in the program.
3. Compensatory rivalry: The performance of the no-program group or individual improves because of a desire
to do as well as those receiving the program, and this diminishes the differences between the new program
and the existing program (also known as the “John Henry” effect).

Example: In the Kansas City Preventive Patrol Experiment (Kelling, 1974a), officers responding to calls for
service in the no-patrol beat used sirens and lights more, and responding in strength could be interpreted as
compensatory rivalry—trying harder to get to the site of a call for service.
4. Resentful demoralization: A threat to internal validity that occurs if the control group perceives unfair
treatment and reacts negatively.

Example: Those persons not getting a program to test the effects of class size on learning (halving the size of
classes) complain to the instructor and demand equal treatment. The administration refuses, and students
threaten to not take any of the remaining tests in the course.

Thus, as Shadish et al. (2002) point out, construct validity threats can be caused by the fact that people know they
are a part of an evaluation process. Participant expectations can influence behaviors, confounding attempts to
generalize the actual findings back to the constructs or program theory. Another way that participant behavior can
confound an experiment is the Hawthorne effect, named after the location where the original research occurred,
as described next.

In a worker productivity experiment in the 1930s in the United States, the experimenters discovered that being
part of an experiment produced an effect, regardless of the levels of the experimental variables being manipulated
(Roethlisberger, Dickson, & Wright, 1939). No matter what conditions the experimenters varied (e.g., lighting
level, speed of the assembly line, variability of the work), the results indicated that any manipulation increased
productivity because the workers knew they were being studied and consequently increased their productivity.
Construct validity was compromised by the behavior of the workers.

In the case of BWCs, looking at the original Rialto Police Department BWC experiment, the police chief was new
to the department and, when he came in, started preparations to deploy BWCs on all patrol officers. Further,
patrol officers could not turn off their cameras while on shift—all encounters with citizens would be recorded.
Some years before the BWC experiment was started in 2012, the Rialto Police Department had been threatened
with disbandment for a series of incidents including uses of force. It is possible that the entire department behaved

161
as if they were being judged throughout the experiment. This could be interpreted as a Hawthorne Effect.

More generally, Cook and Campbell (1979) and Shadish et al. (2002) present lists of circumstances that might
weaken construct validity in an evaluation. In summarizing the ways of minimizing this set of problems, they
suggest that evaluators need to do the following: make sure that constructs are clearly defined so that they can be
measured appropriately, make sure that constructs are differentiated so that they do not overlap as measures are
developed, and develop “good” measures—that is, measures that produce valid information.

162
External Validity
External validity threats include factors that limit the generalizability of the results of a policy or program
evaluation. Even if the research design has acceptable statistical conclusions and internal and construct validities,
those all apply to the evaluation in one setting, with particular participants, a particular treatment and particular
measures. It is possible for these “local” factors to limit the extent to which the results can be generalized to other
times, places, treatment variations, and participants.

Shadish et al. (2002) suggest categories of external validity threats. In each one, the causal results obtained from a
given evaluation (even where there is acceptable statistical conclusions, internal, and construct validity) are
threatened by contextual factors that somehow make the results unique. They suggest four interaction effects that
reflect their concern with generalizing to other units of analysis (typically people), other policy or program
variations, other outcome variations, and other settings. Keep in mind that in any given evaluation, combinations
of these threats are possible—they do not have to operate mutually exclusively.

1. Interaction between the causal results of a policy or program and the people/participants

Example: When BWC programs are implemented, the police culture of one department may be more
hierarchical, with relatively more “top-down” control in combination with possible covert officer resistance
to the program. In another department, the officers may have a more collaborative culture and may be more
open to the BWC initiative because they feel their professionalism is less threatened.
2. Interaction between the causal results of a policy or program and the treatment variations

Example: In some of the BWC studies, officers were allowed discretion as to when and whether to turn on
the BWC, whereas in other jurisdictions the studies called for the cameras to be “on” at every police
interaction with citizens. The variations in the way BWC has been implemented resulted in surprising
differences in outcomes in different locations (Ariel et al., 2017b).
3. Interaction between the causal results of a policy or program and patterns of outcome variations

Example: A provincially run program that is intended to train unemployed workers for entry-level jobs
succeeds in finding job placements (at least 6 months long) for 60% of its graduates. A comparison with
workers who were eligible for the program but could not enroll due to space limitations suggests that the
program boosted employment rates from 30% to 60%—an incremental effect of 30%. Another province is
interested in the program but wants to emphasize long-term employment (2 years or more). Would the
program results hold up if the definition of the key outcome were changed?
4. Interaction between the causal results of a policy or program and the setting

Example: The Abecedarian Project (Campbell & Ramey, 1994) was a randomized experiment intended to
improve the school-related cognitive skills of children from poor families in a North Carolina community.
The setting was a university town, where most families enjoyed good incomes. The project focused on a
segment of the population that was small (poor families), relative to the rest of the community. The
program succeeded in improving cognitive, academic, and language skills. But could these results, robust
though they were for that study, be generalized to other settings where poor, predominantly minority
families resided?

There is a fifth threat to external validity that also limits generalizability of the causal results of an evaluation.
Shadish et al. (2002) call this “context-dependent mediation.” Basically, context-dependent mediation occurs
when pre-existing features of the environment in which the (new) program is implemented influence the program
outcomes, and this pre-existing feature is not present in other settings. An example would be a situation where a
successful crime prevention program in a community used existing neighborhood associations to solicit interest in
organizing blocks as Neighborhood Watch units. Because the neighborhood associations were well established and
well known, the start-up time for the crime prevention program was negligible. Members of the executives of the
associations volunteered to be the first block captains, and the program was able to show substantial numbers of

163
blocks organized within 6 months of its inception. The program success might have been mediated by the
neighborhood associations; their absence in other communities (or having to start from scratch) could affect the
number of blocks organized and the overall success of the program.

Figure 3.7 shows the four kinds of validity described by Cook and Campbell (1979) and suggests ways that they
can be linked.

Figure 3.7 The Four Kinds of Validity in Research Designs

As proposed in the diagram, statistical conclusions validity “feeds into” internal validity, and the two together
support construct validity. All three support external validity. The questions in Figure 3.7 indicate the key issue
that each kind of validity is intended to address. Notice that statistical conclusions validity and internal validity
focus on the variables as they are measured in a program evaluation. Construct validity and external validity are
both about generalizing; the former involves generalizing from the measured variables and their empirical
correlations back to the constructs and their intended relationships in the program model, and the latter is about
generalizing the evaluation results to other situations. We will talk more about the dif ferences between constructs
and variables in Chapter 4 of this textbook.

164
Quasi-Experimental Designs: Navigating Threats to Internal Validity
Fundamentally, all research designs are about facilitating comparisons. In this textbook, we focus on research
designs in part because we want to construct comparisons that allow us to answer evaluation questions about
whether and to what extent programs are effective. Experimental designs, because they involve random assignment
of units of analysis to treatment and control groups, are constructed so that program and no-program situations
can be compared “holding constant” other variables that might explain the observed differences in program
outcomes. It is also possible to construct and apply research designs that allow us to compare program and no-
program conditions in circumstances where random assignment does not occur. These quasi-experimental
research designs typically are able to address one or more categories of possible internal validity threats, but not all
of them. An important point—possible threats to internal validity do not equate to probable threats. In other
words, each evaluation setting needs to be approached on its own terms, using possible threats to internal validity
as a guide but seeking information to determine whether any given possible threat is a probable threat.

Resolving threats to internal validity in situations where there are insufficient resources or control to design and
implement an experiment usually requires the judicious application of designs that reduce or even eliminate
threats that are most likely to be problematic in a given situation. Usually, the circumstances surrounding a
program will mean that some potential problems are not plausible threats. For example, evaluating a 1-week
training course for the servers of alcoholic beverages is not likely to be confounded by maturation of the
participants.

In other situations, it is possible to construct research designs that take advantage of opportunities to use
complementary data sources, each of which has its own research design, and combine these designs with ones
involving collecting data specifically for the evaluation. This creates patched-up research designs (Cordray, 1986).
Patched-up designs are usually stronger than any one of the quasi-experimental designs that compose the
patchwork but typically still can present internal validity challenges to evaluators. When we are working with less-
than-ideal research designs, we are usually trying to reduce the uncertainty about program effectiveness rather than
make definitive statements in response to the evaluation questions. The BWC studies that incorporated time series
comparisons provide a good example of cases where quasi-experimental approaches were used to triangulate with
RCTs that had unavoidable program diffusion limitations (e.g., Ariel et al., 2015; Sutherland, Ariel, Farrar, & Da
Anda, 2017).

Full quasi-experimental designs are research designs where people/units of analysis have not been randomly
assigned to the program and the control groups and where there are comparisons that help us assess intended
causal linkages in program logics. This means that we must ask whether the comparisons are robust; threats to
internal validity that would generally be “handled” by random assignment are now a potential problem. For
example, if two groups are being compared (program vs. no-program) and the group getting the program is
selected on a first-come, first-served basis, it is essential to find out whether being there first for the program is
more or less random or, instead, is related to some factor or factors that could also be related to how well the two
groups will “do” if they get the program. In the case of an employment training program, it might be that those
that got there first are better-off economically and heard about the program via the Internet, factors that may
make them more successful in the program than those who got there later. Comparing the two groups could
present us with an “apples and oranges” problem (a selection threat to the internal validity of this research design).
In the case of police BWCs, the fact that a precinct would volunteer for a study could create a different context
from a case where the precinct was resistant to the idea, feeling that the officers were losing their autonomy or
professionalism. As Maskaly et al., (2017) argue,

It is likely that those agencies most willing to engage in BWC research, particularly a tightly controlled
RCT, are those that are least in need of the potential benefits of the BWC. This means the results of the
current studies may actually be underestimating the effects of BWCs. Or, it is quite possible that the
police departments that have implemented BWCs thus far in these evaluations reviewed here are

165
implementing BWCs as best as they can and these effect sizes are as “good as they are going to get.” (p.
683)

Not all quasi-experimental research designs have comparison groups. In fact, most do not. Single time series
designs (interrupted time series designs) and before–after designs are two quasi-experimental designs that do not
include comparison groups. We will discuss these types of designs in greater detail later in the chapter.

Each quasi-experimental and non-experimental research design has its strengths and limitations in terms of
internal validity. Table 3.4 summarizes possible threats to the internal validity of different quasi-experimental and
non-experimental research designs.

Table 3.4 Quasi-Experimental and Non-Experimental Research Designs and Possible Threats to
Internal Validity
Table 3.4 Quasi-Experimental and Non-Experimental Research Designs and Possible Threats to Internal Validi

For each of the following research designs, the checkmarks indicate possible threats to internal validity. Where there are compariso
they have not been randomly assigned.

Research
Statistical
Model History Selection Maturation Attrition/Mortality Testing Instrumentation
Design Regression

Before–after
OXO þ þ þ þ þ þ
design

Static group XO
comparison þ þ þ
design O

Before–after O X O
comparison þ
group design OO

Case study
XO þ þ þ þ þ
design

Single time
OOOXOOO þ þ þ
series design

Comparative OOOXOOO
time series þ
design OOO OOO

When we look at Table 3.4, there are two considerations to keep in mind. First, the checks in the boxes are there
because, for that design, a particular category of threats is relevant and, second, cannot be controlled given the
comparisons built into the design. For example, if we look at the before–after design, you will see that “selection”
is not checked as a possible threat to internal validity. That is because selection can only be a threat where there is a
comparison group. For that same design, “attrition/mortality” is checked as a possible problem because it is
possible that we might not know who drops out of a program, and any before–after comparisons of outcome
scores could be biased due to the differential composition of the group. We have not indicated that “testing” is a
possible threat to the internal validity of case study designs or static group comparison designs. That is because
there are no pre-tests, and testing is a potential problem only when you have pre-tests. “History” is checked as a

166
threat to internal validity for the before–after design because it is relevant and cannot be controlled with the
comparisons in the research design. Our approach is generally consistent with that taken by Campbell and Stanley
(1966).

Table 3.4 also provides a summary of the essential features/model of each research design using the X and O
notation we introduced earlier in this chapter. Recall that the X is the program intervention that is being
evaluated, and the O is the measured outcome that is being examined in relation to the program. In any given
evaluation, there are typically several research designs (one for each outcome variable that is being examined). That
usually means that when we are assessing the internal validity threats, we have a more complex picture than Table
3.4 implies. Depending on how each construct in the logic model is measured (and whether comparisons are built
into those measures), we can have (overall) research designs that are “patched up”—that is, are combinations of
different designs with different strengths that may be able to compensate for each other’s plausible weaknesses.

Later in this chapter, we will look at an example of a program logic model for an evaluation of a crime prevention
program. We will see how measuring the outcomes produced several different research designs, and this will allow
us to explore some of the issues in assessing “patched-up” evaluation designs.

Some quasi-experimental designs in Table 3.4 are more robust than others. Before we get into a brief discussion of
each, note that none of these research designs can rule out the possibility of “ambiguity of temporal sequence”—
an internal validity threat that is not ruled out even for experimental designs. There are some evaluations where it
is possible to reverse the cause-and-effect relationship between the program and the outcome and find support for
that theoretically. But unless this reversal “makes sense” theoretically, this is not a plausible threat.

Among the quasi-experimental designs in Table 3.4, the before–after comparison group design and the
comparative time series designs have the fewest possible threats to internal validity. These two designs are often
considered to be workable substitutes for fully randomized experimental designs. Part of the reason is that these
designs can be coupled with statistical analyses that can compensate for selection threats to internal validity. Recall
that selection biases occur when differences between the two groups (program vs. no program)—usually
sociodemographic differences—could explain the program versus no-program outcome differences. Propensity
score analysis (usually done with logistic regression techniques), in which sociodemographic characteristics of all
participants (program and control) are used to predict the likelihood/probability that each person is in the
program or the control group, can be used to match individuals so that for each pair (one being in the actual
program group and one in the actual control group), they have the same or very similar propensity scores
(Heckman, Ichimura, Smith, & Todd, 1996). If there are 300 persons in the program group and 500 in the
control group, we would try to match as many of the 300 program participants as possible. There are guidelines
for how closely the propensity scores should match (Caliendo & Kopeinig, 2008). Propensity score analysis is a
relatively effective way to control for selection biases in evaluation designs and is frequently used in quasi-
experimental evaluations.

The single time series design is also relatively strong—it is sometimes called the interrupted time series design.
Although it is vulnerable to four possible classes of threats to internal validity, interrupted time series designs are
attractive because they not only lend themselves to statistical analyses to determine the impact of a program or
policy intervention (in the time series) but can also be displayed to show a visual image of the apparent impact of
the program. There are several statistical approaches for analyzing interrupted time series designs. One is to use the
information/data that is available before the program is implemented, to estimate whether the pre- and post-
program segments of the time series are significantly different in ways that are consistent with the intended
program outcome. In an appendix to this chapter, we describe an example of a single time series design being used
to evaluate a policy change (the introduction of entrance fees) in a museum. That example uses a statistical model
to forecast what would have happened without the program so that a comparison can be made between “what
would have happened without the program” with “what did happen with the program.” Later in this chapter, we
will also discuss how time series are useful in measuring and monitoring program performance.

The static-group comparison design is next on the list in terms of potential threats to internal validity. It is
vulnerable to five different classes of threats to internal validity. Although there is a program/no-program

167
comparison, there are no baseline measurements. That means that we cannot control for the following: pre-
program differences in the two groups, maturation of the participants, attrition, or selection-based interaction
effects.

The before–after design and the case study research design are both vulnerable to several different classes of
threats. Neither design is vulnerable to selection biases because there is no control group. The before–after design
is vulnerable to testing, given the existence of a pre-test—that is not a threat for the case study design. The case
study design is vulnerable to attrition/mortality as a threat because there is no way to keep track of who the
participants were before the program was implemented—this is also a threat for the before–after design when it is
not possible to keep track of who the pre-test participants are and exclude those who drop out at the post-test
stage.

The case study research design does not include any explicit comparisons that make it possible to see what
differences, if any, the program made. There is no pre–post comparison, no comparison group, and no baseline
measures of outcomes (before the program begins). In Chapter 4, we describe and assess retrospective pre-tests as
a way to work with case study research designs to measure variables pre-program by asking participants to estimate
their pre-program level of knowledge, skill, or competence (whatever the outcome variable is) retrospectively. This
approach is being used more widely where programs are focused on training or education, and no baseline
measures have been taken.

Keep in mind that the internal validity comparisons in the Table 3.4 are constructed from the point of view of
someone who has assumed that randomized controlled trials (randomized experiments) are the most valid research
designs. It is also important to remember that the threats to internal validity that have been described are possible
threats and not necessarily plausible threats for a given evaluation.

The reality of many research/evaluation situations in the public and the nonprofit sectors is that case study designs
are all we have. By the time the evaluator arrives on the scene, the program has already been implemented, there is
no realistic way of getting a baseline for the key outcome measures, and there is no realistic way, given the
resources available, to construct a comparison group. As we mentioned previously, however, there are ways of
constructing intragroup comparisons that allow us to explore the differential outcomes of a program across
sociodemographic groups of program participants or across participants who have been served by the program in
different ways.

168
The York Neighborhood Watch Program: An Example of an Interrupted
Time Series Research Design Where the Program Starts, Stops, and Then
Starts Again
The York, Pennsylvania, neighborhood watch program was intended to reduce reported burglaries at both the
neighborhood and city levels. It was initially implemented in one area of the city, and a no-program “control” area
was established for comparison (Poister, McDavid, & Magoun, 1979).

Reported burglaries were tracked over time at both the neighborhood and citywide levels. In addition, a survey of
the block captains in neighborhood watch blocks where the program was implemented was conducted to solicit
their perceptions of the program, including estimates of resident attendance at neighborhood watch meetings.
Finally, key environmental factors were also measured for the entire period, the principal one being the
unemployment rate in the whole community.

Several research designs were embedded in the evaluation design. At the level of the neighborhood watch blocks,
the program was implemented, and the block captains were interviewed. The research design for this part of the
evaluation was a case study design:

XO

where X is the neighborhood watch program, and O is the measurement of block captain perceptions of program
activity.

Reported burglaries were compared between the neighborhoods that received the program and those that did not.

PROGRAM OOOOOOOOXOXOXOXOXOXOXOXOXO

and

NO PROGRAM OOOOOOOO O O O O O O O O O

where X is the neighborhood watch program, and O is the reported burglaries in the program and no-program
areas of the city. Notice that for the program area of the city, we show the “X”s and “O”s being intermingled.
That shows that the program continued to operate for the full length of this time series, once it was implemented.
This comparative time series design is typically stronger than the case study design because it includes a no-
program group. Among the threats to the internal validity of this design is the possibility that the program group
is not comparable with the no-program group (selection bias). That could mean that differences in reported
burglaries are due to the differences in the two types of neighborhoods, and not necessarily due to the program.

Reported burglaries were also compared before and after the program was implemented, citywide. In the
following, we show the before–after time series, but the program was actually implemented, withdrawn, and then
implemented again—we will discuss this shortly.

OOOOOOOOOOOXOXOXOXOXOXOXO

This single time series design is vulnerable to several internal validity threats. In this case, what if some external
factor or factors intervened at the same time that the program was implemented (history effects)? What if the way

169
in which reported burglaries were measured changed as the program was implemented (instrumentation)? What if
the citywide burglary rate had jumped just before the program was implemented (statistical regression)?

In this evaluation, several external factors (unemployment rates in the community) were also measured for the
same time period and compared with the citywide burglary levels. These were thought to be possible rival
hypotheses (history effects) that could have explained the changes in burglary rates.

170
Findings and Conclusions From the Neighborhood Watch Evaluation
The evaluation conclusions indicated that, at the block level, there was some activity, but attendance at meetings
was sporadic. A total of 62 blocks had been organized by the time the evaluation was conducted. That number was
a small fraction of the 300-plus city blocks in the program area alone. At the neighborhood level, reported
burglaries appeared to decrease in both the program and no-program areas of the city. Finally, citywide burglaries
decreased shortly after the program was implemented. But given the sporadic activity in the neighborhood watch
blocks, it seemed likely that some other environmental factor or factors had caused the drop in burglaries.

To explore possible causal relationships among key variables using more data, the evaluation time frame was
extended. Figure 3.8 displays a graph of the burglaries in the entire city from 1974 through 1980. During that
time, the police department implemented two programs: (1) a neighborhood watch program and (2) a team-
policing program. The latter involved dividing the city into team-policing zones and permanently assigning both
patrol and detective officers to those areas.

Figure 3.8 Burglary Levels in York, Pennsylvania: January 1974–February 1980

Figure 3.8 is divided into five time periods. The level of reported burglaries varies considerably, but by calculating
a 3-month moving average (burglaries for January, February, and March would be averaged and that average
reported for February; burglaries for February, March, and April would be averaged and that average reported for
March; and so on), the graph is stabilized somewhat. The 3-month moving average is displayed as the dashed line.

By inspecting the graph, we can see that the police department initially implemented the neighborhood watch
program, then shortly afterward moved to team policing as well. Both team policing and the neighborhood watch
program were in operation for Period 3, then neighborhood watch was cancelled, but team policing continued
(Period 4). Finally, because the detective division succeeded in its efforts to persuade the department to cancel the
team-policing program (detectives argued that being assigned to area-focused teams reduced their information
base and made them less effective—they wanted to operate citywide), the police department restarted the
neighborhood watch program (Period 5).

171
Inspection of Figure 3.8 indicates that burglaries were increasing in the period prior to implementing the
neighborhood watch program in 1976. Burglaries dropped, but within 5 months of the neighborhood watch
program being started up, team policing was implemented citywide. When two programs are implemented so
closely together in time, it is often not possible to sort out their contributions to the outcomes—in effect, one
program becomes a “history” rival hypothesis for the other. In this situation, the political/public response to a
perceived burglary problem consisted of doing as much as possible to eliminate the problem. Although
implementing the two programs may have been a good political response, it confounded any efforts to sort out the
effects of the two programs, had the evaluation time frame ended in 1977.

By extending the time series, it was possible to capture two additional program changes: withdrawal of team
policing in 1978 and the reinstatement of the neighborhood watch program at that point. Figure 3.9 depicts these
program changes between 1974 and 1980. The neighborhood watch program is shown as a single time series in
which the program is implemented (1976), withdrawn (1977–1978), and then implemented again (1978–1980).
This on-off-on pattern facilitates being able to detect whether the program affected reported burglaries,
notwithstanding some difficulties in drawing boundaries between the no-program and program periods. Because
some neighborhood watch blocks could continue operating beyond the “end” of program funding in 1977, it is
possible that some program outputs (e.g., block meetings) for that program persisted beyond that point.

Figure 3.9 Implementation and Withdrawal of Neighborhood Watch and Team Policing

Team policing, being an organizational change, would likely involve some start-up problems (e.g., officers getting
to know their assigned neighborhoods), but when it ended, there would be little carryover to the next segment of
the time series.

It is clear from Figure 3.8 that when team policing and neighborhood watch operated together (Period 3), the
citywide level of burglaries was lowest in the time series. When team policing operated alone (Period 4), burglaries
increased somewhat but were still substantially lower than they were for either of the periods (2 and 5) when
neighborhood watch operated alone.

Based on Figure 3.8 and the findings from evaluating the neighborhood watch program at the block and city
levels, it is reasonable (although not definitive) to conclude that the team-policing program was primarily
responsible for reducing burglaries. Our conclusion is not categorical—very few program evaluation findings are
—but is consistent with the evidence and serves to reduce the uncertainty around the question of relative program
effectiveness.

The evaluation of the York crime prevention programs employed several different research designs. Time-series
designs can be useful for assessing program outcomes in situations where data exist for key program logic
constructs before and after (or during) program implementation. The case study design used to survey the block
watch captains is perhaps the most vulnerable to internal validity problems of any of the possible research designs
evaluators can use. For case study designs, modeled as (X O), there is neither a pre-test nor a control group, so
unless we marshal several different lines of evidence that all speak to the question of whether the program caused
the observed outcomes (including the perceptions of stakeholders), we may not be able to reduce any of the
uncertainty around the question of whether the program caused the observed outcomes.

172
As an example, suppose you have been asked to evaluate a program that offers small businesses subsidies to hire
people aged 17 to 24 years. The objectives of the program are the following: to provide work experience, to
improve knowledge of business environments, and to provide encouragement to either start their own business or
pursue business-related postsecondary education.

As a program evaluator, it would be worthwhile having a comparison group who did not get the program, so that
constructs like “increased knowledge of business practices” could be measured and the results compared. But that
may not be possible, given resource constraints. Instead, you might still be expected to evaluate the program for its
effectiveness and be expected to do so by focusing on the program alone.

One way to reduce uncertainty in the conclusions drawn is to acknowledge the limitations of a case study (X O)
design but apply the design to different stakeholder groups. In the business experience program evaluation, it would
make sense to survey (or interview) a sample of clients, a sample of employers, and the program providers. These
three viewpoints on the program are complementary and allow the evaluator to triangulate the perspectives of
stakeholders. In effect, the X O research design has been repeated for three different variables: (1) client
perceptions, (2) employer perceptions, and (3) program provider perceptions.

Triangulation is an idea that had its origins in the literature on measurement. We are adapting it to evaluation
research designs. As a measurement strategy, triangulation is intended to strengthen confidence in the validity of
measures used in social research.

Once a proposition has been confirmed by two or more independent measurement processes, the
uncertainty of its interpretation is greatly reduced. The most persuasive evidence comes through a
triangulation of measurement processes. If a proposition can survive the onslaught of a series of
imperfect measures, with all their irrelevant error, confidence should be placed in it. (Webb, 1966, p. 3)

In our situation, triangulation that is focused on the question of whether the program was effective can at least
establish whether there is a concurrence of viewpoints on this question, as well as other related issues. It does not
offer a firm solution to the problem of our vulnerable research design, but it offers a workable strategy for
increasing confidence in evaluation findings. In Chapter 5, we will talk about mixed-methods evaluation designs
where qualitative and quantitative lines of evidence are compared and triangulated.

173
Non-Experimental Designs
Non-experimental designs are ones that have no explicit comparisons built into the design. That means that there
is no no-program group, nor is there a before–after comparison for even the program group alone. In the practice
of program evaluation, non-experimental designs are quite common. Typically, they are used when the
evaluator(s) have been brought into the picture after the program has been implemented, and the prospects for a
comparison group or even a before–after comparison for the program group are dim.

Evaluators in such situations are limited in their ability to structure comparisons that offer robust capabilities to
discern incremental program impacts. One strategy that is often used is to construct internal comparisons; for
example, if the clients of a program differ in the extent to which they have used the services offered, we can
compare (even correlate) their outcomes with the “dosage” they received (Bickman, Andrade, & Lambert, 2002).
Another strategy is to divide the clients into subgroups (gender, age, education, employment status, or geography
as possible classification variables) and see how those subgroups compare in terms of measured outcomes. Notice
that we are not constructing a no-program comparison, but we can illuminate ways that program effectiveness
varies within the client group. Research designs in which the comparisons are internal to the program groups are
often called implicit designs or case study designs.

In Chapter 5, we will discuss mixed-methods evaluation designs where the questions driving the evaluation are
addressed by a mix of quantitative and qualitative lines of evidence. In many of these situations, the main research
design is a XO design. But because we can examine both qualitative and quantitative sources of evidence and
compare them, we are able to strengthen the overall evaluation design through this triangulation process. In
effect, where we have a non-experimental design as our main research design, we can triangulate within lines of
evidence (e.g., compare findings by gender, age, or dosage level) or across lines of evidence (qualitative and
quantitative sources of data). Typically, program evaluations relying on non-experimental research designs include
both kinds of triangulations.

174
Testing The Causal Linkages in Program Logic Models
Research designs are intended as tools to facilitate examining causal relationships. In program evaluation, there has
been a general tendency to focus (and display) research designs on the main linkage between the program as a
whole and the observed outcomes. Although we can evaluate the program as a “black box” by doing this, the
increasing emphasis on elaborating program descriptions as logic models (see Chapter 2) presents situations where
our logic models are generally richer and more nuanced than our research designs. When we evaluate programs,
we generally want to examine the linkages in the logic model so that we can see whether (for example) levels of
outputs are correlated with levels of short-term outcomes and whether they, in turn, are correlated with levels of
longer-term outcomes. Research designs are important in helping us understand the logic of isolating each link so
that we can assess whether the intended causal relationships are corroborated; isolating individual linkages in a
program logic amounts to asking, “Is the independent variable in the linkage correlated with the dependent
variable, and are there any other factors that could explain that correlation?” Designing an evaluation so that each
program linkage can successively be isolated to rule out rival hypotheses is expensive and generally not practical.

Let us go back to the York crime prevention program and display the program logic in Table 3.5.

Table 3.5 Program Logic for the York Crime Prevention Program
Table 3.5 Program Logic for the York Crime Prevention Program

Intended
Implementation
Components Outputs Short-Term Outcomes Longer-Term
Activities
Outcomes

Increased awareness of
burglary prevention
Number techniques
of blocks
To organize organized Increased application of Reduction
Attendance
Neighborhood neighborhood watch prevention techniques in burglaries
Number at block
watch blocks blocks in the target committed
of block meetings Improved home
areas of the city citywide
meetings securityReduction in
held burglaries in
neighborhood watch
blocks

The intended outcome in Table 3.5 is a reduction in burglaries committed, first at the program blocks level and
then citywide. To achieve that outcome, the program logic specifies a series of links, beginning with organizing
city blocks into neighborhood watch blocks. If the program logic works as intended, then our (embedded)
program theory will have been corroborated.

Like most evaluations, this one relies on several different lines of evidence. If we look at the logic model, beginning
with the outputs and moving through it to the main outcomes, we see that each of the key constructs will be
measured in different ways. For example, number of blocks organized is measured by counting and tracking blocks
that have been organized, over time. The research design that is connected with that measure is a single time
series. Moving through the program logic, other key constructs are measured in other ways, and each implies its
own research design. What we are saying is that when you look at any program logic model, asking yourself how
the key constructs are going to be measured will suggest what research design is implied by that measure.

Table 3.6 summarizes how the key constructs in the York crime prevention evaluation are measured. When we
look at the key constructs in the logic model, we see that we have three different research designs: a case study
design for the constructs we are measuring from our interviews with block captains, a single time series for the

175
number of neighborhood watch blocks organized, a comparative time series for the numbers of burglaries reported
in the program and control areas of the city, and another single time series for the citywide monthly totals of
reported burglaries.

Table 3.6 Summary of Key Constructs and Their Corresponding Research Designs for the York
Crime Prevention Program
Table 3.6 Summary of Key Constructs and Their Corresponding Research Designs for the York Crime
Prevention Program

Research Designs
Implied by the
Constructs in the Logic Model What We Are Observing/Measuring
Measurement
Process

The number of blocks organized (an Counts of blocks organized as recorded Single time
output variable) monthly by the police department series

Estimates of the numbers of meetings


held (an output) and an estimate of Perceptions of block captains were obtained
Case study
attendance at block meetings (a short- by interviewing them at one point in time
term outcome)

Monthly counts of reported burglaries were


compared on a monthly basis in the two Comparative
Reported burglaries (an outcome)
areas of York that were “program” and time series
“control”

Monthly counts (totals) of citywide


Reported burglaries citywide (an Single time
reported burglaries before and after the
outcome) series
program was implemented

By reviewing Table 3.6, we can see that there were three different research designs in the evaluation and that none
of them facilitated an examination of the whole logic model. Each design focused on one construct in the model,
and the data collected permitted the evaluators to see how that part of the logic model behaved—that is, what the
numbers were and, for the time series designs, how they trended over time.

As an example, the block captain interviews focused on perceived attendance at meetings, which is an important
short-term outcome in the logic model. But they did not measure increased awareness of prevention techniques,
increased applications of techniques, or improved home security. In fact, those three constructs were not measured
at all in the evaluation. Measuring these constructs would have required a survey of neighborhood residents, and
there were insufficient resources to do that. Likewise, the time series designs facilitated tracking changes in blocks
that were organized and reported burglaries over time, but they were not set up to measure other constructs in the
program logic.

In sum, each research design addresses a part of the program logic and helps us see if those parts are behaving as
the logic intended. But what is missing is a way to test the connections between constructs. Even if blocks are
organized and neighbors attend block watch meetings, we do not know whether the steps leading to reduced
burglaries have worked as intended. This limitation is common in program evaluations. We can answer key
evaluation questions through gathering multiple lines of evidence intended to measure constructs in our program
logic models, but what we often cannot do is test particular linkages in the logic model. Petrosino (2000) has
suggested that to begin testing the program theory that is embedded in a logic model, we need to be able to test at
least one linkage—meaning that we need to have data that measure both ends of a link and allow us to examine the
covariation between the two ends.

176
In Chapter 4, we will discuss units of analysis in connection with measurement, but for now, what we are saying
is that examining a linkage between two constructs in a logic model requires that both the intended cause-and-
effect variables (both ends of that link) be measured using the same unit of analysis. In our logic model for the York
Crime Prevention Program, if we wanted to see whether “increased awareness of burglary prevention techniques”
was actually correlated with “increased application of prevention techniques,” we could have surveyed a sample of
York residents and asked them questions about their awareness and application of prevention techniques. Then,
we could see whether, for the people in our sample, a greater awareness of prevention techniques was correlated
with them using more of those on their homes.

The current movement in evaluation toward explicating program theories and testing them in evaluations was
discussed in Chapter 2 of this textbook. It is worth recalling that in a recent content analysis of a sample of self-
proclaimed theory-driven evaluations, the authors of that review found that many so-called theory-driven
evaluations did not actually test the program theory (Coryn, Schröter, Noakes, & Westine, 2011).

Why not design program evaluations in which the whole logic model is tested? One reason is that evaluators
usually do not have the control or the resources needed to set up such an evaluation design. In the York crime
prevention evaluation, a full test of the program logic in Table 3.3 (testing the intended linkages among outputs
and outcomes) would require that all the constructs be measured using the same unit of analysis. To see what that
might have looked like, suppose that random samples of residents in the target and control neighborhoods had
been enlisted to participate in a 4-year study of crime prevention effectiveness in the city. Initially (2 years before
the program started), each household (the main unit of analysis) would be surveyed to find out if any had
experienced burglaries in the past 12 months; householders would also be asked about their participation in any
crime prevention activities, their awareness of burglary prevention techniques, and their existing home security
measures.

This survey could be repeated (using the same sample—a cohort sample) in each year for 4 years (2 before the
program and 2 after implementation). After the program was implemented, the survey participants would also be
asked whether they participated in the program and, if they did, how frequently they attended block watch
meetings; whether the block watch meetings increased their awareness (they could be “tested” for their level of
awareness); whether they were taking any new precautions to prevent burglaries; and finally, whether their homes
had been burglarized in the previous 12 months.

Notice that this information “covers” the linkages in the logic model, and by comparing responses between the
target and control neighborhoods and by comparing responses within households over time, we could assess the
causal linkages in all parts of the model. Comparisons between program and no-program residents after the
program was implemented would indicate whether program residents were more likely to be aware of burglary
prevention methods, more likely to apply such methods, or more likely to have more secure homes, and whether
those links, in turn, were correlated with lower incidence of burglaries in those households. Again, the key point is
that the same unit of analysis (families in this example) has been used to measure all the constructs.

The Perry Preschool experimental study described earlier in this chapter has been set up in this way. The same
study participants (program and control group members) have been surveyed multiple times, and their progress in
school, their encounters with the social service system, and their encounters with the criminal justice system all
recorded.

In that study, the researchers constructed and tested a logic model for the experiment and showed how key
constructs were linked empirically for each participant from elementary school to age 40 (Schweinhart et al.,
2005). In effect, the model shows how the initial differences in cognitive outcomes have been transformed over
time into differences in educational attainment and social and economic success. We include this empirically
driven logic model (called a causal model) in Appendix B of this chapter. Given money, time, and control over the
program situation, it is possible to fully test program logics, as the Perry Preschool Study demonstrates. Testing
the program theory—using approaches that permit tests of logic models—is an important and growing part of the
field (Funnell & Rogers, 2011; Knowlton & Phillips, 2009). We have a ways to go, however, before we will be
able to say that theory-driven evaluations, appealing in theory, are realized in practice.

177
When we look at the practice of evaluation, we are often expected to conduct program evaluations after the
program has been implemented, using (mostly) existing data. These constraints usually mean that we can examine
parts of program logics with evidence (both qualitative and quantitative) and other parts with our own
observations, our experience, and our professional judgments.

178
Research Designs and Performance Measurement
The discussion so far in Chapter 3 has emphasized the connections between the comparisons that are implied by
research designs and the central question of determining program effectiveness in program evaluations. Research
design considerations need to be kept in mind if an evaluator wants to be in a position to conduct a credible
evaluation.

One of the reasons for emphasizing single time series research designs in this chapter is that the data for this kind
of comparison are often available from existing organizational records and overlap with a key source of
information in performance measurement systems. Administrative data are often recorded over time (daily,
weekly, monthly, quarterly, or yearly) and can be included in evaluations. Using administrative data saves time
and money but typically raises questions about the validity and reliability of such data. We will talk about the
validity and reliability of measures in Chapter 4.

In performance measurement systems, administrative data are often the main source of information that managers
and other stakeholders use. Data on outputs are often the most accessible because they exist in agency records and
are often included in performance measures. Outcome-focused performance measures, although generally deemed
to be desirable, often require additional resources to collect. In organizations that are strapped for resources, there
is a tendency to measure what is available, not necessarily what should be measured, given the logic models that
undergird performance measurement systems.

Performance measurement systems are often put together to serve the purposes of improving the efficiency and
effectiveness of programs (we can call these performance improvement–related purposes) and accountability
purposes. Accountability purposes include publicly reporting program results to stakeholders (Hatry, 1999;
McDavid & Huse, 2012). A key issue for any of us who are interested in developing and implementing credible
performance measurement systems is the expectation, on the one hand, that the measures we come up with will
tell us (and other stakeholders) how well the observed outcomes approximate the intended program objectives
and, on the other hand, that the measures we construct will actually tell us what the program (and not other causes)
has accomplished. This latter concern is, of course, our incrementality question: What differences did the program
actually make? Answering it entails wrestling with the question of the extent, if any, to which the program caused
the observed outcomes.

Performance measurement systems, by themselves, are typically not well equipped to tell stakeholders whether the
observed outcomes were actually caused by the program. They can describe the observed outcomes, and they can
tell us whether the observed outcomes are consistent with program outcomes, but there is usually a shortage of
information that would get at the question of whether the observed outcomes were the result of program activities
(Newcomer, 1997).

If we think of performance measurement as a process of tracking program-related variables over time, we can see
that many of the measures built into such systems are, in fact, time series. Variables are measured at regular
intervals, and the changes in their levels are assessed. Often, trends and levels of performance variables are
compared with targets or benchmarks. In some situations, where a change in program structure or activities has
been implemented, it is possible to track the before–after differences and see whether the observed changes in
levels and trends are consistent with the intended effects. Such tracking has become more commonplace and thus
more accessible to evaluators and other stakeholders, with the availability of technological tools (e.g., Internet
accessibility, government transparency initiatives, and data visualization software).

In situations where we want to use time series data to look for effects that are consistent (or inconsistent) with
intended outcomes, we need continuity in the way variables are measured. If we change the way the measures are
taken or if we change the definition of the measure itself (perhaps to improve its relevancy for current program
and policy priorities), we jeopardize its usefulness as a way to assess cause-and-effect linkages (i.e., we create
instrumentation problems).

179
In program evaluations and in performance measurement systems, outputs are typically viewed as attributable to
the program—one does not usually need an elaborate research design to test whether the outputs were caused by
the program. This means that performance measures that focus on outputs typically can claim that they are
measuring what the program actually produced.

When we look at the different research designs covered in this chapter, using time series designs is where it is
possible for program evaluation and performance measurement to overlap. Using administrative data sources that
track a variable over time facilitates dividing the time series so its segments show before, during, and perhaps even
after a program was implemented. This can give us a good start on a program evaluation and, at the same time,
describe how that performance measure trended over time.

180
Summary
This chapter focuses on how research designs can support program evaluators who want to assess whether and to what extent observed
outcomes are attributable to a program. Examining whether the program was effective is a key question in most evaluations, regardless of
whether they are formative or summative.

Research designs are not the same thing as evaluation designs. Evaluation designs include far more—they describe what the purposes of
the evaluation are, who the client(s) are, what the main evaluation questions are, what the methodology is, the findings as they relate to
the evaluation questions, the conclusions, and finally the recommendations. Research designs focus on how to structure the comparisons
that will facilitate addressing whether the program was effective.

Through randomized experimental designs, whereby units of analysis (often people) are assigned randomly to either the program or the
control group, it is possible to be more confident that the two groups are equal in all respects before the program begins. When the
program is implemented, the difference between the two groups in terms of outcomes should be due to the program itself. This makes it
possible to isolate the incremental effects of the program on the participants. Typically, in randomized experiments, we say that we have
controlled for threats to the internal validity of the research design, although there can be internal validity problems with the
implementation of experiments (Cronbach, 1982; Olds, Hill, Robinson, Song, & Little, 2000).

Randomized experiments usually require more resources and evaluator control to design and implement them well than are available in
many evaluations. But the logic of experimental designs is important to understand if evaluators want to address questions of whether the
program caused the observed outcomes. The three conditions for establishing causality—(1) temporal asymmetry, (2) covariation between
the causal variable and the effect variable, and (3) no plausible rival hypotheses—are at the core of all experimental designs and, implicitly
at least, are embedded in all evaluations that focus on program effectiveness.

In assessing research designs, we should keep in mind that the four different kinds of validity are cumulative. Statistical conclusions
validity is about using statistical methods correctly to determine whether the program and the outcome variable(s) co-vary. Covariation is
a necessary condition for causality. Internal validity builds on statistical conclusions validity and examines whether there are any plausible
rival hypotheses that could explain the observed covariation between the program and the outcome variable(s). Ruling out plausible rival
hypotheses is also a necessary condition for causality. Construct validity is about the generalizability of the data-based findings (the
empirical links between and among variables) back to the constructs and their intended linkages in the logic model. Finally, external
validity is about the generalizability of the results of the program evaluation to other times, other programs, other participants, and other
places.

Departures from randomized experimental designs can work well for determining whether a program caused the observed outcomes. One
of the most common quasi-experimental designs is the single time series, where a program is implemented partway through the time
series. Single and multiple time series make it possible to selectively address internal validity threats. Deciding whether a particular threat
to internal validity is plausible or not entails using what evidence is available, as well as professional judgment.

When we develop a logic model of a program, we are specifying a working theory of the program. Ideally, we want to test this theory in
an evaluation. Most program evaluations do not permit such testing because the resources are not there to do so. Rather, most evaluations
use several different research designs, each having the capability of testing a part of the logic model, not all of it collectively. The Perry
Preschool Study is an example of an evaluation that has been able to test the (evolving) theory that undergirds the program.

Full-blown theory-driven program evaluations are designed so that it is possible to test the full logic model. By specifying a single unit of
analysis that facilitates data collection for all the constructs in the model, statistical methods, combined with appropriate comparison
group research designs, can be used to test each linkage in the model, controlling for the influences on that link of other paths in the
model. The Perry Preschool Program evaluation is an example of such an approach. Although this approach to program evaluations
typically requires extensive resources and control over the evaluation process, it is growing in importance as we realize that linking logic
models to research designs that facilitate tests of causal linkages in the models is a powerful way to assess program effectiveness and test
program theories.

Performance monitoring often involves collecting and describing program results over time. Time series of performance results are a
useful contribution to program evaluations—they are where program evaluations and performance measurement systems overlap. Where
it is possible to gather performance data before and after a program has been implemented, in effect, we have variables that can be very
useful for assessing what differences, if any, the advent of the program had on the trends and levels in the time series.

181
Discussion Questions
1. The following diagram shows several weak research designs that have been used in an evaluation. The “O” variable is the same for
the entire diagram and is measured in such a way that it is possible to calculate an average score for each measurement. Thus, O1,
O2, O3, O4, and O5 all represent the same variable, and the numbers in parentheses above each represent the average score for
persons who are measured at that point. All the persons in Group 1 are post-tested; Group 2 had been randomly divided into two
subgroups, and one subgroup had been pre-tested and the other one had not. Notice that all the members in Group 2 got the
program. Finally, for Group 3, there was a pre-test only (to be post-tested later).•
Examine the averages that correspond to the five measurements and decide which threat to the internal validity of the
overall research design is clearly illustrated. Assume that attrition is not a problem—that is, all persons pre-tested are also
post-tested. Explain your answer, using information from Table 3.7.

Table 3.7 What Threat to Validity Is Illustrated by This Patched-Up Research Design?
Table 3.7 What Threat to Validity Is
Illustrated by This Patched-Up
Research Design?

Group 1 (6.0)

X O1

Group 2 (4.0) (7.0)

R O2 X O3

(6.0)

R X O4

Group 3 (4.0)

O5

2. What is a key difference between internal validity and external validity in research designs?
3. What is the difference between testing and instrumentation as threats to the internal validity of research designs?
4. What is the difference between history and selection as threats to the internal validity of research designs?
5. A nonprofit organization in a western state has operated a 40-hour motorcycle safety program for the past 10 years. The program
permits unlicensed, novice motorcycle riders to learn skills that are believed necessary to reduce accidents involving motorcyclists.
On completing the 1-week course, trainees are given a standard state driver’s test for motorcycle riders. If they pass, they are
licensed to ride a motorcycle in the state. The program operates in one city and the training program graduates about 400
motorcyclists per year. The objective of the program is to reduce the number of motor vehicle accidents involving motorcyclists.
Because the program has been targeted in one city, the effects would tend to be focused on that community. The key question is
whether this course does reduce motorcycle accidents for those who are trained. Your task is to design an evaluation that will tell
us whether the training program is effective in reducing motorcycle accidents. In designing your evaluation, pay attention to the
internal and construct validities of the design. What comparisons would you want to build into your design? What would you
want to measure to see whether the program was effective? How would you know if the program was successful?
6. Two program evaluators have designed an evaluation for a museum. The museum runs programs for school students, and this
particular program is intended to offer students (and teachers) an opportunity to learn about Indigenous American languages,
culture, and history.

The museum wants to know if the program improves students’ knowledge of Indigenous languages and culture. The evaluators
are aware that the museum has limited resources to actually conduct this evaluation, so they have been creative in the ways that
they are measuring program outcomes. One feature of their proposed evaluation design is a before–after comparison of knowledge
about Indigenous Americans for several classes of school children who visit the museum. The evaluators have built into their
design a control group—several classes of children who have not yet gone to the museum but are on a list of those who are
planning such a visit in this year.

To save time and effort, the evaluators are proposing to pre-test both the program and the control group to see if their knowledge
levels are similar before the program begins. But they have decided not to post-test the control group.

The rationale is as follows:

The control group and program group pre-test comparisons will provide a full range of understanding of what the knowledge
level is before the program. They will then use the pre- and post-program group test to determine the amount of learning

182
achieved through the program. It is their view that since the control group is not receiving the program, their knowledge will
not be influenced by the program. One way of looking at this is that the pre-test of the control group is essentially their post-
test as well—since they are not receiving the program, their learning will not have changed. The evaluators are trying to
streamline things given the museum’s limited resources for this evaluation.

What are the strengths and weaknesses of their strategy?

183
Appendices

184
Appendix 3A: Basic Statistical Tools for Program Evaluation

Figure 3A.1 Basic Statistical Tools for Program Evaluation

185
Appendix 3B: Empirical Causal Model for the Perry Preschool Study
The Perry Preschool Study is widely considered to be an exemplar in the field of early childhood development.
One of its unique features is the length of time that the program and control groups have been observed and
repeatedly measured across their life spans. The original logic model for the program was based on cognitive
development theories exemplified by the work of Piaget (Berrueta-Clement et al., 1984). The researchers believed
that exposing children from relatively poor socioeconomic circumstances to an enriched preschool environment
would increase their measured intelligence and position them for a successful transition to school. In the Perry
Preschool Study, the initial IQ differences between the preschool and no-preschool groups tended to diminish, so
that by age 10, there were no significant differences. But as additional observations of the two groups were added
onto the original research design, other differences emerged. These continued to emerge over time and, by age 40,
a pattern of differences that reflected life span development could be discerned. The Perry Preschool researchers
have summarized key variables in the whole study in a nonrecursive causal model that we have reproduced in
Figure 3B.1.

This causal model is based on measures first taken when the children in the study (both the program and the
control groups) were 3 years old and extends to measures included in the latest age-40 wave (Schweinhart et al.,
2005). Path analysis is a technique that simultaneously examines all the linkages in Figure 3B.1, summarizing
both their relative strengths and their statistical significance. If we move through the model from left to right, we
can see that the preschool experience and pre-program IQ variables are both empirically connected with post-
program IQ (measured at age 5). The strengths of the two paths are indicated by the numbers embedded in the
arrows. Those numbers are called path coefficients and indicate (on a scale from −1 to +1) how important each
link is in the model. The numbers in the links are all standardized coefficients—that is, they vary between −1 and
+1, and we can compare the strengths of all the links directly. If we look at the link between the preschool
experience and post-program IQ, we see that the strength of that link is .477, which suggests that it is among the
strongest empirical links in the whole causal model. At the same time (and not surprisingly), pre-program IQ is
strongly and positively correlated with post-program IQ (a standardized path coefficient of .400). Together,
preschool experience (keep in mind that this is a “yes” or “no” variable for each child) and pre-program IQ explain
41.8% of the variance in post-program IQ; the .418 just below the post-program IQ box in the causal model is
the proportion of the variance in post-program IQ explained by the combination of preschool experience and pre-
program IQ.

If we move across the model from left to right, we see other boxes and arrows that connect them—each arrow also
includes a standardized path coefficient much like the ones we have already described. The most important
empirical link in the whole model is between post-program IQ and school-related commitment at age 15. In other
words, the research team discovered that the higher the IQ of the children at age 5, the stronger their school
commitment at age 15.

The other paths in the model can be described similarly. When we get to the end of the model (the right-hand
side), we see that educational attainment by age 40 is positively connected with earnings at age 40, and
educational attainment at age 40 is negatively connected with arrests: more education is associated with fewer
arrests.

This whole causal model is based on data from the study participants (program and control groups) taken in five
different waves of data collection. Because there is one unit of analysis (participants), the Perry Preschool evaluators
could construct this path model, which, in effect, is an empirical test of the (evolving) program theory in the
study. All the links among the variables can be examined at the same time, and all the paths in the model are
statistically significant at the .01 level. Other possible paths that might connect the variables in Figure 3B.1 are
apparently not statistically significant.

186
Figure 3B.1 A Causal Model of Empirical Linkages Among Key Variables in the Perry Preschool Study

Note: Path coefficients are standardized regression weights, all statistically significant at p < .01; coefficients in
each box are squared multiple correlations.

Source: Schweinhart et al. (2005, p. 5).

187
Appendix 3C: Estimating the Incremental Impact of a Policy Change—
Implementing and Evaluating an Admission Fee Policy in the Royal British
Columbia Museum
In July 1987, the Royal British Columbia Museum in Victoria, British Columbia, implemented an entrance fee
for the first time in the history of the museum. The policy was controversial but was felt to be necessary given cuts
in government support for the museum. Opponents predicted that monthly attendance (a key outcome measure)
would decrease permanently. Proponents of the new fee predicted that attendance would decrease temporarily but
would bounce back to pre-fee levels over time. The evaluation focused on this variable and looked at monthly
museum attendance as a single time series (McDavid, 2006).

Incrementality is a key part of assessing program or policy effectiveness—it can be stated as a question: What
would have happened if the program had not been implemented? In program evaluations, this is often called the
counterfactual—we want to be able to measure what difference the program made to the outcomes we have in
view. Earlier in this chapter, we looked at randomized experimental research designs as one way to construct the
counterfactual situation. In effect, the control group becomes the situation that would have occurred without the
program, and the program group lets us see what happens when we implement the program. Comparing the
outcome variables across the two groups gives an indication of what differences the program made.

What made this situation amenable to assessing incrementality was the availability of monthly museum attendance
data from 1970 to 1998. The admission fee was a permanent intervention in that time series, so we can see what
impact it made from July 1987 onward. Figure 3C.1 includes actual monthly attendance at the museum from
1970 through June of 1987 (the month before the fee was implemented).

Figure 3C.1 Forecasting Monthly Attendance From July 1987 to 1998

To construct a model of what attendance would have looked like if there had been no fee beyond June 1987, the
evaluators used a multivariate model (ordinary least squares multiple regression) to predict monthly attendance,
using the actual attendance from 1970 to June 1987 as input to the model and forecasting attendance from July
1987 to 1998. Figure 3C.1 shows that predicted attendance would have gradually increased and continued the
marked annual cycle of ups in the summer months when tourists arrived to Victoria and downs in the winter
months when Victoria was gray and cold. Notice how well the multiple regression model actually follows the
actual attendance cycles over time. Using this approach, it was possible for the evaluators to be reasonably
confident that the forecast beyond June 1987 was robust. Other methods could have been used to construct that

188
forecast—there is a family of multivariate statistical methods designed specifically for interrupted time series
analysis. These methods are called ARIMA (autoregressive, integrated, moving average) modeling methods (Box,
Jenkins, Reinsel, & Ljung, 2016; Cook & Campbell, 1979).

What actually happened when the entrance fee was implemented in July 1987? Did attendance drop and then
bounce back, or did attendance drop and stay down? The answer is given by our estimate of the incremental
impact of this policy. Figure 3C.2 displays both the predicted and actual attendance for the years from the fee
increase (July 1987) to the end of the time series in 1998.

Figure 3C.2 Forecasted and Actual Museum Attendance From July 1987 to 1998

What we can see in Figure 3C.2 is how large the drop in actual attendance was and how attendance did not
recover. In fact, this outcome changed the whole business-planning model for the museum. The longer-term drop
in attendance and the associated shortfall in revenues (keep in mind that the attendance was expected to recover)
resulted in the museum moving away from its efforts to attract a broad cross section of the population and,
instead, toward attracting audiences who were more interested in high-profile travelling exhibits that could be
displayed for several months and for which higher fees could be charged.

Visually, Figure 3C.2 offers us an estimate of the incremental impact of the museum fee policy. Because we are
working with a single time series research design, we still need to check for rival hypotheses that might explain this
large and sustained drop. Since monthly attendance was measured in the same way before and after the fee was
implemented (patrons triggering a turnstile as they took the escalator up into the museum), it is unlikely that
instrumentation was a threat to internal validity. What about history variables? Did anything happen to the flows
of people who were coming and going that might have affected museum attendance? Figure 3C.3 displays ferry
traffic at the main ferry terminal that connects the Victoria area with the Vancouver area of the province—because
Victoria, British Columbia, is on an island, ferry traffic is an important measure of people coming and going.

Figure 3C.3 Number of Ferry Passengers Counted at Swartz Bay Terminal on Vancouver Island: January
1984 Through March 1989

We can see that counts of ferry passengers are regular and cyclical in the time series. They are similar to the yearly
cycle of museum attendance—lower in the winter months and higher in the summer. There is no marked drop-off
that coincides with the implementation of the admission fee. The increase in passengers in 1986 coincided with a
major international exposition in Vancouver that attracted tourists from all over the world, to both Vancouver and
Victoria. Overall, we can conclude with confidence that the introduction of an admission fee caused a change in
the pattern of museum attendance. The gap between forecasted and actual attendance is our estimate of the
incremental impact of this policy.

189
190
References
Alkin, M. C. (Ed.). (2012). Evaluation roots: A wider perspective of theorists’ views and influences. Thousand Oaks,
CA: Sage.

Anderson, L. M., Fielding, J. E., Fullilove, M. T., Scrimshaw, S. C., & Carande-Kulis, V. G. (2003). Methods for
conducting systematic reviews of the evidence of effectiveness and economic efficiency of interventions to
promote healthy social environments. American Journal of Preventive Medicine, 24 (3 Suppl.), 25–31.

Ariel, B., Farrar, W. A., & Sutherland, A. (2015). The effect of police body-worn cameras on use of force and
citizens’ complaints against the police: A randomized controlled trial. Journal of Quantitative Criminology,
31(3), 509–535.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J., & Henderson, R. (2016a). Wearing body
cameras increases assaults against officers and does not reduce police use of force: Results from a global multi-
site experiment. European Journal of Criminology, 13(6), 744–755.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J., & Henderson, R. (2016b). Report:
increases in police use of force in the presence of body-worn cameras are driven by officer discretion: A
protocol-based subgroup analysis of ten randomized experiments. Journal of Experimental Criminology, 12(3),
453–463.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J., & Henderson, R. (2017a). Paradoxical
effects of self-awareness of being observed: Testing the effect of police body-worn cameras on assaults and
aggression against officers. Journal of Experimental Criminology, 1–29.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J., & Henderson, R. (2017b). “Contagious
accountability”: A global multisite randomized controlled trial on the effect of police body-worn cameras on
citizens’ complaints against the police. Criminal Justice and Behavior, 44(2), 293–316.

Ariel, B., Sutherland, A., Henstock, D., Young, J., & Sosinski, G. (2018). The deterrence spectrum: Explaining
why police body-worn cameras ‘work’ or ‘backfire’ in aggressive police–public encounters. Policing: A Journal of
Policy and Practice, 12(1), 6–26.

Barahona, C. (2010). Randomised control trials for the impact evaluation of development initiatives: A statistician’s
point of view (ILAC Working Paper No. 13). Rome, Italy: Institutional Learning and Change Initiative.

Berk, R. A., & Rossi, P. H. (1999). Thinking about program evaluation (2nd ed.). Thousand Oaks, CA: Sage.

Berrueta-Clement, J. R., Schweinhart, L. J., Barnett, W. S., Epstein, A. S., & Weikart, D. P. (1984). Changed
lives: The effects of the Perry Preschool Program on youths through age 19. Ypsilanti, MI: High/Scope Press.

Bickman, L., Andrade, A., & Lambert, W. (2002). Dose response in child and adolescent mental health services.

191
Mental Health Services Research, 4(2), 57–70.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review,
111(4), 1061–1071.

Bowman, D., Mallett, S., & Cooney-O’Donoghue, D. (2017). Basic income: trade-offs and bottom lines.
Melbourne, Australia: Brotherhood of St. Laurence.

Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis: Forecasting and control
(5th ed.). Hoboken, NJ: John Wiley & Sons.

Caliendo, M., & Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score
matching. Journal of Economic Surveys, 22(1), 31–72.

Campbell Collaboration. (2018). Our Vision, Mission and Key Principles. Retrieved from
https://www.campbellcollaboration.org/about-campbell/vision-mission-and-principle.html

Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for research. Chicago, IL:
Rand McNally.

Campbell, F. A., Pungello, E. P., Miller-Johnson, S., Burchinal, M., & Ramey, C. T. (2001). The development of
cognitive and academic abilities: Growth curves from an early childhood educational experiment. Developmental
Psychology, 37(2), 231–242.

Campbell, F. A., & Ramey, C. T. (1994). Effects of early intervention on intellectual and academic achievement:
A follow-up study of children from low-income families. Child Development, 65(2), 684–698.

Christie, C. A., & Fleischer, D. N. (2010). Insight into evaluation practice: A content analysis of designs and
methods used in evaluation studies published in North American evaluation-focused journals. American Journal
of Evaluation, 31(3), 326–346.

Cochrane Collaboration. (2018). About us. Retrieved from www.cochrane.org/about-us. Also: Cochrane handbook
for systematic reviews of interventions. Retrieved from http://training.cochrane.org/handbook

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago,
IL: Rand McNally.

Cook, T. D., Scriven, M., Coryn, C. L., & Evergreen, S. D. (2010). Contemporary thinking about causation in
evaluation: A dialogue with Tom Cook and Michael Scriven. American Journal of Evaluation, 31(1), 105–117.

Cook, T. J., & Scioli, F. P. J. (1972). A research strategy for analyzing the impacts of public policy. Administrative
Science Quarterly, 17(3), 328–339.

192
Cordray, D. (1986). Quasi-experimental analysis: A mixture of methods and judgment. New Directions in
Evaluation, 31, 9–27.

Coryn, C. L., Schröter, D. C., Noakes, L. A., & Westine, C. D. (2011). A systematic review of theory-driven
evaluation practice from 1990 to 2009. American Journal of Evaluation, 32(2), 199–226.

Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research. Thousand Oaks,
CA: Sage.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs (1st ed.). San Francisco, CA:
Jossey-Bass.

Cubitt, T. I., Lesic, R., Myers, G. L., & Corry, R. (2017). Body-worn video: A systematic review of literature.
Australian & New Zealand Journal of Criminology, 50(3), 379–396.

Datta, L.-E. (1983). A tale of two studies: The Westinghouse-Ohio evaluation of Project Head Start and the
consortium for longitudinal studies report. Studies in Educational Evaluation, 8(3), 271–280.

Derman-Sparks, L. (2016). What I learned from the Ypsilanti Perry Preschool Project: A teacher’s reflections.
Journal of Pedagogy, 7(1), 93–106.

Donaldson, S. I., & Christie, C. (2005). The 2004 Claremont debate: Lipsey vs. Scriven—Determining causality
in program evaluation and applied research: Should experimental evidence be the gold standard? Journal of
Multidisciplinary Evaluation, 2(3), 60–77.

Donaldson, S. I., Christie, C. A., & Mark, M. M. (Eds.). (2014). Credible and actionable evidence: The foundation
for rigorous and influential evaluations. Thousand Oaks, CA: Sage.

Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver & Boyd.

Forget, E. L. (2017). Do we still need a basic income guarantee in Canada? Thunder Bay, ON: Northern Policy
Institute.

French, R., & Oreopoulos, P. (2017). Applying behavioural economics to public policy in Canada. Canadian
Journal of Economics/Revue canadienne d’économique, 50(3), 599–635.

Funnell, S., & Rogers, P. (2011). Purposeful program theory: Effective use of theories of change and logic models. San
Francisco, CA: Jossey-Bass.

Gaub, J. E., Choate, D. E., Todak, N., Katz, C. M., & White, M. D. (2016). Officer perceptions of body-worn
cameras before and after deployment: A study of three departments. Police Quarterly, 19(3), 275–302.

193
Gil-Garcia, J. R., Helbig, N., & Ojo, A. (2014). Being smart: Emerging technologies and innovation in the public
sector. Government Information Quarterly, 31, I1–I8.

Gliksman, L., McKenzie, D., Single, E., Douglas, R., Brunet, S., & Moffatt, K. (1993). The role of alcohol
providers in prevention: An evaluation of a server intervention program. Addiction, 88(9), 1195–1203.

Guba, E. G., & Lincoln, Y. S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage.

Gueron, J. M. (2017). The politics and practice of social experiments: Seeds of a revolution. In A. V. Banerjee &
E. Duflo (Eds), Handbook of economic field experiments (Vol. 1, pp. 27–69). North-Holland.

Hatry, H. P. (1999). Performance measurement: Getting results. Washington, DC: Urban Institute Press.

Heckman, J. J. (2000). Policies to foster human capital. Research in Economics, 54(1), 3–56.

Heckman, J. J. (2007). The productivity argument for investing in young children. Applied Economic Perspectives
and Policy, 29(3), 446–493.

Heckman, J. J., Ichimura, H., Smith, J., & Todd, P. (1996). Sources of selection bias in evaluating social
programs: An interpretation of conventional measures and evidence on the effectiveness of matching as a
program evaluation method. Proceedings of the National Academy of Sciences of the United States of America,
93(23), 13416–13420.

Heckman, J. J., & Masterov, D. V. (2004). The productivity argument for investing in young children (Working
Paper No. 5, Invest in Kids Working Group Committee for Economic Development). Chicago, IL: University
of Chicago.

Heckman, J. J., Moon, S., Pinto, R., Savelyev, P., & Yavitz, A. (2010). A reanalysis of the High/Scope Perry
Preschool Program. Chicago, IL: University of Chicago.

Hedberg, E. C., Katz, C. M., & Choate, D. E. (2017). Body-worn cameras and citizen interactions with police
officers: Estimating plausible effects given varying compliance levels. Justice Quarterly, 34(4), 627–651.

Heinich R. (1970). Technology and the management of instruction. Washington, DC: Department of Audio-Visual
Instruction, Inc. Associations for Educational Communications and Technology.

Henry, G. T., & Mark, M. M. (2003). Toward an agenda for research on evaluation. New Directions for
Evaluation, 97, 69–80.

Heshusius, L., & Smith, J. K. (1986). Closing down the conversation: The end of the quantitative-qualitative
debate among educational enquirers. Educational Researcher, 15(1), 4–12.

194
Hum, D. P. J., Laub, M. E., Metcalf, C. E., & Sabourin, D. (1983). Sample design and assignment model of the
Manitoba Basic Annual Income Experiment. University of Manitoba. Institute for Social and Economic
Research.

Jennings, E., & Hall, J. (2012). Evidence-based practice and the use of information in state agency decision
making. Journal of Public Administration Research and Theory, 22(2), 245–266.

Jennings, W. G., Fridell, L. A., Lynch, M., Jetelina, K. K., & Gonzalez, J. M. (2017). A quasi-experimental
evaluation of the effects of police body-worn cameras (BWCs) on response-to-resistance in a large metropolitan
police department. Deviant Behavior, 38(11), 1332–1339.

Johnson, R. B., & Christensen, L. B. (2017). Educational research: Quantitative, qualitative, and mixed approaches
(6th ed.). Los Angeles, CA: Sage.

Kahneman, D. (2011). Thinking, fast and slow. New York, NY: Macmillan.

Kangas, O., Simanainen, M., & Honkanen, P. (2017). Basic Income in the Finnish Context. Intereconomics,
52(2), 87–91.

Kelling, G. L. (1974a). The Kansas City preventive patrol experiment: A summary report. Washington, DC: Police
Foundation.

Kelling, G. L. (1974b). The Kansas City preventive patrol experiment: A technical report. Washington, DC: Police
Foundation.

Knowlton, L. W., & Phillips, C. C. (2009). The logic model guidebook. Thousand Oaks, CA: Sage.

Larson, R. C. (1982). Critiquing critiques: Another word on the Kansas City Preventive Patrol Experiment.
Evaluation Review, 6(2), 285–293.

Lyall, K. C., & Rossi, P. H. (1976). Reforming public welfare: A critique of the negative income tax experiment. New
York, NY: Russell Sage Foundation.

Lum, C., Koper, C., Merola, L., Scherer, A., & Reioux, A. (2015). Existing and ongoing body worn camera research:
Knowledge gaps and opportunities. Fairfax, VA: George Mason University: Center for Evidence-Based Crime
Policy.

Maskaly, J., Donner, C., Jennings, W. G., Ariel, B., & Sutherland, A. (2017). The effects of body-worn cameras
(BWCs) on police and citizen outcomes: A state-of-the-art review. Policing: An International Journal of Police
Strategies & Management, 40(4), 672–688.

McDavid, J. C. (2006). Estimating the incremental impacts of programs and policies: The case of the Royal British

195
Columbia Museum entrance fee. Presentation based partially on data from unpublished report by Donna
Hawkins. (1989). The implementation of user fees at the Royal British Columbia Museum: A preliminary impact
analysis. Unpublished manuscript, University of Victoria, Victoria, British Columbia, Canada.

McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

Newcomer, K. E. (1997). Using performance measurement to improve public and nonprofit programs. In K. E.
Newcomer (Ed.), New directions for evaluation (Vol. 75, pp. 5–14). San Francisco, CA: Jossey-Bass.

OECD. (2017). Behavioural insights and public policy: Lessons from around the world. Paris, France: OECD.

Olds, D., Hill, P., Robinson, J., Song, N., & Little, C. (2000). Update on home visiting for pregnant women and
parents of young children. Current Problems in Pediatrics, 30(4), 109–141.

Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.) Thousand Oaks, CA: Sage.

Pechman, J. A., & Timpane, P. M. (Eds.). (1975). Work incentives and income guarantees: The New Jersey negative
income tax experiment. Washington, DC: Brookings Institution Press.

Peck, L. R., Kim, Y., & Lucio, J. (2012). An empirical examination of validity in evaluation. American Journal of
Evaluation, 33(3), 350–365.

Petrosino, A. (2000). Answering the why question in evaluation: The causal-model approach. Canadian Journal of
Program Evaluation, 12(1), 1–25.

Poister, T. H., McDavid, J. C., & Magoun, A. H. (1979). Applied program evaluation in local government.
Lexington, MA: Lexington Books.

Puma, M., Bell, S., Cook, R., & Heid, C. (2010). Head Start impact study: Final report. Washington, DC:
Administration for Children and Families, U.S. Department of Health and Human Services.

Reichardt, C. S. (2011). Criticisms of and an alternative to the Shadish, Cook, and Campbell validity typology. In
H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome evaluation: Theory and
practice. New Directions for Evaluation, 130, 43–53.

Roethlisberger, F. J., Dickson, W. J., & Wright, H. A. (1939). Management and the worker: An account of a
research program conducted by the Western Electric Company, Hawthorne works, Chicago. Cambridge, MA:
Harvard University Press.

Rosenthal, R., & Jacobson, L. (1992). Pygmalion in the classroom: Teacher expectation and pupils’ intellectual
development (Newly expanded ed.). New York: Irvington.

196
Sandhu, A. (2017). ‘I’m glad that was on camera’: A case study of police officers’ perceptions of cameras. Policing
and Society, 1–13.

Schweinhart, L. J. (2013). Long-term follow-up of a preschool experiment. Journal of Experimental Criminology,


9(4), 389–409.

Schweinhart, L., Barnes, H. V., & Weikart, D. (1993). Significant benefits: The High-Scope Perry Preschool Study
through age 27 [Monograph]. Ypsilanti, MI: High/Scope Press.

Schweinhart, L., Montie, J., Xiang, Z., Barnett, W. S., Belfield, C. R., & Nores, M. (2005). The High/Scope Perry
Preschool Study through age 40: Summary, conclusions, and frequently asked questions. Ypsilanti, MI: High/Scope
Press.

Scriven, M. (2008). A summative evaluation of RCT methodology & an alternative approach to causal research.
Journal of Multidisciplinary Evaluation, 5(9), 11–24.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for
generalized causal inference. Boston, MA: Houghton Mifflin.

Simpson, W., Mason, G., & Godwin, R. (2017). The Manitoba Basic Annual Income Experiment: Lessons
learned 40 years later. Canadian Public Policy, 43(1), 85–104.

Stevens, H., & Simpson, W. (2017). Toward a National Universal Guaranteed Basic Income. Canadian Public
Policy, 43(2), 120–139.

Stufflebeam, D. L., & Shinkfield, A. (2007). Evaluation theory, models, and applications. San Francisco, CA:
Jossey-Bass.

Sutherland, A., Ariel, B., Farrar, W., & De Anda, R. (2017). Post-experimental follow-ups—Fade-out versus
persistence effects: The Rialto police body-worn camera experiment four years on. Journal of Criminal Justice,
53, 110–116.

Thaler, R., & Sunstein, C. (2008). Nudge: The gentle power of choice architecture. New Haven, CT: Yale.

Trochim, R., Donnelly, J., & Arora, K. (2016). Research methods: The essential knowledge base (2nd ed.). Boston,
MA: Cengage.

Watson, K. F. (1986). Programs, experiments and other evaluations: An interview with Donald Campbell. The
Canadian Journal of Program Evaluation, 1(1), 83–86.

Watts, H. W., & Rees, A. (Eds.). (1974). Final report of the New Jersey Graduated Work Incentives Experiment.
Madison: Institute for Research on Poverty, University of Wisconsin–Madison.

197
Webb, E. J. (1966). Unobtrusive measures: Nonreactive research in the social sciences. Chicago, IL: Rand McNally.

Weisburd, D. (2003). Ethical practice and evaluation of interventions in crime and justice. Evaluation Review,
27(3), 336–354.

White, H. (2010). A contribution to current debates in impact evaluation. Evaluation, 16(2), 153–164.

White, M. D. (2014). Police officer body-worn cameras: Assessing the evidence. Washington, DC: Office of Justice
Programs, US Department of Justice.

Widerquist, K. (2005). A failure to communicate: What (if anything) can we learn from the negative income tax
experiments? The Journal of Socio-Economics, 34(1), 49–81.

Widerquist, K., Noguera, J. A., Vanderborght, Y., & De Wispelaere, J. (Eds.). (2013). Basic income: An anthology
of contemporary research. Chichester, West Sussex, UK: Wiley-Blackwell.

198
4 Measurement for Program Evaluation and Performance
Monitoring

Introduction 162
Introducing Reliability and Validity of Measures 164
Understanding the Reliability of Measures 167
Understanding Measurement Validity 169
Types of Measurement Validity 170
Ways to Assess Measurement Validity 171
Validity Types That Relate a Single Measure to a Corresponding Construct 172
Validity Types That Relate Multiple Measures to One Construct 172
Validity Types That Relate Multiple Measures to Multiple Constructs 173
Units of Analysis and Levels of Measurement 175
Nominal Level of Measurement 176
Ordinal Level of Measurement 177
Interval and Ratio Levels of Measurement 177
Sources of Data in Program Evaluations and Performance Measurement Systems 179
Existing Sources of Data 179
Sources of Data Collected by the Program Evaluator 182
Surveys as an Evaluator-Initiated Data Source in Evaluations 182
Working With Likert Statements in Surveys 185
Designing and Conducting Surveys 187
Structuring Survey Instruments: Design Considerations 189
Using Surveys to Estimate the Incremental Effects of Programs 192
Addressing Challenges of Personal Recall 192
Retrospective Pre-tests: Where Measurement Intersects With Research Design 194
Survey Designs Are Not Research Designs 196
Validity of Measures and the Validity of Causes and Effects 197
Summary 199
Discussion Questions 201
References 202

199
Introduction
In this chapter, we introduce the conceptual and practical aspects of measurement as they apply to program
evaluation and performance measurement. We first define and illustrate measurement reliability and validity with
examples, then turn to a discussion of the four types of measurement reliability. Measurement validity is more
detailed, and in this chapter, we offer a conceptual definition and then describe eight different ways of assessing
the validity of measures. After that, we describe levels of measurement and units of analysis. These concepts are
important in understanding the connections between how we collect evaluation data and how we analyze it.

Because evaluations usually involve gathering and analyzing multiple lines of evidence, we then discuss sources of
evaluation data that are typically available and then focus on surveying as an important way for evaluators to
collect their own data. Designing sound surveys is important to valid and reliable measurement of constructs, so
we outline ways that surveys can be designed and also mention the advantages and disadvantages of in-person,
telephone, mail-in, and online surveys. Finally, we look at several more specialized measurement-related topics: the
uses of retrospective pre-tests in evaluations, the differences between survey designs and research designs in
evaluations, and the difference between measurement validity and the validity of causes and effects in program
evaluations.

The perspective we are taking in Chapter 4 is generally consistent with how measurement is introduced and
discussed in the social sciences. We have relied on sub-fields of psychology when we describe key features of
measurement validity.

Program evaluation and performance measurement are both intended to contribute to evidence-based decision
making in the performance management cycle. In Chapter 2, we discussed logic models as visual representations
of programs or organizations, and we learned that describing and categorizing program structures and specifying
intended cause-and-effect linkages are the main purposes for constructing logic models. Logic models identify
constructs that are a part of the program theory in those models. In program evaluations and in performance
measurement systems, we need to decide which constructs will be measured—that is, which constructs will be
translated into variables to be measured by procedures for collecting data. Deciding which constructs to measure is
driven by the evaluation questions that are included in a program evaluation or a performance measurement
system.

Gathering evidence in a program evaluation or for performance measures entails developing procedures that can
be used to collect information that is convincingly related to the issues and questions that are a part of a decision
process. The measurement procedures that are developed for a particular evaluation project or for a performance
measurement system need to meet the substantive expectations of that situation and also need to meet the
methodological requirements of developing and implementing credible and defensible measures.

Measurement can be thought of in two complementary ways. First, it is about finding/collecting relevant data,
often in circumstances where both time and resources are constrained. Second, measurement is about a set of
methodological procedures that are intended to translate constructs into observables, producing valid and reliable
data. Understanding the criteria for gathering valid and reliable data will be the backbone of this chapter. Finding
relevant data can be thought of as a first step in measurement—once possible sources of data have been identified,
measurement methodologies help us to sort out which sources are (relatively) defensible from a methodological
perspective.

This chapter will focus on measuring outputs and outcomes, as well as measuring environmental factors that can
affect the program processes and offer rival hypotheses to explain the observed program results. This approach
provides us with a framework for developing our understanding of measurement in evaluations. The measurement
methods that are discussed in this chapter can also be applied to needs assessments (Chapter 6) and will serve us
well when we consider performance measurement systems in Chapters 8, 9, and 10.

As you read this chapter, keep in mind what Clarke and Dawson (1999) have to say about measurement in

200
evaluations:

The evaluation enterprise is characterized by plurality and diversity, as witnessed by the broad range of
data-gathering devices which evaluators have at their disposal . . .  It is rare to find an evaluation study
based on only one method of data collection. Normally a range of techniques form the core of an overall
research strategy, thus ensuring that the information acquired has . . .  depth and detail. (pp. 65–67)

Figure 4.1 links this chapter to the logic modeling approach introduced in Chapter 2. The program, including its
outputs, is depicted as an open system, interacting with its environment. Program outputs are intended to cause
outcomes. Environmental factors, which we introduced in Chapter 2, can affect the program and, at the same time,
affect outcomes. In fact, environmental factors can affect the external validity of the program by mediating
between outputs and outcomes (Shadish, Cook, & Campbell, 2002). Our goal is to be able to measure the
outputs and outcomes in a program logic model and also to measure environmental factors that constitute
plausible rival hypotheses or mediating factors, in order to explain observed program outcomes.

Figure 4.1 Measurement in Program Evaluation and Performance Measurement

Some constructs in the logic model will be more important to measure than others. This will be based on the
evaluation questions that motivate a particular program evaluation, as well as the research designs/comparisons
that are being used for evaluations that focus, in part, on whether the program was effective. If the evaluation
focuses on program effectiveness, we will want to measure the outcomes that are central to the intended objectives.
If our interest is whether the program is technically efficient—that is, what the relationships are between costs and
outputs—we would measure outputs and also make sure that we have a robust way of estimating costs for the
program or even for individual components.

Typically, program managers want to know how a program is tracking in terms of its outputs—often outputs are
more controllable, and output measures are used in performance management. At the same time, performance
measurement systems that are intended to contribute to accountability expectations for a program or for an
organization are often aimed at measuring (and reporting) outcomes. Recall that in Chapter 2, we pointed out
that in logic models, constructs that have a relatively large number of incoming and outgoing links are a priori
candidates for being treated as key performance measures.

In Chapter 10, we will discuss possible trade-offs between performance measurement for (internal) performance

201
improvement and (external) accountability. In a word, the higher the stakes in measuring and reporting
performance, the greater the chances that stakeholders will try to actively manipulate the system, potentially
affecting the validity and the reliability of the measures themselves (Gill, 2011).

202
Introducing Reliability And Validity Of Measures
If we focus on measuring program outputs and outcomes, the process begins with building and validating the
program logic model. Table 4.1 presents the research design for a quasi-experimental evaluation of photo radar
cameras in Vancouver, Canada (Pedersen & McDavid, 1994). Photo radar is a technology that has been used
widely by governments as a way to reduce speeding and reduce the accidents and injuries associated with excessive
speed (Chen, Wilson, Meckle, & Cooper, 2000). The pilot study we describe (conducted in 1990) was a precursor
to the BC government implementing photo radar province-wide in 1996. The program, as implemented in that
setting, consisted of three components: (1) radar camera enforcement, (2) media publicity, and (3) signage along
the street where the cameras were being tested. The main objective was to reduce average vehicle speeds on the
street where the program was implemented.

The radar camera program was implemented on Knight Street (southbound) for a period of 8 weeks (October to
November 1990). A section of Granville Street (southbound) was used as a “control” street, and average vehicle
speeds were measured (southbound and northbound) on both Knight and Granville Streets for 1 week prior to the
intervention, throughout the intervention, and for 10 days after the program ended.

Table 4.1 Research Designs for Vancouver Radar Camera Evaluation


Table 4.1 Research Designs for Vancouver Radar Camera Evaluation

Knight Street OOOOOOOO XOXOXOXOXO OOOOOOOO

Granville Street OOOOOOOO OOOOOOOOOOOO OOOOOOOO

Before the Program During the Program After the Program

A central part of the program logic model is the key intended outcome: reduced vehicle speeds. Measuring vehicle
speeds was a key part of the program evaluation and was one of the dependent variables in the comparative time-
series research design illustrated in Table 4.1. Table 4.2 is a logic model of the radar camera program.

Table 4.2 Program Logic of the Vancouver Radar Camera Intervention

It categorizes the main activities of the program and classifies and summarizes the intended causal linkages among

203
the outputs and outcomes. Each of the outputs and the intended outcomes is represented with words or phrases.
The phrase “reduced vehicle speeds” tells us in words what we expect the program to accomplish, but it does not
tell us how we will measure vehicle speeds.

“Reduced vehicle speeds” is a construct in the logic model, as are the other outputs and outcomes. Recall,
constructs are words or phrases that convey the meanings we have assigned to the constituents of the logic model.
If we think of logic models as visual summaries of the intended theory of the program, then the links in the model
are “if … then” statements. In other words, they are hypotheses that we may want to test in our program
evaluation. Typically, a hypothesis includes at least two constructs—one that references the cause and one that
focuses on the effect. For example, a key hypothesis in the radar camera program logic is “If we implement the
radar camera program, we will lower vehicle speeds.” In this hypothesis, there are two constructs: the radar camera
program and vehicle speeds.

Most of us have a reasonable idea of what it means to “reduce vehicle speeds.” But when you think about it, there
are a number of different ways we could measure that construct. Measurement, fundamentally, is about translating
constructs into observables. In other words, measurement is about operationalizing constructs: translating them
into a set of operations/physical procedures that we will use to count (in our example) the speeds of vehicles over
time so that we can tell whether they have been reduced. It is worth remembering that a particular operational
definition does not exhaust the possible ways we could have measured a construct. Often, we select one measurement
procedure because of resource constraints or the availability of measurement options, but being able to develop
several measures of a construct is generally beneficial, since it makes triangulation of measurement results possible
(Webb, 1966).

Some measurement procedures for a given construct are easier to do than others. Measurement procedures vary in
terms of their costs, the number of steps involved, and their defensibility (their validity and reliability). We will say
more about the latter issues shortly.

Figure 4.2 offers a summary of key terminology used to describe measurement processes. Constructs are where we
begin: When we build a program logic model we need to explain clearly what we mean. It is important to keep the
language describing constructs clear and simple—this will yield dividends when we are constructing measures.

204
Figure 4.2 Measuring Constructs in Evaluations

In the Troubled Families Program in Britain (Department for Communities and Local Government, 2016), an
important outcome was to “turn troubled families around.” Translating that into a measure involved making some
broad assumptions that became controversial—turning families around for the government was substantially
about reducing the costs of public services those families required. This “bottom line thinking” resulted in some
interest groups criticizing the validity of the measure.

Sometimes, we build logic models and end up with constructs that sound different but may mean the same thing
on closer reflection. Suppose we have a training program that is intended to improve job readiness in a client
population of unemployed youths. One construct might be “improved job search attitude” and another
“heightened self-esteem.” It may not be practical to develop measurement procedures for each of these, since it is
quite likely that a client who exhibits one would exhibit the other, and measures of one would be hard to
differentiate from measures of the other.

205
Understanding the Reliability of Measures
Constructs are translated into variables via measurement procedures. Depending on the measurement procedures,
the variables can be more or less reliable. Reliability generally has to do with whether a measurement result is
repeatable (Goodwin, 2002), such that we get the same (or a very similar) reading with our measurement
instrument if we repeat the measurement procedure in a given situation. Reliability also relates to achieving the
same or similar results if two or more people are doing the measuring. If we are measuring the speed of a vehicle
on Knight Street, a reliable measurement procedure would mean that we could measure and re-measure that
vehicle’s speed (at that moment) and get the same reading. We would say that getting the same speed on two
different measurements would be a consistent result.

Thus, there are several other ways that we can assess reliability. In Chapter 5, we will discuss ways that narratives
(e.g., texts from interviews with stakeholders) can be coded in evaluations. If we conduct a survey, we may choose
to include questions where the respondents can offer their own views in open-ended responses. When we analyze
these open-ended responses, one approach is to create categories that are intended to capture the meanings of
responses and allow us to group responses into themes. Developing a coding scheme for open-ended questions
involves considerable judgment. Checking to see whether the categories are successful in distinguishing among
responses can be done by asking two or more persons to independently categorize the open-ended responses, using
the coding categories (Armstrong, Gosling, Weinman, & Marteau, 1997). The extent to which their decisions are
similar can be estimated by calculating an intercoder reliability coefficient (Hayes & Krippendorff, 2007; Holsti,
1969).

The third and fourth types of reliability are more technical and are applicable where evaluators are developing
their own measuring instruments (a set of survey items or a battery of questions) to measure some construct. For
example, if a survey is being developed as part of an evaluation of a housing rehabilitation program, it may be
desirable to develop Likert statement items that ask people to rate different features of their neighborhood. This
type of reliability would focus on evaluators developing two sets of Likert statements, both of which are intended
to measure resident perceptions of the neighborhood. These two parallel forms of the statements are then tested in
a pilot survey and are examined to see whether the results are consistent across the two versions of the perceptual
measures. This is sometimes called a split-half reliability test (Goodwin, 2002).

The fourth way of assessing reliability is often used where a set of survey items are all intended to measure the
same construct. For example, if a survey instrument focused on the perceived quality of police services in a
community, respondents might be asked to rate different features of their police services and their own sense of
safety and security. To determine whether a set of survey questions was a reliable measure of the construct “quality
of police services,” we could calculate a measure called Cronbach’s alpha (Carmines & Zeller, 1979). This statistic
is based on two things: (1) the extent to which the survey items correlate with each other and (2) the number of
items being assessed for their collective reliability. Cronbach’s alpha can vary between 0 (no reliability) and 1
(perfect reliability). Typically, we want reliability values of .80 or better, using this indicator.

Reliability is not always easy to assess. In most program evaluations, the things we want to measure would not “sit
still” while we re-measured a given attribute—we typically get one opportunity to measure, and then, we move on.

Sometimes, we may be able to use a measuring procedure (instrument) that we already know is reliable—that is,
its ability to accurately reproduce a given measurement result in a given situation is already known. In the radar
camera intervention, the experimenters used devices called inductive loops to measure the speed of each vehicle.
An inductive loop is buried in the pavement, and when a vehicle passes over it, the speed of the vehicle is
measured. The inductive loop detects the metal in a passing vehicle and changes in the electric current passing
through the loop acts a counting device, as well as sensing vehicle speed. Because inductive loops are widely used
by engineers to measure both traffic volumes and speeds, they are generally viewed as a reliable way to measure
vehicle speed.

Very few measurement procedures are completely reliable. The usual situation is that a measuring instrument will,

206
if used repeatedly in a given situation, produce a range of results—some being “higher” than the true value and
some “lower.” The degree to which these results are scattered around the true value indicates how reliable the
measure is. If the scatter is tightly packed around the correct value, the measure is more reliable than if there is a
wide scatter. When we use statistics (e.g., correlations to describe the covariation between two variables), those
methods generally assume that the variables have been measured reliably. In particular, if we have two variables,
one of which is hypothesized to be the causal variable, we generally assume that the causal variable is measured
without error (Pedhazur, 1997). Departures from that assumption can affect our calculations of the covariation
between the two variables, generally underestimating the covariation. One final point should be made about
reliability. When we talk about the pattern of measurement results around the true value of a measure, we can
describe this as scatter. When the range of results is scattered so that about the same number of values are above as
are below the true value, we can say that the measurement error is random. In other words, the probability of a
given measurement result being higher or lower than the true value is about equal. This would be considered
random measurement error, and the results would still be considered reliable, if there is not too broad an area of
scatter. If the scattered results tend to be systematically higher (or lower) than the true value, then we say the
measure is biased. Bias is a validity problem.

207
Understanding Measurement Validity
To illustrate the concept of validity in measurement, suppose, in our radar camera intervention, that vehicle
speeds were measured by a roadside radar device instead of in-ground inductive loops. Suppose further that in
setting the device up, the operator had not correctly calibrated it, so that it systematically underestimated vehicle
speeds. Even if the same speed value could be repeated reliably, we would say that that measure of vehicle speed
was invalid.

More generally, validity in measurement has to do with whether we are measuring what we intend to measure: Is
a given measure a “good” representation of a particular construct? Are the measurement procedures consistent
with the meaning of the construct? Are the measurement procedures biased or unbiased? In the radar camera
example from before, the measures of vehicle speeds are biased downwards, if they are systematically lower than
the actual speed.

Figure 4.3 is an illustration of the fundamental difference between validity and reliability in measurement (New
Jersey Department of Health, 2017). The figure uses a “target” metaphor to show how we can visualize reliability
and validity; the bull’s-eye in each of the two panels is a “perfect” measure—that is, one that accurately measures
what it is supposed to measure and is therefore both valid and reliable. Each dot represents a separate result from
using the same measurement process again and again in a given situation. In the first panel, we see a measurement
situation where the results cluster together and are therefore reliable, but they are off target, so they are biased and
hence not valid. The second panel is what we want, ideally. The measurement results are tightly clustered—that is,
they are reliable, and they are on target; they are valid.

Figure 4.3 The Basic Difference Between Reliability and Validity

Source: New Jersey Department of Health. (2017). Reliability and validity. New Jersey state health
assessment data. Retrieved from https://www26.state.nj.us/doh-shad/home/ReliabilityValidity.html.

Generally, we must have reliability to have validity. In other words, reliability is a necessary condition for validity.

Validity has an important judgmental component to it: Does a certain measurement procedure make sense, given
our knowledge of the construct and our experience with measures for other constructs? Suppose, for example, we
are evaluating a community crime prevention program. The key objective might be to prevent crimes from
happening in the community. But directly measuring the numbers and types of crimes prevented is difficult.
Direct measures might require us to develop ways of observing the community to determine how program outputs

208
(e.g., neighborhood watch signs on the streets and in the windows of houses) actually deter prospective criminals.

Usually, we do not do this—we do not have the resources. Instead, we rely on other measures that are available
and pay attention to their validity. Instead of measuring crimes prevented, we might use police records of the
numbers and types of crimes reported. The validity of such a measure assumes a systematic linkage between crimes
prevented and crimes reported; in fact, there is considerable evidence that reported crime levels are strongly
correlated with crime levels revealed through victimization surveys (Decker, 1977), although more recent studies
have indicated that the magnitude of the correlations vary depending on factors like rural versus urban and
poverty level (Berg & Lauritsen, 2016). In most evaluations, we would not have independent evidence that such
linkages exist; instead, we use our judgment and our knowledge of the situation to assess the validity of the
measure. Such judgments are important in our assessment of how valid measures are. We turn to a fuller
description of types of measurement validity in the next section.

209
Types of Measurement Validity
In Chapter 3, we introduced and discussed four different kinds of validity as they apply to determining and then
generalizing cause-and-effect relationships in research designs for program evaluations. What is critical to keep in
mind is that the four validities we learned about include construct validity. In Chapter 4, what we are doing is
further unpacking construct validity and looking at the measurement validity part of construct validity in
particular. In other words, measurement validity is not the same as the validity of research designs.

While there are some inconsistencies in publications that bring together measurement validity and internal
validity, we consider construct validity and measurement validity as related; as Trochim (2006) has suggested,
measurement validity is actually a part of what is meant by construct validity. This view was adopted in the 1999
revision of the Standards for Educational and Psychological Testing (Goodwin, 2002). Shadish, Cook, and
Campbell (2002) also take this view and see construct validity as being broader than measurement validity.
Different forms of measurement validity can be thought of as ways of getting at the question of the “fit” between
variables and corresponding constructs. Trochim (2006) suggests that the broader scope of construct validity can
be viewed as criteria that link two levels of discourse in evaluations. One level is about the theory of the program,
usually expressed in the logic model. At that level, we use language, models, and other verbal or visual ways to
communicate what the program (and its environment) is intended to be about. Constructs have theoretical
meaning based on how they are situated in the program theory that underpins a logic model. The other level is
about measures, variables, and observables. This level is focused on the empirical translation of constructs. It is the
level at which we are collecting data, assessing it for patterns that relate to evaluation questions, and drawing
evidence-based conclusions. Once we have interpreted our empirical results, we can then generalize back to the
theory that is embodied in our logic model. In effect, we are looking for correspondence between the empirical
and theoretical meanings of constructs.

For Trochim (2006) and for others (Cook & Campbell, 1979; Shadish et al., 2002), tying the two levels of
discourse together means that we are doing two things: (1) We are linking constructs to corresponding measures,
and (2) we are linking the observed patterns between and among variables to the predicted/intended patterns
among constructs in the program theory/logic model. Taken together, how well we succeed at these two sets of
tasks determines the construct validity of our logic modeling and measurement process.

Measurement links individual constructs to the level of observables—the level at which all evaluations and
performance measurements are conducted. The conventional types of measurement validity offer us ways of
assessing how well we have succeeded in this process of tying the two levels together. Some types of measurement
validity pertain to a single construct–variable pair and others pertain to expected connections between and among
construct–variable pairs.

210
Ways to Assess Measurement Validity
Because measurement validity is really a part of construct validity, the types of measurement validity that we
introduce here do not exhaust the types of construct validity. We can think of measurement validity as ways to
improve construct validity, understanding that construct validity includes additional issues, as we have indicated in
Chapter 3 (Shadish et al., 2002).

In this section, we introduce and discuss three clusters of measurement validities. Within each of these, we outline
several specific types of validity that can be thought of as subcategories. The first validity cluster focuses on the
relationship between a single measure and its corresponding construct. Within it, we will discuss face validity,
content validity, and response process validity. The second cluster focuses on the relationships between multiple
variables that are intended to measure one construct (internal structure validity). The third cluster focuses on
relationships between one variable–construct pair and other such pairs. Within it, we will discuss concurrent
validity, predictive validity, convergent validity, and discriminant validity. Table 4.3 shows how we can
categorize the different kinds of measurement validity. The eight kinds of measurement validity are defined briefly
in the table so that you can see what each is about.

Table 4.3 Types of Measurement Validity

Validity Types That Relate a Single Measure to a Corresponding Construct


Face Validity. This type of measurement validity is perhaps the most commonly applied one in program
evaluations and performance measurement situations. Basically, the evaluator or other stakeholders make a
judgment about whether the measure has validity on the face of it with respect to the construct in question. As an
example, suppose that the program logic for a Meals on Wheels program includes the intended outcome, “client
satisfaction with the service.” Using a survey-based question that asks clients of the program if they are satisfied
with the service they receive is, on the face of it, a valid measure of the logic model construct “client satisfaction
with the service.”

Content Validity. This type of measurement validity also involves judgment, but here, we are relying on experts
(persons familiar with the theoretical meaning of the construct) to offer their assessments of a measure (Goodwin,
2002). The issue is how well a particular measure of a given construct matches the full theoretically relevant range
of content of the construct. Suppose we think of the construct “head start program.” Given all that has been
written and the wide range of programs that have been implemented that call themselves “head start programs,”

211
we might have a good idea of what a typical program is supposed to include—its components and implementation
activities. Further suppose that in a community where there is a substantial population of poorer families, a local
nonprofit organization decides to implement its own version of a head start program. The intent is to give
preschool children in those families an opportunity to experience preschool and its intended benefits. The “fit”
between the general construct “head start program” and its design and implementation in this community would
be a measure of the content validity of the local construct, “head start program.”

Response Process Validity. This kind of validity was one of five categories created with the 1999 revisions to the
Standards for Educational and Psychological Testing (American Educational Research Association, 1999). It
focuses on the extent to which respondents to a measuring instrument that is being validated demonstrate
engagement and sincerity in the way that they have responded. If an instrument was being developed to measure
school-aged children’s attitudes toward science and technology, for example, we would want to know that the
process of administering the instrument and the ways that the children engaged with the instrument indicate that
they took it seriously. Goodwin (2002) suggests that debriefing a testing process with a focus group is a useful way
to determine whether the response process was valid.

Validity Types That Relate Multiple Measures to One Construct


Internal Structure Validity. Developing a measure can involve using a pool of items that are collectively intended to
be a measure of one construct. In developing the items, the evaluator will use face validity and content validity
methods to get an appropriate pool of potential questions. But until they are tested on one or more samples of
people who are representative of those for whom the measurement instrument was designed, it is not possible to
know whether the items behave collectively, as if they are all measuring the same construct.

As an example, an evaluator working on a project to assess the effectiveness of a leadership training program on
middle managers in a public-sector organization develops an instrument that includes a pool of Likert statements
with which respondents are expected to agree or disagree. (We will discuss Likert statements later in this chapter.)
Among the statements is a set of eight that is intended to measure employee morale. A random sample of 150
middle managers takes a pilot version of the instrument, and the evaluator analyzes the data to see if the set of
eight items cohere—that is, are treated by the respondents as if they pertain to an underlying dimension that we
could label “employee morale.” In other words, the evaluator is looking to see whether each respondent is
answering all eight items in a consistent way, indicating either higher or lower morale. Using a statistical technique
called confirmatory factor analysis (Goodwin, 1997, 2002), it is possible to see whether the eight items cluster
together and constitute one dimension in the data patterns. If one or more items do not cluster with the others,
then it can be assumed that they are not measuring the desired construct, and they can be set aside before the full
survey is conducted.

Validity Types That Relate Multiple Measures to Multiple Constructs


Concurrent Validity. Concurrent validity involves correlating a new measure of a construct with an existing, valid
measure of the same construct or, in some cases, different but related constructs. As an example, measurement of
blood serum cholesterol levels is a standard way of assessing risk of circulatory disease (atherosclerosis) for patients
and typically involves taking blood samples. An alternative way to measure cholesterol levels non-invasively was
developed in the 1980s—using an ultrasound device to measure the thickness of the carotid artery wall in the neck
of patients. To see whether this new measure was valid, two groups of patients were compared. One group was
known to have high cholesterol levels from previous blood tests. Another group was included as a control group—
these people did not have high cholesterol levels. The results indicated that in the high cholesterol group, wall
thickness of their carotid artery was significantly greater than for the control group. The results of this new
measure correlated with the existing measure. In effect, a non-invasive way to measure cholesterol levels was
demonstrated to work (Poli, et al. 1988).

Predictive Validity. Predictive validity involves situations where a measure of one construct taken at one point in
time is used to predict how a measure of another construct will behave, at a future point in time. Two examples

212
will serve to illustrate predictive validity. In some graduate programs in Canadian and American universities,
applicants are expected to take a standardized test called the Graduate Record Examination (GRE). The GRE is
constructed so that higher scores are intended to indicate a higher aptitude on the skills that are tested. Research
on what factors predict success in graduate programs has generally concluded that high GRE scores predict higher
grades in graduate programs (Kuncel, Hezlett, & Ones, 2001). The GRE has good predictive validity with respect
to subsequent performance in graduate programs.

The second example offers an opportunity to highlight the broad and current interest in child development that
was suggested in Chapter 3 with the Perry Preschool Study. Walter Mischel and his colleagues (Mischel et al.,
2011) have conducted a series of studies, beginning in the 1960s at the Stanford University Bing Nursery School,
focusing on the longitudinal effects of children being able to delay gratification. Sometimes called “the
marshmallow studies,” Mischel and his colleagues constructed situations wherein preschool-aged children were
offered a choice: Consume a treat now, or wait and earn a more substantial treat later on. What these studies have
demonstrated is consistent positive correlations between the number of seconds that a child can delay gratification
and a range of psychological, behavioral, health, and economic outcomes to midlife (Mischel et al., 2011). To take
one example, for those who participated in the studies, the number of seconds of delayed gratification is positively
correlated with SAT (scholastic aptitude test) scores (Mischel et al., 2011). The initial measure of delayed
gratification demonstrates predictive validity with respect to the measure(s) of ability comprising the SAT battery
of tests. As an aside, more recently researchers have tested (on children) training techniques that are intended to
improve their capacity to delay gratification (Murray, Theakston & Wells, 2016). The researchers speculate that if
children can be trained to delay gratification, they will have a different (better) social and economic trajectory
from those who are not trained.

Convergent Validity. This kind of measurement validity compares (correlates) one measure to another measure of a
related construct. Evidence of construct validity occurs where there are correlations among measures that are
expected (theoretically) to be related to each other. It can be illustrated by a situation where an evaluator is
assessing the effectiveness of an employment training program and, as part of her methodology, has surveyed a
sample of clients, asking them four questions that are intended to rate their overall satisfaction with the program.
Past research has shown that satisfied clients are usually the ones that tend to be more committed to participating
—that is, they attend the sessions regularly and learn the skills that are taught. In our situation, the evaluator also
measures attendance and has access to records that show how well each person did in the training modules. As part
of the analysis, the evaluator constructs an index of client satisfaction from the four questions and discovers that
persons who are more satisfied are also more likely to have attended all the sessions and are more likely to have
been rated by the instructors as having mastered the materials in the modules. The findings illustrate convergent
validity. In our example, the measure of client satisfaction is more valid because it has convergent validity with
measures of other constructs, such as participation and learning.

Discriminant Validity. This type of construct validity compares (correlates) a measure with another measure of an
unrelated construct, one that should not exhibit correlational linkage with the first measure. In other words, the
two measures should not be related to each other. To illustrate how discriminant validity is estimated, we will
summarize an example mentioned in Shadish et al. (2002). Sampson, Raudenbush, and Earls (1997) conducted a
survey-based study of neighborhoods in Chicago in which a key hypothesis was that neighborhood efficacy
(collective efficacy) would be negatively related to violent crime rates: As their measure of neighborhood efficacy
increased, the violent crime rate would decrease. The neighborhood efficacy measure was constructed by
combining 10 Likert items included in survey of 8,782 residents in 343 neighborhoods in Chicago. Theoretically,
neighborhood efficacy (the sense of trust that exists among neighbors and their greater willingness to intervene in
social disorder situations) was also related to other constructs (friendship and kinship ties, organizational
participation, and neighborhood services). The concern of the researchers was that the influences of these other
constructs (survey-based measures of other constructs) on violent crime incidence would render insignificant any
statistical relationship between neighborhood efficacy and crime rate. In other words, there would be no
discriminant validity between neighborhood efficacy and these other measures of neighborhood cohesion. To test
the discriminant validity of this new construct, the researchers used multivariate analysis that permitted them to
see whether collective efficacy was an important predictor of crime rate once these other, potentially competing

213
variables were statistically controlled. Their key finding was reported this way:

When we controlled for these correlated factors in a multivariate regression, along with prior homicide,
concentrated disadvantage [a measure of socioeconomic status of the neighborhoods], immigrant
concentration, and residential stability, by far the largest predictor of the violent crime rate was
collective efficacy. (Sampson et al., 1997, p. 923)

What they were able to demonstrate was that collective efficacy was a distinct and important construct in
explaining the rates of violent crimes (both reported and perceived) in Chicago neighborhoods (Shadish et al.,
2002).

214
Units of Analysis and Levels of Measurement
In our discussion thus far, we have relied on a program logic approach to illustrate the process of identifying
constructs that can be measured in an evaluation. Typically, we think of these constructs as characteristics of
people or, more generally, the cases or units of analysis across which our measurement procedures reach. For
example, when we are measuring client sociodemographic characteristics as environmental variables that could
affect client success with the program, we can think of clients as the units of analysis.

In a typical program evaluation, there will often be more than one type of unit of analysis. For example, in an
evaluation of a youth entrepreneurship program, clients may be surveyed for their assessments of the program,
service providers may be interviewed to get their perspective on the program operations and services to clients, and
business persons who hired clients might be interviewed by telephone. This evaluation would have three different
types of units of analysis.

Sometimes, in program evaluations, the key constructs are expressed in relation to time. In our example of the
radar camera program earlier in this chapter, vehicle speeds were measured as vehicles passed above the inductive
loops buried in the roadways. Speeds were averaged up to a daily figure for both Knight Street and Granville
Street. The unit of analysis in this evaluation is time, expressed as days. Units of analysis in our evaluations have
attributes that we want to measure because they are related to the constructs in our logic model. If one of our
units of analysis is “clients of a program,” we might want to measure their contact with the program providers as
one attribute—the number of sessions or hours of service (outputs) they received. This kind of measure, which is
often used in evaluations where clients are expected to be changed by their exposure to the program activities, is
sometimes called a dose-related measure (Domitrovich & Greenberg, 2000). When we measure constructs, we are
actually measuring relevant attributes of our units of analysis. Keep in mind that units of analysis usually have a lot
of attributes. Think of all the possible ways of measuring human attributes (physical, psychological, social). But in
evaluations, we are only interested in a small subset of possible attributes—the ones that are relevant to the
program at hand and relevant to the stakeholders who are involved in the design and implementation of that
program. Typically, a program logic reflects a theory of change—program logic models may or may not explicitly
represent that theory visually, but any time a program is designed and implemented, the theory of change is about
how that program is intended to operate in a given context with the clients at hand to produce the intended
outcomes.

Figure 4.2, shown earlier in this chapter, indicates that variables that have been defined through measurement
procedures can be classified according to their levels of measurement. The procedures that are used to collect data
will depend on the level of measurement involved. Fundamentally, all measurement involves classification—the
ability to distinguish between units of analysis on the attribute of interest to us. As we shall see, the three principal
levels of measurement (nominal, ordinal, and interval/ratio) are cumulative. Briefly, a nominal measure is the
most basic; ordinal is next and incorporates all the features of nominal measurement and adds one key feature; and
interval/ratio is the most sophisticated level of measurement—incorporating all the characteristics of both nominal
and ordinal measures and adding another key feature. Think of these three levels of measurement as steps on a
stairway—nominal is the first step, then ordinal, then interval/ratio.

Each level of measurement produces data that carries with it properties that correspond with that level of
measurement. There are statistical methods appropriate for each level of measurement that we can use to analyze
the data. Like the levels of measurement themselves, statistics for nominal variables are the least sophisticated
(involve making the fewest assumptions about the characteristics of the data), ordinal statistics are more
sophisticated, and interval/ratio statistics are the most sophisticated. When we use statistical methods for interval
variables, we have to be reasonably sure that the assumptions for that level of measurement are met; otherwise, the
results we get will be biased. In an appendix to Chapter 3, we summarize some basic descriptive and inferential
statistical tools that are used to describe and generalize the findings for quantitative lines of evidence in
evaluations.

215
216
Nominal Level of Measurement
Classification is the most basic measurement procedure—we call it the nominal level of measurement. Basically,
each category in a nominal level of measurement has a “name” but does not have a specified “order.” Suppose that
one of the relevant environmental factors in a program evaluation was the previous work experience of program
clients. In a job training program, this might be an important alternative factor that explains client success, other
than their participation in the program. We could measure previous work experience as a nominal variable: The
person did or did not have work experience (a yes/no variable). Nominal variables are widely used in evaluations
because they entail the least demanding measurement procedures—basically, the evaluator needs to be able to
classify situations so that for each person/case/unit of analysis, the case will fall into one (but only one) category.

Nominal variables can have more than two categories. Suppose that an evaluator has interviewed a sample of
program clients and has simply recorded their responses to several general questions about their experiences with
the program. To see what kinds of patterns there are in these responses, the evaluator may want to develop a set of
categories that are based on the themes in the actual responses themselves but can be used to classify the responses
into groups of similar ones. The details of such a procedure are described in Chapter 5, but the evaluator is
basically creating, from the clients’ open-ended responses, a nominal variable, which can be used in analyzing the
information.

Nominal variables have two basic features: They permit the evaluator to classify every observation/response into
one—and only one—category, and all the observations/responses must fit into the existing categories. In our
example of the evaluator coding client responses, the challenge is to come up with categories/themes that do a
good job of grouping all the client responses but do not leave the evaluator with a large percentage in a category
that has to be labeled “miscellaneous” or “other.”

217
Ordinal Level of Measurement
With an ordinal level of measurement, the categories created do have not only a label but also have a less-to-more
order. In the example of a job training program, suppose we decided to measure previous work experience on a
“less-to-more” basis. Program clients might be categorized as having “no previous work experience,” “some
previous work experience,” and “a great deal of work experience.” We could design the measurement procedures
so that “some” and “a great deal” equated to ranges of months/years, but we might also want to have rules to take
into account full- or part-time work. The end result would be a variable that categorizes clients and ranks them in
terms of previous work experience. We might have to make judgment calls for some borderline cases, but that
would have been true for the previous “yes/no” version of this variable as well. In creating a variable that measures
previous work experience on a less-to-more basis, we have constructed an ordinal level of measurement. Note that
in our ordinal variable, we have also included the features of nominal variables: Each case (on the relevant variable)
must fit one and only one category.

218
Interval and Ratio Levels of Measurement
Interval-level measures are ones that have three characteristics: (1) Cases must fit into one and only one category
(same as nominal and ordinal measures), (2) all the cases can be ranked in terms of the degree of the attribute that
is being measured (same as for ordinal measures), and (3) there is a unit-based measure such that for each case, the
amount of the attribute can be measured. Ratio measures are the same as interval measures, with one exception:
Ratio-level measures have a unit of measurement with a natural zero point—that is, values of the attribute cannot
go below zero.

What we will do in this section of the chapter is look at interval and ratio levels of measurement together. From a
statistical analysis perspective, there are very few differences between the two kinds of measures in terms of the
kinds of statistical tools that are appropriate. The statistical methods that we use for interval-/ratio-level data are
often called parametric statistics—using these statistical tools with data requires making assumptions about how
the data (values of a variable of interest) are distributed for a sample and in the corresponding population.

In our example of measuring previous work experience, we could use a measurement procedure that involved
querying clients in some detail about their previous work experience: amounts, full-time, and part-time (how
many days or hours per week). Then, we could convert the information obtained from clients into a measure that
counts the number of full-time equivalent months of previous work experience. The conversion process would
necessitate rules for translating part-time into full-time equivalents and deciding on how many hours per week
constitutes full-time work.

The number of full-time equivalent months of a person’s work experience is a ratio level of measurement because it
has a natural zero point. Although statistical methods used by evaluators do not generally distinguish between
interval and ratio levels of measurement, it is useful for us to show the essential differences. In our example of the
number of months of previous work experience, clients can have no previous work experience or some number of
months greater than zero. Because “zero” is a real or natural minimum for that measurement scale, it is possible
for us to compare the amounts of work experience across clients. We could say, for instance, that if one client
reported the equivalent of 6 months of work experience and another client reported 12 months, the ratio of work
experience for the two would be 1 to 2. In other words, the more experienced client has twice as much work
experience. Any time we can construct meaningful comparisons that give us ratios (twice as much, half as much,
and so on), we are using a ratio level of measurement.

Notice what happens if we try to apply our ratios method to an interval variable. Recall our discussion of the New
Jersey Negative Income Tax Experiment in Chapter 3 (Pechman & Timpane, 1975). The experimenters
conceptualized family income relative to some poverty-related benchmark. The poverty benchmark then became
“0” income for the experiment. If a family had more income in a given year than that benchmark, they would not
receive any “negative income benefits.” But, if a family’s income fell below the benchmark value, they would be
entitled to a benefit that increased the lower their income fell below the poverty level. If we were comparing two
families by constructing a ratio of their incomes using the poverty-level benchmark as our 0 point, we would run
into a problem. Suppose that one family earns $6,000 more than the benchmark and the other one earns $6,000
less than the benchmark. Since there is no natural 0 value in this experiment for income, we cannot construct a
ratio of their incomes. We cannot say that one, for instance, earns twice as much as the other. We can, however,
add and subtract their incomes (we can do this for any interval measure), and that is required to use the most
sophisticated statistical tools.

Typically, program evaluators use a mix of measures in an evaluation; some evaluations lend themselves to
“counting” types of measures (interval and ratio), others do not. There is a philosophical issue embedded in how
we measure in program evaluations. Some proponents of qualitative evaluation methods argue that words (e.g.,
narratives, detailed descriptions, discourse) are fundamentally more valid as ways of rendering the subjectivities of
experiences, viewpoints, and assessments of programs. We will discuss this issue in Chapter 5.

Proponents of quantitative evaluation methods tend to rely on numbers—hence, interval-/ratio-level measures of

219
constructs. We will discuss the issue of objectivity in Chapter 11—whether it is possible and how evaluators might
conduct their work to claim that they are being objective. Replicability is a hallmark of scientific investigation and
a key part of claims that evaluations can be objective. Advocates for objectivity point out that measurement
procedures that yield numbers can be structured so that results are repeatable; that is, another evaluator could
conduct the same measurement processes and ascertain whether the patterns of results are the same. In Chapter
11, we will include an example of where evaluators of body-worn camera programs in police departments have
replicated their work (with similar results) across a set of American cities.

Interval-ratio-level variables also lend themselves to varied statistical manipulations, which can be very useful as
evaluators try to determine the incremental effects of programs. If you can conduct a multivariate statistical
analysis that includes both program measures and environmental variables as predictors of some outcome, it may
be possible to assess the effect of the program on the outcome variable, controlling for the environmental variables
in the analysis.

220
Sources Of Data In Program Evaluations And Performance Measurement
Systems
Having described a program structure using logic modeling methods, the constructs identified in the program’s
processes and outcomes become candidates for measurement. Typically, the evaluation questions drive the
comparisons that would be relevant, and the research designs that are selected focus those comparisons on
particular variables. Most evaluations have more than one research design because there will be several variables
that will be important to address the evaluation questions. As we saw in Chapter 3, we rarely measure and test all
the linkages in the logic model, although examples of such evaluations exist: the Perry Preschool Study being one
(Schweinhart et al., 2005).

There are always limits to the amounts and kinds of data that can be gathered for a program evaluation. Program
evaluators may find, for example, that in evaluating a community-based small business support program, some
baseline measures, if they had been collected in the community before the program was implemented, would have
assisted in estimating the program’s actual outcomes. But if the data are not available, that eliminates one program
evaluation strategy (before–after comparisons or a single time series) for assessing the program’s incremental
effects.

221
Existing Sources of Data
Existing sources of data, principally from agency records, governmental databases, research databases, client
records, and the like, are used a great deal in program evaluations and in constructing performance measures.
Typically, when we are doing evaluations and other similar work, we use multiple lines of evidence, and
administrative data sources are a key contribution to many evaluations. It is important to keep in mind, whenever
these sources are being relied on, that the operational procedures used to collect the information may not be
known to the evaluator, and even if they are known, these data may have been collected with constructs in mind
that were important when the measurement procedures were designed and implemented but are not well
documented currently. Thus, when using existing data sources, the evaluator is always in the position of essentially
grafting someone else’s intentions onto the evaluation design.

Furthermore, existing data sources can be more or less complete, and the data themselves more or less reliable.
Suppose, for example, that the responsibility for recording client data in a family health center falls on two clerk-
receptionists. Their days are likely punctuated by the necessity of working on many different tasks. Entering client
data (recorded from intake interviews conducted by one or more of the nurses who see clients) would be one such
task, and they may not have the time to check out possible inconsistencies or interpretation problems on the forms
they are given by the nurses. The result might be a client database that appears to be complete and reliable but, on
closer inspection, has only limited utility in an evaluation of the program or the construction of performance
measures to monitor the program. Typically, when consultants are engaged to do an evaluation, they count on
administrative data being available and budget accordingly. When existing records are not complete or otherwise
not easily obtained, it can undermine the work plan for the evaluation.

Big Data Analytics in Program Evaluation and Performance Measurement: An Emerging Trend

The ubiquity of the Internet, social media platforms, wireless communications, satellite-based observation, statistical databases, and a
movement in some countries to open up government databases to public users (open data) have all contributed to potential opportunities
to integrate large secondary data sources into analytical work, including program evaluations and performance measurement/monitoring
systems.

Big Data has been described this way:

Big Data is a loose description for the general idea of integrating large data sets from multiple sources with the aim of delivering
some new, useful insight from those data. Some writers focus on the systems necessary to efficiently store, manage, and query
the data (Marz & Warren 2015). Other writers focus on the analytic tools needed to extract meaningful insights from these
massive data sources (Dean, 2014). (Ridgeway, 2018, p. 403)

Among the adopters of integrating these data sources are evaluators in developing countries (Bamberger, 2016). Development evaluations
typically include three levels of analysis: policy evaluations that focus on whole countries (e.g., reducing poverty); program evaluations
that focus on clusters of related activities that are aimed at a sector or a region of a country (e.g., building and repairing roads); and
projects that are focused on smaller geographic areas and specific sectors (paving an existing road that connects two cities). At the country
level, large-scale data sources that circumvent the limitations of existing government data sources are a way to estimate the macro effects of
program interventions.

An example of using unconventional data sources is described in a World Bank report that focuses on the use of cell phone–generated
data in Guatemala to estimate poverty. Using data on patterns of cell phone usage (cell phones are widely used in many developing
countries now) and algorithms to infer from patterns of usage, the study estimated the geographic distribution of consumption patterns,
mobility, and social interactions and used those to estimate regional poverty levels. The study concludes that these data sources can get
around the challenges of doing conventional surveys or censuses at a fraction of the cost (Hernandez et al., 2017).

In a recent book that examines the relationships between Big Data and evaluation, Petersson et al. (2017) summarize the potential of Big
Data this way:

When Big Data is anonymized, aggregated, and analyzed, it can reveal significant new insights and trends about human
behavior. The basic idea is that Big Data makes it possible to learn things that we could not comprehend with smaller amounts
of data, creating new insights and value in ways that change markets, organizations, relationships between citizens and
government… (p. 2)

222
So far, Big Data seems underutilized by evaluators (Petersson et al., 2017, p. 3). But like performance measurement in the 1990s, Big
Data is here to stay (Petersson et al., 2017, p. 11).

Existing data sources present yet another challenge; often, they use output measures as proxies for outcome
measures. Many public-sector and nonprofit organizations keep reasonably complete records of program outputs.
Thus, within the limits suggested previously, an agency manager or a program evaluator should be able to obtain
measures of the work done in the program. Program managers have tended to see themselves being responsible and
accountable for program outputs, so they have an incentive to keep such records for their own use and to report
program activities and outputs to senior managers, boards of directors, and other such bodies. But increasingly,
program evaluations and performance measurement systems are expected to focus on outcomes. Outcomes are
further along the causal chain than are outputs and are often more challenging to measure, given agency
resources. Also, program managers may experience some trepidation in gathering information on outcome
variables that they see as being substantially outside their control.

Given the pressure to report on outcomes, one possible “solution” is to report outputs and assume that if outputs
occur, outcomes will follow. Using measures of outputs instead of direct measures of outcomes is a process called
proxy measurement: Output measures become proxies for the outcome measures that are not available (Poister,
1978).

Because proxy measures entail an assumption that the outcomes they represent will occur, they can be
problematic. There may be independent evidence (from a previous evaluation of the program or from other
relevant evaluations conducted elsewhere) that the outputs lead to the proxied outcomes, but one must approach
such shortcuts with some caution. In Chapter 2, we introduced the idea of program complexity and the likelihood
that programs will “deliver the goods” if implemented fully. For simple programs such as highway maintenance
programs, evidence that the outputs occurred generally means that the outcomes also occurred. Most of the
programs we evaluate do not have such simple structures; most are likely to produce outputs, but that does not
give us a lot of leverage in assuming outcomes have occurred. We aspire to measure outcomes, and to examine
whether and to what extent the program caused those outcomes.

Managers who are expected to develop performance measurement systems for their programs are often in a
position where no new resources are available to measure outcomes. Key outcome constructs can be identified and
prioritized for measurement, but existing data sources may not map onto those logic model constructs
convincingly. This is a version of the “round peg in a square hole” conundrum that can characterize the fit
between social science research methodologies and their actual applications in program evaluations; plainly, the
utility of performance measurement systems will depend, in part, on whether the constructs that are measured are
tied to data in ways that are credible to stakeholders who would use the performance information.

223
Sources of Data Collected by the Program Evaluator
Most evaluations of programs involve collecting at least some data specifically for that purpose. There is a wide
variety of procedures for measuring constructs “from scratch,” and in this discussion, several of the main ones will
be reviewed. In Chapter 5, we will discuss interviews and focus groups as two ways of collecting qualitative data,
so we will not specifically cover them in this chapter.

Perhaps the single most important starting point for data in many program evaluations is the evaluator or
members of the evaluation team themselves. Program evaluations typically entail interacting with program
managers and other stakeholders and reviewing previous evaluations. Much of this interaction is informal;
meetings are held to review a draft logic model, for example, but each one creates opportunities to learn about the
program and develop an experiential “database,” which becomes a valuable resource as the evaluation progresses.
In Chapter 12, we discuss the importance of having several evaluator perspectives in a given evaluation. Team
members bring to the evaluation their own knowledge, experiences, values, and beliefs, and these lenses can be
compared and triangulated. What we are saying is that, in addition to triangulation of lines of evidence in most
program evaluations, there is value in triangulating the perspectives of evaluation team members.

224
Surveys as an Evaluator-Initiated Data Source in Evaluations
In addition to the evaluator’s own observations and informal measurements, program evaluations usually include
several systematic data collection efforts, and one means of gathering information is through surveys of program
clients, service providers, or other stakeholders. A survey of each group of stakeholders would constitute additional
lines of evidence in an evaluation. Surveys are also a principal means of collecting information in needs
assessments. In this chapter, we discuss survey design–related issues. It is common for surveys to be implemented
so that samples of respondents are selected to participate. We discuss different probability-based sampling
methods, including random sampling, in Chapter 6 when we describe how needs assessments are done. In
Chapter 5, we describe sampling methods that are appropriate for qualitative methods.

Fundamentally, surveys are intended to be measuring instruments that elicit information from respondents.
Typically, a survey will include measures for a number of different constructs. In some evaluations, survey-based
measures of all the key constructs in a logic model are obtained. For example, in the Perry Preschool evaluation
that we introduced in Chapter 3, surveys have been a principal way of collecting data for both the program and
control cohorts over time. Because individuals are the main unit of analysis, surveys of individuals help the
evaluators to build and test causal models of program effects, over time.

If this is feasible, it is then possible to consider using multivariate modeling techniques (structural equation
modeling) to examine the strength and significance of the linkages among the variables that correspond with the
logic model constructs (Shadish et al., 2002). Surveys generally involve some kind of interaction between the
evaluator and the respondent, although it is possible to conduct surveys in which the units of analysis are
inanimate: For example, a neighborhood housing rehabilitation program evaluation might include a visual survey
of a random sample of houses to assess how well they are being maintained. Surveys of the quality of
neighborhood streets, street lighting, and other such services have also been done (Parks, 1984), relying on
comparisons of neighborhood-level objective measures of specific services.

Surveys that focus on people are intended to measure constructs that are a key part of a program logic model, but
these are typically psychological or belief-related constructs. A program in a government ministry that is intended to
implement an electronic case management system for all clients might be evaluated, in part, by surveying the
affected employees before and after the changeover to electronic client files to see how the change has affected
their perceptions of their work and the timeliness of their responses to clients. One construct in the logic model of
such a program might be “employee morale.” If the electronic file changeover is smooth, morale should improve
or stay the same, given improved access to relevant information as cases are being processed and updated.
Perceptions are subjective and not directly observable, but by using surveys, we can indirectly measure cognitive
and affective constructs.

Figure 4.4 displays a stimulus–response model of the survey process, drawing attention to issues of survey validity
and reliability. On the upper left-hand side of the model, the survey questions we ask are the intended stimuli,
which, if all goes well, elicit (on the upper right side) the responses that become our data.

225
Figure 4.4 Measuring Mental Constructs

Our problem in many surveys is that other factors become unintended stimuli as the survey questions are posed.
These, in turn, produce “unintended responses” that are, from the evaluator’s perspective, mixed with the
responses to the intended questions. Distinguishing the responses to survey questions from the responses to
unintended stimuli is the essence of the reliability and validity challenges of using surveys. Possible sources of
unintended stimuli are included in Figure 4.3, shown earlier: characteristics of the interviewers, relevant whether
telephone, in-person, or group interviews are conducted (gender, tone of voice, phrasing of interview questions);
setting characteristics (location where the interview is conducted—e.g., asking employees about their relationships
with their fellow workers might elicit different responses if the interviews are conducted by telephone after hours
as opposed to interviews conducted at work); interviewee characteristics (e.g., elderly respondents could be hard of
hearing); instrument characteristics (e.g., beginning a survey by asking respondents to provide personal
demographic information may be seen as intrusive, resulting in more cautious responses to other questions or
respondents not responding to some questions); and the survey methods themselves. More and more surveys are
done using combinations of methods, which affect who is likely to respond, how comfortable respondents are
while navigating the technologies embedded in the survey process, and how seriously respondents take the whole
survey process.

226
Among the types of unintended stimuli, the easiest ones to control are those pertaining to the design of the
instrument itself. Suppose, for example, that we are conducting a mailed survey of forestry consultants who have
agreed to participate as stakeholders in the planning phase of a program to use chemicals and/or other means to
control insects and unwanted vegetation in newly reforested areas. In the survey is a series of statements with
which the respondent is expected to agree or disagree.

One such statement is as follows:

Improved pre-harvest planning, quicker reforestation, and better planting maintenance would reduce the need for chemical or mechanical
treatments.

Please circle the appropriate response.

Strongly Agree Neither Strongly Disagree


1 2 3 4 5

The main problem with this statement is that because of the structure of the question, any response is ambiguous.
We cannot tell what the respondent is agreeing or disagreeing with since there are five distinct ideas included in
the statement. In addition to the three ideas in the first part of the statement (improved pre-harvest planning,
quicker reforestation, and better planting maintenance), there are two different treatments: (1) chemical or (2)
mechanical. Respondents could literally focus on any combination of these. In short, the question is not a valid
measure (it does not pass the face validity test) because we cannot tell which construct is being measured. For
instance, the respondent might agree that quicker reforestation would reduce the need for chemical treatments,
but not agree that pre-harvest planning would reduce the need. This problem can be remedied by making sure
that only one idea (a measure of one construct) is included in a given statement. Since these statements (called
Likert statements) are commonly used in surveys, this rule of simplicity is quite useful.

Working With Likert Statements in Surveys


Perhaps the most common way that survey-based measures address evaluation-related questions, such as “Was the
program effective?” is by constructing groups of survey questions that ask respondents (clients or service providers
or other stakeholders) to respond to statements (usually worded either positively or negatively) to rate features of
programs or services, or rate their own perceptions, feelings, or attitudes as they relate to a program. Typically,
these survey questions are structured so that respondents are asked to agree or disagree with each statement in a
range from “strongly disagree” to “strongly agree.” Statements are worded so that one feature of the program or
respondents’ experiences or feelings is highlighted; we do not want to create conceptual ambiguities of the sort
illustrated by the survey item included previously. The following is an example of a workable Likert question from
a mailed survey to residents in a neighborhood in which a homeless shelter was recently opened:

During the past six weeks, have you felt any change in your feeling of safety in your neighborhood (pick the one
that is closest to your feeling)?

1. I felt much safer in the past six weeks.


2. I felt somewhat safer.
3. There was no change in my feeling of safety.
4. I felt somewhat less safe.
5. I felt much less safe.

Individual Likert statements are ordinal variables—that is, each is a statement that includes a fixed number of
possible response categories that are arrayed from “less” to “more.” Respondents are asked to pick one and only
one response. One methodological issue in using Likert statements is whether these ordinal variables can, for
statistical purposes, be treated as if they were interval-level variables. The level of measurement that is assumed is
important because Likert statements, when treated as interval-level variables, can be added, subtracted, and
otherwise analyzed with interval-level statistical methods. This issue is one that has consumed considerable time

227
and research energy in the social sciences since Rensis Likert introduced Likert scales (Likert, 1932).

Carifio and Perla (2007) have reviewed recent contributions to this literature, citing Jamieson (2004), in
particular, as an example of the view that Likert “scales” cannot be treated as if they were interval-level variables.
Carifio and Perla (2007) point out that part of the confusion about Likert-type variables is the basic distinction
between individual Likert statements and sets of statements that collectively are intended to measure some
construct. What they argue is that treating each Likert statement as an interval-level variable and using parametric
statistical methods on them is generally not advisable. But clusters of statements, if they exhibit properties that
indicate that they are valid and reliable measures of constructs, can be treated as interval-level measures. As we
indicated earlier in this chapter when we were discussing internal structure validity, it is possible to empirically
analyze groups of Likert statements and determine whether clusters of statements within a group cohere in ways
that suggest we have valid measures of the intended constructs. As well, Cronbach’s alpha, which is a measure of
the reliability (internal consistency) of a cluster of Likert statements, can be calculated based on the inter-
correlations among a set of Likert statements that are intended to measure one construct (Cronbach, 1951).
Carifio and Perla (2007) summarize their argument this way:

If one is using a 5 to 7 point Likert response format [italics added], and particularly so for items that
resemble a Likert-like scale [italics added] and factorially hold together as a scale or subscale reasonably
well, then it is perfectly acceptable and correct to analyze the results at the (measurement) scale level
using parametric analyses techniques such as the F-Ratio or the Pearson correlation coefficients or its
extensions (i.e., multiple regression and so on), and the results of these analyses should and will be
interpretable as well. Claims, assertions, and arguments to the contrary are simply conceptually,
logically, theoretically and empirically inaccurate and untrue and are current measurement and research
myths and urban legends. (p. 115)

There are several points to keep in mind when designing Likert statements:

1. Because there are many survey instruments available from Internet sources, it is often possible to adapt a
publicly available set of Likert statements for your purposes. Most of these instruments have not been
validated beyond checks on their face and content validities, so it is important to keep in mind that when
developing Likert statements or modifying existing statements, we are usually relying on rough face validity
and perhaps content validity checks. The reliability of a cluster of Likert statements that are all intended to
measure one construct is usually determined after the data are collected, although it is possible to run a pilot
test of a survey instrument with a view to checking the reliability of construct-specific clusters. Going one
step further, it may be possible, if enough cases are included in a pilot test, to use confirmatory factor
analysis to see whether predicted clusters of statements actually cohere as separable dimensions (i.e.,
correspond to the structure of the factor loadings).
2. Likert statements should be balanced—that is, the number of negative response categories for each
statement should equal the number of positive categories. Typically, response categories to Likert statements
offer five choices that include (1) “strongly disagree,” (2) “disagree,” (3) “neutral,” (4) “agree,” and (5)
“strongly agree.” Respondents are asked to pick the choice that is closest to their opinion for that statement.
It is possible to construct Likert items that have four or six categories, taking out the “neutral” response
option in the middle, although typical Likert statements offer respondents an odd number of response
choices. The word “neutral” can also be replaced; the Likert item in the neighborhood survey for the
homeless shelter evaluation used the phrase “no change in my feeling of safety” to convey to respondents a
middle category indicating neutral feelings. It is also possible to insert a middle value like “neither agree nor
disagree” instead of “neutral,” but keeping the overall flow of the language for the verbal anchors in any
Likert statement smooth is important.
3. Other variants on Likert items are possible: 7- or even 9-point scales are sometimes used, although wordings
for each point on a scale need to be considered carefully so that verbal labels clearly indicate a continuum
from less to more (or more to less). Another variant is to verbally anchor just the end points of the scale so
that respondents can pick a number in the range of the scale: If the scale has 5 points with values of

228
“strongly disagree” and “strongly agree” as the endpoint anchors, the range of values can be specified from 1
to 5, and respondents can select which number corresponds with their own opinion. There is evidence that
having more categories in a Likert scale produces higher levels of reliability, and explicitly labeling each
category, instead of just the end points, also results in higher reliability (Weng, 2004). An example of a 10
point scale with a lower and upper anchor point would be a Likert statement that focuses on how citizens’
confidence that their police will treat them fairly in encounters. The end points might be “no confidence at
all,” which would equal a 1, and “complete confidence,” which would equal a 10. Respondents would mark
the point on the scale that is closest to their own view.
4. In constructing sets of Likert statements, a common and useful strategy is to mingle negatively and
positively worded statements so that respondents have to stop and think about whether their opinion is
positive or negative for that statement. Not doing so invites people who are in a hurry to pick one response
category, “agree,” for example, and check off that response from top to bottom. This is the problem of the
response set.
5. Given the concerns about the level of measurement for individual Likert statements, if you are going to be
using this approach to measure constructs in an evaluation, craft the survey instrument so that each
construct is measured by clusters of Likert statements (a minimum of three statements per cluster is
desirable). A common strategy is to add up the responses for the statements pertaining to one construct and
use the resulting total (or average) as your measure.

Designing and Conducting Surveys


In designing, conducting, and coding data from surveys, evaluators can control some sources of validity and
reliability problems more easily than others. Aside from survey design considerations, there are trade-offs among
different ways of administering surveys. In-person interviews afford the most flexibility and can be used to check
whether respondents understand the questions or want to offer alternative responses, and unstructured questions
give respondents an opportunity to express themselves fully. Telephone surveys are somewhat less flexible, but
they still afford opportunities to confirm understanding of the questions. Mailed or Internet-based surveys are the
least flexible, requiring a questionnaire design that is explicit, easy to follow, and unlikely to mislead or confuse
respondents with the wording of questions. Internet-based surveys are becoming increasingly common, and they
have the advantage of being relatively low cost to administer. Like a mailed survey, online surveys are relatively
inflexible, requiring that the instrument be designed so that any respondent can make sense of it page over page.
Unlike mailed or e-mailed surveys, it is possible, when surveys are hosted on a website, to control how respondents
complete the survey, for example, making it necessary for one page to be completed before moving on to the next
page. As well, responses to Internet-hosted surveys can be transferred automatically to databases (Excel
spreadsheets would be an example) that facilitate analysis or export to other software platforms that facilitate
statistical analysis.

Increasingly, mixed approaches to surveying are being used by evaluators and other researchers. Information about
the sociodemographic makeup of the target population can be used to tailor the survey strategy. McMorris et al.
(2009) conducted an experiment that compared the effectiveness of different mixed-methods survey modes for
gathering data from young adults on sex-related behaviors and drug use. A group of 386 participants was
randomly assigned to two experimental conditions: (1) complete the survey online with an in-person interview
follow-up for those not completing the survey online or (2) complete the survey face to face with online follow-up
to increase the response rate. Using a $20 incentive for survey participants, both groups achieved a 92% response
rate overall (McMorris et al., 2009). In terms of the costs of the two mixed strategies, the face-to-face first strategy
was more costly: $114 per completed interview, compared with $72 for the web-first condition. The quality and
the completeness of the data were similar, and the findings for the key variables were not significantly different
between the two groups.

In another study of response rates for mixed-mode surveys, Converse, Wolfe, Huang, and Oswald (2008)
compared conventional mailed surveys as a first contact strategy (with e-mail–web follow-up) with an e-mail–web
strategy in which prospective respondents were sent an e-mail directing them to a web-based questionnaire,
followed up by a mailed survey for those who did not respond. The participants for this study were teachers (N =

229
1,500) who were randomly divided into two groups and treated with either the mail-first or web-first surveys.
Dillman’s (2007) five-step “tailored design method” was used to contact all sample members:

1. A pre-notice letter was sent to each participant via conventional mail saying the survey was coming, and a $2
bill was included with the letter.
2. The survey instrument, instructions, and a self-addressed stamped return envelope were included for the
mail-first group, and an e-mail with a URL to the survey web location was sent to the web-first group.
3. A postcard reminder was sent by conventional mail (to the mail-first group) or e-mail (to the web-first
group).
4. A second reminder questionnaire or e-mail was sent.
5. A final contact to each group that administered the opposite treatment—the web-first group got a hard copy
of the survey and the mail-first groups got the e-mail invitation to complete the survey online.

The overall response rate for the whole sample was 76%. But the mail-first group had a significantly higher
response rate (82.2%) than the web-first group (70.4%). In addition, the mail-first group had a lower non-
deliverable rate. In terms of overall cost per completed survey, the two mixed-methods approaches are quite
similar: $5.32 per response for the mail-first strategy and $4.95 for the web-first approach.

Notwithstanding the apparent cost advantages of web-based survey approaches, any other potential advantages
depend, in part, on who is being surveyed and for what purposes. Issues like computer access, accuracy of e-mail
addresses, and a sufficient level of comfort with online or e-mail-based surveying methods are all factors that affect
the effectiveness of web-based surveying methods. In some jurisdictions, privacy concerns about the servers
hosting web-based surveys are important, since the survey responses could, under some conditions, be accessed by
governmental authorities.

Structuring Survey Instruments: Design Considerations


Careful survey design takes time. It is essential that the designer(s) knows what constructs are to be measured with
the survey, and that that information guides the development of the contents of the instrument. A common
experience for program evaluators is to be developing a survey instrument with an evaluation steering committee
and realizing that each person has his or her own “pet questions” or issues. Those may not always relate to the
evaluation questions. Sometimes, the process of drafting the instrument will itself stimulate additional question
items. Evaluators need to continually ask, Why are we including that? What is that intended to measure? How will
that survey question help us to measure the constructs that, in turn, are part of the evaluation questions that
motivate the project? A useful strategy in winnowing prospective survey questions is to build a table that has rows
and columns. Conventionally, the rows are the broad evaluation questions that are being addressed in the program
evaluation (e.g., overall client satisfaction with the services) and the columns are the specific survey questions that are
intended to address broad evaluation questions (e.g., timeliness of service, appropriate assistance, pleasant service,
responsiveness to unique needs). For practicality, this can be reversed, and the survey questions can be arrayed in
the rows. In any case, for some evaluation questions, there will be no corresponding survey question(s) since other
lines of evidence are covering those, but what we are looking to eliminate are survey questions that do not connect
with any evaluation question.

There is one possible exception to the general guideline of making survey questions fit the evaluation question. In
program effectiveness evaluations, the evaluation questions will usually focus on intended outcomes (and perhaps
outputs). In some cases, it is also important to seek out information on unintended results of the program. For
example, in a program to offer income assistance recipients knowledge and skills to become job-ready, the main
outcome of securing employment could be undermined by the need for the clients of that program to secure child
care if they go to work—this unintended result could differentially affect single parents who have children and
cannot offset the costs of childcare by accessing a network of informal caregivers, including parents and
grandparents.

Although there is no one format that fits all survey designs, a general sequence of question types that applies in

230
many situations is as follows:

Begin the survey with factual, “warm-up” questions. Ask respondents to relate how they became connected
to the program, how long they have been connected, in what ways, and so on.
Ask about program-related experiences. Again, begin with factual questions. If the survey is intended for
program clients, the instrument can be structured to “walk the client through” their program process. For
example, questions might focus on when the respondent first became a client, how many visits were made to
program providers, what kind of follow-up was available to the client, and so on.
As program-related experiences are recalled, it may be appropriate to solicit respondent assessments of each
phase of the process. For example, if clients of a debtor assistance program are being surveyed, the first set of
program experience questions might focus on the initial interview between the client and the debt
counselor. Once the recalled facts of that interview have been recounted, it may be appropriate to ask for a
rating of that experience. As the client is “walked through” the program, each phase can be rated.
Overall ratings of a program should always come after ratings of specific experiences/steps in the process. If
overall ratings are solicited first, there are two risks: (1) The initial overall rating will “color” subsequent
ratings (create a halo effect), and (2) an overall rating without soliciting ratings of specific experiences first is
less likely to be based on a full recall of program experiences—in short, it will be less valid.

Demographic information should be solicited near or at the end of the survey. Demographic information (gender,
age, education, income) will be viewed by some as intrusive, and hence, there may be some reticence about
providing the information. Any effort to solicit such information should be differentiated from the rest of the
survey, and the respondent should be informed that these data are optional and that if any question is viewed as
too personal, they should not respond to it.

Instruments that have been drafted should be pre-tested before they are used in a program evaluation. Often, the
instrument design will be an amalgam of several viewpoints or contributors. Making sure that the questions are
clear and simply stated and that the instrument as a whole “works” are essential steps in increasing the validity and
reliability of the measures. Consider Question 10 in this example, which comes from an actual survey conducted
(but not pre-tested) in a U.S. suburban area. The topic of the survey was the possible amalgamation of 15 smaller
police departments into one, large department.

Question 8: Do you think that your police services would improve if your police department and all other police departments in the West
Shore area combined into one department?

_____________ Yes _____________ No _____________ Undecided

Question 9: Have you discussed this question of police consolidation with friends or neighbors?

_____________ Yes _____________ No _____________ Undecided

Question 10: Are you for or against combining your police department with police departments in surrounding municipalities?

_____________ Yes _____________ No _____________ Undecided

The problem with Question 10 was not discovered until the survey had been mailed to 1,000 randomly selected
homes in 15 communities. Fortunately for the project, many respondents detected the problem and simply circled
“for” or “against” in the question. But some did not, diminishing the value of the entire survey.

Table 4.4 summarizes some of the principal sources of validity and reliability problems in conducting and
processing survey results. Some of these problems are easier to control than others. Although controlling
instrument design can eliminate some problems, the evaluator needs to pay attention to the entire surveying
process to reduce sources of noise that interfere with interpretations of data. Training interviewers (including role-
playing interview situations where interviewers take turns interviewing and being interviewed) and pre-testing
instruments are important to the effective management of surveys, yielding substantial validity and reliability
benefits.

In Table 4.4, we have mentioned several potential validity-related problems for surveys that merit a short

231
explanation. Social desirability response bias can happen in surveys that focus on “undesirable” attitudes or
behaviors. For example, asking smokers how much they smoke or even whether they still smoke can underestimate
actual smoking rates, given the social desirability of saying that you are not smoking (West, Zatonski,
Przewozniak, & Jarvis, 2007). Theory of change response bias will be discussed shortly in the chapter, but for
now, think of it as a tendency on the part of participants in programs to believe that the program “must” have
made a difference for them. Thus, when we compare pre- and post-test assessments for participants (particularly
estimates of pre-program competence measured retrospectively), we can end up with a positive bias in the amount
of reported change (McPhail & Haines, 2010).

Table 4.4 Examples of Validity and Reliability Issues Applicable to Surveys


Table 4.4 Examples of Validity and Reliability Issues Applicable to Surveys

Source of the
Validity: Bias Reliability: Random Error
Problem

Interviewer Inconsistency in the way


Race, gender, appearance, interjections, interviewer
(face to face, questions are
reactions to responses
telephone) worded/spoken

Age, gender, physical or psychological handicaps,


suspicion, social desirability response bias, theory of Respondent Wandering attention
change response bias

Single measures to measure


Biased questions, response set, question order,
Instrument client perceptions of the
unbalanced Likert statements
program

Surveying
Privacy, confidentiality, anonymity situation/survey Noise, interruptions
medium

Biased coding, biased categories (particularly for Coding errors, intercoder


Data processing
qualitative data) reliability problems

232
Using Surveys To Estimate The Incremental Effects Of Programs
In Chapter 3, we discussed the issue of program incrementality in some depth. Often, the fundamental question
in a program evaluation is what differences, if any, the program actually made. Research design is about
constructing comparisons that facilitate responding to this incrementality question. But what are our options in
evaluation situations where the evaluator(s) are expected to assess the effectiveness of a program after it has been
implemented? Suppose that program versus no-program comparisons are unfeasible, and even baseline measures of
key outcome-related variables are not available? We discuss this problem in Chapter 12 when we introduce the
roles that professional judgment plays in our practice as evaluators. But here, we want to suggest some
measurement-related approaches that can be helpful.

233
Addressing Challenges of Personal Recall
The first measurement challenge has to do with asking stakeholders in evaluations about their past behaviors.
Personal recall of uses of services, such as visits to health or social service agencies, waiting times before a
practitioner is available, and the kinds of services delivered, can be an important part of estimating the adequacy of
service coverage and the use patterns of different subpopulations of clients. Walking clients of a program through
their encounters with the program relies on their memories, but there are substantial challenges to being able to
count on memories as measures of constructs (Schwarz, 2007). Asking respondents to assess program-related
encounters assumes that people can recall events accurately enough to offer valid and reliable assessments of their
views of past events.

Assessing events that have occurred in the past is a common feature of program evaluation approaches; most
evaluations are retrospective. We often want to know how people rate their encounters with program providers
and, in some situations, how they rate different steps or stages (components) of the program process. As an
example, suppose we have an employment training program for single parents, and we want to know how they
rate the phases of the program process based on their experiences (intake interview, program orientation, training
sessions, post-training job counseling, employer matching, and 6-month program follow-up). We could ask them
to describe each phase and, in conjunction with that, to rate the phase from their perspective. Ideally, we would be
able to use the phase-specific ratings to see where the program was more or less well received by the clients.
Measuring the quality of public services more broadly by asking clients to rate different phases of their encounters
with services is a part of a movement in Canada and other countries that is focused on public-sector service quality
and, more specifically, what aspects of service encounters best predict overall client satisfaction (Howard, 2010).

Schwarz and Oyserman (2001) have outlined validity and reliability problems that can arise where surveys or
interviews ask respondents to recall their behaviors. They point out that in many evaluations, the time and
resources allocated for instrument design are insufficient to address substantial research questions that have been
identified for surveying. For example, when respondents are asked how many times they visited their doctor in the
past year, they may have different understandings of what a visit is (some may include telephone conversations,
others not), who their doctor is (family doctor, specialists, other health practitioners such as dentists,
chiropractors, or optometrists), how many times they visited (memory decays with time; visits that occurred before
the year began could be included), and how many times are desirable (some older respondents may deliberately
underestimate visits to avoid perceptions that they are using services too frequently or that they are not healthy).

To improve the likelihood that valid and reliable information about past behaviors will be elicited by surveying,
Schwarz and Oyserman (2001) outline for survey designers five steps that survey participants typically go through
in responding to a question, including “understanding the question; recalling relevant behavior; inference and
estimation; mapping the answer onto the response format; and, ‘editing’ the answer for reasons of social
desirability” (p. 129).

The following list of eight key points on questionnaire design is adapted from the concluding remarks of their
report:

1. Once the instrument is drafted, answer every question yourself. If you find questions difficult or confusing,
respondents will as well.
2. Respondents will use the instrument to make sense of the questions. Some features may elicit responses that
are invalid. Features of the instrument to pay attention to include the following: the response alternatives
offered, the time period for recalling behaviors, the content of related/preceding questions, the title of the
questionnaire, and the sponsor of the study.
3. Consult models of well-formed questions.
4. Pilot your questions to see how respondents interpret the wordings of questions and the anchors in rating
scales. Use the pilot to check the graphic layout of the instrument.
5. Familiarize yourself with the basics of how people recall and report events—the psychology of responding to

234
survey questions.
6. Give respondents enough time and be prepared to remind them that accuracy is important.
7. Consider using events calendars. These are tables that have time intervals across the top (e.g., months) and
categories of possible events along the left side (e.g., visits to health care facilities).
8. Train interviewers so that they know the intended meanings of questions in the instrument. (Schwarz &
Oyserman, 2001, pp. 154–155)

Sometimes, in the context of an evaluation, it is appropriate to ask the clients of a program directly to estimate the
incremental effects of the program for themselves. Although such estimates are usually subject to recall and other
kinds of biases, it is usually worthwhile taking advantage of a survey to pose questions of this sort.

In a client survey of a provincial government debtor assistance program, which had, as its objective, counseling
clients to avoid personal bankruptcy, respondents were asked a series of questions about their experiences with the
program (Rogers, 1983). These questions led up to the following question:

What would you have done if this counseling service had not been available? (Choose the most likely one):

a. Contacted your creditors and tried to work out your problems with them
b. Tried to work out the problems yourself
c. Tried to get assistance and advice from another provincial ministry
d. Gotten advice from a friend or acquaintance
e. Applied for bankruptcy
f. Other (please specify)____________________________________________

The number and percentage of clients who selected option “e” above would be one measure of the incremental
effect of the program; if the program had not been available, they would have chosen the option the program was
designed to avoid. Although this measure of incrementality will be subject to recall bias and theory of change bias,
it was part of an evaluation that used three different lines of evidence (client survey, service provider survey, and
organizational managers’ interviews) to elicit information on the extent to which the program succeeded in
steering clients away from bankruptcy.

235
Retrospective Pre-tests: Where Measurement Intersects With Research
Design
Bamberger, Rugh, Church, and Fort (2004) have developed an approach to evaluation they call “shoestring
evaluation.” The basic idea of their approach is to recognize that in many evaluation situations, the most
appropriate methodologies are not really feasible. Evaluators must learn to make do, by using approaches that are
workable within small budgets and short time frames and do not require data collection efforts that are unfeasible.
Although they include Shadish et al.’s (2002) four kinds of validity as a part of assessing the methodological
adequacy of a particular evaluation approach, they are clear that evaluation practice often diverges from the
expectations established by evaluation methodologists.

One technique Bamberger and his colleagues (2004) discuss is using surveys and interviews to establish baselines
retrospectively. Although respondent recall is clearly an issue, there are patterns in recall bias (e.g., tending to
telescope events forward into the time frame of the recall query) that can make it possible to adjust recall of events
so that they are less biased.

Retrospective pre-tests are increasingly being accepted as an alternative to more demanding research designs. What
they are intended to do is “make up” for the fact that in some situations, no before–after comparison research
design is feasible. Instead of measuring outcome variables before the program is implemented (establishing a true
baseline), measures of the outcome-related variables are taken after the program is implemented, with a view to
asking people to estimate retrospectively what the values of those outcome variables were before they had
participated in the program. An example might be a neighborhood watch program where residents are surveyed
after the program is in place and, among other things, asked how aware they were of burglary prevention methods
before the program was implemented in their neighborhood (using a cluster of Likert statements). The evaluator
could compare these retrospective pre-test results with the respondents’ reported awareness after the program was
in their neighborhood. Differences between the “pre-test” and “post-test” results could be an indication of what
difference the program made in levels of awareness.

Advocates of retrospective pre-tests have pointed to response shift bias as a problem for conventional before–after
comparisons. Response shift bias occurs when program participants use a pre-program frame of reference to
estimate their knowledge and skills before participating in the program (usually measured using a set of Likert
statements that ask respondents to rate themselves), and once they have been through the program, they have a
different frame of reference for rating the program effects on them. This shift in their frame of reference tends to
result in underestimations of the actual effects of the program compared with independent assessments of pre-test
knowledge and skills (Hoogstraten, 1985; Howard, 1980; Howard & Dailey, 1979; Howard, Dailey, & Gulanick,
1979; Mueller, 2015; Nimon, Zigarmi, & Allen, 2011; Schwartz & Sprangers, 2010). In effect, retrospective pre-
tests offer a more accurate estimation of the actual incremental effects of these programs than do conventional pre-
and post-test designs.

Some recent studies of response shift bias (e.g., Taylor, Russ-Eft, & Taylor, 2009) have concluded that
retrospective pre-tests are not a good idea because they tend to produce a theory of change response bias:
participants expect the program to have worked, and when tested retrospectively, they bias their pre-test estimates
of competence downward from independently measured actual levels.

However, other studies support the efficacy of retrospective pre-testing. In a recent study that is among the most
elaborate in examining validity problems associated with retrospective pretesting, Nimon, Zigarmi, and Allen
(2011) focused on a training program for organizational managers. Their participants (N = 163) were in 15 classes
that ran over a 4-day period. Each participant was randomly assigned to one of four conditions for purposes of
measuring pre- and post-program effects. All four groups took an “objective leadership competencies” test before
the training began; the test was designed to see whether managers could respond to a series of scenarios
appropriately. Two of the groups also took subjective pre-tests measuring their own self-ratings of competencies.
After the training was completed, all four groups completed subjective self-assessments of their competencies. As well,

236
after the training, all four groups took a retrospective pre-test that was aimed at getting them to estimate what
their pre-program skill/competency levels were. Specifically, post-program, two of the four groups completed the
retrospective pre-test at the same time as they did the post-test. Of those two groups, one had taken the subjective pre-
test, and one had not. The other two groups completed the retrospective pre-tests after completing the post-test.
(The retrospective pre-tests were included in a separate envelope, and participants were told to take it four days
later.) Again, one of those two groups had taken a subjective pre-test and one had not.

The design of the study was intended to see if response shift bias existed—that is, whether post-program
assessments of increased competence were more strongly correlated with an objective pre-test measure (i.e., the
“objective leadership competency” test) than with a subjective pre-test measure. Also, the study looked at the
possible effects of taking a subjective pre-test on post-test assessments and whether participants would tend to
inflate self-rating of competencies (the implicit theory of change that Taylor et al., 2009, had suggested was at
work in their study).

What they found (they ended up with N = 139 cases due to incomplete data sets for some participants) was that
participants tended to overestimate their leadership skill levels before the program, compared with the objective
measure of their leadership skills taken as a pre-test. In other words, there was a significant response shift bias in
the subjective pre-tests. The (later) retrospective pre-tests tended to correlate more strongly with the objective skill
levels pre-program than they did with subjective pre-test scores.

The second finding was that for the two groups that took the retrospective pre-test 4 days after taking the post-
test, the correlations with objective pre-test scores were stronger. This suggests that if retrospective pre-tests are
being used, they will be more valid if administered separately from any other post-tests.

What are we to make of retrospective pre-tests more generally? Are they a way to gain some pre–post research
design leverage in situations where pre-testing was not done? When we look at this issue, it is clear that interest in
retrospective pre-testing is here to stay. The Nimon et al. (2011) findings are generally in line with the (growing)
bulk of the research in this area. Nimon et al. suggest that when retrospective pre-testing is done, it should be
done at a different time from any other subjective post-tests.

The recent retrospective pre-testing literature suggests that this approach has the potential to be useful in a
growing number of contexts. Mueller (2015), in reporting the results of a recent evaluation of a program to
increase Internet-mediated knowledge of disability insurance options in Germany, summarizes his review of the
literature on retrospective pre-testing methodology this way:

Recent studies have shown that RPM may be a viable option for evaluating interventions in areas such
as education (Cantrell, 2003; Coulter, 2012; Moore & Tananis, 2009; Nielsen, 2011), homeland
security (Pelfrey & Pelfrey, 2009), parenting (Hill & Betz, 2005; Pratt, McGuigan, & Katzev, 2000), or
health-related quality-of-life research (Kvam, Wisløff, & Fayers, 2010; Zhang et al., 2012). (p. 286)

Researchers like Nimon et al. (2011) are careful to point to the limitations of their studies. For evaluators, it is
prudent to continue to exercise caution in using this approach—the existing evidence suggests the efficacy of
retrospective pre-tests, but this approach has not been examined across the range of program evaluation-related
settings in which it might be applied.

237
Survey Designs Are Not Research Designs
Designing a survey is a demanding task. This chapter has suggested important issues, but it is not intended as a
detailed guide on this topic. Books and Internet resources that focus on surveying provide more information and
are worth consulting as needed (see, e.g., Alreck & Settle, 2004; Babbie, 2016; Dillman, 2011; Rea & Parker,
2014). Some sources are skeptical of the use of surveys for measuring constructs that focus on reported behaviors
and attitudes (Schwarz, 2007). It is worth remembering that other extant surveys can be a useful source of ideas
for constructing a survey-based measuring instrument, particularly if the surveys have been previously validated.

There is an important difference between survey designs and research designs. Surveys are a way to measure
constructs in an evaluation. They are, fundamentally, measuring instruments. As such, they are not intended to be
research designs. The latter are much broader and will typically include several complementary ways of measuring
constructs. Fundamentally, research designs focus on the comparisons that are needed to get at questions related to
whether the program was effective.

Surveys can be used to measure constructs in a wide variety of research designs. For example, in a quasi-
experimental evaluation of a program called the Kid Science Program (Ockwell, 1992), intended to improve
children’s attitudes toward science and technology, 10 classes of children aged 10 to 12 years who participated in a
1-day program at a local community college were matched with 10 classes of students who were on the waiting list
to participate. All the students in the 20 classes were surveyed before and after the 10 classes had participated in
the program to see what differences there were in their attitudes.

The survey was a key part of the overall program evaluation, but the research design was clearly independent of the
survey as a measuring instrument. The research design for the student surveys is a before–after nonequivalent
control group design that measures children’s attitudes before and after the program, for both the program and the
control groups. Using the terminology introduced in Chapter 3, the research design would look like the diagram
below with nonrandom assignment of classrooms to the program and control groups—it is a before-after non-
equivalent control group design.

O1 X O2

O3 O4

In the same evaluation, the teachers in the 10 participating classes were surveyed (interviewed in person) after the
visit to get at their perceptions of the effectiveness of the program. The research design for this part of the
evaluation was an implicit/case study design (XO). Surveys were used for both designs, illustrating clearly the
difference between surveys as measuring instruments and research designs as the basis for the comparisons in
evaluations.

238
Validity Of Measures And The Validity Of Causes And Effects
Validity of measures is a part of establishing the construct validity of an evaluation research design. Validity of
causes and effects focuses on the combination of statistical conclusions and internal validity for a research design.
Clearly, having valid measures of constructs is important to being able to generalize the evaluation results back to
the logic model of the program, but that is a different issue from establishing whether the program caused the
observed outcomes, at the level of the variables that have been included in the data collection and analysis.

Consider this example: Police departments routinely record the numbers of burglaries reported within their
jurisdiction. Reported burglaries—and all the steps involved in actually getting those numbers together—are often
considered to be a workable measure of the number of burglaries actually committed in the community.

But are reported burglaries a valid measure of burglaries actually committed? Evidence from criminal victimization
surveys suggests that in most communities, residents tend to report fewer burglaries than actually occur. In other
words, reported burglaries tend to underestimate all burglaries in the community.

A more valid measure might be based on periodic community surveys of householders. Carefully constructed
instruments could elicit burglary experiences in the previous 12 months, details of each experience, and whether
the police were called. That might be a more valid measure, but it is costly. As well, questions must be carefully
worded so as to avoid the issue of “telescoping” the recall of events that occurred outside of the designated time
frame into the past 12 months (Schaeffer & Presser, 2003).

The main point is that there are several possible alternative ways of measuring burglaries committed in the
community. Each one has different validity and reliability problems.

Now, suppose the police department designs and implements a program that is intended to reduce the number of
burglaries committed in the community. The program includes components for organizing neighborhood watch
blocks, having one’s property identified, and a series of social media advertisements about ways of
“burglarproofing” one’s home. The police department wants to know if its program made a difference: Did the
program cause a reduction in burglaries committed in the community? Answering that question is different from
answering the question about alternative ways of validly measuring burglaries committed.

We need valid measures of constructs to be able to evaluate program effectiveness, but an additional and key part
of assessing effectiveness is examining whether the program was the cause (or even a cause) of the observed
outcomes. Assessing causes and effects ultimately requires valid measures, but the issue is much broader than that.
In Chapter 8, we will discuss performance measurement as a part of the whole evaluation field (and the
performance management cycle we introduced in Chapter 1), and one of the issues for using performance
measures to evaluate programs is the potential to conflate the question of how to develop valid measures with how
to tell whether the measurement results really tell us what the program has done. We need valid performance
measures to describe what is happening in a program. But we need program evaluation-like reasoning to get at
why the patterns we see in performance data occur.

239
Summary
Measurement is the process of translating constructs into valid and reliable procedures for collecting data (variables). The translation
process can produce nominal, ordinal, and interval/ratio levels of measurement of constructs. Assessing the validity of measures is an
essential part of determining whether they are defensible for a program evaluation or a performance measurement system. There may be a
tendency to rely on data that already exist from organizational records (particularly in constructing performance measures), but the
validity of these measures can be challenging to assess. It is uncommon for existing data sources to be clearly connected with constructs;
the match is at best approximate.

Measurement validity is not the same thing as construct validity. Construct validity is broader, as was discussed in Chapter 3. We can
think of measurement validity as a sub-set of the issues that compose construct validity.

Reliability of measures is about whether measures are consistent (repeated applications of a measuring instrument in a given context). We
have included four kinds of reliability: test–retest; interrater; split-half, and internal consistency (often associated with calculating
Cronbach’s alpha for a set of Likert items that are intended to measure the same construct). Among them, interrater reliability is
commonly used for coding qualitative data where narrative responses to questions have been grouped into themes, and we are testing the
extent to which to different people can categorize the same responses into the same coding categories.

We can distinguish three different clusters of validities for measures that we use in evaluations. All of them pertain to one or more aspects
of the validity of the measures. Among the types of measurement validity, the first cluster includes three sub-types of validity. All of them
involve the relationship between one measure and one construct. Face validity is about what laypeople think of a measure—is it valid on
the face of it? The question that the second kind of validity (content validity) addresses is this: Given what is known of the theoretical
meaning of the construct (how the construct is connected to other constructs in existing research and prior evaluations), to what extent is
the measure a complete or robust representation of the corresponding construct? We tend to rely on knowledgeable stakeholder or expert
judgment as ways of estimating content validity. Response process validity is about ensuring that the measurement process was genuine—
that respondents were not gaming, or skewing (intentionally or unintentionally), the measurement process. We will come back to the
issue of gaming measures when we discuss uses of performance measurement systems in Chapter 10.

The second cluster of measurement validity indicators focuses on situations where we have multiple measures of one construct. Here, we
are using multivariate statistical methods to see if the measures of a construct cohere—that is, behave as if they are all part of a common
underlying dimension. Factor analysis is a common way to discern the dimensionality of data structures. The third cluster of validities
involves correlating two or more variables and seeing how those empirical patterns correspond with the expected (theoretical)
relationships between two or more constructs. In this cluster, one of the most interesting is predictive validity: To what extent does one
measure of a construct at one point in time predict another measure of another construct in the future? We used the Stanford
marshmallow studies to illustrate predictive validity. Convergent validity is where two variables, each representing a distinct construct,
correlate consistently with the expectation that the constructs are linked. Divergent validity is a similar idea, but here, we have two
variables that do not correlate empirically, which lines up with the expectation that their respective constructs also would not correlate.
And finally, concurrent validity (sometimes called criterion validity) is intended to measure whether a new (relatively untried) measure of
one construct (or related constructs) correlates with a valid measure of the same (or another related) construct, where both constructs are
measured concurrently.

Among the eight kinds of measurement validity, three of them—face validity, content validity, and response process validity—will most
likely be in play when evaluations are done. The other types of measurement validity require more data and more control of the
measurement process, and this kind of information is often outside the scope of the evaluations we do. Where we can, we take advantage
of other efforts to validate measures and, if appropriate, use such measures in our evaluations.

When evaluators collect their own data, surveys are often used as a measuring instrument. Surveys can be very useful in evaluations or for
performance measurement systems. Constructing and administering surveys to minimize the “noise” that can occur from failures to
anticipate the ways people will react to the design of the survey instrument, for example, is key to making surveys worthwhile. Likert
statements are a principal way that evaluators measure stakeholder perceptions of program effectiveness. Often, we construct our own
clusters of Likert statements, and when we do, we need to keep in mind the value of clearly worded, simple statements that are likely to be
valid (at least face valid) measures of the construct at hand. Chapter 5, on needs assessment, offers some additional ideas for planning and
conducting surveys.

Retrospective pre-tests, as one strategy for “capturing” pre-test observations of outcome variables, can be useful. In fact, there is
considerable and growing evidence that in programs that are intended to change participant levels of knowledge or skill, participants are
better able to estimate their pre-program skills and knowledge after they complete the program than before they participate. Participation
can more accurately calibrate one’s frame of reference and, thus, increase the likelihood that participants can offer valid pre-program
assessments.

Survey designs are not the same thing as research designs. In a typical program evaluation, we measure constructs using multiple lines of
evidence, including surveys. But surveys are a measurement instrument and not the comparisons implied in addressing the evaluation
questions.

Measurement validity is not the same as the validity of causes and effects. We need to keep in mind that measurement validity is a part of
construct validity, and validity of causes and effects focuses on statistical conclusions and internal validity.

Measurement is perhaps the most undervalued aspect of evaluations—and even more so in constructing performance measurement

240
systems, where there is a tendency to rely very heavily on data that already exist, without taking the time to find out whether the data have
been gathered in a reliable or valid way. Experiences with auditing performance measurement systems suggest that even in systems that
have taken the time to integrate performance measurement into the planning and budgeting cycle, there are significant problems with the
reliability of the data (see, e.g., Texas State Auditor’s Office, 2002). If performance measures are not reliable, then they cannot be valid.

241
Discussion Questions
1. What is the basic difference between the reliability and validity of measures?
2. What is the difference between face validity and content validity?
3. What is the difference between concurrent validity and predictive validity?
4. Would you agree that ordinal measures have all the characteristics of nominal measures? Why, or why not?
5. Are surveys a type of research design? Why, or why not?
6. What is response shift bias in before–after comparison designs? How is that different from theory of change bias?
7. What is the difference between the validity of measures and the validity of causes and effects?
8. Design four Likert scale items (with five points ranging from “strongly disagree” to “strongly agree”) that collectively measure
patron satisfaction with restaurant dining experiences. Discuss the face validity of these measures with a classmate.
9. What are the advantages and disadvantages of using online surveys?

242
References
Alreck, P. L., & Settle, R. B. (2004). The survey research handbook (3rd ed.). New York, NY: McGraw-Hill Irwin.

American Educational Research Association. (1999). Standards for educational and psychological testing.
Washington, DC: American Educational Research Association, American Psychological Association, National
Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological
Testing.

Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The place of inter-rater reliability in qualitative
research: An empirical study. Sociology, 31(3), 597–606.

Babbie, E. (2015). The practice of social research (14th ed.). Boston, MA: Cengage.

Bamberger, M. (2016). Integrating Big Data into monitoring and evaluation of development programs. New York,
NY: United Nations Global Pulse.

Bamberger, M., Rugh, J., Church, M., & Fort, L. (2004). Shoestring evaluation: Designing impact evaluations
under budget, time and data constraints. American Journal of Evaluation, 25(1), 5–37.

Berg, M. T., & Lauritsen, J. L. (2016). Telling a similar story twice? NCVS/UCR convergence in serious violent
crime rates in rural, suburban, and urban places (1973–2010). Journal of Quantitative Criminology, 32(1),
61–87.

Cantrell, P. (2003). Traditional vs. retrospective pretests for measuring science teaching efficacy beliefs in
preservice teachers. School Science and Mathematics, 103(4), 177–185.

Carifio, J., & Perla, R. (2007). Ten common misunderstandings, misconceptions, persistent myths and urban
legends about Likert scales and Likert response formats and their antidotes. Journal of Social Sciences, 3(3),
106–116.

Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Beverly Hills, CA: Sage.

Chen, G., Wilson, J., Meckle, W., & Cooper, P. (2000). Evaluation of photo radar program in British Columbia.
Accident Analysis & Prevention, 32(4), 517–526.

Clarke, A., & Dawson, R. (1999). Evaluation research: An introduction to principles, methods, and practice.
Thousand Oaks, CA: Sage.

Converse, P. D., Wolfe, E. W., Huang, X., & Oswald, F. L. (2008). Response rates for mixed-mode surveys using
mail and e-mail/web. American Journal of Evaluation, 29(1), 99–107.

243
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago,
IL: Rand McNally.

Coulter, S. (2012) Using the retrospective pretest to get usable indirect evidence of student learning, Assessment &
Evaluation in Higher Education, 37(3), 321–334.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.

Dean, J. (2014). Big Data, data mining, and machine learning: Value creation for business leaders and practitioners.
John Wiley & Sons.

Decker, S. H. (1977). Official crime rates and victim surveys: An empirical comparison. Journal of Criminal
Justice, 5(1), 47–54.

Department for Communities and Local Government. (2016). The First Troubled Families Programme 2012–
2015: An overview. London, UK: Department for Communities and Local Government.

Dillman, D. (2007). Mail and Internet surveys: The tailored design method (2nd ed.). Hoboken, NJ: Wiley.

Dillman, D. (2011). Mail and Internet surveys: The tailored design method—2007—Update with new Internet,
visual, and mixed-mode guide. Hoboken, NJ: Wiley.

Domitrovich, C. E., & Greenberg, M. T. (2000). The study of implementation: Current findings from effective
programs that prevent mental disorders in school-aged children. Journal of Educational and Psychological
Consultation, 11(2), 193–221.

Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state organisations in New Zealand.
Wellington, New Zealand: Institute of Policy Studies.

Goodwin, L. D. (1997). Changing conceptions of measurement validity. Journal of Nursing Education, 36(3),
102–107.

Goodwin, L. D. (2002). Changing conceptions of measurement validity: An update on the new standards. Journal
of Nursing Education, 41(3), 100–106.

Hayes, A., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data.
Communication Methods and Measures, 1(1), 77–89.

Hill, L. & Betz, D. (2005). Revising the retrospective pretest. American Evaluation Review, 26(4), 501–517.

Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley.

244
Hoogstraten, J. (1985). Influence of objective measures on self-reports in a retrospective pretest-posttest design.
Journal of Experimental Education, 53(4), 207–210.

Howard, C. (2010). Are we being served? A critical perspective on Canada’s Citizens First satisfaction surveys.
International Review of Administrative Sciences, 76(1), 65–83.

Howard, G. S., Dailey, P. R., & Gulanick, N. A. (1979). The feasibility of informed pretests in attenuating
response-shift bias. Applied Psychological Measurement, 3(4), 481–494.

Howard, G. S. (1980). Response-shift bias: A problem in evaluating interventions with pre/post self-reports.
Evaluation Review, 4(1), 93–106.

Howard, G. S., & Dailey, P. R. (1979). Response-shift bias: A source of contamination of self-report measures.
Journal of Applied Psychology, 64(2), 144–150.

Jamieson, S. (2004). Likert scales: How to (ab)use them. Medical Education, 38(12), 1217–1218.

Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the predictive validity of
the Graduate Record Examinations. Psychological Bulletin, 127(1), 162–181.

Kvam, A., Wisloff, F., & Fayers, P. (2010). Minimal important differences and response shift in health-related
quality of life; a longitudinal study in patients with multiple myeloma. Health and Quality of Life Outcomes, 8,
1–8.

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 1–55.

Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable realtime data systems. Shelter
Island, NY: Manning Publications Co.

McMorris, B., Petrie, R., Catalano, R., Fleming, C., Haggerty, K., & Abbott, R. (2009). Use of web and in-
person survey modes to gather data from young adults on sex and drug use: An evaluation of cost, time, and
survey error based on a randomized mixed-mode design. Evaluation Review, 33(2), 138–158.

McPhail, S., & Haines, T. (2010). The response shift phenomenon in clinical trials. Journal of Clinical Best
Research Practices, 6(2), 1–8.

Mischel, W., Ayduk, O., Berman, M. G., Casey, B. J., Gotlib, I. H., Jonides, J., . . . Shoda, Y. (2011).
“Willpower” over the life span: Decomposing self-regulation. Social Cognitive and Affective Neuroscience, 6(2),
252–256.

Moore, D., & Tananis, C. (2009). Measuring change in a short-term educational program using a retrospective
pretest design. American Journal of Evaluation, 30(2), 189–202.

245
Mueller, C. E. (2015). Evaluating the effectiveness of website content features using retrospective pretest
methodology: An experimental test. Evaluation Review, 39(3), 283–307.

Murray, J., Theakston, A., & Wells, A. (2016). Can the attention training technique turn one marshmallow into
two? Improving children’s ability to delay gratification. Behaviour Research and Therapy, 77, 34–39.

New Jersey Department of Health. (2017). Reliability and validity. New Jersey state health assessment data.
Retrieved from https://www26.state.nj.us/doh-shad/home/ReliabilityValidity.html

Nielsen, R. (2011). A retrospective pretest–posttest evaluation of a one-time personal finance training. Journal of
Extension [Online], 1–8.

Nimon, K., Zigarmi, D., & Allen, J. (2011). Measures of program effectiveness based on retrospective pretest
data: Are all created equal? American Journal of Evaluation, 32(1), 8–28.

Ockwell, P. (1992). An evaluation of the Kid’s Science Program run by the Science & Technology Division of Camosun
College (Unpublished master’s report). University of Victoria, Victoria, British Columbia, Canada.

Parks, R. B. (1984). Linking objective and subjective measures of performance. Public Administration Review,
44(2), 118–127.

Pechman, J. A., & Timpane, P. M. (Eds.). (1975). Work incentives and income guarantees: The New Jersey negative
income tax experiment. Washington, DC: Brookings Institution.

Pedersen, K. S., & McDavid, J. (1994). The impact of radar cameras on traffic speed: A quasi-experimental
evaluation. Canadian Journal of Program Evaluation, 9(1), 51–68.

Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). Fort Worth,
TX: Harcourt Brace College.

Pelfry, V. (2009). Curriculum evaluation and revision in a nascent field. Evaluation Review, 33(1), 54–82.

Petersson, J., Leeuw, F., Bruel, J., & Leeuw, H. (2017). Cyber society, big data and evaluation: An introduction.
In J. Petersson & J. Bruel (Eds.), Cyber society, Big Data and evaluation (Comparative Policy Evaluation) (pp.
1–18). New Brunswick, NJ: Transaction Publishers.

Poli, A., Tremoli, E., Colombo, A., Sirtori, M., Pignoli, P., & Paoletti, R. (1988). Ultrasonographic measurement
of the common carotid artery wall thickness in hypercholesterolemic patients: A new model for the
quantification and follow-up of preclinical atherosclerosis in living human subjects. Atherosclerosis, 70(3),
253–261.

Poister, T. H. (1978). Public program analysis: Applied research methods. Baltimore, MD: University Park Press.

246
Pratt, C., McGuigan, W., & Katzev, A. (2000). Measuring program outcomes: Using retrospective pretest
methodology. American Journal of Evaluation, 21(3), 341–349.

Rea, L. M., & Parker, R. A. (2014). Designing and conducting survey research: A comprehensive guide (4th ed.). San
Francisco, CA: Jossey-Bass.

Ridgeway, G. (2017). Policing in the era of Big Data. Annual Review of Criminology, 1, 401–419.

Rogers, P. (1983). An evaluation of the debtor assistance program (Unpublished master’s report). University of
Victoria, Victoria, British Columbia, Canada.

Sampson, R., Raudenbush, S., & Earls, F. (1997). Neighborhoods and violent crime: A multi-level study of
collective efficacy. Science, 277, 918–924.

Schaeffer, N. C., & Presser, S. (2003). The science of asking questions. Annual Review of Sociology, 29(1), 65–88.

Schwartz, C. E., & Sprangers, M. A. (2010). Guidelines for improving the stringency of response shift research
using the thentest. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care
and Rehabilitation, 19(4), 455–464.

Schwarz, N. (2007). Cognitive aspects of survey methodology. Applied Cognitive Psychology, 21(2), 277–287.

Schwarz, N., & Oyserman, D. (2001). Asking questions about behavior: Cognition, communication, and
questionnaire construction. American Journal of Evaluation, 22(2), 127–160.

Schweinhart, L. J., Montie, J., Xiang, Z., Barnett, W. S., Belfield, C. R., & Nores, M. (2005). The High/Scope
Perry Preschool Study through age 40: Summary, conclusions, and frequently asked questions. Ypsilanti, MI:
High/Scope Press.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for
generalized causal inference. Boston, MA: Houghton Mifflin.

Taylor, P. J., Russ-Eft, D. F., & Taylor, H. (2009). Gilding the outcome by tarnishing the past: Inflationary
biases in retrospective pretests. American Journal of Evaluation, 30(1), 31–43.

Texas State Auditor’s Office. (2002). An audit report on fiscal year 2001 performance measures at 14 entities. Austin,
TX: Author.

Trochim, W. M. K. (2006). The research methods knowledge base (2nd ed.). Retrieved from
http://www.socialresearchmethods.net/kb/index.htm

247
Van Selm, M., & Jankowski, N. W. (2006). Conducting online surveys. Quality & Quantity, 40(3), 435–456.

Webb, E. J. (1966). Unobtrusive measures: Nonreactive research in the social sciences. Chicago, IL: Rand McNally.

Weng, L.-J. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test-
retest reliability. Educational and Psychological Measurement, 64(6), 956–972.

West, R., Zatonski, W., Przewozniak, K., & Jarvis, M. (2007). Can we trust national smoking prevalence figures?
Discrepancies between biochemically assessed and self-reported smoking rates in three countries. Cancer
Epidemiology Biomarkers & Prevention, 16(4), 820–822.

Zhang, X., Shu-Chuen, L., Feng, X., Ngai-Nung, L., Kwang-Ying, Y., Seng-Jin, Y.,… Thumboo, J. (2012). An
exploratory study of response shift in health-related quality of life and utility assessment among patients with
osteoarthritis undergoing total knee replacement surgery in a tertiary hospital in Singapore. Value in Health, 15,
572–578.

248
5 Applying Qualitative Evaluation Methods

249
Contents
Introduction 206
Comparing and Contrasting Different Approaches to Qualitative Evaluation 207
Understanding Paradigms and Their Relevance to Evaluation 208
Pragmatism as a Response to the Philosophical Divisions Among Evaluators 213
Alternative Criteria for Assessing Qualitative Research and Evaluations 214
Qualitative Evaluation Designs: Some Basics 216
Appropriate Applications for Qualitative Evaluation Approaches 216
Comparing and Contrasting Qualitative and Quantitative Evaluation Approaches 218
Designing and Conducting Qualitative Program Evaluations 221
1. Clarifying the Evaluation Purpose and Questions 222
2. Identifying Research Designs and Appropriate Comparisons 222
Within-Case Analysis 222
Between-Case Analysis 223
3. Mixed-Methods Evaluation Designs 224
4. Identifying Appropriate Sampling Strategies in Qualitative Evaluations 228
5. Collecting and Coding Qualitative Data 230
Structuring Data Collection Instruments 230
Conducting Qualitative Interviews 231
6. Analyzing Qualitative Data 233
7. Reporting Qualitative Results 237
Assessing The Credibility and Generalizability of Qualitative Findings 237
Connecting Qualitative Evaluation Methods to Performance Measurement 239
The Power of Case Studies 241
Summary 243
Discussion Questions 244
References 245

250
Introduction
Our textbook is aimed at supporting a wide range of evaluation-related projects. So far, we have emphasized the
importance of thinking systematically about what is involved in evaluating program effectiveness. Chapters 2, 3,
and 4 cover topics that support evaluations that include quantitative lines of evidence. In Chapter 5, we are
acknowledging that most program evaluations include both quantitative and qualitative lines of evidence (mixed
methods) and we spend time describing core qualitative methods that are used in program evaluations.

The field of evaluation is relatively young. You will often encounter textbooks or even discussions that mingle
methodologies and philosophical issues (Mertens & Wilson, 2012, is an example). In our field, philosophical
differences among evaluators are not far below the surface of how we practice our craft. Chapter 5 begins by
comparing and contrasting different philosophical and methodological approaches to qualitative evaluation,
pointing out that the diversity in approaches is one of the challenges in working with and conveying results using
qualitative methods. We introduce a set of criteria for assessing the quality of qualitative evaluations but point out
that these criteria are themselves contested by some qualitative evaluators.

The main part of this chapter is focused on the process of doing qualitative evaluations and working with different
qualitative methodologies. We include checklists to guide sampling, conduct qualitative interviews, and analyze
qualitative data. The chapter concludes with a discussion of several topics: the credibility and generalizability of
qualitative findings; using qualitative methods to construct performance measures (we introduce the Most
Significant Change [MSC] approach, which has been used in assessing programs in developing countries); and a
note on the power of (and the responsibilities involved in) using case studies in evaluations.

Qualitative evaluation methods are typically distinguished by their emphasis on interviews, focus groups, textual
sources, and other media that consist of words, either written or spoken. Fundamentally, qualitative evaluation
approaches rely on natural language-based data sources. In Chapters, 2, 3, and 4, we introduced ideas and
methodologies that have their origins in the social sciences, particularly those disciplines that historically have been
associated with the core of evaluation as a field. Alkin, Christie, and Vo (2012) introduce the metaphor of a tree to
visually depict the main branches of evaluation theory in the whole field. The central branch, which can be traced
to the origins of the field, is a methodology/methods group of theorists (Donald T. Campbell is a key member of
that group) who collectively are guided by the centrality of methodologies to design and conduct evaluations.

The concepts and principles in Chapters 2, 3, and 4 reflect a methodology focus in which understanding causal
thinking is important. The contents of those chapters are a part of specialized research-focused languages that have
been important in the social sciences. For example, our discussion of measurement validity in Chapter 4 relies on
the discipline of psychology and the work that has been done to develop methodologies for validating measures of
constructs. Learning methodologically focused languages can be challenging because they are not the languages we
use every day. It takes practice to become comfortable with them.

When we include qualitative evaluation methods in a program evaluation, we generally are analyzing the
following: the narratives that are created when people interact with each other; organizations’ and governments’
textual/documentary materials; or other sources of information that are not numerical. Narratives can be as brief
as open-ended responses to survey questions, or as lengthy as in-depth interviews recorded with stakeholders. This
chapter will show how qualitative evaluation methods can be incorporated into the range of options available to
evaluators and their clients and will offer some comparisons between qualitative and quantitative evaluation
approaches.

In general, qualitative approaches are less structured than quantitative methods and are valuable in collecting and
analyzing data that do not readily reduce into numbers. Qualitative methods are particularly useful for exploratory
work and participatory (Cousins & Chouinard, 2015), utilization-focused (Patton, 2008), or empowerment
evaluations (e.g., see Fetterman, 2005; Fetterman & Wandersman, 2007). These approaches to evaluation involve
significant collaboration between the evaluator and stakeholders during most or all of the steps in the evaluation
process, from the planning and design to the final interpretation and recommendations, and tend to rely on

251
qualitative evaluation methods.

252
Comparing And Contrasting Different Approaches To Qualitative
Evaluation
When qualitative evaluation approaches emerged as alternatives to the then-dominant social scientific
(quantitative) approach to evaluation in the 1970s, proponents of these new ways of evaluating programs were
part of a broader movement to remake the foundations and the practice of social research. Qualitative research has
a long history, particularly in disciplines like anthropology and sociology, and there have been important changes
over time in the ways that qualitative researchers see their enterprise. There is more diversity within qualitative
evaluation approaches than within quantitative approaches:

A significant difference between qualitative and quantitative methods is that, while the latter have
established a working philosophical consensus, the former have not. This means that quantitative
researchers can treat methodology as a technical matter. The best solution is one which most effectively
and efficiently solves a given problem. The same is not true for qualitative research where proposed
solutions to methodological problems are inextricably linked to philosophical assumptions and what
counts as an appropriate solution from one position is fatally flawed from another. (Murphy, Dingwall,
Greatbatch, Parker, & Watson, 1998, p. 58)

Qualitative evaluation methods can be viewed as a subset of qualitative research methods. We can think of
qualitative research methods being applied to evaluations. Denzin and Lincoln (2011) summarize the history of
qualitative research in their introduction to the Handbook of Qualitative Research. They offer an interpretation of
the history of qualitative research in North America as comprising eight historical moments, which overlap and
“simultaneously operate in the present” (p. 3). As they emphasize, the field of qualitative research is characterized
by tensions and contradictions. They begin their timeline with traditional anthropological research (1900s to
about 1950), in which lone anthropologists spent time in other cultures and then rendered their findings in
“objective” accounts of the values, beliefs, and behaviors of the Indigenous peoples. This approach was informed
by what we call a positivist theoretical framework, which we will explain in more detail later. The three most
recent eras include “the crisis of representation” (1986–1990); the postmodern era, “a period of experimental and
new ethnographies”; and “the methodologically contested present” (2000–onwards). While emphasizing that
qualitative research has meant different things in each of these eight movements, Denzin and Lincoln (2011)
nevertheless provide a generic definition of qualitative research:

Qualitative research is a situated activity that locates the observer in the world. Qualitative research
consists of a set of interpretive, material practices that make the world visible . . .  qualitative researchers
study things in their natural settings, attempting to make sense of, or to interpret, phenomena in terms
of the meanings people bring to them. (p. 3)

They later continue,

Qualitative researchers stress the socially constructed nature of reality, the intimate relationship between
the researcher and what is studied, and the situational constraints that shape inquiry. Such researchers
emphasize the value-laden nature of enquiry. They seek answers to questions that stress how social
experience is created and given meaning. In contrast, quantitative studies emphasize the measurement
and analysis of causal relationships between variables, not processes. (p. 8)

253
254
Understanding Paradigms and Their Relevance to Evaluation
In the field of program evaluation, academics and practitioners in the 1970s were increasingly under pressure to
justify the then-dominant social science–based approach as a way of thinking about and conducting evaluations.
Questions about the relevance and usefulness of highly structured evaluations (often experiments or quasi-
experiments) were being raised by clients and academics alike.

Thomas S. Kuhn (1962), in his book The Structure of Scientific Revolutions, asserted that when scientists “discover”
a new way of looking at phenomena, they literally see the world in a different way. He popularized the notion of a
paradigm, a self-contained theoretical and perceptual structure akin to a belief system that shapes what we think is
important when we do research, how we see the events and processes we are researching, and even whether we can
see particular events. Although Kuhn was describing the change in worldviews that happened in theoretical
physics when Einstein’s relativity theory began its ascendancy at the turn of the 20th century, replacing the then-
dominant Newtonian theory that had been the mainstay of physics since the 1700s, he used language and
examples that invited generalizing to other fields. In fact, because his book was written in a nontechnical way, it
became a major contributor to the widespread and continuing process of questioning and elaborating the
foundations of our knowledge and understanding in the social sciences and humanities.

Paradigms, for Kuhn (1962), were at least partly incommensurable. That is, adherence to one paradigm—and its
attendant way of seeing the world—would not be translatable into a different paradigm. Proponents of different
paradigms would experience an inability, at least to some extent, to communicate with their counterparts. They
would talk past each other because they would use words that refer to different things and literally see different
things even when they were pointing to the same object. They would, in essence, see the world via differing lenses.

In the 1970s, an alternative paradigm for evaluation was emerging, based on different assumptions, different ways
of gathering information, different ways of interpreting that information, and, finally, different ways of reporting
evaluation findings and conclusions. Part of this shift in paradigms involved making greater space for qualitative
methods.

Some authors in the evaluation field continue to argue that qualitative evaluation methods always have different
epistemological underpinnings from quantitative methods (see Bamberger, Rugh, & Mabry, 2012; Denzin &
Lincoln, 2011; Guba & Lincoln, 1989, 2005). However, many evaluators do not believe that qualitative methods
necessarily have a different epistemological underpinning from quantitative methods (Kapp & Anderson, 2010;
Johnson & Onwuegbuzie, 2004; Owen, 2006). These authors believe that the distinction between qualitative and
quantitative needs to be made at the level of methods, not at the level of epistemology (how we know what we
know). Methods are “the techniques used to collect or analyze data” in order to answer an evaluation question or
hypothesis (Crotty, 1998, p. 3). This is the view taken in this chapter, and it is consistent with the pragmatic
philosophical view that we discuss later in this chapter. In this section, we introduce some basic philosophical ideas
to show how qualitative and quantitative evaluation approaches have been viewed by some as being different and
how philosophical positions can shape the ways that evaluations are approached and what is considered to be
appropriate in terms of methodologies.

Table 5.1 has been adapted from Crotty (1998) to illustrate some basic concepts that underlie debates and
disagreements among some qualitative and quantitative evaluators.

Table 5.1 Underlying Epistemologies and Theoretical Perspectives in Evaluation


Table 5.1 Underlying Epistemologies and Theoretical Perspectives in Evaluation

Important Epistemologies Important Theoretical Perspectives

Objectivism assumes objects Positivism is based on an epistemology of objectivism, and this perspective

exist as meaningful entities holds it is possible to fully comprehend the real world through the

255
independently of human scientific method.
“consciousness and
experience.” Postpositivism: we can only incompletely understand the real world through
the scientific method

Interpretivism, sometimes called antipositivism, assumes that our


descriptions of objects, be they people, social programs, or institutions, are
always the product of interpretation, not neutral reports of our
observations; the focus is on understanding people’s situated
interpretations of the social world.
Constructionism assumes
things do not exist as Phenomenology assumes our culture gives us ready-made
meaningful entities interpretations of objects in the world. It focuses on trying to get
independently of human past these ready-made meanings.
consciousness and Hermeneutics involves the understanding of social events (human
experience. interactions) by analyzing their meanings to the participants, as well
as taking into account how the meanings are influenced by cultural
Constructivism focuses contexts.
on meanings that
individuals generate. Critical inquiry: This approach views the world in terms of conflict and
Social constructionism oppression and assumes that it is the role of research to challenge the status
focuses on the social quo and to bring about change.
context that produces
the meanings Feminism is a collection of movements that focus on the roles and
individuals use. rights of women in societies, including interpreting social and
historical events and changes from women’s perspectives.
Pedagogy of the oppressed was developed by Paulo Freire and Anna
Maria Araújo Freire (1994); it emphasized that there is inequality
and domination in the world and focuses on increasing the
consciousness of oppressed groups.

Source: Adapted from Crotty (1998, p. 5).

Basically, at the level of epistemologies and theoretical perspectives, we are speaking about divisions in approaches
to evaluation as a field. An epistemology is the (philosophical) theory of how we know what we know. In Table 5.1,
objectivism assumes that objects exist as meaningful entities independently of human “consciousness and
experience” (p. 5). From this perspective, objects, such as a tree, are understood to carry an intrinsic meaning.
When, as evaluators, we interact with objects in the world, we “are simply discovering a meaning that has been
lying there in wait … all along” (p. 8). In contrast, constructionists believe that meaningful reality does not exist
independently of human consciousness and experience. This does not mean that reality exists just in our mind.
Instead, it means that objects are not meaningful independent of human consciousness. Indeed, very few
constructionists have an antirealist ontology—the idea that reality consists only of ideas or is confined to the mind
(Kushner, 1996). However, within the field of evaluation, constructionism sometimes has been linked with
antirealism because the most prominent evaluators taking a constructionist approach—Guba and Lincoln (1989)
—rely on an antirealist ontology. Most constructionists (and other interpretive researchers) reject this
understanding of constructionism and have a realist ontology, accepting that reality is not confined to the mind
(Crotty, 1998; Kushner, 1996). While Guba and Lincoln’s (1989) position has been influential, it has also been
controversial. Some constructionists believe that we should focus on the way that individuals interpret the world
—these are the constructivists. The social constructionists believe that social contexts are critical for
understanding how meanings and realities are generated (Crotty, 1998). When we consider these underlying
philosophical perspectives, we can understand how paradigms in evaluation developed. Adherents of objectivism
would not comfortably “see” the world through a constructionist lens, and vice versa. To some extent, they would
be talking past each other as they tried to explain their beliefs and how those beliefs informed their evaluation
approaches and methods.

256
The epistemological and theoretical perspectives summarized in Table 5.1 can be connected with different
research and evaluation methodologies, but how and whether they are connected depends on the underlying
philosophical beliefs of the evaluators involved. For example, connecting an interpretivist approach to
experimental methodologies—in which human participants are randomly assigned to program and control groups
and differences are measured numerically and compared statistically—would not work for an evaluator who
believes that objectivism and constructionism are incompatible philosophical stances.

The New Chance welfare-to-work evaluation (Quint, Bos, & Polit, 1997), conducted by the Manpower
Demonstration Research Corporation (2012), provides a good example of evaluators using qualitative methods in
a strictly positivist way (Zaslow & Eldred, 1998). One of the components of this evaluation was observations of
interactions between mothers and their child carrying out the following activities: book reading, block game,
wheel game, sorting chips game, Etch-a-Sketch, and gift giving. Reflecting their strict positivist theoretical
perspective, the evaluators were concerned with keeping a strict distance between evaluators and the research
subjects and with creating a replicable research design. Interactions between evaluators and research subjects were
rigidly scripted, observations were recorded using strict protocols, and the quality of the parenting was assessed
against predetermined standards. Importantly, reflecting their positivist theoretical perspective, the evaluators
made the assumption that these criteria for good parenting were “value-neutral, ahistorical and cross-cultural”
(Crotty, 1998, p. 40). An interpretive theoretical perspective, in contrast, would have made explicit that these
criteria for good parents were a plausible and defensible way of defining good parenting but not a set of value-
neutral and universal measures.

Postpositivists modify the strict stance taken by positivists; while striving not to influence what/who they observe,
they willingly accept that there is no “Archimedean point from which realities in the world can be viewed free
from any influence of the observers’ standpoint” (Crotty, 1998, p. 40). Furthermore, postpositivists acknowledge
that observation of parenting—or any other social phenomena—always “takes place within the context of theories
and is always shaped by theories” (p. 33). For example, a postpositivist evaluator would be more willing to
acknowledge that he or she is observing parenting through the frame of theory rather than making completely
value-neutral and ahistorical assessments of good parenting. However, strict positivism and postpositivism have in
common an assumption that it is possible to describe objects in the social or natural world in isolation from the
person experiencing it.

Interpretivism is an important epistemological perspective within qualitative approaches to research and


evaluation. Interpretivists believe that the description of an object or an event is always shaped by the person or
culture describing it and that it is never possible to obtain a description that is not shaped in this way. Therefore,
the aim of the interpretivist approach is to look “for culturally derived and historically situated interpretations of
the social world” (Crotty, 1998, p. 67). Evaluators taking an interpretivist stance assume that our descriptions of
objects, be they people, social programs, or institutions, are always the product of interpretation, not neutral
reports of our observations.

Fish (1980) illustrates this perspective nicely in his well-known essay “How Do You Recognize a Poem When You
See One?” in which he recalls a summer camp during which he taught two courses. That summer, Fish taught one
course on linguistics and literary criticism and one on English religious poetry. Both courses were held in the same
classroom, and one followed directly after the other. One morning, as the students in the linguistics and literary
criticism course left the classroom, Fish looked at the list of authors he had written on the blackboard (Figure 5.1).
Students were expected to read these authors prior to the next class. Fish put a box around the names and wrote
“p. 43” above the box. As students for the next class trailed into the room, he drew their attention to the list of
names, told them it was a religious poem, and invited them to interpret it. Students enthusiastically took up the
challenge and began interpreting the poem. One student pointed out that Jacobs can be related to Jacob’s ladder,
an Old Testament allegory for ascent into heaven, and is linked to “Rosenbaum” (rose tree in German). Surely, one
student argued, this is an allusion to the Virgin Mary, who is often depicted as a rose without thorns and promotes
Christians’ ascent into heaven through the redemptive work of her son Jesus. Other students provided further
interpretations.

257
Figure 5.1 Is This a List of Names or a Religious Poem?

Source: Crotty (1998, p. 194).

In his essay, Fish (1980) points out that the students did not come to recognize “the poem” because of its features
but, instead, because he told them it was a poem. Fish concludes that any reading of a text (or objects in the social
world) is not “a matter of discerning what is there” but “of knowing how to produce what can thereafter be said to
be there” (pp. 182–183). One could object that only the first class discerned what was “really there,” namely, a list
of readings. Fish (1980) counters as follows:

The argument will not hold because the assignment we all see is no less the product of interpretation
than the poem into which it was turned … it requires just as much work, and work of the same kind, to
see this as an assignment as it does to see it as a poem. (p. 184)

Interpretivists would also point out that both the students and the object have a vital role to play in the
“generation of meanings” (Crotty, 1998, p. 48). While the students could have turned any list of names into a
poem, “they would make different sense of a different list” (p. 48). Within program evaluation, interpretive
approaches involve viewing our understandings of programs as “historically and culturally effected interpretations”
rather than neutral or eternal truths (p. 48).

Another example of the interpretive perspective is an evaluation of mental health services for severely emotionally
disturbed youths (Kapp & Anderson, 2010). Setting aside any attempt to provide a single objective definition of
program success, the evaluators instead aimed to uncover what success meant for the young clients, their parents,
and professionals caring for the young clients. Through excerpts from interviews they conducted as part of the
evaluation, Kapp and Anderson (2010) illustrate how each group has a different perspective on what it means for
the young clients to be successful. This focus on individual understandings also raises an important division within
interpretivism between those who focus on the meanings that individuals generate (constructivism) and those that
focus on the social context that produces meaning (social constructionism). Importantly, Kapp and Anderson did
not view the client’s actions in the world as simply “ideas in their minds.” However, in line with a constructivist
perspective, they assumed that there is no single way of defining success, only ways that different people make
sense of success.

258
Pragmatism as a Response to the Philosophical Divisions Among Evaluators
In the 1990s, the “paradigm wars” that had dominated the evaluation field in the 1980s began to wane (Patton,
1997). Qualitative evaluators, such as Patton (2002, 2015), and experts in mixed methods, such as Creswell
(2009, 2015) and Johnson and Onwuegbuzie (2004), argue that evaluators should focus on “what works”
situationally in terms of research and analytical methods (see also Bryman, 2009). Rather than partitioning
methods according to their presumed allegiance to underlying philosophical assumptions, pragmatists take the
view that if a method “works”—that is, yields information that best addresses an evaluation question in that
context—then use it. Usefulness becomes a key criterion for fitting evaluation approaches and methods to
particular situations and their requirements. This view of evaluation methods largely separates them from
underlying philosophical positions, and it means that qualitative and quantitative methods can be used
pragmatically in a wide range of evaluation situations.

In this textbook, we have adopted a pragmatic view of the evaluation enterprise. Like Cresswell (2009, 2015),
Crotty (1998), and Johnson and Onwuegbuzie (2004), we see the distinction between qualitative and quantitative
approaches best being made at the level of methods, not at the level of theoretical perspectives or at the level of
epistemologies. Underlying philosophical differences are largely relegated to the background in the way we see the
practice of evaluation. However, it is important to understand these philosophical divisions. At times, this
understanding can help clarify an evaluation focus, question, or alternative perspectives (Johnson, 2017). For
instance, imagine you are asked to find out about clients’ experiences of a program. The question could be based
on the objectivist assumption that subjective meanings are important in clients’ lives but that these are “inferior”
(less real) to objective, scientific meanings. Alternatively, this question could be based on the interpretivist
assumption that all meaning is individually or socially generated.

We also follow Crotty (1998) and Morgan (2007) in understanding methods as “the techniques or procedures
used to gather and analyse data” (Crotty, 1998, p. 3) in order to answer a research question or hypothesis.
Methodologies are broader than specific methods and include “the strategy, plan of action, process, or design
lying behind [our] choice, and use, of particular methods” (p. 3). Evaluators’ methodologies link “their choice and
use of methods” (p. 3) to their research or evaluation objectives. Common evaluation methodologies include
experimental research, survey research, ethnography, and action research. Examples of methods include
sampling, direct observation, interviews, focus groups, statistical analysis, and content analysis.

259
Alternative Criteria for Assessing Qualitative Research and Evaluations
Although we take a pragmatic view of how qualitative and quantitative approaches can relate to each other in
evaluations, it is worthwhile understanding how the different philosophical perspectives summarized in Table 5.1
can be followed through with criteria for assessing the quality and credibility of qualitative research. The main
point of Table 5.1 is that assessing the quality of qualitative research depends on one’s perspective. One of the
most important reasons to be clear about our assumptions in doing qualitative research is that they help us be clear
about the criteria we are expecting stakeholders to use in assessing the findings (Crotty, 1998). Do we intend
others to view our findings as objective truths that are valid and generalizable (a positivist theoretical approach)?
Alternatively, do we intend that people view our findings as sound and plausible interpretations (a constructionist
theoretical approach) (Crotty, 1998)? Our judgment of the quality of research and the validity of the findings
depends on the criteria we use (Patton, 2015). As Patton (2015) explains, people often find it difficult to assess the
quality of qualitative research because they are unsure about the criteria to use.

Unlike with quantitative methods, there is no single universally accepted way of assessing the quality of qualitative
research. Different audiences are likely to bring different criteria to bear. An audience of community activist
evaluators is likely to bring different criteria to bear than an audience of government evaluators. Understanding
the diversity within qualitative evaluation approaches helps you understand the diversity in criteria that others are
likely to bring to bear on your evaluation findings. This understanding helps you anticipate their reactions and
position your “intentions and criteria in relation to their own expectations and criteria” (Patton, 2002, p. 543).

Based on Patton’s (2002) framework, Table 5.2 divides qualitative approaches into three primary types.
Importantly, in practice, the lines can be more blurred. For instance, a postpositivist project may wish to capture
multiple perspectives, though it will not see all perspectives as equally authoritative or valid. Not all criteria listed
may apply to all projects in that category. For instance, participatory projects are concerned with collaboration,
and many are not explicitly concerned with identifying the nature of injustice or lines of inquiry to improve social
justice.

Table 5.2 Alternative Sets of Criteria for Judging the Quality and Credibility of Qualitative
Research
Table 5.2 Alternative Sets of Criteria for Judging the Quality and Credibility of Qualitative Research

Critical Change Criteria (Feminist


Positivist/Postpositivist Interpretivist/Social Construction Inquiry, Empowerment Evaluation, Some
Criteria and Constructivist Criteria Collaborative and Participatory
Approaches)

Attempts to minimize Aims to increase consciousness about


Evaluator’s subjectivity is social injustices
bias and ensure the
acknowledged (biases are discussed
objectivity of the
and taken into account) Findings identify nature and causes of
inquirer
inequalities and injustices

Trustworthiness of the research The perspective of the less powerful is


Validity of the data— findings represented
measures what it
intends to measure Authenticity of the research Engagement with those with less power in
approach a respectful and collaborative way

Illuminates how the powerful exercise


power and benefit from it
Fieldwork procedures Multiplicity (capturing and

260
are systematically respecting multiple perspectives) The capacity of those involved to take
rigorous action is increased

Change-making strategies are identified

Triangulation
(consistency of
findings across Praxis
methods and data
sources) is used Historical and values context are
Reflexivity—attempting to act in the clear
Coding and pattern world while acknowledging that Consequential validity—takes into
analysis is reliable, that these actions necessarily express account the implications of using
is, another coder social, political, and moral values any measures as a basis for action
would code the same and the social consequences of this
way research

Findings correspond
to reality

Praxis
Findings are
generalizable (external Attention to particularity by
validity) doing justice to the integrity
of unique cases
Evidence supports
Verstehen—deep and
causal hypotheses
empathetic understanding of
Findings make others’ meanings
contributions to Presentation of findings make
theory building contributions to dialogue

Source: Adapted from Patton (2002, pp. 544–545).

While there are still those within the evaluation community who believe that qualitative and quantitative methods
are based on fundamentally conflicting philosophical perspectives, there is broad acceptance among professional
evaluators and government evaluation agencies of a more pragmatic perspective that assumes that qualitative and
quantitative evaluation methodologies and methods are complementary. Again, the important thing when
incorporating qualitative methods is to be clear about the criteria that your audience will use to assess the quality
of your work, and to position your approach in relation to their expectations.

Stakeholders can come to an evaluation with their own paradigmatic lenses, and although pragmatism is an
increasingly widespread basis for contemporary evaluation practice, different social science and humanities
disciplines and even specific undergraduate or graduate programs can imbue their graduates with “worldviews.”
Stakeholders may also disagree among themselves about the criteria to use to assess evaluation quality. In these
cases, it is desirable to work to resolve these conflicts when developing the evaluation design. The advantage of
pragmatism is that it is permissive of different methodologies (both qualitative and quantitative) and encourages
the view that a mixed-methods approach to evaluations is methodologically appropriate. We will expand on this
idea in Chapter 12, where we discuss professional judgment.

261
Qualitative Evaluation Designs: Some Basics
What is qualitative evaluation? How is it distinguished from other forms of program evaluation? How do
qualitative evaluators do their work?

These are practical questions, and the main focus of this section will be to offer some answers to them. It is worth
saying, however, that qualitative evaluation methods have developed in many different ways and that there are a
number of different textbooks that offer evaluators ways to design, conduct, and interpret evaluations that rely on
qualitative data (e.g., Denzin & Lincoln, 2011; Patton, 2002, 2008, 2015; Miles, Huberman, & Saldana, 2014).
We encourage any reader interested in a more detailed understanding of qualitative evaluation than is offered in
this chapter to refer to these other resources and others listed in the references at the end of this chapter.

Patton (2003), in the Evaluation Checklists Project (Western Michigan University, 2010), maintains, “Qualitative
methods are often used in evaluations because they tell the program’s story by capturing and communicating the
participants’ stories” (p. 2). Qualitative methods frequently involve seeking to understand participants’ points of
view and experiences, seeking to understand and describe different ways of making sense of the world, seeking to
collect data in an exploratory and unstructured way, and seeking to capture descriptive, rich, and in-depth
accounts of experiences. In contrast, quantitative evaluations use numbers gathered from measures over
comparatively large samples and use statistical procedures for describing and generalizing the relationships between
and among variables. An evaluation may be entirely conducted using a qualitative approach, but it is more
common to combine qualitative and quantitative methods. We discuss mixed-methods designs later in the
chapter.

262
Appropriate Applications for Qualitative Evaluation Approaches
Qualitative methods are not appropriate for every evaluation. In Chapter 1, we described the differences between
summative evaluations and formative evaluations. Formative evaluations address questions such as the following:

How does the program actually operate in practice?


Has the program been implemented as planned?
What are the program objectives and target populations?
Can the program be evaluated?
What program outcomes were observed?
Why do the observed program outcomes occur?
How can the program process be changed?

In contrast, summative evaluations are concerned with questions such as the following:

Is the program worthwhile given the outcomes achieved?


Is the program offering value for the resources that it consumes?

You also learned that, while the most common reason we conduct evaluations is to learn about program
effectiveness, some evaluations focus on other questions, such as the relevance of a program, the appropriateness of
a program, or even the need for a program.

Patton (2002) describes nine particularly appropriate applications for qualitative methods within an evaluation.
Table 5.3 illustrates how the nine applications of qualitative methods described by Patton (2002) primarily fit
with a summative or formative evaluation intention (as described in Chapter 1). Where appropriate, we have also
referenced Patton (2015). This table illustrates two important things. First, qualitative evaluation is often focused
on “determining how the program actually operates in practice.” Second, while qualitative evaluation can address
summative evaluation questions and questions about effectiveness, qualitative approaches are particularly
appropriate for answering formative evaluative questions, such as questions related to program processes. The
Urban Change welfare-to-work project provides an example of where evaluators used case studies to answer a
formative evaluation question about why the observed program outcomes occurred. Specifically, through case
studies, they determined why some adolescents had poor education outcomes when their mothers were required to
participate in welfare-to-work programs (Gennetian et al., 2002).

Table 5.3 Nine Qualitative Evaluation Applications by Evaluation Type


Table 5.3 Nine Qualitative Evaluation Applications by Evaluation Type

Particularly
Appropriate
Formative Evaluation Questions Summative Evaluation Questions
Qualitative Evaluation
Applications

Process studies:
How does the program actually operate
“looking at how
in the context in which it has been
something happens”
implemented?
(Patton, 2015)

Comparing programs:
How does the program context affect
focus on diversity
implementation?
(Patton, 2015)

Documenting How has the program changed over

263
development over time (program structure, program
time and system objectives, resources, environmental
changes (Patton, factors)?
2015)

Implementation Has the program been implemented as


evaluation (Patton, planned (fidelity to intended program
2015) design)?

What is/are the program objective(s)


Logic models and and target population(s)? What is the
theories of change program structure? What are the
(Patton, 2015) assumptions being made as part of the
theory of change?

Given the resources available and the Given the resources available and the
Evaluability
political and organizational political and organizational
assessments (Patton,
opportunities and constraints, should opportunities and constraints, should
2002)
the program be evaluated? the program be evaluated?

Did the program achieve its intended


Was the program effective? That is, did
outcomes, were those due to the
Outcomes evaluation it achieve its intended objectives, and
program, and is the program
(Patton, 2002) were the observed outcomes due to the
worthwhile given the outcomes
program?
achieved?

Was the program effective? That is, did Is the program worthwhile given the
it achieve its intended objectives for outcomes achieved—has the program
Evaluating
individual clients, and were those delivered value for the per client
individualized
outcomes due to the program? What resources consumed? What
outcomes (Patton,
combinations of context, mechanism, combinations of context, mechanism,
2002)
and outcome variables were effective? and outcome variables were effective?
Less effective? Less effective?

Prevention evaluation:
examining the degree Was the program effective? That is, did Is the program worthwhile given the
to which desired it achieve its intended objectives—did outcomes achieved? Given the pattern
behavioral and the program prevent the problem or of outcomes, was the program
attitudinal change condition for the clients involved? If so, worthwhile considering what was
linked to prevention how? prevented?
occurs (Patton, 2015)
Source: Adapted from Patton, (2002) and Patton, (2015).

264
Comparing and Contrasting Qualitative and Quantitative Evaluation
Approaches
Although all qualitative and quantitative evaluations have unique features, evaluation is very much about finding a
fit between methodologies and methods and the characteristics of a particular evaluation setting. It is worthwhile
summarizing some general features of typical qualitative and quantitative evaluations. Table 5.4 suggests an image
of qualitative and quantitative program evaluations that highlights the differences between the two approaches.
The features listed in the table for qualitative methods fit most closely with interpretivist approaches. Qualitative
approaches with positivist or postpositivist theoretical underpinnings will have fewer of these characteristics.

Table 5.4 Differences Between Qualitative and Quantitative Evaluation


Table 5.4 Differences Between Qualitative and Quantitative Evaluation

Qualitative Evaluation Is Often


Characterized by
Quantitative Evaluation Is Often Characterized by
Inductive approach to data
Hypotheses and evaluation-related questions (often
gathering, interpretation, and
embedded in logic models) are tested in the evaluation
reporting
Emphasis on measurement procedures that lend
Holistic approach: looking for an
themselves to numerical representations of variables
overall interpretation for the
Representative samples of stakeholder groups
evaluation results
Use of sample sizes with sufficient statistical power to
Verstehen: understanding the
detect expected outcomes
subjective lived experiences of
Measuring instruments that are constructed with a view to
program stakeholders
making them reliable and valid
(discovering their truths)
Using statistical methods (descriptive and inferential
Using natural language data
statistics) to discern patterns that either corroborate or
sources in the evaluation process
disconfirm particular hypotheses and answer the
In-depth, detailed data collection
evaluation questions
Organizing/coding narratives to
Understanding how social reality, as observed by the
address evaluation questions
evaluator, corroborates or disconfirms hypotheses and
Use of case studies
evaluation questions
The evaluator as the primary
Evaluator control and ability to manipulate the setting,
measuring instrument
which improves the internal validity, the statistical
A naturalistic approach: does not
conclusions validity, and the construct validity of the
explicitly manipulate the setting
research designs
but instead evaluates the
program as is

Source: Davies & Dart (2005).

Within interpretive qualitative evaluations, emphasis is placed on the uniqueness of human experiences, eschewing
efforts to impose categories or structures on experiences, at least until they are fully rendered in their own terms.
This form of qualitative program evaluation tends to build from these experiences upward, seeking patterns but
keeping an open stance toward the new or unexpected. The inductive approach starts with “the data,” namely,
narratives, direct and indirect (unobtrusive) observations, interactions between stakeholders and the evaluator,
documentary evidence, and other sources of information, and then constructs an understanding of the program and
its effects. Discovering the themes in the data, weighting them, verifying them with stakeholders, and finally,
preparing a document that reports the findings and conclusions are part of a holistic approach to program
evaluation. A holistic approach entails taking into account and reporting different points of view on the program,
its leadership, its operations, and its effects on stakeholders. Thus, an evaluation is not just conducted from the
program manager’s or the evaluator’s standpoint but takes into account beneficiaries’ and other stakeholders’

265
viewpoints. Later in this chapter, we will provide further suggestions for structuring a qualitative evaluation
project.

Although qualitative methods can form part of randomized controlled research designs (RCTs) (Lewin, Glenton,
& Oxman, 2009), qualitative methods are often used as part of naturalistic evaluation designs; that is, they do not
attempt to control or manipulate the program setting. Within naturalistic designs, the evaluator works with the
program as it is and works with stakeholders as they interact with or perform their regular duties and
responsibilities in relation to the program or with each other. Naturalistic also means that natural language is used
by the evaluator—the same words that are used by program stakeholders. There are no separate “languages of
research design or measurement,” for example, and usually no separate language of statistics.

Within qualitative evaluations based on an interpretivist approach, the evaluators themselves are the principal
measuring instrument. There is no privileged perspective in such an evaluation. It is not possible for an evaluator
to claim objectivity; instead, the subjectivity of the evaluator is acknowledged and the trustworthiness (credibility)
of the evaluator is emphasized (Patton, 2015). Evaluator observations, interactions, and renderings of narratives
and other sources of information are a critical part of constructing patterns and creating an evaluation report. A
principal means of gathering data is face-to-face interviews or conversations. Mastering the capacity to conduct
interviews and observations, while recording the details of such experiences, is a key skill for qualitative program
evaluators. We will summarize some guidelines for interviewing later in this chapter.

For quantitative evaluators or evaluations using qualitative methods within a postpositivist approach, the aim is for
the evaluators to put into place controls that ensure their findings are credible according to accepted research
design, measurement, and statistical criteria and contribute to testing hypotheses or answering evaluation
questions, which generally reflect a limited number of possible stakeholder perspectives.

Typically, a key evaluation question is whether the program produced/caused the observed and intended
outcomes; that is, was the program effective? Qualitative methods can be used to help answer this question;
however, the way that qualitative evaluators approach this task is typically quite different from quantitative
evaluators. Within quantitative evaluations, the logic of change underpinning this question tends to be linear and
is tied to the program logic model. Most qualitative evaluators eschew a linear cause-and-effect logic, preferring a
more holistic picture (Patton, 2015).

That is, influenced by their interpretive underpinnings, these evaluators aim to show how particular causal
relationships are embedded within complex networks of relationships in a specific space and time. The
International Research Development Centre’s outcome mapping for the evaluation of international development
projects (Earl, Carden, & Smutylo, 2001) is just one example of how qualitative methods avoid a linear cause-and-
effect logic while attempting to answer questions about the program’s effectiveness.

Outcome mapping is a process that can be used in complex programs to describe the changes that are expected
and then observed. Because development projects are typically large in scale and scope, outcome mapping aims to
document the performance of the program over time and then estimate whether the program contributed to the
development outcomes that were observed. Recall, quantitative evaluation is generally concerned with validity
and, in particular, with threats to statistical conclusions validity, internal validity, and construct validity that
would undermine the applications of methods that are intended to quantify the existence and significance of the
links between the program and the actual outcomes. Concerns with validity usually mean that quantitative
evaluators prefer having some control over the program design, program implementation, and the evaluation
process. A randomized controlled trial typically involves tight control over how the program is implemented,
including who the program’s clients are, how they are served by the program providers, and how those in the
control group are separated from the program group to avoid cross-contamination due to program–control group
interactions. Recall that for the body-worn cameras program implemented in Rialto, California, the same police
officers were included in both the program and control groups (the units of analysis were shifts instead of police
officers), and one effect of that was diffusion of the treatment (a construct validity threat) across all the patrol
officers in the department. Replications of the Rialto evaluation design in seven other U.S. cities ended up
replicating this diffusion effect (Ariel et al., 2017).

266
For qualitative evaluations, the emphasis on not manipulating the program setting often means that addressing
program effectiveness-related questions involves collecting and comparing perceptions and experiences of different
stakeholders and then coming to some overall view of whether and in what ways the program was effective.
Scriven (2008) has pointed out that observation is a frequently used method for gathering data that can be used to
discern causes and effects. Qualitative evaluations or evaluations that use mixed methods offer different ways for
evaluators to observe, interact, and construct conclusions using combinations of evidence and their professional
judgment.

267
Designing And Conducting Qualitative Program Evaluations
In Chapter 1 of this textbook, we introduced 10 steps that can serve as a guide to conducting an evaluability
assessment for a program evaluation. Although the 10 steps are helpful for both quantitative and qualitative
evaluations, there are other issues that need to be addressed if evaluators decide to go ahead with a project using
qualitative evaluation methods. Table 5.5 lists seven issues that we will discuss in more detail, which all bear on
when and how to use qualitative evaluation methodologies and methods. In discussing these issues, we provide
examples, including five qualitative research/evaluation studies from Canada, the United Kingdom, and the
United States.

Table 5.5 Qualitative Evaluation Design and Implementation Issues


Table 5.5 Qualitative Evaluation Design and
Implementation Issues

1. Clarifying the evaluation purpose and questions

2. Identifying research designs and comparisons

3. Mixed-methods designs

4. Identifying appropriate sampling strategies

5. Collecting and coding qualitative data

6. Analyzing qualitative data

7. Reporting qualitative findings

1.

268
Clarifying the Evaluation Purpose and Questions
To determine the purpose of the evaluation, you need to know its intended uses. As part of the evaluation planning
process, you and the sponsors of the evaluation will have determined some broad evaluation questions and issues.
Before conducting the evaluation, you need to make sure it is clear whether the evaluation is to be used to answer
formative evaluation questions (i.e., to improve the program) or summative evaluation questions (to render
judgments about the overall merit and worth of the program), or to answer both. Second, you need to establish if
specific evaluation questions will be determined in advance and, if so, to negotiate a written agreement around
these with the stakeholders.

It may be appropriate in an evaluation to begin qualitative data collection without a tightly fixed agenda, to learn
what the issues, concerns, and problems are from different perspectives so that an agenda or evaluation questions
can be established. Michael Scriven (1973) promoted the idea of goal-free evaluation, in which the evaluators
deliberately avoid focusing on intended program outcomes in order to elicit the range of actual outcomes (both
positive and negative) from stakeholders’ perspectives. The basic idea is to encourage the evaluator to see what has
actually happened (or not), without having the filter of program objectives in the way (Youker, Ingraham, &
Bayer, 2014).

However, a practical limitation on the use of unstructured approaches is their cost. Furthermore, evaluations are
usually motivated by issues or concerns raised by program managers or other stakeholders. Usually, evaluations are
commissioned by stakeholders with particular evaluation questions in mind. Those issues constitute a beginning
agenda for the evaluation process. The evaluator will usually have an important role in defining the evaluation
issues and may well be able to table additional issues. Nevertheless, it is quite rare for an evaluation client or clients
to support a fully exploratory (goal-free) evaluation.
2.

269
Identifying Research Designs and Appropriate Comparisons
Qualitative data collection methods can be used in a wide range of research designs. Although they can require
considerable resources, from a pragmatic perspective, qualitative methods can be used as the primary means of
collecting data even in fully randomized experiments, where the data are compared and analyzed with the goal of
drawing conclusions around the program’s outcomes. More typically, the comparisons in program evaluations to
address questions about program effectiveness are not structured around experimental or even quasi-experimental
research designs. Instead, implicit designs are often used in order to create multiple lines of evidence for
triangulation.

Within-Case Analysis
Miles, Huberman, and Saldana (2014) indicate that two broad types of analysis are important, given that an
evaluator has collected qualitative data. One is to focus on single cases (perhaps individual clients of a program)
and conduct analyses on a case-by-case basis. These are within-case analyses. Think of a case as encompassing a
number of possibilities. In an evaluation of the Perry Preschool experiment (see Chapter 3), the individual
children in the study (program and control groups) were the cases. In the New Chance welfare-to-work
demonstration evaluation, mother–child pairs were the cases, and 290 were selected from the 2,322 families
participating in the demonstration (Zaslow & Eldred, 1998). In the U.K. Job Retention and Rehabilitation Pilot
(Farrell, Nice, Lewis, & Sainsbury, 2006), the clients were the cases, and 12 respondents from each of the three
intervention groups were selected, resulting in 36 cases. In the Troubled Families Program in Britain (Day,
Bryson, & White, 2016), families were the primary cases, although for some lines of evidence, local governments
were also cases.

In some evaluations, a “case” includes many individuals. For example, within an evaluation of mental health
services for severely emotionally disturbed youth who were involved in the juvenile justice system, a case
comprised “a youth, their parents, mental health professional and juvenile justice professional” (Kapp, Robbins, &
Choi, 2006, p. 26). A total of 72 interviews were completed, and these represented 18 cases. Cases are, in the
parlance of Chapter 4, units of analysis. When we select cases in a qualitative evaluation, we are selecting units of
analysis.

Cases can be described/analyzed in depth. In the Urban Change welfare-to-work evaluation, discussed later, the
evaluators presented three in-depth case studies that illustrated how adolescents were affected when their mothers
were required to participate in welfare-to-work programs (Gennetian et al., 2002). In case studies, events can be
reconstructed as a chronology. This is often a very effective way of describing a client’s interactions with a
program. Cases can also include quantitative data. Within the juvenile justice study, quantifiable information was
extracted from the sociodemographic form, as well as the interviews with all 72 respondents, and recorded in a
Statistical Package for the Social Sciences (SPSS) data set (Kapp et al., 2006).

In the juvenile justice study (Kapp, Robbins, & Choi, 2006), comparisons were also done within the cases.
Because cases can include multiple sources of data and multiple lines of evidence, it is possible to mine multi-
person cases for insights about how a program operates, how it affects different stakeholders, and even why
observed outputs and outcomes happened. Individual cases, because they can be presented as “stories” of how
clients, for example, interacted with a program, can be persuasive in an evaluation. We will come back to this issue
later in Chapter 5.

Between-Case Analysis
The second kind of comparison using cases is across cases. Commonly, evaluators compare within and across cases.
Selected (now adult) program participants in the Perry Preschool experiment, for example, were compared using
qualitative analysis (Berruetta-Clement, Schweinhard, Barnett, Epstein, & Weikart, 1984). Each person’s story
was told, but his or her experiences were also aggregated into between-group comparisons: men versus women, for

270
example. Longitudinal studies add another layer of complexity. In the evaluation of the Pathways to Work pilot,
the researchers aimed to interview each of the 24 evaluation participants three times, at 3-month intervals (Corden
& Nice, 2007). To examine change, they created common “baselines” against which they assessed subsequent
change or absence of change (Corden & Nice, 2007). In the U.K. Job Retention and Rehabilitation Pilot, the
evaluators first analyzed individual cases and then compared across cases (Lewis, 2007). Because this evaluation
collected longitudinal data, the evaluators also had to conduct comparisons across time. Overall, Lewis and her
colleagues analyzed the data in seven different ways, including repeat cross-sectional analysis involving examining
how individuals changed between interviews and individual case narratives that aimed to capture the essence of the
journey the client had travelled.

Cases can be compared across program sites. Urban Change was implemented in multiple neighborhoods in
Cleveland and Philadelphia, so evaluators chose approximately 15 women from three neighborhoods in each city
and then compared client experiences across areas (Gennetian et al., 2002).
3.

271
Mixed-Methods Evaluation Designs
Mixed methods refer to evaluation designs that use both qualitative and quantitative sources of data. We can
think of mixed-methods evaluations as incorporating multiple lines of evidence. Bamberger et al. (2012) specify
the additional requirement that such designs must incorporate methods or theories from two or more disciplines.
Johnson (2017) posits, “It turns out that some researcher/practitioners find many positive features in more than
one paradigm” (p. 156, emphasis in original). For Bryman (2009), Creswell (2009), and Johnson and
Onwuegbuze (2004), mixed-methods designs are based implicitly on philosophical pragmatism, the working
assumption being that a design that combines qualitative and quantitative methods in situationally appropriate
ways can provide a richer, more credible evaluation than one that employs either qualitative or quantitative
methods alone.

Creswell’s (2009) framework for identifying and categorizing different mixed-methods designs has been influential
within the evaluation literature (see Bamberger et al., 2012). His framework categorizes mixed methods on the
basis of four factors—(1) timing, (2) weighting, (3) mixing, and (4) theorizing (see Table 5.6)—resulting in the
identification of a variety of mixed-methods strategies. If we look at Table 5.6, we can see, for example, that a
concurrent collection of qualitative and quantitative data is usually coupled with equal weighting of the data
sources and a subsequent integration of the two broad sources of data. The overall findings can be a part of explicit
or implicit efforts to construct explanatory conclusions. The same horizontal reading can be applied to the other
two rows in Table 5.6.

Table 5.6 Creswell’s Mixed-Methods Framework for Combining Qualitative and Quantitative
Data
Table 5.6 Creswell’s Mixed-Methods Framework for Combining Qualitative and Quantitative Data

Weighting of the
Timing of the Collection of Mixing Qualitative Theorizing
Qualitative and
Qualitative and QuantitativeData and Quantitative Data (Explanation)
Quantitative Data

No sequence—concurrent Equal Integrating


Explicit
Sequential—qualitative first Qualitative Connecting

Sequential—quantitative first Quantitative Embedding Implicit

Source: Creswell (2009, p. 207).

Timing refers to whether the qualitative and quantitative data will be collected at the same time (concurrently) or
collected sequentially. A very common approach is to collect qualitative and quantitative data at the same time
through a survey containing both closed-ended and open-ended questions. In an evaluation where the researcher
collects the qualitative data first, the aim is usually to explore the topic with participants first and then later collect
data from a larger (usually representative) sample that includes quantitative measures of constructs. Initial
qualitative research, including interpreting documents and interviewing stakeholders, can also be used to develop
the logic models that will form the basis of a program evaluation. In contrast, where qualitative data are collected
after quantitative data, the aim is usually to explore unexpected or puzzling quantitative findings. Qualitative
research may also be used after quantitative research has been completed to help determine how suggested changes
may be implemented.

Weighting refers to the priority given the qualitative methods or the quantitative methods within the evaluation. In
an experimental design, priority is usually given to the quantitative findings, and the qualitative research plays a
supportive, case-specific explanatory role. In the juvenile justice study (Kapp et al., 2006), interviews with the 72
respondents were the main source of qualitative data, but these were complemented by quantifiable information

272
extracted from the sociodemographic profiles of the participants.

Mixing refers to when and how the analyst brings the qualitative and quantitative data/lines of evidence together.
Reviews frequently find that researchers do this poorly and that a common problem is failure to adequately utilize
the qualitative lines of evidence or failure to include the qualitative team at all stages of the research (Gardenshire
& Nelson, 2003; Lewin et al., 2009). Mixing can occur at any of the following three stages: (1) data collection, (2)
analysis, or (3) interpretation. At one extreme, the qualitative and quantitative data can be combined into one data
set (integrating), while at the other extreme, the two types of data can be kept completely separate at all stages.

Theorizing focuses on the ways that social science theories or other lenses (e.g., participatory or empowerment
evaluation) can frame a project. Creswell (2009) points out that theoretical lenses can be explicit—that is,
acknowledged as part of the research—or implicit: “All researchers bring theories, frameworks and hunches to
their inquiries, and those theories may be made explicit in a mixed methods study, or be implicit and not
mentioned” (p. 208).

One example of a mixed-methods approach is the U.K. Job Retention and Rehabilitation Pilot (Farrell et al.,
2006; Purdon et al., 2006). This pilot collected quantitative data (including administrative and survey data), as
well as longitudinal qualitative data. Quantitative and qualitative data were originally kept separate at all stages,
with separate teams gathering, analyzing, and reporting the data. Results were presented in stand-alone reports
that referenced each other (Farrell et al., 2006; Purdon et al., 2006).

A similar approach was used in the multi-stage evaluation of the Troubled Families Programme in Britain (Day et
al., 2016). In that evaluation, the main program outcomes were aimed at reducing the family-related problems
(family violence, lack of education, lack of employment, criminal justice encounters) that had been hypothesized
to be the root of the problems that resulted in riots in British cities in 2011.

As part of the evaluation, a qualitative evaluation that focused on a sample of 22 families in 10 local authorities
(local governments) was conducted. A total of 62 persons were included in the study, and 79 interviews were
conducted overall. Key to this evaluation was learning what the experiences of these families were with the
program. Five areas were explored in the interviews: awareness and initial engagement in the program; assessment
and identification of needs; family experiences with the intervention; key features of the family intervention; and,
finally, family experiences of changes since being involved in the program (Blades, Day & Erskine, 2016).

Each intervention lasted from 12 to 18 months, and by the end of it, nearly all of the families reported “some
degree of improvement in their circumstances, and specifically in relation to the problem issues at the start of the
intervention” (Blades, Day, & Erskine, 2016, p. 4). These positive findings contrasted with the generally negative
findings (no overall change) that were reported from the analysis of quantitative (secondary data) lines of evidence
(Bewley, George, Rienzo, Cinzia, & Portes, 2016).

In reconciling these contrasting findings in the Synthesis Report, Day et al. (2016) concluded that

the evaluation has presented a mixed picture with regard to the effectiveness and impact of the Troubled
Families Programme. As we have discussed throughout this report, the investment of £448 million in
developing family intervention provision across England provided an important opportunity to boost
local capacity and to expand the workforce across all 152 local authorities. The programme clearly raised
the profile of family intervention country-wide, and transformed the way services were being developed
for families in many areas. These achievements did not translate into the range and size of impacts that
might have been anticipated, however, based on the original aspirations for the programme. (Day et al.,
2016, pp. 80–81)

Even more succinctly, they concluded there was a “lack of evidence of any systemic or significant impact found by
the evaluation on the primary outcome measures for the programme” (Day et al., 2016, p. 81, emphasis added). In
this program evaluation, reconciling the lines of evidence went in favor of the quantitative sources of data.

273
In his mixed-methods approach, Creswell (2009) argues that there are six primary mixed-methods strategies. We
will highlight some details from the three that are most relevant to evaluations. One of the most common is a
sequential explanatory design, where the quantitative data is collected and analyzed prior to collecting and analyzing
the qualitative data. Typically, this approach involves giving greater weight to the quantitative methods and is used
when evaluators want to use qualitative methods to explore/explain puzzling findings that emerged within the
quantitative analysis. The New Hope program was designed to supplement the incomes of low-income people
living in two high-poverty areas of Milwaukee, and was a RCT pilot program. Within the New Hope program
evaluation, qualitative research was used to help explain perplexing findings in the survey and administrative data
(Gibson & Duncan, 2000; Miller et al., 2008). Participants in the treatment group were eligible for a range of
additional assistance, and analysts were perplexed by the wide variation in the rates at which people took
advantage of specific services. Contrary to the evaluators’ initial assumption that participants would use the entire
package of benefits, most made selective use of the benefits. Subsequent ethnographic research involving a sample
of 46 families, half from the program group and half from the control group, showed that differences in
perspectives regarding the benefits (e.g., in how people weighed the burden of longer work hours against the
income supplements and whether they considered the community service job option demeaning) helped account
for their patterns of service take-up (Gibson & Duncan, 2000; Miller et al., 2008).

A sequential exploratory design, which involves collecting the qualitative data first, is also very common. Typically,
qualitative data are collected from stakeholders to identify questions and issues that then drive more systematic
quantitative data collection strategies. Qualitative data play a supporting role in such designs. It is, of course,
possible to use qualitative data to design survey instruments that include both quantitative (closed-ended) and
qualitative (open-ended) questions. The open-ended responses can later be analyzed thematically to provide more
detailed information about evaluation-related questions.

Perhaps the most common approach is a concurrent triangulation approach. When using this strategy, the
evaluators collect both qualitative and quantitative data concurrently and compare the data sets to determine the
degree of convergence (Creswell, 2009). For example, you may use program management data, survey data, and
key informant interviews. The basic idea of this approach is that qualitative and quantitative lines of evidence are
complementary and, when used together, strengthen the overall evaluation design. Creswell (2009) says, “This
model generally uses separate quantitative and qualitative methods as a means to offset the weaknesses inherent
within one method with the strengths of the other” (p. 213). Findings that are consistent across multiple sources
are considered much more reliable than findings based on just one data source. Many program evaluations are
conducted using implicit research designs (XO), after the program is implemented (see Chapter 3). Implicit
research designs are not related to implicit theorizing, as indicated in Table 5.6. In implicit research designs, there
are no comparison groups, and there may not even be a before–after comparison for the program group. Mixed
methods may strengthen implicit designs. Qualitative methods can be used: to develop a better understanding of
the program theory and the program context; to assess the quality of the program intervention; to understand
contextual factors at different intervention sites; and to understand how cultural characteristics of the target
populations may have affected implementation. Mixed-methods evaluation designs and the triangulation
approach, in particular, have become a central feature of evaluation practice in governmental and nonprofit
settings.

In concurrent triangulation evaluation designs, where different lines of evidence have been gathered and will
ultimately be compared in relation to the evaluation questions driving the project, it is possible for lines of
evidence to yield inconsistent and even contradictory findings. In the Troubled Families Program, for example,
the quantitative lines of evidence (statistical analysis of secondary data sources) suggested that overall the program
did not make much of a difference for the families involved. Key outcome measures when compared between
families in the program and matched families not in the program were not significantly different. But the
qualitative interviews with sample of families in the program to explore their own perceptions of program impacts
indicated that, subjectively, the program had made important differences.

How to reconcile those two sets of findings? In the Troubled Families Program, given its high-stakes profile and
its national scope, the resolution was not straightforward, and many articles followed the original evaluation (see

274
Sen & Churchill, 2016). Advocates for the qualitative findings objected strongly to the quantitative, summative
conclusions that the program was not effective. The debate over the program became a political issue that moved
the resolution away from any efforts to reconcile the findings methodologically (e.g., see Crossley & Lambert,
2017).

Most program evaluations are not that high stakes, so resolving inconsistent or contradictory findings comes down
to several strategies or combinations of them. First, the evaluation team can review the methodologies involved for
the lines evidence in question and, if there are differences in the robustness of methods, use that information to
weight the findings. Second, program logic models, which typically are embedded in program theories, have
constructs and intended linkages that are usually informed by what is known about the appropriateness of a
program design. When evaluators encounter contradictory findings, how do those align with the expectations in
relevant program theory? It may be possible to resolve differences that way. Third, consistent with what we have
been saying so far in this textbook, reflective evaluators gain practical experience over time, and this is an asset in
interpreting lines of evidence. Ideally, a team of evaluators that are involved in a project would review and discuss
inconsistent or contradictory findings and use their professional judgment to weight lines of evidence. We will say
more about professional judgment in Chapter 12.
4.

275
Identifying Appropriate Sampling Strategies in Qualitative Evaluations
Qualitative sampling strategies generally include deliberately selecting cases, an approach referred to as purposeful
sampling or theoretical sampling. Contrast this approach with a quantitative evaluation design that emphasizes
random samples of cases. Understanding sampling in qualitative methods is complicated by the fact that the
literature describes many different strategies, and there is little consistency in the terminology used.

In qualitative evaluations using interviews, the total number of cases sampled is usually quite limited, but in recent
years, many government-sponsored evaluations have used relatively large samples. For example, the Pathways to
Work pilot in the United Kingdom included a qualitative longitudinal study with three cohorts totaling 105
individuals, and more than 300 interviews (Corden & Nice, 2007). The New Chance evaluation involved
qualitative research with 290 mother–child pairs (Zaslow & Eldred, 1998). However, smaller samples of less than
40 are more common.

Table 5.7 is a typology of purposeful sampling strategies developed by qualitative researchers and evaluators. This
list is drawn from Miles and Huberman (1994, p. 28), Miles et al. (2014, p. 32), and Patton (2015, pp. 277–
287). Random probability sampling strategies that are used in quantitative research can also be used in qualitative
research, but these are not repeated in Table 5.7.

Table 5.7 Purposeful Sampling Strategies for Qualitative Evaluations


Table 5.7 Purposeful Sampling Strategies for Qualitative Evaluations

Type of
Purposeful The Purpose of This Type of Sampling Is
Sampling

Comprehensive Selecting all the cases in a population to ensure that every possible instance of the
sampling phenomena are included—this approach is resource-intensive

Maximum To deliberately get a wide range of variation on characteristics of interest; documents


variation unique, diverse, or common patterns that occur across variations

To focus and simplify the study and facilitate group interviewing; used where one
Homogeneous
stakeholder perspective is central to the evaluation purposes

Reputational
Picking cases based on input from an expert or key participant
case

To highlight important cases or those that make a point dramatically; permits logical
Critical case generalization and application to other cases—that is, if it is true in this case, then it is
likely to be true in all other cases

To test theory and to test or confirm/disconfirm the importance of emerging patterns;


sampling to test emerging concepts or theories (used in grounded theory approaches that
Theoretical
build generalizations from case studies) or choosing cases as examples of theoretical
constructs

To identify information-rich cases; well-situated people are asked who would be a good
Snowball or
source or current informants may be asked to identify further informants—this can be
chain
combined with reputational sampling

276
Extreme or To elucidate a phenomenon by choosing extreme cases, such as notable successes or
deviant case failures

To seek rich but not extreme examples of the phenomenon of interest; similar logic to
Intensity
extreme sampling, but highly unusual cases are not selected

Often, to describe a program to people not familiar with it; knowledgeable staff or
Typical case
participants are used to identify who or what is typical

Politically To attract additional attention to the study or to avoid it; cases that are politically
important cases sensitive are selected or avoided

To ensure that there are cases from strategically important groups across which
Stratified comparisons will be made; population is divided into strata (e.g., socioeconomic status,
purposeful gender, or race), and a second purposeful strategy, such as typical case, is used to select
cases within each stratum

Quota Dividing up a population into major subgroups (strata) and picking one or more cases
sampling from each subgroup

Used for quality assurance or audit of program or agency case records; all cases that meet
Criterion
certain criteria are chosen, for example, all those who declined treatment

To take advantage of unexpected opportunities; involves making decisions about


Opportunistic sampling during the data collection process based on emerging opportunities; this can
overlap with snowball sampling

To make sampling inexpensive and easy; this sampling method has the poorest rationale
Convenience
and the lowest credibility

Mixed-
To meet stakeholders’ multiple needs and interests; multiple purposeful strategies are
sampling
combined or purposeful strategies are combined with random sampling strategies
strategy

Among the strategies identified in Table 5.7, several tend to be used more frequently than others. One of these is
snowball or chain sampling, which relies on a chain of informants who are themselves contacted, perhaps
interviewed, and asked who else they can recommend, given the issues being canvassed. This sampling strategy can
be combined with the reputational sampling approach included in Table 5.7.

Although snowball sampling is not random and may not be representative, it usually yields uniquely informed
participants. In a qualitative study of stakeholder viewpoints in an intergovernmental economic development
agreement, the 1991–1996 Canada/Yukon Economic Development Agreement (McDavid, 1996), the evaluator
initially relied on a list of suggested interviewees which included public leaders, prominent business owners, and
the heads of several interest group organizations (e.g., the executive director of the Yukon Mining Association).
Interviews with those persons yielded additional names of persons who could be contacted, some of whom were
interviewed and others who were willing to suggest further names (McDavid, 1996). One rough rule of thumb to
ascertain when a snowball sample is “large enough” is when you reach “saturation”—that is, when themes and
issues begin to repeat themselves across informants.

A study of severely emotionally disturbed youth involved in the justice system used a form of typical case
sampling, with researchers and staff at community mental health centers using “their special knowledge of
juveniles involved in both systems to select subjects who represent this population” (Kapp et al., 2006, p. 24).
Opportunistic sampling takes advantage of the inductive strategy that is often at the heart of qualitative

277
interviewing. An evaluation may start out with a sampling plan in mind (picking cases that are representative of
key groups or interests), but as interviews are completed, a new issue may emerge that needs to be explored more
fully. Interviews with persons connected to that issue may need to be conducted.

Mixed-sampling strategies are common. Within the Job Retention and Rehabilitation Pilot, the researchers used
a combination of stratified purposeful and maximum variation sampling strategies (Farrell et al., 2006). First,
they selected 12 respondents from each of the different intervention groups and nine service providers from four
of the six pilot locations. Second, they sought to ensure that the final samples reflected “diversity in sex, age,
occupation, employer type, industry sector, length of time off sick” (p. 150) and other characteristics. In pursuing
mixed strategies, it is important to be able to document how sampling decisions were made. One of the criticisms
of some qualitative sampling instances is that they have no visible rationale—they are said to be drawn
capriciously, and the findings may not be trusted (Barbour, 2001). Even if sampling techniques do not include
random or stratified selection methods, documentation can blunt criticisms that target an apparent lack of a
sampling rationale. Public auditors, who conduct performance audits, routinely use qualitative sampling strategies
but, in doing so, are careful to document who was sampled and the rationale for including interviewees (American
Institute of Certified Public Accountants, 2017).
5.

278
Collecting and Coding Qualitative Data

Structuring Data Collection Instruments


Qualitative data collection instruments used for program evaluations are structured to some extent. While
qualitative evaluations may include informal conversational interviews, it is very unusual to conduct interviews
without at least a general agenda of topics (topic guide). Certainly, additional topics can emerge, and the
interviewer may wish to explore connections among issues that were not anticipated in the interview plan. Given
the tight time frames associated with most program evaluations, however, standardized open-ended interview
guides that contain a list of pre-planned questions—linked to the evaluation questions that are being addressed—
are also commonly used. While qualitative interview guides may contain some closed-ended questions, they
predominately contain open-ended questions. When deciding on what mix of questions to use, you should ensure
you know “what criteria will be used [by primary intended users] to judge the quality of the findings” and choose
your instruments accordingly (Patton, 2002, p. 13).

The Job Retention and Rehabilitation Pilot evaluation (Farrell et al., 2006) is an example of a qualitative
evaluation that uses structured data collection instruments. The guide used in the interview process for this
evaluation took the form of a topic guide but had some unique features, including a detailed script on the first two
pages. Also, each section began with an “aim” so that the interviewer could understand the overarching logic that
tied together the subtopics. The complete guide was 10 pages long and began with questions on the personal
circumstances and background of the interviewees. Evaluators sometimes begin with these type of questions
because they are generally noncontroversial. However, a disadvantage to structuring an interview in this way is
that interviewees may find answering these questions tedious and quickly disengage. Furthermore, placing these
relatively closed-ended questions early in an interview may establish a pattern of short responses that will make it
difficult to elicit in-depth narratives later in the interview.

Structuring data collection instruments does have several limitations. By setting out an agenda, the qualitative
evaluator may miss opportunities to follow an interviewee’s direction. If qualitative evaluation is, in part, about
reconstructing others’ lived experiences, structured instruments, which imply a particular point of view on what is
important, can significantly limit opportunities to empathetically understand stakeholders’ viewpoints. For
example, an unstructured approach may be appropriate if one is involved in participatory evaluative work with
Indigenous peoples (Chilisa, 2012; Chilisa & Tsheko, 2014; Drawson, Toombs, & Mushquash, 2017; Kovach,
2018; LaFrance & Nicholas, 2010), where the topics and cultural awareness suggest cross-cultural methodological
issues that cannot be subsumed in structured data collection approaches.

Cost considerations often place limits on the extent to which unstructured interviews can be used, so a careful
balance must be found.

Conducting Qualitative Interviews


A principal means of collecting qualitative data is interviewing. Although other qualitative techniques are also used
in program evaluations (e.g., documentary reviews/analyses, open-ended questions in surveys, direct observations),
face-to-face interviews are a key part of qualitative data collection options. Table 5.8 summarizes some important
points to keep in mind when conducting face-to-face interviews. The advice in Table 5.8 is not exhaustive but is
based on the authors’ experiences of participating in qualitative interviews and qualitative evaluation projects. For
additional information on this topic, Patton (2003) includes sections in his Qualitative Evaluation Checklist that
focus on fieldwork and open-ended interviewing. Patton’s experience makes his checklists a valuable source of
information for persons involved in qualitative evaluations.

Table 5.8 Some Basics of Face-to-Face Interviewing


Table 5.8 Some Basics of Face-to-Face Interviewing

279
Preparations for Conducting Interviews

Consider issues of social and cultural diversity when wording your questions and choosing interview
locations.
Consider pre-testing your data collection instrument with a few participants so that you can
determine if questions are being misinterpreted or misunderstood.
Consider incorporating principles from postcolonial indigenous interviewing, including unique
“Indigenous ways of knowing” (Drawson et al., 2017).
View the interview as a respectful dialogue rather than a one-way extraction of information with the
interviewer in a position of authority.
Develop an appreciation and understanding of the cultural background of the population being
interviewed. For example, if the interview is being conducted on traditional First Nations territory, it
may be appropriate to acknowledge this fact with the person being interviewed. Also, it may be
appropriate to ask for community permission before arranging interviews.
“Elite interviews,” with those in relative positions of power, usually have the expectation that
interviewers to have conducted sufficient background research.
Consider having one team member conduct the interview while another takes notes or, if you can,
use a tape-/electronic recorder.

Conducting Interviews

Remind the interviewee how she or he was selected for the interview.
Tell the interviewee what degree of anonymity and confidentiality you can and will honor.
Project confidence, and be relaxed—you are the measuring instrument, so your demeanor will affect
the entire interview.
For various reasons, individuals may not be familiar with the expectations within an interview and
may need encouragement to speak freely.
Inform participants—make sure they understand why they are being interviewed, what will happen
to the information they provide, and that they can end the interview or not respond to specific
questions as they see fit (informed consent).
Cautious flexibility is essential—it is quite possible that issues will come up “out of order” or that
some will be unexpected, but you will also need to avoid going far beyond the primary evaluation
questions.
Listening (and observing) are key skills—watch for word meanings or uses that suggest they differ
from your understanding. It is important to listen carefully to ensure that the interviewee has
actually answered the question you asked. Watch for nonverbal cues that suggest follow-up questions
or more specific probes.
Ask for clarifications—do not assume that you know or that you can sort something out later.
Ask questions or raise issues in a conversational way.
Show you are interested but nonjudgmental. This is particularly important when asking about
sensitive or controversial topics. You can use wording that suggests the behavior in question is
common, such as “As you may be aware, many people abuse alcohol (definition: drinking more than
5 drinks at a time, 5 out of 7 days a week) as a way to cope with their stress” or wording that
assumes the behavior and asks how frequently it occurs (Paloma Foundation & Wellesley Institute,
2010, p. 87).
Look at the person when asking questions or seeking clarifications, but be mindful of the cultural
appropriateness of eye contact.
Pace the interview so that it flows smoothly and you get at the questions that are the most important
for the evaluation.
Note taking is hard work: The challenge is to take notes, listen, and keep the conversation moving.
Note key phrases, knowing that after the interview you will review your notes and fill in gaps.
Pay attention to the context of the interview—are there situational factors (location of the interview,
interruptions, or interactions with other people) that need to be noted to provide background

280
information as qualitative results are interpreted?

Immediately After the Interview

Label and store your recordings with ID numbers (or pseudonyms), as well as the interview date and
time. It is essential to create duplicate copies of audiotapes or back-up electronic recordings on your
computer.
Remember to keep all records secure, including password protection.
Your recall of a conversation decays quickly, so if you have not used a tape-/electronic recorder, you
should write up your notes immediately after the interview and fill in details that you did not have
time to record. In a few days (or as soon as you have done the next interview), you will have
forgotten important details.

6.

281
Analyzing Qualitative Data
One of the wonderful, yet challenging, aspects of qualitative research is the vast volume of data that it generates.
In the Job Retention and Rehabilitation Pilot, the evaluators had 197 interviews to analyze (Farrell et al., 2006;
Lewis, 2007). Table 5.9 offers some suggestions about analyzing qualitative data, again, principally from face-to-
face interviews. As Patton (2003) reiterates in the Qualitative Evaluation Checklist, it is important that the data
are effectively analyzed “so that the qualitative findings are clear, credible, and address the relevant and priority
evaluation questions and issues” (p. 10). In most evaluations, the data analysis is the responsibility of the
evaluation team, but in participatory approaches, clients may be included in the process of analyzing the results
(see Jackson, 2008).

Table 5.9 Helpful Hints as You Analyze Qualitative Data


Table 5.9 Helpful Hints as You Analyze Qualitative Data

Your next decision is whether to use pen/paper/note


cards or qualitative software or a spreadsheet to
support your coding. Qualitative evaluators rarely
rely on pen/paper/note cards anymore. Typically,
they use computer-assisted qualitative data analysis
software (CAQDAS) to help with organizing and
coding their data (Miles et al, 2014). Many
evaluators have used Ritchie, Spencer, and
O’Connor’s (2003) influential matrix-based thematic
framework (the Framework), for summarizing and
organizing data. The framework has been updated in
the more recent book by Ritchie, Lewis, Nicholls,
and Ormston (2013), with guidance for using
CAQDAS with various complementary types of
Getting Started software.
One approach to qualitative data analysis is to use
Recall why you conducted the predetermined themes or categories (Preskill & Russ-
interviews and how the interviews fit Eft, 2005). This approach is appropriate if there is
into the program evaluation. Related pre-existing research that allows you to determine
to this process is whether the what the likely categories will be. If you choose this
interviews are a key line of evidence or approach, you need to determine precise definitions
a supplementary line of evidence. for each theme or category.
Creswell’s (2009) options for mixed- Within most qualitative evaluations, the themes or
methods designs suggest how categories are at least partly determined by the data.
qualitative data can be positioned in At the same time, because data collection instruments
the overall evaluation. within program evaluations are generally quite
What specific evaluation issues were structured and evaluation questions have been
you anticipating could be addressed by determined in advance, the evaluator usually has a
the interview data? good starting point for developing themes.
Does each section of your interview For most qualitative evaluators, the first step is to
instrument address a particular familiarize themselves with the data by immersing
evaluation issue? If so, you may begin themselves in reading transcripts, listening to
by organizing responses within each recordings, and reading observational notes. During
section. If not, can you identify which this stage, the analyst jots down ideas for possible
sections address which evaluation themes—as penciled marginal notes or as electronic
issues? memos. There is a balance between looking for
themes and categories and imposing your own
Working With the Data expectations. When in doubt, look for evidence from

282
Your first decision is whether to the interviews. Pay attention to the actual words
transcribe the data. Transcription is people have used—do not put words in interviewees’
expensive, with each hour of interview mouths.
generating 3 to 6 hours of Thematic analysis can be focused on identifying
transcription work. Use of voice words or phrases that summarize ideas conveyed in
recognition software can reduce the interviews. For example, interviews with government
tedium of transcription and can save program evaluators to determine how they acquired
time but has some limitations their training identified themes such as university
(Bokhove & Downey, 2018). Large- courses, short seminars, job experience, and other
scale evaluations typically involve full training. A succinct discussion of the process of
transcription, so that teams can ensure thematic coding of qualitative data can be found in
the accuracy and completeness of the Thomas (2006).
interview data. Within smaller scale
and budget evaluations, it is often not Coding the Data: Identifying and Confirming Themes
practical to fully transcribe interviews.
Which of the preliminary themes still make sense?
If you have tape-/electronically
Which ones are wrong? What new themes emerge?
recorded the interviews, you should
What are the predominant themes? Think of themes
listen to the recordings as you review
as ideas: They can be broad (in which case lots of
your interview notes to fill in or clarify
different sub-themes would be nested within each
what was said.
theme), or they can be narrow, meaning that there
If you choose to fully transcribe your
will be lots of them.
interviews, you will need to decide
Are your themes different from each other? (They
what level of detail you want
should be different.)
transcribed. Lapadat (2000) argues
Have you captured all the variation in the interviews
that you need to be clear about the
with the themes you have constructed?
purpose of your transcription.
How will you organize your themes? Alternatives
Evaluations that are focused on
might be by evaluation issue/question or by affect—
understanding interviewees’ emotions
that is, positive, mixed, negative views of the issue at
may need to capture details of speech
hand.
patterns and intonations, whereas
List the themes and sub-themes you believe are in the
evaluations primarily focused on
interviews. Give at least two examples from the
capturing factual data may require a
interviews to provide a working definition of each
relatively clean transcript in which
theme or subtheme.
false starts, fillers, and intonation are
Read the interviews again, and this time, try to fit the
omitted (Lapadat, 2000).
text/responses into your thematic categories.
If there are anomalies, adjust your categories to take
them into account.
There is almost always an “other” category. It should
be no more than 10% of your responses/coded
information.
Could another person use your categories and code
the text/responses approximately the way you have?
Try it for a sample of the data you have analyzed.
Calculate the percentage of agreements out of the
number of categorizations attempted. This is a
measure of intercoder reliability (Miles &
Huberman, 1994; Miles et al., 2014).
For the report(s), are there direct quotes that are
appropriate illustrations of key themes?

Coding of the data gathered for the Job Retention and Rehabilitation Pilot (Farrell et al., 2006) was based on the

283
Framework approach (Ritchie et al., 2003) and was described as follows:

The first stage of analysis involves familiarization with the data generated by the interviews and
identification of emerging issues to inform the development of a thematic framework. This is a series of
thematic matrices or charts, each chart representing one key theme. The column headings on each chart
relate to key sub-topics, and the rows to individual respondents. Data from each case is then
summarized in the relevant cell … the page of the transcript … noted, so that it is possible to return to
a transcript to explore a point in more detail or extract text for a verbatim quotation… . Organising the
data in this way enables the views, circumstances and experiences of all respondents to be explored
within a common analytical framework which is both grounded in, and driven by, their accounts. The
thematic charts allow for the full range of views and experiences to be compared and contrasted both
across and within cases, and for patterns and themes to be identified and explored. The final stage
involves classificatory and interpretative analysis of the charted data in order to identify patterns,
explanations and hypotheses. (Farrell et al., 2006, p. 150)

To illustrate, in the Job Retention and Rehabilitation study, the thematic coding and analysis resulted in tables
being produced. In Table 5.10, the rows represent persons interviewed, and the columns represent the sub-
themes. Tables were produced for each of the four key themes identified in the study: (1) background of
participants; (2) going off sick, entry into the program, and returning to work; (3) uses of the program; and (4)
impacts of the program and other activities.

Table 5.10 Thematic Coding Chart Example

Source: Adapted from Lewis, J. (2007). Analysing qualitative longitudinal research in evaluations. Social Policy and
Society, 6(4), 545–556.

Table 5.10 represents a small part of the full chart for one of the four key themes, with several sub-themes and
several interviewees. In addition to tables for each key theme, one overall summary table was produced that
focused on the key themes across all the respondents.

For example, under the key theme “Going off sick, entry to program, return to work,” there were 10 sub-themes,
including reasons behind sickness absence, how sick leave occurred, and expectations of the duration of sick leave
(Farrell et al., 2006, pp. iii–iv). In the final report, the evaluators also reported on 13 sub-themes around
employment outcomes and the perceived impact of the Job Retention and Rehabilitation Pilot Program. These
sub-themes included motivations to return to work, overall perceptions of the impact of the program on returns to
work, and health barriers to returning to work. Following the Richie et al. (2003) framework, Farrell et al. (2006)
then began to map and interpret the whole set of data. Mapping and interpreting involves “defining concepts,
mapping range and nature of phenomena, creating typologies, finding associations, providing explanations,
developing strategies, etc.” (Ritchie & Spencer, 1994, p. 186). Which of these specific tasks the evaluator chooses
“will be guided by the original research questions to be addressed, and by the types of themes and associations

284
which have emerged from the data themselves” (p. 186).

Because the pilot included longitudinal data, the analysts also had to develop ways of analyzing change for clients
over time. To analyze change, the evaluators found it useful to query the data in terms of themes related to
change. Given the complexity of the project, they focused on participants’ decisions to go back to work and
constructed eight different questions that linked that decision to personal circumstances, the consequences of that
decision, the personal meaning of that decision, and whether that decision was viewed as a change for the
participant (Lewis, 2007).

An Emerging Trend: Virtual Interviews and the Uses of Software to Record, Edit, Transcribe, and Analyze Qualitative Data

Given the costs of face-to-face interviews for qualitative evaluations, a growing trend is to conduct interviews using software platforms like
Skype, Gmail, or FaceTime. This option facilitates capturing the interviews electronically, and the files can be edited, transcribed, and
analyzed without working with paper copies. As an example of such an approach, De Felice and Janesick (2015) report on a project that
was focused on the lived experiences of Indigenous educators and teaching and learning endangered languages. Figure 5.2 summarizes the
project cycle phases from conducting the interviews to analyzing the data.

Figure 5.2 The Life Cycle for Virtual Interviewing

Source: De Felice, & Janesick (2015, p. 1577). Reproduced with permission.

The process of transcribing the interviews, which is arguably the key step in the cycle, involved the interviewer listening to each interview
with headphones and simultaneously speaking the interviewee words so they were captured on files that could then be transcribed
electronically using Dragon (software that can be trained to understand a given person’s voice and competently transcribe speech in that
voice).

7.

285
286
Reporting Qualitative Results
Generally, qualitative findings are based on lines of evidence that feature documentary or written narratives.
Interviews or open-ended responses to questions on surveys are important sources of qualitative data in
evaluations. One approach to reporting qualitative results is to initially rely on summary tables that display the
themes or categories that have been constructed through coding the data. Frequencies of themes for each variable
indicate how often they occur among the full range of responses that were elicited to particular interview or survey
questions. Depending on how complicated the coding scheme is, sub-themes can also be identified and reported.
An important part of presenting qualitative findings is to use direct quotes to illustrate patterns that have been
identified in the data. These can be narrative reports of experiences with the program, perceptions of how effective
the program was for that person, and even ways that the program could have been improved.

A typical program evaluation report will include a clear statement of the purposes(s) of the evaluation, including
who the principal clients are (that is, who has commissioned the evaluation); the evaluation questions and sub-
questions that drive the project; the methodology, methods, and participants that are engaged to address the
questions; the findings from different lines of evidence as they bear upon each evaluation question; conclusions for
each evaluation question; and, depending on the terms of reference for the evaluation, recommendations and
(perhaps) lessons learned from the evaluation.

For qualitative evaluations, this overall pattern is generally appropriate, but the discussions of the findings will rely
on ways of reporting results that respect and include more emphasis on the voices (in their own words) of those
whose views have been included in the lines of evidence that compose the evaluation data.

The qualitative evaluation report from the National Evaluation of the Troubled Families Programme in Britain
(Blades, Day & Erskine, 2016) is an example of how to structure such a report. The qualitative findings were
reported in three sections: family engagement with the program; experiences with the program; and perceptions of
progress and outcomes. In each section of the report, there were sub-sections and, for each one, findings based on
the interviews.

Findings in each sub-section were first reported as an overall summary and then as a series of direct quotes from
participants to illustrate the findings. Persons quoted were identified by their role in the family: mother, father, or
children, and the direct quotes were long enough to offer full sentences—often several sentences. The report was
based on the evaluators having coded the interviews in sufficient detail to support an analysis that addressed key
evaluation questions, but that criterion was balanced with including the perspectives of those who had been
interviewed.

287
Assessing The Credibility And Generalizability Of Qualitative Findings
Analyzing qualitative data takes time and considerable effort, in relation to typical quantitative analysis.
Qualitative methods usually focus on fewer cases, but the unique attributes and completeness of the information
are viewed by proponents as outweighing any disadvantages due to lack of quantitative representativeness. A
challenge for evaluators who use qualitative methods is to establish the credibility of their results to those who are
perhaps more familiar with quantitative criteria (and the corollary can be true for quantitative evaluators). It is
important for those who manage evaluation projects to be familiar with how to establish credibility of qualitative
evaluation results throughout the evaluation process.

Miles et al. (2014) have identified 13 separate ways that qualitative data and findings can be queried to increase
their robustness. Their list emphasizes positivist or postpositivist concerns. Table 5.11 adapts and summarizes
these checks, together with a brief explanation of what each means.

Table 5.11 Ways of Testing and Confirming Qualitative Findings


Table 5.11 Ways of Testing and Confirming Qualitative Findings
1. Check the cases for representativeness by comparing case characteristics with characteristics of people
(units of analysis) in the population from which the cases were selected.
2. Check for researcher effects by asking whether and how the evaluator could have biased the data
collection or how the setting could have biased the researcher.
3. Triangulate data sources by comparing qualitative findings with other sources of data in the evaluation.
4. Weight the evidence by asking whether some sources of data are more credible than others.
5. Check outliers by asking whether “deviant” cases are really that way or, alternatively, whether the
“sample” is biased and the outliers are more typical.
6. Use extreme cases to calibrate your findings—that is, assess how well and where your cases sit in relation
to each other.
7. Follow up surprises—that is, seek explanations for findings that do not fit the overall patterns.
8. Look for negative evidence—that is, findings that do not support your own conclusions.
9. Formulate “if–then” statements based on your findings to see if interpretations of findings are internally
consistent.
10. Look for spurious relations that could explain key findings—if you have information on rival variables,
can you rule their influences out, based on your findings?
11. Replicate findings from one setting to another one that should be comparable.
12. Check out rival explanations using your own data, your judgment, and the expertise of those who
know the area you have evaluated.
13. Get feedback from informants by summarizing what they have contributed and asking them for their
concurrence with your summary.
Source: Adapted from Miles, Huberman, and Saldana (2014, pp. 293–310).

Although these 13 points offer complementary ways to increase our confidence in qualitative findings, some are
more practical than others. In program evaluations, two of these are more useful:

1. Triangulating data sources


2. Getting feedback from informants

Triangulation of data sources or lines of evidence is important to establish whether findings from qualitative
analyses accord with those from other data sources. Typically, complementary findings suggest that the qualitative
data are telling the same story as are other data. If findings diverge, then it is appropriate to explore other possible
problems. Earlier in this chapter, we discussed three strategies for dealing with divergent findings across lines of

288
evidence (review the methodologies for gathering the lines of evidence; look at the alignment between particular
findings and the relevant theory/research related to those findings; and use evaluation team knowledge and
experience to make a judgment call).

Triangulation of qualitative and quantitative lines of evidence in an evaluation is the principal way that mixed-
methods evaluation designs can be strengthened. In effect, triangulation can occur among sources of qualitative
data, as well as among sources of qualitative and quantitative data.

Feedback from informants goes a long way toward establishing the validity of qualitative data and findings.
Asking those who have been interviewed (or even a sample of them) to review the data from their interviews can
establish whether the evaluators have rendered the data credibly—that is key to authentically representing their
perspectives.

Participatory evaluations can include stakeholders in the data collection and analysis phases of the project. This is
intended to increase the likelihood that evaluation results will be utilized. Empowerment evaluations often are
intended to go further and facilitate program managers and other stakeholders taking ownership of the evaluation
process, including the data collection and analysis (Fetterman, Rodriquez-Campos, Wandersman, & O’Sullivan,
2014).

289
Connecting Qualitative Evaluation Methods To Performance Measurement
Performance measurement has tended to rely on quantitative measures for program- or policy-related constructs
(Poister, Aristigueta, & Hall, 2015). Program or organizational objectives are sometimes stated in numerical
terms, and annual numerical performance targets are established in many performance measurement systems.
Numbers lend themselves to visual displays (graphs, charts) and are relatively easy to interpret (trends, levels). But
for some government agencies and nonprofit organizations, the requirement that their performance targets be
represented in numbers forces the use of measures that are not seen by agency managers to reflect key outcomes.
Nonprofit organizations that mark their progress by seeing individual clients’ lives being changed often do not feel
that numerical performance measures weigh or even capture these outcomes.

As an alternative, Sigsgaard (2004) has summarized an approach to performance measurement that is called the
Most Significant Change (MSC) approach (Dart & Davies, 2003). Originally designed for projects in developing
nations, where aid agencies were seeking an alternative to numerical performance measures, the MSC approach
applies qualitative methods to monitoring and assessing performance and, more recently, has been used in
Indigenous evaluations (Grey, Putt, Baxter, & Sutton, 2016). It has something in common with the RealWorld
Evaluation approach (originally called the Shoestring Evaluation approach)—both are designed for situations
where evaluation resources are very limited, but there is a need to demonstrate results and do so in ways that are
defensible (Bamberger et al., 2012, p. xxxiv).

Sigsgaard (2004) describes how a Danish international aid agency (Mellemfolkeligt Samvirke) adopted the MSC
approach as an alternative to the traditional construction of quantitative logic models of projects in developing
countries. The main problem with the logic modeling approach was the inability of stakeholders to define
objectives that were amenable to quantitative measurement.

The MSC approach involves an interviewer or interviewers (who have been briefed on the process and intent of
the approach) asking persons who have been involved in the project (initially the recipients/beneficiaries of the
project) to identify positive or negative changes that they have observed or experienced over a fixed time, for one
or more domains of interest. Examples of a domain might be health care in a village involved in an aid project, or
farming in a rural area where a project has been implemented. By eliciting both positive and negative changes,
there is no evident bias toward project success. Then, these same persons are asked to indicate which change is the
most significant and why.

By interviewing different stakeholders, a series of change-related performance stories are recorded. Although they
might not all relate to the project or to the project’s objectives, they provide personal, authentic views on how
participants in the MSC interviews see their world and the project within it.

The performance stories are then reviewed by program management and, ultimately, by the governance level
(boards) in the donor organization (within and outside the country). At each level of authority or responsibility,
interviewees are asked to offer their own assessment of what the MSC(s) are from among the performance stories
that have been collected and to provide comments that can be taken as feedback from their “level” in the
evaluation process. Essentially, the set of performance stories is shared upwards and discussed among stakeholders
both horizontally and vertically within the program/organization and finally winnowed to a smaller set that is
deemed to encapsulate the performance of the program. Performance stories, thus reviewed and validated, are then
used to guide any changes that are elicited by the results that are communicated via the stories.

Figure 5.3 is taken from Davies and Dart (2005) and conveys the flow of stories and the feedback that is included
in the MSC approach to understanding program performance. The number of levels in the process will vary with
the context, of course.

290
Figure 5.3 Flow of Stories and Feedback in the Most Significant Change Approach

Sigsgaard (2004) sums up the experience of his aid organization with the MSC approach to qualitative
performance measurement:

There are also indications that the work with this simple approach has demystified [performance]
monitoring in general. The process of verification and the curiosity aroused by the powerful data
collected, will urge the country offices as well as the partners, to supplement their knowledge through
use of other, maybe more refined and controlled measures … The MSC system is only partially
participatory. Domains of interest are centrally decided on, and the sorting of stories according to
significance is hierarchic. However, I believe that the use of and respect for peoples’ own indicators will
lead to participatory methodologies and “measurement” based on negotiated indicators where all
stakeholders have a say in the very planning of the development process. Some people in the MSC
system have voiced a concern that the MSC method is too simple and “loose” to be accepted by our
back donor, Danida, and our staff in the field. The method is not scientific enough, they say. My
computer’s thesaurus programme tells me that science means knowledge. I can confidently recommend
the Most Significant Changes methodology as scientific. (p. 8)

291
The Power of Case Studies
One of the great appeals of qualitative evaluation is the ability to render personal experiences in convincing detail.
Narrative from even a single case, rendered to convey a person’s experiences, is a very powerful way to draw
attention to an issue or a point of view.

Most of us pay attention to stories, to narratives that chronicle the experiences of individuals in a time-related
manner. In the context of program evaluations, it is often much easier to communicate key findings by using case
examples. For many clients, tables do not convey a lot of intuitive meaning. Graphs are better, but narratives, in
some cases, are best. Patton (2003), in his checklist for qualitative evaluations, suggests this:

Qualitative methods are often used in evaluations because they tell the program’s story by capturing and
communicating the participants’ stories. Evaluation case studies have all the elements of a good story.
They tell what happened when, to whom, and with what consequences. (p. 2)

Performance stories are the essence of the Most Significant Change approach. Capturing individual experiences
and winnowing those until there is an agreed-upon performance story for a project or a program is very different
from quantitatively measuring performance against targets for a small number of outcome variables. Nonprofit
organizations (the United Way being an example) are increasingly creating annual performance reports that
convey key outputs in numerical terms but describe outcomes with stories of how the program has changed the
lives of individuals who are program clients.

In the mass media, news stories often focus on the experience of individuals, thus providing a single well-stated
opinion or carefully presented experience that can have important public policy implications. For example, the
tragic death of a single child in British Columbia, Canada, in 1994 at the hands of his mother became the basis for
the Gove Commission (Gove, 1995) and, ultimately, the reorganization of all existing child protection functions
into the provincial Ministry for Children and Families in 1996.

In program evaluations, case studies often carry a lot of weight, simply because we can relate to the experiences of
individuals more readily than we can understand the aggregated/summarized experiences of many. Even though
single cases are not necessarily representative, they are often treated as if they contained more evidence than just one
case. For program evaluators, there is both an opportunity and a caution in this. The opportunity is to be able to
use cases and qualitative evidence to render evaluation findings more credible and, ultimately, more useful. But
the caution is to conduct qualitative evaluations (or the qualitative components of multisource evaluations) so that
they are methodologically defensible as well as being persuasive.

292
Summary
Qualitative evaluation methods are essential tools that evaluators call on in their practice. Since the 1970s, when qualitative evaluation
methods were first introduced as an alternative to the then-dominant quantitative experimental/quasi-experimental paradigm, debates
about the philosophical underpinnings and methodological requirements for sound qualitative evaluation have transformed the theory
and practice of evaluation. Debates continue about the relative merits of qualitative versus quantitative methods, but many evaluators
have come to the view that it is desirable to mix qualitative and quantitative methods—they have complementary strengths, and the
weaknesses of one approach can be mitigated by calling on the other approach—and most program evaluations employ mixed methods.

Philosophically, many evaluators have embraced pragmatism. What that means is that mixing qualitative and quantitative methods to
build multiple independent lines of evidence in evaluations has become a standard practice in evaluations. The merits of particular
methods are decided situationally—pragmatism emphasizes the value of “what works” and focuses less on the epistemological links that
have been ascribed to qualitative or quantitative methods.

Even though pragmatism is emerging as a “solution” to earlier deep divisions among evaluators (the paradigm wars of the 1980s), there
continues to be considerable diversity in philosophical and methodological approaches to qualitative evaluation. Judging the quality of
qualitative evaluation depends on the philosophical ground on which one stands—there is no universally agreed-upon set of criteria. This
situation contrasts with evaluations where positivist or postpositivist philosophical assumptions mean that methodologies can be assessed
with a common set of criteria. In Chapter 3, we introduced the four kinds of validity connected with research designs: (1) statistical
conclusions validity, (2) internal validity, (3) construct validity, and (4) external validity. These all include methodological criteria for
judging the quality of evaluations that are consistent with positivist and postpositivist philosophical beliefs.

Qualitative evaluation often relies on case studies—in-depth analyses of individuals or groups (as units of analysis) who are stakeholders in
a program. Case studies, often rendered as narrative stories, are an excellent way to communicate the personal experiences of those
connected with a program. We, as human beings, have tended to be storytellers—indeed, stories and songs were the ways we transmitted
knowledge and culture before we had written language. Case studies convey meaning and emotion, rendering program experiences in
terms we can all understand.

Although performance measurement has tended to rely on quantitative indicators to convey results, there are alternatives that rely on
qualitative methods to elicit performance stories from stakeholders. The MSC approach has been developed to monitor performance of
development projects in countries where data collection capacities and resources may be very limited. In settings like these, qualitative
methods offer a feasible and effective way to describe and communicate performance results. As well, the MSC has recently been adapted
for evaluations in Indigenous communities.

293
Discussion Questions
1. What is a paradigm? What does it mean to say that paradigms are incommensurable?
2. Do you think paradigms are real? Why?
3. What is the pragmatic approach to evaluation?
4. How does pragmatism deal with the philosophical differences that divided the evaluation field in the 1980s?
5. What are the key characteristics of qualitative evaluation methods?
6. What does it mean for an evaluation to be naturalistic?
7. What is snowball sampling?
8. Suppose that you have an opportunity to conduct an evaluation for a state agency that delivers a program for single mothers. The
program is intended to assist pregnant women with their first child. The program includes home visits by nurses to the pregnant
women and then regular visits for the first 2 years of their child’s life. The objective of the program is to improve the quality of
parenting by the mothers and hence improve the health and well-being of the children. The agency director is familiar with the
quantitative, experimental evaluations of this kind of program in other states and wants you to design a qualitative evaluation that
focuses on what actually happens between mothers and children in the program. What would your qualitative evaluation design
look like? What qualitative data collection methods would you use to see what was happening between mothers and children?
How would you determine whether the quality of parenting had improved as a result of the program?
9. In the Discussion Questions at the end of Chapter 1 of this textbook, we asked you to think about your own preferences for
either numbers or words—whether you think of yourself as a words person, a numbers person, or a “balanced” person. Having
read Chapter 5, has your view of yourself changed at all? Why do you think that has happened?
10. If you were asked to tell someone who has not read this chapter what is the “essence” of qualitative evaluation methods, what four
or five points would you make to them?
11. We have introduced mixed-methods designs as a way to combine qualitative and quantitative evaluation approaches. When you
think of how to use combinations of qualitative and quantitative methods, which approach should be the one that has the “final
say”? Why do you think that? Discuss this question with someone else from your class or even in a group of three or four
classmates.

294
References
Alkin, M. C., Christie, C. A., & Vo, A. T. (2012). Evaluation theory. Evaluation Roots: A Wider Perspective of
Theorists’ Views and Influences, 386.

American Institute of Certified Public Accountants. (2017). Audit guide: Audit sampling. New York, NY: John
Wiley & Sons.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J., & Henderson, R. (2017). “Contagious
accountability”: A global multisite randomized controlled trial on the effect of police body-worn cameras on
citizens’ complaints against the police. Criminal Justice and Behavior, 44(2), 293–316.

Bamberger, M., Rugh, J., & Mabry, L. (2012). RealWorld evaluation: Working under budget, time, data, and
political constraints (2nd ed.). Thousand Oaks, CA: Sage.

Berruetta-Clement, J., Schweinhard, L., Barnett, W., Epstein, A., & Weikart, D. (1984). Changed lives: The effects
of the Perry Preschool experiment on youths through age 18. Ypsilanti, MI: High/Scope Press.

Bewley, H., George, A., Rienzo, C., & Portes, J. (2016). National evaluation of the Troubled Families Programme:
National impact study report. London, UK: Department for Communities and Local Government.

Blades, R., Day, L., & Erskine, C. (2016). National evaluation of the Troubled Families Programme: Families’
experiences and outcomes. London, UK: Department for Communities and Local Government.

Bokhove, C., & Downey, C. (2018). Automated generation of “good enough” transcripts as a first step to
transcription of audio-recorded data. Open Science Framework.

Bryman, A. (2009). Mixed methods in organizational research. In D. Buchanan, & A. Bryman (Eds.), The SAGE
handbook of organizational research methods (pp. 516–531). Thousand Oaks, CA: Sage.

Chilisa, B. (2012). Indigenous research methodologies. Thousand Oaks, CA: Sage.

Chilisa, B. (2012). Postcolonial indigenous research paradigms. In B. Chilisa (Ed.), Indigenous research
methodologies (pp. 98–127). Thousand Oaks, CA: Sage.

Chilisa, B., & Tsheko, G. N. (2014). Mixed methods in indigenous research: Building relationships for
sustainable intervention outcomes. Journal of Mixed Methods Research, 8(3), 222–233.

Corden, A., & Nice, K. (2007). Qualitative longitudinal analysis for policy: Incapacity benefits recipients taking
part in pathways to work. Social Policy and Society, 6(4), 557–569.

295
Cousins, J. B., & Chouinard, J. A. (2012). Participatory evaluation up close: An integration of research-based
knowledge. Charlotte, NC: IAP.

Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.)
Thousand Oaks, CA: Sage.

Creswell, J. W. (2015). A concise introduction to mixed methods research. Thousand Oaks, CA: Sage.

Crossley, S., & Lambert, M. (2017). Introduction: ‘Looking for trouble?’ Critically examining the UK
government’s Troubled Families Programme. Social Policy and Society, 16(1), 81–85.

Crotty, M. (1998). The foundations of social research: Meaning and perspective in the research process. Thousand
Oaks, CA: Sage.

Dart, J., & Davies, R. (2003). A dialogical, story-based evaluation tool: The most significant change technique.
American Journal of Evaluation, 24(2), 137–155.

Davies, R., & Dart, J. (2005). The “most significant change” (MSC) technique: A guide to its use. Retrieved from
http://mande.co.uk/wp-content/uploads/2018/01/MSCGuide.pdf

Day, L., Bryson, C., White, C., Purdon, S., Bewley, H., Sala, L., & Portes, J. (2016). National evaluation of the
Troubled Families Programme: Final synthesis report. London, UK: Department for Communities and Local
Government.

De Felice, D., & Janesick, V. J. (2015). Understanding the marriage of technology and phenomenological
research: From design to analysis. Qualitative Report, 20(10), 1576–1593. Retrieved from
http://nsuworks.nova.edu/tqr/vol20/iss10/3

Denzin, N. K., & Lincoln, Y. S. (Eds.). (2011). Handbook of qualitative research (4th ed.). Thousand Oaks, CA:
Sage.

Drawson, A. S., Toombs, E., & Mushquash, C. J. (2017). Indigenous research methods: A systematic review.
International Indigenous Policy Journal, 8(2), 5.

Earl, S., Carden, F., & Smutylo, T. (2001). Outcome mapping: Building learning and reflection into development
programs. Ottawa, Ontario, Canada: International Development Research Centre.

Farrell, C., Nice, K., Lewis, J., & Sainsbury, R. (2006). Experiences of the job retention and rehabilitation pilot
(Department for Work and Pensions Research Report No 339). Leeds, England: Corporate Document Services.

Fetterman, D. (1994). Empowerment evaluation [Presidential address]. Evaluation Practice, 15(1), 1–15.

296
Fetterman, D. (2005). A window into the heart and soul of empowerment evaluation: Looking through the lens
of empowerment evaluation principles. In D. M. Fetterman & A. Wandersman (Eds.), Empowerment evaluation
principles in practice (pp. 1–26). New York, NY: Guilford Press.

Fetterman, D., & Wandersman, A. (2007). Empowerment evaluation: Yesterday, today, and tomorrow. American
Journal of Evaluation, 28(2), 179–198.

Fetterman, D., Rodriguez-Campos, L., Wandersman, A., & O’Sullivan, R. (2014). Collaborative, participatory
and empowerment evaluation: Building a strong conceptual foundation for stakeholder involvement approaches
to evaluation [Letter to the editor]. American Journal of Evaluation, 35, 144–148.

Fish, S. (1980). Is there a text in this class? The authority of interpretive communities. Cambridge, MA: Harvard
University Press.

Freire, P. (1994). Pedagogy of hope: Reliving pedagogy of the oppressed. New York, NY: Continuum.

Gardenshire, A., & Nelson, L. (2003). Intensive qualitative research challenges, best uses, and opportunities (MDRC
Working Paper on Research Methodology). Retrieved from http://www.mdrc.org/publications/339/full.pdf

Gennetian, L. A., Duncan, G. J., Knox, V. W., Vargas, W. G., Clark-Kauffman, E., & London, A. S. (2002).
How welfare and work policies for parents affect adolescents. New York, NY: Manpower Demonstration Research
Corporation.

Gibson, C. M., & Duncan, G. J. (2000, December). Qualitative/quantitative synergies in a random-assignment


program evaluation. Presented at the Discovering Successful Pathways in Children’s Development: Mixed
Methods in the Study of Childhood and Family Life Conference, Northwestern University, , Evanston, IL.

Gove, T. J. (1995). Report of the Gove Inquiry into Child Protection in British Columbia: Executive summary.
Retrieved from http://www.qp.gov.bc.ca/gove/gove.htm

Grey, K., Putt, J., Baxter, N., & Sutton, S. (2016). Bridging the gap both-ways: Enhancing evaluation quality and
utilization in a study of remote community safety and wellbeing with Indigeneous Australians. Evaluation
Journal of Australasia, 16(3), 15–24.

Guba, E. G., & Lincoln, Y. S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage.

Guba, E. G., & Lincoln, Y. S. (2005). Paradigmatic controversies, contradictions, and emerging confluences. In
N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (3rd ed., pp. 191–216). Thousand Oaks,
CA: Sage.

Jackson, S. (2008). A participatory group process to analyze qualitative data. Progress in Community Health
Partnerships: Research, Education, and Action, 2(2), 161–170.

297
Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm whose time has
come. Educational Researcher, 33(7), 14–26.

Kapp, S. A., & Anderson, G. (2010). Agency-based program evaluation: Lessons from practice. Thousand Oaks, CA:
Sage.

Kapp, S. A., Robbins, M. L., & Choi, J. J. (2006). A partnership model study between juvenile justice and community
mental health: Interim year-end report August 2006. Lawrence: School of Social Welfare, University of Kansas.

Kovach, M. (2018). Doing Indigenous methodologies: A letter to a research class. In N. Denzin & Y. Lincoln
(Eds.), The Sage handbook of qualitative research (pp. 214–234). Thousand Oaks, CA: Sage.

Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago Press.

Kushner, S. (1996). The limits of constructivism in evaluation. Evaluation, 2, 189–200.

LaFrance, J., & Nicholas, R. (2010). Reframing evaluation defining an indigenous evaluation framework.
Canadian Journal of Program Evaluation, 23(2), 13–31.

Lapadat, J. (2000). Problematizing transcription: Purpose, paradigm and quality. International Journal of Social
Research Methodology, 3(3), 203–219.

Lewin, S., Glenton, C., & Oxman, A. (2009). Use of qualitative methods alongside randomised controlled trials
of complex healthcare interventions: Methodological study. BMJ, 339, b3496.

Lewis, J. (2007). Analysing qualitative longitudinal research in evaluations. Social Policy and Society, 6(4),
545–556.

Manpower Demonstration Research Corporation. (2012). About MDRC. Retrieved from


http://www.mdrc.org/about.htm

McDavid, J. C. (1996). Summary report of the 1991–1996 Canada/Yukon EDA evaluation. Ottawa, Ontario,
Canada: Department of Indian and Northern Affairs.

Mertens, D. M., & Wilson, A. T. (2012). Program evaluation theory and practice: A comprehensive guide. Guilford
Press.

Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook (2nd ed.). Thousand
Oaks, CA: Sage.

Miles, M. B., Huberman, A. M., & Saldana, J. (2014). Qualitative data analysis. Thousand Oaks, CA: Sage.

298
Miller, C., Huston, A. C., Duncan, G. J., McLoyd, V. C., & Weisner, T. S. (2008). New Hope for the working
poor: Effects after eight years for families and children. New York, NY: MDRC. Retrieved from
http://www.mdrc.org/publications/488/overview.html

Morgan, D. L. (2007). Paradigms lost and pragmatism regained: Methodological implications of combining
qualitative and quantitative methods. Journal of Mixed Methods Research, 1(1), 48–76.

Murphy, E., Dingwall, R., Greatbatch, D., Parker, S., & Watson, P. (1998). Qualitative research methods in
health technology assessment: A review of literature. Health Technology Assessment, 2(16), 1–274.

Owen, J. M. (2006). Program evaluation: Forms and approaches. Sydney, New South Wales, Australia: Allen &
Unwin.

Paloma Foundation & Wellesley Institute. (2010). Working together: The Paloma-Wellesley guide to participatory
program evaluation. Toronto, Ontario, Canada: Author.

Patton, M. Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (2002). Qualitative research & evaluation methods (3rd ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (2003). Qualitative evaluation checklist. Retrieved from


http://www.wmich.edu/evalctr/archive_checklists/qec.pdf

Patton, M. Q. (2008). Utilization-focused evaluation: The new century text (4th ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (2015). Qualitative research and methods: Integrating theory and practice. (4th ed.). London, UK:
Sage.

Preskill, H. S., & Russ-Eft, D. F. (2005). Building evaluation capacity: 72 activities for teaching and training.
Thousand Oaks, CA: Sage.

Poister, T., Aristigueta, M., & Hall, J. (2015). Managing and measuring performance in public and nonprofit
organizations. An integrated approach. San Francisco, CA: Jossey-Bass.

Purdon, S., Stratford, N., Taylor, R., Natarajan, L., Bell, S., & Wittenburg, D. (2006). Impacts of the job retention
and rehabilitation pilot (Department for Work and Pensions Research Report No 342). Leeds, England:
Corporate Document Services.

Quint, J. C., Bos, H., & Polit, D. F. (1997). New Chance: Final report on a comprehensive program for young
mothers in poverty and their children. New York, NY: Manpower Demonstration Research Corporation.

299
Ritchie, J., & Spencer, L. (1994). Qualitative data analysis for applied policy research. In A. Bryman & R. G.
Burgess (Eds.), Analyzing qualitative data (pp. 173–194). New York: Routledge.

Ritchie, J., Lewis, J., Nicholls, C. M., & Ormston, R. (Eds.). (2013). Qualitative research practice: A guide for social
science students and researchers (2nd ed.). Los Angeles, CA: Sage.

Ritchie, J., Spencer, L., & O’Connor, W. (2003). Carrying out qualitative analysis. In J. Ritchie & J. Lewis
(Eds.), Qualitative research practice: A guide for social science students and researchers (pp. 219–262). London,
England: Sage.

Scriven, M. (1973). Goal-free evaluation. In E. R. House (Ed.), School evaluation: The politics and process (pp.
319–328). Berkeley, CA: McCutchan.

Scriven, M. (2008). A summative evaluation of RCT methodology: & an alternative approach to causal research.
Journal of Multidisciplinary Evaluation, 5(9), 11–24.

Sen, R., & Churchill, H. (2016). Some useful sources. Social Policy and Society, 15(2), 331–336.

Sigsgaard, P. (2004). Doing away with predetermined indicators: Monitoring using the most significant changes
approach. In L. Earle (Ed.), Creativity and constraint grassroots monitoring and evaluation and the international
aid arena (NGO Management & Policy Series No. 18, pp. 125–136). Oxford, England: INTRAC.

Thomas, D. R. (2006). A general inductive approach for analyzing qualitative evaluation data. American Journal of
Evaluation, 27(2), 237–246.

Western Michigan University. (2010). The evaluation center: Evaluation checklists. Retrieved from
http://www.wmich.edu/evalctr/checklists

Youker, B. W., Ingraham, A., & Bayer, N. (2014). An assessment of goal-free evaluation: Case studies of four
goal-free evaluations. Evaluation and Program Planning, 46, 10–16.

Zaslow, M. J., & Eldred, C. A. (1998). Parenting behavior in a sample of young mothers in poverty results of the New
Chance Observational Study. Retrieved from http://www.mdrc.org/project_publications_8_34.html

300
6 Needs Assessments for Program Development and Adjustment

Introduction 249
General Considerations Regarding Needs Assessments 250
What Are Needs and Why Do We Conduct Needs Assessments? 250
Group-Level Focus for Needs Assessments 252
How Needs Assessments Fit Into the Performance Management Cycle 252
Recent Trends and Developments in Needs Assessments 254
Perspectives on Needs 255
A Note on the Politics of Needs Assessment 256
Steps in Conducting Needs Assessments 257
Phase I: Pre-Assessment 259
1. Focusing the Needs Assessment 260
2. Forming the Needs Assessment Committee (NAC)266
3. Learning as Much as We Can About Preliminary “What Should Be” and “What Is”
Conditions From Available Sources 267
4. Moving to Phase II and/or III or Stopping 268
Phase II: The Needs Assessment 268
5. Conducting a Full Assessment About “What Should Be” and “What Is” 268
6. Needs Assessment Methods Where More Knowledge Is Needed: Identifying the
Discrepancies 269
7. Prioritizing the Needs to Be Addressed 278
8. Causal Analysis of Needs 280
9. Identification of Solutions: Preparing a Document That Integrates Evidence and
Recommendations 280
10. Moving to Phase III or Stopping 282
Phase III: Post-Assessment: Implementing a Needs Assessment 283
11. Making Decisions to Resolve Needs and Select Solutions 283
12. Developing Action Plans 284
13. Implementing, Monitoring and Evaluating 284
Needs Assessment Example: Community Health Needs Assessment In New Brunswick 285
Summary 291
Discussion Questions 292
Appendixes 293
Appendix A: Case Study: Designing a Needs Assessment for a Small Nonprofit Organization 293
The Program 293
Your Role 294
Your Task 294
References 295

301
Introduction
In Chapter 1, we introduced an open systems model of programs and key evaluation issues (Figure 1.4). That
model shows how needs often drive the program delivery process—meeting needs is the rationale for designing
and implementing programs (and policies).

Chapter 6 begins with an overview of the purposes of needs assessments and then adapts the steps outlined by
Altschuld and Kumar (2010) to take the reader through the basics of how to plan, undertake, analyze, and
communicate results from needs assessments. Then, we offer an example of an actual needs assessment to show
how the principles and steps in the chapter can be applied.

Needs assessments are a practical part of evaluation-related activities that are conducted in public and nonprofit
sectors. They are usually done to inform program development or strategic program planning. Needs assessments
can also be done specifically to modify ongoing program delivery. They can be conducted by organizations, such
as educational institutions, health care institutions or agencies, local or municipal agencies, state or provincial
agencies, federal agencies, and charities and other nonprofit organizations. They can also be conducted by
community-based collaborations—these needs assessments will span a range of program provision organizations in
a community. Despite the fact that they can be resource intensive, there is a heightened call for needs assessments
to help improve fiscal decision making due to budgetary restraints.

In different chapters in this textbook, we have pointed to the fiscal climate in the post 2008–2009 Great Recession
period as a factor in the purposes for conducting evaluation-related activities. Needs assessments are also being
affected by the same fiscal constraints (more program demands and fewer resources). What that means is that
increasingly, needs assessments are being conducted as part of resource reduction or resource reallocation
scenarios. For example, in a recent report to Congress by the Government Accountability Office, six states that are
recipients of Medicaid funds were reviewed for their Medicaid-related needs assessments of prospective clients. An
underlying concern is that Medicaid funds are not sufficient to meet the needs of persons who may be eligible for
support, coupled with a concern that existing funds are misallocated. The report uncovered practices in some
states that suggest either capping the number of clients when funds had run out for a given fiscal year and/or
conflicts of interest between the assessors and the agencies who are providing services—assessors in some situations
were the same organizations who were looking to augment their client rosters (Government Accountability Office,
2017).

The availability of online public health databases and other sector-specific resource portals has facilitated changes
to both the practice and expectations of needs assessments. The process and results of needs assessments are now
more commonly expected to be a part of systems of knowledge building and sharing, rather than simply one-off
projects. For example, the Canadian Observatory on Homelessness (2018) supports the Homeless Hub, an
electronic, freely-available repository of Canadian and international research and guidance, including needs
assessments, on homelessness and related issues:

The Homeless Hub is a web-based research library and information centre representing an innovative
step forward in the use of technology to enhance knowledge mobilization and networking.

Building on the success of the Canadian Conference on Homelessness (2005), the Homeless Hub was
created to address the need for a single place to find homelessness information from across Canada. This
project began with an understanding that different stakeholders (in government, academia and the
social services sector) are likely to think about and utilize research in different ways. As such, the website
was built with different stakeholders in mind. (p. 1)

This chapter will discuss these recent developments and then take the reader through an example of a multistage
needs assessment that illustrates how common databases have been used to support a series of community health

302
needs assessments in a Canadian province.

303
General Considerations Regarding Needs Assessments

What Are Needs, and Why Do We Conduct Needs Assessments?


When we talk about needs in terms of evaluation work, we are typically talking about gaps, either in
programs/services that are needed or in the condition of an individual or a group in terms of health, education,
and so on. Altschuld and Kumar (2010) define needs as “the measurable gap between two conditions; ‘what is’
(the current status or state) and ‘what should be’ (the desired status or state)” (p. 3). Similarly, in the adult
education field, Sork (2001) defines an educational need as “a gap or discrepancy between a present capability
(PC) and a desired capability (DC)” (p. 101). Also worthy of note, he maintains needs assessment is “part
technical, part sociological and part ethical. It is dangerous to plan a needs assessment without considering all
three of these domains” (p. 100). Social betterment has been a stream of evaluation for decades (see Henry, 2003),
and more recently, the idea of equitable societies has become a more prominent component of the evaluation
conversation (Donaldson & Picciotto, 2016). Thus, in needs assessments, as in other forms of evaluation, values
play a role, and each evaluator benefits from reflecting on her or his stance on relevant values.

Needs assessments, then, are about technically, sociologically, and ethically defining the needs, developing
strategies for assessing the extent of the needs given what is currently available to address those needs, prioritizing
the needs to be addressed, and determining the way forward.

Needs assessments are done to provide evidence for choices in the provision of programs intended to benefit
society, within resource constraints and usually with an eye on value for money. Social values (what individuals,
groups or societies believe should be done) and political imperatives have an impact on decision-making processes.
In many countries, economic and fiscal challenges, particularly since the Great Recession in 2008–2009, have
heightened pressures to consider either cutbacks to programs and services or reallocation of limited resources to
higher priority services. As part of the strategic planning and decision-making cycle, making choices among
competing policies or programs—or perhaps reallocating resources—entails justifying the support of some
priorities over others.

The need for a national policy on social housing, for example, is an expression of concern about the homeless and
the consequences of being homeless, and a statement that we ought to ameliorate this problem. Similarly,
community-based mental health services in many jurisdictions are acknowledged as not meeting the needs of the
population being served, leading to needs assessments that help to identify how to improve and coordinate services
(Hanson, Houde, McDowell, & Dixon, 2007). An example of an emerging problem for societies is how the
widespread use of prescription opioids to treat painful medical conditions has morphed into an epidemic of illegal
drug use, fentanyl in particular, with the attendant numbers of overdose deaths. Prospective solutions to this
problem vary with values and political priorities—some advocate a crackdown on the illegal sources and the level
of opioid prescriptions by doctors, and others advocate decriminalizing drug use to connect drug users to
treatment options before they harm themselves (Beletsky & Davis, 2017; Davis, Green, & Beletsky, 2017).

The growing emphasis on strategic allocations of resources for health care, education, training, housing, justice,
social services, and community infrastructure needs are all contributors to the current interest in needs assessment,
with a view to finding the most relevant, effective, and cost-effective fit between needs and the design and delivery
of programs. Needs assessments, aside from identifying and helping prioritize needs, can also facilitate building
relationships among service providers and assist in finding ways to collaboratively deliver programs with a more
holistic view of the program recipients. That is one outcome from the community health needs assessment case we
will discuss later in this chapter. Approaches or solutions should be considered within relevant community and
cultural contexts.

In addition, needs assessments are increasingly mandated by regulations or laws (Soriano, 2012). For example, the
United States’ Patient Protection and Affordable Care Act (Affordable Care Act), enacted in 2010, mandates that
tax-exempt hospitals and various other public health providers conduct needs assessments every 3 years (Cain,

304
Orionzi, & O’Brien, 2017; Folkemer et al., 2011), and the United Kingdom mandates assessments for unmet
needs of children in care (Axford, 2010). In some cases, no overall needs assessment is done prior to implementing
a program. The Troubled Families Program in Britain, for example, was designed and implemented in response to
riots in British cities in 2011 and was intended to “turn around” families who were deemed to be most likely to be
the source of future problems for the criminal justice and social services systems. It was a response to a political
crisis. But after implementation, a key part of working with families was “workers [assessing] the needs of families
identified as troubled and coordinate a year-long program of intensive family support to tackle antisocial behavior,
misuse of drugs and alcohol, and youth crime” (Fletcher, Gardner, McKee, & Bonell, 2012, p. 1).

Group-Level Focus for Needs Assessments


While a needs assessment can be focused at the level of the individual, group, service provider, or system, for the
purposes of this book, we will follow McKillip’s (1998) approach, emphasizing the group or population level
“partly because the decisions that utilize need analysis have a public policy orientation and partly because the
methodologies used usually are not sensitive enough for use in individual-level decisions” (p. 262). This focus
corresponds with Altschuld and Kumar’s (2010) Level 1, which refers to group-level assessment, rather than the
individual diagnostic level. Their Level 2 is focused on the needs of service providers, and Level 3 “is the nature of
what is required by the system that supports service providers and service recipients” (p. 11), so certainly, Levels 2
and 3 are important to Level 1, in that they need to be taken into consideration as part of the context of the needs
assessment and the capacity to deliver on recommendations that come out of a needs assessment process.

In some cases, diagnostic needs assessment tools that are meant for individuals can be useful for group or
population needs assessments, such as when individual-level data are rolled up to the local (e.g., municipal) or
service provider level. This aggregated data can be used to help agencies and institutions target resources to areas
that need additional improvement (Asadi-Lari, Packham, & Gray, 2003; Pigott, Pollard, Thomson, & Aranda,
2009; Wen & Gustafson, 2004).

Examples of individual-level diagnostic tools that can also be used to inform group needs assessments are the
Cumulative Needs Care Monitor for patients with severe mental illness (Drukker, van Os, Bak, à Campo, &
Delespaul, 2010) and the Children’s Review Schedule, an instrument to assess the nature, range, and severity of
family problems of children in need (Sheppard & Wilkinson, 2010). In other cases, there are two versions of a
tool, such as the Camberwell Assessment of Need for assessing the needs of people with severe mental illness; the
clinical version is used to plan the care of individuals, and the research version can be used for population-level
comprehensive needs assessments or similar research (Slade, Thornicroft, Loftus, Phelan, & Wykes, 1999).

How Needs Assessments Fit Into the Performance Management Cycle


The performance management cycle, introduced in Chapter 1, includes several stages that relate to assessing needs.
Figure 6.1 connects needs assessments to the performance management cycle. Several of the major clusters of
activities (stages) in the cycle are associated with applications of needs assessment tools.

Figure 6.1 Needs assessments and the performance management cycle

At the strategic planning and resource stage of the cycle, strategic plans are developed or modified in the light of
information gathered to reflect the organization’s strengths, weaknesses, opportunities, and threats, and this
information is significant for needs assessments. Setting strategic goals in public-sector and nonprofit organizations
is often restricted by existing mandates and resources, although in some cases, there are new funding
opportunities. The importance of considering the timing of needs assessments in the performance management
cycle is emphasized by Stevens and Gillam (1998):

A key to successful needs assessment is the proper understanding of how it is related to the rest of the
planning process. Too much needs assessment is divorced from managers’ deadlines and priorities. If the
information and recommendations produced are not timely, they will not be useful. The results of

305
needs assessments therefore need to be encapsulated in strategies or business plans. These need clear
definitions of objectives: describing what needs to be done, by whom, and by when. The key to
effecting change is an understanding of the opportunities that may facilitate and the obstacles that may
obstruct what is being attempted—knowing which “levers” to use. An understanding of the sources of
finance, their planning cycles, and the criteria used to fund new initiatives is essential. (p. 1451)

Frequently, needs assessments are conducted as programs or policies are being developed or modified. Program or
policy design is informed not only by strategic objectives but also by sources of information that can shape
program structure(s), operations, and intended outcomes. Later in this chapter, we will discuss causal analysis as a
part of conducting needs assessments. If we take the example of homelessness as a measure of the need for social
housing, we can go further to include the causes of homelessness and the consequences of being homeless in our
designs for programs and policy options—we want to construct programs that have the best chance of successfully
addressing the problem—this is a focus on program or policy appropriateness as discussed in Chapter 1.

In some cases, when a program evaluation is being conducted (at the Assessment and Reporting Phase of the
performance management cycle), the evaluation can incorporate lines of evidence that focus on client/stakeholder
needs and therefore provide information for a needs assessment. For example, if an evaluation includes a survey of
existing program recipients, the survey instrument can solicit client experiences and assessments of their
interactions with the existing program (relevant for determining program effectiveness), as well as their perceptions
of ways in which the program could be modified to better meet their needs (relevant for addressing perceived gaps
between what is being offered and what is needed). The information on gaps between the services received from
the current program and perceived unmet needs can be one source of data used to modify the design and/or the
implementation of the program.

Recent Trends and Developments in Needs Assessments


Needs themselves, our understanding of how to address those needs, and the resources available for addressing
needs change over time (Stevens & Gillam, 1998). Sork (2001) points out, for example, that “what it means to be
a capable person/employee/citizen/parent is constantly changing. As expectations change, needs are made” (p.
102). As well, the practice of needs assessment continues to evolve (Altschuld, 2004; Altschuld & Kumar, 2010;
Soriano, 2012). Most recently, periodic assessment of the need for programs or services has become a part of core
evaluation expectations that are required by many governments and external funders. Public service providers are
not only being expected to conduct needs assessments as part of their strategic planning process, but in some
sectors, there is also a growing expectation that they access common databases and online portals, incorporate
standardized questions and measures, coordinate needs assessments with other service provision areas, share
information on common portals, and provide needs assessment evidence in requests for funding (Axford, 2010;
Government Accountability Office, 2017; Scutchfield, Mays, & Lurie, 2009; Tutty & Rothery, 2010).

Stergiopoulos, Dewa, Durbin, Chau, and Svoboda (2010), in “Assessing the Mental Health Service Needs of the
Homeless: A Level-of-Care Approach,” note, “During the last decade in mental health services research, several
systematic and standardized methods for assessing the needs for care have been developed” (p. 1032). Tutty and
Rothery (2010) point out that using standardized measures “has the advantage of building on the work that has
gone into identifying and conceptualizing potentially important needs and of using a measure for which reliability
and validity will often have been established” (p. 154).

Technological advances are changing the way organizations can access and share their data and are an important
driver of changes to the practice of needs assessment. Online portals are being developed to house standard
datasets that can be used for the foundational stages of a needs assessment and other research. For example, in
Ontario, the online resource provided by the Institute for Clinical Evaluative Sciences (https://www.ices.on.ca)
provides linked, province-wide health system administrative data, analytic tools, and full reports.

In the United States, the Data Resource Center for Children and Adolescent Health website, which supports
public health databases such as the National Survey of Children with Special Health Care Needs and the National

306
Survey of Children’s Health, provides information on more than 100 indicators at the national, state, and health
region level (http://www.childhealthdata.org). The website is sponsored by a number of public-sector entities and
features interactive data query capabilities.

While needs assessments can be done at either the single-organization level or using a multi-organization
approach, over time there is likely to be better comparability across needs assessments, more emphasis on
reliability and validity of the tools, and more online sharing of the information. Community Health Assessment is
described in Friedman and Parrish (2009) as follows:

Community Health Assessment is the ongoing process of regular and systematic collection, assembly,
analysis, and distribution of information on the health needs of the community. This information
includes statistics on health status, community health needs/gaps/problems, and assets. The sharing of
findings with key stakeholders enables and mobilizes community members to work collaboratively
toward building a healthier community. (p. 4)

Before outlining the steps of needs assessment, we want to mention several important points about different
perspectives on needs and about the politics of needs assessments.

Perspectives on Needs
Needs can be approached from several different perspectives, and it is useful to have an understanding of the
terminology from various sectors in order to make an informed choice as to the best approach (combination of
methodologies) to assessing the needs of the target population (Axford, 2010). Kendall et al. (2015, p. 2)
summarize Bradshaw’s (1972) seminal typology of social need as follows:

Felt need:
Felt need is want, desire or subjective views of need which may, or may not become expressed need. It may be
limited by individual perceptions and by lack of knowledge of available services. Felt need may be influenced by
comparison with peers or others with a similar condition, and whether it is elicited may depend how (and by
whom) the question is asked.
Expressed need:
Expressed need is demand or felt need turned into action, and help is sought. However, to express need it is
necessary to have heard of a service, consider the service to be acceptable, accessible and priced appropriately.
Normative need:
Normative needs are defined by experts, professionals, doctors, policymakers. Often a desirable standard is laid
down and compared with the standard that actually exists. They are not absolute and different experts set different
standards. Assessment of normative needs can also be used to judge the validity of an expressed need.
Comparative need:
Comparative need has to do with equity: if some people are in receipt of a service and others, in similar
circumstances, are not, then the latter are considered to be in need. Relative availability will influence comparative
need as the benchmark is to achieve equal access.

In education needs assessments, “felt needs” are sometimes described as “perceived needs,” or “what the
individuals or the group have identified as what they want to learn”; “prescribed needs” reflect deficiencies in
“those areas that educators or program planners determine as inadequate and that need educational intervention”;
and “unperceived needs” are “what learners don’t know that they need to know” according to “teachers,
professional bodies, clients or patients, allied health professionals, and national and international organizations”
(Ratnapalan & Hilliard, 2002, p. 2).

Asadi-Lari et al. (2003) suggest that Bradshaw’s (1972) model defines needs from a social perspective. They
suggest that an economist’s approach incorporates the critical factor of cost containment and the cost-effectiveness
of outcomes when considering various options to address needs. Cost-effectiveness and “capacity to benefit” from

307
services are the focal point of this perspective, but Asadi-Lari et al. (2003) caution that “this terminology is
innovation-disorienting, that is it limits population healthcare needs to readily available services, ignoring
potential needs arising from emerging health technologies” (p. 2).

Finally, Scriven and Roth (1978) include “maintenance needs”: needs that are causally linked to a program—or
parts of a program–that has been withdrawn. For example, the withdrawal of a school lunch program may later
bring to light that this service was a cornerstone for helping meet overall learning needs for children in an
elementary school in a poorer area. Essentially, determining the perspective on the need that is going to be assessed
helps in defining and operationalizing how it will be measured.

Regardless of whether we believe that there is an intrinsic hierarchy of human needs (Maslow, 1943), translating
needs (or demands or wants) into policies and programs involves value-based choices. Assessing needs, reporting or
documenting such assessments, communicating/reporting the results of assessments, and implementing the
subsequent action plans will be scrutinized by a range of interested parties, some of whom will be supportive and
some of whom may work to prevent resources being allocated to those purposes. The defensibility of needs
assessments becomes an important issue in these circumstances, and underlines the importance of an informed and
strategic approach in the work, conducted as neutrally as possible while acknowledging the context of the project.

A Note on the Politics of Needs Assessment


Needs assessments, because they are fundamentally focused on prioritizing and then translating values into policies
and programs, are intrinsically political. They can be contentious—they have that in common with high-stakes
summative evaluations. More and more, needs assessments are occurring in the context of resource constraints, in
an environment where there is an expectation that identifying and prioritizing needs will be linked to reallocations
or reductions in expenditures. Proponents of expanding or creating a service based on a needs assessment can be
challenged on several grounds, including the following:

The provision of the program is wrong or a poor use of resources because the proposed objectives/intended
outcomes of the program are different from or directly challenge strongly held values or existing
commitments.

Political commitments extend to prevailing political ideologies of the relevant government jurisdiction—in fact,
political ideologies can drive the “need” for programs that are expected to align with political priorities and may or
may not reflect other perspectives on needs. An example in Canada was the Conservative federal government
priority (2006–2015) to be “tough on crime,” which involved building more prisons to house inmates convicted
under legislation that required judges to impose minimum sentences for crimes. As the legislation was being
debated, critics pointed to evidence from U.S. jurisdictions that had rescinded their minimum sentencing laws
because of the costs and unintended negative consequences of imprisonment (Piche, 2015). The government
proceeded with its program, notwithstanding evidence to the contrary (Latimer, 2015), and over time, the
constitutionality of components of mandatory minimum sentencing have been successfully challenged (Chaster,
2018).

Altschuld and Kumar (2010) contend that “by attending to politics throughout the process, the likelihood of
success will increase” (p. 20). Disagreements based on differences in values influence the political decision-making
process. “Policy decisions in public health [for example] are always influenced by factors other than evidence,
including institutional constraints, interests, ideas, values, and external factors, such as crises, hot issues, and
concerns of organized interest groups” (Institute of Medicine, Committee for the Study of the Future of Public
Health, 1988, p. 4).

The needs assessment itself is flawed, and the information produced is biased, inaccurate, or incomplete.

An important methodological consideration is an awareness of the incentives that various stakeholders have in
providing information. For example, service providers typically will be interested in preserving existing services
and acquiring more resources for expanded or new services. Methodological challenges can, when anticipated, be

308
addressed proactively (even if they cannot all be resolved).

Existing or prospective clients will generally be interested in improved services as well. Usually, the services they
consume do not entail paying substantial user fees, so there is a general bias toward wanting more services than
would be the case if they bore the full marginal costs associated with increasing services.

Other stakeholders (service providers who might be offering complementary or perhaps competing services) may
or may not have an a priori tendency to want services expanded. Their views can be very useful as a way to
triangulate the views of clients and service providers at the focus of a needs assessment. At the same time,
including service providers in needs assessments can create conflicts of interest where individuals are screened and,
if found to be in need, become their clients (GAO, 2017).

For both of the challenges noted previously, knowing who the key users of a needs assessment are will influence
how the reporting process unfolds. Examples of users include the following: service providers, funders, elected
officials, board members, current and prospective clients, and the general public. Engaging stakeholders in the
study as it is happening is one way to build relationships and increase their buy-in for the recommendations in the
final report. Sometimes, there are several stakeholder groups who are interested in a needs assessment—it is
valuable to create a written agreement describing the terms of reference for the study to be considered.

With this background in mind, we move to the steps in conducting a needs assessment.

309
Steps In Conducting Needs Assessments
This section provides guidance on the steps to be considered for a needs assessment. We will adapt the Altschuld
and Kumar (2010) framework, in particular, paying attention to how the purposes (formative, summative) and
timing (ex ante or ex post) of the needs assessment, affect the steps in the process.

There are a growing number of resources that provide in-depth suggestions for specific program sectors. In
addition to the Altschuld and Kumar (2010) tool kit already mentioned, there are books such as McKillip’s (1987)
still-useful Need Analysis: Tools for the Human Services and Education, Soriano’s (2012) basic and practical
Conducting Needs Assessments: A Multidisciplinary Approach (2nd ed.), and Reviere, Berkowitz, Carter, and
Ferguson’s (1996) Needs Assessment: A Creative and Practical Guide for Social Scientists. Some of the recent
resources incorporate guidance on database availability and utilization, additional online tools, and details on the
growing expectations for collaborative, integrated, and standardized progress in needs assessments (see, e.g.,
Altschuld & Kumar, 2010; Axford, 2010; Byrne, Maguire, & Lundy, 2018; Folkemer et al., 2011; Miller &
Cameron, 2011; Strickland et al., 2011).

Existing needs assessment frameworks, although they do take into account politics and other related pressures,
often rely on internal participation from organizations in which current or prospective programs would be housed.
Implicitly, there is a view that, like Collaborative, Participatory, or Empowerment Evaluation (Fetterman &
Wandersman, 2007; Fetterman, Rodríguez-Campos, & Zukoski, 2018), there is no inherent conflict of interest
between being a program provider and being involved in a needs assessment regardless of its purposes. Although
the Altschuld and Kumar (2010) framework does mention needs assessments where resource constraints are a
factor, the overall emphasis is on collaboration and stakeholder participation—in essence, a process that is more
bottom-up than top-down.

Our view in this textbook is that it is prudent to be aware that evaluation purposes, including needs assessment
purposes, affect the incentives and disincentives that participants perceive for their own involvement. An evaluator
needs to be aware of the big picture, in addition to people’s natural motivations when they are asked to provide
information for lines of evidence. Where appropriate, this view will be reflected in our discussion of the steps for
conducting a needs assessment.

The steps in Table 6.1 suggest that needs assessments are linear—that they proceed through the steps in ways that
are similar to a checklist. But Altschuld and Kumar (2010) point out that needs assessments, like other types of
evaluations, typically encounter surprises, and it is possible that the steps become iterative:

Sometimes the process recycles back to earlier steps and retraces prior ground. Some steps may be out of
order or overlap so that it is difficult to distinguish phases from each other . . . a number of activities
may be underway simultaneously. (p. 30)

No matter how basic the needs assessment is, it is important to keep clear and comprehensive records during the
entire process, for the benefit of the decision makers, the organization, and future users of the needs assessment
information. As Altschuld and Kumar (2010) explain,

The process must be documented to show how priorities were determined and for use in later needs-
oriented decisions. Date all tables and keep accurate records of how the NAC [needs assessment
committee] accomplished its work. This will also enhance the evaluation of the needs assessment . . . 
Assessments generate a large amount of paperwork (tables, summaries, reports, meeting
agendas/minutes, etc.) that must be put into a format that aids decision-making. (p. 114)

Below, we will expand on some of the key issues involved within the steps of a needs assessment.

310
Table 6.1 Steps of a Needs Assessment
Table 6.1 Steps of a Needs Assessment

Phase Overarching Phase Descriptor Key Steps

1. Focusing the
assessment
2. Forming the needs
assessment
committee
3. Learning as much
Phase I: Focusing the needs assessment, and what do we know about as we can about
Pre- possible needs? (This phase mainly takes advantage of existing preliminary “what
assessment data.) should be” and
“what is” conditions
from available
sources
4. Moving to Phase
II and/or III or
stopping

5. Conducting a full
assessment about
“what should be”
and “what is”
conditions
6. Identifying
discrepancies
Do we need to know more, will we have to conduct a much 7. Prioritizing
more intensive data collection effort, and do we have ideas discrepancies
Phase II:
about what are the causes of needs? (This phase may require an 8. Causally analyzing
Assessment
extensive investment of time, personnel, and resources for the needs
collection of new data.) 9. Preliminary
identification of
solution criteria and
possible solution
strategies
10. Moving to Phase
III or stopping

11. Making final


decisions to resolve
needs and selecting
solution strategies
12. Developing
action plans for
solution strategies,
communicating
Phase III: plans, and building
Are we ready to take action, and have we learned enough about
Post- bases of support
the need to feel comfortable with our proposed actions?
assessment 13. Implementing

311
and monitoring plans
14. Evaluating the
overall needs
assessment endeavor
(document with an
eye to revisit and
reuse)

Source: Altschuld & Kumar (2010, p. 34).

312
Phase I: Pre-Assessment
This initial stage of the needs assessment process relies mostly on existing information, and possibly informal
interviews, to begin structuring the needs assessment process and decide whether to proceed with a more formal
needs assessment. In other words, at the end of this first phase, a decision will be made to (a) continue to the next
phase of the needs assessment, (b) discontinue the assessment since it has become evident that the need is not
sufficient to warrant more work, or (c) move straight to the third phase, planning the actions to address the need.

If resources allow, this phase and the second phase of the process can be facilitated by an external advisor;
Altschuld and Kumar argue that “internal staff may be unable to adopt the neutral stance necessary for the
facilitation” (p. 58). This will be an issue where the overall purpose is summative. We will say more about this
shortly.

The pre-assessment phase includes the following: focusing the assessment, establishing a formal or informal NAC
[needs assessment committee], and learning as much as we can about preliminary “what should be” and “what is”
conditions from available sources. This phase can be considered the preliminary fact-finding mission, where one of
the main goals is to define what the needs problem is about. The problem may have both political and empirical
policy implications. Are there indications that a service gap may need to be addressed? Are there pressures to more
effectively coordinate services with other service providers? Are there fiscal pressures that may force
realignment/consolidation of services? Is there a demand for evidence of needs, to be provided in a renewed or new
funding proposal, necessitating a needs assessment?

1. Focusing the Needs Assessment


Being clear on who the end users of the assessment will be will helps to determine who should be involved in the
needs analysis process, and being clear on the planned uses helps create some parameters for defining the problems
and for the subsequent consideration of possible solutions (McKillip, 1998). Focusing the needs assessment
involves at least six features:

a. Determining the purpose of the needs assessment and the question(s) that identifies the nature of the gap(s)
to be addressed
b. Understanding the target population(s)
c. Considering the strategic context of the program(s)
d. Creating an inventory of existing services
e. Identifying possible service overlaps and potential co-services
f. Factoring in the resources available for the needs assessment

a. Determining the Purpose of the Needs Assessment and the Question(s) That Identify the
Nature of the Gap(s) to Be Addressed.

Determining the overall purpose of the needs assessment—is it formative or summative—is an important first
step. It is helpful to think of two general scenarios. Where a new program is under consideration to meet a need—
this is an ex ante needs assessment—stakeholder involvement would typically include existing program providers,
although one consideration is whether a proposed new program would be seen to be a competitor to existing
program offerings.

Where a needs assessment is examining existing programming (ex post needs assessment), the overall purpose is an
important factor in how to structure the process. We will distinguish between formative and summative purposes
and suggest the effects of purpose on the process.

A formative needs assessment that is not focused on determining the future existence of the existing program
(except to identify marginal changes/improvements) would typically involve program providers/program
managers. This type of needs assessment is essentially an evaluation conducted to improve a program. They would

313
be important sources of information on client patterns, waiting lists, and linkages among existing program
providers. Similar to involving program managers in program evaluations (we discuss this in Chapter 11), there is
not likely to be a conflict of interest between the purpose of the needs assessment and the interests of the
managers.

A summative needs assessment where the existing program is being reviewed for its continued relevance in relation
to current and future needs (is it meeting high-priority needs given the funding that is available?) would be higher
stakes for program managers/program providers and could create conflicts of interest for them. Program manager
involvement in such needs assessments is important, but the credibility of the results depends on the perceived
neutrality of the process and its products. Overall, program manager involvement needs to be balanced with the
importance of the independence and credibility of the needs assessment process, given its purposes.

Another objective at this stage is to get clarity on the question or questions that are driving the needs assessment.
Although much of this will come down to what is the nature of the gap that may need to be addressed or addressed
differently, this is where we begin defining “what’s the problem” for a given population, not yet the extent of the
gap, nor the solution for the problem. In terms of addressing “what is” and “what should be,” the problem will
typically be defined by value-driven questions, such as the following: “‘What ideally should be?’ ‘What is likely to
be?’ ‘What is expected to be?’ ‘What is feasible?’ ‘What is minimally acceptable?’ and so forth” (Altschuld, 2004, p.
11). These are the questions that are part of defining the gap that is the problem or obstacle to a desired state for a
targeted population, in areas such as social service needs, safety needs, transportation needs, training needs, and
health care needs.

This step also involves looking at the research that already has been conducted, as well as locating and gaining an
understanding of databases and standardized tools that may be of use in determining the scale and scope of the
problem, such as population and demographic trends, transportation services, service use data, and so on. If not
already known, this is also the time to determine whether a needs assessment is mandated as part of a specific
funding request (e.g., through legislation or regulations), particularly since there may be some specific expectations
for a needs assessment that must be included in a funding proposal. It will not typically be possible to measure a
gap at this first stage, but the preliminary research will help determine how to define the gap that will be assessed
later in the process. For example, through preliminary research (Waller, Girgis, Currow, & Lecathelinais, 2008),
support for palliative caregivers was identified as a gap that seemed to need to be addressed, and the later survey
(the “assessment” phase) determined that the need was for more help and advice in providing physical care.

Some sector-level online resources provide a rich source of helpful information. For example, the Community
Health Assessment Network of Manitoba website has a “resources” section that links to an array of informational
files and other links.

“Key informants” can be a useful source of data early in the needs assessment process. As outlined by McKillip
(1998), they are “opportunistically connected individuals with the knowledge and ability to report on community
needs, [such as] lawyers, judges, physicians, ministers, minority group leaders, and service providers who are aware
of the needs and services perceived as important by the community” (pp. 272–273, as cited in Tutty & Rothery,
2010, p. 151).

b. Understanding the Target Population(s).

The second step when establishing the terms of reference for a needs assessment is to determine the target
population. For example, is it a certain age group in a community or possibly current users of a complementary
service? Sometimes, the target population will be already-identified clients of programs or services, but it is also
possible that there are potential clients with needs that are not being met or are being only partially met through a
related but non-optimal service or delivery method. For example, in rolling out the US Affordable Care Act, the
federal government intended to increase the number of citizens covered by a health care plan—persons or families
not covered by existing options (Schoen, Doty, Robertson, & Collins, 2011).

There are two key dimensions related to the target population: (1) the sociodemographic characteristics of the

314
target population and (2) the geographic scope of a planned needs assessment (the boundaries of the target
population). The more precise the designation of the relevant characteristics and geography, the easier it is to focus
the methodologies of the study (sampling and data collection methods).

Among generally relevant sociodemographic attributes of current and prospective clients are the following: age,
gender, ethnicity, first language, literacy level, education, occupation, income, and place of residence. These can
often be used as indirect measures of need and can be used to cross-classify reported needs in order to zero in on
the subpopulations where reported needs are thought to be the greatest.

Demographic characteristics that describe existing populations are often available online, such as the American
Factfinder census resource (https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml) or the health statistics
provided by federal, state, and local agencies through the Association for Community Health Improvement
(http://www.healthycommunities.org) for the United States and various statistics available through Statistics
Canada (https://www.statcan.gc.ca/eng/start). Relevant databases are typically also available at the state, provincial,
regional, or even local level. Epidemiological databases (prevalence and incidence of factors related to or even
predictive of needs) and related service and demographic information that are already available can help inform
this stage of the needs assessment process.

When compiling data from these secondary sources, the Catholic Health Association (2012) suggests the
following:

Record all data sources (including web locations) and reporting periods . . .
Seek sources of data that are online and publicly available.
Use the most recent data available.
Incorporate data from prior years, if available. This will allow you to see changes over time.
Collect data for other regions, such as the entire state, or all counties within the state. This will allow for
comparisons and rankings.
Find data that will allow for evaluation of disparities. For example, the Census provides data by census tract
(statistical subdivisions of a county) and thus allows for identification of specific geographic areas that may
differ from neighboring geographies in terms of population, economic status and living conditions. (p. 56)

Census information or similar data from government agencies can be used to estimate the occurrence of individual
or group characteristics associated with known needs for services. For example, demographic information on the
age and gender distributions in a region or even sub-region might be used to roughly gauge the need for services
for the elderly. The Ontario Institute for Clinical Evaluative Sciences website (https://www.ices.on.ca) is an
example of a portal for this type of information. An American example of an omnibus source of secondary data is
the Oregon Open Data Portal (https://data.oregon.gov). Among the geographic “units” for which
sociodemographic data are available are Oregon counties and cities.

The assumption that age is strongly associated with the need for services that target older persons can be
corroborated by developing demographic profiles of existing clients for services for seniors where such services are
provided and then comparing client profiles with the population demographics (Soriano, 2012). Estimates of need
can be constructed by comparing total numbers of prospective clients in the population with the numbers served
by existing providers.

In addition to identifying the target populations, it is important to identify any comparison populations that
could be used to benchmark needs. For example, a needs assessment for job training services in a community that
has lost a key resource-based employer might include comparisons with other resource-dependent communities
that have already established such programs, to gauge how many clients are served in these other communities in
relation to their populations. The idea would be to establish a rough benchmark that suggests an appropriate scale
for a job training program based on the ratio of population to clients served in other, similar communities.

c. Considering the Strategic Context of the Program(s).

315
Third, during this pre-assessment phase, it is also advisable to consider the strategic context, such as the political,
economic, demographic, and organizational factors that could affect the feasibility of making program changes or
additions. Establishing the strategic context (a form of organizational and environmental scanning) helps set the
parameters for the scope of the problem. Sork (2001) highlights the following potentially important context
factors:

Mission, aims and purposes of the organization or planning group


Internal and external political considerations
Authority and accountability relationships
Philosophical and material conditions that influence what is possible;
Co-operative and competitive relationships (internal and external). (p. 105)

Scans such as this can offer comparisons that indicate trends that could affect the way the organization wishes to
direct or redirect its efforts. As an example, an environmental scan for a community health strategic planning
process might concentrate on the following:

Demographic and socioeconomic trends


Health status of population and subgroups
Access to existing health care services
Use of existing health care services
Environmental factors (natural and social)
Trends in labor force participation for men and women
Projections of the demand for different types of health services in the future
Currently available programs
Sustainability of current health-related programs

These factors were considered in the community health needs assessment case that we will discuss later in this
chapter.

d. Creating an Inventory of Existing Services.

Assuming that the clients of the needs assessment have been identified and the target population has also been
specified, an important step in conducting a needs assessment is to develop an inventory of the programs/services
currently being offered in a given geographic area. This step is relevant for both ex ante and ex post needs
assessments. McKillip (1998) suggests that once a target population in a geographic area has been identified, it is
useful to contact existing service providers and find out the following:

Who is providing services to the target population


What services are actually being provided, including types, availability, and costs
Who the clients (respecting confidentiality) are, that is, what their demographic characteristics are and their
geographic locations in relation to the service providers

McKillip (1998) illustrates displays of service inventory information in a matrix format, with relevant client
demographic characteristics across the top of the matrix, the types of programs that are provided to that client
population down the side of the matrix, and the names of the provider agencies in the cells of the matrix. Table
6.2 illustrates part of a template that might be constructed for a needs assessment that is focused on services for
seniors in a given population, in a case where three agencies are providing services in an area.

Table 6.2 Program Activities Inventory for Elderly Persons in a Population


Table 6.2 Program Activities Inventory for Elderly Persons in a Population

Relevant Client Characteristics


Program
Older Person
Activities

316
Provided Than 65 Living Years Living Alone Physical Disabilities
Years Alone

Meals delivered Agency


Agency A Agency A Agency A
to the home A

Light
Agency B
housekeeping

Home care
Agency C
nursing

During the process of developing a service inventory, service providers could also be asked, in informal interviews,
to offer estimates of the extent to which existing clients’ needs are being met by the services provided to them
(Kernan, Griswold, & Wagner, 2003). Using service provider input in this manner has the advantage of acquiring
data from persons or agencies that are knowledgeable about current clients, past clients, and the patterns of service
requests, demands, and needs the providers observe. On the other hand, service providers clearly have a stake in
the outcome of needs assessments and may have an incentive to bias their estimates of the adequacy of existing
service to existing clients. Given the current and potentially future constrained fiscal climate for program
providers, it is prudent to pay attention to the fiscal context for this step. Where program providers see themselves
competing for the same scarce resources (for new or existing programming), we should be cautious about the
neutrality of their information.

e. Identifying Possible Service Overlaps and Potential Collaborative Services.

Related to the four steps already mentioned, it may be valuable to consider an approach that covers related needs
and services of a targeted population. For example, Axford, Green, Kalsbeek, Morpeth, and Palmer (2009), in
their meta-analysis of measuring children’s needs, emphasize the importance of “embracing a multi-dimensional
perspective of need that is rooted in a wider body of research about what human beings, and children in particular,
need” (p. 250). Swenson et al. (2008) focused their needs assessment on collaborative mental health services. Some
issues are increasingly being understood as more multidimensional than previously seen—such as homelessness,
mental health, and poverty (all three are related)—and organizations are working to coordinate their efforts
(Stergiopoulos, Dewa, Durbin, et al., 2010; Stergiopoulos, Dewa, Tanner, et al., 2010; Watson, Shuman,
Kowalsky, Golembiewski, & Brown, 2017).

f. Factoring in the Resources Available for the Needs Assessment.

Finally, while narrowing and focusing the main problems to be addressed in the needs assessment, it will be critical
to take into consideration the resources available for the research that would be required to conduct a full needs
assessment (Soriano, 2012). In identifying and obtaining resources for the assessment, the Catholic Health
Association (2012) suggests some items to consider when planning the budget:

Assessment approach (e.g., purpose, scope, partners, need for consultants)


Data collection and analysis resource needs
Facilitation of collaboration, planning, and priority setting
Report writing and dissemination
Operational expenses, including meeting supplies and communications costs (p. 41)

Summarizing the Pre-assessment Findings. Following the preliminary scoping research, Altschuld and Kumar (2010)
suggest meeting with those who originally suggested that a needs assessment may be necessary and discussing the
next steps. At this point, organize the information collected so far, particularly to consolidate the findings as a
resource that can help with the decisions about whether to create a Needs Assessment Committee (NAC) (to be
discussed shortly) or perhaps a study team. Altschuld and Kumar (2010) suggest creating a 5- to 10-page summary

317
with a few tables showing the relevant research so far, along with recommendations for next steps, including items
to discuss as part of deciding whether to create an NAC and whether to move further with the needs assessment
process. They suggest the following as discussion items:

What is the scope and size of the area(s) or topic(s) of interest?


Would it be of value to divide the area(s) or topic(s) into sub-clusters or themes?
Would it be useful to collect more information?
Would the collection of new data be warranted?
Will going to a full set of Phase I activities be needed? (i.e., further preliminary data collection)
Should the organization form an NAC, and if so, what individuals and groups should be involved?
What resources might the organization or other stakeholders be willing to commit to the needs assessment?
How much time would a full Phase I implementation require?
What is the importance to the organization of looking at needs and ultimately changing what it does in
accord with what might be learned? (p. 61)

2. Forming the Needs Assessment Committee (NAC)


The purpose of the needs assessment (ex ante versus ex post, and formative versus summative) will influence the
next steps in the process. If it is decided that the project should continue, a memorandum of understanding
(MOU) should be written to cover the formation and composition of a NAC or steering committee, as well as
additional details such as expected time lines, resources, and number and topics of meetings and brief descriptions
of the reports that would be expected at various stages. What is being suggested here is formalizing the terms of
reference for the needs assessment process.

There are a number of reasons to form an NAC. Key stakeholders and collaborators with various kinds of expertise
(e.g., facilitation, research, outreach, and administration) can contribute to the needs assessment, and there is value
in having a diverse range of perspectives and ideas. Resource limitations may dictate that the committee be less
formal, and although we will use that term—following Altschuld and Kumar’s (2010) model—the actual group
may well be decidedly ad hoc and informal, depending on the problem and the organizational capacity and
resources. There are several factors to take into consideration if forming an NAC—how formal or informal it is
and the balance between internal and external members. Where a needs assessment is summative—that is, where
the purpose is to determine continuing relevance and whether the program is continuing to meet priority need(s)
—internal (program or organizational) involvement should not dominate the committee and its work. The
committee typically would have an oversight function with possible subcommittees that have expertise relevant to
the needs assessment at hand. In a fairly comprehensive needs assessment, the formation of the NAC can turn out
to be pivotal to having the project successfully navigate additional research needs, stakeholder engagement, and
organizational buy-in for any recommended changes.

3. Learning as Much as We Can About Preliminary “What Should Be” and


“What Is” Conditions From Available Sources
If an NAC has been formed, members may become involved in reviewing the preliminary information that has
been gathered so far, in order to narrow the topic areas before furthering any primary needs assessment research
(e.g., surveys, interviews, focus groups). The research team (ideally a subcommittee of the NAC) may also want to
check more deeply into the availability of records and archives and any common standards and mandates for
assessment. Beyond the secondary database sources mentioned earlier and the existing organizational reports or
evaluations, other examples of useful information at this stage would include the following:

Literature reviews, in particular systemic reviews or meta-analyses that cover the programming area
Organizational data, such as waitlists and referrals (keeping in mind earlier comments about the integrity of
such sources of data); in a community-based needs assessment, waitlists and referrals from all program
providers would be a relevant line of evidence
Relevant government reports

318
Literature reviews (both published and grey literature sources) can locate existing studies on the area of interest
and “meta-needs assessments,” which synthesize data from previously completed needs assessments on a particular
topic. This can help in planning the research approach to be used in a full needs assessment. As just one example,
Morris, King, Turner, and Payne (2015) conducted a narrative literature review of 28 studies of “family carers
providing support to a person dying in a home setting” and concluded, “there is evidence of gaps and deficits in
the support that family carers receive” (p. 488). Gaber (2000) points out that in some cases, a meta-needs
assessment can be a very useful substitute for an in-depth needs assessment when resources (including time) are
insufficient to conduct one.

Recent reviews of various fields have looked at a wide range of needs. Just a few examples are the following: unmet
supportive care needs of people with cancer (Harrison, Young, Price, Butow, & Solomon, 2009), lay caregivers
needs in end-of-life care (Grande et al., 2009; Hudson et al., 2010), selection and use of health services for infants’
needs by Indigenous mothers in Canada (Wright, Warhoush, Ballantyne, Gabel, & Jack, 2018), cultural
adaptations to augment health and mental health services (Healy et al., 2017), child welfare needs (Axford et al.,
2009; Rasmusson, Hyvönen, Nygren, & Khoo, 2010), and homelessness and mental health care (Watson et al.,
2017).

A literature search may also uncover existing standardized needs assessment instruments. Examples are Patterson et
al.’s (2014) “Sibling Cancer Needs Instrument” or Axford’s (2010) “Conducting Needs Assessments in Children’s
Services.”

Altschuld and Kumar (2010) suggest using tables to track the gathered information at each step, as the group
works to determine (and track) what it knows and what it needs to know to facilitate the next stage of the process.
Depending on who is involved, who the decision makers are, and their availability, the end of the first phase may
overlap with the beginning of the second phase. In particular, the agenda for a “first” meeting of the next phase
becomes part of the transition at the end of this first phase, as the work done so far is consolidated and an agenda
for the further possible work is created.

4. Moving to Phase II and/or III or Stopping


After going through the pre-assessment stage, it may turn out that the group responsible for the pre-assessment
recommends to the decision makers that no further analytical work be undertaken. This could occur for a number
of reasons, such as the unmet needs (the discrepancy) not being large enough to warrant new programmatic
action, or findings on unmet needs that are clear enough that the next stage will be to move directly to making the
necessary program changes. For example, in a summative needs assessment that is focused around whether the
program continues to align with the priorities of the funders/government, the results (degree of
relevance/alignment) could be determined from documentation, interviews with stakeholders, and existing records
of services provided. Whether to continue the program would depend, in part, on its relevance, given these lines of
evidence, and, if not relevant, whether the program can be re-focused to align it better with funding priorities.

At this stage of the pre-assessment phase, the group may also recommend that further research needs to be done,
such as surveys, focus groups, and/or interviews. A summary report should be prepared for the decision makers, if
they have not been included so far, advising them on what has been done so far, the sources of information, and
what has been determined to this point. The report can also include options to consider, ranking them for the
decision makers. Altschuld and Kumar (2010) recommend that the pre-assessment group attend the
presentation(s) to the decision makers, who will be choosing whether to (a) terminate the assessment, (b) further
delve into a prioritized set of needs, or (c) move directly to the post-assessment phase, when planned actions are
communicated and implemented.

319
Phase II: The Needs Assessment
For the needs assessment phase, Altschuld and Kumar (2010) lay out steps to further identify the discrepancies
between the conditions (what is versus what is desired), prioritize them, consider causal factors where possible, and
begin to identify potential solutions. This second phase can be seen as the core of the needs assessment process and
typically is more time-consuming and costly than the pre-assessment phase.

5. Conducting a Full Assessment About “What Should Be” and “What Is”
When more information is needed beyond the pre-assessment phase, this will typically involve building further
understanding via surveys, interviews, focus groups, or community forums—often using a combination of
quantitative and qualitative methods. These methods are covered in more detail in other chapters of this book, but
some key points will be highlighted here. To guide a meeting intended to organize next steps, Altschuld and
Kumar (2010) suggest including not only a summary of the progress so far but also an agenda organized around
the following questions:

Was enough found out about the areas of interest (the discrepancies) that expensive new data are not
warranted?
Do we have a solid sense of the discrepancies for the three levels (service recipients, providers, and the
organizational system)?
Are we clear and in agreement as to which needs are the priorities of the committee and the organization?
Should the focus shift to what is causing needs and determining final needs-based priorities, taking into
account what the organization can and cannot do?
Should we develop criteria for choosing and/or developing or locating solution strategies?
Do we know enough to jump directly into Phase III? (p. 80)

To facilitate focusing the Phase II assessment activities and before launching into further research, Altschuld and
Kumar (2010, p. 81) recommend crafting a template table (see Table 6.3). The needs assessment committee or
smaller subgroups will begin to work with the table at the first Phase II meeting and can populate the table as the
research continues.

The group can work with the table to generate information that will help them “sort out perceptions and what
underlies them” (p. 81) and guide the further needs assessment research work to be undertaken, whether it be
quantitative or qualitative. As indicated throughout this book, causal analysis is a structured process that involves
examining hypothesized cause–effect linkages and rival hypotheses, and within a needs analysis, one would usually
look to previous research (program theories, meta-analysis, systematic reviews, and case studies) and experience to
inform the “causal analysis” column. An organization’s understanding of causal factors related to its programs can
be iteratively built and tracked over time, based on literature reviews and further research.

Table 6.3 Next Steps


Table 6.3 Next Steps

Potential Next Steps in Phase II


Area of
Concern What We More Knowledge Causal Analysis Possible Solution Possible
Know Desirable (Where Available) Criteria Solutions

Area 1

• Subarea
1

320
• Subarea
2

Area 2

Area 3

Area n

Source: Adapted from Altschuld & Kumar (2010, p. 81).

6. Needs Assessment Methods Where More Knowledge Is Needed:


Identifying the Discrepancies
In a full needs assessment, research to better identify and analyze the discrepancies, or unmet needs, is a key
component. It is critical to have the resources and expertise to do this well. Like other kinds of evaluation, needs
assessments draw from various methods, such as focus groups, surveys, checklists, and interviews. Often,
combinations of methods are used, and this reflects the importance of triangulating information in needs
assessments (Tutty & Rothery, 2010). The most common qualitative and quantitative research approaches are
mentioned in this chapter but are covered in more depth in other chapters of this textbook.

Using Surveys in Needs Assessment

Surveys have been an important way to gather new data in needs assessments, and surveys of current or future
clients are often used to estimate unmet needs. Prospective or existing program clients can be surveyed to ascertain
their experiences and levels of satisfaction with existing services and can also be queried about gaps in the services
in relation to their perceived needs. Surveys can be a part of evaluations that combine assessments of the
effectiveness of the existing program and, at the same time, gather data to illuminate the perceived gaps between
the existing program and needs.

Developing and conducting a survey is deceptively simple (Axford, 2010). The selection of respondents,
development of a survey instrument that includes valid and reliable measures, and statistical analysis of the results
need appropriate expertise and resources.

To make sure that it is worth the effort, Altschuld and Kumar (2010) suggest that the NAC confirm not only that
they have access to the necessary expertise to provide advice on whether it is worth the effort, but also that there is
a high likelihood of reasonable response rate and that there are enough resources for the process of survey
development, implementation, and analysis.

No matter which type of survey is developed (mailed, online, telephone, in person, or combinations), it must be
carefully planned and executed, including being pre-tested before being fully rolled out. Critically, you must know
the target population:

Finally, a solid knowledge of the social, political, economic, and demographic characteristics of the
community of focus is vital to addressing likely sensitivities and to taking into account both needs and
strengths of populations. It is likewise important to know the cultural and linguistic diversity of
populations of interest, as well as the literacy levels of potential respondents, to prepare valid and
reliable data collection instruments. (Soriano, 2012, p. 70)

Sensitive information may be collected during a needs assessment, including information about gaps in an
individual’s capabilities, so vigilance will be needed for both privacy and ethical considerations. In Sork’s (2001)
questions that guide a needs assessment, he includes, “What are the most important ethical issues you are likely to

321
encounter during the needs assessment and how will you deal with them?” (p. 109). This is a question to consider
when determining the methods and operational details of the needs assessment.

A key part of any survey or primary data collection from populations is soliciting informed consent from
participants. Briefly, informed consent involves fully disclosing the purposes of the data collection, the risks
involved (if any), and the extent to which individual data will be kept confidential. In Indigenous communities,
conducting needs assessments can involve securing the agreement of community leaders before seeking the
participation of community members in the project.

Given the desirability of being able to generalize survey results to the population for which needs are being
assessed, one potential type of survey is focused on a random or representative sample from a population.
Questions are posed that get at uses of existing services, as well as respondent estimates of the adequacy of services
vis-à-vis their needs. Respondents’ estimates of their uses of existing services can be used to develop demographic
profiles of current service users. Survey-based data on who uses services or programs can be used in conjunction
with census- or other population-based characteristics to develop estimates of the total possible usage of services in
other populations. For example, if 10% of the respondents to a population survey of senior citizens in a region
indicated that they have used Meals on Wheels services in the past year, and if our sample is representative of the
regional population, an estimate of total possible usage of Meals on Wheels for the population in that region can
be calculated. This calculation would be done by constructing a confidence interval around the sample
proportion of users and then multiplying the lower and upper limits of the confidence interval by the total
population of seniors in the region. In our example, if 10% of a survey sample of 500 seniors indicated they have
used Meals on Wheels in the past year, the 95% confidence interval for the population proportion of users is
between .0737 and .1260. In other words, we can be “95% sure” that the true percentage of seniors using Meals
on Wheels in the population is between 7.37% and 12.6%. Of course, this estimate of program usage could be
compared with agency records to get a sense of the degree to which the survey responses are valid.

Suppose we wanted to use this information to estimate the need for Meals on Wheels in another region. If we
knew the population of seniors in that region (e.g., 10,000), we could estimate that the number of seniors who
would use Meals on Wheels is between 737 and 1,260 persons. This approach to estimating need is similar to the
Marden approach for estimating the number of individuals in a population who are at risk of having alcohol-
related problems (Dewit & Rush, 1996).

One concern with surveys that ask for ratings of the need for services is that there is no constraint or trade-offs
among the number of “priority” services that can be identified. Often, analysts are faced with survey outcomes
that suggest that the number of areas of need is great and that differences among the ratings of the needed services
are small. That is, all of the ratings are skewed to the high end of the rating scale.

An alternative is to ask survey respondents to rank the importance of services, forcing them to prioritize. Although
ranking techniques are limited in their use (most respondents will not rank more than about six choices), in
situations where analysts want information on a limited number of alternatives, ranking is more valid than rating
the choices.

For whatever the needs assessment focus, there are resources and tool kits that can provide additional specific
details on the principles and pitfalls of surveys (see, e.g., Altschuld, 2010; Soriano, 2012; White & Altschuld,
2012). A reminder, too, that at every point in the needs assessment process, it is worthwhile to stay informed
about similar national or local efforts that address either matching problem areas or matching population groups
that could be served through partnerships.

Notes on Sampling in Needs Assessment

A key point of being able to defend the methodology of a needs assessment is a sampling procedure that is
defensible. Of course, sampling is an issue that arises in most data collection activities and, as such, spans all areas
of program evaluation. However, we discuss the needs assessment–related considerations of sampling here.

322
Ideally, sampling for needs assessment surveys should be random. That means that any respondent has an equal
chance of being selected, and no respondents or groups of respondents have been excluded from the sampling
process. Selecting a random sample requires us to be able to enumerate (list all those who are in the population)
the population and, using one of several methods (e.g., random number tables, computer software), pick our
intended respondents. Where it is not practical to enumerate a population, it may be possible to draw a systematic
sample. Typically, in systematic sampling, an estimate of the total population size is divided by the desired sample
size to obtain a skip factor that is used to count through the list of potential respondents (say, their listing in a
directory), picking cases that coincide with the skip interval. For example, a researcher may decide to interview
every fifth person who comes through the door of a seniors’ center over a period of time. By using a random
starting point in the first skip interval, it is possible to approximate a random sample.

One concern with systematic samples is that if the population listing is organized so that the order in which cases
appear corresponds to a key characteristic, then two different passes through the population listing that began at
different points in the first skip interval would produce two samples with different characteristics. If client files
were organized by the date they first approached a social service agency, for example, then two different passes
through the files would produce different average times the samples have been clients.

There are several other random sampling methods that result in samples designed for specific comparisons.
Stratified random samples are typically drawn by dividing a population into strata (e.g., men and women) and
then randomly sampling from each stratum. In populations where one group is dominant but the analyst wants to
obtain sufficient cases from all groups to conduct statistically defensible comparisons, stratified sampling will yield
samples that are representative of each group or stratum. Proportionate stratified samples are ones where the
proportion of cases sampled from each stratum (randomly sampled in each stratum) is the same as the relative
proportions of the strata in the population. If women are 25% of a population, a proportionate sample would be
25% women. A disproportionate stratified sampling method is sometimes used where an important group is
relatively small. For example, if a needs assessment for community health services were being conducted in a
region that had 5% Indigenous residents, a disproportionate stratified sample might select more Indigenous
residents than the 5% in the population warranted, in order to permit statistically valid comparisons between
Indigenous and non-Indigenous health needs. Again, this sampling approach would use random sampling within
each stratum.

The cost of a needs assessment survey will vary with the size of the sample, so it is useful at the outset to have a
general idea of how much precision is desired in any generalizations from the sample back to the population.
Generally, the larger the sample, the more precise the measurement results.

Existing methods for determining sample sizes are awkward in that they force us to make assumptions that can be
quite artificial in needs assessments. To determine sample size (assuming we are going to use a random sample),
we need to know the following:

How much error we are willing to tolerate when we generalize from the sample to the population
What the population proportion of some key feature of our cases is (or is estimated to be), so we can use
that to pick a sample size

Typically, when we conduct a needs assessment, we are interested in a wide variety of possible generalizations from
the sample to the population. We are using the survey to measure multiple constructs that are related to some
agreed-on scope for the needs assessment. We may have decided to conduct a survey and are now interested in
estimating the sample size that we require to accurately estimate the perceived needs for the services in the
population. The methodology of determining sample sizes requires that we assume some population proportion of
need in advance of actually conducting the survey and then use that to estimate our required sample size. In effect,
we have to zero in on one service, “estimate” the perceived need (usually a very conservative estimate) in the
population for that service in advance of conducting the survey, and construct our sample size with respect to that
estimate.

Table 6.4 displays a typical sample size table (Soriano, 2012). Across the top are the expected proportions of

323
responses to one key item in the needs assessment survey (e.g., the proportion needing home care nursing
services), and down the left side are the percentages of sampling error when we generalize from a given sample
back to the population, assuming our sample is random.

Table 6.4 Sample Sizes for a 95% Level of Confidence Depending on Population Proportions
Expected to Give a Particular Answer and Acceptable Sampling Error
Table 6.4 Sample Sizes for a 95% Level of Confidence Depending on Population Proportions Expected to
Give a Particular Answer and Acceptable Sampling Error

Proportion of Population Expected to Give Particular


Acceptable Sampling Error (plus or Answer
minus %)
5/95 10/90 20/80 30/70 40/60 50/50

1 1,900 3,600 6,400 8,400 9,600 10,000

2 479 900 1,600 2,100 2,400 2,500

3 211 400 711 933 1,066 1,100

4 119 225 400 525 600 625

5 76a 144 256 336 370 400

6 — 100 178 233 267 277

7 — 73 131 171 192 204

8 — — 100 131 150 156

9 — — 79 104 117 123

10 — — — 84 96 100

Source: de Vaus (1990), in Soriano (2012, p. 93).

a. Samples smaller than this would be too small for analysis.

Suppose we “guesstimate” that 5% of the population would indicate a need for home care nursing services. That
would put us in the first column of the table. Now, suppose we wanted to be able to estimate the actual (as
opposed to the “guesstimated”) proportion of persons indicating a need for home care nursing to within ±2%. We
would need a random sample of 479 cases.

There is one additional factor that is implicit in Table 6.4. In addition to specifying our desired level of precision
in estimating the population proportion of persons needing home care (±2%), we must recognize that all of Table
6.4 is based on the assumption that we are willing to accept a 95% level of confidence in our generalizations to
the population. That means that even though we might, for example, conduct a needs assessment and estimate
that the actual population percentage of persons needing home care nursing is 7%, with a possible error of ±2%
either way, we would be able to say, with 95% confidence, that in the population, the percentage of persons
needing home care is between 5% and 9%.

Another way to look at this situation is to say that we are only “95% confident” that our estimating process has
captured the true population proportion of persons needing home care nursing. What that implies is that if we

324
were to do 100 needs assessment surveys in a given community, using sample sizes of 479 each time, in 5 of those
needs assessments, our estimation procedure would not capture the true population proportion, even though our
samples were random each time. Unfortunately, we do not know which of those needs assessments will produce
the misleading results.

Clearly, estimating sample sizes involves assumptions that are quite restrictive and, perhaps, not based on much
information. But to be able to defend the findings and conclusions from a needs assessment, the sampling
methodology must be transparent and consistent with accepted practices.

If you look carefully at Table 6.4, you will see that for any given level of sampling error, as the expected
population proportion that gives a particular answer moves toward 50% (say, a positive response to a survey
question about the need for home care nursing), the required sample size increases. So an evaluator conducting a
needs assessment can avoid having to make “guesstimates” in advance of the survey by assuming that the
population responses will be 50/50. That is the most conservative assumption and is eminently defensible.
However, it also requires much larger sample sizes for all levels of acceptable sampling error.

There are a number of nonrandom sampling methods that can be used if random selection is not feasible. They
are also called convenience sampling methods, and as the name implies, they are based on sampling respondents
that are conveniently accessible. Soriano (2012) describes five convenience sampling methods, noting, though,
that “the appropriateness of a convenience sample depends on its representation of the target population. Its
appropriateness can range from totally justified to absolutely inadequate” (p. 83). The five methods are shown,
with brief explanations, in Table 6.5.

There are advantages and drawbacks to each of these methods, and it will be critical to have a defensible argument
to explain why, if a nonrandom sampling was used, the sample is still likely representative of the population being
assessed. There are also statistical tools to help establish the resulting representativeness of the sample, but that is
beyond the scope of this chapter. Chapter 4 of this textbook provides additional information about survey
methods and measurement validity issues.

Table 6.5 Sampling Methods


Table 6.5 Sampling Methods

Type of
Sampling Description
Method

Quota Sampling of a fixed number of participants with particular characteristics. An example would
sampling be sampling men and women to reach a quota of 50 persons in each group. Not random.

Closest to random sampling; systematically selecting respondents from a large population by,
Systematic for example, dividing the number of addresses of all households in a population by the
sampling number of participants to be sampled and then surveying every nth household (Note: not
based on requesting the service).

Selection of participants using a periodic sequence (e.g., every eighth client who is requesting
Interval
a service). This is similar to systematic sampling but does not involve a random start point for
sampling
drawing the sample.

Judgment Using experts to select a sample of participants that should be representative of the population
sampling to be studied. This is also mentioned in Chapter 5 of this textbook. Not random.

Begins with a small group of accessible participants and expands as they recruit other
Snowball participants who would fit the selection criteria. Also mentioned in Chapter 5. This is a key

325
sampling method for qualitative data collection. Not random.

Measurement Validity Issues in Needs Assessment

In Chapter 4, we discuss the validity of measures and defined validity as the extent to which a measure does a
“good job” of measuring a particular construct. Fundamentally, measurement validity is about controlling bias. In
the example that follows, surveying prospective bus riders yields a biased measure of the construct “actual transit
ridership.”

In a community in northwestern Pennsylvania, the local Public Transit Commission was interested in expanding
the bus routes to attract more ridership. There were several areas of the community that were not served by
existing routes, so the commission hired a transit planner on contract to estimate the costs and revenues that
would result from a number of expansion options (Poister, 1978).

Among the methodologies selected by the planner was a household survey that targeted the areas of the city that
were currently not served by public bus routes. One question in the telephone survey asked respondents to
estimate their own projected usage of public buses if this form of transit ran through their neighborhood:

Now, turning to your own situation, if a city bus were to run through your neighborhood, say, within
three blocks of your house, how many times a week would you ride the bus?

_______ Less than once per week


_______ Once per week
_______ Two to three times per week
_______ Three to four times per week
_______ More than four times per week
_______ Would not ride the bus
_______ Don’t know
_______ No response

Survey results indicated that nearly 30% of respondents would become regular users of the buses (three or more
times per week). When the sample proportion of regular users was generalized to the population, expansion of the
bus system looked feasible. The increased ridership would generate sufficient revenue to more than meet the
revenues-to-costs target ratio.

But the transit planner, who had done other studies of this kind, did not recommend the bus routes be expanded.
In his experience, a 30% potential ridership would translate into an actual ridership of closer to 5%, which was
insufficient to meet the revenues-to-costs target ratio. Response bias reflects human nature; there are incentives
(e.g., being seen to be doing the socially desirable thing, or saying yes to wanting a service that would have little or
no personal implementation cost), and few disincentives, for respondents to indicate that they would indeed make
use of increased services. The transit planner was willing to use the survey results in his analysis but was also aware
that they were seriously biased in favor of more transit ridership. His experience allowed him to discount the bias
to a more realistic figure, but another person might not have been aware of this problem, resulting in a service
provision decision that would not have been cost-effective.

Our transit planning example suggests that using surveys to estimate needs is not straightforward. In addition to
the biases that crop up when we ask people about programs and services that they need or want, the ways that
instruments are put together (sequencing of questions, wording of questions) and the ways that they are
administered can also affect the validity of the information we collect. In Chapter 4, we discussed these concerns.

In general, it is important to keep in mind that needs assessments are subject to many of the threats to
measurement validity that we discussed in Chapter 4. In conducting needs assessments, we must do our best to
control for these elements of bias. Calsyn, Kelemen, Jones, and Winter (2001), for example, published an

326
control for these elements of bias. Calsyn, Kelemen, Jones, and Winter (2001), for example, published an
interesting study of one common element of response bias in needs assessments—over-claiming of awareness of
agencies by current and prospective clients. In needs assessments, a respondent’s awareness of a particular agency is
often used as a measure of their use of—and therefore need for—that agency. However, for reasons such as age or
a desire to appear well informed, survey participants often claim awareness of agencies of that they do not, in fact,
have any knowledge. The study by Calsyn et al. (2001) concluded that one of the best ways to discourage such
response bias is to warn respondents ahead of time that the list of agencies being used in the needs assessment
contains the names of fictitious as well as real agencies. This warning tends to make respondents more cautious
about their answers and produces more accurate estimates of agency awareness.

Qualitative Methods in a Needs Assessment

Interviews, focus groups, and community forums are common qualitative methods used in needs assessments and
can be included either as a building block toward a survey or as a way to understand what survey results mean for
stakeholders. In the community health needs assessment case presented at the end of this chapter, large-scale
surveys had been done in the province of New Brunswick prior to the needs assessment, yielding both
demographic and perceptual findings on primary health care services. Qualitative lines of evidence were used to
interpret and weight the findings from the surveys and other statistical data sources.

Qualitative exploratory research can also be used to narrow information that then guide quantitative methods such
as surveying. For example, in a study targeting health care needs assessments (Asadi-Lari et al., 2003), the
researchers, in conjunction with a literature review, expert opinions, and discussions with medical staff, conducted
semi-structured interviews with 45 patients before developing a questionnaire to assess the health care needs of
patients with coronary artery disease.

Chapter 5 provides guidelines on conducting qualitative evaluations. And, as mentioned earlier, there is a growing
pool of resources, both in books and online, of needs assessment research methods for specific fields, such as
education, health care, housing, justice, and community infrastructure. Here, we will briefly summarize an
example where qualitative methods were used to conduct a rapid needs assessment in a neighborhood of
Johannesburg, South Africa (Lewis, Rudolph, & White, 2003).

An Example of a Qualitative Needs Assessment

The needs assessment of health promotion needs in the Hillbrow neighborhood of Johannesburg (Lewis et al.,
2003) offers an example of an approach to needs assessment called “rapid appraisal.” It also illustrates how
triangulation can be used in this kind of project.

The researchers in Hillbrow conducted a needs assessment where they consulted with and involved the
community itself, in an effort to make the conclusions as relevant as possible to local needs. “Rapid appraisal” is
designed to “gain insights into a community’s own perspective on its major needs, then to translate these into
action and, finally, to establish an on-going relationship between service providers and local communities” (Lewis
et al., 2003, p. 23). Problems they encountered when using this approach included issues of measurement
reliability and validity, which the researchers attempted to address through triangulation, using a four-step
methodology.

Step 1 involved a review of the available written records concerning the neighborhood. These had been produced
by institutions outside the community itself and were incomplete and questionable in their accuracy. Step 2
focused on fleshing out this background information with a series of semi-structured interviews with a small
number of key stakeholders who worked in the neighborhood and were in positions of influence. One issue that
emerged from these interviews was the lack of community engagement with youth and women, so Step 3 involved
two focus group discussions, one with area youth (14 participants) and the other with area women (12
participants). The main intent of these discussions was to get local people’s views on some of the issues raised in
Steps 1 and 2. These focus groups were facilitated to allow participants to direct the discussion and to focus on the
issues that were of greatest importance to them.

327
Step 4 was designed to create an opportunity for stakeholders and members of the neighborhood to consider the
information gathered and the issues raised in Steps 1 to 3 and to attempt to reach agreements about possible
courses of action. A community workshop was held with more than 80 participants, made up not only of
community representatives but also of service providers and decision makers. Emphasis was placed on ensuring
that all participants felt involved in the discussions and that divergent views were fully expressed. To start, key
messages from the earlier stages of the needs assessment were presented by representatives of the focus groups,
followed by a breakout of all the participants into smaller groups to discuss the issues. Each small group focused
on one predominant theme that emerged from the overall assessment, such as crime, Hillbrow’s physical
environment, or cultural offerings. These smaller groups helped participants interact and understand each other’s
viewpoints. In the final stage of the workshop, each group reported their key conclusions back to the whole group,
and together, participants worked to develop an action plan.

The authors of the study report that the “rapid-appraisal” methodology allowed them not only to gain many
different perspectives on the unmet health and social needs of Hillbrow but also to further the involvement of
community members and build partnerships for future action.

Qualitative methods can elicit ideas on interpretation of existing information, perspectives that may have been
missed, or ideas about how trends and the internal and external context affect both unmet needs and possible
solutions. Analysis of risks and resources can also figure into these considerations.

Once all the data have been collected from a needs assessment, the qualitative and quantitative results will need to
be analyzed and summarized in a way that will make the information accessible and easily understandable for the
team to work together to prioritize the needs that the organization will consider addressing. Prioritization is the
next step of the needs assessment process.

7. Prioritizing the Needs to Be Addressed


Ideally, prioritizing needs is both evidence-based (which needs are most strongly reflected in the lines of evidence
that have been gathered and analyzed?) and values-based. If a needs assessment is conducted as part of a
summative review of an existing program, existing program objectives/outcomes will guide any prioritization of
needs—the driving question in such situations is usually linked to the continued relevance of the program (does it
continue to address a high priority need) and its effectiveness (is the existing program effective in meeting the
need) or possibly its cost-effectiveness.

In a formative needs assessment, where an existing program is being reviewed for ways of improving it, identified
needs will may relate to ways of extending either the scale or scope of the program. Where a needs assessment is
conducted to identify needs with a view to designing new programs, prioritizing needs will usually rely on the
evidence from the study, weighted by the values (including political considerations) of stakeholders. Where there
is a client (one agency, for example) for a needs assessment, prioritizing needs will take into account that
perspective.

Fundamentally, credible needs assessments typically take a pragmatic approach in prioritizing needs: Priorities are
identified by the project work, and then those, when filtered through (often competing) value lenses, get ranked
for discussion purposes.

Incorporating criteria from the 2 008 North Carolina Community Assessment Guidebook, Platonova, Studnicki,
Fisher, and Bridger (2010, p. 142) create examples of criteria to consider (such as “magnitude of health problem”
and “trend direction”), and they suggest which sorts of questions to ask when prioritizing needs and beginning to
consider solutions for community health needs. The questions are based on a literature review of criteria used for
prioritizing health issues and have been adapted for Table 6.6.

Table 6.6 Prioritizing Needs: Applying an Evidence-Based and a Values-Based Lens


Table 6.6 Prioritizing Needs: Applying an Evidence-Based and a Values-Based Lens

328
Criteria for Helping Prioritize List of Needs: Discussion Items

Assessment Criteria Possible Questions to Ask

Evidence-based Questions

Magnitude of the problem What percentage of your population does the problem affect?

Cost-effectiveness Are the results worth financial investment?

Trend direction Has the trend improved or worsened in the past 5 years?

Magnitude of difference to like How much worse is the problem in your jurisdiction compared with
jurisdictions or regions similar jurisdictions in your province or state?

Funds available Are there sufficient funds available to address an issue?

Are there mandates, laws, or local ordinances that either prohibit or


External directives
require you to address a certain issue?

Seriousness of consequences Does the problem cause severe illness and/or premature deaths?

Values-based Questions

Community acceptability Is the intervention consistent with the community values?

Prevention potential Does the intervention keep people well?

Political pressure Is the issue driven by populist feelings?

Source: Table adapted from Platonova, Studnicki, Fisher, and Bridger (2010, p. 142).

8. Causal Analysis of Needs


Identifying and prioritizing needs is an important part of focusing on possible policy and program options for
development or adjustments. To support policy and program designs that yield appropriate interventions, causal
analysis of needs (what causes needs and what to do to meet needs) is an asset. Often, there are studies (both
published and unpublished) that examine how particular needs develop and also describe and assess options to
intervene to address a need. For example, if we look at the ongoing crisis with uses as misuses of opioids—we are
writing this book in 2018 in the midst of an international epidemic of overdoses and deaths from fentanyl and
related (and very powerful) opioids—there is a widely recognized need to address this problem and find ways of
mitigating both the uses of these drugs and the attendant incidence of overdoses and deaths (with the costs to
families, social service agencies, law enforcement agencies, and health agencies). A growing body of literature is
emerging that is examining the origins of this crisis and ways that it might be addressed (Alcoholism and Drug
Abuse Weekly, 2016; Barry, 2018; Beletsky & Davis, 2017).

A similar situation exists for the problem of homelessness, where there is an emerging consensus that housing first
programs are relatively effective in addressing chronic homelessness as a social, health, and law enforcement
problem (Aubry et al., 2015; Padgett, Henwood, & Tsembris, 2016; Tsemberis, 2010). What complicates
addressing homelessness and many other high-profile needs-related problems is their inherent multi-jurisdictional
nature. Part of the challenge then is mobilizing and coordinating agencies that must reach across organizational
and jurisdictional boundaries to effectively address the causes of such problems. The phrase “wicked problems” has

329
been coined to characterize public policy challenges that are inherently complex (Head & Alford, 2015).

More generally, causal analysis involves taking advantage of literature-based and other sources of information to
develop an understanding of the causes and the consequences of a need so that appropriate interventions can be
designed. As well, lines of evidence that are gathered as part of a needs assessment (particularly via service
providers) can be a good source of context-specific understandings of how a need has developed and how it might
be addressed. The growing importance of theory-based evaluations in which program theories are specified and
then matched with patterns of evidence to see to what extent the theory has been supported by the evidence fits
with understanding needs from a causal/theoretical perspective.

9. Identification of Solutions: Preparing a Document That Integrates


Evidence and Recommendations
Typically, needs assessment reports are layered—that is, constructed to make it possible for users to obtain varying
levels of detail as they use the report. Altschuld and Kumar (2010), Soriano (2012), and online resources such as
the Manitoba’s Community Health Assessment Guidelines (Community Health Assessment Network of Manitoba,
2018) offer additional guidance on how to prepare a needs assessment report, including syntheses of the lines of
evidence used to measure needs, gaps between needs and current services, and the ways that trends will affect
needs into the future.

A needs assessment report should have at least these seven sections:

1. The executive summary is usually two to three pages in length and is focused mainly on the key findings and
recommendations from the study. Typically, the executive summary is intended for those who do not have
the time to read the whole document. The executive summary may be the source of text to be incorporated
into future documents that have requests for new or continued funding. It is usually simpler to write the
executive summary after having written the full report.
2. The introduction states the purposes of the needs assessment, including the key questions or issues that
prompted the study. Recalling our earlier mention of the importance of identifying the problem, the
introduction is where the problem(s) driving the needs assessment are summarized. Suspected needs gaps
would be identified as well—usually these are part of the problem statement for the needs assessment. This
part of the report also should include relevant contextual information: community or population
characteristics; the history of the program and the agency delivering it (where an existing program is being
reviewed), including its current and historical place in serving individuals in need; and other relevant
contextual factors. An important part of the report is a summary of how the needs assessment project was
governed—e.g., how a Needs Assessment Committee was formed and how members were involved in the
project. Names and organizational affiliations of NAC members can be included in an appendix to the
report.
3. The methods and lines of evidence section describes how information from different sources (both qualitative
and quantitative) were being collected and used to address the key questions or issues that drive the study.
Often, it is worthwhile including a table that summarizes the questions that are driving the assessment and
the lines of evidence that address each question. Typically, such a table would list the questions as rows and
the lines of evidence as columns. For each question, we can show which lines of evidence address that
question. This section would also mention the data collection instruments, the methods used to collect the
data (including sampling methods), and a discussion of how the data were analyzed. Appendices can be used
to include the data collection instruments or provide more details on sampling and other related issues.
4. The findings section succinctly summarizes what we have learned from the lines of evidence that have been
gathered. Typically, this section begins with a description of the participants (how many, core demographics
if appropriate) for each line of evidence. Central to this section are succinct discussions of the findings for
each needs assessment question, organized by lines of evidence. Visual displays of information (graphs and
charts) are superior to tables for most audiences. Even bivariate relationships between variables can be
displayed graphically, in preference to cross-tabulations. Keep in mind that in this part of the report, we are
describing the findings as they relate to each of the needs assessment questions.

330
5. The discussion section is where the findings are interpreted. Soriano (2012) says that the in this section, “the
writer has liberty to take meaning from the results of the study” (p. 178). This section threads together the
findings, weights them, and offers overall statements that address each of the questions that motivated the
study. It can also point to any limitations of the study, particularly limitations on generalizability that affect
whether we can extrapolate the findings from the population of interest to other population groups or other
geographic areas. In sum, the study report interprets the findings and summarizes the evidence that will be
relevant for stakeholders.
6. The conclusion and recommendations section offers advice to decision makers based on the study. Mandates
of needs assessments sometimes preclude making recommendations—negotiating the scope of the final
report is an essential part of the up-front work as the NAC and the project team are formed and the terms of
reference for the project are negotiated. Where recommendations are expected to be a part of the scope of
the project, they must be based on evidence from the study and offer ways of connecting the findings and
the conclusions of the study to policy or program options. Recommendations also need to be appropriate for
the context.

One of the advantages of creating a NAC is to have stakeholders at the table who can help move the needs
assessment forward to program development or program change and its implementation. Framing
recommendations that are broadly consistent with resource expectations will help further implementation.
For each recommendation, it is often desirable to summarize the advantages and disadvantages of
implementing it. Alternatively, for each recommendation, a rationale is offered, based on the evidence in the
report. If there have been recommendations, they will be the principal part of the executive summary and
need to be written in plain language.
7. Often, needs assessment reports include appendices that offer stakeholders more detail on methods used, data
sources, and analyses. Appendices permit a more detailed layering of the report for decision makers who
want to see these details. Large-scale needs assessments may include additional technical reports where
individual lines of evidence have been gathered and analyzed (e.g., a large-scale survey).

The process of drafting a report is iterative. Typically, as a first draft is completed, it is made available to members
of the NAC. The draft can be reviewed and fine-tuned to substantiate its overall credibility and defensibility, and
the recommendations discussed. Where users of the report are included in the NAC, their awareness of the process
will foster a “no-surprises” report and possibly enhanced subsequent engagement in any changes to be
implemented.

10. Moving to Phase III or Stopping


All the work done by the NAC should put the final decision-making team (which might be a sub-set of the NAC)
in a better position to decide on the “post-assessment options,” which can include the following (Altschuld &
Kumar, 2010):

Discontinue the needs assessment at this point, as the case is not sufficiently compelling
Carry out further research into the prioritized needs and potential solutions
Begin designing the action plan for the organization to address the specified high-priority needs, based on
the chosen recommendations

Credible research is necessary but not sufficient for an effective needs assessment process. Effective communication
of conclusions and recommendations involves producing a readable report and also includes appropriate
communication with stakeholders.

Building a support coalition for a needs assessment, principally via the NAC, makes it possible to engage
stakeholders throughout the whole process so that as findings, conclusions, recommendations, and final choices
are tabled, there are no or, at least, few surprises. In the same ways that Patton (2008) suggests that evaluators
engage with stakeholders to ensure that evaluation results are utilized, needs assessment teams can do the same
thing.

331
Altschuld and Kumar (2010) point out that while “needs assessment is a mechanism by which organizations
change, develop and learn” (p. 115), there may be significant organization changes that include restructuring, due
to redirection of priorities. This makes it all the more important that a NAC remain involved and be particularly
“cautious and sensitive” (p. 115). They recommend that for the implementation phase, the NAC be augmented, if
appropriate, with additional managerial representation from the organization affected by the results, to ensure the
changes are “less outsider led” (p. 115). This option is relevant where a needs assessment is summative in its intent
and the results point to organizational changes that affect resources, jobs, or even careers.

332
Phase III: Post-Assessment: Implementing a Needs Assessment

11. Making Decisions to Resolve Needs and Select Solutions


The key objective of Phase II is to provide sufficient guidance to the decision makers so that they can effectively
consider the evidence base and the overall context in identifying and finalizing solutions. Thus, the end of the
“assessment” phase entails having the team meet with the decision makers to go over the findings to this point and
engage them in a discussion of the options, to work with them to determine the next steps. While we do not want
to go into a great amount of detail on the selection and implementation issues, questions to be asked include the
following:

What features of a solution will make it work in our situation?


What kinds of new or changed skills on the part of staff may be required?
How long might a start-up period be?
What are the short- and long-term outcomes?
How much will it cost to begin the solution, bring it up to a satisfactory level of performance, and maintain
a high quality of delivery/implementation?
How will different solutions affect the problem? (Altschuld & Kumar, 2010, p. 123)

Scaling the recommendations so that they are realistic, given the context, affects implementation. Needs
assessments that recommend adjustments to existing programs and services will typically be easier to implement
than recommendations that point to new programs or greatly expanded programs. Much of political decision
making is incremental, so needs assessments that are attuned to that fact will typically have recommendations that
are more likely to be implemented.

The decision makers may want to see further research done or have additional questions answered before
proceeding to the solution-oriented part of the process, if they feel that there is not enough firm evidence about
the needs and their causes, or the feasibility of addressing the needs with modified or new solutions. Depending
on the context, program or policy design alternatives may be available from the research that has been done, as
well as from the experience and expertise of the stakeholders involved. If the results of the needs assessment require
a new or untried solution, it may be necessary to elaborate solution options through further research.

12. Developing Action Plans


At this point, change begins for the organization as solution decisions have been made by the decision makers.
Altschuld and Kumar (2010) point out that while “needs assessment is a mechanism by which organizations
change, develop and learn” (p. 115), there may be significant organizational changes that may include loss of jobs
and restructuring, due to redirection of priorities. This makes it all the more important that a NAC remain
involved and be particularly “cautious and sensitive” (p. 115). Altschuld and Kumar (2010) recommend, however,
that for the implementation phase, the NAC should be reconfigured to include additional managerial
representation from the organization(s), to ensure the changes are “less outsider led” (p. 115). In some cases, it will
be advisable to pilot the program or policy solution before rolling out a full-scale change. Logic modeling,
discussed in Chapter 2, can facilitate designing and evaluating a pilot program or policy, based on the causal
analysis described earlier.

13. Implementing, Monitoring, and Evaluating


In much the same way that policy or program implementation would be expected to be followed by evaluation in
the performance management cycle, needs assessment projects that result in new or changed programs can be
followed up by performance monitoring and program evaluation. When we introduced the open systems model of
programs in Chapter 1, we suggested that program outcomes can be connected back to needs to see whether the

333
gap has been closed. The Catholic Health Association (2012) addresses “making the implementation strategy
sustainable over time” and has the following advice:

The implementation strategy should be dynamic. It will need to be updated as new information
becomes available: changes in community needs, changes in resource availability, and the effectiveness
of the implementation strategy and supporting programs.

Set time frames for periodic review of information about the community.
Monitor availability of resources required to carry out the implementation strategy.
Make evaluation part of the implementation strategy and all supporting community benefit programs.
Have in place processes that will ensure that evaluation findings are used to improve the strategy and
supporting programs. (p. 21)

Because needs assessments are intended to address gaps in policies or programs, support the development of new
programs or policies, support changes to existing programs or policies, or determine the continuing relevance of a
program in relation to current priorities in a jurisdiction, evaluating the implementation could be either formative
or summative. One issue that was identified earlier in this chapter was the unintended effects of stopping a
program where it turns out that that program was causally linked to other programs in ways that supported clients
indirectly. An evaluation would be a useful way to assess both direct and indirect effects of program changes—
implementing a new program or modifying or even eliminating an existing program.

In the next section, we provide an example of how one needs assessment approached the key steps of the pre-
assessment and assessment phases of the framework we have outlined in this chapter.

334
Needs Assessment Example: Community Health Needs Assessment in New
Brunswick
The framework that we have adapted (Altschuld & Kumar, 2010) is intended primarily for community-level or
organizational needs assessments. We will summarize a needs assessment that was conducted as part of a province-
wide initiative in New Brunswick, Canada (Government of New Brunswick, 2016), to improve primary health
care services. In presenting this case study, we note that the framework we have outlined in this chapter is more
comprehensive than the approach taken in New Brunswick. In general, for needs assessments, the context and
purposes will affect the process, including the combinations of pre-assessment, assessment, and post-assessment
steps.

Background

New Brunswick is on the east coast of Canada in a region that is experiencing economic struggles as the
demographic profile ages and as younger persons and families seek economic opportunities elsewhere in the
country. New Brunswick borders on Maine to the south and west, and there is considerable commerce and traffic
between the two jurisdictions. New Brunswick depends on natural resource extraction (forestry and mining) and
tourism, although it also has a seaport (the capital city, St. John) and a major oil refinery. It is at the eastern end of
a railroad network that extends across Canada.

The Community Health Needs Assessments were started in 2012 to address ongoing problems with existing
health services provided in the province. Among the problems identified province-wide were high levels of families
without a family doctor; high levels of chronic diseases; and low access to primary care health services
(Government of New Brunswick, 2013, p. 3). The focus was on improving primary care services and, in doing so,
taking a population health perspective (focusing on the social determinants of health) to improve preventive
services and stabilize or reduce utilization of existing hospital and medical services. The theory of change that was
embedded in the social determinants of health perspective that underpinned the New Brunswick initiative
references 12 factors that influence the health outcomes of individuals, families, and communities: income and
social status; social support networks; education and literacy; employment and working conditions; physical
environment; biology and genetic environment; personal health practices and coping skills; healthy child
development; health services; gender; social environment; and culture (Government of New Brunswick, 2013, p.
5).

Health services are a factor but are not considered to be critical overall in predicting health outcomes in
comparison to other social determinants. The New Brunswick framework is adapted from a model developed by
the University of Wisconsin’s Population Health Institute (2018). That model weights the relative importance of
the determinants of health: health services, 10%; health-related behaviors, 40%; social and economic factors, 40%;
and physical environment, 10% (Government of New Brunswick, 2016, p. 9).

The Community Health Needs Assessments have been overseen by the two regional health authorities in the
province. In each of those two health regions, needs assessments have been conducted for each community in their
catchment area—a total of 28 communities in the province (Government of New Brunswick, 2016, p. 10). The
Nackawic, Harvey, McAdam and Canterbury Area (Community 23) Community Health Needs Assessment was
conducted in 2016.

Community 23 is located in the southwestern part of the province and shares a boundary with Maine. It is mostly
rural with a number of small villages. The total population was 11,266 in the 2011 Census, which was a decrease
of 1% since 2006. Seventeen percent of the population lives in low-income households (Government of New
Brunswick, 2016, p. 10). The needs assessment report summarizes in a table the data for the incidence of chronic
diseases over time—this was one area that was addressed in the needs assessment:

Data from the Primary Health Care Survey of New Brunswick (2014) shows rates for many chronic

335
diseases increasing between 2011 and 2014 in the Nackawic, Harvey, McAdam, Canterbury Area
[Community 23]. The data shows separate rates for the Nackawic, McAdam, Canterbury Area and the
Harvey Area (region represented by the postal code E6K). Especially concerning are the increasing rates
of asthma, depression, cancer, heart disease, chronic pain, and emphysema or Chronic Obstructive
Pulmonary Disease (COPD) (Government of New Brunswick, 2016, p.11).

336
The Needs Assessment Process

Focusing the Needs Assessment


The community needs assessments in New Brunswick were all part of a province-wide initiative to improve
primary health care, so the mandate for each needs assessment was substantially determined by common terms of
reference across the province. Each community had some latitude to identify local needs (gaps between what is
provided and what is needed), as well as ways of identifying existing and prospective ways that health-related
providers link with each other as they serve their clients. Part of identifying gaps was to look at potential
collaborations that were currently not being realized as ways of addressing those gaps. Overall, the province has
recognized that the costs of health care services are a major issue, so improving primary health care services is
linked to a desire to control hospital and medical facility costs.

Forming the Needs Assessment Committee


The health region in which Community 23 is located is one of two health regions in the province. The Horizon
Health Network serves the primarily English speaking population in the province, while the Vitalité Health
Network serves the primarily French speaking population in the northern part of the province. Horizon Health
Network has a needs assessment team (a Community Health Assessment Team) who took the analytical lead in
each of the community needs assessments in that part of the province. The Horizon team initiated the local
process in Community 23 in 2015 by soliciting members for a Management Committee. Members of the
Management Committee included regional and local leaders within the Horizon Health Network organization
who live/work in Community 23 and have in-depth knowledge of that community, the programs being provided,
and an understanding, based on their knowledge and experience, of the issues and challenges in the community.

The Management Committee and the research team together compiled a list of possible members for a
Community Advisory Committee (CAC) that would function as the main reviewing, decision-making, and
communications body for the Community Health Needs Assessment. An important criterion for reaching out to
solicit prospective members of the CAC was ensuring that there was representation related to each of the 12 social
determinants of health. A total of 20 persons were included on the committee, representing health care (including
primary care), social services, political leaders, schools, recreation, community development, employment, mental
health, and volunteer sectors. The CAC members committed to regular meetings with the Horizons research team
to participate in the methodological and data collection steps in the needs assessment, as well as the data
interpretation once all the lines of evidence had been collected. The CAC also acted as a link between the team
and stakeholders in the community. More specifically their roles were to

attend approximately five two-hour meetings


perform a high-level review of currently available data on the Nackawic, Harvey, McAdam, Canterbury Area
provided by the CHA Team
provide input on which members of the community should be consulted as part of the CHNA
review themes that emerge through the CHNA consultation process
contribute to the prioritization of health and wellness themes. (Government of New Brunswick, 2016, p.15)

Learning About the Community Through a Quantitative Data Review


The common mandate for the community health needs assessments in the province made it possible to compile
community profiles based on existing secondary statistical data sources (federal and provincial), and make those
available to the Community 23 NAC. Combined with two province-wide primary health care surveys (2011 and
2014) that each sampled 13,500 residents in telephone interviews, the CHAs research teams had access to
summaries that included information about: environmental quality; community safety, injury prevention; family
and social support; income; employment; education; stress; sexual activity; alcohol and drug use; eating habits and

337
physical exercise; satisfaction with health-related services; perceived quality of services; and perceived access to
services.

The research team prepared summaries based on these and other variables. They were reviewed and discussed in
meetings with the CAC. Patterns were identified and questions about possible needs, based in part on
comparisons with other communities (each community had a set of reference communities that functioned as peer
comparisons) were identified. Since the overall focus was on primary health care improvement from a social of
determinants of health perspective, the questions were a first cut at identifying issues related to the 12 factors that
needed further exploration with other lines of evidence.

Learning About Key Issues in the Community Through Qualitative


Interviews and Focus Groups
The questions coming out of the quantitative data review were an important input into how the qualitative lines
of evidence were collected. Key informant interviews were done with primary health care–related stakeholders
including professionals in mental health and addictions, seniors support, child and youth care, recreation, social
services, clergy, and health center staff in the three largest villages (Nackawic, Harvey, and McAdam).

Focus groups were held in the four main villages in Community Area 23 to solicit both professional and lay
perspectives on primary health care–related issues. These sessions (6–10 persons each, lasting up to 2 hours) were
moderated by a Horizons Health research team member, and all dialogue was audio recorded and later transcribed
to facilitate qualitative data analysis. Thematic analysis was done to identify the issues raised in the focus groups
(both the content and frequency of issues mentioned), as well as suggestions about actions that should be taken to
address gaps or other shortcomings in primary health care services.

Triangulating the Qualitative and Quantitative Lines of Evidence


The Community Health Assessment Team from Horizon Health took both the qualitative and quantitative results
and compared the patterns to see what was shared between the two approaches, as well as what differed.
Summaries were prepared of issues identified and were presented to the CAC for a review of the findings. Findings
from complementary lines of evidence were tabled to facilitate a discussion of how to interpret the overall findings.

Prioritizing Primary Health-Related Issues in the Community


The CAC took the lead in an exercise to prioritize the issues that were most important in improving primary
health care in the community. The priorities are need-related, but in most cases, more work was identified as the
next step(s) to refine them to a point where they are actionable. This needs assessment process has identified
categories of needs but has not (yet) indicated specific interventions that are actionable.

Table 6.7 summarizes the priorities, the recommendations, and the population health factors that are relevant for
each one.

Table 6.7 Priorities, Recommendations, and Population Health Factors From the Nackawic,
McAdam, and Canterbury Area (Community 23) Health Needs Assessment
Table 6.7 Priorities, Recommendations, and Population Health Factors From the Nackawic, McAdam, and
Canterbury Area (Community 23) Health Needs Assessment

Identified Priority and Relevant


Recommendations for Next Steps
Determinants of Health

Priority: A decrease in mental


resiliency and coping skills among
Further consult with parents, educators, and mental health

338
children and youth in the community professionals about the types of mental resiliency and coping
skills that children and youth are missing and, through
Determinants: Social support partnerships, develop a plan to fill these learning gaps in the
networks; social environment; healthy community.
child development; personal health
practices; and coping skills

Priority: The need to review the way


in which mental health and addictions
services are currently being delivered in
the community to improve access to Further consult with mental health professionals, health centre
these services staff, and primary health care providers working in the
community to determine what additional services are needed.
Determinants: Income and social Review outcomes with Horizon’s Mental Health and Addictions
status; social support networks; leadership to determine how best to fill these gaps in service.
employment and working conditions;
personal health practices and coping
skills; health services

Priority: Food insecurity in the


community

Determinants: Income and social Working with key community partners, review the various
status; education and literacy; elements of food insecurity affecting the community and develop
employment and working conditions; a plan of action.
physical environment; personal health
and coping skills; and healthy child
development

Priority: The need for improved


supports in the community for families
who are struggling and experiencing Using a multi-sector approach that includes family support
difficulties services, public health, educators, and community partners,
revisit the current model of providing family support services
Determinants: Social environment;
and develop a more up-to-date approach to provision that better
income and social status; health child
aligns with the challenges being faced by families in the
development; personal health practices
community today.
and coping skills; social support
networks; and employment and
working conditions

Priority: The need to enhance


collaboration between health centre Initiate a working group with staff and leadership representation
staff, allied health professionals, and from Nackawic, McAdam, and Harvey health centres, the Dr.
other partners Everett Chalmers Regional Hospital (DECRH), other health
care providers as well as community partners to develop a plan to
Determinants: Social environment; improve communication and collaboration between these
physical environment; and health groups.
services

Priority: The need for more consistent


access to physicians and nurse
practitioners in the community to Review current access issues, wait list and status of the primary
health care provider pool in the community and, working with

339
improve continuity of care Horizon and community leaders, determine a strategy to
maintain and improve access to primary health care services in
Determinants: Health services; social the community.
environment; and physical
environment

Priority: The need for more


preventive, educational type Review current access issues, wait list, and status of the primary
programming and services health care provider pool in the community and, working with
Horizon and community leaders, determine a strategy to
Determinants: Social environment;
maintain and improve access to primary health care services in
personal health practices and coping
the community.
skills; education and literacy; and
health services
Source: Adapted from the Government of New Brunswick (2016), p. 22–28.

This needs assessment was carried to a Phase II endpoint in terms of our needs assessment framework. What still
needs to be done is to make the priorities and recommendations specific enough to make changes in programs
(adjusting existing programs, adding new program components). What is also evident is the emphasis on working
across community providers to realize benefits from coordination of existing programs and services. No resource
implications of the priorities and recommendations are explicitly indicated—given the overall budgetary situation
in the province of New Brunswick, it may be that improving primary care health services based on a model where
10% of the effects on health outcomes are due to health services, amounts to changes that are diffused across the
public and even private sectors. As well, it is evident that “personal health and coping skills” of residents in
community Area 23 are an important factor in improving primary health care.

If we review this needs assessment case in relation to the steps in a needs assessment, the New Brunswick case
generally follows the first 10 steps of the framework we discussed earlier in this chapter. In particular, the New
Brunswick community health needs assessment was strong on “forming the needs assessment committee” and
using both quantitative and qualitative lines of evidence to identify priorities and suggest next steps. What was less
evident in its documentation was in “causal analysis of needs.” Other than the social determinants of health
framework (assumed to be valid for this whole initiative), there was no causal analysis at the level of priorities and
suggested next steps.

340
Summary
In this chapter, we have taken you through the connections between needs assessment and the performance management cycle, the
foundations of what needs assessments are and why they are conducted, and some of the recent changes and expectations in the field.
Needs are fundamentally about what we value in our society. If we can agree that a given service or program is a need (e.g., basic income
support payments), that is the beginning of finding ways through the political process to allocate resources to that need. But because
needs reflect our values, there can be sharp and enduring disagreements over whether it is desirable to fund services or programs, even if a
need has been demonstrated. This is even more of a challenge in times of ongoing fiscal restraint.

Because of competition for funds and because needs assessments can be contentious, it is important that they be conducted in ways that
are methodologically defensible. More recently, there is a growing pool of sector-specific online guidance resources, and there are
corresponding heightened expectations that needs assessments utilize standardized tools and statistical data. Our needs assessment case
from the province of New Brunswick, Canada, is a good example of how one set of guidelines, common data sources, and even similar
questions frame a series of community health needs assessments across the province.

In some cases, periodic needs assessments are expected as part of requests for funding from government or nonprofit entities. Meta-
analyses are also increasingly available and useful for focusing needs assessments.

In a broad range of health, social, and education sectors, there is a fairly consistent set of steps that need to be taken to perform a solid,
defensible needs assessment that can be useful to program decision makers and funders. This chapter has outlined these basic steps (13 of
them), but the reader is encouraged to follow up with readings related to needs assessments in his or her specific area.

341
Discussion Questions
1. There are different perspectives on how to define needs. How would you define a need when it comes to developing public-sector
policies and programs?
2. It is quite common to measure needs by asking people to describe their own needs (in surveys or focus groups). What are some
advantages and disadvantages of this approach?
3. When you look across the ways that “need” can be measured, which approach do you think is the most valid? Why?
4. In Chapter 6, we mentioned possible conflicts of interest between the purpose of a needs assessment and who should participate
in the needs assessment process. What are your own views on this issue? Why?
5. What is the difference between stratified and non-stratified random samples? Give an example of each.
6. Why is sampling important for needs assessments?
7. What are some of the most important factors that will enhance the likelihood of timely implementation of needs assessment
recommendations?

342
Appendixes

343
Appendix A: Case Study: Designing a Needs Assessment for a Small
Nonprofit Organization
The purpose of this case is to give you an opportunity to design a needs assessment, based on the situation
described below. In this chapter, we outline steps in designing and implementing a needs assessment. You can use
the steps as you develop your design, but do not commit to doing anything that is not realistic—that is, beyond
the means of the stakeholders in this case.

Your task in this case is to design a needs assessment for a nonprofit organization. There is not much money, the
organization involved is small, and the needs assessment is being demanded by a key funder of the agency’s
programs. When you have read the case, follow the instructions at the end. Once you have developed your needs
assessment design (we would suggest you work with one or two classmates to develop the design), discuss it with
other class members. This case will take about 2 to 3 hours to do. If a whole class is doing this case you should
allow at least a half hour to invite the groups to report on the main features of their designs.

The Program
A Meals on Wheels program in a community is currently being funded by a national charitable funding
organization and private donations. The funding organization is under considerable budget pressures because the
total donations have not kept up with the demand for funds. The board has recently adopted a policy that requests
that the program managers of all funded agencies demonstrate the continuing relevance of their program in the
community to receive funding.

The program manager of Meals on Wheels is concerned that the needs assessment that she must conduct will be
used to make future funding reductions but does not feel she has a choice. She has limited resources to do any
kind of needs assessment on her own. There is basically her time, the time of an office staff member, and the time
of volunteers.

The Meals on Wheels program is intended to bring one hot meal a day to her clients—all of whom are elderly,
single members of the community. Most have physical limitations that limit their ability to cook their own food,
and some are experiencing memory loss and other problems that make it hard for them to remember when or even
how to prepare regular meals. There are 150 clients at this time in the program, and that number has been fairly
steady for the past several years, although other agencies are reporting more demand for services from elderly
people.

Volunteers, most of whom are elderly themselves, pick up the meals from a catering company in the city and are
assigned a group of deliveries each day. The volunteers have to be able to drive, and because they have other
commitments, most volunteers do not deliver meals every day.

In addition to making sure that the program clients get at least one hot meal a day, the volunteers can check to
make sure that they have not fallen or otherwise injured themselves. If volunteers find a client in trouble, they can
decide whether to call 911 directly or instead call the Meals on Wheels office.

Most of the volunteers have been delivering meals for at least 3 years. Their continued commitment and
enthusiasm are key assets for the program. The program manager recognizes their importance to the program and
does not want to do anything in the needs assessment that will jeopardize their support.

Your Role
The program manager approaches you and asks you to assist her with this project. She is clearly concerned that if
the needs assessment is not done, her funding will be cut. She is also concerned that if the study does not show
that her clients need the benefits of the services, she and her volunteers who offer the program will be vulnerable

344
to cuts.

You are a freelance consultant—that is, you work on your own out of a home office and do not have access to the
time and resources that a consultant in a larger firm would have. She can pay you for your work on the design, but
any suggestions that you make to the program manager have to be realistic—that is, cannot assume that large
amounts of money or other resources are available.

Your Task
Working in a team of two to three persons, draft a design for a needs assessment that is focused on whether there
is a continuing need for the Meals on Wheels program in the community. In your design, pay attention to the
steps in conducting a needs assessment that were discussed in this chapter. Make your design realistic—that is, do
not assume resources are just going to be available.

Outline your design in two to three pages (point form). Discuss it with other teams in your class.

345
References
Alcoholism and Drug Abuse Weekly. (2016). CDC report shows heroin and illicit fentanyl overdoses increasing.
28(2), 1–3.

Altschuld, J. (2004). Emerging dimensions of needs assessment. Performance Improvement, 43(1), 10–15.

Altschuld, J. (2010). Needs assessment phase II: Collecting data. Thousand Oaks, CA: Sage.

Altschuld, J., & Kumar, D. D. (2010). Needs assessment: An overview (Vol. 1). Thousand Oaks, CA: Sage.

Asadi-Lari, M., Packham, C., & Gray, D. (2003). Need for redefining needs. Health and Quality of Life Outcomes,
1(34).

Aubry, T., Tsemberis, S., Adair, C. E., Veldhuizen, S., Streiner, D., Latimer, E.,. . . & Hume, C. (2015). One-
year outcomes of a randomized controlled trial of Housing First with ACT in five Canadian cities. Psychiatric
Services, 66(5), 463–469.

Axford, N. (2010). Conducting needs assessments in children’s services. British Journal of Social Work, 40(1),
4–25.

Axford, N., Green, V., Kalsbeek, A., Morpeth, L., & Palmer, C. (2009). Measuring children’s needs: How are we
doing? Child & Family Social Work, 14(3), 243–254.

Barry, C. (2018). Fentanyl and the evolving opioid epidemic: What strategies should policy makers consider?
Psychiatric Services, 69(1), 100–103.

Bee, P., Barnes, P., & Luker, K. (2009). A systematic review of informal caregivers’ needs in providing home-
based end-of-life care to people with cancer. Journal of Clinical Nursing, 18(10), 1379–1393.

Beletsky, L., & Davis, C. S. (2017). Today’s fentanyl crisis: Prohibition’s Iron Law, revisited. International Journal
of Drug Policy, 46, 156–159.

Bradshaw, J. (1972). The concept of social need. New Society, 30, 640–643.

Byrne, B., Maguire, L., & Lundy, L. (2015). Reporting on best practice in cross-departmental working practices
for children and young people. Centre for Children’s Rights. Belfast: Queen’s University. Retrieved from
http://www.niccy.org/media/1655/juw-report-final-30-sept-15.pdf

Cain, C. L., Orionzi, D., O’Brien, M., & Trahan, L. (2017). The power of community voices for enhancing
community health needs assessments. Health Promotion Practice, 18(3), 437–443.

346
Calsyn, R. J., Kelemen, W. L., Jones, E. T., & Winter, J. P. (2001). Reducing overclaiming in needs assessment
studies: An experimental comparison. Evaluation Review, 25(6), 583–604.

Canadian Observatory on Homelessness (2018). Homeless Hub—About us. Retrieved from


http://homelesshub.ca/about-us

Catholic Health Association of the United States. (2012). Assessing and addressing community health needs.
Retrieved from http://www.chausa.org/Assessing_and_Addressing_Community_Health_Needs.aspx

Chaster, S. (2018). Cruel, unusual, and constitutionally infirm: Mandatory minimum sentences in Canada.
Appeal: Review of Current Law and Law Reform, 23, 89–119.

Community Health Assessment Network of Manitoba. (2009). Community health assessment guidelines 2009.
Retrieved from http://www.gov.mb.ca/health/rha/docs/chag.pdf

Davis, C., Green, T., & Beletsky, L. (2017). Action, not rhetoric, needed to reverse the opioid overdose epidemic.
Journal of Law, Medicine & Ethics, 45 (1_suppl),20–23.

Dewit, D. J., & Rush, B. (1996). Assessing the need for substance abuse services: A critical review of needs
assessment models. Evaluation and Program Planning, 19(1), 41–64.

Drukker, M., van Os, J., Bak, M., à Campo, J., & Delespaul, P. (2010). Systematic monitoring of needs for care
and global outcomes in patients with severe mental illness. BMC Psychiatry, 10 (1). Retrieved from
http://www.biomedcentral.com/content/pdf/1471–244X-10–36.pdf

Donaldson, S. I., & Picciotto, R. (Eds.). (2016). Evaluation for an equitable society. Charlotte, NC: IAP.

Fetterman, D. M., Rodríguez-Campos, L., & Zukoski, A. P. (2018). Collaborative, participatory, and empowerment
evaluation: Stakeholder involvement approaches. New York, NY: Guilford Publications.

Fetterman, D., & Wandersman, A. (2007). Empowerment evaluation: Yesterday, today, and tomorrow. American
Journal of Evaluation, 28(2), 179–198.

Fletcher, A., Gardner, F., McKee, M., & Bonell, C. (2012). The British government’s Troubled Families
Programme. BMJ (Clinical research ed.), 344, e3403.

Folkemer, D., Somerville, M., Mueller, C., Brow, A., Brunner, M., Boddie-Willis, C., . . . Nolin, M. A. (2011,
April). Hospital community benefits after the ACA: Building on state experience (Issue Brief). Baltimore, MD: The
Hilltop Institute, UMBC. Retrieved from http://www.greylit.org/sites/default/files/collected_files/2012–
09/HospitalCommunityBenefitsAfterTheACA-HCBPIssueBrief2-April2011.pdf

Friedman, D. J., & Parrish, R. G. (2009). Is community health assessment worthwhile? Journal of Public Health

347
Management and Practice, 15(1), 3–9.

Gaber, J. (2000). Meta-needs assessment. Evaluation and Program Planning, 23(2), 139–147.

Government Accountability Office. (2017). Medicaid: CMS should take additional steps to improve assessments of
individuals’ needs for home- and community-based services. Washington, DC: United States Government
Accountability Office.

Government of New Brunswick. (2013). Community health needs assessment guidelines for New Brunswick. St.
John, NB: Department of Health [online]. Retrieved from
http://en.horizonnb.ca/media/819151/chna_guide_en.pdf

Government of New Brunswick. (2016). Nackawic, Harvey, McAdam and Canterbury area: Community health
needs assessment. St. John, NB: Horizon Health Network. Retrieved from
http://en.horizonnb.ca/media/872324/nackawic_chna_en.pdf

Grande, G., Stajduhar, K., Aoun, S., Toye, C., Funk, L., Addington-Hall, J., . . . Todd, C. (2009). Supporting
lay careers in end of life care: Current gaps and future priorities. Palliative Medicine, 23(4), 339–344.

Hanson, L., Houde, D., McDowell, M., & Dixon, L. (2007). A population-based needs assessment for mental
health services. Administration and Policy in Mental Health, 34(3), 233–242.

Harrison, J. D., Young, J. M., Price, M. A., Butow, P. N., & Solomon, M. J. (2009). What are the unmet
supportive care needs of people with cancer? A systematic review. Supportive Care in Cancer, 17(8), 1117–1128.

Head, B. W., & Alford, J. (2015). Wicked problems: Implications for public policy and management.
Administration & Society, 47(6), 711–739.

Healey, P., Stager, M. L., Woodmass, K., Dettlaff, A. J., Vergara, A., Janke, R., & Wells, S. J. (2017). Cultural
adaptations to augment health and mental health services: A systematic review. BMC Health Services Research,
17(1), 8.

Henry, G. T. (2003). Influential evaluations. American Journal of Evaluation, 24(4), 515–524.

Hudson, P. L., Trauer, T., Graham, S., Grande, G., Ewing, G., Payne, S., . . . Thomas, K. (2010). A systematic
review of instruments related to family caregivers of palliative care patients. Palliative Medicine, 24(7),
656–668.

Institute of Medicine, Committee for the Study of the Future of Public Health. (1988). The future of public health.
Washington, DC: National Academies Press.

Jennekens, N., de Casterlé, B., & Dobbels, F. (2010). A systematic review of care needs of people with traumatic

348
brain injury (TBI) on a cognitive, emotional and behavioural level. Journal of Clinical Nursing, 19(9/10),
1198–1206.

Kendall, M., Buckingham, S., Ferguson, S., MacNee, W., Sheikh, A., White, P.,. . . & Pinnock, H. (2015).
Exploring the concept of need in people with very severe chronic obstructive pulmonary disease: A qualitative
study. BMJ Supportive & Palliative Care. [ePub ahead of print]

Kernan, J. B., Griswold, K. S., & Wagner, C. M. (2003). Seriously emotionally disturbed youth: A needs
assessment. Community Mental Health Journal, 39(6), 475–486.

Latimer, C. (2015, October 4). How we created a Canadian prison crisis. Toronto Star. Retrieved from
https://www.thestar.com/opinion/commentary/2015/10/04/how-we-created-a-canadian-prison-crisis.html

Lewis, H., Rudolph, M., & White, L. (2003). Rapid appraisal of the health promotion needs of the Hillbrow
Community, South Africa. International Journal of Healthcare Technology and Management, 5(1/2), 20–33.

MacIsaac, L., Harrison, M., Buchanan, D., & Hopman, W. (2011). Supportive care needs after an acute stroke.
Journal of Neuroscience Nursing, 43(3), 132–140.

Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50(4), 370–396.

McKillip, J. (1987). Need analysis: Tools for the human services and education (Applied social research methods
series, vol. 10). Thousand Oaks, CA: Sage.

McKillip, J. (1998). Need analysis: Process and techniques. In L. Bickman & D. J. Rog (Eds.), Handbook of
applied social research methods (pp. 261–284). Thousand Oaks, CA: Sage.

Miller, E., & Cameron, K. (2011). Challenges and benefits in implementing shared inter-agency assessment across
the UK: A literature review. Journal of Interprofessional Care, 25(1), 39–45.

Padgett, D., Henwood, B., & Tsemberis, S. (2016). Ending homelessness, transforming systems and changing lives.
New York, NY: Oxford University Press.

Patterson, P., McDonald, F. E. J., Butow, P., White, K. J., Costa, D. S. J., Millar, B., . . . Cohn, R. J. (2014).
Psychometric evaluation of the Sibling Cancer Needs Instrument (SCNI): An instrument to assess the
psychosocial unmet needs of young people who are siblings of cancer patients. Supportive Care in Cancer, 22(3),
653–665.

Phelan, M., Slade, M., Thornicroft, G., Dunn, G., Holloway, F., Wykes, T., . . . Hayward, P. (1995). The
Camberwell Assessment of Need: The validity and reliability of an instrument to assess the needs of people with
severe mental-illness. British Journal of Psychiatry, 167(5), 589 595.

349
Piche, J. (2015). Playing the “Treasury Card” to Contest Prison Expansion: Lessons from a Public Criminology
Campaign. Social Justice, 41(3),145 167.

Pigott, C., Pollard, A., Thomson, K., & Aranda, S. (2009). Unmet needs in cancer patients: Development of a
supportive needs screening tool (SNST). Supportive Care in Cancer, 17(1), 33 45.

Platonova, E., Studnicki, J., Fisher, J., & Bridger, C. (2010). Local health department priority setting: An
exploratory study. Journal of Public Health Management and Practice, 16(2), 140 147.

Poister, T. H. (1978). Public program analysis: Applied research methods. Baltimore, MD: University Park Press.

Rasmusson, B., Hyvönen, U., Nygren, L., & Khoo, E. (2010). Child-centered social work practice: Three unique
meanings in the context of looking after children and the assessment framework in Australia, Canada and
Sweden. Children and Youth Services Review, 32, 452 459.

Reviere, R., Berkowitz, S., Carter, C., & Ferguson, C. (1996). Needs assessment: A creative and practical guide for
social scientists. Washington, DC: Taylor & Francis.

Schoen, C., Doty, M. M., Robertson, R. H., & Collins, S. R. (2011). Affordable Care Act reforms could reduce
the number of underinsured US adults by 70 percent. Health Affairs, 30(9), 1762 1771

Scriven, M., & Roth, J. (1978). Needs assessments: Concepts and practice. In S. B. Anderson & C. D. Coles
(Eds.), Exploring purposes and dimensions (New Directions in Program Evaluation, no. 1). San Francisco, CA:
Jossey-Bass.

Scutchfield, F. D., Mays, G. P., & Lurie, N. (2009). Applying health services research to public health practice:
An emerging priority. Health Research and Education Trust, 44(5), 1775 1787.

Sheppard, M., & Wilkinson, T. (2010). Assessing family problems: An evaluation of key elements of the
children’s review schedule. Children & Society, 24(2), 148 159.

Slade, M. (1999). CAN: Camberwell Assessment of Need: A comprehensive needs assessment tool for people with severe
mental illness. London, England: Gaskell.

Soriano, F. I. (2012). Conducting needs assessments: A multidisciplinary approach (2nd ed.). Los Angeles, CA: Sage.

Sork, T. J. (2001). Needs assessment. In D. H. Poonwassie & A. Poonwassie (Eds.), Fundamentals of adult
education: Issues and practices for lifelong learning (pp. 101 115). Toronto, Ontario, Canada: Thompson
Educational.

Stergiopoulos, V., Dewa, C., Durbin, J., Chau, N., & Svoboda, T. (2010). Assessing the mental health service
needs of the homeless: A level-of-care approach. Journal of Health Care for the Poor and Underserved, 21(3),

350
1031 1045.

Stergiopoulos, V., Dewa, C., Tanner, G., Chau, N., Pett, M., & Connelly, J. L. (2010). Addressing the needs of
the street homeless: A collaborative approach. International Journal of Mental Health, 39(1), 3 15.

Stevens, A., & Gillam, S. (1998). Needs assessment: From theory to practice (Clinical Research ed.). BMJ,
316(7142), 1448 1452.

Strickland, B., van Dyck, P., Kogan, M., Lauver, C., Blumberg, S., Bethell, C., . . . Newacheck, P. W. (2011).
Assessing and ensuring a comprehensive system of services for children with special health care needs: A public
health approach. American Journal of Public Health, 101(2), 224 231.

Swenson, J. R., Aubry T., Gillis, K., Macphee, C., Busing, N., Kates, N., . . . Runnels, V. (2008). Development
and implementation of a collaborative mental health care program in a primary care setting: The Ottawa share
program. Canadian Journal of Community Mental Health, 27(2), 75 91.

Tsemberis, S. (2010). Housing first: The pathways model to end homelessness for people with mental illness and
addiction manual. Center City, MN: Hazelden.

Tutty, L., & Rothery, M. (2010). Needs assessments. In B. Thyer (Ed.), The handbook of social work research
methods (2nd ed., pp. 149 162). Thousand Oaks, CA: Sage.

Waller, A., Girgis, A., Currow, D., & Lecathelinais, C. (2008). Development of the palliative care needs
assessment tool (PC-NAT) for use by multi-disciplinary health professionals. Palliative Medicine, 22(8), 956
964.

Watson, D. P., Shuman, V., Kowalsky, J., Golembiewski, E., & Brown, M. (2017). Housing First and harm
reduction: A rapid review and document analysis of the US and Canadian open-access literature. Harm
Reduction Journal, 14(1), 30.

Wen, K.-Y., & Gustafson, D. (2004). Needs assessment for cancer patients and their families. Health and Quality
of Life Outcomes, 2(1), 1 12.

White, J., & Altschuld, J. (2012). Understanding the “what should be condition” in needs assessment data.
Evaluation and Program Planning, 35(1), 124 132.

Wisconsin University (2018). Population Health Institute: Home. Retrieved from


https://uwphi.pophealth.wisc.edu

351
352
7 Concepts and Issues in Economic Evaluation

Introduction 299
Why an Evaluator Needs to Know About Economic Evaluation 300
Connecting Economic Evaluation With Program Evaluation: Program Complexity and Outcome
Attribution 302
Program Complexity and Determining Cost-Effectiveness of Program Success 302
The Attribution Issue 303
Three Types of Economic Evaluation 304
The Choice of Economic Evaluation Method 304
Economic Evaluation in the Performance Management Cycle 306
Historical Developments in Economic Evaluation 307
Cost–Benefit Analysis 308
Standing 309
Valuing Nonmarket Impacts 312
Revealed and Stated Preferences Methods for Valuing Nonmarket Impacts 312
Steps for Economic Evaluations 313
1. Specify the Set of Alternatives 314
2. Decide Whose Benefits and Costs Count (Standing) 314
3. Categorize and Catalog the Costs and Benefits 314
4. Predict Costs and Benefits Quantitatively Over the Life of the Project 315
5. Monetize (Attach Dollar Values to) All Costs and Benefits 315
6. Select a Discount Rate for Costs and Benefits Occurring in the Future 316
7. Compare Costs With Outcomes, or Compute the Net Present Value of Each Alternative 317
8. Perform Sensitivity and Distributional Analysis 318
9. Make a Recommendation 318
Cost–Effectiveness Analysis 320
Cost–Utility Analysis 321
Cost–Benefit Analysis Example: The High/Scope Perry Preschool Program 322
1. Specify the Set of Alternatives 324
2. Decide Whose Benefits and Costs Count (Standing) 324
3. Categorize and Catalog Costs and Benefits 324
4. Predict Costs and Benefits Quantitatively Over the Life of the Project 325
5. Monetize (Attach Dollar Values to) All Costs and Benefits 325
6. Select a Discount Rate for Costs and Benefits Occurring in the Future 326
7. Compute the Net Present Value of the Program 327
8. Perform Sensitivity and Distributional Analysis 327
9. Make a Recommendation 327
Strengths and Limitations of Economic Evaluation 328
Strengths of Economic Evaluation 328
Limitations of Economic Evaluation 329
Summary 333
Discussion Questions 334
References 336

353
Introduction
Chapter 7 introduces the concepts, principles, and practices of economic evaluation. We begin by connecting
economic evaluation to earlier themes in this book: program complexity, as described in Chapter 2, and
attribution, as described in Chapter 3. We introduce the three principal forms of economic analysis: (1) cost–
benefit analysis (CBA), (2) cost–effectiveness analysis (CEA), and (3) cost–utility analysis (CUA). Economic
evaluation has a rich history grounded in both economic theory and public-sector decision making, so we briefly
summarize some of those themes.

The main part of Chapter 7 is a conceptual and step-by-step introduction to CBA. We then describe CEA and
CUA, both of which share core concepts with CBA. We offer an example of an actual CBA that was done of the
High/Scope Perry Preschool Program (HSPPP), which we introduced in Chapter 3. Finally, we summarize the
strengths and limitations of economic evaluation.

Over the past several decades, the combination of greater demand for public services concurrent with increased
fiscal pressures, particularly in the health sector, has resulted in calls for economic evaluations to increase efficiency
in the use of public resources. This chapter is intended to help program evaluators (a) become knowledgeable and
critical users of economic evaluations, (b) see the relationship between economic evaluations and other
evaluations, and (c) identify potential weaknesses in the validity of specific studies.

The focus of program evaluation is typically on how well the actual results/outcomes of a program compare with
the intended outcomes and whether those actual outcomes can be attributed to the program. In the case of
economic evaluations generally, the goal is to explicitly compare costs and benefits, using an economic efficiency
criterion to choose between program, policy, intervention, or project alternatives, or between the status quo and
one or more alternatives. Choosing between alternatives includes decisions about project scale and treatment
“dosage.” Efficiency, of course, is not the only consideration when making program decisions. In reality, the
interrelationship between efficiency and social equity, as well as the influence of politics on public policy making,
means that economic evaluations are one component of a complex decision-making process (Gramlich, 1990;
Mankiw, 2015). Moreover, as discussed later in this chapter, certain types of economic evaluations explicitly
address in their analysis equity considerations, such as the distributional consequences of interventions (Atkinson
& Mourato, 2015; Cai, Cameron, & Gerdes, 2010; Pearce, Atkinson, & Mourato, 2006).

The three most important economic evaluation approaches, which are described and illustrated in the remainder
of this chapter, are CBA, CEA, and CUA. These approaches can address the following key issues:

A CBA can tell us whether the social benefits of an intervention exceed its social costs—and consequently,
whether it should be undertaken on the basis of that criterion. In CBA, the efficiency criterion aims to
maximize social utility, the aggregate utility of citizens, defined as the sum of individual citizens’ utility.
CBA, because it monetizes both costs and benefits, can also be used to choose between mutually exclusive
alternative projects and to rank or prioritize projects when limited investment funds are available to finance
a small number of projects.
A CEA can tell us which of two or more interventions minimizes the social costs per unit of a given outcome
achieved, such as monetary cost per lives saved, and it can therefore help us choose among alternative
interventions to maximize the cost-effectiveness of interventions aiming to achieve a unit of common
outcome. CEAs do not provide sufficient information to determine whether an intervention generates net
social benefits/net social value (NSBs), because information on the monetarized social value of the
outcomes is unavailable or is not estimated. A cost–effectiveness analysis can look at just one program’s per
unit cost of achieving a specific outcome, but generally, the idea is to be able to compare two or more
interventions, so that the results can be ranked and the best chosen.
With a CUA, the intent is to determine which of numerous possible interventions minimize social costs per unit
of a broad (multi-component) outcome achieved, such as quality-adjusted life-years (QALY). It is used to help
make choices among alternative (e.g., health service) interventions to maximize QALY. This is done by

354
comparing costs per QALY of interventions to an explicit threshold believed to represent the (monetary)
social value of a QALY. While CUA is increasingly used in the sense just outlined, there are some
unresolved ethical and methodological issues, such as how to discount additional years of life (Neumann,
2015). In practice, then, CUAs usually answer the same type of question as CEAs do, except that the
outcome of interest in a CUA is QALY in contexts where the purpose of interventions is to improve life
span and quality of life. Because CUAs measure cost per unit of utility, the outcome measure should
represent social utility or preferences.

355
Why an Evaluator Needs to Know About Economic Evaluation
Budgetary and accountability trends, prompted in part by New Public Management (NPM) influences and, more
recently, by the Great Recession in 2008–2009 and the subsequent fiscal constraints, suggest that economic
evaluations will have an increasing role to play in program evaluations (Clyne & Edwards, 2002; Drummond,
Sculpher, Claxton, Stoddart, & Torrance, 2015; Sanders, Neumann, Basu, et al., 2016). In some fields (health
being a prime example) the importance of containing costs and assessing relative efficiency and effectiveness of
interventions continues to grow, as population demographics in many Western countries drives up health care
expenditures (OECD, 2017). While most program evaluators cannot expect to conduct an economic evaluation
without a background in economics, it is important to have an understanding of how economic evaluation
intersects with program evaluation. At a minimum, we want to be knowledgeable and critical readers of economic
evaluations that others have done.

It is important for evaluators to critically assess economic evaluations that may be used to justify program or policy
decisions. Guidelines and checklists have been developed to assist researchers, reviewers, and evaluators in
designing and evaluating research that can be used for such an assessment (Drummond & Jefferson, 1996;
Drummond et al., 2015; Fujiwara & Campbell, 2011; van Mastrigt et al., 2016). Higgins and Green (2011)
recommend the use of either the British Medical Journal checklist (Drummond & Jefferson, 1996) or the
Consensus on Health Economic Criteria (Evers, Goossens, de Vet, van Tulder, & Ament, 2005) to evaluate the
quality of studies included in a Cochrane-style systematic review. Both checklists are reproduced in Section 15.5.2
of the Cochrane Handbook for Systematic Reviews of Intervention, Version 5.1.0, available online (Higgins & Green,
2011).

Sometimes, program evaluations can be done in a way that later facilitate an economic evaluation of the outcomes
of an intervention or program (Dhiri & Brand, 1999; HM Treasury, 2018; Ryan, Tompkins, Markovitz, &
Burstin, 2017). Alternatively, prior to the implementation of a program, a program evaluator may review the
economic analyses that have already been done, in order to identify the variables/data that would be needed to
conduct future evaluations or economic evaluations.

The quality of economic evaluations has been and continues to be a concern (Anderson, 2010; Mallender &
Tierney, 2016; Mathes et al., 2014; Sanders et al., 2016). Jefferson, Demicheli, and Vale (2002) examined reviews
of economic evaluations in the previous decade in the health sector and concluded that “the reviews found
consistent evidence of serious methodological flaws in a significant number of economic evaluations” (p. 2809).

More recently, Sabharwal, Carter, Darzi, Reilly, and Gupte (2015) examined the methodological quality of health
economic evaluations (HEEs) for the management of hip fractures and concluded,

Most of these studies fail to adopt a societal perspective and key aspects of their methodology are poor.
The development of future HEEs in this field must adhere to established principles of methodology, so
that better quality research can be used to inform health policy on the management of patients with a
hip fracture. (p. 170)

Sanders et al. (2016) had similar concerns in their review of cost–effectiveness analyses in health and medicine but
noted the continuing evolution and increasing use of cost–effectiveness studies and provide a “reporting checklist
for cost–effectiveness analysis” (p. 1099). Gaultney, Redekp, Sonneveld, and Uyl-de Groot (2011) and Polinder et
al. (2012) looked at the quality of economic evaluations of interventions in multiple myeloma and in injury
prevention, respectively, and again, the studies suggest that continued vigilance in assessing the quality of
economic evaluations is in order.

356
357
Connecting Economic Evaluation With Program Evaluation: Program
Complexity and Outcome Attribution

Program Complexity and Determining Cost-Effectiveness of Program


Success
In Chapter 2, we introduced simple, complicated, and complex program structures, pointing out that the more
complicated/complex a program is, the more challenging it is to coordinate and implement multiple factors to
achieve program success. An example of a simple program would be a program that is focused on maintaining
highways in a region to some set of audited standards. The program would consist of components that describe
clusters of maintenance activities, and most of the work done would be done by machinery and their operators. If
we assume that the level of maintenance is generally constant over time—that is, the program passes an annual
audit based on minimum standards of maintenance completeness and quality—then we could construct a ratio of
the cost per lane-kilometer of highway maintained in that region. This measure of cost-effectiveness could be
compared over time, adjusting for general increases in the price level, allowing us to detect changes in cost-
effectiveness. Because the program is relatively simple in its structure, we can be reasonably sure that when we
implement the program and track it through to outcomes, the costs (program inputs) have indeed been
responsible for producing the observed outcome. Our measure of cost-effectiveness is a valid measure of program
accomplishment. Measuring performance would be roughly equivalent to evaluating the effectiveness of the
program.

In our highway maintenance example, we could use the cost per lane-kilometer to assess the effects of program-
related interventions or even changes in the program environment. Suppose highway maintenance in that
jurisdiction was contracted out, as has been done in many places. If we compared the cost per lane-kilometer
before and after outsourcing (privatizing) the program, including the cost of contracting, net of savings from lower
human resource management costs, we could estimate the change in cost-effectiveness due to that change in the
provision of the program. Of course, if winter weather in the region was severe for several years, we would expect
that to affect the cost-effectiveness of the service. A change in the weather affecting highway maintenance costs
that occurs simultaneously with outsourcing would confound sorting out the effects of the two
environmental/exogenous factors. So even in the case of “simple” program changes, it can sometimes be difficult
to determine precise cost savings.

An example of a program that is more complex would be a program to encourage parents to modify their
parenting behaviors

Such a program involves changing the behavior of people in an open systems context, generally a challenging
outcome to achieve. Recall that in Chapter 2, the example of a complex program that was offered was raising a
child—in general, programs that are intended to change our knowledge, attitudes, beliefs, and our behaviors are
challenging to design and implement. The Troubled Families Program in Britain (Day, Bryson, & White, 2016;
Department for Communities and Local Government, 2016) is an example of a complex program that is intended
to change family behaviors in such ways that for the families in the program, there will be less family violence,
fewer crimes committed, more employment, and better school attendance by children in those families. An
important objective of the program was to save government costs. The evaluation results were contradictory—the
quantitative lines of evidence suggested no overall program effects, whereas the qualitative lines of evidence
suggested positive changes in the sample of families included in that particular evaluation (Bewley, George,
Rienzo, & Portes, 2016; Blades, Day, & Erskine, 2016).

Measuring the cost-effectiveness of such a program, even if we could agree on an outcome, would be questionable
—we would not be confident that actual outcomes were due to the program, at least not without a program
evaluation that addressed the important question of causal links between program outputs and outcomes. From a
performance measurement standpoint, outcomes measures would not be valid indicators of what the program

358
actually accomplished.

As we’ve noted earlier, nested logic models can be useful for capturing complexity. Anderson et al. (2011) propose
some strategies for using logic models for systematic reviews of complex health and social programs. For the health
sector, Lewin et al. (2017) have recently proposed a framework “tool,” the “intervention Complexity Assessment
Tool for Systematic Reviews” (abbreviated iCAT-SR) to assess and compare the complexity of interventions when
conducting systematic reviews.

The Attribution Issue


Establishing causality is equally relevant to economic evaluations as it is in regular program evaluations. Recall the
York crime prevention program, which had an objective of reducing burglaries committed in the community. In
Chapter 4, we considered some of the potential difficulties of developing valid measures of the construct
“burglaries committed” and ended up using “burglaries reported to the police.” One way of assessing the cost-
effectiveness of that program would be to calculate the ratio of program cost per reduced number of burglaries
(cost per burglary avoided) in the city. This, however, presupposes that reductions in reported burglaries are due to
program efforts and, hence, can be connected to program costs. If general economic conditions in the region were
improving at the same time as the program was implemented, a reduction in burglaries may simply reflect better
alternative economic opportunities.

Levin and McEwan (2001), in their discussion of CEA, point out the importance of establishing causality before
linking costs with observed outcomes. They introduce and summarize the same threats to internal validity that we
discussed in Chapter 3. They point out that experiments are the strongest research designs for establishing
causality:

Randomized experiments provide an extremely useful guard against threats to internal validity such as
group nonequivalence. In this sense, they are the preferred method for estimating the causal relationship
between a specific…alternative and measures of effectiveness. (pp. 124–125)

If we are reasonably confident that we have resolved the attribution problem, then our ratio—cost per burglary
avoided—would be a summary measure of the cost-effectiveness of that program. Its usefulness depends, in part,
on whether reducing burglaries is the only program outcome and, in part, on how reliably we have estimated the
costs of the program. Even if reducing burglaries was the only objective and estimates of program costs were
reliable, in and of itself, the measure would have limited usefulness. However, it would be very useful if a general
estimate of the social cost of burglaries was available, as the cost of avoiding burglaries through the program could
then be compared with the social cost of burglaries and a net social benefit (NSB) calculated. NSB would be equal
to the social cost of avoided burglaries less the program cost.

In the remainder of this chapter, we further examine the objectives and theoretical underpinnings of CBA, CEA,
and CUA; the key steps in conducting these types of analyses; and then use an example of a CBA study to
illustrate the basic steps needed to perform an economic analysis. We then review some of the controversies and
limitations of CBA, CEA, and CUA in the chapter summary.

359
Three Types of Economic Evaluation
Cost–benefit analysis (CBA), cost–effectiveness analysis (CEA), and cost–utility analysis (CUA) are the three main
types of economic evaluation applicable to public-sector program evaluation. With all these types of analyses, the
costs of the programs or potential programs are monetized, but for each, the benefits are quantified differently.
With CBA, both the costs and the resulting benefits to society are monetized to determine whether there is a net
social benefit (NSB). However, with CEA, costs per unit of a single non-monetized outcome, such as “life saved,”
are calculated, and with CUA, costs per unit of a measure of multi-component utility, such as “quality-adjusted
life-years” (QALY), are calculated. Because benefits are not monetized in CEA and CUA, these methods cannot
usually answer the question of whether an intervention provides NSBs, except that cost–utility ratios from CUAs
may be compared with a threshold or benchmark that, under restrictive assumptions, can be considered to reflect
social utility. CUAs and CEAs can be used to rank alternative approaches to achieving a particular objective.

In all cases, the effectiveness units (or outcome units), whether monetized or not, should be conceptually
consistent with the intended outcomes that were predicted for the program. Also, with all three types of economic
evaluation, the focus is on the additional, or incremental, costs and benefits/outcomes of a new or modified
program or program alternatives. Because many program outcomes are affected by external influences, identifying
incremental costs and benefits is usually a complicated methodological undertaking.

360
The Choice of Economic Evaluation Method
CEA and CUAs are especially useful when the objective is to choose among a small number of alternative
interventions that achieve the same outcome in the context of a limited budget (Drummond et al., 2015). Analysts
may also choose to use CEA or CUA when the key benefits are difficult or controversial to monetize. For example,
a cost–effectiveness study might compare several seniors’ fall reduction programs on the basis of incremental cost
per fall prevented. It would be difficult to monetize all the benefits of a prevented fall because, apart from averted
medical costs, the costs would also have to include other variables such as the incremental value of the labor and
voluntary work of the persons whose falls were prevented and the value of the time gained by friends and family,
who might otherwise have been called on for caregiving duty. Yet if the intended outcome is clear and consistent
between alternative interventions, CEA or CUA can tell decision makers which programs to choose to maximize
the actual outcome within a given budget.

To fulfill the usual objective of CEA or CUA, which is to identify the more cost-effective alternative, two or more
alternative interventions are evaluated, unless a benchmark cost per unit of outcome is available for comparison.
Without a comparison intervention or a benchmark, the ratio of incremental costs to incremental benefits for a
single intervention does not provide relevant information on whether the intervention should be chosen over
another or at all.

While CEA captures the main benefit in one outcome, such as the number of falls prevented, with CUA, several
(typically, two) outcomes are combined into one measurement unit. The most common outcome unit for CUA is
QALY gained, which combines the number of additional years of life with subjective ratings of the quality of life
expected in those years, to create a standardized unit for analysis that can be used to compare across various
programs for health or medical interventions. The quality-of-life ratings are standardized in that the value
normally ranges between 1 (perfect health) and 0 (death). CUA, then, is most commonly used in the health sector,
where it is both (1) important to capture the benefit of extra years lived and the quality of life in those extra years
lived and (2) difficult to monetize all of the social benefits of a treatment or program.

CBA is most often used in determining whether a particular program will increase the economic welfare of a
society, as compared with alternative programs or the status quo. With CEA and CUA, specific outcomes, such as
lives saved by implementing a smoking cessation program, have already been established as desirable, and the
question is not whether to initiate or expand a particular program or project but how to most efficiently expend
resources to attain the desired outcomes (or, conversely, how to increase effectiveness while maintaining current
levels of expenditure).

CBA is grounded in welfare economics and requires the aggregation of willingness-to-pay (WTP) for benefits
and willingness-to-accept (WTA) compensation for losses (i.e., the benefits and costs of a program or investment)
across society in order to arrive at a measure of NSBs or utility. Because CBA calculates NSBs, it can tell us
whether a project is worthwhile undertaking at all, with the decision criterion being that social benefits must
exceed social costs for a project to be deemed acceptable. While many costs can be measured using the available
market prices for inputs, many, if not most, CBA applications will include costs and/or benefits for which no
market prices exist or for which market prices do not reflect the full social costs and benefits. We will discuss how
this situation is addressed later in this chapter.

While CBA can be used to decide whether an intervention increases social welfare, and should thus be undertaken,
the same is not true in general for CEA or CUA. CEA or CUA can only tell us whether an intervention is socially
desirable if the denominator is considered a valid measure of social utility (value) and if a monetary benchmark
representing its social value is available. QALY, the most commonly used denominator for CUA, can be
considered a valid measure of social utility under certain restrictive assumptions about individual preferences, but
QALY and WTP “differ in their theoretical foundations, the unit by which health is measured, and in the relative
values they assign to different health risks,” and “the different assumptions underlying QALYs and WTP have
systematic effects on the quantified value of changes in current mortality risk” (Hammitt, 2002, p. 998).

361
Benchmarks for QALY have been used for resource allocation decisions, although this practice is still controversial
and is considered inappropriate for some applications, such as rare and serious conditions (Neumann, 2011;
Sanders et al., 2016).

362
Economic Evaluation in the Performance Management Cycle
Figure 7.1 shows the two main points in the performance management cycle where economic evaluations might
be conducted. Early in the cycle, economic evaluations can occur as programs or policies are being proposed or
designed. These are usually ex ante analyses—occurring as potential program or policy alternatives are being
compared. Ex ante analyses typically use existing theoretical models or experience to predict future costs and
benefits, assuming that the program is implemented and unfolds according to the intended program logic. For
example, Mills, Sadler, Peterson, and Pang (2017) used the results of a pilot falls prevention study in an institution
for the elderly to look forward and extrapolate the potential savings if the program was scaled up and implemented
for a year and the average savings possible over a 6-year period. Estimating the “total annual averted cost of falls,”
the cost–effectiveness analysis calculated that over the long term, for every U.S. dollar spent, there would be seven
dollars saved. Note here that they were not trying to calculate the total net social benefit of the falls prevention
program; that would be a cost–benefit analysis.

Analyses can also be conducted ex post (after the program has been implemented or after completion). An ex post
analysis at the assessment and reporting phase in the cycle is based on the after-the-fact, rather than forecasted,
costs and benefits accruing to a program. These analyses depend, in part, on being able to assess the extent to
which the policy or program caused the outcomes that were observed.

Figure 7.1 Economic Evaluation In The Performance Management Cycle

363
Historical Developments in Economic Evaluation
Economic evaluation has a long history, particularly in the United States, where it began in the early 1800s with a
federal treasury report on the costs and benefits of water projects. The use of cost–benefit analysis (CBA) grew in
the 1930s, when it was seen as a tool to help decide how best to spend public funds during Roosevelt’s New Deal,
where large-scale job creation programs, including massive infrastructure projects (Smith, 2006), were used to
stimulate the depression-era economy. From the 1930s through the 1950s, various forms of CBA were applied to
water resource projects, such as flood control, navigation, irrigation, electric power, and watershed treatment and
to the related areas of recreation, fish, and wildlife (Poister, 1978). From this foundation, CBA began to be
applied to other public investments in the 1960s and 1970s.

The use of CBA increased in both Canada and the United States during the 1970s, with increasing pressures to
determine value-for-money in public expenditures. Value-for-money has two distinct meanings. One is aligned
with an economist perspective (Mason & Tereraho, 2007)—CBA, CEA, and CUA are all consistent with that
definition.

A second version is based, in part, on agency costs and has been adapted by the public-sector auditing profession
as value-for-money auditing. Value-for-money auditing was introduced in Canada in 1978 by the then auditor
general, J. J. Macdonell (Canadian Comprehensive Auditing Foundation, 1985), as a way to broaden the purview
of auditors from their traditional focus on financial accountability to inclusion of the relationships between
resources and results. Unlike cost–benefit analysis (CBA), value-for-money audits typically use a mix of qualitative
methodologies to construct an understanding of the economy (Were the inputs to a program purchased
economically?), efficiency (What are the relationships between inputs and outputs?), and effectiveness (Have the
managers of the program implemented procedures that allow them to tell whether the program was effective?). As
we will discuss later, limiting an analysis to agency or budgetary costs and results does not provide sufficient
information to determine a program’s social value.

By the 1980s and 1990s, there were increasing federal requirements, particularly in the United States, for CBA for
large public projects. Regulations applying to the environment or to health and safety were often first subjected to
a CBA. Examples are President Reagan’s Executive Order 12291 in 1981 and, later, President Clinton’s Executive
Order 12866, which “require agencies to prepare Regulatory Impact Analysis (RIA) for all major federal
regulations” (Hahn & Dudley, 2004, p. 4). Apart from the assessment of regulatory issues, economic evaluations
are increasingly being used in almost all areas of public expenditure, including health, transportation, education,
pollution control, and protection of endangered species (Fuguitt & Wilcox, 1999). Economic evaluations are
included as a part of decision-making processes, and sometimes, their contribution is to highlight costs or benefits
that may not have been properly understood or to more clearly identify the winners and losers of a policy.

Evaluators need to be aware that the expression cost-effectiveness is frequently used in studies that focus only on
agency costs and in government reports where the concern is more related to transparency than to determining the
economic efficiency of a program. In British Columbia, Canada, for example, “cost-effectiveness” was built into
the accountability framework developed by the Auditor General and the Deputy Ministers’ Council (1996), yet
the only associated requirement is that government be “clear about its objectives and targets, the strategies it will
employ to meet its objectives, the full costs of these strategies, and its actual results” (p. 33). This information is to
be gleaned from “information required for managing at the program level” (p. 33).

364
Cost–Benefit Analysis
We begin with cost–benefit analysis (CBA) because, conceptually, CBA is the most comprehensive economic
evaluation method available. It is also the most demanding to conduct. CBAs are usually conducted by or under
the supervision of economists and should be informed by a review of the methodological literature and the
literature on comparable evaluations. CBA is routinely used in environmental and infrastructure studies. In CBA,
a project or intervention is deemed acceptable if its social benefits exceed its social costs.

Cost–benefit analysis is conducted to estimate the value or relative value of an intervention to society, which
comprises citizens of the relevant jurisdiction. While one may be naturally inclined to focus on government
revenues and expenditures only, costs and benefits to all members of society, including intangibles and
externalities, should be included. Intangibles include loss of alternative uses, such as the loss of recreational
opportunities for a park reassigned to social housing or the loss of leisure and household production of previously
unemployed labor. Externalities can be positive or negative and are defined as the social value of (cost or benefit
of) a good or bad outcome that is a by-product of economic activity, is not reflected in market prices, and affects
parties other than those engaging in the activity. Noise, pollution, and greenhouse gas emissions (GHGs) are all
examples of negative externalities. They impose costs on members of society who are not responsible for their
generation. Neither the producer nor the consumer of the good generating the externality pay for the damage they
impose on others. Positive externalities include items such as reduced risk of getting ill because other people are
getting vaccinated and the benefits of the efforts of our neighbors in beautifying their properties.

In economic evaluations taking on a social (jurisdictional) perspective, fees and charges for a program or
intervention that are paid by residents to a government agency are not considered benefits; these fees and charges
are merely a transfer from program users to all taxpayers within the relevant jurisdiction. However, fees and
charges collected from nonresidents are considered benefits. This approach differs from the “bottom-line
budgetary orientation” described in Boardman, Greenberg, Vining, and Weimer (2018, pp. 16–17), where not
only are toll bridge revenues from both residents and nonresidents included as benefits of a toll bridge project, but
indirect costs such as incomes lost to businesses in the construction zone and intangible benefits such as reduced
congestion and GHGs are ignored. In sum, an economic evaluation that is fully grounded in welfare economics
seeks to measure the aggregate social value of the costs and benefits of an intervention and not just the costs and
benefits to a particular government agency.

Social benefits and costs include market prices and also items for which there are no available market values, such
as the social value of public parks and the value of externalities. Additionally, on occasion, a project may use
resources that would otherwise be unemployed in an economy (unemployed capital or labor). If this is the case,
the social cost of that resource is the opportunity cost of that resource and not the price paid for the resource. In
the case of unemployed labor, for example, the value of leisure and household production of the unemployed
worker is the social opportunity cost of that labor. More generally, the opportunity cost of a resource is the social
value that resource would generate in its next most valued alternative. Note that using lower labor costs to reflect
the (social) opportunity cost of labor in the face of unemployment accounts for the social value of employment
generated by a project and is the appropriate method for valuing employment effects. Adding the wages of newly
employed labor as a benefit is not appropriate because labor inputs are costs, not benefits. Moreover, using the
wages of newly employed labor to reflect value assumes that unemployed labor (leisure and household production)
has no social value. The value of added employment is properly reflected by using the opportunity cost of leisure
and household production as the “cost” of newly employed labor, rather than using the full wages paid to that
labor: Benefits are not higher as a result of newly employed labor; instead, labor costs are lower than if the labor
had already been employed.

The foregoing discussion is an illustration of the consequences of a widespread confusion between CBA and
economic impact analysis (EIA). While the objective of a CBA is to estimate the net social benefits (NSBs) of an
investment, the objective of an EIA is to estimate the impact of an investment on the gross domestic product
(GDP) of an economy. NSB and GDP are not equivalent concepts. GDP measures the monetary value of

365
exchanges of final goods and services in an economy. Final goods and services exclude those serving as inputs into
a final consumer good or service. NSB is a measure of social value, also expressed in monetary terms, which is
derived using techniques for eliciting this value from the public or inferred from market prices when they are
available. NSB incorporates the cost of externalities such as pollution, while GDP does not. NSB takes into
account the opportunity cost of inputs, while EIA does not. For these reasons, EIAs tend to overestimate benefits
compared with CBAs, a reflection of the inadequacy of GDP as a measure of social welfare. Taks, Kesenne,
Chalip, Green, and Martyn (2011) illustrate the difference by comparing the results of an EIA and a CBA for the
2005 Pan-American Junior Athletic Championships and find that while the EIA estimated an increase in
economic activity of $5.6 million, the CBA estimated a negative net benefit of $2.4 million.

In the remainder of this section, we discuss the concept of standing, how to value costs and benefits without a
market price, and the steps included in an economic evaluation.

366
Standing
At the beginning of a CBA or other type of economic evaluation, the decision of who has standing in the
evaluation must be made. To qualify as a true CBA, costs and benefits for the whole society should be included.
“Society” typically includes the residents of the jurisdiction of interest (local, provincial, state, or national). In
cases where the project or intervention has international consequences, an international perspective may be taken,
such as is frequently the case with climate change studies (Atkinson & Mourato, 2015; Pearce et al., 2006).
Agency bottom-line perspectives are sometimes used in CEA, meaning that the agency is the only “person” with
standing in the analysis. From a public policy perspective, such an approach is only appropriate if non-agency
costs and benefits are very small—for instance, if an agency were to compare the relative costs and benefits of
leasing versus purchasing office equipment.

While the idea of including all residents of a jurisdiction or nation may seem simple conceptually, it can be
confusing in practice. An illustration that distinguishes between the private and the social perspective will help
explain the differences between these two perspectives. Consider the imposition of a gasoline tax in a region to
finance an expansion of public transit in the same region. While transit is not free in the region, it is subsidized,
and while transit revenues are expected to increase as a result of increased ridership from the expansion, the
expected fare revenue increase is not sufficient to finance the expansion, hence the need for the gasoline tax.

Consider the four perspectives—transit users, private vehicle users, the transit authority, and society—assuming
for simplicity that the subsets of transit users and private vehicle users are mutually exclusive and collectively
exhaustive of the set of individuals composing the society. The costs and benefits for society, then, are the sum of
the costs and benefits for transit users, private vehicle users, and the transit authority. Table 7.1 shows a list of
costs and benefits and how they affect the bottom line for each of the perspectives. Economic theory predicts that
over time, some private vehicle users will switch to public transit because of the higher cost of driving relative to
using transit and the improved transit service. This is not fully accounted for in Table 7.1; individuals switching
travel mode should be treated as a separate category. Switching from using vehicles to using public transit should
result in reduced congestion, pollution, and GHGs. These benefits are shown in the bottom half of Table 7.1.
The last row in the table shows the overall net costs or benefits for each of the perspectives. Here is a summary of
the effects of expanding public transit and subsidizing it with a gasoline tax:

Gasoline taxes are a cost to private vehicle users paying the taxes and a benefit to the transit authority
collecting the taxes; they have no impact on society because the taxes are a transfer from private vehicle users
to the transit authority.
Additional fare charges are a cost to transit users paying the fares and a benefit to the transit authority
collecting the fares; they have no impact on society because the fares are a transfer from transit users to the
transit authority.
Transit service increases are a benefit to transit users and to society.
Costs of resources used in the expansion are a cost to the transit authority and society.
Reductions in externalities such as congestion, pollution, and GHG emissions benefit transit users and
private vehicle users and, thus, society.
Transit users benefit because the utility from the additional service exceeds the additional fares paid
(otherwise, they would not increase usage).
Private vehicle users’ net benefits are to be determined. If reductions in congestion, pollution, and GHG
emissions produce higher benefits than the cost of the gasoline tax, they will benefit.
By assumption, the effect on the transit authority is nil, as taxes and fares are supposed to finance the cost of
the expansion.
Society will benefit as long as the resource costs of the expansion are lower than the sum of the increased benefits to
transit users from the expanded service plus the value of the reduction in externalities. Fares and taxes are
transfers between transit users, private vehicle users, and the transit authority. Once these three perspectives
are added together to reflect the costs and benefits to society, the costs and benefits of fares and taxes to
different subgroups offset each other. The only remaining costs and benefits for society as a whole are the

367
value created from increased service and from reduced congestion, pollution, and GHG emissions and the
opportunity cost of the resources used in the expansion.

Table 7.1 Selected Costs and Benefits of Transit Expansion Financed Through the Gasoline Tax
Table 7.1 Selected Costs and Benefits of Transit Expansion Financed Through the Gasoline Tax

Transit Private Vehicle Transit


Society
Users Users Authority

Tax, fares, and service effects before change in behavior

Gasoline tax No effect Cost Benefit No effect

Fares Cost No effect Benefit No effect

Value (utility of increased service) Benefit No effect No effect Benefit

Opportunity cost of resources for


No effect No effect Cost Cost
expansion

Reduction of negative externalities as a result of switch from driving to transit

Reduced congestion Benefit Benefit No effect Benefit

Reduced pollution Benefit Benefit No effect Benefit

Reduced greenhouse gas emissions Benefit Benefit No effect Benefit

To be To be
Total Benefit No effect
determined determined

Benefits from the reduction of externalities will manifest themselves as reduced commuting time, which can be
priced at the value of leisure; improved health as a result of lower pollution, which can be valued as the increase in
the value of a statistical life resulting from the project; and reduced GHG emissions, which can be valued at the
estimated social costs of GHG emissions. The next section of this chapter discusses how we assign monetary value
to such benefits (and costs) that have no market price. GHG emission reductions are a special case because, unlike
pollution and congestion, GHG emissions affect the entire world and reducing emissions would provide benefits
to non-nationals. However, if a government has made international commitments to reduce GHG emissions, it
would be reasonable to expand “society” to global society when taking into account the benefits of GHG
reductions. This is particularly true if a government expects reciprocal commitments from international partners,
is shouldering a larger relative share of the global commitment to GHG reductions as a form of development aid,
or has recognized that its own per capita GHG emissions are relatively high. GHG emission reductions are an
important transportation policy objective, and including them in this example illustrates that externalities may
affect persons without standing and that special consideration may be given to costs or benefits to outsiders in a
CBA when jurisdictions are collaborating to reduce externalities that cross jurisdictional boundaries.

Please note that this analysis has been simplified for illustration purposes and should not serve as the basis for a
CBA of such an undertaking.

368
369
Valuing Nonmarket Impacts
As indicated in the transit example previously, a project that reduces negative externalities, such as congestion,
pollution, and GHG emissions, generates social benefits or utility that cannot be valued using market prices.
There have been numerous methodological advances made in the valuation of non-market costs and benefits over
the past three decades, especially in the field of environmental economics (Atkinson & Mourato, 2015). While
discussing the methodological improvements and institutional innovations over that time period is beyond the
scope of this chapter, we define the main approaches used in valuing non-market prices.

Welfare economics defines social utility as the sum of the utilities of individuals with standing. There are two
measures of (changes in) utility for individuals in welfare economics: willingness-to-pay (WTP) and willingness-
to-accept (WTA). WTP measures the maximum amount an individual is willing to pay to acquire a good, and
WTA measures the minimum amount an individual would be willing to accept to forego a good (to incur a cost).
Market prices in competitive markets are taken to reflect WTP and WTA. In the absence of a market for a good,
WTP and WTA are estimated using a variety of methods, discussed later.

In the remainder of this section, we define methods used to derive WTP and WTA in the absence of a market.
These methods are classified as revealed preferences or stated preferences methods and include the hedonic
price method, the travel cost method, the averting behavior and defensive expenditure methods, the cost of
illness and lost output methods, the contingent valuation method, and choice modeling.

Revealed and Stated Preferences Methods for Valuing Nonmarket Impacts


Table 7.2 provides a brief description of the approaches used for valuing nonmarket impacts. The approaches are
classified into revealed and stated preferences methods. Revealed preference methods are indirect methods of
valuing impacts; they do not ask people for their valuations but infer them from their behavior. They include the
hedonic price method, the travel cost method, the averting behavior and defensive expenditure approaches, and
the cost of illness and lost output approaches (Pearce et al., 2006). Stated preference methods include the
contingent valuation method and choice modeling. Both stated preference methods use survey questionnaires to
elicit WTP and/or WTA from representative individuals or households whose welfare is expected to be affected by
the proposed interventions. For a discussion on survey methods used to elicit WTP and/or WTA information, see
Board et al. (2017).

Table 7.2 Revealed and Stated Preferences Methods


Table 7.2 Revealed and Stated Preferences Methods

Method Description

Revealed
preferences
methods

Regression analysis, used to estimate the value of nonmarket attributes of a market good,
Hedonic such as the amenity value of urban green space, which is factored into the housing prices
price method of properties benefiting from the green space. Similarly, it is the incremental additional
cost of a scenic view, when purchasing a home.

A method used to estimate the use value of recreational amenities. Information on direct
Travel cost and indirect expenditures made by visitors of recreational amenities, including the
method opportunity cost of travel time, is used to estimate the lower-bound value of amenities to
visitors.

370
Averting
behavior and
A method used to estimate the value of the avoided risk or consequences, based on
defensive
expenditures made by individuals or households to avoid risk or adverse consequences.
expenditure
methods

Cost of illness
and lost A method used to estimate the cost of the illness, based on the cost of treating an illness,
output the value of lost output as a result of the illness, and the cost of pain and suffering.
methods

Stated
preferences
methods

Contingent Random sample of the relevant population surveyed to elicit their willingness-to-pay for
valuation or willingness-to-accept compensation for the expected consequences of an investment.

Random sample of the relevant population surveyed to elicit their willingness-to-pay for
Choice
or willingness-to-accept compensation for the expected consequences of each of two (or
modeling
more) mutually exclusive, multidimensional investments or policies.

371
Steps for Economic Evaluations
The following are the nine major steps for a CBA (adapted from Boardman et al., 2018; Pearce et al., 2006). The
main steps of a CEA (and CUA) parallel those of CBA, except that with CEA and CUA the benefits are quantified
but not monetized. Each of the steps is discussed briefly in the following subsections, highlighting how they differ
between CBA, CEA, and CUA.

1. Specify the set of alternatives.


2. Decide whose benefits and costs count (standing).
3. Categorize and catalog the costs and benefits.
4. Predict costs and benefits quantitatively over the life of the project.
5. Monetize (attach dollar values to) all costs and benefits (for CEA and CUA, quantify benefits).
6. Select a discount rate for costs and benefits occurring in the future.
7. Compare costs with outcomes, or compute the NPV of each alternative.
8. Perform sensitivity and distributional analysis.
9. Make a recommendation.

1. Specify the Set of Alternatives


CBAs may consider one or more alternatives, although many, if not most, CBAs consider a single project, albeit
with potentially different approaches or levels of investment. For example, a state government may arrange for a
CBA of investment in a highway project that compares alternative approaches to the project, such as the location
of the road, the number of lanes of highway, and the number of overpasses that offer high-speed intersections with
other roads.

CEAs and CUAs should consider two or more alternatives, with the objective of recommending the alternative
with the lowest (marginal) cost per unit of (marginal) benefit/outcome unless—if only one intervention is to be
considered—a benchmark cost per unit, such as a threshold QALY, is available for comparison. For instance, if a
health authority set a QALY threshold of $40,000, the equivalent policy statement is that medical interventions
costing less than $40,000 per QALY generated would be considered admissible for implementation.

The key point to keep in mind when specifying the alternatives to be compared is that as more alternatives are
added to the mix, the analysis becomes more complicated. It is important to carefully choose the alternatives and
keep their number manageable.

2. Decide Whose Benefits and Costs Count (Standing)


As discussed previously, the perspective for a CBA must be chosen early on. Perspective, or standing, will typically
reflect the underlying population of the jurisdiction of the government with control over the expenditure. CBAs
should take the social perspective; analyses that focus on agency costs and benefits are not considered CBAs. The
same holds for social CEAs and CUA analyses. However, some studies are financial CEAs or CUAs, merely
counting the net savings to the government agency of adopting an intervention (Neumann, 2009). While such a
study provides useful information for the agency, and information that could be used in a CEA, it should not be
classified as a social CEA (Newcomer, Hatry, & Wholey, 2015).

3. Categorize and Catalog the Costs and Benefits


This step involves listing all inputs (costs) and all outputs or outcomes (benefits) for each alternative, including
costs and benefits with available market prices and costs and benefits without. While all the social costs and
benefits for each of the project alternatives should be included originally, some costs and benefits may be
considered too small to quantify and others too difficult to quantify, especially complex projects involving human
service programs. Often, only the key inputs and outputs or outcomes are counted and monetized, although the

372
catalog of costs and benefits should include those that cannot be quantified, so that a qualitative/conceptual
discussion of unquantifiable costs and benefits can be included in the analysis.

Types of physical and intangible costs and benefits vary between applications. All project alternatives will include
the costs of resources used to carry out a program, policy, or intervention. Applications involving educational
investments will include the benefits of improved educational outcomes, which for CBA purposes are usually
measured in terms of changes in learners’ productivity over their lives, but they could also include health and
justice outcomes if the population served is disadvantaged and has a high risk of poor health outcomes and/or
engaging in criminal activity. Applications involving justice policy would include changes in the incidence of
criminality and the consequent social cost of victimization and incarceration, including lost productivity of victims
and offenders. Applications in transportation infrastructure investments would include the benefits of reduced
vehicle costs, travel time, accidents, and pollutants.

When identifying the types of costs and benefits, an evaluator should do a review of the economic evaluation
literature related to the investment or alternatives under consideration to identify the costs and benefits that are
typically considered and the methods used to identify and monetize them in similar contexts. In Shemilt,
Mugford, Vale, Marsh, and Donaldson (2011), several authors provide useful discussions of review methods and
issues encountered in economic evaluations in the fields of health care, social welfare, education, and criminal
justice. Hanley and Barbier (2009), Pearce et al. (2006), and Atkinson and Mourato (2015) provide useful
background and discussions for environmental applications. The Transportation Research Board’s Transportation
Economics Committee (n.d.) has published a useful cost–benefit guide for transportation applications that
includes a number of case studies and lists the types of costs and benefits to be included in transportation CBAs.
Litman (2011) takes on an approach to transportation policy that considers a more comprehensive set of costs and
benefits than have traditionally been included, including the health benefits of switching from passive modes of
transportation to active ones.

4. Predict Costs and Benefits Quantitatively Over the Life of the Project
Once we have catalogued the costs and benefits, we need to quantify (or, in the case of an ex post analysis, we can
calculate) them for the time frame of the analysis. For example, if we have decided that “change in students’ test
scores” over the time span of alternative educational interventions on randomly assigned groups of students is the
denominator in a CEA, the next step would entail obtaining scores just before the interventions began and just
after they ended.

There are a number of different approaches to modeling or forecasting costs and benefits for policies, programs, or
projects that vary from one substantive field to another. For example, in health care, Markov simulation models
have been applied to model chronic disease progress to estimate the null case if no intervention is undertaken
(Briggs & Sculpher, 1998). An example of forecasting the benefits of a transportation project is included in
Boardman, Greenberg, Vining, and Weimer (2018), where they describe the cost–benefit analysis that was done
for the Coquihalla Highway Project in the interior of British Columbia, Canada. In that project, the cost–benefit
analysis examined two alternatives: building the highway so that it would be a toll road or building it as a public
highway without tolls. In the end, the decision of the government of the day was to build the toll-funded version
of the road (although they years later removed the toll).

5. Monetize (Attach Dollar Values to) All Costs and Benefits


This step involves attaching dollar values to all the quantified input components and, in the case of CBA, the
quantified outputs or outcomes indicators. Alternative methods such as hedonic pricing, contingent valuation,
choice modeling, or other methods discussed earlier in this chapter would be used at this stage to estimate social
values for costs and benefits for which no market prices are available. However, it is often the case that monetizing
social benefits is controversial, particularly in the health sector (Neumann et al., 2017).

In the case of CEAs and CUAs, the benefits are not monetized but are defined. For example: “Number of falls

373
prevented.”

When monetizing future costs and benefits, the following should be kept in mind. Monetary measures change in
their value over time as the purchasing power of a currency is eroded through inflation. Future costs and benefits
for economic evaluations should not be indexed to reflect inflation. In other words, economic evaluations should
use real costs and real benefits (where price inflation has been taken out of the estimates of costs and benefits) and
not nominal costs and benefits, which include the costs of inflation. However, if the relative cost of a resource
used or the relative value of a benefit generated is expected to change over time, the change should be taken into
account. For instance, as fresh water is becoming increasingly scarce, projects that use or save water over many
years should reflect estimates of its increasing relative value.

6. Select a Discount Rate for Costs and Benefits Occurring in the Future
Discounting is a relatively straightforward arithmetical application, but the theoretical foundations for
discounting are the subject of vigorous debates and controversies in the literature. In economic evaluations, costs
and benefits that occur over more than 1 year are discounted. “Discounting refers to the process of assigning a
lower weight to a unit of benefit or cost in the future than to that unit now” (Pearce et al., 2006, p. 184). The
weights attached to each future period are a function of the discount rate and the time distance of the future
period from the period to which the future costs and benefits are being discounted. The discount rate is related to
the real market interest rate, as discussed later. Once all costs and benefits are discounted to a common period,
usually the period in which the initial investment begins, they can be added together and their net present value
(NPV) calculated. That is, net present value is the economic value of a project once net present (discounted) costs
have been subtracted from net present (discounted) benefits. Non-monetized outcomes realized in the future and
used in the denominators of the effectiveness ratios in CEAs and CUAs should also be discounted (the estimated
quantities adjusted numerically to reflect the discount rate), in the same way that the money values are discounted.
The formulas used for discounting are presented in the next section.

Two arguments are advanced to support discounting. The first is based on the preferences of individuals, and the
second on opportunity costs, so that either in practice can be considered to be consistent with welfare economics.
The first argument is that individuals need to be rewarded to save because they expect their incomes to grow over
time and would prefer to borrow now against future earnings so as to smooth consumption over their life cycle.
Moreover, individuals are assumed to be relatively impatient and myopic, and they take into account the
probability of the risk of death—as a result, individuals will only postpone consumption by saving a portion of
their income if they receive a premium to do so. The second argument is that financial capital is productive and
can be used to generate returns and therefore has an opportunity cost. The first argument represents the marginal
social rate of time preference (SRTP), while the second represents the marginal social opportunity cost of
capital (SOC). In a perfectly competitive market, SRTP (borrowing rate) and SOC (lending rate) should be equal,
yielding the real market interest rate, but for a variety of reasons, SRTP is smaller than SOC (Edejer et al., 2003;
Zhuang, Liang, Lin, & De Guzman, 2007).

In practice, prescribed discount rates vary considerably between national governments, with some favoring the
lower SRTP and others the higher SOC. Moreover, the use of high discount rates has been questioned on ethical
grounds, especially with projects that have long-term environmental consequences. To get a grasp of the
magnitude of the problem, consider an annual discount rate of 4%, which is, in fact, lower than many discount
rates used around the world (Zhuang et al., 2007). A benefit occurring 100 years from now would be weighted at
(1/1.04100) = 2% of its actual value. Discounting therefore makes the problems of climate change and other
environmental issues with long-term consequences seem to disappear. In recognition of this problem, scholars
have considered a variety of solutions, including zero discounting and declining rates of discount. In practice,
there is “an extraordinary range of practices” (Atkinson & Mourato, 2015). The alternative would be to use very
low discount rates, as suggested by Stern (2008). Stern argues that SRTP and SOC are based on an expectation of
growth, but the path we choose with respect to climate change will affect growth, and therefore, the discount rate
for a non-marginal intervention such as addressing climate change actually varies with the path we choose. A path
where climate change is not addressed would imply a lower discount rate as growth will be impaired and the

374
discount rate increases with expected growth. Moreover, Stern argues that, on ethical grounds, the discount rate
should not favor the current generation over future generations. Based on his decomposition and analysis of the
components of the discount rate, Stern’s CBA of climate change mitigation used a discount rate of 1.4%
(Ackerman, 2009). While there is no scholarly consensus on what discount rate to use, we favor arguments for
using SRTP, which has been adopted by the Government of the United Kingdom, where the discount rate
(SRTP) has been set (estimated) at 3.5% for short-term projects and declining rates adopted for long-term projects
(HM Treasury, 2003). Edejer et al. (2007) also recommend the use of SRTP for health applications, with an even
lower estimate of 3%, and they recommend 6% for sensitivity analysis.

The actual discount rates of 1.4%, 3%, 3.5%, and 6% discussed earlier are real rates, as opposed to nominal
interest rates, and are therefore not comparable with market interest rates. Because currencies lose their value over
time as a result of inflation, (observed) market rates for SRTP (the lending rate) and SOC (borrowing rate) are
nominal, meaning that they include an inflation factor to compensate lenders for the loss of value of the currency.
As discussed before, an economic evaluation should be conducted using real values rather than nominal values.
However, if nominal future costs and benefits are used in the analysis, then a nominal discount rate should also be
used.

7. Compare Costs With Outcomes, or Compute the Net Present Value of


Each Alternative
In CEA and CUA, a ratio of incremental costs to incremental outcomes is formed for each of the alternatives
considered. The ratios are compared, with lower ratios representing more cost-effective alternatives. While
monetary NPV cannot be calculated in such analyses because benefits are not monetized, discounting is applicable
to costs and outcomes in these analyses as well. Outcomes represent benefits that could in theory be monetized,
and non-monetized quantities of benefits only differ from monetized ones by not being multiplied by a money
value per unit of benefit. Receiving the benefit in the future has an equivalent relationship to preferences, whether
the benefit is monetized or not. Discounting assigns lower values to future costs and benefits and can be applied to
flows of money or costs and benefits expressed as outcome quantities occurring in the future.

To discount costs and benefits, the timing of all identified costs and benefits is needed, along with the real
discount rate chosen for the intervention, although there are applications using time-declining discount rates, as
discussed previously. The formula for calculating the NPV of costs and benefits with a constant discount rate is
∑t[(Bt−Ct)/(1+D)t],

where

∑ = the summation operator—the formula is applied for each value of t (each time period) and the results added,

Ct = costs during period t,

D = the discount rate (assumed constant), and

Bt = benefits during period t.

If costs are monetary and benefits are nonmonetary, then two separate calculations are required, one for costs and
one for benefits.

As a simple example, suppose that a cost is expected to accrue 5 years from the present. Assume a discount rate of
3.5% and that the expected amount of the cost in Year 5 is $100. The present value of that cost is

$100/(1.035)5 = $84.20.

Thus, a cost of $100 five years from now at a discount rate of 3.5% has a present value of $84.20. As discussed

375
earlier, the higher the discount rate, the lower the NPV of a project, as projects typically involve an up-front
investment, followed by benefits in the future. Moreover, NPVs for projects with benefits far into the future
relative to the investments will be more sensitive to increases in the discount rate.

8. Perform Sensitivity and Distributional Analysis


In ex ante CBAs, we cannot make perfect predictions about variables such as SRTP, the utilization rates of a
program, the success rates of a medical treatment, the density of traffic on a new highway, or other outcomes, so
the evaluator will need to do sensitivity analyses to show the range of possibilities for some of the key variables.
Sensitivity analysis simulates different scenarios for a CBA. Typically, discount rates and other assumptions are
varied and then tested for their impacts on the overall net present benefit calculation. Furthermore, some studies
may provide various estimates on the basis of a range of discount rates, especially if the choice of discount rate is
controversial. Normally, the results of sensitivity analyses are presented in tables or displayed graphically, so that
the decision makers can easily make comparisons across several possibilities. For example, a graph can be helpful in
showing where the net present benefit for a project or program option goes to 0 and then becomes negative given
a range of discount rates.

A full CBA with significant distributional impacts will also conduct a distributional analysis to identify how the
intervention’s costs and benefits are distributed among different segments of society (Atkinson & Mourato, 2008).
This information is especially important for policymakers who may wish to compensate the losers in
implementing a policy or program. For example, a CBA may have shown that there will be an overall net benefit
to society if the government invests in a program to flood a valley to create a hydroelectric project, but it will be
important to show which stakeholders will benefit overall (e.g., the users of the electricity, who may pay lower
prices) and which groups may lose (e.g., farmers or Indigenous people who live in the valley and will have to
relocate). In addition to providing information for potential compensation, distributional analysis may be used in
conjunction with distributional weights to arrive at alternative measures of NPV. Distributional weights may
serve to address the inequity inherent in the use of WTP to value benefits. WTP is a reflection of one’s income,
and therefore, without distributional weights, the preferences of higher-income groups are overrepresented in
WTP (and thus in NPV). Distributional weights assign higher values/weights to the costs and benefits of lower-
income or otherwise disadvantaged groups to rectify the imbalance.
9.

Make a Recommendation
CBAs, CEAs, and CUAs are generally conducted for government agencies or their affiliates but also by scholars
examining a research or evaluation question that interests them. In the former case, a recommendation would be
in order, while in the latter case, the scholar may refrain from making recommendations and instead conclude
with a contextualization of the findings.

CEAs and CUAs typically compare alternative interventions, and the intervention that costs the least per unit of
outcome generated would be preferred. If only one intervention is evaluated, then a benchmark criterion should
be available for purposes of comparison. CBAs that examine one intervention can suggest whether the intervention
is worthwhile by using the NPV > 0 criterion for project acceptance. For CBAs that consider two or more
mutually exclusive interventions, the appropriate choice criterion is to choose the intervention with the highest
NPV, the assumption being that the size of the investment is irrelevant and the objective is to choose the
intervention that maximizes social value. This is the correct assumption because the opportunity cost of capital is
already taken into account via discounting. If CBAs are conducted that examine several interventions that are not
mutually exclusive and the objective is to maximize NPV by selecting projects that exhaust an agency’s fixed
capital budget, then the appropriate selection procedure is to rank projects on the basis of their NPV per unit of
capital outlay and select those with the highest NPV-to-capital outlay ratio until the budget is exhausted.

While the prior discussion provides appropriate decision criterion and project selection procedures for different
contexts, it is important to remember that the economic analysis is only one part of the decision-making process. The

376
policymakers are given the evaluation information and then must take into consideration equity issues and
political, legal, and moral/ethical factors. The policymakers will also need to consider the various scenarios offered
in the sensitivity analysis. Fuguitt and Wilcox (1999), while acknowledging that professional judgment is
inevitably a part of the process of an economic evaluation, offer the following suggestions for the analyst’s role:

The analyst’s responsibility is to conduct the analysis with professional objectivity and minimize
inappropriate subjective influences. Specifically, in performing the analysis, the analyst must (1)
minimize any bias or otherwise misleading influences reflected in the measurements of individuals’
subjective preferences, (2) explicitly identify any value judgments embodied in the analysis and make
transparent their implications for the outcome and (3) where analyst or decision-maker discretion is
possible, choose an approach that minimizes one’s subjective influence and demonstrates the sensitivity
of the analysis to alternative (subjective) choices. (p. 18)

Thus, the reader of an economic evaluation report should expect to find a discussion of assumptions, value
judgments, technical choices, possible errors that were made during the evaluation, and even subjective biases or
conflicts of interest that may have affected the outcome of the analysis (Fuguitt & Wilcox, 1999). An example of
the latter is a cost–benefit analysis done to determine whether Zyban (a drug to inhibit the desire to smoke) was
relatively more effective than the patch, counseling, or a placebo. One of the principal investigators worked in the
company that produced Zyban—on the face of it, this conflict of interest could be said to undermine the
credibility of the findings. The results suggested that Zyban was cost-beneficial (Nielsen & Fiore, 2000).

377
Cost–Effectiveness Analysis
Cost–effectiveness analysis (CEA) is used to compare the costs of alternative interventions used to achieve a
particular outcome/impact, such as life-years saved, falls prevented, roadways maintained, 5-year cancer
survivability, or incremental improvements in test scores. We will consider situations where several outcomes have
been combined when we look at cost–utility analyses (CUAs) later in this chapter. CEA calculates the ratio of the
incremental costs of implementing the intervention to the incremental outcome, and estimates a cost–effectiveness
ratio. CEA is preferred to cost–benefit analysis (CBA) by many evaluators, especially in health, because it does not
require that a monetary value be placed on the health outcome, which greatly simplifies the analysis (Garber &
Phelps, 1997). If the purpose of an evaluation is to decide among two or more alternative interventions or
treatments, and the commissioning agency is committed to implementing one of the two alternatives, then CEA
provides a sufficient criterion for the evaluator to make a recommendation: Choose the intervention with the
lowest cost–effectiveness ratio. Moreover, if interventions are not mutually exclusive and if their outcomes are
independent from one another, the agency could rank various interventions according to their cost–effectiveness
ratios, as long as the outcome measure is common to the interventions. CEA is used in the social services and the
health sector but is becoming more important in the education sector and is also used in the crime prevention and
transportation sectors (Levin & McEwan, 2001; Tordrup et al., 2017).

While CEA can be a powerful tool for choosing among alternatives, it does not speak to the question of whether
any of the alternatives should be undertaken. This is because it does not provide information about the monetized
value society places on the resulting outcomes. Moreover, restricting the denominator to one outcome ignores
other potential beneficial results. Unlike CBA and CUA, CEA is not grounded in welfare economics, but it
provides a relatively simple approach for comparing the cost-effectiveness of interventions with a common
outcome. A number of publications provide methodological guidance for the application of CEA (Drummond et
al., 2015; Edejer et al., 2003; Garber & Phelps, 1997; Weinstein, Siegel, Gold, Kamlet, & Russell, 1996).

Cost–effectiveness ratios are appealing tools when they are used summatively—that is, to make judgments about
the future of programs. However, it is necessary when doing CEA to remember that the most effective approaches
are not necessarily the least costly ones (Royse, Thyer, Padgett, & Logan, 2001) and that, when comparing
alternative interventions, the outcomes achieved by different interventions are comparable in quality.

378
Cost–Utility Analysis
Even where quantitative information on program outcomes is available, programs typically consist of multiple
components, as well as multiple outcomes. The cost–effectiveness analysis (CEA) decision criterion is based on the
ratio of incremental costs to a single incremental outcome, which is not appropriate for complex interventions
with multiple outcomes that can vary between interventions. CEAs are especially useful for clinical trials
comparing equivalent drugs that target a single health measure but they are less useful for interventions with
multiple outcomes. Cost–utility analysis (CUA) is a variation of CEA that uses a utility index to represent
preferences in the denominator rather than a single outcome.

Its most common form is used in the health sector, with quality-adjusted life-years (QALY) gained used as an
outcome. It is useful for comparing the health and economic consequences of a wide variety of medical and
health-related interventions. In measuring the outcome, the method combines both the years of additional life and
the subjective value of those years, in cases where quality of life is affected. Once the cost of each type of
intervention is calculated, it is possible to create a ratio of cost per QALY. However, QALYs can only be taken to
be a representation of preferences under very restrictive assumptions, and they “yield systematically different
conclusions [from WTP] about the relative value of reducing health and mortality risks to individuals who differ
in age, pre-existing health conditions, income, and other factors” (Hammitt, 2002, p. 985), leading to “cost-per-
QALY thresholds discriminat[ing] on the basis of age and disability by favoring younger and healthier populations
who have more potential QALYs to gain” (Neumann, 2011, p. 1806). The use of QALY thresholds for resource
allocation decisions is therefore controversial and has been discontinued or was never adopted in some
jurisdictions. Although he recognizes the methodological issues that arise from using QALY for resource allocation
decisions, Neumann (2011) elaborates on what he perceives as the main reason for such decisions:

Above all, critics conflate QALYs with rationing. They do not distinguish QALYs as an outcome
measure from cost-per-QALY thresholds as a decision tool and seem to blame the QALY for revealing
uncomfortable choices in health care. They fault the measure for presenting an unacceptable intrusion
into the patient–physician relationship. They imply that QALYs represent an absence of clinical
judgment and a loss of control, which could shift from physicians and patients to economists or
bureaucrats who themselves do not provide care and who have a cost-containment agenda. (p. 1806)

Nevertheless, Neumann (2011) is a strong advocate for the use of QALY, and he is not alone in this support:

For all of its shortcomings, the QALY provides a helpful benchmark in considerations of comparative
value. Cost-per-QALY ratios have been endorsed by the US Panel on Cost-Effectiveness in Health and
Medicine, composed of physicians, health economists, ethicists, and other health policy experts. (p.
1807)

The use of QALY to measure health outcomes is an improvement over the use of a single indicator for health
outcomes; however, collecting information to construct QALYs is resource intensive. Evaluators have developed
several methods to determine the subjective valuations of the quality of life of various health outcomes to create
the QALY index. Three common methods are the health rating method, the time trade-off method, and the
standard gamble method. With the health rating method, “researchers derive a health rating (HR) from
questionnaires or interviews with health experts, potential subjects of treatment, or the general public” (Boardman
et al., 2018, p. 477). The time trade-off method involves gathering subjective ratings of various combinations of
length of life and quality of life in that time, whereas the standard gamble method involves having the subjects
make choices in decision tree scenarios with various timings of life, death, or survival with impaired health
(Boardman et al., 2018). The results of this research then facilitate the development of QALY tables, which can be
used to help determine the best use of health care resources. Kernick (2002, p. 111), for example, tables a list of

379
estimates of “cost per QALY of competing interventions” (in £1990), including interventions such as

GP advice to stop smoking (£270)


Antihypertensive therapy (£940)
Hip replacement (£1180)

Sensitivity analysis for CUAs should include examining a range of the following: costs, preference weights among
outcomes, estimates of effectiveness, and the cost-related discount rate (Neumann et al., 2015). The QALY
approach has had broad application in health policy analyses but has not been used extensively in other program
or policy areas (Levin & McEwan, 2001). While health outcomes are relevant in other policy areas, for instance,
studies of infrastructure or environmental interventions to improve air or water quality, infrastructure and
environmental studies are typically CBAs, and the value of improved health is estimated using WTP.

380
Cost–Benefit Analysis Example: The High/Scope Perry Preschool Program
In Chapter 3, we introduced the Perry Preschool Program evaluation as an example of a longitudinal randomized
controlled research design. Children who started the program as preschoolers in the 1960s and are now
approaching their sixth decade have been followed and periodically interviewed. When combined with additional
lines of evidence, this project continues to generate papers and reports (Derman-Sparks, 2016; Englund,
Reynolds, Schweinhart, & Campbell, 2014; Heckman et al., 2010).

Here, we discuss a U.S. evaluation that conducted a cost–benefit analysis (CBA) of the High/Scope Perry
Preschool Program (HSPPP) (Belfield, Nores, Barnett, & Schweinhart, 2006, p. 162). We examine the study in
terms of the nine steps for conducting a CBA that were summarized earlier in this chapter.

The HSPPP study involved 123 children, 58 of whom were randomly assigned to the treatment group and 65 to a
control group (Belfield et al., 2006). The Coalition for Evidence-Based Policy (n.d.) provides a description of the
program:

The Perry Preschool Project, carried out from 1962 to 1967, provided high-quality preschool education
to three- and four-year-old African-American children living in poverty and assessed to be at high risk of
school failure. About 75 percent of the children participated for two school years (at ages 3 and 4); the
remainder participated for one year (at age 4). The preschool was provided each weekday morning in
2.5-hour sessions taught by certified public school teachers with at least a bachelor’s degree. The average
child–teacher ratio was 6:1. The curriculum emphasized active learning, in which the children engaged
in activities that (i) involved decision making and problem solving, and (ii) were planned, carried out,
and reviewed by the children themselves, with support from adults. The teachers also provided a weekly
1.5-hour home visit to each mother and child, designed to involve the mother in the educational
process and help implement the preschool curriculum at home. (para. 2)

Randomized control trials (RCTs) can provide high-quality evidence on whether a program works. This is because
program participants and the control group are intended to be alike and comparable; differences between the
treatment and the control group can be attributed to the program, with a margin of statistical error that can be
attributed to random differences. While RCT evidence is considered high quality, the HSPPP experiment was a
demonstration project, and the results cannot be wholly generalized to a broader scale (Coalition for Evidence-
Based Policy, n.d., para. 1), although some later studies have included the results of this study in systematic multi-
study examinations and policy simulations (Anderson et al., 2003, Gilliam & Zigler, 2000; Hattie, 2008;
Reynolds & Temple, 2008).

As noted earlier, in CBA, both the costs and the benefits of a program are estimated in monetary units and
discounted. The discounted values are summed up (benefits as positive values, costs as negative) to arrive at the net
present value (NPV) of a program. Programs with a positive NPV have social benefits whose value exceeds the
social costs and are therefore considered to be worthwhile social investments.

In the Belfield et al. (2006) examination of the Perry Preschool Program, costs and benefits are calculated for
program participants and for the general public, which are then summed up to arrive at the costs and benefits to
society. They conclude that at a 3% discount rate, the program had net positive benefits to the general public that
are 12.9 times the costs, which when added to the benefits for program participants resulted in net positive
benefits to society that are 16.1 times the costs. Over time, other analyses have calculated other rates of return but
consistently have found that benefits have outweighed the costs (Heckman et al., 2010; Reynolds, Ou, Mondi, &
Hayakawa, 2017). Next, we discuss this study in terms of the nine steps of a typical CBA and point out some of
the key issues in assessing such a study critically.

381
382
1. Specify the Set of Alternatives
Some CBAs assess two or more alternative programs or interventions. The Belfield et al. (2006) study only
considers one program and calculates the program’s incremental costs and benefits compared with the absence of
the program (program vs. control group comparisons). As noted earlier in this chapter, unlike cost–effectiveness
analyses (CEAs), which need to compare the ratios of incremental costs to the incremental outcomes of two or
more programs to rank programs in terms of their cost-effectiveness in achieving outputs, CBAs can examine just
one program or investment in relation to the status quo because both costs and benefits are monetized, and it is
therefore possible to determine if benefits exceed costs (Net Present Value is greater than 0).

383
2. Decide Whose Benefits and Costs Count (Standing)
The Belfield et al. (2006) study takes a societal perspective. Researchers allocate costs and benefits to program
participants and to “general society” and then sum the costs and benefits to these two groups to arrive at the costs
and benefits to society. Although it is not clearly stated whether “society” refers to Michigan or the United States,
the costs and benefits to general society imply a national perspective, especially since the benefits are related to
participants’ actual and predicted behavior over their lifetime and participants cannot be expected to live in
Michigan for their entire lives.

384
3. Categorize and Catalog Costs and Benefits
Belfield et al. (2006) identify the following costs and benefits:

Costs

Programs costs, including school district funding and program administration costs

The costs of increased educational attainment for program participants

Benefits

Incremental earnings of program participants

Reduced criminality of program participants

Administrative costs of reduced welfare receipts—welfare payments are a transfer from general society
(cost) to participants (benefit); net societal effects only include the cost of administering the welfare
program

The value of time saved or childcare costs saved to parents of program participants

The benefit of increased efficiency in progressing through educational programs for participants because
of lower grade retention and a reduced need for special education

Benefits discussed but not included in the calculations

Improved health outcomes for participants

Lower participant mortality rates

Intergenerational effects not yet detected

385
4. Predict Costs and Benefits Quantitatively Over the Life of the Project
The life of the project for the Belfield et al. (2006) study is the life of the program participants, although
potentially important intergenerational impacts were also discussed but not estimated. Participants had been last
interviewed at age 40, and program benefits to age 40 were calculated on the basis of survey responses. Program
benefits beyond that date were projections. The next section provides greater detail on how costs and benefits were
estimated.

386
5. Monetize (Attach Dollar Values to) All Costs and Benefits
Program costs and benefits were monetized as follows:

Program costs include costs taken from school district budgets and program administration costs. Operating
costs and capital expenses for the program were calculated. The undiscounted cost per participant in year
2000 dollars was calculated at $15,827. Expenditures by participants were not taken into account.
Lifetime earnings differences between participants and nonparticipants were calculated on the basis of (a)
responses from participants in the experiment about their labor market experiences to age 40 and (b)
extrapolation from labor market experiences to age 40 to estimate an earnings profile to age 65. Projecting
earnings from age 41 to age 65 presented methodological challenges because few career profiles are stable.
The authors provide detailed information on the calculations used to estimate earnings profiles. Overall
increases in earnings from program participation were estimated at between 11% and 34%.
Some of the benefits of increased earnings accrue to general society in the form of taxes and the remainder
accrue to participants. Taxes on earnings were estimated at 31% of earnings.
Reduced criminality resulted in the benefit of avoiding the costs of crime. Crime rates were compared
between program participants and nonparticipants. Program participants had lower crime rates than
nonparticipants. “The incidence of each type of crime across the program and no-program groups is
identified, and then these incidences are multiplied by the average cost of each type of crime” (Belfield et al.,
2006, p. 170). “Crime behaviors are divided into 11 categories: felonies of violent assault, rape, drugs,
property, vehicle theft, and other; and misdemeanors of assault/battery, child abuse, drugs, driving, and
other” (Belfield et al., 2006, p. 170). Criminal behavior was extrapolated beyond age 40. Murders were
subsumed under assaults because the difference in rates between participants (2%) and nonparticipants
(5%), along with the high cost of murder, would overshadow all other benefits and the authors preferred to
use a more conservative approach because of data limitations. Additionally, the number of arrests, which
understates the crime rate, was inflated by a factor derived from criminal justice studies to arrive at an
estimated actual crime rate. The average cost of crime includes victim costs, such as medical treatment,
property replacement, and reduced productivity, and criminal justice system costs for arrests, trials, and
sentencing.
The administrative costs of reduced welfare receipts reduce the costs to general society. Welfare reliance was
lower for program participants than it was for nonparticipants. For example, 71% of program participants
had ever received welfare by age 40 compared with 86% of nonparticipants. Welfare payments are a transfer
from general society (cost) to participants (benefit) and have no net impact on society. The societal effects,
therefore, only include the cost of administering the welfare program, estimated at 38 cents per dollar
disbursed, and the economic efficiency loss created by raising taxes to finance the program. The latter cost is
addressed in the sensitivity analysis.
The value of time saved or childcare costs saved to parents of program participants is a benefit to parents
and was estimated at $906 per participant on the basis of an earlier study of the program.
Increased efficiency in progressing through educational programs for participants is a benefit. This occurs
because of lower grade retention and a reduced need for special education. These cost savings were estimated
at $16,594 for program males and $7,239 for program females on the basis of an earlier study of the
program. Moreover, participants were more likely to complete high school on time, reducing participation
in adult schooling. Savings from this were estimated at $338 for males and $968 for females.
Increased postsecondary participation adds to costs, while decreased postsecondary participation reduces
costs. Cost differentials to the state and to individuals were both calculated to age 40. It was assumed that
no further education was undertaken after age 40.
As noted earlier, other benefits were identified but not quantified. Quantifying all benefits can be difficult
and costly, and if a program is already demonstrating that the more easily quantifiable benefits exceed costs
by a wide margin, which was the case for this program, it is also unnecessary.

387
388
6. Select a Discount Rate for Costs and Benefits Occurring in the Future
Belfield et al. (2006) applied discount rates of 3% and 7%. As noted earlier in this chapter, higher discount rates
yield lower benefit value because benefits occur later than costs, and discounting at a higher rate results in a lower
present value for benefits realized in the future. This is standard in economic evaluation because the question at
hand is whether an investment yields sufficient benefits and because the investments typically occur before benefits
can be realized.

389
7. Compute the Net Present Value of the Program
At a 3% discount rate, the program yielded net benefits, in 2000 dollars, of $49,190 for participants (i.e., the
average per participant), $180,455 for the general public, and (when these two figures are added together)
$229,645 for society over the lifetime of each participant. At a 7% discount rate, the program yielded net benefits
in 2000 dollars of $17,370 for participants, $67,029 for the general public, and (added together) $84,400 for
society over each lifetime.

Benefit–cost ratios were also calculated. These ratios are helpful in gauging how large benefits are in relation to
costs. At a 3% discount rate, the program yielded $12.90 in benefits per dollar invested to the general public and
$16.14 to society. At a 7% discount rate, the program yielded $5.67 in benefits per dollar invested to the general
public and $6.87 to society. As noted earlier, a number of potential benefits were not counted, and these results
are therefore conservative estimates.

390
8. Perform Sensitivity and Distributional Analysis
The Belfield et al. (2006) study divides costs and benefits between program participants and the general public.
Costs and benefits for the two groups are added to arrive at costs and benefits to society. This type of
distributional analysis is the most basic, reflecting costs and benefits to the primary beneficiaries of a public
investment and costs and benefits to members of general society (taxpayers) who pay for the investment.
Typically, more complex distributional analysis would be undertaken for large investments that affect various
groups differently. Distributional analysis on the basis of income is common. Distributional analysis could also
consider regions, rural versus urban residents, and various demographic categories. The policy context usually
dictates the nature and extent of distributional analysis. Distributional analysis is common in CBAs but not in
CEAs or CUAs.

Sensitivity analysis should be undertaken for any type of economic evaluation and entails changing some of the
assumptions behind the calculation to more and less conservative assumptions, thereby providing a range for net
benefit estimates rather than a point estimate. To perform sensitivity analysis, Belfield et al. (2006) “recalculated
earnings, tax impacts, crime, and welfare receipts” (p. 182) for both the 3% and the 7% discount rates using
alternative data sources. Expressed in 2000 dollars, net benefits for program participants ranged between $13,943
and $49,190; net benefits for the general public ranged between $32,136 and $237,468; and net benefits for
society ranged between $81,321 and $486,247.

391
9. Make a Recommendation
Belfield et al. (2006) discuss the external validity and generalizability of the Perry Preschool Experiment results.
While the results provide strong evidence for investment in enriched preschool programming for at-risk children,
they ask, “(1) Would the same returns be anticipated from a similar investment under current economic
conditions? and (2) Would the same returns be anticipated for groups other than children from low-income at risk
of school failure?” (p. 184). They argue for—and present evidence supports—the continued relevance of preschool
enrichment programs in current economic conditions. They also suggest that while the program, in the way in
which it was designed and applied, would likely be a worthwhile large-scale public investment if directed at
children with a high risk of dropping out of school, the same cannot be said of the value of this program or of
alternative programs to children from more advantaged backgrounds.

Recently, Heckman et al. (2010) re-examined cost–benefit calculations the HighScope Perry Preschool Program,
citing seven improvements over two key earlier studies (Belfield et al., 2003; Rolnick & Grunewald, 2003) via “an
extensive analysis of sensitivity to alternative plausible assumptions” (p. 114). One improvement, for example, was
to use “local data on costs of education, crime, and welfare participation whenever possible, instead of following
earlier studies in using national data to estimate these components of the rate of return” (p. 115). Among the
differences, their calculations resulted in a significantly lower estimate of the crime reduction attributable to the
program, and changes to earnings estimations, yet overall “the estimated annual rates of return are above the
historical return to equity of about 5.8% but below previous estimates reported in the literature” (p. 127). In the
end, the several CBAs of the HSPPP have all resulted in higher overall benefits than costs, supporting later early
childhood education investments in other jurisdictions (HighScope, 2018).

392
Strengths and Limitations of Economic Evaluation
CEA, CUA, and CBA have a strong appeal for evaluators and decision makers who desire a numerical, monetary-
focused conclusion to an evaluation. In addition, CBAs, when well conducted, offer a relatively complete
consideration of intended as well as unintended costs and benefits. All three methods of economic evaluation
compare resources with outcomes, either monetizing them (CBA) or constructing cost–effectiveness ratios (CEA
and CUA).

Historically, one “promise” of program evaluations was to be able to offer decision makers information that would
support resource allocation and reallocation decisions (Mueller-Clemm & Barnes, 1997). Economic evaluations,
particularly ones that explicitly compare program or project alternatives, can offer decision makers “bottom-line”
information that suggests the most cost-conscious choice.

There is considerable evidence from fields like health care that economic evaluations have become an essential part
of the process whereby programs, policies, and technologies are assessed. Growing costs, growing demands for
services, and resource scarcities collectively support analytical techniques that promise ways of helping determine
the best use of new or even existing funds.

393
Strengths of Economic Evaluation
Economic evaluation works best in situations where the intended logic of a program or a project is simple, where
the effect of interventions has been demonstrated and quantified using high-quality RCTs generalizable to the
population of interest, or where a sufficient number of high-quality, comparable quasi-experimental estimates are
providing consistent estimates and comparable results to yield defensible approximations of positive intervention
effects. In such cases, if resources are invested, we can be reasonably sure that the intended outcomes will
materialize and that we can provide reliable estimates of these outcomes. If we can forecast the benefits and costs
of a program or project, given a projected investment of resources, then we are in a position to conduct ex ante
analyses—offering decision makers information that indicates whether a project or a program has an NPV greater
than zero, for example. Ex ante analyses can be conducted at the policy or program planning and design stage of
the performance management cycle, strengthening the process of translating strategic objectives into well-
considered, implemented programs.

Methods of economic evaluation demand that analysts and other stakeholders identify the assumptions that
underlie an evaluation. Because CEA, CUA, and CBA all focus on comparing inputs and outcomes (monetized or
not), it is necessary for analysts to wrestle with issues like standing (from whose perspective is the evaluation being
done?), what to include as costs and benefits, how to measure the costs and benefits, how to discount to present
values, and how to rank program or project alternatives. Because details of the inputs and outcomes are
determined, later economic analyses can build on common elements in earlier similar efforts. Competent
consumers of an economic evaluation can discern what assumptions are explicit or implicit in the analysis—this
makes it possible to identify possible biases or ways in which values have been introduced into the process.

It is important to keep in mind that the execution of economic evaluations depends on the exercise of professional
judgment. A competent economic evaluator will rely on his or her experience to navigate the steps in the process.
As we will explain in Chapter 12, every evaluation entails professional judgment—its extent and nature will vary
from one evaluation to the next, but professional judgment is woven into the fabric of each evaluation.

394
Limitations of Economic Evaluation
The validity of the conclusions reached in an economic evaluation depends on the quality and completeness of the
data and the accuracy of the assumptions that undergird the analysis. For programs where key costs and outcomes
cannot be quantified or monetized, the validity of any ratios comparing costs with benefits are weakened. For
many social programs where program logics are complex, forecasting outcomes in advance of implementing the
program introduces considerable uncertainty, which reduces the value of ex ante economic evaluations. Moreover,
the methods used for CEA (including CUA) and CBA, such as choices of who has “standing” in the study, can
give rise to various ethical challenges.

Even ex post evaluations may rely on methodologies that introduce a substantial amount of uncertainty in the
findings. For example, a CBA of a mobile radio system that was piloted in the Vancouver, British Columbia,
police department (McRae & McDavid, 1988) relied on questionnaires administered to police officers to assess
the likelihood that having an in-car computer terminal (with information access to a Canada-wide arrest warrant
database) was instrumental in making an arrest. Over a period of 3 months, a total of 1,200 questionnaires were
administered for the arrests logged in that period. After identifying which arrests the officers attributed to the
mobile radio system, estimates of the incremental effects of the system were calculated.

Because one of the frequent arrest types was for outstanding parking tickets in the city, the city realized a
substantial benefit from the installation of the in-car terminals—officers could query vehicle license plates and find
out whether there were outstanding warrants associated with the owner of that vehicle. This benefit was monetized
and forecasted for the duration of the project and became a key part of the overall net present benefit calculation
supporting the expansion of the system to the rest of the fleet. Clearly, relying on officer assessments of
incrementality in the short term introduces the possibility of bias in estimating longer-term benefits.

Questionable assumptions can be made in any economic evaluation. In some cases, manipulation of key variables
affects the entire economic evaluation, biasing it so that the results are not credible. For instance, discount rates
may be manipulated to support one’s favored conclusion. Fuguitt and Wilcox (1999) offer an example that
illustrates what can happen when discount rates are manipulated to achieve a political objective:

President Nixon ordered the use of a relatively high discount rate for analyses of most federal projects;
thus, some relatively efficient projects were determined to be inefficient. Many authors relate this
requirement to Nixon’s promise to reduce public expenditures.…Moreover, in response to Western
states’ interests, federal water projects were excluded from this stipulation; analyses for these projects
were allowed to use a specified low discount rate.…By instituting different discount rates for two
separate categories of federal projects, Nixon effectively shaped all federal cost–benefit analyses produced
during his administration. (p. 20)

As discussed earlier, the choice of discount rate is contestable, and time-declining discount rates have recently been
proposed to value projects that have benefits extending far into the future. Cost–benefit analyses of policies or
programs that have environmental consequences have distributional and ethical concerns that are receiving
heightened criticism and scrutiny. As Atkinson and Mourato (2008) noted even a decade ago,

A broader array of evolving issues has come to the fore in extending economic appraisal to
contemporary environmental policy challenges, perhaps most notably climate change and biodiversity
loss. Some of these issues can be summarized as stemming from distributional concerns with regards to
how human well-being and wealth are distributed across generations as well as within generations:
respectively inter- and intra-generational equity. Other insights have emerged in response to reflections
on whether the extent and nature of the uncertainty and the irreversibility that characterize certain
environmental problems might require that decision making needs to be weighted more heavily in favor

395
of precaution. (p. 318)

Pinkerton, Johnson-Masotti, Derse, and Layde (2002) and Pearce et al. (2006) discuss ethical and equity concerns
in economic evaluations. Pinkerton et al. (2002) discuss how using different methods for ranking medical
interventions can favor one group over another. For instance, the use of cost-per-QALY ratios will typically favor
women over men and the young over the old because women live longer than men and the young have more years
of life to live than the old. On the other hand, the use of economic productivity of those who are treated, as a
measure of benefits, will favor high-income over low-income groups. Pearce et al. (2006) discuss equity issues in
CBA, identifying three major concerns. First, when standing is limited to national or regional boundaries, the
costs and benefits that affect individuals outside these boundaries are ignored. This is of particular concern when
significant externalities or program effects cross the boundaries established in a CBA. Examples include GHGs,
acid rain, and the extraction and pollution of water sources that cross jurisdictional boundaries (Atkinson &
Mourato, 2008; Austen, 2015; International Joint Commission, 1988).

Second, discount rates that are set too high or too low raise intergenerational equity issues. This is particularly
relevant in environmental applications, where discount rates that are set too high threaten the sustainability of the
ecosystem for future generations (Atkinson & Mourato, 2008). Finally, because CBA is based on willingness-to-
pay, the preferences of the wealthy have more weight because they can afford to pay more. In traditional CBA, the
preferences of each person with standing are given equal weight (costs and benefits of different individuals are
simply added), yet some projects may have undesirable distributional consequences. To address this issue, CBAs
can weight the preferences of disfavored groups more heavily. Alternatively, policymakers who implement projects
with unfavorable distributional consequences can design complementary transfer schemes to compensate losers or
adversely affected disadvantaged groups. Distributional analysis in CBA provides policymakers with the
information needed to assess and address equity issues that arise from the implementation of a project or policy.

Reviews of economic evaluations suggest that many studies lack methodological quality (Gaultney et al., 2011;
Masucci et al., 2017; Polinder et al., 2012). Gaultney et al. (2011) conducted a review of 18 published economic
studies on multiple myeloma, concluding that “the quality of the methodology applied and its documentation can
be improved in many aspects” (p. 1458). Similarly, Polinder et al. (2012), who conducted a review of 48 studies
on injury prevention, concluded that “approaches to economic evaluation of injury prevention vary widely and
most studies do not fulfill methodological rigour” (p. 211). Identified weaknesses included the following:

The perspective of the analysis, the time horizon, data collection methods, or assumptions used in
developing models are often not clearly defined…many studies did not adopt the societal perspective…
in several studies not all relevant costs were included…[and] more than half of the studies failed to
discount benefits or costs that arise in different years. (p. 218)

Gaultney et al. (2011) found that studies often had an “inadequate description of the analysis, particularly for
costs,” while

Few studies incorporated standard methods expected in high-quality economic evaluations of


healthcare, such as assessment of uncertainty in the estimates and discounting…[and m]any studies
relied on effectiveness estimates from non-experimental studies which were often based on different
patient populations. (p. 1465)

In addition, discussions of generalizability were uncommon (Gaultney et al., 2011).

The foregoing discussion suggests that conducting competent economic evaluations is challenging. But the public
and nonprofit sectors are moving to embrace cost-effectiveness and other criteria that imply comparisons of
program resources with results. This trend is part of a broader movement in the public sector that is part of New

396
Public Management that continues to influence administrative reforms and emphasizes the importance of
accountability for results (Jakobsen, Baekgaard, Moynihan, & van Loon, 2017). Demonstrating value for money
is a broad expectation for governments now. Appropriate uses of CBA, CEA, and CUA can support policy and
program decisions at different points in the performance management cycle. Like all analytical approaches, they
are not suited to every evaluation purpose. Sound professional judgment is an important part of choosing whether
to use an economic evaluation to assess a program.

397
Summary
Economic evaluation is an approach to program and policy evaluation that relies on the principle that choices among programs and
policies need to take into account their benefits and costs from a societal point of view. The societal point of view means that costs and
benefits are not limited to those faced by an agency but include all costs and benefits faced by the residents of the relevant jurisdiction. It
also means that costs and benefits are not restricted to marketed costs and benefits but also include values that are not priced in the
market. The three main methods of conducting economic evaluations are cost–benefit analysis (CBA), cost–effectiveness analysis (CEA),
and cost–utility analysis (CUA).

The three methods differ in the ways costs and outcomes of programs or policies are treated. CEA compares the economic or opportunity
costs of alternative interventions against a quantitative measure—that is, a nominal count—of the key intended outcome.

CUA compares the economic or opportunity costs of alternative interventions against a quantitative measure of the utility expected from
their implementation. For example, health interventions are often compared by calculating the ratio of incremental cost to incremental
quality-adjusted life-years (QALYs). To qualify as a measure of utility, the outcome used in the CUA ratio must be constructed in such a
way that it is a plausible representation of social preferences.

CBA compares the costs and benefits of program or policy alternatives, where benefits and costs have been monetized. CBA can compare
interventions but can also answer the question of whether a single intervention or investment is worth undertaking from a social
perspective. The answer is yes if the net present value (NPV) of the intervention exceeds zero, which means that the opportunity costs of
diverting scarce resources from the economy into this intervention are less than the expected benefits from the intervention.

The challenges in estimating costs and benefits vary with the types of programs or policies that are being compared. The biggest challenge
is in estimating intangible costs and benefits; they typically cannot be estimated directly and often require the use of complex methods to
estimate social preferences (willingness-to-pay [WTP] and willingness-to-accept [WTA]). Most methods used to estimate WTP and WTA
have methodological shortcomings that are discussed in the literature. There have been numerous methodological advances made in the
valuation of nonmarket costs and benefits over the past two decades, especially in the field of environmental economics.

All three approaches to economic evaluation depend on being able to specify the actual outcome component(s) for the program or policy
alternatives that are being compared. When programs have a high likelihood of achieving their intended outcomes, are uniquely
responsible for those outcomes, have limited or no unintended consequences, and the associated costs and benefits can be estimated using
market values, estimating future costs and benefits is relatively straightforward. But when the desired outcomes may be the result of other
factors besides the program, interventions have unintended consequences, or market values are not available for some of the costs or
benefits, estimating the costs and the actual benefits is challenging.

Economic evaluation is growing in importance as governments are increasingly expected to demonstrate value for money for their
expenditure choices. Evaluators wishing to become involved in economic evaluations should consult the recent literature on the
methodology related to the policy area of interest, and specific methodological approaches. Checklists similar to the steps we have
outlined in this chapter have been developed to assist researchers, reviewers, and evaluators in evaluating the quality of economic
evaluations. Existing systematic reviews of economic evaluations suggest that in general, there is a lot of room for improvement when
evaluations are compared to criteria like the nine steps we have outlined in this chapter.

398
Discussion Questions
1. What are the principal differences between cost–effectiveness analysis (CEA), cost–utility analysis (CUA), and cost–benefit
analysis (CBA)? Give an example of each approach.
2. Why are research design and the attribution of outcomes important to CBA, CEA, and CUA?
3. Value for money evaluations can be conducted from an economics or an auditing standpoint. What are the differences between
those two approaches?
4. What are opportunity costs? How do they differ from accounting or budgeted costs?
5. Why are future costs and benefits discounted?
6. What is the difference between nominal prices and real prices?
7. Give an example of non-marketed costs and non-marketed benefits.
8. What methods are used to put a value on non-marketed costs and benefits?
9. Does a CUA using quality-adjusted life-years (QALY) as the outcome yield the same results as a CBA using willingness-to-pay
(WTP) for longevity and health quality?
10. Review the British Medical Journal and the Consensus on Health Economic Criteria checklists reproduced in the Cochrane
Handbook for Systematic Reviews of Interventions (www.cochrane-handbook.org), and use one or both to evaluate the quality of an
economic evaluation.
11. The fire chief in a coastal city that surrounds a harbor has long expressed his concerns to the city manager that the harbor is not
adequately protected in case of a boat fire or a fire in a structure on the harbor. Marine fire protection is currently provided on a
contract basis with private tugboats at an annual cost of $12,000. For the past 7 years, the company has not been able to man its
fireboat between the hours of midnight and 7 a.m. This represents a very serious deficiency in the city’s fire defense plan in that
most of the serious fires occur during this time frame. In a memorandum to the city manager, the chief offers two options:

Option 1: Maintain the present level of service with the company, which provides a manned fireboat 17 hours a day and recalls
off-duty personnel for 7 hours a day, with a response time of approximately 60 to 90 minutes.

Twenty-year total cost with annual increases of 5% = $397,000.

Option 2: Have the fire department operate a city-owned fireboat, which could also be used for marine rescue, code enforcement,
and monitoring and containment of water pollution on a 24-hour basis.

CAPITAL COST = $175,000

MOORING FACILITY = $10,000

TOTAL MAINTENANCE OVER 20 YEARS = $50,000

TOTAL COST = $235,000

It is recommended that Option 2 be adopted for the following reasons:

The fireboat would be available for prompt response to fire and life rescue assignments on a 24-hour basis.

The cost saving projection over the 20-year period would be $162,000.

Do you agree with the fire chief’s recommendation? Why, or why not? Be specific in your response to this case.
12. In a recent decision the City Council in Victoria, British Columbia, approved a 10-year property tax holiday for a local developer
who is renovating a downtown building and converting it into condominiums. The estimated cost to the city of this tax holiday is
about $6 million over 10 years. In justifying the decision, the mayor said that because the developer has spent about $6 million on
the building to earthquake proof it, giving the developer an equivalent tax break was fair. Do you agree? Why?
13. Looking at the same case as outlined in Question 12, is it appropriate to compare the value of the $6 million in earthquake-
related renovations to the $6 million in property taxes forgone over 10 years by the city. Are those two amounts equivalent? Why?

399
References
Ackerman, F. (2009). The Stern review vs. its critics: Which side is less wrong? Retrieved from
http://www.e3network.org/briefs/Ackerman_Stern_Review.pdf

Anderson, L. M., Petticrew, M., Rehfuess, E., Armstrong, R., Ueffing, E., Baker, P.,… Tugwell, P. (2011). Using
logic models to capture complexity in systematic reviews. Research Synthesis Methods, 2(1), 33–42.

Anderson, L. M., Shinn, C., Fullilove, M. T., Scrimshaw, S. C., Fielding, J. E., Normand, J., & Carande-Kulis,
V. G. (2003). The effectiveness of early childhood development programs: A systematic review. American
Journal of Preventive Medicine, 24(3), 32–46.

Anderson, R. (2010). Systematic reviews of economic evaluations: Utility or futility? Health Economics, 19(3),
350–364.

Atkinson, G., & Mourato S. (2008). Environmental cost–benefit analysis. Annual Review of Environment and
Resources, 33, 317–344.

Atkinson, G., & Mourato, S. (2015). Cost–benefit analysis and the environment. OECD Environment Working
Papers, No. 97. Paris, France: OECD.

Auditor General of British Columbia & Deputy Ministers’ Council. (1996). Enhancing accountability for
performance: A framework and an implementation plan—Second joint report. Victoria, British Columbia, Canada:
Queen’s Printer for British Columbia.

Austen, D. (2015). Comments on the development of a metal mining district in the headwaters of the Stikine, Taku,
and Unuk rivers, with examples from the KSM (Kerr-Sulphurets-Mitchell) Proposed Mine Environmental
Assessment. Bethesda, MD: American Fisheries Society. 1–16.

Belfield, C. R., Nores, M., Barnett, S., & Schweinhart, L. (2006). The High/Scope Perry Preschool Program:
Cost–benefit analysis using data from the age-40 follow up. Journal of Human Resources, 41(1), 162–190.

Bewley, B., George, A., Rienzo, C., & Portes, J. (2016). The impact of the Troubled Families Programme:
Findings from the analysis of national administrative data. London, UK: Department for Communities and
Local Government.

Blades, R., Day, L., & Erskine, C. (2016). National evaluation of the Troubled Families Programme: Families’
experiences and outcomes. London, UK: Department for Communities and Local Government.

Boardman, A., Greenberg, D., Vining, A., & Weimer, D. (2018). Cost–benefit analysis: Concepts and practice
(4th ed.). Cambridge, UK: Cambridge University Press.

400
Briggs, A., & Sculpher, M. (1998). An introduction to Markov modelling for economic evaluation.
Pharacoeconomics, 13(4), 397–409.

Cai, B., Cameron, T. A., & Gerdes, G. R. (2010). Distributional preferences and the incidence of costs and
benefits in climate change policy. Environmental and Resource Economics, 46(4), 429–458.

Canadian Comprehensive Auditing Foundation. (1985). Comprehensive auditing in Canada: The provincial
legislative audit perspective. Ottawa, Ontario, Canada: Author.

Clyne, G., & Edwards, R. (2002). Understanding economic evaluations: A guide for health and human services.
Canadian Journal of Program Evaluation, 17(3), 1–23.

Coalition for Evidence-Based Policy. (n.d.). Social programs that work: Perry Preschool Project. Retrieved from
http://evidencebasedprograms.org/wordpress/?page_id=65

Day, L., Bryson, C., & White, C. (2016). National evaluation of the Troubled Families Programme: Final synthesis
report. London, UK: Department for Communities and Local Government.

Department for Communities and Local Government. (2016). The First Troubled Families Programme 2012 to
2015: An overview. London, UK: Department for Communities and Local Government.

Derman-Sparks, L. (2016). What I learned from the Ypsilanti Perry Preschool Project: A teacher’s reflections.
Journal of Pedagogy, 7(1), 93–106.

Dhiri, S., & Brand, S. (1999). Analysis of costs and benefits: Guidance for evaluators. London, England: Research,
Development and Statistics Directorate, Home Office.

Drummond, M. F., & Jefferson, T. O. (1996). Guidelines for authors and peer reviewers of economic
submissions to the BMJ. British Medical Journal, 31, 275–283.

Drummond, M. F., Sculpher, M. J., Claxton, K., Stoddart, G. L., & Torrance, G. W. (2015). Methods for the
economic evaluation of health care programmes. (4th ed.). Oxford, UK: Oxford University Press.

Edejer, T. T., Baltussen, R., Adam, T., Hutubessy, R., Acharya, A., Evans, D. B., & Murray, C. J. L. (Eds.).
(2003). Making choices in health: WHO guide to cost-effectiveness analysis. Geneva, Switzerland: World Health
Organization. Retrieved from http://www.who.int/choice/book/en

Englund, M., White, B., Reynolds, A., Schweinhart, L., & Campbell, F. (2014). Health outcomes of the
Abecedarian, Child-Parent, and HighScope Perry Preschool programs. In A. J. Reynolds, A. J. Rolnick, & J. A.
Temple (Eds.), Health and education in early childhood: Predictors, interventions, and policies. Cambridge, UK:
Cambridge University Press.

401
Evers, S., Goossens, M., de Vet, H., van Tulder, M., & Ament, A. (2005). Criteria list for assessment of
methodological quality of economic evaluations: Consensus on Health Economic Criteria. International Journal
of Technology Assessment in Health Care, 21(2), 240–245.

Fuguitt, D., & Wilcox, S. J. (1999). Cost–benefit analysis for public sector decision makers. Westport, CT: Quorum.

Fujiwara, D., & Campbell, R. (2011). Valuation techniques for social cost–benefit analysis: Stated preference, revealed
preference and subjective well-being approaches: A discussion of the current issues. London, UK: HM Treasury.

Garber, A. M., & Phelps, C. E. (1997). Economic foundations of cost-effectiveness analysis. Journal of Health
Economics, 16(1), 1–31.

Gaultney, J. G., Redekp, W. K., Sonneveld, P., & Uyl-de Groot, C. (2011). Critical review of economic
evaluations in multiple myeloma: An overview of the economic evidence and quality of the methodology.
European Journal of Cancer, 47, 1458–1467.

Gilliam, W. S., & Zigler, E. F. (2000). A critical meta-analysis of all evaluations of state-funded preschool from
1977 to 1998: Implications for policy, service delivery and program evaluation. Early Childhood Research
Quarterly, 15(4), 441–473.

Gramlich, E. (1990). A guide to cost–benefit analysis. Englewood Cliffs, NJ: Prentice Hall.

Hahn, R. W., & Dudley, P. (2004). How well does the government do cost–benefit analysis? (Working Paper No.
04–01). Washington, DC: American Enterprise Institute, Brookings Center for Regulatory Studies.

Hammitt, J. K. (2002). QALYs versus WTP. Risk Analysis, 22(5), 985–1001.

Hanley, N., & Barbier, E. B. (2009). Pricing nature: Cost–benefit analysis and environmental policy. Northampton,
MA: Edward Elgar.

Hattie, J. (2008). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. New Yourk, NY:
Routledge.

Heckman, J. J., Moon, S. H., Pinto, R., Savelyev, P. A., & Yavitz, A. Q. (2010). The rate of return to the
HighScope Perry Preschool Program. Journal of Public Economics, 94(1–2), 114–128.

Higgins, J. P. T., & Green, S. (Eds.). (2011). Cochrane handbook for systematic reviews of interventions (Version
5.1.0). London, England: Cochrane Collaboration. Retrieved from www.handbook.cochrane.org

HighScope. (2018). Projects. Retrieved from https://highscope.org/research/projects

402
HM Treasury. (2018). The Green Book: Central government guidance on appraisal and evaluation. London,
England: TSO. Retrieved from https://www.gov.uk/government/publications/the-green-book-appraisal-and-
evaluation-in-central-governent

Huenemann, R. (1989). A persistent error in cost–benefit analysis: The case of the Three Gorges Dam in China.
Energy Systems and Policy, 13(2), 157–168.

International Joint Commission. (1988). Impacts of a Proposed Coal Mine in the Flathead River Basin.
International Joint Commission (IJC) Digital Archive. Retrieved from https://scholar.uwindsor.ca/ijcarchive/369

Jakobsen, M. L., Baekgaard, M., Moynihan, D. P., & van Loon, N. (2017). Making sense of performance
regimes: Rebalancing external accountability and internal learning. Journal of Public Administration Research and
Theory, 1–15.

Jefferson, T., Demicheli, V., & Vale, L. (2002). Quality of systematic reviews of economic evaluations in health
care. Journal of the American Medical Association, 287(21), 2809–2812.

Kernick, D. (2002). Measuring the outcomes of a healthcare intervention. In D. Kernick (Ed.), Getting health
economics into practice (pp. 101–115). Abingdon, UK: Radcliffe Medical Press.

Levin, H. M., & McEwan, P. J. (Eds.). (2001). Cost-effectiveness analysis: Methods and applications (2nd ed.).
Thousand Oaks, CA: Sage.

Lewin, S., Hendry, M., Chandler, J., Oxman, A. D., Michie, S., Shepperd, S.,… Welch, V. (2017). Assessing the
complexity of interventions within systematic reviews: Development, content and use of a new tool
(iCAT_SR). BMC medical research methodology, 17(1), 76.

Litman, T. (2011). Smart congestion relief: Comprehensive analysis of traffic congestion costs and congestion reduction
benefits. Victoria, British Columbia, Canada: Victoria Transport Policy Institute.

Mallender, J., & Tierney, R. (2016). Economic analyses. In D. Weisburd, D. Farrington, & C. Gill (Eds.), What
works in crime prevention and rehabilitation (pp. 291–309). New York, NY: Springer.

Mankiw, N. G. (2015). Principles of economics (7th ed.). Stamford, CT: Cengage Learning.

Mason, G., & Tereraho, M. (2007). Value-for-money analysis of active labour market programs. Canadian
Journal of Program Evaluation, 22(1), 1–29.

Masucci, L., Beca, J., Sabharwal, M., & Hoch, J. (2017). Methodological issues in economic evaluations
submitted to the Pan-Canadian Oncology Drug Review. PharmacoEconomics, I(4), 255–263.

Mathes, T., Walgenbach, M., Antoine, S. L., Pieper, D., & Eikermann, M. (2014). Methods for systematic

403
reviews of health economic evaluations: A systematic review, comparison, and synthesis of method literature.
Medical Decision Making, 34(7), 826–840.

McRae, J. J., & McDavid, J. (1988). Computer-based technology in police work: A benefit–cost analysis of a
mobile digital communications system. Journal of Criminal Justice, 16(1), 47–60.

Mills, K. M., Sadler, S., Peterson, K., & Pang, L. (2018). An economic evaluation of preventing falls using a new
exercise program in institutionalized elderly. Journal of Physical Activity and Health, 15(6), 397–402.

Mueller-Clemm, W. J., & Barnes, M. P. (1997). A historical perspective on federal program evaluation in
Canada. Canadian Journal of Program Evaluation, 12(1), 47–70.

Neumann, P. J. (2009). Costing and perspective in published cost-effectiveness analysis. Medical Care, 47(7Suppl.
1), 528–532.

Neumann, P. J. (2011). What next for QALYs? Journal of the American Medical Association, 305(17), 1806–1807.

Neumann, P. J., Sanders, G. D., Basu, A., Brock, D. W., Feeny, D., Krahn, M.,… Salomon, J. A. (2017). Cost-
effectiveness in health and medicine. New York, NY: Oxford University Press.

Neumann, P. J., Thorat, T., Shi, J., Saret, C. J., & Cohen, J. T. (2015) The changing face of the cost-utility
literature. 1990–2012. Value in Health, 18(2), 270–277.

Newcomer, K. E., Hatry, H. P., & Wholey, J. S. (2015). Handbook of practical program evaluation (4th ed.).
Hoboken, NJ: John Wiley & Sons.

Nielsen, K., & Fiore, M. (2000). Cost–benefit analysis of sustained-release bupropion, nicotine patch or both for
smoking cessation. Preventive Medicine, 30, 209–216.

OECD. (2017). Tackling wasteful spending on health. Paris, France: OECD.

Pearce, D., Atkinson, G., & Mourato, S. (2006). Cost–benefit analysis and the environment: Recent developments.
Paris, France: OECD.

Pinkerton, S. D., Johnson-Masotti, A. P., Derse, A., & Layde, P. M. (2002). Ethical issues in cost-effectiveness
analysis. Evaluation and Program Planning, 25, 71–83.

Poister, T. H. (1978). Public program analysis: Applied research methods. Baltimore, MD: University Park Press.

Polinder, S., Segui-Gomez, M., Toet, H., Belt, E., Sethi, D., Racioppi, F., & van Beeck, E. F. (2012). Systematic
review and quality assessment of economic evaluation studies of injury prevention. Accident Analysis and

404
Prevention, 45, 211–221.

Reynolds, A. J., Ou, S. R., Mondi, C. F., & Hayakawa, M. (2017). Processes of early childhood interventions to
adult well-being. Child Development, 88(2), 378–387.

Reynolds, A. J., & Temple, J. A. (2008). Cost-effective early childhood development programs from preschool to
third grade. Annual Review of Clinical Psychology, 4, 109–139.

Rolnick, A., & Grunewald, R. (2003). Early childhood development: Economic development with a high public
return. The Region, 17(4), 6–12.

Royse, D. D., Thyer, B. A., Padgett, D. K., & Logan, T. K. (2001). Program evaluation: An introduction (3rd ed.).
Belmont, CA: Brooks/Cole.

Ryan, A. M., Tompkins, C. P., Markovitz, A. A., & Burstin, H. R. (2017). Linking spending and quality
indicators to measure value and efficiency in health care. Medical Care Research and Review, 74(4), 452–485.

Sabharwal, S., Carter, A., Darzi, L., Reilly, P., & Gupte, C. (2015). The methodological quality of health
economic evaluations for the management of hip fractures: A systematic review of the literature. The Surgeon,
Journal of the Royal Colleges of Surgeons or Edinburgh and Ireland, 13, 170–176.

Sanders, G. D., Neumann, P. J., Basu, A., Brock, D. W., Feeny, D., Krahn, M.,… Ganiats, T. G. (2016).
Recommendations for conduct, methodological practices and reporting of cost-effectiveness analyses: Second
panel on cost-effectiveness in health and medicine. JAMA, 316(10), 1093–1103.

Schweinhart, L. J., Heckman, J. J., Malofeeva, L., Pinto, R. Moon, S., & Yavitz, A. (2010). The cost-benefit
analysis of the Preschool Curriculum Comparison Study. Final Report to the John D. and Catherine T.
MacArthur Foundation. Ypsilanti, MI: HighScope. Retrieived from
https://highscope.org/documents/20147/43309/cost-benefit-analysis-preschool.pdf

Shemilt, I., Mugford, M., Vale, L., Marsh, K., & Donaldson, C. (Eds.). (2010). Evidence-based decisions and
economics: Health care, social welfare, education and criminal justice (2nd ed.). Chichester, England: Wiley-
Blackwell.

Smith, J. (2006). Building New Deal liberalism: The political economy of public works, 1933–1956. New York:
Cambridge University Press.

Smith, R. D., & Widiatmoko, D. (1998). The cost-effectiveness of home assessment and modification to reduce
falls in the elderly. Australian and New Zealand Journal of Public Health, 22(4), 436–440.

Stern, N. (2008). The economics of climate change. American Economic Review: Papers & Proceedings, 98(2),
1–37.

405
Taks, M., Kesenne, S., Chalip, L., Green, B. C., & Martyn, S. (2011). Economic impact analysis versus cost
benefit analysis: The case of a medium-sized sport event. International Journal of Sport Finance, 6, 187–203.

Tordrup, D., Chouaid, C., Cuijpers, P., Dab, W., van Dongen, J. M., Espin, J.,… Miguel, J. P. (2017). Priorities
for health economic methodological research: Results of an expert consultation. International Journal of
Technology Assessment in Health Care, 33(6), 609–619.

Transportation Economics Committee. (n.d.). Transportation benefit–cost analysis. Washington, DC:


Transportation Research Board. Retrieved from http://bca.transportationeconomics.org

van Mastrigt, G. A., Hiligsmann, M., Arts, J. J., Broos, P. H., Kleijnen, J., Evers, S. M., & Majoie, M. H. (2016).
How to prepare a systematic review of economic evaluations for informing evidence-based healthcare decisions:
A five-step approach (part 1/3). Pharmacoeconomics & Outcomes Research, 16(6), 689–704.

Weinstein, M. C., Siegel, J. E., Gold, M. R., Kamlet, M. S., & Russell, L. B. (1996). Consensus statement:
Recommendations of the Panel on Cost-Effectiveness in Health and Medicine. Journal of the Medical
Association, 276(15), 1253–1258.

Zhuang, J., Liang, Z., Lin, T., & De Guzman, F. (2007). Theory and practice in the choice of social discount rate for
cost-benefit analysis: A survey (ERD Working Paper No. 94.1). Mandaluyong, Metro Manila, Philippines: Asian
Development Bank.

406
8 Performance Measurement as an Approach to Evaluation

Introduction 341
The Current Imperative to Measure Performance 342
Performance Measurement for Accountability and Performance Improvement 343
Growth and Evolution of Performance Measurement 344
Performance Measurement Beginnings in Local Government 344
Federal Performance Budgeting Reform 345
The Emergence of New Public Management 346
Steering, Control, and Performance Improvement 349
Metaphors that Support and Sustain Performance Measurement 350
Organizations as Machines 351
Government as a Business 351
Organizations as Open Systems 352
Comparing Program Evaluation and Performance Measurement Systems 353
Summary 364
Discussion Questions 365
References 366

407
Introduction
In this chapter, we introduce performance measurement as an approach that complements program evaluation in
assessing the effectiveness of programs and policies and is often expected to fulfill accountability functions. This is
the first of three chapters focused on performance measurement. The next chapter will delve into the design and
implementation of performance measurement systems, followed by a chapter on the use of performance
measurement for accountability and performance improvement. We begin this chapter with an overview of the
current imperative for performance measures, and then briefly discuss the two key performance measurement
purposes. We follow this with a look at the growth and evolution of performance measurement, beginning with its
origins in American cities at the turn of the 20th century. We show how performance measurement was adapted
to several waves of administrative and budgeting reforms since the 1960s and 1970s. We describe the fiscal
environment for governments in the 1970s and 1980s and the burden of deficits, debts, and taxes that prompted
many jurisdictions in the United States to pass laws to limit expenditures and/or tax increases.

With the continuing public expectations for governments to provide services efficiently and effectively, and a
parallel desire for limitations or reductions of taxation, performance measurement emerged as a key part of New
Public Management (NPM) reforms beginning in the 1980s (Hood, 1991). Results-focused performance
measurement became a principal means for demonstrating accountability through publicly reporting performance
results, with the assumption that performance targets and public reporting would provide incentives and pressures
to induce performance improvements (improved efficiency and effectiveness). More recently, some of the
assumptions related to using performance measures to steer or control public-sector management have been found
to have empirically verified shortcomings and even contradictions (Van Dooren & Hoffman, 2018), which we
will discuss further in Chapters 9 and 10.

Even though NPM is no longer center stage as an administrative reform movement, having partly been overtaken
by approaches that acknowledge the growing myriad of networks and interdependencies among organizations and
governments (Bourgon, 2011; de Lancer Julnes & Steccolini, 2015; Perrin, 2015), performance measurement is
here to stay (Feller, 2002; Poister, Aristigueta, & Hall, 2015).

In this chapter, we also examine three metaphors that have been used by both analysts and practitioners to
understand performance measurement intuitively. Finally, we turn to comparisons between program evaluation
and performance measurement as approaches to evaluating programs. Our view is that the two approaches are
complementary and can yield complementary lines of evidence in evaluations, yet there are challenges to using
measures or evaluations for multiple purposes (see, for example, Hatry, 2013; Lahey & Nielsen, 2013; Perrin,
2015; Nielsen & Hunter, 2013). Fundamentally, performance measurement facilitates describing program results,
and program evaluation is aimed at asking why those results occur.

408
The Current Imperative to Measure Performance
Measuring the performance of programs, policies, organizations, governments, and the people who work in them
is nearly a universal expectation in the public and nonprofit sectors, particularly in Western countries. In
developing countries, performance measurement is now a part of the expectations by donors seeking
accountability for program and policy results (Davies & Dart, 2005; Gulrajani, 2014; OECD, 2008).

In the past 30 years, there has been phenomenal growth in the attention and resources being devoted to
performance measurement. This has been connected to a shift in expectations about the roles and responsibilities
of public and nonprofit organizations and their managers in particular—a shift that includes the importance of
performance management, monitoring, and the achievement of results. Pressures such as globalization, public debt
burdens, citizen dissatisfaction with public services, limited gains in public service efficiencies, chronic budget
shortages (particularly since the Great Recession in 2008/2009), and advances in information technology have led
many governments to adopt a pattern of public-sector reforms that includes performance measurement systems
(see, e.g., Borins, 1995; Gruening, 2001; Hood, 2000; Jakobsen, Baekgaard, Moynihan, & van Loon, 2017;
Pollitt & Bouckaert, 2011; Shand, 1996).

Increasingly, managers, executives, and their organizations are expected to be accountable for achieving intended
(and stated) outcomes. Traditional emphasis on inputs and processes—following the rules and regulations, and
complying with authoritative directives—is being supplemented and, in some cases, supplanted by an emphasis on
identifying, stating, and achieving objectives; planning and operating in a businesslike manner; and, like
businesses, being accountable for some performance-related “bottom line” (de Lancer Julnes & Steccolini, 2015;
Thomas, 2007; Van Dooren & Hoffman, 2018). For many public and nonprofit managers, evaluative criteria
such as value for money and cost-effectiveness are intended to link resources to outputs and outcomes in order to
produce evidence that is analogous to private-sector measures of success.

Managing for results is part of the broader emphasis on performance management in the public and nonprofit
sectors. Our earlier discussions of the performance management cycle in Chapter 1 pointed out that organizations
are expected to undertake strategic planning that develops goals and objectives. Strategic objectives, in turn, are
expected to drive policy and program development that leads to implementation, then to evaluation, and then
back to the strategic planning phase of the cycle.

Performance measurement, particularly in concert with evaluation, can inform all phases of the performance
management cycle. Most common is measurement to assess the extent to which intended results have been
achieved, from having implemented policies or programs. However, with tools such as improved databases and the
availability of big data, measures are also an increasingly important for monitoring throughout the design,
implementation, and assessment phases. Currently, monitoring progress is at the core of most performance
measurement systems in order to facilitate comparisons between what was planned/targeted and what was
accomplished.

409
Performance Measurement for Accountability and Performance
Improvement
Broadly, there are two purposes that underlie most performance measurement systems: (1) accountability and (2)
improving performance (Behn, 2003; Hatry, 2013; Perrin, 2015). How these two sometimes-contradictory
purposes are balanced and how they evolve over time in different political/organizational cultures influences the
system design and the intended and unintended effects. We will look further at this issue in Chapter 10 when we
consider the impacts of political cultures on the ways that performance results are used.

Osborne and Gaebler (1992) and others have pointed out that performance measurement and reporting are
critical to the expectation that the taxpayers receive value for their tax dollars. This is fundamentally an
accountability-focused view of performance management. Performance measurement for public reporting in this
case is intended to be primarily summative. Usually, performance measurement and public reporting systems that
focus on accountability are top-down initiatives, driven by political decision makers, designed by senior public
officials and advisors, and implemented within government-wide performance frameworks. Public performance
reporting is intended to inform elected officials and the public and, through comparisons between what was
intended and what was accomplished, shine a light on performance shortcomings (McDavid & Huse, 2012).
Elected decision makers are expected to use public performance results to hold public servants to account and put
pressure on bureaucracies to perform better. We will look at the impacts of this approach to designing and
implementing performance measurement systems in Chapters 9 and 10.

The “improving performance” stream of measuring performance does not have to be coupled with accountability
reforms, although that is often expected to occur. Performance measurement systems can be designed and
implemented as bottom-up initiatives by managers or by organizations that have in view using performance
information in their day-to-day work. Although these performance results (usually selected measures) are
sometimes reported publicly, the main reason to measure performance is to provide information that can be used
by managers to see how their programs are tracking and to guide improvements. This performance information,
then, is intended to be used formatively (Hildebrand & McDavid, 2011). We will explore this “low-stakes”
approach to performance measurement and reporting in Chapter 10 and situate it in a range of possible
combinations of prioritizing accountability or prioritizing performance improvement (Jakobsen, Baekgaard,
Moynihan, & van Loon, 2017). In Chapter 10, we will look at whether and how public performance reporting
succeeds in improving organizational performance.

410
Growth and Evolution of Performance Measurement
Performance measurement is not new. Historically, it has been connected primarily with financial accountability
—being able to summarize and report the ways that resources have been expended in a given period of time.
Traditional public-sector budgets focus on resources for policies and programs—inputs that can be tracked and
accounted for at the end of each reporting period. Over time, the accounting profession expanded its domain as
the need for financial accountability grew; organizations became more complex, and the regulation of ways that
financial reporting occurred resulted in a need for expertise in assessing/evaluating the completeness, honesty, and
fairness of the “books” in organizations (Hopwood & Miller, 1994). In addition, because accounting emphasized
a systematic description of the monetary values of resources that were expended in organizations, efforts to
improve efficiency depended, in part, on being able to rely on information about the funds expended for activities.
Accounting provided a framework for calculating the inputs in efforts to estimate efficiency and productivity ratios
(inputs compared to outputs).

411
Performance Measurement Beginnings in Local Government
While we have tended to situate the beginnings of performance measurement in the United States in the 1960s,
with the development of performance management systems, such as planning, programming, and budgeting
systems (PPBS) and zero-based budgeting (ZBB) (Perrin, 1998; Wildavsky, 1975), performance measurement and
reporting was well developed in some American local governments early in the 20th century. Williams (2003) and
Lee (2006) discuss the development of budget-related performance and productivity measurement in New York
City, beginning as early as 1907 with the creation of the Bureau of Municipal Research. The bureau had a
mandate to gather and report statistical data on the costs, outputs, and some outcomes (e.g., infant mortality rates)
of municipal service delivery activities.

One of the innovations that the bureau instituted was an annual “Budget Exhibit”—a public display in city hall of
the annual performance report that included the following:

Facts and figures graphically displayed, intermingled with physical objects [that] informed the visitor of
the city’s activities—what had been and what was expected to be done with the taxpayer’s money.
(Sands & Lindars, 1912, quoted in Williams, 2003, p. 647)

Because the bureau adapted the municipal accounting system to make it possible to calculate the costs of service
activities, it was possible to produce information on the unit costs of services delivered. By comparing these figures
over time or across administrative units, it was possible to track efficiency. Providing information that could be
used to improve productivity offered a way to make public reporting a part of the political dialogue. Reporting
publicly meant that New York City administrative departments, the mayor, and the council were more
accountable; valid and reliable information about their service performance was available in a form that was
intended to be accessible to the public. In Chapter 10, we will describe how Britain has used high-profile public
reporting of performance results to pressure service providers to improve their efficiency and effectiveness (Bevan
& Hamblin, 2009; Bevan & Hood, 2006).

By 1916, there were comparable bureaus of municipal research in 16 northeastern U.S. cities—each having a
mandate similar to the New York bureau. This movement to measure and report local government performance
was part of a broader movement to reform U.S. urban local governments. Performance information, linked to
making and approving city budgets and delivering services, was aimed at educating members of the public about
their municipal services so that their votes for mayors and council members could be based on knowledge of
previous performance rather than the appeals of urban political party machines (Lee, 2006).

By World War II, 89 local governments in the United States were issuing performance reports in a range of
formats: city workers distributing reports to residents; local academics preparing overall performance reports;
newspapers publishing reports; displaying posters in subways, streetcars, and buses; and scouts delivering the
reports to homes in the community (Lee, 2006). During the 1950s and onward, the push for local government
performance reporting diminished. Although more local governments were tabling annual reports, there did not
appear to be a lot of interest in municipal reporting (there was a lack of demand for the information) and little
improvement in the quality of the information itself (Lee, 2006). The original Progressive Movement (Weibe,
1962) that had driven the city government performance reporting movement succeeded in securing reforms to
local government structures and processes in the United States, principally by state legislative changes. Thus,
although local government performance measurement and reporting was here to stay, it became part of the
background to emerging interests in state and federal performance measurement.

412
413
Federal Performance Budgeting Reform
Because local governments had demonstrated that it was feasible to develop measures of service-related
performance and even attach costs to those services, local government performance measurement became one of
the drivers for efforts to measure performance for public-sector organizations at other levels of government (Hatry,
1974, 1980, 1999, 2002, 2006). Criticisms of bureaucracies that emerged in the United States in the 1960s and
1970s (Downs, 1965; Niskanen, 1971) focused, in part, on the incentives for public officials in organizations
dominated by hierarchies, rules, procedures, and an emphasis on process. For Downs and others, public
bureaucracies were inefficient and ineffective substantially because public officials were rewarded for following
bureaucratic rules rather than achieving value for money; the emphasis on process eclipsed the focus on achieving
policy and program objectives.

In the 1960s, a major reform movement in the United States was to develop and implement Programmed Planned
Budgeting Systems (PPBS) (Perrin, 1998). This budgeting approach was developed at the federal level (and in
some states) as a way to link budgeted expenditures to program results. Governments implementing PPBS were
expected to re-conceptualize their administrative departments as clusters of program-related activities. (These did
not have to coincide with existing organizational units or subunits.) Existing line-item budgets also had to be
redistributed among the programs in this new structure. Each program was expected to have identifiable
objectives. Program objectives were a key innovation—organizations were intended to state expected results for
their expenditures. Clusters of programs, conceptualized as open systems that converted resources into results in
environments that could influence how the programs operated, were intended to contribute to organizational
goals, and clusters of these goals were intended to contribute to broader sectoral goals and even government goals.
This emphasis on a goal- or objective-driven vision for governments can be seen as an important step toward later
performance management–related views of how government organizations should function.

A key part of building PPBS was the identification of performance measures for each program, and that, in turn,
depended on specifying objectives that could be translated into measures. The overall intention of these systems
was to be able to specify the costs of programs and then link costs to results so that measures of efficiency and cost-
effectiveness could be obtained and reported. Through a public reporting process in which elected decision makers
were the intended clients for this information, advocates of PPBS believed that input-focused line-item budgeting
process could be transformed to focus on the linkages between expenditures and results. The main rationale
underlying PPBS was the belief that data on program efficiency and effectiveness would present decision makers
with information that they could use to improve government efficiency and effectiveness. The logic of this
argument had an impact on the ways that American (and later Canadian) jurisdictions conceptualized their
budgeting processes.

PPBS encountered, in most jurisdictions, what would eventually be insurmountable implementation problems,
time and resource limitations, difficulties in defining objectives and measuring results, lack of management
information capacity, and lack of accounting system capacity to generate meaningful program-based costs. These
implementation problems resulted in a general abandonment of PPBS by the early 1970s in the United States
(Perrin, 1998). Canada undertook its own version at the federal level until the late 1970s (Savoie, 1990). Lack of
success with PPBS did not result in the abandonment of the basic idea of relating governmental costs to results,
however. Successors to PPBS included zero-based budgeting, which was similar to PPBS in linking costs to results
but insisted that program budgets be built from a zero base rather than the previous year’s base, and management
by objectives, which emphasized the importance of stating clear objectives and making them the focus of
organizational activities (Perrin, 1998). These alternatives did not prove to be any more durable than PPBS, but
the core idea—that it is desirable to focus on results—survived and has since flourished as a central feature of
contemporary performance measurement systems.

414
415
The Emergence of New Public Management
A key feature of the fiscal environment in which governments operated in the 1970s and early 1980s was large and
persistent operating deficits. Indeed, one of the appeals of zero-based budgeting was its emphasis on
deconstructing budgets and demanding a rationale for the full amount, not just the incremental increase or
decrease. The scale of government activities at all levels of government had tended to grow during this period,
demanding more resources and, where budgets did not balance, running deficits and debts or increasing taxes. The
combination of deficits and inflation produced an environment in which analysts and public officials were looking
for ways to balance budgets.

In Britain, the election of Margaret Thatcher in 1979 was a turning point. Her government systematically
restructured the way that the public sector operated. It emphasized reductions in expenditures and, hence, the
scope and scale of government activities. It introduced competition into the production and delivery of services,
including local government services, and it generally articulated a view that emphasized the importance of
diminishing the role of government in society and creating incentives and opportunities for the expansion of the
private sector. Public accountability for results was a key feature of the restructuring of government (Hood, 1989;
Pollitt, 1993).

Other countries followed Britain’s example. New Zealand is frequently cited as an exemplar of public-sector
reforms that were aimed at reducing the scale and scope of government and, at the same time, introducing a
broad, top-down performance management regime (Gill, 2011). In the United States, taxpayers were making it
increasingly clear that they were no longer willing to finance the growing demands for resources in the public
sector. Proposition 13 in California in 1978 was a taxpayer initiative that effectively capped local government tax
rates in that state. Similar initiatives spread rapidly across the United States so that during the late 1970s, 23 states
had local or state legislation aimed at limiting government expenditures (Danziger & Ring, 1982).

This taxpayers’ revolt at the state and local levels was accompanied by a different vision of government more
broadly. In the United States, Osborne and Gaebler’s (1992) book Reinventing Government articulated private-
sector principles that they saw exemplified in sustainable government organizations. These principles, paraphrased
below, reflected emerging efforts to reform governments internationally and amounted to a philosophy of
government:

1. Government should steer rather than row, creating room for alternatives to the public-sector delivery of
services.

2. Government should empower citizens to participate in ownership and control of their public services.

3. Competition among service deliverers is beneficial, creating incentives for efficiency and enhancing
accountability.

4. Governments need to be driven by a mission, not by rules.

5. Funding should be tied to measured outcomes rather than inputs, and performance information should be
used to improve results.

6. Governments should meet the needs of customers rather than focusing on interest groups and the needs of
the bureaucracy.

7. Enterprise should be fostered in the public sector, encouraging generation of funds, rather than just
spending.

8. Governments should focus on anticipating and preventing problems and issues rather than remediating
them. (Relatedly, strategic planning is essential to drive the framework for managing performance.)

416
9. Governments should use a participatory and decentralized management approach, building on teamwork
and encouraging innovation.

10. Governments should use market mechanisms to achieve public purposes.

The vision of government in society that is reflected in the normative principles by Osborne and Gaebler (1992)
was itself heavily influenced by economic theory articulated by public choice theorists. Niskanen (1971) and
Downs (1965) argued that to understand how governments and, in particular, government organizations
function, it is necessary to apply rational economic models to “bureaucratic” behaviors; in effect, private-sector
microeconomic models should be applied to the public sector. They were arguing that the same rational self-
interested model of how people behave in the private sector should be applied in the public sector.

New Public Management (NPM) became an increasingly central feature of the administrative and governance
landscape during the 1990s in the United States, Canada, and most Western democracies, although its
implementation varied (Hood & Peters, 2004; Pollitt & Bouckaert, 2011; Shand, 1996). With its emphasis on
reforming the public sector (Borins, 1995), NPM can be seen, in part, as a response to criticisms of bureaucratic
waste and a lack of responsiveness to political leaders in public organizations. NPM reflects two themes: (1)
accountability for results and (2) giving managers “freedom to manage.” Mandating a focus on outcomes,
measuring performance toward achieving objectives, and publicly reporting on whether intended outcomes were
achieved are all intended to drive public-sector performance improvements.

Linking achieving performance results to rewards for public servants was intended to shift the incentives so that
they were better aligned with intended program and organizational outcomes (Bevan & Hamblin, 2009; Poister,
Aristigueta, & Hall, 2015). In effect, the basic NPM theory is that if managers are given the freedom (and the
performance incentives) to work with resources to improve effectiveness and efficiency and are given performance
measurement–based rewards and sanctions for their efforts, then targets, performance measurement, and public
reporting operate as a package to improve performance at the same time that accountability is improved (Borins,
1995; Osborne & Gaebler, 1992: Poister, Aristigueta, & Hall, 2015).

The focus on managing for results, which had threaded its way through public-sector innovations and reforms
from the early years of the 20th century in local governments, was now being generalized and combined with the
fiscal imperative, based on a broad belief that private-sector business practices needed to be emulated in the public
sector to improve efficiency and effectiveness (based on the assumption that public-sector organizations are
populated by people with similar motives to those in businesses) (Thomas, 2007).

Similar to the principles originally articulated by Osborne and Gaebler (1992), NPM was centered, at least
rhetorically, on a core set of imperatives:

Providing high-quality services that citizens value; increasing the autonomy of public managers,
particularly from central agency controls; measuring and rewarding organizations and individuals on the
basis of whether they meet demanding performance targets; making available the human and
technological resources that managers need to perform well; and, appreciative of the virtues of
competition, maintaining an open-minded attitude about which public purposes should be performed
by the private sector, rather than the public sector. (Borins, 1995, p. 122)

The paradoxes of the NPM approach have become more evident (Hood & Peters, 2004; Perrin, 2015; Steane,
Dufour, & Gates, 2015; Thomas, 2007), NPM has been controversial (Denhardt & Denhardt, 2003; Hood,
1991; Pollitt & Bouckaert, 2011; Savoie, 1995), and, in some respects, it has been supplanted as a theoretical
framework for public-sector reform (Bourgon, 2011, 2017; de Lancer Julnes & Steccolini, 2015; De Vries, 2010;
Dunleavy, Margetts, Bastow, & Tinkler, 2006; Nielsen & Hunter, 2013; Van Dooren & Hoffman, 2018).
However, it continues to play a key role at an operational level in our thinking about the design, implementation,
and assessment of government programs and services (De Vries, 2010; Hatry, 2013; Kroll & Moynihan, 2017).

417
418
Steering, Control, and Performance Improvement
The concepts embodied in the NPM approach to government account for much of the current and widespread
emphasis on performance measurement in the public sector. This is primarily an accountability-focused view of
the design and implementation of performance measurement systems.

Performance measurement and reporting in the United States has been legislated in most states (Melkers, 2006)
and in the federal government (Kroll & Moynihan, 2017). Congress passed the Government Performance and
Results Act (GPRA, 1993) to mandate results-based management in U.S. federal departments and agencies.
Passage of that act marked the beginning of a continuing federal emphasis on measuring and reporting
performance results. The GPRA mandated government-wide strategic planning that was intended to drive
departmental and agency objectives and, hence, performance measurement.

In parallel to the GPRA, the Bush administration, in 2002, under the aegis of the Office of Management and
Budget (OMB), introduced a government-wide performance review process called PART (Program Assessment
Rating Tool) that cyclically reviewed and assessed all U.S. federal programs. The OMB analysts would conduct a
review of designated programs, taking into account agency-generated performance results, evaluations that had
been done, and their own summary assessments of program effectiveness. This program review function, although
modified in the U.S. by the Obama administration beginning in 2009, has become a feature of central agency
expenditure management systems in some Western countries (OECD, 2010; Shaw, 2016).

The Obama administration continued the emphasis on the importance of performance results (Kroll &
Moynihan, 2017). The GPRA was replaced by the Government Performance and Results Act Modernization Act
(2010), which included the creation of a performance improvement officer role for each federal agency to drive
internal efforts to measure and report on performance results, oversee the evaluation of programs, and generally
meet the agency requirements to manage performance and report results (U.S. Government Accountability Office,
2011). Performance improvement officers have a mandate to assess performance and to work with both agency
executives and central agencies (principally the OMB) in the government to identify ways to improve efficiency
and effectiveness.

The 2010 act continues to emphasize objective-setting, performance measurement, and reporting. As a result of
the current amendments, there is more emphasis on tailoring performance measures to administrative agency
needs, reflecting a modest decentralization of the performance management system as a whole, to the agency level
(Kroll & Moynihan, 2017).

In Canada, although there is no comparable legislation to the GPRA, there is an important central agency role,
articulated as policies and procedures for all Canadian federal departments and agencies (Lahey & Nielsen, 2013;
Treasury Board of Canada Secretariat, 2017a, 2017b). Treasury Board is responsible for expenditure management
for the government and sets policies for both the performance measurement and program evaluation functions.
The Management Accountability Framework “is a key tool of oversight that is used by Treasury Board of Canada
Secretariat (TBS) to help ensure that federal departments and agencies are well managed, accountable and that
resources are allocated to achieve results” (TBS, 2016a, p. 1). The federal government of Canada has a
government-wide program evaluation mandate that requires all programs to be evaluated on a periodic basis.
Performance measurement is expected to support the program evaluation function in departments and agencies,
although the current federal focus on delivering results seems to have elevated outcomes-focused performance
measurement to at least the same status as program evaluation (Treasury Board of Canada Secretariat, 2016b).

Increasingly, the nonprofit sector in the United States and Canada has been expected to emulate the changes that
have occurred in the public sector. Long-practiced NPM precepts have become a part of a general movement
toward contractual relationships between funders and service providers (Boris, De Leon, Roeger, & Nikolova,
2010; Eikenberry & Kluver, 2004; Scott, 2003). Measuring and reporting performance is a key part of this
process.

419
420
Metaphors that Support and Sustain Performance Measurement
Measuring the performance of public and nonprofit programs, policies, and organizations can be connected to
metaphors that shape our perceptions and assumptions about what is possible and desirable as we conceptualize
government and nonprofit activities. These metaphors serve as “theories” or models that guide management
change efforts (Doll & Trueit, 2010; Morgan, 2006) and suggest a rationale for pursuing performance
measurement. In this chapter, we summarize three metaphors that have had important influences on how
government organizations and programs are “seen”: (1) as a machine, (2) as a business, and (3) as an open system.
The open system metaphor is connected with the view that programs, organizations, and their environments can
be complex. We discussed this issue in Chapter 2 and will note here that there is growing interest in the
implications of complexity for program evaluation and performance measurement (Bititci et al., 2012; Forss,
Marra, & Schwartz, 2011; Patton, 2011).

These metaphors have general appeal, being rooted in our everyday experiences. Applying them either explicitly or
implicitly makes it possible for evaluators and various stakeholders to better comprehend the meaning of the key
features of designing, implementing, and using performance measurement systems.

421
Organizations as Machines
This metaphor is rooted in a vision of organizations as instruments designed by people to produce tangible
outputs/results (Morgan, 2006). In American organization theory and practice, one source of this image of
organizations was the scientific management movement that was developed as an approach to improving the
efficiency of industrial production processes. Frederick Taylor (1911) was the key proponent of this approach to
organizing work flows to optimize unit production. A key element of his approach was the use of time and motion
studies to break down how individual workers contributed to a production process and then use analysis to re-
engineer the process to minimize wasted effort and increase efficiency (units of output per unit of input) (Kanigel,
1997). Although Taylorism waned in analytic prominence by the 1920s, this approach, with its emphasis on
quantitative measurement as a means of improving the performance of individuals in the workplace, continues to
be influential (Savino, 2016). Scientific management has contributed to the metaphorical view that organizations
can be understood as machines (Morgan, 2006).

The connection to performance measurement is this: Many performance measurement systems rely on visual
heuristics to offer users an at-a-glance way to tell how an organization is performing (Edwards & Thomas, 2005;
Kitchin & McArdle, 2015). Typically, claiming that a suite of performance measures is to an organization as the
instruments on a dashboard are to an automobile or even an aircraft suggests that appropriate performance
measures can provide users with readings that are analogous in functionality to those obtained from the
instruments and gauges in cars or airplanes. In France, for example, the Tableau de Bord was developed early in
the 20th century, originally as a “dashboard” system of indicators used by managers to monitor the progress of the
business (Epstein & Manzoni, 1998). Similarly, the “balanced scorecard” approach (Kaplan & Norton, 1996)
provides for cascading performance indicators nested within a framework of four interrelated perspectives: (1)
financial, (2) customer, (3) internal business process, and (4) learning and growth. We will discuss the balanced
scorecard approach to organizational performance measurement in Chapter 9.

If a suite of performance measures is a dashboard, then a public organization can be understood as a machine that
consists of complicated but understandable systems and subsystems that are linked and can be monitored in valid
and reliable ways. By measuring analogs to indicators like compass direction (alignment with strategic objectives),
managers and other stakeholders are expected to be able to “fly” or “drive” their organizations successfully. One
implication of relying on machine-like metaphors or dashboards to construct and display performance results in
that complex organizations are often simplified (validly or not)—they are defined as simple or complicated, but
not complex (Gloubermann & Zimmerman, 2002).

422
Government as a Business
Another metaphor that has come to strongly influence our thinking about government organizations and,
increasingly, nonprofit organizations is the lens of “government as a business.” This metaphor, which has guided
efforts to infuse business practices into governments in North America, the United Kingdom, and Australasia
(Pollitt, 1998) emphasizes the importance of clearly stated objectives, programs that are planned and managed to
achieve those objectives, efficiency (including the positive effects of competition, privatization, contracting out),
and, ultimately, attention to a bottom line that is analogous to bottom-line measures such as profit or market
share in the private sector. Performance measures are a key part of a governance and management philosophy that
emphasizes results and encourages managers to manage for outcomes.

New Public Management (NPM) embraces many of these same principles (Borins, 1995; Hood, 1991; Osborne
& Gaebler, 1992), although business thinking in and for the public sector predates NPM, being traceable, in part,
to tenets of the Progressive Movement in the United States (Buenker, Burnham, & Crunden, 1976). With respect
to state and local governments, the Progressive Movement emerged as a response to the widespread concerns with
political corruption and machine politics in American state and local governments around the turn of the 20th
century (Williams, 2003). Woodrow Wilson’s (1887) article, “The Study of Administration,” exemplified the
efforts of reformers who wanted to introduce political and organizational changes that would eliminate the
perceived ills of U.S. public-sector governance:

Bureaucracy can exist only where the whole service of the state is removed from the common political
life of the people, its chiefs as well as its rank and file. Its motives, its objects, its policy, its standards,
must be bureaucratic. (p. 217)

A key part of this movement was its emphasis on business-like practices for government organizations. Indeed, the
creation of New York’s Bureau of Municipal Research (Williams, 2003) was a part of this transformation of local
government. Performance reporting was intended to provide information to key decision makers (the mayor) that
would result in improved efficiency, as well as inform voters so that they could hold politicians accountable.

Although New Public Management as a reform movement is no longer dominant (Bourgeon, 2011, 2017), its
emphasis on performance results, measuring and reporting results for accountability, and being business-like in
how government and nonprofit organizations are managed and changed are here to stay (Jakobsen, Baekgaard,
Moynihan, & van Loon, 2017; Thompson, 2007).

423
Organizations as Open Systems
In Chapter 2, we introduced the open systems metaphor and its effects on the way we see programs. “Open
systems” has become a dominant way that managers and analysts have come to view programs and organizations
(Morgan, 2006) and has exerted a major influence on the way we think about and structure program evaluations.

The key originating source of the open systems metaphor is the biological metaphor (Von Bertalanffy, 1968).
Gareth Morgan (2006) introduces the biological metaphor by pointing out that it is perhaps the dominant way
that organizations are now seen. Looking at the biological domain, organisms interact with their environments as
open systems and have structures that perform functions that, in turn, contribute to a goal of homeostasis (the
ability to maintain a steady state in relation to fluctuations in the environment).

Biological organisms, to maintain themselves, need to operate within certain parameters. For example, warm-
blooded animals have species-specific ranges of normal body temperature. Although normally self-correcting,
fluctuations above or below the normal range indicate that the organism is “not well,” and if the temperature
deviation is not corrected, permanent damage or even death will result. Medical assessment of vital signs normally
includes measuring body temperature, blood pressure, pulse rate, and respiration rate. Collectively, these are
hypothesized to indicate overall bodily functioning—they are connected to complex systems in the body but are
deemed to be valid indicators of system functioning.

Although we generally do not explicitly assert that organizations are organisms, the biological/open systems
metaphor exerts an important influence on the way we think about measuring performance. In a report that was a
part of the development of the performance measurement system in British Columbia, Canada (Auditor General
of British Columbia & Deputy Ministers’ Council, 1996), the State of Oregon’s exemplary efforts to create state-
wide performance measures and benchmarks is summarized this way:

The State of Oregon has generally been recognized as one of the leading jurisdictions in reporting state-
wide accountability information. It has defined a wide range of benchmarks to use as indicators of the
progress that the state has had in achieving its strategic vision. Just as blood pressure, cholesterol levels
and other such indicators serve as signs of a patient’s health, benchmarks serve as signs of Oregon’s
vision of well-being in terms of family stability, early childhood development, kindergarten to grade 12
student achievement, air and water quality, housing affordability, crime, employment and per capita
income. (p. 70)

It is appealing to compare the process of measuring a person’s health to the process of indicating the “health” (or
well-being) of a government, a public organization, an economy, or even a society. We generally agree that blood
pressure is a valid measure of some aspects of our physical health. We have well-established theories backed by
much evidence that departures from the accepted ratios of diastolic to systolic pressure, as measured by blood
pressure cuffs, result in health problems. Our research and experience with normal and abnormal blood pressures
have established widely accepted benchmarks for this performance measure.

Using this metaphor as a basis for measuring public-sector performance suggests that we also have an accurate
understanding of the cause-and-effect linkages in programs and even whole organizations, such that a performance
measure or a combination of them will indicate “how well” the organization is doing. Finding the right
performance measures, then, would be a powerful shorthand way to monitor and assess organizations.

424
Comparing Program Evaluation and Performance Measurement Systems
In the first four chapters of this book, we suggested that basic program evaluation tools are also a useful
foundation for performance measurement. Logic models (Chapter 2) can be used to construct models of programs
or organizations and, in so doing, identify key constructs that are included in cause-and-effect relationships that
predict intended outcomes and are candidates for performance measures. Research designs (Chapter 3) focus our
attention on the attribution question and guide analysts and managers in their efforts to interpret and report
performance measurement results. Measurement (Chapter 4) outlines criteria that can guide the process of
translating constructs (e.g., in logic models) into measures for which data can be collected. Together, these three
chapters focus on knowledge and skills that can be adapted by managers and evaluators who are involved in
designing, implementing, or assessing performance measurement systems.

Because core program evaluation knowledge and skills can be adapted to the design and implementation of
performance measurement systems, it is clear that there is substantial overlap between these two evaluation
approaches. However, there are important differences between program evaluation and performance
measurement. This section of the chapter offers comparisons between program evaluation and performance
measurement on a number of criteria. By contrasting performance measurement with program evaluation, it is
possible to offer an extended definition of performance measurement as an approach to evaluation.

We believe that the core knowledge and skills that are integral to becoming a competent program evaluator are
necessary to designing and implementing effective performance measurement systems. If evaluators do not
understand and know how to work with these basic concepts, expectations and indeed the designs of performance
measurement systems will be inappropriate and less likely to be useful, used, and sustainable. Table 8.1 (adapted
from McDavid & Huse, 2006) summarizes how core evaluation skills can be deployed for both program
evaluations and for performance measurement.

Table 8.1 Core Evaluation Skills That Can Be Applied to Both Program Evaluation and
Performance Measurement
Table 8.1 Core Evaluation Skills That Can Be Applied to Both Program Evaluation and Performance
Measurement

Core Skills Applied to Program Evaluation Applied to Performance Measurement

Building logic models to identify key


Logic modeling Focus on program effectiveness
constructs

Working with Understanding causality: “what” versus


Understanding causality, rival hypotheses
research designs “why” questions

Measuring Understanding validity and reliability of Understanding validity and reliability of


constructs measures and data sources measures and data sources
Source: McDavid & Huse (2006).

The growth of performance measurement and its implementation in settings where resources are constrained or
even diminished has prompted some managers to question the value of program evaluation—considering it as
somewhat of a “luxury” (de Lancer Julnes, 2006; Scheirer & Newcomer, 2001). In contrast, some have argued the
importance of both approaches to evaluation, pointing out that they can be mutually reinforcing (Hatry, 2013;
Kroll & Moynihan, 2017; Lahey & Nielsen, 2013; Newcomer, 2007; Nielsen & Hunter, 2013; Scheirer &
Newcomer, 2001; Treasury Board of Canada Secretariat, 2016b; Wholey, 2001).

In this textbook, the two approaches are presented as complementary evaluation strategies. Both program
evaluation and performance measurement are a part of the performance management cycle that was introduced in

425
Chapter 1. In that cycle, they are both intended to be a part of the feedback loop that reports, assesses, and
attributes outcomes of policies and programs.

Table 8.2 summarizes key distinctions between program evaluation and performance measurement. Some of these
distinctions have been noted by analysts who discuss the relationships between the two approaches to evaluation
(Hatry, 2013; Kroll & Moynihan, 2017; Lahey & Nielsen, 2013; McDavid & Huse, 2006; Scheirer &
Newcomer, 2001). Each of the comparisons in the table is discussed more fully in the subsections that follow.

Before we further expand on Table 8.2, we should note that implicit in the comparison is the view that program
evaluations are primarily projects that have a beginning and an endpoint, similar to research projects. This view of
evaluation has been challenged by Mayne and Rist (2006) and Mayne (2008), who make the point that although
evaluators will continue to do studies to assess programs or policies, there is an opportunity for evaluators to
become more engaged in organizations, to become a resource for managers and other stakeholders to build
capacity and move the culture of organizations toward embracing evaluation on a day-to-day basis. This is also a
point made by Patton (2011) in discussions of developmental evaluation. Although this perspective is emerging,
the dominant role that evaluators play is still focused on program-specific engagements that address questions and
issues that usually are tailored to particular evaluation projects.

Table 8.2 Comparisons Between Program Evaluation and Performance Measurement


Table 8.2 Comparisons Between Program Evaluation and Performance Measurement

Program Evaluation Performance Measurement

1. Episodic (usually) Ongoing

Designed and built with more general issues in mind. Once implemented,
2. Issue-specific performance measurement systems are generally suitable for the broad
issues/questions that were anticipated in the design.

3. Measures are usually


Measures are developed and data are usually gathered through routinized
customized for each program
processes for performance measurement.
evaluation.

4. Attribution of observed
outcomes is usually a key Attribution is generally assumed.
question.

5. Targeted resources are


Because it is ongoing, resources are usually a part of the program or
needed for each program
organizational infrastructure.
evaluation.

6. Program evaluators are not Program managers are usually expected to play a key role in developing
usually program managers. performance measures and reporting performance information.

7. The intended purposes of a


The uses of the information can evolve over time to reflect changing
program evaluation are
information needs and priorities.
usually negotiated up front.

426
427
1. Program evaluations are episodic, whereas performance measurement is
ongoing.
Typically, program evaluations are projects that have a time frame. As was indicated in Chapter 1, a program
evaluation is a project that has a starting point, often driven by particular information needs or by organizational
policies governing periodic evaluations of programs. Developing the terms of reference for a program evaluation
typically marks the beginning of the process, and reporting (and perhaps publishing) the evaluation findings,
conclusions, and recommendations usually marks the endpoint of a program evaluation. In Chapter 1, we describe
the steps in assessing the feasibility of a program evaluation and then doing it.

Performance measurement systems, conversely, are designed and implemented with the intention of providing
regular and continuing monitoring of information for program and organizational purposes. Once implemented,
they usually become part of the information infrastructure in an organization. Current information technologies
make it possible to establish databases and update them routinely. As long as the data are valid, reliable, and
complete, they can be used by managers and other stakeholders to generate periodic reports or dashboards that are
updated periodically.

Where performance measurement results are used for external accountability, periodic reporting is the norm.
Typically, this is an annual event and may involve rolling up performance results that have been gathered by the
organization for its own uses (Hildebrand & McDavid, 2011). But it may also involve constructing and reporting
measures specifically for external reporting. Decoupling of internally used and externally reported performance
information is an issue that we will discuss in Chapter 10 of this textbook (Gill, 2011; McDavid & Huse, 2012).

2. Program evaluations are issue/context specific, whereas performance


measurement systems are designed with more general issues in mind.
Program evaluations are usually developed to answer questions that emerge from stakeholder interests in a
program at one point in time. The client(s) of the evaluation are identified, terms of reference for the evaluation
are developed and tailored to that project, and resources are usually mobilized to do the work and report the
results. Where governments have an overarching framework for program evaluation, there can be core questions
that are expected to be addressed in all evaluations (Treasury Board of Canada Secretariat, 2016b), as well as
departmental budgets that are expected to cover the evaluation work in a fiscal year. Even there, each evaluation
will typically generate its own array of sub-questions that are geared to that project and context.

Some organizations have an ongoing infrastructure that supports program evaluation and, when evaluations are
required, use their own people and other resources to do the work (Lahey & Nielsen, 2013). Even where
infrastructure exists and a regular cycle is used to evaluate programs, there is almost always a stage in the process in
which the terms of reference are negotiated by key stakeholders in the evaluation process—these terms of reference
are at least partially situation-specific.

Contrast this with performance measurement systems that are intended to be ongoing information-gathering and
dissemination mechanisms and are usually determined internally. Typically, developing performance measures
entails agreement in advance on what general questions or issues will drive the system and, hence, what is
important to measure. Examples of general questions might include the following: What are the year-over-year
trends in key outcomes? and How do these trends conform to annual or multiyear targets? Key constructs in a program
and/or an organizational logic model can be identified, measures developed, and processes created to collect, store,
track, and analyze the results. Once a system is in place, it functions as a part of the organization’s information
infrastructure, and the measures, data structure, and reporting structure remain fairly stable until the program
focus or systems architecture is modified.

One of the main kinds of comparisons used in analyzing performance results is to display performance measures
over time. Trends and comparisons between actual and targeted results can be displayed visually. To maximize the

428
potential usefulness of such data, keeping the same performance measures year over year is an advantage. At the
same time, as organizational structure and priorities shift and even as methods used to collect data change, it may
be appropriate to change or modify measures. Balancing continuity of measures with their relevance is an
important issue for many organizations (Gregory & Lonti, 2008; Malafry, 2016). Particularly where
organizational environments are turbulent (e.g., government departments are been reorganized), performance
measures can change frequently, and the performance measurement systems may be modified. De Lancer Julnes
and Steccolini (2015) note that with increasing hybridization of service delivery, “performance measurement
systems need to continuously change over time” (p. 332). In complex settings, constructing relevant performance
measures can be challenging—there is a point where neither program evaluations nor performance measurement
systems will be useful if organization/environmental interactions are chaotic.

3. For program evaluations, measures and lines of evidence are at least


partially customized for each evaluation, whereas for performance
measurement, measures are developed and data are gathered through
routinized processes.
Since the terms of reference are usually specific to each program evaluation, the evaluation issues and the data
needed to address each issue are also tailored. The measures and the research design–related comparisons (the lines
of evidence) needed to answer evaluation questions typically require a mixture of primary and secondary data
sources. Primary data (and the instruments used to collect them) will reflect issues and questions that that
evaluation must address. Secondary data that already exist in an organization can be adapted to an evaluation, but
it is rare for an evaluation to rely entirely on pre-existing data. In some jurisdictions, performance measurement is
now expected to support the program evaluation function or even be equal to it (Lahey & Nielsen, 2013; Treasury
Board Secretariat, 2016b). Balancing program management needs and evaluator needs typically amounts to
balancing formative and summative evaluation–related purposes.

Performance measurement systems tend to rely heavily on existing sources of data, and the procedures for
collecting those data will typically be built into organizational routines. Program managers often have a role in the
data collection process, and program-level data may be aggregated upward to construct and track organizational
performance measures. Even where primary data are being collected for a performance measurement system,
procedures for doing so are usually routinized, permitting periodic comparisons of the actual performance results
and, usually, comparisons between actual and targeted results.

For example, WorkSafeBC in British Columbia, Canada, regularly collects client satisfaction data as part of its
performance measurement process. Each time a survey is conducted, a random sample of clients is drawn and a
private polling company (under contract to WorkSafeBC) is hired to administer a pre-set survey constructed so
that satisfaction ratings with WorkSafeBC service can be measured and compared over time and across
administrative regions. These are featured in the annual report (WorkSafeBC, 2018) and are distributed to
managers so that they can see how client satisfaction is tracking over time. The targets for satisfaction ratings are
typically aimed at achieving at least 77% ratings of “good” or “very good” overall experience. Visually, the actual
averages can be compared to the targets.

4. For program evaluations, research designs and the comparisons they entail
are intended as ways to get at the attribution issue, whereas for performance
measurement systems, attribution is generally assumed.
The literature on program evaluation continues to be energized by discussions of ways to design evaluations to
make it possible to determine the causal attribution of the actual outcomes of a program (Cook, Scriven, Coryn,
& Evergreen, 2010; Forss et al., 2011; Picciotto, 2011; Scriven, 2008). In Chapter 3, we introduced and discussed
the importance of internal validity in evaluation research designs as a set of criteria to assess the capacity of
research designs to discern causes and effects. Although there is by no means universal agreement on the centrality

429
of internal validity as a criterion for defensible research designs (Cronbach et al., 1981; Shadish, Cook, &
Campbell, 2002), the practice of program evaluation has generally emphasized the importance of sorting out the
incremental effects of programs and being able to make statements about the extent to which observed outcomes
are actually attributable to the program, as opposed to other causes. Causes and effects are perhaps the core issue
in evaluation. If we look at what distinguishes evaluation from other, related professions like auditing and
management consulting, evaluators are typically trained to understand the challenges of assessing program
effectiveness and doing so in ways that are credible (McDavid & Huse, 2006; Picciotto, 2011).

The attribution problem can be illustrated succinctly with Figure 8.1, which has been adapted from a research
design model for evaluating European Union expenditure programs (Nagarajan & Vanheukelen, 1997). A key
question for program evaluators is whether the observed outcome (in this figure, the creation of 75 new job
placements) is due to the training program in question. The purpose of the comparison group is to “calibrate” the
observed outcome of the program by offering evidence of what would have happened without the program. This is
sometimes called the counterfactual condition. From the comparison group, 50 found new jobs during the same
time frame that the program group found 75 new jobs. Because both groups found jobs, the incremental outcome
of the program was 25 jobs. In other words, 25 new jobs can be attributed to the program—the other 50 can be
attributed to factors in the environment of the program.

Figure 8.1 How an Internally Valid Research Design Resolves the Attribution Problem

Source: Nagarajan & Vanheukelen (1997, p. 327).

In Chapter 3 (Appendix C), we introduced an example of a time series based evaluation of a policy to implement
an admission fee to a museum in Victoria, British Columbia, and showed the visual impact of the policy on
monthly attendance. By comparing what would have happened to attendance over time with what actually
happened, the evaluators were able to show the incremental effects of the intervention on attendance. As we have
stressed in earlier chapters, if an organization is simply tracking the performance measures of outcomes over time,
it is not typically easy to determine whether the results are due to the program or are reflecting results that
occurred because of other influences, or are a combination of the two.

430
Exploring the Complementarity of Performance Measurement and Program Evaluation: When Do Performance Results Become
Program Evaluation Findings?

Performance results are often displayed over time—previous years are included so that a visual display of trends can be seen. We can see
whether performance measurement trends are consistent with intended improvements or not. Time series are important in program
evaluations as well. If we have a program outcome variable that is displayed before and after a program begins, we can see whether the
trend and level of that variable is consistent with program effectiveness. Monthly attendance in the Royal BC Museum is both a
performance measure (we can see the trend over time) and data for an interrupted time series research design (we can compare the level
and trend in monthly attendance before and after the admission fee was implemented). Knowing when a program was implemented gives us
a way to “convert” time series performance data into program evaluation data.

An example of using performance information in a program evaluation is the 2012 Evaluation Report for the Smoke-Free Ontario Strategy.
The program was started in 2011 but smoking-related data were available before that. Logic models were developed for the three major
components of the program: protection, smoking cessation, and youth prevention. For each logic model, outcome-related constructs were
identified, and for those, data were displayed in a time series format (at least five years before the program started—2005 to 2010). The
figure below displays survey-based trends in secondhand smoking exposure and is taken from the 2012 report:

Figure 8.2 Exposure to Secondhand Smoke at restaurants or bars, ages 15+, Ontario, 2005 to 2010

Source: Ontario Tobacco Research Unit. Smoke-Free Ontario Strategy Evaluation Report. Ontario Tobacco Research Unit, Special
Report, November 2012. p. 28. Used with permission.

This graph (and all the others) become baseline measures for the program constructs, and over time, it will be possible to compare pre-
program trends to post-program trends. By themselves, these graphs do not support a conclusion about program effectiveness, but this
information becomes an important line of evidence in the program evaluation.

Keep in mind that in Chapter 3, we emphasized that research designs for program evaluations are about what
comparisons are possible given particular data sources (single time series versus before–after comparison group
designs, for example). When we are measuring performance, we need to pay attention to what comparisons are
possible with the data at hand—it may be possible that for some data sources, we can meaningfully begin to ask,
“What does the incremental effect of the program appear to be, given the comparison at hand?” In effect, we are
moving beyond asking, “What where the results of the program?” to asking, “Why did those results occur?”

In performance measurement systems, it is rare to build in the capacity to conduct comparisons that can sort out

431
outcome attribution (Scheirer & Newcomer, 2001). Typically, when performance measures are being developed,
attention is paid to their measurement validity. (Are the measures valid indicators of constructs in the program logics,
particularly the outcomes?) The attribution question usually cannot be answered just with the performance
measures. This often does become a relevant obstacle, given reliance on existing data sources, in which measures
must be adapted to “fit” the constructs that are important for the performance measurement system.

Because outcomes occur in the environment of programs, factors other than the program can and usually do affect
outcome measures. For persons developing performance measures, a further challenge is to construct measures that
are not only valid indicators of constructs (measure what they are intended to measure) but also give us an
indication of what the program/s is/are actually doing, under a variety of conditions.

Unlike blood pressure, which chiefly indicates the performance of systems within our bodies, performance
measures that focus on outcomes typically indicate the behavior of variables outside programs themselves. Without
appropriate comparisons and the research designs that are typical of program evaluations, these performance
measures cannot attribute these outcomes to the program; they cannot demonstrate that the program caused or
did not cause the outcome measured. How well performance measures indicate what the program (or
organization) actually accomplished depends, in part, on the types of measures chosen. Output measures, for
example, are typically “closer” to the program process than are outcomes and are often viewed as being more
defensibly linked to what the program actually accomplished.

Thompson (1967) introduced program technologies as a concept that helps explain the likelihood that a well-
implemented program will deliver its intended outcomes. Some programs (e.g., highways maintenance programs)
have high-probability core technologies (based as they are on engineering knowledge), meaning that organizational
resources will quite reliably be converted into expected results. Low-probability program technologies (e.g., a drug
rehabilitation program) can be less likely to directly produce intended outcomes because the “state of the art” of
rehabilitating drug users is more like a craft—or even an art—rather than an engineering science. The complex
contingencies of the program efforts and its external context create a more volatile situation. The core technology
of a program affects the likelihood that program outcomes can be “tracked” back to the program. Other things
being equal (sometimes called the ceteris paribus assumption), observed program outcomes from high-probability
programs will be more likely attributable to the program than outcomes from low-probability programs. In other
words, attribution will be a more salient problem for low-probability program technologies. Consequently,
performance measurement systems can be more confidently used to measure actual outcomes in programs with
high-probability technologies, whereas in programs with low-probability technologies, performance measurement
systems can rarely be used with confidence to attribute outcomes to the program. Program evaluation is more
appropriate.

5. Targeted resources are needed for each program evaluation, whereas for
performance measurement, because it is ongoing, resources are a part of the
program or organizational infrastructure.
Program evaluations can be designed, conducted, and reported by contracted consultants, in-house staff, or both.
Even where evaluations are conducted in-house, each evaluation typically includes a budget for primary data
collection and other activities that are unique to that study.

Availability of resources for developing, implementing, and reporting from performance measurement systems can
vary considerably, but typically, managers are expected to play a key role in the process (Treasury Board of Canada
Secretariat, 2016b). In organizations where budgets for evaluation-related activities have been reduced, managers
often are expected to take on tasks associated with performance measurement as part of their work. As we discuss
in the next two sections, this situation has both advantages and disadvantages.

6. For program evaluations, evaluators are usually not program managers,


whereas for performance measurement, managers are usually key players in

432
developing and reporting performance results.
Program evaluations in government organizations are typically conducted with the advice of a steering committee,
which may include program managers among the stakeholders represented on such a committee. In nonprofit
organizations, particularly smaller agencies, it is likely that program managers will play a key role in conducting
program evaluations. Looking across the field of program evaluation, there are approaches that emphasize
managerial involvement. One approach is empowerment evaluation (Fetterman, 2001a, 2001b; Fetterman,
Kaftarian, & Wandersman, 2015; Wandersman & Fetterman, 2007) that is based on the belief that evaluation
should be used to improve social justice and to empower individuals and organizations. Empowerment evaluations
are intended to be done by people connected with the program, who have detailed knowledge of the program. In
its beginnings, this evaluation approach was controversial, in part because of concerns about whether it is possible
for managers to credibly self-evaluate their programs (Stufflebeam, 1994).

In the field of program evaluation, there is considerable discussion about the pros and cons of managerial
involvement. The diversity of the evaluation field offers views that range from no managerial involvement
(Scriven, 1997) to managerial ownership of the evaluation process (Fetterman, 1994). Both Austin (1982) and
Love (1991) have argued that managerial involvement is essential to ensuring that evaluations are “owned” by
those who are in the best position to use them. Wildavsky (1979), on the other hand, has questioned whether
evaluation and management can ever be joined in organizations. The usual practice in program evaluation is for
evaluations to be conducted by people other than program managers, with degrees of managerial involvement. We
discuss managerial involvement in evaluations in Chapter 11.

In developing performance measurement systems, program managers are usually expected to play a central role
since one goal of such systems is to produce information that is useful for performance management. Managerial
involvement in developing performance measures makes good sense, since program managers are in a position to
offer input, including pointing to measures that do a good job of capturing the work the organization is doing
(outputs) and the results that are intended (outcomes). But if performance measurement systems are also used to
report externally (i.e., to the public) as part of accountability-related commitments, managers may perceive mixed
incentives if they become involved in developing and using performance information. In organizational settings
where the environment is critical of any reported performance shortcomings or where there is a concern that
results may be used to justify budget cuts, managers have an incentive to be cautious about creating performance
measures that might reveal performance shortcomings. In Chapter 10, we will discuss this problem and link it to
the challenges of using performance information for both public accountability and for performance management.

7. The intended purposes of a given program evaluation are usually


negotiated up front, whereas for performance measurement, the uses of the
information can evolve over time to reflect changing information needs and
priorities.
Program evaluations can be formative or summative in intent, but typically, the purposes are negotiated as
evaluations are planned. Terms of reference are usually established by the evaluation client and are overseen by a
steering committee that is part of the program evaluation. The process and products are expected to conform to
the terms of reference. If consultants are doing some or all of the work, the terms of reference become the basis for
their contract, so changing the terms of reference would have cost implications for the organization conducting
the evaluation.

Performance measurement systems can also be used formatively or summatively (for program improvement or for
accountability, respectively), but because they are part of ongoing processes in organizational environments, it is
typical for the uses of information produced by these systems to evolve as organizational information needs and
priorities change. Because modern information technologies are generally flexible, it is often possible to refocus a
performance measurement system as information needs evolve. As well, typical performance measurement systems
will include measures that become less useful over time and can be removed from the system as a result. Managers

433
or other stakeholders can add relevant measures over time as needed.

The province of Alberta in Canada, for example, has been measuring provincial performance since 1995 and, each
year, reports performance results for a set of province-wide measures. Each year, some of the measures are revised
for future reporting opportunities (Government of Alberta, 2017).

Pollitt, Bal, Jerak-Zuiderent, Dowswell and Harrison (2010) offer us an example of a performance measurement
system in the health sector in Britain that has evolved substantially over time. Initially, the system was designed for
health service managers and reflected their needs for formative performance information. Over time, as political
stakeholders changed, the system was used increasingly for external accountability purposes. This transformation,
perhaps unintended by those involved initially, affected how users of the performance information saw the risks
and benefits for themselves. In Chapter 10, we will discuss how the purposes of a performance measurement
system can affect the way that managers and other stakeholders participate in its development and how they use
the information that is produced.

434
Summary
Performance measurement in the United States began in local governments at the turn of the 20th century. Local government programs
and services lent themselves to measuring costs, outputs, and even outcomes. Since that time, performance measurement has been a part
of different governmental reform movements that extend up to the present day. In previous reform movements (principally in the 1960s
and 1970s), although each government reform initiative did not fully realize its intended results, performance measurement adapted and
survived. More recently, the New Public Management (NPM) reform movement, which began in the 1970s but gained its traction in the
1980s, became part of government reforms in the United States, Canada, Europe, Australasia, and much of the rest of the world
(although it was adopted differently in the various nations). Performance measurement, focused on results (outputs and outcomes), has
become a central feature of a broad expectation that governments will be more accountable.

Performance measurement in the public and nonprofit sectors has been informed by several different metaphors of how organizations
function, as a plausible basis for constructing and implementing performance measures. Although there is no one theory that undergirds
measuring performance, conceptualizing organizations as machines, businesses, open systems, or organisms offers some guidance about
what to focus on and what to expect when performance is being measured. The dominant metaphor that is in play now is that
organizations are open systems. We use that metaphor in both program evaluation and performance measurement to construct logic
models that become the foundation for identifying performance measures.

Performance measurement and program evaluation are complementary ways of acquiring and analyzing information that is intended to
inform and reduce the uncertainty of program and policy decisions. They both rely on a common core of methodologies that is discussed
in Chapters 2, 3, and 4.

The increasing overlaps between these two approaches to evaluating programs reflect a growing trend toward integrating evaluation
databases into the information infrastructure of organizations. Managers are increasingly adopting a “just in time” stance with respect to
acquiring, analyzing, and reporting evaluation information (Mayne & Rist, 2006). Increasingly, performance measurement results are
expected to support program evaluations in organizations.

Although performance measurement can be a cost-effective alternative to program evaluation, particularly where the purpose is to describe
patterns of actual program and/or organizational results (“what happened?” questions), it typically does not allow the user to directly
address questions of why observed program results occurred.

435
Discussion Questions
1. Why did performance measurement have its origins in local governments in the United States?
2. Why did performance measurement survive different reform movements in the United States between the early 20th century to
the present day?
3. One of the metaphors that has provided a rationale for performance measurement in government is that government is a business.
In what ways is government businesslike?
4. What are some of the differences between governments and businesses? How would those differences affect performance
measurement?
5. Assume that you are an advisor to a public-sector organization, with several hundred employees, that delivers social service
programs. At present, the organization does not have the capability to do either program evaluations or measure program
performance results. Suppose that you are asked to recommend developing either program evaluation capabilities (putting
resources into developing the capability of conducting program evaluations) or performance measurement capability (putting
resources into measuring key outputs and outcomes for programs) as a first step in developing evaluation capacity. Which one
would you recommend developing first? Why?
6. What are the key performance measures for a driver of an automobile in a city? In other words, what would you want to know
about the driver’s performance to be able to decide whether he or she was doing a good job of driving the vehicle? Where would
the data come from for each of your measures?
7. Do some Internet research on self-driving automobiles. What can you say about the performance of self-driving vehicles? Is there
a set of performance measures that are built into the software (the algorithms) and hardware that manages such vehicles? If so,
what are some of the performance measures? Are they the same measures that human drivers would use?
8. The movie Moneyball is based on a book by Michael Lewis (2003) and is about how a professional baseball team (the Oakland
Athletics) used performance measurement information to build winning baseball teams. What can we say about the advantages
and disadvantages of using performance measurement in public-sector or nonprofit organizations, based on the story of the
Oakland Athletics?

436
References
Auditor General of British Columbia & Deputy Ministers’ Council. (1996). Enhancing accountability for
performance: A framework and an implementation plan—Second joint report. Victoria, British Columbia, Canada:
Queen’s Printer for British Columbia.

Austin, M. J. (1982). Evaluating your agency’s programs. Beverly Hills, CA: Sage.

Behn, R. D. (2003). Why measure performance? Different purposes require different measures. Public
Administration Review, 63(5), 586–606.

Bevan, G., & Hamblin, R. (2009). Hitting and missing targets by ambulance services for emergency calls: Effects
of different systems of performance measurement within the UK. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 172(1), 161–190.

Bevan, G., & Hood, C. (2006). Gaming in targetworld: The targets approach to managing British public services.
Public Administration Review, 66(4), 515–521.

Bititci, U., Garengo, P., Dörfler, V., & Nudurupati, S. (2012). Performance measurement: Challenges for
tomorrow. International Journal of Management Reviews, 14(3), 305–327.

Borins, S. (1995). The New Public Management is here to stay. Canadian Public Administration, 38(1), 122–132.

Boris, T., De Leon, E., Roeger, K., & Nikolova, M. (2010). Human service nonprofits and government
collaboration: Findings from the 2010 National Survey of Nonprofit Government Contracting and Grants.
Washington, DC: Urban Institute.

Bourgeon, J. (2011). A new synthesis of public administration. Queens’ Policy Studies Series. Kingston, ON:
McGill-Queen’s University Press.

Bourgon, J. (2017). Rethink, reframe and reinvent: Serving in the twenty-first century. International Review of
Administrative Sciences, 83(4), 624–635.

Buenker, J. D., Burnham, J. C., & Crunden, R. M. (1976). Progressivism. Cambridge, MA: Schenkman.

Cook, T. D., Scriven, M., Coryn, C. L., & Evergreen, S. D. (2010). Contemporary thinking about causation in
evaluation: A dialogue with Tom Cook and Michael Scriven. American Journal of Evaluation, 31(1), 105–117.

Cronbach, L. (1980. Toward reform of program evaluation. San Francisco, CA: Jossey-Bass Social and Behavioral
Science Series.

437
Cronbach, L., Ambron, S., Dornbusch, S., Hess, R., Hornik, R., Phillips, D., . . . Weiner, S. (1981). Toward
reform of program evaluation. Educational Evaluation and Policy Analysis, 3(6), 85–87.

Danziger, J. N., & Ring, P. S. (1982). Fiscal limitations: A selective review of recent research. Public
Administration Review, 42(1), 47–55.

Davies, R., & Dart, J. (2005). The “Most Significant Change” (MSC) technique: A guide to its use. Retrieved
from http://mande.co.uk/wp-content/uploads/2018/01/MSCGuide.pdf.

de Lancer Julnes, P. (2006). Performance measurement: An effective tool for government accountability? The
debate goes on. Evaluation, 12(2), 219–235.

de Lancer Julnes, P., & Steccolini, I. (2015). Introduction to symposium: Performance and accountability in
complex settings—Metrics, methods, and politics. International Review of Public Administration, 20(4),
329–334.

Denhardt, J. V., & Denhardt, R. B. (2003). The new public service: Serving, not steering. Armonk, NY: M. E.
Sharpe.

De Vries, J. (2010). Is New Public Management really dead? OECD Journal on Budgeting, 10(1), 87.

Doll, W. E., & Trueit, D. (2010). Complexity and the health care professions. Journal of Evaluation in Clinical
Practice, 16(4), 841–848.

Downs, A. (1965). An economic theory of democracy. New York: Harper & Row.

Dunleavy, P., Margetts, H., Bastow, S., & Tinkler, J. (2006). New Public Management is dead—long live digital-
era governance. Journal of Public Administration Research and Theory, 16(3), 467–494.

Edwards, D., & Thomas, J. C. (2005). Developing a municipal performance-measurement system: Reflections on
the Atlanta Dashboard. Public Administration Review, 65(3), 369–376.

Eikenberry, A. M., & Kluver, J. D. (2004). The marketization of the nonprofit sector: Civil society at risk? Public
Administration Review, 64(2), 132–140.

Epstein, M., & Manzoni, J.-F. (1998). Implementing corporate strategy: From tableaux de bord to balanced
scorecards. European Management Journal, 16(2), 190–203.

Feller, I. (2002). Performance measurement redux. American Journal of Evaluation, 23(4), 435–452.

Fetterman, D. (1994). Empowerment evaluation. Presidential address. Evaluation Practice, 15(1), 1–15.

438
Fetterman, D. (2001a). Foundations of empowerment evaluation. Thousand Oaks, CA: Sage.

Fetterman, D. (2001b). The transformation of evaluation into a collaboration: A vision of evaluation in the 21st
century. American Journal of Evaluation, 22(3), 381–385.

Fetterman, D., Kaftarian, S., & Wandersman, A. (Eds.). (2015). Empowerment evaluation: Knowledge, and tools
for self-assessment, evaluation capacity building, and accountability (2nd ed.). Thousand Oaks, CA: Sage.

Forss, K., Marra, M., & Schwartz, R. (Eds.). (2011). Evaluating the complex: Attribution, contribution, and beyond.
New Brunswick, NJ: Transaction.

Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state organisations in New
Zealand. Wellington, NZ: Institute of Policy Studies.

Glouberman, S., & Zimmerman, B. (2002). Complicated and complex systems: What would successful reform of
Medicare look like? Commission on the Future of Health Care in Canada. Discussion Paper Number 8, Ottawa,
ON: Commission on the Future of Health Care in Canada.

Government of Alberta. (2017). 2016–2017 annual report. Edmonton: Government of Alberta. Retrieved from
https://open.alberta.ca/dataset/7714457c-7527–443a-a7db-dd8c1c8ead86/resource/e6e99166–2958–47ac-
a2db-5b27df2619a3/download/GoA-2016–17-Annual-Report.pdf

Government Performance and Results Act of 1993, Pub. L. No. 103–62.

Government Performance and Results Act Modernization Act of 2010, Pub. L. No. 111–352.

Gregory, R., & Lonti, Z. (2008). Chasing shadows? Performance measurement of policy advice in New Zealand
government departments. Public Administration, 86(3), 837–856.

Gruening, G. (2001). Origin and theoretical basis of New Public Management. International Public Management
Journal, 4(1), 1–25.

Hatry, H. P. (1974). Measuring the effectiveness of basic municipal services. Washington, DC: Urban Institute and
International City Management Association.

Hatry, H. P. (1980). Performance measurement principles and techniques: An overview for local governments.
Public Productivity Review, 4(4), 312–339.

Hatry, H. P. (1999). Performance measurement: Getting results. Washington, DC: Urban Institute

Hatry, H. P. (2002). Performance measurement: Fashions and fallacies. Public Performance & Management

439
Review, 25(4), 352–358.

Hatry, H. P. (2006). Performance measurement: Getting results (2nd ed.). Washington, DC: Urban Institute Press.

Hatry, H. P. (2013). Sorting the relationships among performance measurement, program evaluation, and
performance management. New Directions for Evaluation, 137, 19–32.

Hildebrand, R., & McDavid, J. C. (2011). Joining public accountability and performance management: A case
study of Lethbridge, Alberta. Canadian Public Administration, 54(1), 41–72.

Hood, C. (1989). Public administration and public policy: Intellectual challenges for the 1990s. Australian Journal
of Public Administration, 48, 346–358.

Hood, C. (1991). A public management for all seasons? Public Administration, 69(1), 3–19.

Hood, C. (2000). Paradoxes of public-sector managerialism, old public management and public service bargains.
International Public Management Journal, 3(1), 1–22.

Hood, C., & Peters, G. (2004). The middle aging of New Public Management: Into the age of paradox? Journal of
Public Administration Research and Theory, 14(3), 267–282.

Hopwood, A. G., & Miller, P. (Eds.). (1994). Accounting as social and institutional practice. Cambridge, MA:
Cambridge University Press.

Ibrahim, N., Rue, L., & Byars, L. (2015). Human resource management (11th ed.). New York, NY: McGraw-Hill
Higher Education.

Jakobsen, M. L., Baekgaard, M., Moynihan, D. P., & van Loon, N. (2017). Making sense of performance
regimes: Rebalancing external accountability and internal learning. Perspectives on Public Management and
Governance.

Kanigel, R. (1997). The one best way: Frederick Winslow Taylor and the enigma of efficiency. New York, NY:
Penguin-Viking.

Kaplan, R. S., & Norton, D. P. (1996). The balanced scorecard: Translating strategy into action. Boston, MA:
Harvard Business School Press.

Kitchin, L., & McArdle, G. (2015). Knowing and governing cities thorough urban indicators, city benchmarking
and real-time dashboards. Regional Studies, Regional Science, 2(1) 6–28.

Kroll, A., & Moynihan, D. P. (2017). The design and practice of integrating evidence: Connecting performance

440
management with program evaluation. Public Administration Review, 78(2), 183–194.

Lahey, R., & Nielsen, S. B. (2013). Rethinking the relationship among monitoring, evaluation, and results-based
management: Observations from Canada. New Directions for Evaluation, 137, 45–56.

Lee, M. (2006). The history of municipal public reporting. International Journal of Public Administration, 29(4),
453–476.

Lewis, M. (2003). Moneyball: The art of winning an unfair game. New York, NY: W. W. Norton.

Love, A. J. (1991). Internal evaluation: Building organizations from within. Newbury Park, CA: Sage.

Malafry, R. (2016). An analysis of performance measures in Alberta Health (Government of Alberta). University of
Victoria, School of Public Administration, Master’s Project Report.

Manitoba Office of the Auditor General. (2000). Business and performance measurement—Study of trends and
leading practices. Winnipeg, Manitoba, Canada: Author.

Mayne, J. (2008). Building an evaluative culture for effective evaluation and results management. Retrieved from
www.focusintl.com/RBM107-ILAC_WorkingPaper_No8_EvaluativeCulture_Mayne.pdf.

Mayne, J., & Rist, R. C. (2006). Studies are not enough: The necessary transformation of evaluation. Canadian
Journal of Program Evaluation, 21(3), 93–120.

McDavid, J. C. (2001). Program evaluation in British Columbia in a time of transition: 1995–2000. Canadian
Journal of Program Evaluation, 16(Special Issue), 3–28.

McDavid, J. C., & Huse, I. (2006). Will evaluation prosper in the future? Canadian Journal of Program
Evaluation, 21(3), 47–72.

McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

Melkers, J. (2006). On the road to improved performance: Changing organizational communication through
performance management. Public Performance & Management Review, 30(1), 73–95.

Morgan, G. (2006). Images of organization (Updated ed.). Thousand Oaks, CA: Sage.

Nagarajan, N., & Vanheukelen, M. (1997). Evaluating EU expenditure programs: A guide. Ex post and intermediate
evaluation. Brussels, Belgium: Directorate-General for Budgets of the European Commission.

441
Newcomer, K. E. (2007). How does program performance assessment affect program management in the federal
government? Public Performance & Management Review, 30(3), 332–350.

Nielsen, S. B., & Hunter, D. E. (2013). Challenges to and forms of complementarity between performance
management and evaluation. New Directions for Evaluation, 137, 115–123.

Niskanen, W. A. (1971). Bureaucracy and representative government. New York, NY: Aldine-Atherton.

OECD. (2010). Value for money in government: Public administration after “New Public Management.” Paris,
France: Author.

OECD. (2008). Effective aid management: Twelve lessons from DAC peer reviews. Paris, France: Author. Retrieved
from https://www.oecd.org/dac/peer-reviews/40720533.pdf

Ontario Tobacco Research Unit. (2012). Smoke-Free Ontario strategy evaluation report. Ontario Tobacco Research
Unit, Special Report (November).

Osborne, D., & Gaebler, T. (1992). Reinventing government: How the entrepreneurial spirit is transforming the
public sector. Reading, MA: Addison-Wesley.

Patton, M. Q. (2011). Developmental evaluation: Applying complexity to enhance innovation and use. New
York, NY: Guilford Press.

Perrin, B. (1998). Effective use and misuse of performance measurement. American Journal of Evaluation, 19(3),
367–379.

Perrin, B. (2015). Bringing accountability up to date with the realities of public sector management in the 21st
century. Canadian Public Administration, 58(1), 183–203.

Picciotto, R. (2011). The logic of evaluation professionalism. Evaluation, 17(2), 165–180.

Poister, T. H., Aristigueta, M. P., & Hall, J. L. (2015). Managing and measuring performance in public and
nonprofit organizations (2nd ed.). San Francisco, CA: Jossey-Bass.

Pollitt, C. (1993). Managerialism and the public services (2nd ed.). Oxford, UK: Blackwell.

Pollitt, C. (1998). Managerialism revisited. In B. G. Peters & D. J. Savoie (Eds.), Taking stock: Assessing public
sector reforms (pp. 45–77). Montreal, Quebec, Canada: McGill-Queen’s University Press.

Pollitt, C., Bal, R., Jerak-Zuiderent, S., Dowswell, G., & Harrison, S. (2010). Performance regimes in health care:
Institutions, critical junctures and the logic of escalation in England and the Netherlands. Evaluation, 16(1),

442
13–29.

Pollitt, C., & Bouckaert, G. (2011). Public management reform: A comparative analysis: New Public Management,
governance, and the neo-Weberian state (3rd ed.). New York, NY: Oxford University Press.

Queensland State Government. (2017). Queensland Government Performance Management Framework Policy.
Retrieved from https://www.forgov.qld.gov.au/sites/default/files/performance-management-framework-
policy.pdf

Randall, M., & Rueben, K. (2017). Sustainable budgeting in the states: Evidence on state budget institutions and
practices. Washington, DC: Urban Institute. Retrieved from
https://www.urban.org/sites/default/files/publication/93461/sustainable-budgeting-in-the-states_2.pdf

Sands, H. R., & Lindars, F. W. (1912). Efficiency in budget making. ANNALS of the American Academy of
Political and Social Science, 41(1), 138–150.

Savino, D. M. (2016). Frederick Winslow Taylor and his lasting legacy of functional leadership competence.
Journal of Leadership, Accountability and Ethics, 13(1), 70–76.

Savoie, D. J. (1990). Reforming the expenditure budget process: The Canadian experience. Public Budgeting &
Finance, 10(3), 63–78.

Savoie, D. J. (1995). What is wrong with the New Public Management. Canadian Public Administration, 38(1),
112–121.

Scheirer, M. A., & Newcomer, K. (2001). Opportunities for program evaluators to facilitate performance-based
management. Evaluation and Program Planning, 24(1), 63–71.

Scott, K. (2003). Funding matters: The impact of Canada’s new funding regime on nonprofit and voluntary
organizations, summary report. Ottawa, Ontario, Canada: Canadian Council on Social Development.

Scriven, M. (1997). Truth and objectivity in evaluation. In E. Chelimsky & W. R. Shadish (Eds.), Evaluation for
the 21st century: A handbook (pp. 477–500). Thousand Oaks, CA: Sage.

Scriven, M. (2008). A summative evaluation of RCT methodology & an alternative approach to causal research.
Journal of Multidisciplinary Evaluation, 5(9), 11–24.

Shadish, W., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized
causal inference. Boston, MA: Houghton Mifflin.

Shand, D. A. (Ed.). (1996). Performance auditing and the modernisation of government. Paris, France: Organisation
for Economic Co-operation and Development.

443
Shaw, T. (2016). Performance budgeting practices and procedures. OECD Journal on Budgeting, 15(3), 1–73.

Steane, P., Dufour, Y., & Gates, D. (2015). Assessing impediments to NPM change. Journal of Organizational
Change Management, 28(2), 263–270.

Stufflebeam, D. L. (1994). Empowerment evaluation, objectivist evaluation, and evaluation standards: Where the
future of evaluation should not go and where it needs to go. Evaluation Practice, 15(3), 321–338.

Taylor, F. W. (1911). The principles of scientific time management. New York, NY: Harper.

Thomas, P. G. (2007). Why is performance-based accountability so popular in theory and difficult in practice?
World Summit on Public Governance: Improving the Performance of the Public Sector. Taipei, May 1–3.

Thompson, J. D. (1967). Organizations in action. New York, NY: McGraw-Hill.

Thornton, D. (2011, March). Tax and expenditure limits II: Are there additional options? Policy Study, 1–27.
Retrieved from http://www.limitedgovernment.org/publications/pubs/studies/ps-11–2.pdf

Treasury Board of Canada Secretariat. (2016a). Management accountability framework. Retrieved from
https://www.canada.ca/en/treasury-board-secretariat/services/management-accountability-framework.html

Treasury Board of Canada Secretariat. (2016b). Policy on results. Retrieved from


https://www.canada.ca/en/treasury-board-secretariat/services/management-accountability-framework.html

Treasury Board of Canada Secretariat. (2017a). Policies, directives, standards and guidelines. Ottawa, ON: Treasury
Board Secretariat, Retrieved from http://www.tbs-sct.gc.ca/pol/index-eng.aspx

Treasury Board of Canada Secretariat. (2017b). Evaluation of the Management Accountability Framework.
Ottawa, ON: Treasury Board Secretariat. Retrieved from https://www.canada.ca/en/treasury-board-
secretariat/corporate/reports/evaluation-management-accountability-framework.html

U.S. Government Accountability Office. (2011). GPRA Modernization Act implementation provides important
opportunities to address government challenges (GAO-11–617T). Retrieved from
http://www.gao.gov/assets/130/126150.pdf

Van Dooren, W., & Hoffmann, C. (2018). Performance management in Europe: An idea whose time has come
and gone? In E. Ongaro & S. van Thiel (Eds.), The Palgrave handbook of public administration and management
in Europe (pp. 207–225). London, UK: Palgrave Macmillan.

Von Bertalanffy, L. (1968). General system theory: Foundations, development, applications (Rev. ed.). New York,
NY: G. Braziller.

444
Wandersman, A., & Fetterman, D. (2007). Empowerment evaluation: Yesterday, today, and tomorrow. American
Journal of Evaluation, 28(2), 179–198.

Weibe, R. H. (1962). Businessmen and reform: A study of the progressive movement. Cambridge, MA: Harvard
University Press.

Wholey, J. S. (2001). Managing for results: Roles for evaluators in a new management era. American Journal of
Evaluation, 22(3), 343–347.

Wildavsky, A. (1975). Budgeting: A competitive theory of budget processes. Boston, MA: Little Brown.

Wildavsky, A. (1979). Speaking truth to power: The art and craft of policy analysis. Boston, MA: Little Brown.

Williams, D. W. (2003). Measuring government in the early twentieth century. Public Administration Review,
63(6), 643–659.

Wilson, W. (1887). The study of administration. Political Science Quarterly, 2(2), 197–222.

WorkSafeBC. (2018). WorkSafeBC 2017 annual report and 2018–2020 service plan. Richmond, BC:
WorkSafeBC. Retrieved from https://www.worksafebc.com/en/about-us/what-we-do/our-annual-report

445
9 Design and Implementation of Performance Measurement
Systems

Introduction 372
The Technical/Rational View and the Political/Cultural View 372
Key Steps in Designing and Implementing a Performance Measurement System 374
1. Leadership: Identify the Organizational Champions of This Change 375
2. Understand What Performance Measurement Systems Can and Cannot Do 377
3. Communication: Establish Multi-channel Ways of Communicating That Facilitate Top-Down,
Bottom-Up, and Horizontal Sharing of Information, Problem Identification, and Problem Solving
379
4. Clarify the Expectations for the Intended Uses of the Performance Information That is Created 380
5. Identify the Resources and Plan for the Design, Implementation, and Maintenance of the
Performance Measurement System 383
6. Take the Time to Understand the Organizational History Around Similar Initiatives 384
7. Develop Logic Models for the Programs for Which Performance Measures Are Being Designed and
Identify the Key Constructs to Be Measured 385
8. Identify Constructs Beyond Those in Single Programs: Consider Programs Within Their Place in
the Organizational Structure 387
9. Involve Prospective Users in Development of Logic Models and Constructs in the Proposed
Performance Measurement System 390
10. Translate the Constructs Into Observable Performance Measures that Compose the Performance
Measurement System 391
11. Highlight the Comparisons that Can be Part of the Performance Measurement System 395
12. Reporting and Making Changes to the Performance Measurement System 398
Performance Measurement for Public Accountability 400
Summary 402
Discussion Questions 403
Appendix A: Organizational Logic Models 404
References 405

446
Introduction
In this chapter, we cover the design and implementation of performance measurement systems. We begin by
introducing two complementary perspectives on public-sector organizations: (1) a technical/rational view that
emphasizes systems and structures and (2) a political/cultural view that emphasizes the dynamics that develop and
persist when we take into account people interacting to get things done. Then, we introduce and elaborate 12
steps that are important in designing and implementing performance measurement systems. These steps reflect
both the technical/rational and political/cultural perspectives on organizations. As we describe each step, we offer
advice and also point to possible pitfalls and limitations while working within complex organizations. The
perspective we take in this chapter is that the steps are for organizations that are designing and implementing
performance measurement systems from scratch. We realize that there will also be many situations where an
existing performance measurement system is being reviewed with the intention of making changes. We cover
those situations in Chapter 10, which discusses the uses of performance results.

The process of designing and implementing performance measurement systems uses core knowledge and skills that
are also a part of designing, conducting, and reporting program evaluations. In Chapter 8, we pointed out that
program evaluation and performance measurement share core knowledge and skills, including logic modeling and
measurement. In addition, understanding research designs and the four kinds of validity we described in Chapter
3 are valuable for understanding and working with the strengths and limitations of performance measurement
systems.

In Chapter 1, we outlined the steps that make up a typical program evaluation. In this chapter, we will do the
same for performance measurement systems, understanding that for each situation, there will be unique
circumstances that can result in differences between the checklist that follows and the process that is appropriate
for that context. Each of the 12 steps is elaborated to clarify issues and possible challenges. We distinguish
designing and implementing performance measurement systems from the uses of such systems. Usage is a critical
topic on its own, and we will elaborate on it in Chapter 10.

447
The Technical/Rational View and the Political/Cultural View
Designing and implementing performance measurement systems can be a significant organizational change,
particularly in public-sector organizations that have emphasized “bureaucratic” procedures instead of results.
Depending on the origins of such an initiative (external to the organization, internal, top-down, or manager-
driven), different actors and factors will be more or less important. When we design and implement performance
measurement systems that are intended to be sustainable (our view is that this an important goal that makes
performance measurement systems worthwhile), we must go beyond normative frameworks (what “ought” to be
done) to consider the “psychological, cultural, and political implications of organizational change” (de Lancer
Julnes, 1999, p. 49). De Lancer Julnes and Holzer (2001) have distinguished a rational/technical lens and a
political/cultural lens as key to understanding the successful adoption, implementation, and use of performance
measures.

The technical/rational perspective is grounded in a view of organizations as rational means–ends systems that are
designed to achieve purposive goals. This view emphasizes the importance of systems and structures as keys to
understanding how organizations work and how to change them. With respect to performance measurement
systems, then, there are rational and technical factors to be kept in mind as they are designed and implemented.
These factors include having sufficient resources, training people appropriately, aligning management systems,
developing appropriate information systems (hardware and software), and developing valid and reliable
performance measures. It is important to have an overall plan that organizes the process, including who should be
involved at different stages, how the stages link together timing-wise, what is expected—and from whom—as each
stage is implemented, and how the overall system is expected to function once it has been implemented.

The political/cultural perspective on organizations emphasizes the people dynamics in organizations.


Organizations as political systems is one of the metaphors that Gareth Morgan (2006) includes in his seminal
book Images of Organization. This view of organizations involves understanding how people interact with and in
complex organizations. Performance management systems and structures play a role (they are part of the
institutional fabric), but individuals and coalitions can influence and even negate the results intended from these
systems. Organizational politics are an inevitable feature of organizational dynamics. Politics do not have to be
about political parties or formal political allegiances. Instead, it is essentially about the processes (both formal and
informal) that are used to allocate scarce resources among competing values. Even though there will be
organizational and program objectives, with resources being devoted to their achievement (the rational purposes of
organizations), there will also be values, interests and incentives, and coalitions of stakeholders who can influence
and facilitate design, implementation, and use of performance measurement systems. For the public sector
perspective, Deborah Stone (2012) covers this issue in Policy Paradox: The Art of Political Decision Making.

Overlaid on these two images of organizations are the wide range of environments in which organizations can be
embedded. What we will see in Chapter 10 is that some environments are more conducive to sustaining
performance measurement systems than others. Where performance measurement is focused on public reporting
in high-stakes, accountability-oriented environments, it can be challenging to construct and sustain useful
performance measurement systems. One “solution” that we will explore in Chapter 10 is to decouple the
performance measurement system that is used for (internal) performance management from the performance
measures that are used for external reporting (McDavid & Huse, 2012; Perrin, 2015; Van Dooren & Hoffman,
2018).

The 12 steps discussed in this chapter outline a process that is intended to increase the chances that a performance
measurement system will be successfully implemented and sustained. A key part of sustaining performance
measurement as an evaluative function in organizations is to use the performance information (Kroll, 2015;
Moynihan, Pandey, & Wright, 2012). In other words, there must be a demand for performance information, as
well as a supply. Supplying performance information (e.g., preparing, maintaining, and delivering performance
reports, presentations, and dashboards) where there is limited or no ongoing demand tends to undermine the
credibility of the system—lack of use is an indication that the system is not aligned with actual substantive

448
organizational priorities.

In many situations, the conditions under which organizations undertake the development of performance
measures are less than ideal. In the summary to this chapter, we identify six steps, among the 12, that are most
critical if organizations want to focus on contributing to managerial and organizational efforts to improve
efficiency, effectiveness, and accountability. In situations where the system is intended to provide accountability-
focused measures or both improvement and accountability, there are a number of obstacles to consider. We touch
on them briefly in this chapter but explore them more fully in Chapter 10.

Fundamentally, our view is that the organizational cultural acceptance and commitment to performance
measurement and performance management systems are key to predicting whether such innovations will be
sustained in a substantive way. The 12 steps that follow are a mix of “system and technical” and “political and
cultural” considerations in designing, implementing, and sustaining performance measurement systems. Of note,
Poister, Aristiguela, and Hall (2015) offer an alternative set of steps (13 of them) that are aimed at guiding the
design and implementation of performance measurement and performance management systems. Like our
guidance, their steps are a mix of system/technical and “people-focused” considerations. On balance, however,
their approach focuses more on the design and implementation of such systems from a rational/technical
perspective compared with our approach.

449
Key Steps in Designing and Implementing a Performance Measurement
System
Table 9.1 summarizes 12 key steps in designing and implementing a performance measurement system. Each of
these steps can be viewed as a guideline—no single performance measurement development and implementation
process will conform to all of them. In some cases, the process may diverge from the sequence of steps. Again, this
could be due to local factors. Each of the steps in Table 9.1 is discussed more fully in the following sections. Our
discussion of the steps is intended to do two things: (1) elaborate on what is involved and (2) point out challenges
along the way. As you review the steps, you will see that most of them acknowledge the importance of both a
rational/technical and a political/cultural view of organizations. That is, beyond the technical issues, it is
important to consider the interactions among the people, incentives, history, and who wins and who loses. This
perspective is carried into Chapter 10, where we look at uses of performance measurement.

Table 9.1 Key Steps in Designing and Implementing a Performance Measurement System
Table 9.1 Key Steps in Designing and Implementing a Performance Measurement System

1. Leadership: Identify the organizational champions of this change.

2. Understand what a performance measurement system can and cannot do and why it is needed.

3. Communication: Establish multichannel ways of communicating that facilitate top-down, bottom-up,


and horizontal sharing of information, problem identification, and problem solving.

4. Clarify the expectations for the uses of the performance information that will be created.

5. Identify the resources and plan for the design, implementation, and maintenance of the performance
measurement system.

6. Take the time to understand the organizational history around similar initiatives.

7. Develop logic models for the programs or lines of business for which performance measures are being
developed.

8. Identify constructs that are intended to represent performance for aggregations of programs or the
whole organization.

9. Involve prospective users in reviewing the logic models and constructs in the proposed performance
measurement system.

10. Translate the constructs into observable measures.

11. Highlight the comparisons that can be part of the performance measurement system.

12. Reporting results and then regularly review feedback from users and, if needed, make changes to the
performance measurement system.

One way that we can look at these 12 steps is to differentiate between those that are primarily technical/rational
and those that are cultural/political. Among the steps, the majority are more closely aligned with the
political/cultural view of organizations: identifying the champions of this change; understanding what
performance measurement systems can actually do (and not do); establishing and using communication channels;
clarifying intended uses (for all the stakeholders involved); understanding the organizational history and its

450
impacts on this change process; involving users in developing models and performance measures; and regularly
reviewing and acting on user feedback. The others are as follows: identifying resources; developing logic models;
identifying constructs that span programs or the whole organization; identifying the comparisons that are
appropriate given the intended uses; measuring constructs; and reporting performance results are more closely
aligned with a technical/rational view of organizations. One step (reporting and regularly reviewing feedback from
users) straddles the technical/cultural boundary. As we’ve noted, our approach emphasizes the importance of both
perspectives and their complementarity in building and implementing sustainable performance measurement
systems.

451
1. Leadership: Identify the Organizational Champions of This Change
The introduction of performance measurement, particularly measuring outcomes, is an important change in both
an organization’s way of doing business and its culture (de Lancer Julnes & Holzer, 2001). Unlike program
evaluations, performance measurement systems are ongoing, and it is therefore important that there be
organizational leaders who are champions of this change in order to provide continuing support for the process
from its inception onward. In many cases, an emphasis on measuring outcomes is a significant departure from
existing practices of tracking program inputs (money, human resources), program activities, and even program
outputs (work done). Most managers have experience measuring/recording inputs, processes, and outputs, so the
challenge in outcome-focused performance measurement is in specifying the expected outcomes (stating clear
objectives for programs, lines of business, or organizations) and facilitating organizational commitment to the
process of measuring and working with outcome-related results.

By including outcomes, performance measurement commits organizations to comparing their actual results with
stated objectives. In many jurisdictions, outcomes are parsed into annual targets, and actual outcomes are
compared with the targets for that year. Thus, the performance measurement information commonly is intended
to serve multiple purposes, including enhancing managerial decision making, framing organizational alignment,
and promoting transparency and accountability (Perrin, 2015).

Because performance measurement systems are ongoing, it is important that the champions of this change support
the process from its inception onward; the whole process begins with leadership. Moynihan, Pandey, and Wright
(2012) suggest that leadership commitment is critical to the process and also affects performance information uses.
The nature of performance measures is that they create new information—a potential resource in public and
nonprofit organizations. Information can reduce uncertainty with respect to the questions it is intended to answer,
but the process of building performance measurement into the organization’s business can significantly increase
uncertainty for managers. The changes implied by measuring results (outcomes), reporting results, and (possibly)
being held accountable for those results can loom large as the system is being designed and implemented. If a
performance measurement system is implemented as a top-down initiative, managers may see this as a threat to
their existing practices. Some will resist this change, and if leadership commitment is not sustained, the transition
to performance measurement as a part of managing programs will wane over time (de Waal, 2003). A history of
partially implemented organizational changes will affect the likelihood of success in any new initiative. We
elaborate on this in the sixth step.

A results-oriented approach to managing has implications for public-sector accountability. In many jurisdictions,
public organizations are still expected to operate in ways that conform to process-focused notions of accountability
and within a performance reporting architecture. In Canada, for example, the Westminster parliamentary system
is intended to make the minister who heads each government department nominally accountable for all that
happens in his or her domain. Government has a Management, Resources, and Results Structure (MRRS) that
outlines departmental reporting requirements for expected outcomes, and “this articulation of program
architecture serves as the basis for performance monitoring, reporting, and annual strategic reviews” (Lahey &
Nielsen, 2013, p. 49). The adversarial nature of politics, combined with the tendency of the media and interest
groups to emphasize mistakes that become public (focus on controversies), can bias managerial behavior toward a
sanitized and procedurally focused way of performance measurement system design and use, in which only “safe”
measures are introduced into the reporting system (Propper & Wilson, 2003). Navigating such environments
while working to implement performance measurement systems requires leadership that is willing to embrace
some risks, not only in developing the system but in encouraging and rewarding decisions and practices in which
performance results are used to inform decision making. We explore these issues in much greater detail in
Chapters 10 and 11 where we introduce learning cultures and link those to ways that program evaluation and
performance measurement function in organizations.

In most governmental settings, leadership at two levels is required for a performance measurement system. Senior
executives in a ministry or department must actively support the process of constructing and implementing a

452
performance measurement system. But it is equally important that the political leadership be supportive of the
development, implementation, and use of a performance measurement system. The key intended users of
performance information that is publicly reported are the elected officials (of all the political parties) and the
public (McDavid & Huse, 2012).

In British Columbia, Canada, for example, the Budget Transparency and Accountability Act (Government of
British Columbia, 2001) specifies that service plans and annual service plan reports (ASPRs) are to be tabled in the
legislative assembly. The stated goal is to have legislature review of these reports and use them as they scrutinize
ministry operations and future budgets. Each year, the public ASPRs are tabled in June and are based on the
actual results for the fiscal year ending March 31. Strategically, the reports were intended to figure in the
budgetary process for the following year, which begins in the fall. If producing and publishing these performance
reports is not coupled with scrutiny of the reports by legislators, then a key reason for committing resources to this
form of public accountability is undermined. In Chapter 10, we look at research results on the ways in which
elected officials in British Columbia actually use performance reports.

In summary, an initial organizational commitment to performance measurement, which typically includes


designing the system, can produce “results” that are visible (e.g., a website with the performance measurement
framework), but implementing and working with the system over 3 to 5 years is a much better indicator of its
sustainability, and for this to happen, it is critical to have organizational champions of the process.

It is worth noting that Kotter (1995) suggests that for an organizational change to be sustained, a period of 5 to 10
years is required. Kotter’s perspective is more appropriate for a transformation of an organization’s culture—often
performance measurement systems are introduced into an existing culture without a goal of transforming it. In
Chapter 10 and Chapter 11, we discuss developing learning cultures as a way to transform the ways evaluative
information is used—changes of that magnitude may well take longer than 5 years.

453
2. Understand What Performance Measurement Systems Can and Cannot
Do
There are limitations to what performance measurement systems can do, yet in some jurisdictions, performance
measurement has been treated as a substitute for program evaluation (Kroll & Moynihan, 2018; Martin &
Kettner, 1996). Public-sector downsizing has diminished the resources committed to program evaluations, and
managers have been expected to initiate performance measurement instead (McDavid, 2001b; McDavid & Huse,
2012; Van Dooren & Hoffman, 2018). The emphasis on performance reporting for public accountability—and
the assumption that that can drive performance improvements—is the principal reason for making performance
measurement the central evaluative approach in many organizations. We will look at this assumption in Chapter
10 when we discuss the uses of performance information when public reporting is mandated.

Performance measurement can be a powerful tool in managing programs or organizations. If the measures are
valid and the information is timely, reviewing emerging trends can help to identify possible problems (a negative-
feedback mechanism), as well as possible successes (positive feedback) (Behn, 2003). But performance
measurement results generally describe what is going on; they do not explain why it is happening (Hatry, 2013;
McDavid & Huse, 2006; Newcomer, 1997; Poister et al., 2015).

Recall the distinction between intended outcomes and actual outcomes introduced in Chapter 1. Programs are
designed to produce specified outcomes, and one way to judge the success of a program is to see whether the
intended outcomes have actually occurred. If the actual outcomes match the intended outcomes, we might be
prepared to conclude that the program was effective.

However, we cannot generally conclude that the outcomes are due to the program unless we have additional
information that supports the assumption that other factors in the environment could not have caused the
observed outcomes. (Simple program structures, for example, are typically composed of tight links between
outputs and outcomes.) Getting that information is at the core of what program evaluation is about, and it is
essential that those using performance measurement information understand this distinction. As Martin and
Kettner (1996) commented when discussing the cause-and-effect relationship that many people mistakenly
understand to be implied in performance measurement information, “Educating stakeholders about what outcome
performance measures really are, and what they are not, is an important—and little discussed—problem associated
with their use by human service programs” (p. 56).

Establishing the causal links between observed outcomes and the program that was intended to produce them is
our familiar attribution problem. Some analysts have explicitly addressed this problem for performance
measurement. Mayne (2001) offers six strategies intended to reduce the uncertainty about whether the observed
performance measurement outcomes can be attributed to the program, through contribution analysis. Briefly, his
suggestions are as follows: (1) develop an intended-results chain; (2) assess the existing research/evidence that
supports the results chain; (3) assess the alternative explanations for the observed results; (4) assemble the
performance story; (5) seek out additional evidence, if necessary; and (6) revise and strengthen the performance
story. Several of his suggestions are common to both program evaluation and performance measurement, as we
have outlined them in this book. His final (seventh) suggestion is to do a program evaluation if the performance
story is not sufficient to address the attribution question. This suggestion supports a key theme of this book—that
performance measurement and program evaluation are complementary, and each offers ways to reduce uncertainty
for managers and other stakeholders in public and nonprofit organizations (de Lancer Julnes & Steccolini, 2015;
Hatry, 1013; Kroll & Moynihan, 2018).

454
3. Communication: Establish Multi-Channel Ways of Communicating That
Facilitate Top-Down, Bottom-Up, and Horizontal Sharing of Information,
Problem Identification, and Problem Solving
One pattern for developing a performance measurement system is to begin informally. Managers who want to
obtain information that they can use formatively to monitor their programs can take the lead in developing their
own measures and procedures for gathering and using the data. This bottom-up process is one that encourages a
sense of ownership of the system. In the British Columbia provincial government, this manager-driven process
spanned the period roughly from 1995 to 2000 (McDavid, 2001a). Some departments made more progress than
others, in part because some department heads were more supportive of this process than others. Because they
were driven by internal performance measurement needs, the systems that developed were adapted to local
organizational needs.

To support this evolutionary bottom-up process in the British Columbia government, the Treasury Board Staff (a
central agency responsible for budget analysis and program approval) hosted meetings of an informal network of
government practitioners who had an interest in performance measurement and performance improvement. The
Performance Measurement Resource Team held monthly meetings that included speakers from ministries and
outside agencies who provided information on their problems and solutions. Attendance and contributions were
voluntary. Horizontal information sharing was the principal purpose of the sessions.

When the Budget Transparency and Accountability Act (Government of British Columbia, 2001) was passed,
mandating performance measurement and public reporting government-wide, the stakes changed dramatically.
Performance measurement systems that had been intended for manager-centered formative uses were now exposed
to the requirement that a (strategic) selection of the performance results would be made public in an annual report
to the legislative assembly. This top-down directive to report performance for summative purposes needed to be
meshed with the bottom-up (formative) cultures that had been developed in some ministries.

A Case Study of How One Public-Sector Organization Navigated Conflicting Expectations for Performance Measurement:
Choosing Performance Measures for Public Reporting in a Provincial Ministry

Prior to 2001, middle managers in the ministry had developed performance measures that they used to monitor their own programs
(mostly focused on supporting colleges and universities in the province). When the government passed legislation in 2001 requiring
public performance reporting, the ministry executive, after meetings with managers who had expressed their concerns about having their
monitoring data made public, hosted a retreat of all middle and senior managers in the ministry. The goal was to come up with a list of
performance measures that would serve both public reporting and program monitoring purposes. Over the course of the day, breakout
sessions canvassed possible measures. A plenary session, aimed at wrapping this all up, quickly surfaced the fact that the breakout tables
had come up with long lists of possible measures. Managers insisted on having their own programs reflected in the organization-level
measures. Several openly said that unless their programs were “there” they would not be a priority in the future—“what gets measured
matters.”

The day ended without an agreement on a set of “strategic” measures. Guidelines from the government had specified at most a dozen
performance measures for the whole ministry, and when all the measures at the tables were put together, the total was closer to 100.

After discussions in the ministry, the executive suggested a two-tiered solution—strategic performance measures for public reporting and
internal measures for program managers. This compromise was provisional, but it opened a way forward. It took more time before
program managers trusted that their programs did not have to be directly reflected in the organizational performance measures. Over
time, it became apparent that the public performance reports were not being used by legislators to make decisions; they became ways of
demonstrating (arguably symbolic) accountability and providing positive publicity for the ministry. Managers had weathered a storm and
could continue to develop and use their own measures internally.

Generally, public organizations that undertake the design and implementation of performance measurement
systems that are intended to be used internally must include the intended users (Kravchuk & Schack, 1996), the
organizational leaders of this initiative, and the methodologists (Thor, 2000). Top-down communications can
serve to clarify direction, offer a framework and timelines for the process, clarify what resources will be available,
and affirm the importance of this initiative. Bottom-up communications can question or seek clarification of

455
definitions, timelines, resources, and direction. Horizontal communications can provide examples, share problem
solutions, and offer informal support.

The communications process outlined here exemplifies a culture that increased the likelihood that performance
management will take hold and be sustainable. Key to developing a sustainable performance management culture
is treating information as a resource, being willing to “speak truth to power” (Wildavsky, 1979), and not treating
performance information as a political weapon. Kravchuk and Schack (1996) suggest that the most appropriate
metaphor to build a performance culture is the learning organization. This construct was introduced by Senge
(1990) and continues to be a goal for public organizations that have committed to performance measurement as
part of a broader performance management framework (Agocs & Brunet-Jailly, 2010; Mayne, 2008; Mayne &
Rist, 2006). We discuss this view of organizations in Chapters 10 and 11.

456
4. Clarify the Expectations for the Intended Uses of the Performance
Information That Is Created
Developing performance measures is intended, in part, to improve performance by providing managers and other
stakeholders with information they can use to monitor and make adjustments to program processes. Having “real-
time” information on how programs are tracking is often viewed by managers as an asset and is an incentive to get
involved in constructing and implementing a performance measurement system. Managerial involvement in
performance measurement is an expectation and is linked to how successful performance measurement systems are
(Jakobsen, Baekgaard, Moynihan, & van Loon, 2017). To attract the buy-in that is essential for successful design
and implementation of performance measurement systems, we believe that performance measurement needs to be
used first and foremost for internal performance improvement. Public reporting can be a part of the process of
using performance measurement data, but it should not be the primary reason for developing a performance
measurement system (Hildebrand & McDavid, 2011; Jakobsen et al., 2017). A robust performance measurement
system should support using information to inform improvements to programs and/or the organization. It should
help identify areas where activities are most effective in producing intended outcomes and areas where
improvement could be made (de Waal, 2003; Laihonen & Mantyla, 2017).

Designing and implementing a performance measurement system primarily for public accountability usually entails
public reporting of performance results, and in jurisdictions where performance results can be used to criticize
elected officials or bureaucrats, there are incentives to limit reporting of anything that would reflect negatively on
the government of the day. Richard Prebble, a long-time political leader in New Zealand, outlines Andrew
Ladley’s “Iron Rule of the Political Contest”:

The opposition is intent on replacing the government.


The government is intent on remaining in power.
MPs want to get re-elected.
Party leadership is dependent on retaining the confidence of colleagues (which is shaped by the first three
principles). (Prebble, 2010, p. 3)

In terms of performance measures to be reported publicly, this highlights that organizational performance
information will not only be used to review performance but will likely be mined for details that can be used to
embarrass the government.

In Chapter 10, we will look at the issues involved in using performance measurement systems to contribute to
public accountability. Understanding and balancing the incentives for participants in this process is one of the
significant challenges for the leaders of an organization. As we mentioned earlier, developing and then using a
performance measurement system can create uncertainty for those whose programs are being assessed. They will
want to know how the information that is produced will affect them, both positively and negatively. It is essential
that the leaders of this process be forthcoming about the intended uses of the measurement system if they expect
buy-in.

If a system is designed for formative program improvement purposes, using it for summative purposes will change
the incentives for those involved (Van Dooren & Hoffman, 2018). Sustaining the internal uses of performance
information will involve meaningful engagement with those who have contributed to the (earlier) formative
process. Changing the purposes of a performance measurement system affects the likelihood that gaming will
occur as data are collected and reported (Pollitt, 2007; Pollitt, Bal, Jerak-Zuiderent, Dowswell, & Harrison, 2010;
Propper & Wilson, 2003). In Chapter 10, we will discuss gaming as an unintended response to incentives in
performance measurement systems.

Some organizations begin the design and implementation process by making explicit the intention that the
measurement results will only be used formatively for a 3- to 5-year period of time, for example. That can generate
the kind of buy-in that is required to develop meaningful measures and convince participants that the process is

457
actually useful to them. Then, as the uses of the information are broadened to include external reporting, it may
be more likely that managers will see the value of a system that has both formative and summative purposes.

Pollitt et al. (2010) offer us a cautionary example, from the British health services, of the transformation of the
intended uses of performance information. Their example suggests that performance measurement systems that
begin with formative intentions tend, over time, to migrate to summative uses.

In the early 1980s in Britain, there were broad government concerns with hospital efficiency that prompted the
then Conservative government to initiate a system-wide performance measurement process. Right from the start,
the messages that managers and executives were given were ambiguous. Pollitt et al. (2010) note that

despite the ostensible connection to government aims to increase central control over the NHS, the
Minister who announced the new package described PIs [performance indicators] in formative terms.
Local managers were to be equipped to make comparisons, and the stress was on using them to trigger
inquiry rather than as answers in themselves, a message that was subsequently repeated throughout the
1980s. (p. 17)

However, by the early 1990s, the “formative” performance results were being reported publicly, and comparisons
among health districts (health trusts) were a central part of this transition. “League tables,” in which districts were
compared across a set of performance measures, marked the transition from formative to summative uses of the
performance information. By the late 1990s, league tables had evolved into a “star rating system,” in which
districts could earn up to three stars for their performance. The Healthcare Commission, a government oversight
and audit agency, conducted and published the ratings and rankings. Pollitt et al. (2010) summarize the transition
from a formative to a summative performance measurement system thus:

In more general terms, the move from formative to summative may be thought of as the result of PIs
[performance indicators] constituting a standing temptation to executive politicians and top managers.
Even if the PIs were originally installed on an explicitly formative basis (as in the UK), they constitute a
body of information which, when things (inevitably) go wrong, can be seized upon as a new means of
control and direction. (p. 21)

This change brought with it different incentives for those involved and ushered in an ongoing dynamic in which
managerial responses to performance-related requirements included gaming the measures—that is, manipulating
activities and/or the information to enhance performance ratings and reduce poor performance results in ways that
were not intended by the designers of the system. Recent articles have explored the difficult balancing act of trying
to use the same measures for multiple purposes (de Lancer Julnes & Steccolini, 2015; Van Dooren & Hoffman,
2018). This issue will be explored in greater detail in the next chapter.

458
5. Identify the Resources and Plan for the Design, Implementation, and
Maintenance of the Performance Measurement System
Aside from the 12-step strategic process we are outlining in this chapter, it is important to plan in detail the
project that comprises the design and implementation of a performance measurement system. How to do that?
Poister et al. (2015) suggest that a project management lens (using appropriate software) is helpful for “scheduling
work, assigning responsibilities, and tracking progress” (p. 425). They suggest a 1- to 2-year time frame is not
unusual to design and implement a new system. Our view is that ensuring sustainability takes longer than that.

Organizations planning performance measurement systems often face substantial resource constraints. One of the
reasons for embracing performance measurement is to do a better job of managing the (scarce) available resources.
If a performance measurement system is mandated by external stakeholders (e.g., a central agency, an audit office,
or a board of directors), there may be considerable pressure to plunge in without fully planning the design and
implementation phases.

Often, organizations that are implementing performance measurement systems are expecting to achieve efficiency
gains, as well as improved effectiveness. Downsizing may have already occurred, and performance measurement is
expected to occur within existing budgets. Those involved may have the expectation that this work can be added
onto the existing workload of managers—they are clearly important stakeholders and logically should be in the
best position to suggest or validate proposed measures. Under such conditions, the development work may be
assigned to an ad hoc committee of managers, analysts, co-op or intern students, other temporary employees, or
consultants.

Identifying possible performance measures is usually iterative, time-consuming work, but it is only a part of the
process. Full implementation requires both outputs (identifying data that correspond to the performance
constructs and collecting data for the measures, preparing reports and briefings, a website, progress reports,
testimonials by participants in the process) and outcomes (actually using performance results on a continuing basis
to improve the programs in the organization).

Although a “one-shot” infusion of resources can be very useful as a way to get the process started, it is not
sufficient to sustain the system. Measuring and reporting performance takes ongoing commitments of resources,
including the time of persons in the organization.

Training for staff who will be involved in the design and implementation of the performance measures is
important. On the face of it, a minimalist approach to measuring performance is straightforward. “Important”
measures are selected, perhaps by an ad hoc committee; data are marshaled for those measures; and the required
reports are produced. But a commitment to designing and implementing a performance measurement system that
is useful and sustainable requires an understanding of the process of connecting performance measurement to
managing with performance data (Kates, Marconi, & Mannle, 2001; Kroll & Moynihan, 2018).

In some jurisdictions, the creation of legislative mandates for public performance reporting has resulted in
organizational responses that meet the legislative requirements but do not build the capacity to sustain
performance measurement. Performance measurement is intended to be a means rather than an end in itself.
Unless the organization is committed to using the information to manage performance, it is unlikely that
performance measurement will be well integrated into the operations of the organization.

In situations where there are financial barriers to validly measuring outcomes, it is common for performance
measures to focus on outputs. In many organizations, outputs are easier to measure, and the data are more readily
available. Also, managers are often more willing to have output data reported publicly because outputs are easier to
attribute to a program. Some performance measurement systems have focused on outputs from their inception.
The best example of that approach has been in New Zealand, where public departments and agencies negotiate
output-focused contracts with the New Zealand Treasury (Gill, 2011; Hughes & Smart, 2018). However,
although outputs are important as a way to report work done, they cannot be entirely substituted for outcomes;

459
the assumption that if outputs are produced, outcomes must have been produced is usually not defensible (see the
discussion of measurement validity vs. the validity of causes and effects in Chapter 4).

460
6. Take the Time to Understand the Organizational History Around Similar
Initiatives
Performance measurement is not new. In Chapter 8, we learned that in the United States, local governments
began measuring the performance of services in the first years of the 20th century (Williams, 2003). Since then,
there have been several waves of governmental reforms that have included measuring results.

In most public organizations, current efforts to develop performance measures come on top of other, previous
attempts to improve the efficiency and effectiveness of their operations. Public management as a field is replete
with innovations that often reflect assumptions/principles or practices that may or may not be substantively
evidence-based. New Public Management, which has had a 30-plus-year run, is based substantially on
microeconomic theory (public choice theory), metaphors (government should be business-like), and the
experiences of practitioners who have become consultants. Although there has been a critical scholarship that
assesses New Public Management, it is only recently that systematic research is being done that critically examines
core assumptions of this reform movement (Jakobsen et al., 2017; Kroll, 2015; Kroll & Moynihan, 2018).

Managers who have been a part of previous change efforts, particularly unsuccessful ones, have experience that will
affect their willingness to support current efforts to establish a system to measure performance. It is important to
understand the organizational memory of past efforts to make changes, and to gain an understanding of why
previous efforts to make changes have or have not succeeded. The organizational memory around these changes is
as important as a dispassionate view, in that participants’ beliefs are the reality that the current change will first
need to address.

Long-term employees will often have an in-depth understanding of the organization and its history. In
organizations that have a history of successful change initiatives, losing the people who were involved, through
retirements or downsizing, can be a liability when designing and implementing a performance measurement
system. Their participation in the past may have been important in successfully implementing change initiatives.
On the other hand, if an organization has a history of questionable success in implementing change initiatives,
organizational turnover may be an asset.

461
7. Develop Logic Models for the Programs for Which Performance Measures
Are Being Designed and Identify the Key Constructs to Be Measured
In Chapter 2, we discussed logic models as a way to make explicit the intended cause-and-effect linkages in a
program or even an organization. We discussed several different styles of logic models and pointed out that
selecting a logic modeling approach depends, in part, on how explicit one wants to be about intended cause-and-
effect linkages. A key requirement of logic modeling that explicates causes and effects is the presentation of which
outputs are connected to which outcomes.

Key to constructing and validating logic models with stakeholders is identifying and stating clear objectives,
including outputs and outcomes, for programs (Kravchuk & Schack, 1996). Although this requirement might
seem straightforward, it is one of the more challenging aspects of the logic modeling process. Often, program or
organizational objectives are put together to satisfy the expectations of stakeholders, who may not agree among
themselves about what a program is expected to accomplish. One way these differences are sometimes resolved is
to construct objectives that are general enough to appear to meet competing expectations. Although this solution
is expedient, it complicates the process of measuring performance.

Criteria for sound program objectives were discussed in Chapter 1. Briefly, objectives should state an expected
change or improvement if the program works (e.g., reducing the number of citizen complaints against police
officers), an expected magnitude of change (e.g., reducing the number of complaints by 20%), a target
audience/population (e.g., reducing the number of complaints against police officers in Minneapolis, Minnesota,
by 20%), and a time frame for achieving the intended result (e.g., reducing the number of complaints by 20% in
Minneapolis in 2 years).

In an ideal performance measurement system, both costs and results data are available and can be compared, but
for many programs, this kind of linkage throughout the program is a prohibitively information-intensive
proposition. An important driver behind the movement to develop planning, programming, and budgeting
systems (PPBS) in the 1960s was the expectation that cost–effectiveness ratios could be constructed, but the
initiative was not sustainable.

Performance measures can be designed to be useful in activity-based cost accounting (Arnaboldi & Lapsley, 2005;
Innes, Mitchell, & Sinclair, 2000; Vazakidis, Karaginnis, & Tsialta, 2010), which is one strategy suited for input-
output/outcome comparisons. This approach identifies program activities as the main unit of analysis in the
accounting system (for example, training long-term unemployed workers to be job-ready). If the full costs (both
operating and overhead) of this program activity can be calculated, then it is possible to calculate the cost per
person trained and even the cost per worker who is successful in securing employment. Many public-sector
organizations now have accounting systems that permit managers to cost out programs, although activity-based
accounting systems are not yet widespread. Information systems are more flexible than in the past, and the
budgetary and expenditure data are more complete, but it is still very difficult to design performance measurement
systems that link directly to budget decision making (Shaw, 2016).

So, while there are certainly challenges to creating a performance measurement system where we can directly link
costs to a full, measurable suite of a program’s outcomes, logic models are useful as a means of identifying
constructs that are the most useful candidates for performance measurement (Funnel & Rogers, 2011). Although,
in some sense, logic models do constrain us in that they assume that programs exist in open systems that are stable
enough to be depicted as a static model, they do help us create conceptual boundaries that identify the critical
activities, outputs, and outcomes that help an organization’s decision makers track and adjust its actions. The
open systems metaphor also invites us to identify environmental factors that could affect the program, including
those that affect our outcome constructs. Although some performance measurement systems do not measure
factors that are external to the program or organization, it is worthwhile including such constructs as candidates
for measurement. Measuring these environmental factors (or at least accounting for their influences qualitatively)
allows us to begin addressing attribution questions, if only judgmentally. For example, a program or policy to

462
increase the supply of low-income rental housing availability may also benefit from tracking local measures of
homelessness.

Deciding what should be included in a logic model and which types of performance measures are most
appropriate for a given program or organization can be assisted by some analytic guideposts. Some of the
guideposts will be provided by the managers in the organization, but the professional judgment of the evaluator
can help provide a framework that helps narrow the field of possibilities. For example, Wilson (1989) has
suggested that the complexity of the program and the level of turbulence of the environment influences the
options for the measures of outputs and outcomes can be developed. The environment of the program includes
the political system in which the program or organization is embedded. Table 9.2 adapts his approach to suggest a
typology describing the challenges and opportunities for measuring outputs and outcomes in different types of
organizations. Coping organizations (in which work tasks are fluid and complex and results are not visible—e.g.,
central government policy units), where both program characteristics and environments combine to limit
performance measurement development, will possibly have the most difficulty in defining measurable outputs and
outcomes. Production organizations (with simple, repetitive tasks, the results of which are visible and countable
—highway maintenance would be an example) are the most likely to be able to build performance measurement
systems that include outputs and outcomes. Craft organizations rely on applying mixes of professional knowledge
and skills to unique tasks to produce visible outcomes in a fairly stable environment—a public audit office would
be an example. Procedural organizations rely on processes to produce outputs that are visible and countable but
produce outcomes that are less visible—military organizations are an example. Thus, for example, craft and
procedural organizations differ in their capacities to develop output measures (procedural organizations can do this
more readily) and outcome measures (craft organizations can do this more readily).

Table 9.2 Measuring Outputs and Outcomes: Influences of Program Complexity and
Turbulence of Organizational Context
Table 9.2 Measuring Outputs and Outcomes: Influences of Program Complexity and Turbulence of
Organizational Context

Can Outputs Be Can Outcomes Be Measured in a Valid and Reliable Way?


Measured in a Valid
and Reliable Way? No Yes

Coping organizations—the Craft organizations—


No environment is turbulent, and the environments are stable, and
programs are complex programs are complicated

Procedural organizations—the Production organizations—


Yes environment is turbulent, and the environments are stable, and
programs are complicated programs are simple
Source: Adapted from Wilson (1989).

463
8. Identify Constructs Beyond Those in Single Programs: Consider Programs
Within Their Place in the Organizational Structure
Program logic models are mainly intended to illuminate the structure of individual programs (Poister et al., 2015).
Organizational diagrams or logic models, however, can be quite general for large-scale organizations that consist of
a wide range of programs. In addition, as organizations get bigger, their structures, functions, and internal
dynamics become complex. In Chapter 2, we mentioned complexity as an emerging challenge for the field of
evaluation, and in this section of Chapter 9, we will suggest three different approaches to addressing organizational
complexity in public-sector and nonprofit organizations when building organizational performance measurement
systems.

Organizational logic models can be seen as an extension of program logic models, but because they typically
focus on a higher-level view of programs or business lines, the constructs will be more general. Some jurisdictions
use organizational logic models that depict the high-level intended linkages between strategic outcomes and
programs. These larger frameworks can help set some parameters and priorities for the program’s performance
measurement system.

Figure 9A.1 in Appendix A is a high-level logic model for the Canadian Heritage Department in the Canadian
federal government. It shows the Departmental Results Framework as a high-level program structure. All federal
departments and agencies in Canada are required to develop and periodically update their Departmental Results
Framework that summarizes departmental objectives/outcomes and how those are intended to be achieved
through the program structure (Treasury Board of Canada Secretariat, 2016b). The Canadian Heritage
Department model does not explicitly address the complexity that is inherent at that level. Instead, linearity is
“imposed” on the organizational structure. This model then becomes the backdrop for program-level logic models
that are more focused and are the mainstay in federal departmental program evaluations and in performance
measurement systems.

A second approach to parsing complexity is illustrated by the government of Alberta, a national leader in Canada
in strategic performance management. The government publishes an annual report called Measuring Up, which
describes and illustrates performance trends over the previous several years (Government of Alberta, 2017).
Included in the 2016–2017 report are summaries of five government-wide strategic priorities and 41 performance
measures. In effect, this approach relies on the face validity of the performance measures in the framework—the
question being whether they are they are valid measures of the strategic priorities. No overall performance
framework has been specified (logic model or scorecard), but instead, the strategic objectives for the whole
government have been listed and, for each one, performance measures included. Again, complexity is set aside in
preference to specifying an overarching (rational) objective-focused structure that is featured as the centerpiece of
the government performance framework. Such a structure is clearly purposive but does not try to show (in any
visual model) how government programs or administrative departments contribute to strategic objectives.

As a third example, some organizations prefer to use the balanced scorecard (Kaplan & Norton, 1996) to help
identify performance constructs. It was originally designed for the private sector, but some have adapted it to the
public sector (e.g., Moullin, 2016). This approach typically uses measures for target-setting and benchmarking, an
accountability-focused carrot-and-stick system that is not without its detractors (Nielsen, Lund, & Thomsen,
2017; Tillema, 2010). Setting targets can become a contentious process. If the salaries of senior managers are
linked to achieving targets (one incentive that is recommended to fully implement performance management
systems), there will be pressure to ensure that the targets are achievable. If reporting targets and achievements is
part of an adversarial political culture, there will again be pressure to make targets conservative (Davies &
Warman, 1998). Norman (2001) has suggested that performance measurement systems can result in
underperformance for these reasons. Hood (2006) points to the ratchet effect (a tendency for performance targets
to be lowered over time as agencies fail to meet them) as a problem for public-sector performance measurement in
Britain.

464
However, it is possible to design and use a performance measurement system using benchmarking for learning
rather than coercive purposes (Buckmaster & Mouritsen, 2017). The balanced scorecard approach includes a
general model of key organizational-level constructs that are intended to be linked to a central vision and strategy.
Typically, the balanced scorecard approach includes clusters of performance measures for four different
dimensions: (1) organizational learning and growth; (2) internal business processes; (3) customers; and (4) the
financial perspective. Under each dimension there are these: objectives, measures, targets, and initiatives.

Overall, this approach suggests that the mission (a statement of strategic purposes) functions as an analogue to
longer term outcomes that one would expect to find in a logic model. By assuming the existence of the four
dimensions or their analogues for all organizations, it is possible to focus on performance measures for each
dimension. Complexity is taken into account by showing all four dimensions being linked to each other with
double-headed arrows although strategy maps often depict linkages with one-way arrows. The balanced scorecard
is scalable and relies heavily on the intuitive appeal of the four dimensions and the face validity of performance
measures that are developed.

To summarize, organizations often only indirectly address environmental complexity when building and
implementing performance measurement systems for organizations. Although complexity has been recognized as a
challenge for approaches to evaluation given the increasingly joined-up and distributed nature of programs and
policies in both the public and nonprofit sectors, this has not substantially translated into ways that performance
measures have been developed. In terms of outputs and outcomes, performance measurement continues to address
the what but not the why. In reality, we need to recognize that performance measurement frameworks, particularly
at an organizational level, are heuristics—useful but probably not indicative of causal linkages.

We will mention one more challenge for organizational performance measurement—the growing appreciation of
the “wickedness” of policy and program problems and the resulting implications for accountability-focused
performance measurement systems. Programs aimed at addressing wicked problems (Head & Alford, 2015)
typically cannot be assigned to one administrative department or even one government. An example is
homelessness. A social services department might have a mandate to provide funds to nonprofit organizations or
even private-sector developers to build housing for the homeless. Housing is costly, and states or provinces may be
reluctant to undertake such initiatives on their own. The nature of homelessness, with its rate of incidence of
physical health problems, mental health challenges, and drug dependences, will mean that housing the homeless,
even if funding and land to construct housing can be marshaled, is just part of a more comprehensive suite of
programs needed to address the complex causes of homelessness that individuals typically present (Aubry, Nelson,
& Tsemberis, 2015). Homelessness transcends government departmental boundaries and even levels of
government, involving local, state/provincial, and federal governments. Effectively addressing this kind of problem
requires collaboration among agencies and governments (and nonprofit and private-sector organizations) that cross
existing organizational and functional boundaries.

Horizontal/vertical initiatives like ones to address homelessness present challenges for measuring performance,
particularly where there is an expectation that reporting results will be part of being publicly accountable (Bakvis
& Juillet, 2004; Perrin, 2015). Developing performance measures involves a sharing of responsibility and
accountability for the overall program objectives. If permitted to focus simply on the objectives of each
government department or level of government during the design of the system, each contributor would have a
tendency to select objectives that are conservative—that is, do not commit the department to being responsible for
overall program or strategy outcomes. In particular, if legislation has been passed that emphasizes departments
being individually accountable (part of our Western administrative cultures), then broader objectives that are
multi-sectoral or multi-level in nature, such as climate change, may well be under-addressed.

A similar problem arises for many nonprofit organizations. In Canada and the United States, many funding
organizations (e.g., governments, private foundations, the United Way) are opting for a performance-based
approach to their relationship with organizations that deliver programs and services. Increasingly, funders expect
periodic results-focused performance information as a condition for grants funding, contracts, and particularly
renewals. Governments that have opted for contractual relationships with nonprofit service providers are
developing performance contracting requirements that specify deliverables and often tie funding to the provision

465
of evidence that these results have been achieved (Bish & McDavid, 1988; Prentice & Brudney, 2015).

Nonprofit organizations are often quite small and are typically dedicated to the amelioration of a community
problem or issue that has attracted the commitment of members and volunteers. Being required to bid for
contracts and account for the performance results of the money they have received is added onto existing
administrative requirements, and many of these organizations have limited capacity to do these additional tasks.
Campbell (2002) has pointed out that in settings where targeted outcomes span several nonprofit providers, it
would be beneficial to have some collaboration among funders and for providers to agree on ways of directly
addressing the desired outcomes. If providers compete and funders continue to address parts of a problem, the
same inter-sectoral disregard that was suggested for government departments will happen in the nonprofit sector.

466
9. Involve Prospective Users in Development of Logic Models and
Constructs in the Proposed Performance Measurement System
Developing logic models of programs is an iterative process. Although the end product is meant to model the
programmatic and intended causal reasoning that transforms resources into results, it is essential that logic models
be constructed and validated with organizational participants and other stakeholders. Involvement at this stage of
the development process will validate key constructs for prospective users and set the agenda for developing
performance measures. Program managers, in particular, will have an important stake in the system. Their
participation in validating the logic models increases the likelihood that performance measurement results will be
useful for program improvements. Recall that Chapter 2 provides much more detail on the development and uses
of logic models.

Depending on the purposes of the performance measurement process, some constructs will be more important
than others. For example, if a logic model for a job training and placement program operated by a community
nonprofit organization has identified the number of persons who complete the training as an output and the number
who are employed full-time 1 year after the program as an outcome, the program managers would likely emphasize
the output as a valid measure of program performance—in part because they have more control over that
construct. But the funders might want to focus on the permanent employment results because that is really what
the program is intended to do.

By specifying the intended causal linkages, it is possible to review the relative placement of constructs in the model
and clarify which ones will be a priority for measurement. In our example, managers might be more interested in
training program completions since they are necessary for any other intended results to occur. Depending on the
clients, getting persons to actually complete the program could be a major challenge in itself. If the performance
measurement system is intended to be summative as well, then measuring the permanent employment status of
program participants would be important—although there would be a question of whether the program produced
the observed employment results.

If a performance measurement system is going to be designed and implemented as a public accountability


initiative that is high stakes—that is, has resource-related consequences for those organizational units being
measured and compared—then the performance measures chosen should be ones that would be difficult to
“game” by those who are being held accountable. Furthermore, it may be necessary to periodically audit the
performance information to assess its reliability and validity (Bevan & Hamblin, 2009). Some jurisdictions—New
Zealand, for example—regularly audit the public performance reports that are produced by all departments and
agencies (Auditor General of New Zealand, 2017).

467
10. Translate the Constructs Into Observable Performance Measures That
Compose the Performance Measurement System
We learned in Chapter 4 that the process of translating constructs into observables involves measurement. For
performance measurement, secondary data sources are the principal means of measuring constructs. Because these
data sources already exist, their use is generally seen to be cost-effective. There are several issues that must be kept
in mind when using secondary data sources:

Can the existing data (usually kept by organizations) be adapted to measure constructs in the performance
measurement system? In many performance measurement situations, the challenge is to adapt what exists,
particularly data readily available via information systems, to what is needed to translate performance
constructs into reliable and valid measures. Often, existing data have been collected for purposes that are not
related to measuring and reporting on performance. Using these data raises validity questions. Do they really
measure what the performance measurement designers say that they measure? Or do they distort or bias the
performance construct so that the data are not credible? For example, measuring changes in employee job
satisfaction by counting the number of sick days taken by workers over time could be misleading. Changes
in the number of sick days could be due to a wide range of factors, making it questionable measure of job
satisfaction.
Do existing data sources sufficiently cover the constructs that need to be measured? The issue here is
whether our intended performance measures are matched by what we can get our hands on in terms of
existing data sources. In the language we introduced in Chapter 4, this is a content validity issue.
A separate but related issue is whether existing data sources permit us to triangulate our measurements of
key constructs. In other words, can we measure a given construct in two or more independent ways, ideally
with different methodologies? Generally, triangulation increases confidence that the measures are valid.
Can existing data sources be manipulated by stakeholders if those data are included in a performance
measurement system? Managers and other organizational members generally respond to incentives (Le
Grand, 2010; Spekle & Verbeeten, 2014). If a performance measure becomes the focus of summative
program or service assessments and if the data for that measure are collected by organizational participants,
it is possible that the data will be manipulated to indicate “improved” performance (Bevan & Hamblin,
2009; Otley, 2003).

An example of this type of situation from policing was an experiment in Orange County, California, to link officer
salary increases in the police department to reduced reporting rates for certain kinds of crimes (Staudohar, 1975).
The agreement between the police union and management specified thresholds for the percentage reductions in
four types of crimes and the associated magnitudes of salary increases.

The experiment “succeeded.” Crime rates in the four targeted crimes decreased just enough to maximize wage
increases. Correspondingly, crime rates increased for several related types of crimes. A concern in this case is
whether the crime classification system may have been manipulated by participants in the experiment, given the
incentive to “reduce” crimes in order to maximize salary increases.

If primary data sources (those designed specifically for the performance measurement system) are being used,
several issues should be kept in mind:

Are there ongoing resources to enable collecting, coding, and reporting data? If not, then situations can
develop where the initial infusion of ad hoc resources to get the system started may include funding to
collect initial outcomes data (e.g., to conduct a client survey), but beyond this point, there will be gaps in
the performance measurement system where these data are no longer collected.
Are there issues with sampling procedures, instrument design, and implementation that need to be reviewed
or even done externally? In other words, are there methodological requirements that need to be established
to ensure the credibility of the data?
Who will actually collect and report the data? If managers are involved, is there any concern that their

468
involvement could be seen to be in conflict with the incentives they perceive?
When managers review the performance measures that are being proposed, if a draft of the proposed
performance measures does not feature any that pertain to their programs, they may conclude that they are
being excluded and are therefore vulnerable in future budget allocations. (This was the main issue in the BC
government ministry case included earlier in this chapter.) It is essential to have a rationale for each measure
and some overall rationale for featuring some measures but not others. In many complex organizations,
performance measures that would be chosen by program managers to monitor their own programs would
not be “strategic”—there can be a divide between measures that face inward and those that face outward
(Laihonen & Mäntylä, 2017; Umashev & Willett, 2008).

In Chapter 4, we introduced measurement validity and reliability criteria to indicate the methodological
requirements for sound measurement processes in general. In many performance measurement situations,
resources are scarce and time limited to assess each measure in methodological terms. In terms of the categories of
validity discussed in Chapter 4, those developing and implementing performance measures usually pay attention
to face validity (On the face of it, does the measure do an adequate job of representing the construct?), content
validity (How well does the measure or measures represent the range of content implied by the construct?), and
response process validity (Have the participants in the measurement process taken it seriously?).

We are reminded of a quote that has been attributed to Sir Josiah Stamp, a tax collector for the government in
England during the 19th century:

The government is extremely fond of amassing great quantities of statistics. These are raised to the nth
degree, the cube roots are extracted and the results are arranged into elaborate and impressive displays.
What must be kept ever in mind, however, is that in every case, the figures are first put down by a
village watchman and he puts down anything he damn well pleases. (Source, Sir Josiah Stamp, Her
Majesty’s Collector of Inland Revenues, more than a century ago) (cited in Thomas, 2004, p. xiii)

Assessing other kinds of measurement validity (internal structure, concurrent, predictive, convergent, and
discriminant; see Chapter 4) is generally beyond the methodological scope in performance measurement
situations. The reliability of performance measures is often assessed with a judgmental estimate of whether the
measure and the data are accurate and complete—that is, are collected and recorded so that there are no important
errors in the ways the data represent the events or processes in question. In some jurisdictions, performance
measures are audited for reliability (see, e.g., Texas State Auditor’s Office, 2017).

An example of judgmentally assessing the reliability and validity of measures of program results might be a social
service agency that has included the number of client visits as a performance measure for the funders of its
counseling program. Suppose that the agency and the funders agree that the one measure is sufficient since
payments to the agency are linked to the volume of work done, and client visits are deemed to be a reasonably
accurate measure for that purpose. To assess the validity and reliability of that measure, one would want to know
how the data are recorded (e.g., by the social workers or by the receptionists) and how the files are transferred to
the agency database (manually or electronically as part of the intake process for each visit). Are there under- or
over-counting issues in the way the data are recorded? Do telephone consultations count as client visits? What if
the same client visits the agency repeatedly, perhaps even to a point where other prospective client appointments
are less available? Should a second measure of performance be added that tracks the number of clients served
(improving content validity)? Will that create a more balanced picture and create incentives to move clients
through the treatment process? What if clients change their names—does that get taken into account in recording
the number of clients served? Each performance measure or combination of measures for each construct will have
these types of practical problems that must be addressed if the data in the performance measurement system are to
be credible and usable.

In jurisdictions where public performance reporting is mandated, a significant issue is an expectation that
requiring fewer performance measures for a department will simplify performance reporting and make the

469
performance report more concise and more readable. Internationally, guidelines exist that suggest a rule of
parsimony when it comes to selecting the number of performance measures for public reporting. For example, the
Canadian Audit and Accountability Foundation (CCAF-FCVI, 2002) has outlined principles for public
performance reporting, one of which is to “focus on the few critical aspects of performance” (p. 4). This same
principle is reflected in guidelines developed for performance reporting by the Queensland State Government in
Australia (Thomas, 2006).

Typically, the number of performance measures in public reports is somewhere between 10 and 20, meaning that
in large organizations, some programs will not be represented in the image of the department that is conveyed
publicly. A useful way to address managers wanting their programs to be represented publicly is to commit to
constructing separate internal performance reports. Internal reports are consistent with the balancing of formative
and summative uses of performance measurement systems. As we’ve noted, it is our view that unless a performance
measurement system is used primarily for internal performance management, it is unlikely to be sustainable.
Internal performance measures can more fully reflect each program and are generally seen to better represent the
accomplishments of programs.

One possible problem with any performance measurement system is the potential for ambiguity in observed
patterns of results. In an Oregon benchmarking report (Oregon Progress Board, 2003), affordability of housing
was offered as an indicator of the well-being of the state (presumably of the broad social and economic systems in
the state). If housing prices are trending downward, does that mean that things are getting worse or better? From
an economic perspective, declining housing prices could mean that (a) demand is decreasing in the face of a steady
supply; (b) demand is decreasing, while supply is increasing; (c) demand and supply are both increasing, but
supply is increasing more quickly; or (d) demand and supply are both decreasing, but demand is decreasing more
quickly. Each of these patterns suggests something different about the well-being of the economy. To complicate
matters, each of these scenarios would have different interpretations if we were to take a social rather than an
economic perspective. The point is that prospective users of performance information should be invited to offer
their interpretations of simulated patterns of such information (Davies & Warman, 1998). In other words,
prospective users should be offered scenarios in which different trends and levels of measures are posed. If these
trends or levels have ambiguous interpretations—“it depends”—then it is quite likely that when the performance
measurement system is implemented, similar ambiguities will arise as reports are produced and used.
Fundamentally, ambiguous measures invite conflicting interpretations of results and will tend to weaken the
credibility of the system.

An additional measurement issue is whether measures and the data that correspond with the measures should all
be quantitative. Poister et al. (2015) specify, “Performance measurement is the systematic orderly collection of
quantitative data along a set of key indicators of organizational or program performance” (p. 7). In Chapter 5, we
discussed the important contributions that qualitative evaluation methods can make to program evaluations. We
included an example of how qualitative methods can be used to build a performance measurement and reporting
system (Davies & Dart, 2005; Sigsgaard, 2002). There is a meaningful distinction between the information that is
conveyed qualitatively and that which is conveyed by numbers. Words can provide us with texture, emotions, and
a more vivid understanding of situations. Words can qualify numbers, interpret numbers, and balance
presentations. Most importantly, words can describe experiences—how a program was experienced by particular
clients, as opposed to the number of clients served, for example.

In performance measurement systems, it can be desirable to have both quantitative and qualitative measures/data.
Stakeholders who take the time to read a mixed presentation can learn more about program performance. But in
many situations, particularly where performance targets are set and external reporting is mandated, there is a bias
toward numerical information, since targets are nearly always stated quantitatively. If the number of persons on
social assistance is expected to be reduced by 10% in the next fiscal year, for example, the most relevant data will
be numerical. Whether that program meets its target or not, however, the percent reduction in the number of
persons on social assistance provides no information about the process whereby that happened, along with other
(perhaps unintended) consequences.

Performance measurement systems that focus primarily on providing information for formative uses should

470
include deeper and richer measures than those used for public reporting. Qualitative information can provide
managers with feedback that is helpful in adjusting program processes to improve results. Also, qualitative
information can reveal to managers the client experiences that accompany the process of measuring quantitative
results.

Qualitative information presented as cases or examples that illustrate a pattern that is reported in the quantitative
data can be a powerful way to convey the meaning of the numerical information. Although single cases can only
illustrate, they communicate effectively. In some cases, narratives can be essential to conveying the meaning of
performance results.

471
11. Highlight the Comparisons That Can Be Part of the Performance
Measurement System
In addition to simulating different patterns of information for prospective users, it is important to ascertain what
kinds of comparisons are envisioned with performance data. Table 9.3 lists four different comparisons that can be
built into performance measurement systems (and reports). Often, mixes of these comparisons will be included in
a performance report.

Table 9.3 Comparisons that can be Included in Performance Measurement Reports


Table 9.3 Comparisons that can be Included in Performance Measurement
Reports

Comparisons of performance trends over time

Comparisons of a performance measure across similar administrative units

Comparisons between actual performance results and benchmarks or targets

Comparisons that publicly rank organizations in terms of their performance

A common comparison is to look for trends over time and make judgments based on interpretations of those
trends. (Are the trends favorable or not given intended linkages in the model?) An example of a publicly reported
performance measure that tracks trends over time is the WorkSafeBC measure of injured workers’ overall
satisfaction with their experience with the organization (their rating of “overall claim experience”).

Each year, WorkSafeBC arranges for an independent survey of about 800 injured workers who are randomly
selected from among those who made claims for workplace injuries (WorkSafeBC, 2018). Workers are asked rate
their overall satisfaction on a 5-point scale from very poor to very good. This performance measure is one of 10 that
are included in the annual report. Figure 9.1 has been excerpted from the 2017 Annual Report (WorkSafeBC,
2018) and displays the percentages of surveyed workers who rated their overall satisfaction as good or very good
over time. Also displayed are the targets for this performance measure for the next 3 years. This format for a
performance measure makes it possible to see what the overall trend is and how that trend is expected to change in
the future. We can see that over time, approximately three quarters of injured workers have tended to be satisfied
with their claim experience. There have been historical variations in that percentage; for example, in 2009 and
2010 (not displayed) the percentages dropped to 65% and 69% respectively. Why might that have been the case?
Those two years coincided with the aftermath of the Great Recession in 2008, and in British Columbia, Canada,
there were layoffs in many workplaces. Circumstantially, the surveys of injured worker satisfaction may have also
“picked up” worker (dis)satisfaction more broadly.

472
Figure 9.1 Performance Measurement Results Over Time: Injured Workers’ Overall Satisfaction With Their
WorkSafeBC Claim Experience

Source: WorkSafeBC (2018, p. 50). Copyright © WorkSafeBC. Used with permission.

Another comparison that can be made using performance information is across similar administrative units. For
example, in Canada, the Municipal Benchmarking Network (MBN, 2016) facilitates voluntary comparative
performance measurement and reporting for local governments. Municipalities (15 are involved) provide annual
performance information for 36 municipal services (a total of 173 performance measures). Figure 9.2 displays a
comparison for one performance measure for one service (emergency medical services): the percentage of cardiac
arrest calls where the response is less than 6 minutes and the responder has a defibrillator on board. In the figure,
12 of the 15 cities are displayed for 2014, 2015, and 2016. (Those cities provide this service.) Of interest are the
overall comparisons and any trends over time within communities.

This figure also illustrates a third kind of comparison in which performance results are benchmarked—in this case,
against a standard of 6 minutes to respond with a defibrillator on board.

473
Figure 9.2 Percentages of Cardiac Emergency Responses in Selected Canadian Municipalities That Are Less
Than 6 Minutes and the Responder Has a Defibrillator on Board

Source: Municipal Benchmarking Network Canada. (2017, p. 57). 2016 MBNCanada Performance
Measurement Report. Copyright © MBNCanada. Used with permission.

A fourth type of comparison is a performance assessment of similar organizations (universities, for example) that
ranks them, summarizing the overall results in a table or graphic. Internationally, there is growing movement to
rank countries on different indices. One international organization that features country rankings is the United
Nations Development Program (UNDP). In a 2016 report, countries are compared and ranked on their Human
Development Index scores (UNDP, 2016).

The British government has been a leader in using public rankings of institutions and governments. Public
rankings (sometimes called league tables) are influential in conferring status (and perceived success) on the highest
ranked or rated institutions. As an example, The Complete University Guide (2018) is an annual comparison of
129 universities in the UK (England, Scotland, and Wales) on 10 measures (entry standards, student satisfaction,
research quality, research intensity, graduate prospects, student–faculty ratio, spending on academic services,
spending on facilities [buildings, labs, and technology], degree graduates who achieve an honors GPA, and degree
completion rate). Each measure is weighted so that for each institution, an overall (index) score is calculated. (The
scores are normed so that the top score is 1,000 points; the bottom score is 337 points in the 2017 table).

Many organizations have opted for a visuals-rich way of reporting performance. Dashboards where key
performance measurement results are displayed graphically can be interpreted at a glance—one format is to display
a green (light) performance result indicating acceptable performance, orange indicating caution, and red (stop)
indicating a problem that needs a more careful look. Is there a regular cycle of reporting? Is there a process in place
whereby reports are reviewed and critiqued internally before they are released publicly? Often, agencies have
internal vetting processes in which the authors of reports are expected to be able to defend the report in front of
their peers before the report is released. This challenge function is valuable as a way of assessing the defensibility of
the report and anticipating the reactions of stakeholders.

After the performance measurement system has been drafted, the organization can begin tracking, analyzing,
interpreting, and reporting, with expectations that the system and measures will need some modification.

474
12. Reporting and Making Changes to the Performance Measurement System
Reporting requirements will vary from organization to organization or from one governmental jurisdiction to the
next. Typically, the intended uses of the performance results will affect how formal the reporting process is. Some
organizations report their performance results using dashboards on their websites, others use manager-driven
performance reports that are available to elected decision makers (Hildebrand & McDavid, 2011), and still others
follow required templates that reflect government-wide reporting requirements (British Columbia Government,
2017). Poister, Aristigueta, and Hall (2015) suggest elements of possible performance reports in Chapter 6 of their
book and point out that the intended purposes will drive the format and contents of performance reports.

The international public accounting community has taken an interest in public performance reporting generally
and, in particular, the role that public auditors can play in assessing the quality of public performance reports
(CCAF-FCVI, 2008; Klay, McCall, & Baybes, 2004). The assumption is that if the quality of the supply of public
performance reports is improved—that is, performance reports are independently audited for their credibility—
they are more likely to be used, and the demand for them will increase. Legislative auditors, in addition to
recommending principles to guide public performance reporting, have been active in promoting audits of
performance reports (CCAF-FCVI, 2002, 2008; Klay et al., 2004). Externally auditing the performance reporting
process is suggested as an important part of ensuring the longer term credibility of the system—improving the
supply of performance information to increase its demand and use. With varying degrees of regularity and
intensity, external audits of performance reports are occurring in some jurisdictions at the national, state, province,
and/or local levels (Gill, 2011; Schwartz & Mayne, 2005). In Britain, for example, between 2003 and 2010, the
National Audit Office (NAO) conducted assessments of the performance measures that were integral to the Public
Service Agreements between departments and the government. The NAO audits focused on performance data
systems “to assess whether they are robust, and capable of providing reliable, valid information” (NAO, 2009).

Implementing a system with a fixed structure (logic models and measures) at one point in time will not ensure the
continued relevance or uses of the system in the future. Uses of and organizational needs for performance data will
evolve. There is a balance between the need to maintain continuity of performance measures, on the one hand,
and the need to reflect changing organizational objectives, structures, and prospective uses of the system, on the
other (Kravchuk & Schack, 1996). In many performance measurement systems, there are measures that are
replaced periodically and measures that are works in progress (Malafry, 2016). A certain amount of continuity in
the measures increases the capacity of measures to be compared over time.

Data displayed as a time series can, for example, show trends in environmental factors, as well as changes in
outputs and outcomes; by comparing environmental variable trends with outcome trends, it may be possible to
eyeball the influences of plausible rival hypotheses on particular outcome measures. Although this process depends
on the length of the time series and is judgmental, it does permit analysts to use some of the same tools that would
be used by program evaluators. In Chapter 3, recall that in the York crime prevention program evaluation, the
unemployment rate in the community was an external variable that was included in the evaluation to assist the
evaluators in determining whether the neighborhood watch program was the likely cause of the observed changes
in the reported burglary rate.

But continuity can also make performance measures less relevant over time. Suppose, for example, that a
performance measurement system was designed to pull data from several different databases, and the original
information system programming to make this work was expensive. Even if the data needs change, there may well
be a desire not to go back and repeat this work, simply because of the resources involved. Likewise, if a
performance measurement system is based on a logic model that becomes outdated, then the measures will no
longer fully reflect what the program(s) or the organization is trying to accomplish. But going back to redo the
logic model (which can be a time-consuming, iterative process) may not be feasible in the short term, given the
resources available. The price of such a decision might be a gradual reduction in the relevance of the system, which
may not be readily detected (substantive uses would go first, leaving symbolic uses).

475
With all the activity to design and implement performance measurement and reporting systems, there has been
surprisingly little effort to date to evaluate their effectiveness (McDavid & Huse, 2012; Poister et al., 2015). In
Chapter 10, we will discuss what is known now about the ways in which performance information is used, but it is
appropriate here to suggest some practical steps to generate feedback that can be used to modify and better sustain
performance measurement systems:

Develop channels for user feedback. This step is intended to create a process that will allow the users to
provide feedback and suggest ways to revise, review, and update the performance measures. Furthermore,
this step is intended to help identify when corrections are required and how to address errors and
misinterpretations of the data.
Create an expert review panel of persons who are both knowledgeable about performance measurement and
do not have a stake in the system that is being reviewed. Performance measurement should be conducted on
an ongoing basis, and this expert panel review can provide feedback and address issues and problems over a
long-term time frame. A review panel can also provide an independent assessment of buy-in and use of
performance information by managers and staff and can track the (intended and unintended) effects of the
system on the organization. The federal government in Canada, for example, mandates the creation of
Performance Measurement and Evaluation Committees for each department or agency (Treasury Board,
2016b). Among other tasks, these committees review and advise on “the availability, quality, utility and use
of planned performance information and actual performance information” (p. 1).

476
Performance Measurement for Public Accountability
Speaking broadly, performance results are used to improve programs in two ways: (1) externally, through public
and political accountability expectations, and (2) internally, providing information to managers to improve
performance. Together, these purposes are often referred to as the foundation for performance management.

Many jurisdictions have embraced results-focused performance measurement systems with an emphasis on public
and political accountability (Christensen & Laegreid, 2015; Dubnick, 2005; Jakobsen et al., 2017). Performance
measurement systems can be developed so that the primary emphasis, as they are implemented, is on setting public
performance targets for each organization, measuring performance, and, in public reports, comparing actual results
with targeted results. Usually, performance reports are prepared at least annually and delivered to external
stakeholders. In most jurisdictions, elected officials and the public are the intended primary recipients.

The logic that has tended to underpin these systems is that the external pressures and transparency of public
performance reporting will drive performance improvement. Making public accountability the principal goal is
intended to give organizations the incentive to become more efficient and effective (Auditor General of British
Columbia, 1996). Performance improvements are expected to come about because elected officials and other
stakeholders put pressure, via public performance reports, on organizations to “deliver” targeted results. Fully
realized performance management systems are expected to include organizational performance incentives that are
geared toward improving performance (Moynihan, 2008; Poister et al., 2015).

Figure 9.3 is a model of key intended relationships between performance measurement, public reporting, public
accountability, and performance improvement. In the figure, public performance reporting is hypothesized to
contribute to both public accountability and performance improvement. Furthermore, performance improvement
and public accountability are expected to reinforce each other.

The model in Figure 9.3 reflects relationships among performance measurement, public reporting, public
accountability, and performance improvement that are (normatively) expected in governmental reforms in many
jurisdictions. Indeed, New Public Management, with its critical perspective on public-sector organizations and
governments, relies on a “carrot and stick” logic that reflects public choice assumptions about human motivation
and human behavior. As we noted earlier, this approach to performance reporting can introduce a higher stakes
side of developing and implementing performance measurement systems, particularly in environments where
politics is adversarial. Once performance information is rendered in public reports, it can be used in ways that
have consequences (intended and unintended) for both managers, executives, and elected officials. The literature
on using performance information is rich with findings that suggest that the characteristics of the political culture
in which government organizations are embedded can substantially influence both the quality and the uses of
performance information (de Lancer Julnes, 2006; de Lancer Julnes & Holzer, 2001; McDavid & Huse, 2012;
Thomas, 2006; Van Dooren & Hoffman, 2018).

In this chapter, we have suggested that performance measurement systems, to be sustainable, need to be designed
and implemented so that managerial use of performance information is the central purpose. In contrast to the
relationships suggested in Figure 9.3, we believe that in many settings, particularly where the political culture is
adversarial, public performance reporting may undermine the use of the performance information for performance
improvement (McDavid & Huse, 2012). We will explore this challenge in Chapter 10.

477
Figure 9.3 A Normative Model of the Intended Relationship Between Public Accountability and
Performance Improvement

478
Summary
The 12 steps for designing and implementing performance measurement systems discussed in this chapter reflect both a technical/rational
and a political/cultural view of organizations. Both perspectives are important in designing and implementing sustainable performance
measurement systems. Collectively, these steps highlight that undertaking this process is a significant organizational change. It is quite
likely that in any given situation, one or more of these criteria will be difficult to address. Does that mean that, unless performance
measurement systems are designed and implemented with these 12 steps in view, the system will fail? No, but it is reasonable to assert that
each criterion is important and does enhance the overall likelihood of success. The performance measurement system may benefit from
the decoupling of internal and external performance measures—an issue further discussed in Chapter 10.

Our view is that for performance measurement systems to be sustainable, prioritization of managerial involvement is key. In this chapter,
we have developed an approach to performance measurement that emphasizes utilization of the information obtained for improving
performance. Performance measurement for public accountability is one purpose of such systems, but making that the main purpose will
tend to weaken managerial commitment to the system over time and thus undermine the usefulness of the measures for improving
efficiency and effectiveness.

Among the 12 steps, six are more critical. Each contributes something necessary for successful design and implementation, and again,
these reflect a mix of technical/rational and organizational-political/cultural perspectives.

1. Sustained leadership: Without this, the process will drift and eventually halt. Leadership is required for a 3- to 5-year period of
time.
2. Good communications: They are essential to developing a common understanding of the process, increasing the likelihood of buy-
in, and contributing to a culture of openness.
3. Clear expectations for the system: Being open and honest about the purposes behind the process is important so that key
stakeholders (managers and others) are not excluded or blindsided. Bait-and-switch tactics, in which one picture of prospective
uses is offered up front (formative uses) and is then changed (to summative uses) once the system is developed, tend to backfire.
4. Resources sufficient to free up the time and expertise needed: When resources are taken away from other programs, to measure and
report on performance, the process is viewed as a competitor to important organizational work and is often given short shrift.
5. Logic models that identify the key program and organizational constructs: The process of logic modeling or building a framework is
very important to informing the selection of constructs and the development of performance measures.
6. A measurement process that succeeds in producing valid measures in which stakeholders have confidence: Too few performance
measurement systems pay adequate attention to measurement validity and reliability criteria that ultimately determine the
perceived credibility and usefulness of the system.

These six criteria can be thought of as individually necessary, but they will vary in importance in each situation. Performance
measurement is a craft. In that respect, it is similar to program evaluation. There is considerable room for creativity and professional
judgment as organizations address the challenges of measuring results.

479
Discussion Questions
1. You are a consultant to the head of a government agency (1,000 employees) that delivers social service programs to families. The
families have incomes below the poverty line, and most of them have one parent (often the mother) who is either working for low
wages or is on social assistance. The agency is under some pressure to develop performance measures as part of a broad
government initiative to make service organizations more efficient, effective, and accountable. In your role, you are expected to
give advice to the department head that will guide the organization into the process of developing and implementing a
performance measurement system. What advice would you give about getting the process started? What things should the
department head do to increase the likelihood of success in implementing performance measures? How should he or she work
with managers and staff to get them onside with this process? Try to be realistic in your advice—assume that there will not be
significant new resources to develop and implement the performance measurement system.
2. Performance measurement systems are often intended to improve the efficiency and effectiveness of programs or organizations
(improve performance). But, generally, organizations do not take the time to strategically assess whether their performance
measurement systems are actually making a difference, and if so, how? Suppose that the same organization that was referred to in
Question 1 has implemented its performance measurement system. Assume it is 3 years later. The department head now wants to
find out whether the system has actually improved the efficiency and effectiveness of the agency’s programs. Suppose that you are
giving this person advice about how to design an evaluation project to assess whether the performance measurement system has
“delivered.” Think of this as an opportunity to apply your program evaluation skills to finding out whether this performance
measurement system was successfully implemented. What would be possible criteria for the success of the system? How would
you set up research designs that would allow you to see whether the system had the intended incremental effects? What would you
measure to see if the system has been effective? What comparisons would you build into the evaluation design?

480
Appendix A: Organizational Logic Models

Figure 9A.1 Canadian Heritage Department Departmental Results Framework 2018–2019

Source: Departmental Results Framework and Program Inventory, Canadian Heritage, 2017, retrieved from
https://www.canada.ca/en/canadian-heritage/corporate/mandate/results-framework-program-inventory.html.
Reproduced with the permission of the Minister of Canadian Heritage, 2018.

481
References
Agocs, C., & Brunet-Jailly, E. (2010). Performance management in Canadian local governments: A journey in
progress or a dead-end? In E. Brunet-Jailly & J. Martin (Eds.), Local government in a global world: Australia and
Canada in comparative perspective. Toronto, Canada: University of Toronto Press, IPAC Publication.

Arnaboldi, M., & Lapsley, I. (2005). Activity based costing in healthcare: A UK case study. Research in Healthcare
Financial Management, 10(1), 61–75.

Aubry, T., Nelson, G., & Tsemberis, S. (2015). Housing first for people with severe mental illness who are
homeless: a review of the research and findings from the at home—chez soi demonstration project. Canadian
Journal of Psychiatry, 60(11), 467–474.

Auditor General of British Columbia. (1996). 1996 annual report: A review of the activities of the office. Victoria,
British Columbia, Canada: Queen’s Printer.

Auditor General of New Zealand. (2017). Auditor-General’s Audit Standard 4: The audit of performance reports.
Retrieved from https://www.oag.govt.nz/2017/auditing-standards/docs/28-ag-4-performance-reports.pdf

Bakvis, H., & Juillet, L. (2004). The horizontal challenge: Line departments, central agencies and leadership. Ottawa,
Ontario: Canada School of Public Service.

Behn, R. D. (2003). Why measure performance? Different purposes require different measures. Public
Administration Review, 63(5), 586–606.

Bevan, G., & Hamblin, R. (2009). Hitting and missing targets by ambulance services for emergency calls: Effects
of different systems of performance measurement within the UK. Journal of the Royal Statistical Society. Series A
(Statistics in Society), 172(1), 161–190.

Bish, R., & McDavid, J. C. (1988). Program evaluation and contracting out government services. Canadian
Journal of Program Evaluation, 3(1), 9–23.

British Columbia Government. (2017). Annual Service Plan Reports 2017. Retrieved from
http://www.bcbudget.gov.bc.ca/Annual_Reports/2016_2017/default.htm

Brudney, J., & Prentice, C. (2015). Contracting with nonprofit organizations. In R. Shick (Ed.), Government
contracting (pp. 93–113). New York, NY: Routledge.

Buckmaster, N., & Mouritsen, J. (2017). Benchmarking and learning in public healthcare: Properties and effects.
Australian Accounting Review, 27(3), 232–247.

Campbell, D. (2002). Outcomes assessment and the paradox of nonprofit accountability. Nonprofit Management

482
& Leadership, 12(3), 243–259.

Canadian Heritage Department. (2017). Departmental results framework and program inventory. Retrieved from
https://www.canada.ca/en/canadian-heritage/corporate/mandate/results-framework-program-inventory.html

CCAF-FCVI. (2002). Reporting principles: Taking public performance reporting to a new level. Ottawa, Ontario,
Canada: Author.

CCAF-FCVI. (2008). Consultations on improving public performance reports in Alberta. Retrieved at


http://www.assembly.ab.ca/lao/library/egovdocs/2008/altrb/168825.pdf

Christensen, T., & Laegeid, P. (2015). Performance and accountability—a theoretical discussion and an empirical
assessment. Public Organization Review, 15(2), 207–225.

Davies, R., & Dart, J. (2005). The “Most Significant Change” (MSC) technique: A guide to its use. Retrieved from
http://mande.co.uk/wp-content/uploads/2018/01/MSCGuide.pdf

de Lancer Julnes, P. (1999). Lessons learned about performance measurement. International Review of Public
Administration, 4(2), 45–55.

de Lancer Julnes, P. (2006). Performance measurement: An effective tool for government accountability? The
debate goes on. Evaluation, 12(2), 219–235.

de Lancer Julnes, P., & Holzer, M. (2001). Promoting the utilization of performance measures in public
organizations: An empirical study of factors affecting adoption and implementation. Public Administration
Review, 61(6), 693–708.

de Lancer Julnes, P., & Steccolini, I. (2015). Introduction to Symposium: Performance and accountability in
complex settings—Metrics, methods, and politics. International Review of Public Administration, 20(4),
329–334.

de Waal, A. A. (2003). Behavioral factors important for the successful implementation and use of performance
management systems. Management Decision, 41(8), 688–697.

Dubnick, M. (2005). Accountability and the promise of performance: In search of the mechanisms. Public
Performance & Management Review, 28(3), 376–417.

Funnell, S. C., & Rogers, P. J. (2011). Purposeful program theory: Effective use of theories of change and logic models.
San Francisco, CA: John Wiley & Sons.

Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state organisations in New Zealand.
Wellington, New Zealand: Institute of Policy Studies.

483
Government of Alberta 2016–17 Annual Report: Executive Summary, Consolidated Financial Statements and
Measuring Up. Edmonton, Alberta: Government of Alberta. Retrieved from
https://open.alberta.ca/dataset/7714457c-7527–443a-a7db-dd8c1c8ead86/resource/e6e99166–2958–47ac-
a2db-5b27df2619a3/download/GoA-2016–17-Annual-Report.pdf

Government of British Columbia. (2001). Budget Transparency and Accountability Act [SBC 2000 Chapter 23]
(amended). Victoria, British Columbia, Canada: Queen’s Printer.

Hatry, H. P. (2013). Sorting the relationships among performance measurement, program evaluation, and
performance management. New Directions for Evaluation, 137, 19–32.

Head, B. W., & Alford, J. (2015). Wicked problems: Implications for public policy and management.
Administration & Society, 47(6), 711–739.

Hildebrand, R., & McDavid, J. (2011). Joining public accountability and performance management: A case study
of Lethbridge, Alberta. Canadian Public Administration, 54(1), 41–72.

Hood, C. (1991). A public management for all seasons? Public Administration, 69(1), 3–19.

Hood, C. (2006). Gaming in targetworld: The targets approach to managing British public services. Public
Administration Review, 66(4), 515–521.

Hood, C., & Peters, G. (2004). The middle aging of New Public Management: Into the age of paradox? Journal of
Public Administration Research and Theory, 14(3), 267–282.

Hughes, P., & Smart, J. (2018). You say you want a revolution: The next stage of public sector reform in New
Zealand. Policy Quarterly, 8(1).

Innes, J., Mitchell, F., & Sinclair, D. (2000). Activity-based costing in the UK’s largest companies: A comparison
of 1994 and 1999 survey results. Management Accounting Research, 11(3), 349–362.

Jakobsen, M. L., Baekgaard, M., Moynihan, D. P., & van Loon, N. (2017). Making sense of performance
regimes: Rebalancing external accountability and internal learning. Perspectives on Public Management and
Governance.

Kaplan, R. S., & Norton, D. P. (1996). The balanced scorecard: Translating strategy into action. Boston, MA:
Harvard Business School Press.

Kates, J., Marconi, K., & Mannle, T. E., Jr. (2001). Developing a performance management system for a federal
public health program: The Ryan White CARE ACT Titles I and II. Evaluation and Program Planning, 24(2),
145–155.

484
Klay, W. E., McCall, S. M., & Baybes, C. E. (2004). Should financial reporting by government encompass
performance reporting? Origins and implications of the GFOA-GASB conflict. In A. Khan & W. B. Hildreth
(Eds.), Financial management theory in the public sector (pp. 115–140). Westport, CT: Praeger.

Kotter, J. (1995, March/April). Leading change: Why transformation efforts fail. Harvard Business Review. Reprint
95204, 59–67. Retrieved from
https://www.gsbcolorado.org/uploads/general/PreSessionReadingLeadingChange-John_Kotter.pdf

Kravchuk, R. S., & Schack, R. W. (1996). Designing effective performance-measurement systems under the
Government Performance and Results Act of 1993. Public Administration Review, 56(4), 348–358.

Kroll, A. (2015). Drivers of performance information use: Systematic literature review and directions for future
research. Public Performance & Management Review, 38(3), 459–486.

Kroll, A., & Moynihan, D. P. (2018). The design and practice of integrating evidence: Connecting performance
management with program evaluation. Public Administration Review, 78(2), 183–194.

Lahey, R., & Nielsen, S. (2013). Rethinking the relationship among monitoring, evaluation, and results-based
management: Observations from Canada. New Directions for Evaluation, 2013(137), 45–56.

Laihonen, H., & Mäntylä, S. (2017). Principles of performance dialogue in public administration. International
Journal of Public Sector Management, 30(5), 414–428.

Le Grand, J. (2010). Knights and knaves return: Public service motivation and the delivery of public services.
International Public Management Journal, 13(1),56–71.

Levine, C. H., Rubin, I., & Wolohojian, G. G. (1981). The politics of retrenchment: How local governments manage
fiscal stress (Vol. 130). Beverly Hills, CA: Sage.

Martin, L. L., & Kettner, P. M. (1996). Measuring the performance of human service programs. Thousand Oaks,
CA: Sage.

Mayne, J. (2001). Addressing attribution through contribution analysis: Using performance measures sensibly.
Canadian Journal of Program Evaluation, 16(1), 1–24.

Mayne, J. (2008). Building an evaluative culture for effective evaluation and results management. Retrieved from
www.focusintl.com/RBM107-ILAC_WorkingPaper_No8_EvaluativeCulture_Mayne.pdf

Mayne, J., & Rist, R. C. (2006). Studies are not enough: The necessary transformation of evaluation. Canadian
Journal of Program Evaluation, 21(3), 93–120.

McDavid, J. C. (2001a). Program evaluation in British Columbia in a time of transition: 1995–2000. Canadian

485
Journal of Program Evaluation, 16(Special Issue), 3–28.

McDavid, J. C. (2001b). Solid-waste contracting-out, competition, and bidding practices among Canadian local
governments. Canadian Public Administration, 44(1), 1–25.

McDavid, J. C., & Huse, I. (2006). Will evaluation prosper in the future? Canadian Journal of Program
Evaluation, 21(3), 47–72.

McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

Morgan, G. (2006). Images of organization (Updated ed.). Thousand Oaks, CA: Sage.

Moullin, M. (2017). Improving and evaluating performance with the Public Sector Scorecard. International
Journal of Productivity and Performance Management, 66(4), 442–458.

Moynihan, D. P. (2005). Goal-based learning and the future of performance management. Public Administration
Review, 65(2), 203–216.

Moynihan, D. P. (2008). The dynamics of performance management: Constructing information and reform.
Washington, DC: Georgetown University Press.

Moynihan, D. (2008). Advocacy and learning: An interactive-dialogue approach to performance information use.
In W. Van Dooren & S. Van de Wall (Eds.), Performance information in the public sector: How it gets used (pp.
24–41). London, UK: Palgrave Macmillan.

Moynihan, D. P., Pandey, S. K., & Wright, B. E. (2012). Setting the table: How transformational leadership
fosters performance information use. Journal of Public Administration Research and Theory, 22(1), 143–164.

National Audit Office. (2009). Performance frameworks and board reporting: A review by the performance
measurement practice. Retrieved from
http://www.nao.org.uk/guidance__good_practice/performance_measurement1.aspx

Newcomer, K. E. (Ed.). (1997). Using performance measurement to improve public and nonprofit programs (New
Directions for Evaluation, No. 75). San Francisco, CA: Jossey-Bass.

Nielsen, C., Lund, M., & Thomsen, P. (2017). Killing the balanced scorecard to improve internal disclosure.
Journal of Intellectual Capital, 18(1), 45–62.

Norman, R. (2001). Letting and making managers manage: The effect of control systems on management action
in New Zealand’s central government. International Public Management Journal, 4(1), 65–89.

486
Norman, R., & Gregory, R. (2003). Paradoxes and pendulum swings: Performance management in New
Zealand’s public sector. Australian Journal of Public Administration, 62(4), 35–49.

Oregon Progress Board. (2003). Is Oregon making progress? The 2003 benchmark performance report. Salem, OR:
Author.

Otley, D. (2003). Management control and performance management: Whence and whither? British Accounting
Review, 35(4), 309–326.

Perrin, B. (2015). Bringing accountability up to date with the realities of public sector management in the 21st
century. Canadian Public Administration, 58(1), 183–203.

Poister, T. H, Aristigueta, M. P., & Hall, J. L. (2015). Managing and measuring performance in public and
nonprofit organizations (2nd ed.). San Francisco, CA: Jossey-Bass.

Pollitt, C. (2007). Who are we, what are we doing, where are we going? A perspective on the academic performance
management community. Köz-Gazdaság, 2(1), 73–82

Pollitt, C., Bal, R., Jerak-Zuiderent, S., Dowswell, G., & Harrison, S. (2010). Performance regimes in health care:
Institutions, critical junctures and the logic of escalation in England and the Netherlands. Evaluation, 16(1),
13–29.

Prebble, R. (2010). With respect: Parliamentarians, officials, and judges too. Wellington, New Zealand: Victoria
University of Wellington, Institute of Policy Studies.

Propper, C., & Wilson, D. (2003). The use and usefulness of performance measures in the public sector. Oxford
Review of Economic Policy, 19(2), 250–267.

Schwartz, R., & Mayne, J. (Eds.). (2005). Quality matters: Seeking confidence in evaluating, auditing, and
performance reporting. New Brunswick, NJ: Transaction Publishers.

Senge, P. M. (1990). The fifth discipline: The art and practice of the learning organization (1st ed.). New York:
Doubleday/Currency.

Shaw, T. (2016). Performance budgeting practices and procedures. OECD Journal on Budgeting, 15(3), 1D.

Sigsgaard, P. (2002). MCS approach: Monitoring without indicators. Evaluation Journal of Australasia, 2(1),
8–15.

Speklé, R. F., & Verbeeten, F. H. (2014). The use of performance measurement systems in the public sector:
Effects on performance. Management Accounting Research, 25(2), 131–146.

487
Staudohar, P. D. (1975). An experiment in increasing productivity of police service employees. Public
Administration Review, 35(5), 518.

Stone, D. A. (2012). Policy paradox: The art of political decision making (3rd ed.). New York, NY: W. W. Norton.

Texas State Auditor’s Office. (2017). Performance Measures at the Cancer Prevention and Research Institute of Texas.
Texas State Auditor’s Office, Report number 18–009. Retrieved from
http://www.sao.texas.gov/SAOReports/ReportNumber?id=18–009

The Complete University Guide. (2018). University League Tables 2018. Retrieved from
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?v=wide.

Thomas, P. G. (2004). Performance measurement, reporting and accountability: Recent trends and future directions
(SIPP Public Policy Paper Series No. 23). Retrieved from http://www.publications.gov.sk.ca/details.cfm?
p=12253

Thomas, P. G. (2006). Performance measurement, reporting, obstacles and accountability: Recent trends and future
directions. Canberra, ACT, Australia: ANU E Press. Retrieved from
https://press.anu.edu.au/publications/series/australia-and-new-zealand-school-government-anzsog/performance-
measurement/download

Thor, C. G. (2000, May/June). The evolution of performance measurement in government. Journal of Cost
Management, 18–26.

Tillema, S. (2010). Public sector benchmarking and performance improvement: What is the link and can it be
improved? Public Money & Management, 30(1), 69–75.

Treasury Board of Canada Secretariat. (2016a). Policy on Results. Retrieved from https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31300

Treasury Board of Canada Secretariat. (2016b). Directive on results. Retrieved from https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31306

Umashev, C. and Willett, R. (2008). Challenges in implementing strategic performance measurement systems in
multi-objective organizations: The case of a large local government authority. Abacus, 44(4), 377–398.

United Nations Development Program. 2016. Human development report 2016: Human development for everyone.
New York: UDNP.

Wildavsky, A. B. (1979). Speaking truth to power: The art and craft of policy analysis. Boston, MA: Little Brown.

Williams, D. W. (2003). Measuring government in the early twentieth century. Public Administration Review,

488
63(6), 643–659.

Wilson, J. Q. (1989). Bureaucracy: What government agencies do and why they do it. New York: Basic Books.

WorkSafeBC. (2018). 2017 annual report and 2018–2020 service plan. Retrieved from
https://www.worksafebc.com/en/resources/about-us/annual-report-statistics/2017-annual-report/2017-annual-
report-2018–2020-service-plan?lang=en

Van Dooren, W., & Hoffmann, C. (2018). Performance management in Europe: An idea whose time has come
and gone?. In E. Ongaro & S. van Thiel (eds.), The Palgrave handbook of public administration and management
in Europe (pp. 207–225). London, UK: Palgrave Macmillan.

Vazakidis, A., Karagiannis, I., & Tsialta, A. (2010). Activity-based costing in the public sector. Journal of Social
Sciences, 6(3), 376–382.

489
10 Using Performance Measurement for Accountability and
Performance Improvement

490
Contents
Introduction 410
Using Performance Measures 411
Performance Measurement in a High-Stakes Environment: The British Experience 412
Assessing the “Naming and Shaming” Approach to Performance Management in Britain 415
A Case Study of Gaming: Distorting the Output of a Coal Mine 418
Performance Measurement in a Medium-Stakes Environment: Legislator Expected Versus Actual Uses
of Performance Reports in British Columbia, Canada 419
The Role of Incentives and Organizational Politics in Performance Measurement Systems With a
Public Reporting Emphasis 424
Performance Measurement in a Low-Stakes Environment: Joining Internal and External Uses of
Performance Information in Lethbridge, Alberta 425
Rebalancing Accountability-Focused Performance Measurement Systems to Increase Performance
Improvement Uses 429
Making Changes to a Performance Measurement System 432
Does Performance Measurement Give Managers the “Freedom to Manage?” 434
Decentralized Performance Measurement: The Case of a Finnish Local Government 435
When Performance Measurement Systems De-Emphasize Outputs and Outcomes: Performance
Management Under Conditions of Chronic Fiscal Restraint 437
Summary 439
Discussion Questions 440
References 441

491
Introduction
In Chapter 10 we discuss how to encourage the development and use of performance measures for learning and
organizational improvement, taking into account the role of incentives and organizational politics in designing
and implementing performance measurement systems. We examine the various approaches that organizations have
taken to simultaneously address external accountability and internal performance improvement. We review
empirical research that has examined performance measurement uses in different political environments: high risk,
where the consequences of reporting performance problems can be quite severe for both managers and political
leaders; moderate risk, where revelations of publicly reported performance measures hold the possibility of
negative attention from political opposition or the media; and low risk, where negative consequences are typically
not material concerns for organizations and their political leaders. Political and managerial risk are important
factors in how performance information is created (its credibility, accuracy, and completeness) and ultimately how
it is used for performance improvement and accountability purposes (Boswell, 2018; Bouckaert & Halligan, 2008;
Kroll & Moynihan, 2018).

Although since the 1970s performance measurement systems have tended to be designed and implemented to
drive improvements to efficiency and effectiveness through public and political accountability, there is an
emerging body of research and practice that emphasizes the importance of internal organizational learning as a
rationale for performance measurement (Jakobsen, Baekgaard, Moynihan, & van Loon, 2017; Moynihan, 2005;
Perrin, 2015; Van Dooren & Hoffman, 2018). Organizations do modify existing performance measurement
systems, and to examine this we revisit the 12 criteria we introduced in Chapter 9 for building, implementing, and
sustaining such systems. Chapter 10 provides eight recommendations for how performance measurement systems
that have been designed primarily to meet external accountability expectations can be changed to better balance
accountability and performance improvement uses. Parenthetically, in Chapter 11 we will discuss the challenges of
transforming organizational cultures to place greater emphasis on evaluation (and learning from evaluation results)
as principal ways of supporting decision-making.

In the latter part of this chapter, we look at the persistence of accountability-focused performance measurement
systems that have evolved as part of the New Public Management doctrine that has underpinned administrative
reforms since the 1980s. We then describe a case of a Finnish local government where performance measurement
results are used for organizational learning and performance improvement. The managers are part of an ongoing
culture change that focuses on internal learning as a core value (Laihonen & Mäntylä, 2017). We reflect on the
contemporary reality for many public sector and nonprofit organizations: What happens to performance
measurement systems when governments experience chronic fiscal restraint, and where the main focus reverts to a
concern with resources (inputs) and less on outputs and outcomes?

492
Using Performance Measures
The performance management cycle that we introduced in Chapter 1 is a normative model. It displays intended
relationships among evaluative information and the four phases of the cycle: (1) strategic planning and resource
allocation; (2) policy and program design; (3) implementation and management; (4) assessment and reporting results.
Given the research that has been done that examines the way the performance management cycle “closes the
loop”—that is, actually makes performance results available to decision makers—it is appropriate to assess the
model in terms of performance measurement systems.

As part of our assessment, recall that in Chapter 9, we introduced the rational/technical lens and the
political/cultural lens through which we can “see” organizations and the process of developing and implementing a
performance measurement system. The model of the performance management cycle we introduced in Chapter 1
begins with a rational/technical view of organizations. This model of performance management places emphasis
on the assumption that people will behave as if their motives, intentions, and values are aligned with the
rational/technical “systems” view of the organization.

The political/cultural lens introduced in Chapter 9 suggests a view of performance management that highlights
what we will consider in the examples in Chapter 10. Performance measurement and public reporting systems that
are high stakes can encounter significant problems in terms of the ways that performance information is created,
compiled, reported, and actually used over time (Boswell, 2018). Some results are broadly consistent with the
normative performance management model, in that high-stakes public reporting does initially improve
performance (at least on the measures that are being specified). However, other behaviors can work to undermine
the accuracy and credibility of such systems, unless measurement systems and results are monitored/policed
through processes such as regular external audits. Audit and performance assessment systems are often costly to
operate, affecting the sustainability of these systems over time.

Figure 10.1 displays our original performance management cycle introduced in Chapter 1 but includes additional
details that highlight organizational politics and incentives. What Figure 10.1 highlights is that a plan to design
and implement a performance management system will of necessity need to navigate situations in which people,
their backgrounds and experiences, their organizational culture, and their sense of “who wins and who loses?” and
“what does this change do to our organization’s prospects?” will be key to how well, if at all, the performance
management system actually “performs.”

493
Figure 10.1 Public Sector Accountability and Performance Management: Impacts of Incentives and
Organizational Politics

One way to look at the performance management cycle that is depicted in Figure 10.1 is that in settings where the
stakes are high—the political culture is adversarial, the media and other interests are openly critical, and the public
reporting of performance results is highly visible (in other words, closer to the “naming and shaming” system that
was implemented in England between 2000 and 2005)—it is more likely that unintended effects in the
performance management cycle will emerge. Otley (2003) has suggested that over time, as gaming becomes more
sophisticated, it is necessary to counter it with more sophisticated monitoring and control mechanisms. In the
following sections, we take a closer look at three performance measurement systems, exemplifying varying levels of
external accountability pressures.

494
Performance Measurement in a High-Stakes Environment: The British
Experience
Britain is often cited as an exemplar of a national government committing to performance measurement and
performance management for public accountability and performance improvement. Pollitt, Bal, Jerak-Zuiderent,
Dowswell, and Harrison (2010), Bevan and Hamblin (2009), Hood, Dixon, and Wilson (2009), Le Grand
(2010), Bevan and Wilson (2013) and Pollitt (2018), among others, have examined the British experience of using
performance measurement and public reporting as a means to manage and improve governmental performance.
Over time, British governments have taken different approaches to performance measurement and public
reporting—we will look at research that shows how well they have worked to improve performance, given the high
priority placed on using accountability to drive performance.

Le Grand (2010) suggests that public servant motivation is a key variable that should be considered when
governments design performance systems. He terms the four governance models as “trust, mistrust, voice, and
choice” (p. 67). A traditional assumption was that public servants were “knights” who were motivated by trust to
“do the right thing” (serve the public interest); they would be motivated intrinsically to improve performance to
meet targets. The United Kingdom and most of the Western world took a “public choice” turn in the early-1970s,
viewing public servants as more self-interested than altruistic, concerned individually and organizationally with
maximizing their own “utility” (Niskanen, 1971). Thus, the model moved from “trust” to “mistrust.” We have
discussed the rise of New Public Management in earlier chapters.

When we examine the British approach to performance management, we can see that three different approaches
were tried at different times and places: the first approach, in place before the Blair government came to power in
1997, involved performance results—to the extent that they were available—being used to inform and induce
improvements through target setting (Hood, Dixon, & Wilson, 2009). Pollitt et al. (2010) point out that the first
performance measurement system in the National Health Service (NHS) was developed in 1983 but was initially
intended to be formative, to be used by managers to monitor programs and improve performance locally,
notwithstanding the fact that national performance targets were part of the development of this system. In the
NHS, this approach gave way to the first published comparisons of performance results across health regions in
1994; this was the first version of the “league tables” approach that was used more widely later on.

Beginning in 1997 when the Labour government under Tony Blair was first elected, performance measurement
was a high priority. There had been earlier efforts in the NHS to implement results-based management regimes
(Pollitt et al., 2010), but the New Labour Government expanded the scope of performance measurement and
target setting. Initially, there was an emphasis on constructing performance measures for government departments
and agencies, setting targets, and reporting actual results compared with targets. This approach was used for about
3 years (1997–2000), and assessments of the results of this approach suggested that performance had not
improved, even though more money had been put into key services such as health and education (Le Grand,
2010).

By 2000, the Blair government had decided to use a second model, a much more centralized and directive
approach to improving public accountability and performance management. For the next five years (2000–2005),
high-stakes public rating and ranking systems were widely implemented (health, education, local government,
education, and police were among the organizations and services targeted). This regime has been called the
“targets and terror” approach to performance management (Bevan & Hamblin, 2009) and epitomized a high-
stakes approach to measuring and reporting performance to achieve public accountability and performance
improvements.

In the health sector, for example, the heart of this approach was a star rating system wherein public organizations
(hospitals and ambulance services were among health-related organizations to be targeted) were rated on their
overall performance from zero stars up to three stars. This public accountability approach was first applied in acute
care hospitals in 2001 in England and then extended in 2002 to cover ambulance services in England (Bevan &

495
Hamblin, 2009). Eventually, it was implemented in other parts of the public sector, including local governments
(McLean, Haubrich, & Gutierrez-Romer, 2007). The mechanism that was integral to the star rating system was to
challenge the reputation of each organization by publicizing its performance (Bevan & Hamblin, 2009; Hibbard,
2008). Hibbard, Stockard, and Tusler (2003) specify four criteria that are considered necessary to establish an
effective ranking system that has real reputational consequences for the organizations that are rated and ranked
(see also Bevan & Hamblin, 2009):

1. A ranking system must be established for the organizations in a sector.


2. The ranking results need to be published and disseminated widely.
3. The ranking results need to be easily understood by the public and other stakeholders so that it is obvious
which organizations are top performers and which are not.
4. Published rankings are periodically followed up to see whether performance has improved; one way to do
this is to make the rankings cyclical.

The process by which organizations (e.g., hospitals) were rated and ranked is detailed by Bevan and Hamblin
(2009). For hospitals, approximately 50 measures were used to rate performance, and these measures were then
aggregated so that for each hospital a “star rating” was determined. The performance measures for the hospitals
were based primarily on administrative data collected and collated by an independent agency. This agency (the
Healthcare Commission) assessed performance and announced star ratings publicly. The star ratings were
published in a league table that ranked all hospitals by their star ratings. These tables were widely disseminated.

Bevan and Hamblin (2009) summarize the impacts of the first published three-star rankings in 2001 for acute care
hospitals: “the 12 zero-rated hospitals [in that year’s ratings] were described by the then Secretary of State for
Health as the ‘dirty dozen’; six of their chief executives lost their jobs” (p. 167). In 2004:

the chief executives of the nine acute hospitals that were zero rated, were “named and shamed” by the
Sun (on October 21st, 2004), the newspaper with a circulation of over 3 million in Britain: a two-page
spread had the heading “You make us sick! Scandal of Bosses running Britain’s worst hospitals” and
claimed that they were delivering “squalid wards, long waiting times for treatment and rock-bottom
staff morale.” (p. 167)

The whole process was high stakes for organizations being rated, and their managers. It had real consequences.
What made the British approach to performance management unique, and offers us a way to see what difference it
made, is that the star rating system was implemented in England and not in Wales or Scotland; those latter two
countries (within Britain) controlled their own administration of all health-related organizations and services
(starting in 1999), even though the funding source was the (British) NHS (Propper, Sutton, Whitnall, &
Windmeijer, 2010).

In Wales and Scotland, there were performance targets and public reporting, but no rankings and no regimes of
“naming and shaming” or “targets and terror.” Bevan and Hamblin (2009) take advantage of this natural
experiment to compare performance over time in England compared with Wales and Scotland. What they
discovered was that with the oversight in place, the English approach had an effect: There were measurable
improvements in performance in England’s hospitals that did not occur in either Wales or Scotland (see also:
Pollitt, 2018).

A second natural experiment was also evaluated by Bevan and Hamblin (2009). For ambulance services England
implemented a high-stakes summative star rating performance measurement system, whereas Scotland and Wales
did not. For emergency (Category A) calls, the UK NHS had established a target of having 75% of those calls
completed in eight minutes or less in England, Scotland, and Wales. In England, by 2003 most ambulance
organizations reported achieving that target (Bevan & Hamblin, 2009), but neither Wales nor Scotland did. The
star rating system in England again apparently produced the desired result of improving performance.

However, the star rating system also produced substantial unintended effects, which we will explore in greater

496
detail in the next sections of this chapter. In 2005, there was a national election in Britain and one of the issues
that surfaced was the unintended side effects of the performance management regime that was in place. As an
example: England’s star rating system included measurement of whether patients are offered a doctor’s
appointment within two working days of a request. Bevan and Hood (2006) report one politically detrimental
incident involving the prime minister, Tony Blair, and a questioner on the campaign trail:

In May 2005, during the British general election campaign, the prime minister was apparently
nonplussed by [perplexed about how to respond to] a complaint made during a televised question [and
answer] session that pressure to meet the key target that 100% of patients be offered an appointment to
see a general practitioner within two working days had meant that many general practices refused to
book any appointments more than two days in advance. A survey of patients found that 30% reported
that their general practice did not allow them to make a doctor’s appointment three or more working
days in advance. (p. 420)

The star rating system was scaled back in 2005 when the government was re-elected. Target setting and public
reporting were continued but the public “naming and shaming” aspects of the system were largely abandoned in
preference to a less confrontational approach. This third model (continuing to the present) is roughly similar to
the first one in that objectives, targets, and reporting are all mandated, but not used in such a high-stakes manner.
But elements of the league tables system have carried over (Gibbons, Neumayer, & Perkins, 2015).

Assessing the “Naming and Shaming” Approach to Performance


Management in Britain
Bevan and Hamblin (2009), Otley (2003), Pollitt et al. (2010), and others have commented on problems that can
arise when performance measurement and management regimes utilize the kind of “naming and shaming”
strategies that were central to the English approach between 2000 and 2005. Bevan and Hamblin (2009) suggest
several problems with the high-stakes star rating system that was adopted in Britain. We will highlight three
problems here.

The first problem is that what gets measured matters, and by implication, what is not or cannot be measured does
not matter and may be neglected. A phrase that has been used to characterize this situation is “hitting the target
and missing the point” (Christopher & Hood, 2006). Wankhade (2011) looked at the English ambulance service
and found that the dominant focus on a response time target of eight minutes for “Category A” emergency
ambulance calls distorted the real work that was being done and forced these organizations to devalue the
importance of patient outcomes.

The second problem is related to the first one in that picking key performance measures often misrepresents the
complexity of the work being done by public organizations (Bevan & Hood, 2006; Himmelstein, Ariely, &
Woolhandler, 2014; Jakobsen et al., 2017). Picking performance measures is at least in part opportunistic;
measures represent values and priorities that are politically important at a given time but may not be sound
measures of the performance of core objectives in organizations.

The third problem is perhaps the most significant one: gaming performance measures is a widespread problem
and has been linked to the lack of credibility of performance results in many settings (Bevan & Hamblin, 2009;
Bevan & Hood, 2006; Christopher & Hood, 2006; Hood, 2006; Lewis, 2015). We will look more carefully at
gaming as an issue in the several NHS-related studies covered earlier, and then summarize another case that
illustrates gaming, based on Otley’s (2003) research on the coal mining industry in Britain.

Propper and Wilson (2003) describe gaming behaviors in terms of the relationship between principals (political
decision makers or executives) and agents (those who are actually delivering the programs or services): “As the
principal tries to get higher effort (and so better public services) by implementing performance measurement, the
response may be better [measured] services but also may be other less desired behaviour” (p. 252). Gaming occurs

497
in situations where unintended behaviors result from the implementation of performance measurement systems;
the incentives actually weaken or undermine the intended uses of the system (Christopher & Hood, 2006).

In their examination of English ambulance services during the high-stakes regime from 2000 to 2005, Bevan and
Hamblin (2009) point out that many ambulance services (about one third) were manually “correcting” their
reported response times to come in “on target.” Furthermore, in an audit that was conducted in 2006 after
whistle-blowers had contacted the counter fraud service, the Department of Health “reported . . . that six of 31
trusts [ambulance organizations] had failed accurately to record the actual response times to the most serious life-
threatening emergency calls” (p. 182).

Figures 10.2 and 10.3 have been reproduced from Bevan and Hamblin (2009) and offer a visual interpretation of
gaming behaviors in the English ambulance trusts during the time that the high-stakes performance management
system was in place. First, Figure 10.2 illustrates a frequency distribution of ambulance response times (in
minutes), taken from one service (trust) that indicates a fairly linear distribution of the frequency of response times
and corresponding numbers of calls for service. This overall pattern suggests that the ambulance trust is reporting
response times accurately; there is no visible change in frequency of calls around the 8-minute target that was the
core of the system for ambulance services in England.

Figure 10.2 A Distribution of Response Times for One English Ambulance Service: No Gaming Is Evident

Source: Bevan and Hamblin (2009, p. 178).

Figure 10.3, in contrast, indicates a marked difference from the overall linear pattern of the frequency of
ambulance response times and the number of calls for service. Up to the 8-minute performance target, there are
apparently more ambulance calls the closer the response times are to that target. But beyond the target, the
response frequency drops off dramatically. The pattern in Figure 10.3 strongly suggests that ambulance response
times are being “adjusted” (gamed) so that they meet the 8-minute threshold; it is unlikely that the discontinuity
in Figure 10.3 could have occurred by chance.

498
Figure 10.3 Distribution of Response Times for One English Ambulance Service: Gaming Is Evident

Source: Bevan and Hamblin (2009, p. 179).

Hood (2006) points out that gaming was either not anticipated as the transition to high-stakes performance
measurement and reporting was made in Britain in 2000 or was possibly downplayed by those who had a stake in
“meeting the targets.” Hood puts it this way:

Why was there no real attempt to check such data properly from the start? The slow and half-hearted
approach to developing independent verification of performance data itself might be interpreted as a
form of gaming by the central managers (like the famous English admiral, Horatio Nelson, who put a
telescope to his blind eye to avoid seeing a signal he did not want to obey). (pp. 519–520)

Hood (2006) has suggested three general categories of gaming behaviors based on his research on performance
management in Britain. Ratchet effects are exemplified where organizations try to negotiate performance targets
that are easier to attain. An example from Bevan and Hamblin (2009) was the Welsh ambulance service that could
not meet the NHS target of eight minutes for 75% of Category A calls and, over several successive years,
succeeded in getting that target reduced year over year.

Threshold effects occur when a performance target results in organizational behaviors that distort the range of
work activities in an organization. Hood (2006) gives the example of

… schools that were set pupil-attainment targets on test scores, leading teachers to concentrate on a
narrow band of marginal students who are close to the target thresholds and to give proportionately less
attention to those at the extreme ends of the ability range. (p. 518)

The third kind of gaming is arguably the most important of the three proposed by Hood (2006). Output
distortions occur in situations where performance results are “adjusted” so that they line up with expectations.
Bevan and Hamblin (2009) quote Carvel (2006) who examined the actual methods used to measure English
ambulance response times:

Some did not start the clock as soon as a 999 call was received. Others did not synchronize the clocks in

499
the emergency switchboard with those used by the paramedics. In some cases, ambulance organizations
re-categorized the urgency of the call after the job was done to make it fit the response time achieved
rather than the priority given when the original call was made. This would allow staff to downgrade an
emergency if the ambulance arrived late. (Bevan & Hamblin, 2009, p. 182)

Below, we examine another case of output distortion, in a coal mining setting.

A Case Study of Gaming: Distorting the Output of a Coal Mine


Otley (2003) recounts a story based on his early experience as a British mining engineer. His first project was to
develop a computer model of production in a coal mine. Using an existing model of how a single coal face
operated, he extended this model to a whole mine. Validating the model involved comparing the model’s
predicted mine outputs with data from the actual mine. The model predicted average output quite well but could
not predict the variability in output. Since the model was intended in part to assist in the design of an
underground transportation system, peak loads needed to be accurately estimated.

Otley assumed that he had made some kind of programming error; he spent several weeks searching for such an
error, to no avail. He decided to look at the actual raw data to see if anything emerged. The weekly data had
patterns. The mining output data showed that for a typical Monday through Thursday, actual tonnes of coal
produced conformed pretty closely to a budgeted target for each day. But on Friday, the actual tonnes could be
anything from much more to much less than the daily average. It turned out that the mine managers knew that
for every day of the week but Friday, they could report an output to headquarters that was close to the budgeted
output because the actual tonnes were only totaled up on Fridays. To reconcile their reported figures with the
weekly total (being on budget with actual production was their performance measure), they approached the Friday
output figure creatively: The mine managers had created an additional way of assuring that they met the weekly
production targets.

At the bottom of the mine shaft there was a bunker that was intended to be used to store coal that could not be
transported to the surface during a given day. The bunker was supposed to be emptied on Friday, so that it could
be used to buffer the next week’s daily production—the hoist that brought the coal to the surface was a
bottleneck, and the bunker was a way to work with this problem. But Otley discovered that the bunker was often
full on Monday mornings; the managers had determined that having a full bunker to start the week meant that
they had a leg up on that week’s quota, and since the penalty for under-producing was greater than any
consequence for overproducing, they responded to the incentives.

Mine managers had developed ways to game the performance measure for which they were accountable. For
Otley’s modeling process, the output data were not sufficiently accurate to be useful. This case study offers us a
simple example of how performance measurement, coupled with consequences for managers, can result in
distortions of performance results. Gaming continues to be a challenge for performance measurement systems
where targets and results have consequences for organizations, politicians, and for individual managers (Kelman &
Friedman, 2009; Kroll, 2015; Jakobsen et al., 2017; Moynihan, 2009).

500
Performance Measurement in a Medium-Stakes Environment: Legislator
Expected Versus Actual Uses of Performance Reports in British Columbia,
Canada
In 2000, the British Columbia (B.C.) Legislature passed the Budget Transparency and Accountability Act
(Government of BC, 2000), a law mandating annual performance plans (“service plans”) and annual performance
reports (“annual service plan reports”) for all departments and agencies. The law was amended in 2001
(Government of BC, 2001), and the first annual service plan reports were completed in June 2003. The timing of
this change made it possible to design an evaluation that had this research design:

OXOXO

The key outcome variables (the O’s in the research design) were: legislator expected uses (the first O) of the public
performance reports they would receive in 2003 (the first such round of reports), and then the later reported uses
of the reports in 2005 (the third round of reports) and 2007 (the fifth round of public reports). The X’s were the
annual performance reports that were based on organizational performance measures and reported results in
relation to targets. Although guidelines for constructing the performance reports changed somewhat over time
(e.g., the number of measures for each ministry were reduced over time), the legislative requirements of the annual
cycle did not change.

McDavid and Huse (2012) sent anonymous surveys to all elected members of the legislature in 2003, 2005, and
2007. In 2003 and 2005, the legislature was dominated by one political party (75 of the 77 members were Liberal
Party members). We did not survey the two opposition (New Democratic Party) members mainly to respect their
anonymity. By 2007, a provincial election had been held and there were substantial numbers of Opposition
members elected so all Liberal and NDP members were surveyed. Table 10.1 summarizes the response rates to the
three surveys.

Table 10.1 Legislator Response Rates for the 2003, 2005, and 2007 Surveys
Table 10.1 Legislator Response Rates for the 2003, 2005, and 2007 Surveys

2003 2005 2007

Total number of MLAs in Legislature 79 79 79

Total number of survey respondents 36 27 30

Total percentage of responding MLAs in the Legislature 45.6 34.2 38.0


Source: McDavid and Huse (2011, p. 14).

In each of the three surveys, the same outcome measures were used. The only difference between the 2003 survey
and the later two was that in 2003 the Likert statements were worded in terms of expected uses of the performance
reports since legislators had not received their first report when the survey was fielded. Fifteen separate Likert
statements were included, asking politicians to rate the extent to which they used (or, in the first survey, how they
expected to use) the public performance reports for those 15 purposes. Figure 10.4 shows the format and content
of the Likert statements in the first survey of expected uses, and the same 15 items were used in the subsequent
surveys of reported uses. The Likert statements were later clustered, for analysis, into five indices: (1) accountability
uses, (2) communications uses, (3) improving efficiency and effectiveness uses, (4) making policy decision uses, and (5)
making budget decision uses.

501
Figure 10.4 Format for the Survey Questions on Expected Uses of Performance Reports in 2003

Survey responses from the governing party were grouped to distinguish between cabinet ministers (politicians who

502
were the heads of departments and agencies) and backbenchers (elected officials in the governing party who had
no departmental oversight responsibilities). Figures 10.5 (cabinet ministers) and 10.6 (backbenchers) display key
findings from the three surveys for the governing (Liberal) party.

If we look at Figures 10.5 and 10.6 together, we see several trends. Initial expectations in 2003 about ways that
performance reports would be used were high. Cabinet ministers had even higher expectations than did their
backbench colleagues in the governing party. The drops from 2003 to actual reported uses in 2005 and 2007 are
substantial. For three of the clusters of uses (communication uses, efficiency and effectiveness uses, and policy uses),
when the 2005 and 2007 levels are averaged, the drops for cabinet ministers were greater than 50%. The overall
pattern for backbench government members is similar to cabinet ministers. When the two groups of elected
officials are compared, cabinet minister drops in reported uses were larger than for backbench members.

Figure 10.5 Clusters of Performance Reports Uses for Cabinet Ministers in 2003, 2005, and 2007

Source: McDavid and Huse (2011, p. 15).

503
Figure 10.6 Clusters of Performance Reports Uses for Liberal Backbench Members of the Legislature

Source: McDavid and Huse (2011, p. 16).

In the provincial election in the spring of 2005 (after the second survey was completed), the New Democratic
Party won 33 of the 79 seats in the legislature, so formed a larger opposition. Figure 10.7 compares the reported
uses of performance reports in 2007 by Government members of the legislature and members of the Opposition
party. Comparisons indicated that Opposition members generally used the reports less than Government
members. Although the responses indicated that members of the opposition used the reports for general
accountability purposes more than government members did, the difference is not statistically significant. Overall,
the reports appeared to be relatively less useful for opposition members in their roles as critics of government
policies and programs.

504
Figure 10.7 Government (Liberal) and Opposition (New Democratic Party) Uses of the Performance
Reports in 2007

Source: McDavid and Huse (2011, p. 17).

The findings from this study are generally consistent with reports of under-utilization or even non-utilization of
public performance reports by elected officials elsewhere (Barrett & Greene, 2008; Bouckaert & Halligan, 2008;
Raudla, 2012; Steele, 2005; Sterck, 2007). The picture of legislator uses of public performance reports, based on
this empirical study, suggests that although there were legitimately high expectations before the first performance
reports were seen—expectations that reflect the intended uses of performance reports that have been a part of the
NPM literature—the two subsequent rounds of actual reports were not used nearly as much as had been expected
(McDavid & Huse, 2012).

If elected officials are not using performance reports or are using them sparsely in their roles and responsibilities,
an important link in the intended performance management cycle is weakened. As well, expectations that public
accountability will drive performance improvements via legislator scrutiny are then questionable. While there may
be process-related benefits associated with an annual cycle of setting targets, measuring performance against those
targets, and reporting the results (McDavid & Huse, 2012), those are not the real consequences of public reporting
that have been envisioned by advocates of this approach to public accountability and performance management.

505
The Role of Incentives and Organizational Politics in Performance
Measurement Systems With a Public Reporting Emphasis
To this point in Chapter 10, we have looked at the link between reporting performance results and the “real
consequences” when those results become public. The high-stakes approach to building in real consequences for
performance reporting—the public “naming and shaming” approach that was implemented in England between
2000 and 2005—appeared to work to improve performance compared with the less heavy-handed approaches
used in Wales and Scotland, but the English system created substantial gaming-related side effects that had to be
countered by auditing and also served in part to put an end to it in 2005.

Britain has returned to a less high-stakes variant of performance measurement although the government continues
to be committed to targets/performance results (Bewley, George, Rienzo, & Porte, 2016) and in some sectors,
league tables (Gibbons, Neumayer, & Perkins, 2015). In the United States, where the Office of Management
Budget from 2002 to 2009 annually conducted summative assessments of the effectiveness of federal programs
using the Program Assessment Rating Tool (or the PART) process, the administration has pulled back from this
approach—amending the Government Performance Results Act (GPRA, 1993) in 2010 (GPRA Modernization
Act, 2010) to focus more attention on performance management in individual departments and agencies.
Performance measurement is still required, as is reporting on a quarterly basis, but there is more emphasis on
balancing performance measurement and program evaluation (Willoughby & Benson, 2011).

Pollitt et al. (2010), who have extensively examined the British experience with performance management, suggest
that there is a pattern to the development of public performance measurement systems that consists of six stages:

1. The initial few, simple indicators become more numerous and comprehensive in scope;
2. Initially formative approaches to performance become summative, e.g., through league tables or targets;
3. The summative approach becomes linked to incentives and sanctions, with associated pressures for
“gaming”;
4. The initial simple indicators become more complex and more difficult for non-experts to understand;
5. “Ownership” of the performance regime becomes more diffuse, with the establishment of a performance
“industry” of regulators, academic units and others, including groups of consultants and analysts . . . ; and
6. External audiences’ trust in performance data and interpretations of them tends to decline. (p. 19)

What Pollitt et al. (2010) are describing is the evolution of performance measurement systems in adversarial
political cultures, at least in the example of Britain. Kristiansen, Dahler-Larsen, & Ghin (2017) have generalized
this logic of escalation for performance management regimes, more broadly. Their analysis suggests that “the basic
notion is that once a PI system is in place, there is an endogenous dynamic that results in the multiplication of
PIs, an increased technical elaboration of composite indices, a parallel growth in a specialist technocratic
community of ‘performance experts’ and a tighter coupling of PIs to targets, and targets to penalties and
incentives” (page 2).

Even in a relatively moderate-stakes system of publicly-reported performance targets and results, such as the one in
the study reported by McDavid and Huse (2012), legislators generally under-utilize public performance reports
for budgetary decisions, policy-related decisions, and for improving efficiency and effectiveness (Barrett & Greene,
2008; Bouckaert & Halligan, 2008; Raudla, 2012; Steele, 2005; Sterck, 2007). Instead, performance reports
appear to be more useful for symbolic accountability and communications with constituents and other
stakeholders.

However, there is another way to achieve a linkage between public performance reporting, accountability, and
performance improvements. Instead of navigating a high-stakes environment for the process, it is possible, in some
settings, to work with a low-stakes approach. We will describe an example of such a setting—a local government
in Western Canada—and will use this case to transition to a way to combine accountability and performance
improvement uses of performance results by developing internal-facing performance measurement systems that

506
operate in parallel to the externally-facing performance measurement and reporting systems.

507
Performance Measurement in a Low-Stakes Environment: Joining Internal
and External Uses of Performance Information in Lethbridge, Alberta
In Chapter 8, we pointed out that performance measurement had its origins in local governments in the United
States at the turn of the 20th century. Although there were no commonly-available computers or calculators, it
was still possible to construct performance information that included costs of local government services as well as
key outputs and often outcomes (Williams, 2003). When Niskanen (1971) wrote his seminal book on
bureaucracies, one of his recommendations was to increase the role of the private sector in providing government
programs and services. Variations on that recommendation were also made by Bish (1971) and others who were
writing about urban local governments in the United States. Hatry (1974, 1980) was among the first to champion
performance measurement for local governments. Contracting out of local government services became a
widespread practice during the 1970s and 1980s (McDavid, 2001; Savas, 1982, 1987), and part of the success of
that movement was due to the relative ease with which the performance of local government services could be
measured. Generally, local government programs and services have outputs and often outcomes that are tangible,
are countable, and are agreed upon.

Many local governments also deliver programs and services in political environments that are relatively
nonpartisan, or at least do not have a vociferous opposition. Indeed, a key goal of the Progressive Movement in
the United States during the late 1800s to post–World War I was to eliminate political parties from local elections
(Schaffner, Streb, & Wright, 2001), and to introduce businesslike practices into local government. From this
perspective, citizens who are served by local governments can be seen to be consumers who are able to see the
amount and quality of services they receive, and can offer performance feedback via choice, complaints, and other
mechanisms.

Research has been done to look at the way local governments use performance information (Askim, 2007; Kelly &
Rivenbark, 2014; Pollanen, 2005; Spekle & Verbeeten, 2014; Streib & Poister, 1999), and the Hildebrand and
McDavid (2011) case study of the Lethbridge local government systematically looked at how managers and
elected officials in a local government use performance information. Lethbridge is a community of 98,000 people
that is situated in the southern part of the province of Alberta, in Western Canada. Performance measurement had
been widely implemented across the city departments, but the measures had been developed by managers for their
own uses. Although public performance reports were not required, nearly all business units prepared such reports
for City Council each year.

In 2009, eight of nine members of City Council and 25 of 28 departmental managers were interviewed to solicit
their perceptions of the usefulness of performance information for certain purposes (Hildebrand & McDavid,
2011). Table 10.2 shows a comparison on the same interview questions between council members and business
unit managers on several indicators of the perceived usefulness of performance information (based on a scale of 1
to 5).

Table 10.2 Perceived Usefulness of Performance Information for Council Members and
Managers
Table 10.2 Perceived Usefulness of Performance Information for Council Members
and Managers

Council members Business-unit managers

Is performance data currently useful for identifying strategic priorities?

“Moderately useful” to “Very useful” 63% 74%

Mean 3.5 4.1

508
Is performance data currently useful for supporting budget decisions?

“Moderately useful” to “Very useful” 100% 91%

Mean 4.8 4.8

Is performance data currently useful for supporting program evaluation decisions?

“Moderately useful” to “Very useful” 88% 61%

Mean 4.1 3.7

Would the citizenry find performance reports useful?

“Moderately useful” to “Very useful” 63% 35%

Mean 3.9 2.7


Source: Reproduced from Hildebrand and McDavid (2011, p. 56).

There is general agreement between councilors and managers on the extent to which performance information is
useful. One difference between councilors and managers was around how useful they thought citizens would find
the performance reports. Council members were more likely than managers to indicate that citizens would find
the reports useful. Because managers in Lethbridge had built the performance measures for their own uses, some
of the measures were technical, reflecting the business of their department, and managers possibly shared the view
that citizens would not find this information useful. Council members, on the other hand, had a perspective that
was perhaps more likely to consider citizens as stakeholders interested in the reports, given that performance
reports were a means to demonstrate the accountability of the city government.

Table 10.3 compares councilor and manager perceptions of the quality and credibility of the performance data
that were produced by departments in the city. Both council members and managers generally agreed on the high
quality of the performance information they were using. The one difference was around the extent to which
performance information was accurate; council members were more likely to take a “neutral” position, but when
asked to rate overall data believability, council members responded that they trusted the information produced by
their managers.

Table 10.3 Council Member and Manager Perceptions of Performance Data Quality and
Credibility
Table 10.3 Council Member and Manager Perceptions of Performance
Data Quality and Credibility

To what degree do you agree with the following:

Council members Business-unit managers

The performance data are relevant.

“Agree” to “Strongly agree” 100% 100%

Mean 4.3 4.6

The performance data are easy to understand.

509
“Agree” to “Strongly agree” 75% 91%

Mean 3.9 4.1

The performance data are timely.

“Agree” to “Strongly agree” 75% 73%

Mean 4.0 3.8

The performance data are accurate.

“Agree” to “Strongly agree” 50% 95%

Mean 3.8 4.2

Overall, the performance data are believable.

“Agree” to “Strongly agree” 100% 100%

Mean 4.4 4.5


Source: Reproduced from Hildebrand and McDavid (2011, p. 59).

Both council members and managers were asked to respond to a Likert statement about the extent to which they
were concerned about publicly reporting performance results that are not positive. Their choices were: 1 = not at
all, 2 = hardly any degree, 3 = some degree, 4 = moderate degree, or 5 = great degree. Overall, the responses from
council members and managers were quite similar; neither group was substantially concerned about reporting
results that are not positive. Council members were more concerned: 25% of them expressed at least a moderate
degree of concern (response mean of 2.75) versus 16% of the business-unit managers (response mean of 2.04).

The Lethbridge local government case contrasts with high-stakes performance measurement and reporting in
adversarial political environments. In Lethbridge, performance measures had been developed by managers over
time, and public reporting was not the central purpose of the system. Both council members and managers found
performance information credible and useful, and neither group was substantially concerned with publicly
reporting performance results that were not positive. Most important, a non-adversarial political culture facilitated
developing and using performance information for both accountability and performance improvement purposes.
In other words, the absence of a high-stakes “naming and shaming” approach to the public airing of performance
results meant that the same performance data were used both to address public accountability and to improve
performance.

The Lethbridge findings, although they represent only one local government, provide an interesting contrast to
findings from studies of high-stakes and medium-stakes utilization of performance measures. As we have seen,
high-stakes top-down performance measurement and public reporting systems, in particular, have exhibited
significant challenges related to their efforts to increase both public accountability and performance improvement.
As the English experience has suggested, making a “naming and shaming” performance management system work
requires a major investment in monitoring, auditing, and measurement capacities to manage the behavioral side
effects of such an approach. Hood (2006) has suggested that these capacities were not sufficiently developed when
the system was implemented in Britain. Capacity issues are often an issue when governments design and
implement performance management systems (Bourgeois, 2016).

In Chapter 9, we suggested that unless managers are substantially involved in developing and implementing
performance measurement systems, the measures will quite possibly not be useful for performance
management/performance improvement and will not be sustainable or will just be continued for the production of

510
pro forma performance reports (de Lancer Julnes & Steccolini, 2015; Perrin, 2015; Van Dooren & Hoffman,
2018). Our findings from examining performance measurement and public reporting in Lethbridge, Alberta, also
suggest that in political cultures where the “temperature” is lower, that is, performance results are not treated as
political ammunition, there is a higher likelihood that the goal of simultaneously realizing public accountability
and performance improvement will be achieved. It is important to keep in mind that securing managerial buy in,
in the initial stages of developing a performance measurement system, is often linked to an initial goal of using
performance information formatively. The pattern summarized above is generally consistent with the evolution of
the performance measurement regime in the case of the British Columbia government (McDavid, 2001), where
an initial formative stage from about 1995, in which individual departments developed their own measures and
shared their experiences, gave way to legislated performance measurement and reporting requirements by 2000.
That summative, target-based public reporting system endures to the present.

511
Rebalancing Accountability-Focused Performance Measurement Systems to
Increase Performance Improvement Uses
Our discussion so far in Chapter 9 and this chapter has focused on organizations that are designing and
implementing a performance measurement system and doing so from scratch. The 12 steps in Chapter 9 outline a
process that is balanced between working with people and the culture (political and organizational) on the one
hand and rational and technical considerations on the other hand. Our view is that an approach with greater
emphasis on internal management information needs has a better chance of achieving the objective of contributing
to performance improvement uses which are important to sustainable performance measurement systems.

The British national experience with performance measurement, performance management, and accountability
suggests that such systems are evolutionary. Pollitt et al. (2010) and Kristiansen, Dahler-Larsen and Ghin (2017)
suggest that performance measurement systems, their purposes, and their effects can change over time. One of the
legacies of New Public Management is the core assumption that public sector and nonprofit organizations must be
accountable—that unless there are systems in place to ensure accountability, taxpayer dollars will be wasted and
organizations and their managers will behave in self-interested ways. From this perspective, motivation is
problematical and the “trust and altruism” model that was mentioned by Bevan and Wilson (2013) and Le Grand
(2010) may actually be undermined (Jacobsen, Hvitved, & Andersen, 2014; Pandey & Moynihan, 2006).

Many organizations that might be interested in increasing performance improvement-related uses, already have a
performance measurement system in place. Because a key driver of existing performance measurement systems is
the drive to improve accountability (Jakobsen, Baekgaard, Moynihan, & van Loon, 2017) what we have seen so
far in Chapter 10 suggests that in some contexts, focusing foremost on using measures for external accountability
tends, over time, to become counterproductive in supporting performance improvement uses. In other words, an
emphasis on accountability in high-stakes settings (in particular) appears to drive out performance improvement
uses as a realistic objective for performance measurement systems.

Hood and Peters (2004), Pandey (2010) and more recently Jakobsen, Baekgaard, Moynihan and van Loon (2017)
point to a paradox in expecting external performance reporting to drive performance improvements in efficiency
and effectiveness. Others, too, have examined the conundrum; Van Thiel and Leeuw’s (2002) article The
performance paradox in the public sector has been cited almost 1,000 times. On the one hand public sector
performance systems are ubiquitous, but at the same time they are not consistently effective. Why is that? The
article Jakobsen et al. (2017) article suggests that because most performance measurement systems with public
reporting are focused on external accountability, given the political contexts in which these organizations are
embedded, they fail to deliver on their ostensive other purpose—improving performance: “ … performance
regimes are often experienced as externally imposed standards that encourage passivity, gaming, and evasion, and
will therefore never be able to achieve performance gains that depend on purposeful professional engagement.”(p.
1).

Jakobsen et al (2017) identify three ways that performance measurement systems can be developed and
implemented. The External Accountability (EA) approach is primarily top-down and is externally mandated. It
is focused on external performance-based account-giving. The PER (professional engagement regime) is
primarily bottom-up. It relies on those in the organization (managers and workers) taking the lead in developing
performance measures of their programs—this approach focuses primarily on performance improvement. The
Lethbridge case we described earlier in this chapter is in line with this approach. The advantages of this approach
are that it recognizes the challenges of measuring performance in complex organizations (human service
organizations), and it empowers and motivates organizational participants to get involved in developing
implementing and using the performance measurement system.

The disadvantage of this approach is that measures that reflect the detailed and particularistic nature of program
delivery (particularly where persons are the main program recipients) may not be suitable for external reporting
purposes.

512
The third approach is the IL (internal learning) approach. This approach does not give up on the expectation
that there be external accountability—external performance reporting will generally be required given the ubiquity
of that expectation in contemporary administrative and political settings. But internally, there will be processes to
engage organizational managers and even front-line workers as key stakeholders, with a view to encouraging
elaborations of what performance means to them and how it would be measured to make the results useful. In
effect the external reporting requirements would be buffered by developing an internal learning culture wherein
performance information is a resource to support monitoring and decision-making. Jakobsen et al. (2017) suggest
that this IL approach can satisfy both external accountability and performance improvement expectations for
performance measurement systems. They end their discussion with three hypotheses that need to be further
researched but appear to be supported by the evidence at hand:

A shift from an external accountability regime to an internal learning performance regime will increase
autonomous motivation among employees and decrease perceptions of red tape.

A shift from an external accountability regime to an internal learning performance regime will decrease gaming
behavior and increase cooperation and learning.

A shift from an external accountability regime to an internal learning performance regime will increase
organizational performance on some aspects but may decrease performance on previously incentivized
performance measures.

Are there circumstances where a top-down EA approach is more likely to succeed in both its accountability and
performance improvement objectives? Jakobsen et al. (2017) suggest that where organizational programs are not
complex (do not involve delivering programs/services where human services is the main focus) it is possible to
design and implement performance measures that work on both counts. Recall Table 9.2 in Chapter 9 where we
compare four types of organizations: coping organizations, craft organizations, procedural organizations, and
production organizations. Among those four types, production organizations come closest to being ideal for EA
performance measurement systems that will deliver both accountability and performance improvement.

An example might be a highway maintenance program where crews and equipment patch the roads in a given
region of the province or state, as well as clearing snow, ensuring that there are no hazards on the roads and
monitoring road conditions to identify areas that need repairs such as resurfacing. The work is mostly technical
and although crews will interact with the driving public, the main tasks are routine. Performance measures
include: road roughness; traffic delays; responsiveness in mitigating road hazards; accidents; and user satisfaction.
Data for these and other measures would be gathered from different sources and, generally, road maintenance
crews can be monitored to see whether they are doing their work. Principal-agent problems are relatively
manageable.

In contrast, a social service agency that delivers programs to at-risk families would be closer to a coping
organization (the environment is turbulent, and the program is complex). Performance measurement in such
organizations is challenging. On the one hand is an expectation by governments that programs are delivered
efficiently and effectively and to that end performance measurement systems are designed in part to monitor
interactions between front line workers and their clients—this is an EA approach that has been drilled down to the
front-line workers. But social worker-client interactions are typically diverse and cannot easily be categorized.
Performance measures from a front line, bottom-up perspective would be richer, more qualitative, and linked
directly to serving individual client needs.

The top-down and bottom-up perspectives could produce performance measures that are different. In Chapter 9,
we included a case where an organization that had an existing performance improvement-focused measurement
system tried to build an externally-focused system for public reporting purposes. What became evident when the
managers and executives tried to come up with one set of measures is that the bottom-up and top-down priorities
substantially talked past each other.

The Internal Learning approach to building performance measurement systems acknowledges that these systems

513
will have an external (accountability-focused) and an internal (learning-focused) face. How do these two
perspectives mesh? Where in the organization hierarchy do the internal performance measures and products meet
the external performance measures and related products? Jakobsen et al. (2017) do not address this issue in their
synthesis but we will offer an option.

One response to this question is that the two systems would exist in parallel. In effect the internal learning focus of
the performance measurement system would be decoupled from the external accountability focus. Decoupling has
been suggested as a strategy for reducing risks and increasing the likelihood that performance information will be
used by managers and others inside organizations. Kettl and Kelman (2007), Johnsen (1999, 2005), Brignall and
Modell (2000), Rautiainen (2010), and McDavid and Huse (2012), among others, have suggested that
decoupling is a strategy for realizing both the accountability and performance improvement objectives of
performance measurement systems. Decoupling means that public performance reporting is largely separated from
the internal performance measurement and management activities in departments and agencies. Information that
is developed for internal uses may then be viewed as being more trustworthy by managers because it is not
intended to be made public. It can be used formatively. Managers may create their own databases that are distinct
from existing organization-wide performance systems and even prepare their own internal performance reports
(Hatry, 2006). Gill (2011), in a large-scale study of the New Zealand public service, reports that some managers
have developed their own information sources and have at least partially decoupled internal uses of performance
information from external accountability-related reporting requirements.

Brignall and Modell (2000), in their general review of both private sector and public sector performance
measurement strategies, suggest that in settings where there is more dissonance between the expectations of
funders and internal stakeholders who are delivering programs, it can make sense for management to decouple
performance measures to be able to balance their interests. For us, that suggests that in high-stakes EA settings, it
may be logical to decouple internal and external performance measures; that would seem to be a way to buffer the
internal organization and make some progress toward building and sustaining a learning culture.

514
Making Changes to a Performance Measurement System
If an organization wants to change/re-orient their existing performance measurement system, the 12 steps that we
introduced in Chapter 9 need to be modified. In the Summary to Chapter 9 we pared down the 12 steps to 6
essential ones: sustained leadership; good communications; clear expectations for the system; resources and
planning; logic models; and a credible measurement process. Although all 12 steps are useful in situations where a
performance measurement system is being re-focused (re-balanced so that performance improvement uses are a
key goal) two other steps besides the six that are “core” stand out.

One additional step is taking the time to understand and assess the organizational history around performance
measurement and other related activities to gauge their impacts on the culture. Key to building an internal
learning culture is trust and trust-building—something that existed in the Lethbridge case.

Trust is prominent in the literature on the uses of performance information (Hildebrand & McDavid, 2011;
Kroll, 2015; Moynihan & Pandey, 2010). Van Thiel and Yesilkagit (2011) suggest the importance of trust as a
‘new mode of governance’ and show how trust can be built by politicians to manage agencies. Trust is a core value
in the new value-based approach to public governance (Bao, Wang, Larsen, & Morgan, 2013). We understand
trust to be an aspect of a relationship between actors (Schoorman, Mayer, & Davis, 1995) that leads to the
reciprocal expectation of two parties that the other one will not exploit potential weaknesses (Das & Teng, 2001;
van Thiel & Yesilkagit, 2011).

If the culture has been weakened in that respect (for example, by a high-stakes performance system that has
created incentives for gaming performance measures) then a critical part of re-focusing the performance
measurement system is, at least internally, identifying credible internal performance measures that can be used to
help improve program effectiveness, to begin to build trust in the measurement system.

The second additional step is involving the prospective users in reviewing logic models and other parts of the
performance measurement system. This step goes hand in hand with another step—good communications (two-
way vertical and horizontal communications)—and is essential as a building block toward supporting a learning
culture. Competition based on performance incentives often weakens information sharing and that in turn
weakens any prospects for building a learning culture. Thus, the revised steps that are most relevant for balancing
performance measurement systems are as follows:

1. Sustained leadership: Without this, the change process will drift. Commitment to an internal learning culture
needs to be explicit from the outset. Given that external accountability expectations will continue, leaders
need to be prepared to buffer the organization so that there is space to focus on assessing and then building
toward a learning culture. This leadership is required for a 3–5 year period of time.
2. Assess the culture: Take the time to understand the organizational history around similar initiatives. Use the
assessment to identify potential allies for the (re)balancing process.
3. Open communications: Open communications is a key part of building toward a view of performance
information that it is a shared resource and not a prospective weapon to be wielded internally.
Communication is the vehicle by which intentions are shared, elaborated, and if necessary, refined. They are
essential to developing a common understanding of the process and increasing the likelihood of buy-in.
4. Clear expectations for the system: A commitment to a balanced performance measurement system, where an
external accountability-focused system already exists, will require acknowledging the history of performance
measurement in the organization, its impacts on the way the organization has functioned, and clear
statements around valuing internal learning, including involvement of those in the organization who will be
developing measures and ultimately using performance results in their work. It is essential that managers are
not blindsided as the process unfolds.
5. Resources and planning sufficient to free up the time and expertise as needed: In addition to time and technical
supports, a key resource will be reliable and valid (viewed as such by those in the organization) information
gathering and disseminating capacity, to assess the culture and monitor progress as the performance

515
measurement system is balanced.
6. Logic models that identify the key program and organizational constructs: Logic models that were adequate for
accountability-focused performance measures will not be adequate to support building a learning culture.
The models that are developed will be richer and will likely differ from one part of the organization to the
other (particularly for complex organizations that deliver a range of programs).
7. Open involvement in the process: Key to making this balancing process workable will be a commitment to
involving those who are affected by balanced performance measurement system. Key to a useful system will
be engagement and buy-in.
8. A measurement process that succeeds in producing valid measures in which stakeholders have confidence: This
requirement underscores the need to invest the time and money in measurement methodologies that yield
results that are viewed by prospective users as being valid and reliable. Without credible measures, it will not
be possible to build toward a learning culture.

Five of these 8 steps are people-related. What we are saying is that in re-balancing a performance measurement
system, cultural/political considerations are more important overall that technical-rational considerations.

516
Does Performance Measurement Give Managers the “Freedom to Manage?”
When New Public Management was first introduced as an approach to public sector management in the 1990s,
the emphasis on managing for results and structuring incentives to align organizational behaviors with the
achievement of outputs and outcomes, was a core feature. One of the premises built into results-based
management systems based on principal-agent theory was that if managers were incentivized to perform (given
clear performance objectives and incentives to achieve them) they would be appropriately motivated (Poister,
Aristigueta, & Hall, 2015). Freed from the need to pay close attention to traditional process and procedural
requirements, they would have the latitude to work with inputs to their programs in ways that improved efficiency
and effectiveness.

But in many settings, particularly where performance measurement and public reporting are high stakes, managers
have not been freed up in this way. In fact, the results-based management requirements have been layered on top
of existing process-focused requirements (Moynihan, 2008). Particularly in an environment of fiscal constraint,
the overall effect has been a tendency to centralize government and organizational decision making. Targets are
not accompanied by more latitude, but instead by more control (Peters, Pierre, & Randma-Liiv, 2011; van der
Voet & Van de Walle, 2018).

Gill (2011), in The Iron Cage Recreated, examined the New Zealand public service and included a survey of more
than 1,700 public servants in organizations across the public service. One of his conclusions is that Max Weber’s
“Iron Cage” has been re-created in the New Zealand public service. Weber (1930) introduced this metaphor in his
writings on 19th-century bureaucracies. Weber recognized the importance of bureaucracies not just as instruments
in the emerging rational societies in Europe but also as a cultural phenomenon by which relationships would be
transformed in governments and between governments and societies. Bureaucracy, in addition to offering societies
ways of regularizing administration and governance, could also become an iron cage wherein behaviors and
relationships in government and in society become circumscribed by the values and expectations core to well-
functioning bureaucracies.

In New Zealand, performance management was implemented in 1988, but over time performance information
has tended to be used to demonstrate and control alignment, with objectives and targets cascading downward to
the front-line level. Although results-focused information is used by managers, performance information on inputs
and processes is used more. When bureaucracies are challenged either by their minister or by external stakeholders,
the impulse is to retreat to rules and processes. Two specific findings are particularly worth noting: Of 10 possible
influences on the daily work of the managers who were surveyed, they were least likely to agree or strongly agree
that their “work unit has a lot of freedom in how we allocate our budget and staff”, and most likely to agree or
strongly agree that “my work unit is mostly guided by established rules and procedures” (Gill, 2011, p. 385).

Using performance measurement systems to control and ensure alignment would be expected in an External
Accountability (EA) system, where performance measurement is primarily top-down and is premised on being
able to demonstrate that organizational performance is consistent with strategic objectives. Jakobsen et al. (2017)
recognize this problem implicitly and suggest that a self-conscious shift of a performance measurement system to a
more balanced approach (their Internal Learning option) would result in improved program performance and
continued external accountability (although not necessarily better performance based on the measures in the
former EA system).

Our approach is Chapters 8, 9 and 10 has generally aligned with the Jakobsen et al. (2017) perspective on the
whole field. They acknowledge that their synthesis of different perspectives on accountability and performance
improvement needs to be researched further. We have presented some research results in Chapter 10 that support
their view but also acknowledge that more needs to be done. Below, we examine one alternative: a jurisdiction
exemplifying efforts to support a continuing “performance dialogue.”

517
518
Decentralized Performance Measurement: The Case of a Finnish Local
Government
An emerging perspective on performance measurement and performance management internationally, that reflects
the Jakobsen, et al. (2017) perspective, is the importance of continuing performance dialogues among internal
stakeholders (Laihonen & Mäntylä, 2017; OECD, 2010) to achieve learning-related benefits from performance
management systems. Moynihan (2005) introduced the idea of learning forums as a way to facilitate the use of
performance information by managers. He identified criteria to guide learning forums that have since been
adapted and applied in a Finnish municipality (Laihonen & Mäntylä, 2017, 2018).

Laihonen and Mäntylä are part of a group of researchers who have worked extensively with the city of Tampere,
Finland (2010–2016) to develop and implement a performance management system. Following Moynihan (2005)
and in line with Jakobsen et al. (2017), they have worked with the municipal administration to pilot a learning
forum as a means of supporting ongoing performance dialogue among managers in Tampere. The learning forum
is intended to be a vehicle for institutionalizing an Internal Learning (IL) approach to performance management.
The learning forum in 2016 was a pilot and was intended to be the first in an ongoing series. It was based on these
elements:

Routine event
Facilitation and ground rules to structure dialogue
Non-confrontational approach to avoid defensive reactions
Collegiality and equality among partners
Diverse set of organizational actors present who are responsible for producing outcomes under review
Dialogue centered, with dialogue focused on organizational goals
Basic assumptions are identified, examined, and suspended (especially for double-loop learning)
Quantitative knowledge that identifies successes and failures, including goals, targets outcomes, and points
of comparison
Experiential knowledge of process and work conditions that explain successes, failures, and the possibility of
innovations

(Laihonen & Mäntylä, 2017, p. 423, based on Moynihan, 2005).

Overall the pilot was successful, although it did suggest that using learning forums as a means to build toward a
new kind of performance-focused organization is not sufficient in itself. Three principles emerged from this case
study that are intended to guide further work in that municipality and by implication in other government
organizations that want to move towards an effective IL-focused performance management system:

Continuing performance dialogue often requires cultural change

Learning forums by themselves are not sufficient to induce a cultural change that embraces organizational
learning. Moynihan (2005) discovered this when he worked with several state agencies in different locations
and saw that where learning forums were aligned with the pre-existing organizational culture, they “took”
much more readily than in organizations where they were treated as one-offs by the main stakeholders.

Performance dialogue needs to provide a structure for the use of performance information

Learning forums are an efficient and effective way to bring stakeholders together (information analysts,
managers, and a group facilitator) to review performance results, elaborate on possible reasons for observed
results, and identify options for improvements and innovations.

Performance dialogue needs an initiator

Laihonen and Mäntylä have worked with the city of Tampere since 2010, and by now have established their

519
credibility as researchers and participant observers. They recognize that establishing a culture that maintains
a performance dialogue requires facilitation. In their words:

Whether they are internal or external, it seems important that persons with this role serve as referees of
the information game. They need to understand the complexity of the public service system and the
underlying information architecture. They have to be able to turn often blurred business needs into
concrete and actionable information requests. The required personal capabilities are very different from
those of stereotyped public officials. (p 425)

The Tampere, Finland case, in ways that are analogous to the Lethbridge, Canada case, suggests that it is possible
to build toward a culture that supports internal learning and hence, managerial uses of performance information
to improve performance. Although the dominant “paradigm” around performance management is still focused on
accountability first and the (now questionable) assumption that accountability uses will lead to performance
improvement uses of performance results, we are suggesting that in the post-NPM field of public administration,
where complexity and rapid change are becoming the norms, directly engaging with managers and giving them
sufficient agency to be collaborators is a promising strategy for realizing performance improvement as an objective
of designing and implementing performance measurement systems.

520
When Performance Measurement Systems De-Emphasize Outputs and
Outcomes: Performance Management Under Conditions of Chronic Fiscal
Restraint
There is one other consideration that affects efforts to build a performance measurement system from scratch or
rebalance one that exists. Since 2008, there has been a general change in government and governance contexts.
The Great Recession (Bermeo & Bartels, 2014; Jenkins, Brandolini, Micklewright, & Nolan, 2013) and its
aftermath have resulted in situations where governments face more fiscal and political uncertainty, from more
demands for programs that now must address the emerging impacts of climate change, and at the same time
demands for programs that reflect the changing demographics in most western societies (aging populations,
immigration pressures, refugee pressures). Chronic fiscal pressures, and an unwillingness of some stakeholders to
tolerate persistent government deficits, has produced a demand for ongoing expenditure restraint. This pressure is
felt throughout the public and nonprofit sectors (Bozeman, 2010; Pandey, 2010; Raudla, Savi, & Randma-Liiv,
2013). There has been a resurgence of interest in “cutback management,” and increasing interest in the concept of
“spending reviews,” a specific type of evaluation to address budget cuts and reallocations (Catalano & Erbacci,
2018).

One effect on performance measurement regimes is heightened emphasis on organizational control and in
particular, expenditure control. Not only are such systems becoming more centralized, decision-makers are also
prioritizing oversight via expenditures (inputs) over measuring of outputs and outcomes (see, for example,
Randma-Liiv & Kickert, 2018). In Canada, a major expenditure review across the federal government in 2012 was
intended primarily to find an aggregate amount of cost savings across government departments (Dobell &
Zussman, 2018). Although evaluations focused on program effectiveness were available, there is scant evidence
that they were used (Dobell & Zussman, 2018; Shepherd, 2018).

For organizations that are building or re-focusing performance measurement systems, this backdrop will influence
the demands from governments and other public sector funders and hence, the strategies that are more likely to
succeed in meeting funder expectations while building a culture that supports performance improvement.
Consistent with the themes of this textbook, evaluators will need a foundation of technical understanding of
evaluation and performance measurement processes, and the professional judgement to appreciate and capitalize
on the economic, political, and organizational context that will impact the development and implementation of
these tools for the betterment of society.

521
Summary
The stance we have taken in Chapters 8, 9 and 10 in describing performance measurement, how such systems complement program
evaluation, and how to build and sustain performance measurement systems, is that organizational buy-in, specifically manager buy-in, is
essential to realize the potential of this evaluation approach. Buy-in cannot be decreed or demanded—it must be earned by negotiation
and trust-building. The “people” parts of the processes outlined in these three chapters in our textbook underscore the importance of
understanding and working with organizational cultures, and their interactions with the political cultures in which public sector and
nonprofit organizations are embedded.

Performance measurement and public reporting is now a central part of governmental efforts to demonstrate public accountability. The
performance management cycle we introduced in Chapter 1 suggests that public performance reporting is typically implemented with the
expectation that the consequences of transparency and incentives will serve as drivers for performance improvements. These assumptions
are reflected in Figure 9.3 in Chapter 9.

In Chapter 10, we have looked at the ways that performance information is used in different settings and whether public performance
reporting does improve performance. The initial experience with public reporting suggests that where performance reporting for a sector
is high stakes and is done by independent organizations that rate or rank performance in that sector, such as in England between 2000
and 2005, when a three-star rating system was used for hospitals, that challenging the reputations of public organizations will, at least
initially, improve performance.

But high-stakes performance measurement settings usually produce unintended side effects, and the key one is gaming of performance
results. In the English ambulance service case, gaming included modifying the reported response times for ambulances to meet the 8-
minute target for emergency responses to calls. Managing gaming responses to performance targets requires investing in strategies such as
regular audits of data systems and performance results. Gaming is dynamic, that is, it evolves over time. Audit-based responses to gaming
will reduce it but probably not eliminate it.

In most settings, public performance reporting is ‘high stakes’ in the sense that negative political consequences can develop if reported
results are less than positive (de Lancer Julnes & Steccolini, 2015; Perrin, 2015; Van Dooren & Hoffman, 2018). The risk of that
happening is situational. In even slightly adversarial political cultures, however, where risk aversion is a factor in administrative and even
policy decisions, public performance reports can become part of efforts to minimize risks, including making sure that they contain “good
news” or at least performance results that are not negative (Van Dooren & Hoffman, 2018).

There is little empirical evidence that politicians make significant use of ex post performance information in their roles and responsibilities
(Shaw, 2016; Van Dooren & Van de Walle, 2016). A study that examined the ways that legislators use performance reports over time
shows that expectations were high before the first reports were received, but actual uses were very modest and were mainly focused around
general (and perhaps symbolic) accountability uses as well as information dissemination uses (McDavid & Huse, 2012).

If organizations want to improve program performance, building cultures that facilitate internal learning is an advantage. In relatively
low-risk settings (e.g., most local governments) it is easier to use performance results for both internal performance management and
external accountability. There is a growing body of research, including the Lethbridge, Alberta case, that supports building internal
learning cultures.

One recent innovation in performance measurement and performance management are efforts to re-balance systems that were focused on
accountability uses, to better include performance improvement uses of performance information. In Finland, one local government
(Tampere) has implemented learning forums to facilitate internal uses of performance results. While acknowledging that public
performance reporting for accountability is here to stay, this approach essentially decouples that function from the creation and uses of
performance information for internal performance dialogues that describe performance results and offer ways of improving performance.

522
Discussion Questions
1. What are the key differences between the technical/rational view of implementing performance management in organizations and
the political/cultural view in terms of the assumptions they make about people?
2. Some commentators have suggested that failures of performance measurement systems to live up to their promises are due to poor
or inadequate implementation. This view suggests that if organizations properly implement performance measurement, paying
attention to what is really needed to get it right, performance measurement will be successful. Another view is that performance
measurement itself is a flawed idea and that no amount of attention to implementation will solve its problems. What are your
views on this issue?
3. Will auditing performance reports increase their usefulness? Why?
4. Can managers in organizations be trusted to collect performance data for an organizational performance measurement system?
Why?
5. What is “buy-in” when we look at the design and implementation of performance measurement systems?
6. If you have had organizational experience (working, work term placements, or internships) either in the public sector, private
sector, or the nonprofit sector, what is your experience with whether performance measurement seems to centralize the
organization or, instead, decentralize it?
7. What does it mean for organizational managers to “game” performance measures? What are some ways of reducing the occurrence
of this problem?
8. What is the “ratchet effect” in setting targets for performance measures?

523
References
Askim, J. (2007). How do politicians use performance information? An analysis of the Norwegian local
government experience. International Review of Administrative Sciences, 73(3), 453–472.

Bao, G., Wang, X., Larsen, G. L., & Morgan, D. F. (2013). Beyond new public governance: A value-based global
framework for performance management, governance, and leadership. Administration & Society, 45(4),
443–467.

Barrett, K., & Greene, R. (2008). Grading the states ‘08: The mandate to measure. Governing, 21(6), 24–95.

Bermeo, N., & Bartels, L. (2014). Mass politics in tough times: Opinions, votes, and protest in the Great Recession.
London: Oxford University Press.

Bevan, G., & Hamblin, R. (2009). Hitting and missing targets by ambulance services for emergency calls: Effects
of different systems of performance measurement within the UK. Journal of the Royal Statistical Society. Series A
(Statistics in Society), 172(1), 161–190.

Bevan, G., & Hood, C. (2006). Health policy: Have targets improved performance in the English NHS? BMJ,
332(7538), 419–422.

Bevan. G. & Wilson, D. (2013). Does “naming and shaming” work for schools and hospitals? Lessons from
natural experiments following devolution in England and Wales, Public Money & Management, 33(4),
245–252.

Bewley, H., George, A., Rienzo, C., & Porte, J. (2016). National Evaluation of the Troubled Families Programme:
National Impact Study Report. London, UK: Department for Communities and Government.

Bish, R. (1971). The public economy of metropolitan areas. Chicago, IL: Markham.

Boswell, C. (2018). Manufacturing political trust: Targets and performance measurement in political policy. UK:
Cambridge University Press.

Bouckaert, G., & Halligan, J. (2008). Managing performance: International comparisons. New York: Routledge.

Bourgeois, I. (2016). Performance measurement as precursor to organizational evaluation capacity building.


Evaluation Journal of Australasia, 16(1), 11–18.

Bozeman, B. (2010). Hard lessons from hard times: Reconsidering and reorienting the “managing decline”
literature. Public Administration Review, 70(4), 557–563.

524
Brignall, S., & Modell, S. (2000). An institutional perspective on performance measurement and management in
the “new public sector.” Management Accounting Research, 11, 281–306.

Carvel, J. (2006). To improve the NHS must admit its faults. The Guardian. Retrieved from:
https://www.theguardian.com/society/2006/oct/25/health.comment

Catalano, G., & Erbacci, A. (2018). A theoretical framework for spending review policies at a time of widespread
recession. OECD Journal on Budgeting, 17(2), 9–24.

Das, T. K., & Teng, B-S. (2001). Trust, control, and risk in strategic alliances: An integrated framework.
Organization Studies, 22(2), 251–283.

de Lancer Julnes, P., & Steccolini, I. (2015). Introduction to Symposium: Performance and accountability in
complex settings—Metrics, methods, and politics. International Review of Public Administration, 20(4),
329–334.

Dobell, R., & Zussman, D. (2018). Sunshine, scrutiny, and spending review in Canada, Trudeau to Trudeau:
From program evaluation and policy to commitment and results. Canadian Journal of Program Evaluation,
32(3), 371–393.

Gibbons, S., Neumayer, E., & Perkins, R. (2015). Student satisfaction, league tables and university applications:
Evidence from Britain. Economics of Education Review, 48, 148–164.

Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state organisations in New Zealand.
Wellington, NZ: Institute of Policy Studies.

Government of British Columbia. (2000). Budget Transparency and Accountability Act: [SBC 2000 Chapter 23].
Victoria, British Columbia, Canada: Queen’s Printer.

Government of British Columbia. (2001). Budget Transparency and Accountability Act [SBC 2000 Chapter 23]
(amended). Victoria, British Columbia, Canada: Queen’s Printer.

Government Performance and Results Act of 1993, Pub. L. No. 103–62.

Government Performance and Results Act Modernization Act of 2010, Pub. L. No. 111–352.

Hatry, H. P. (1974). Measuring the effectiveness of basic municipal services. Washington, DC: Urban Institute and
International City Management Association.

Hatry, H. P. (1980). Performance measurement principles and techniques: An overview for local government.
Public Productivity Review, 4(4), 312–339.

525
Hatry, H. P. (2006). Performance measurement: Getting results (2nd ed.). Washington, DC: Urban Institute Press.

Hibbard, J. (2008). What can we say about the impact of public reporting? Inconsistent execution yields variable
results Annals of Internal Medicine, 148, 160–161.

Hibbard, J., Stockard, J., & Tusler, M. (2003). Does publicizing hospital performance stimulate quality
improvement efforts? Health Affairs, 22(2), 84–94.

Hildebrand, R., & McDavid, J. (2011). Joining public accountability and performance management: A case study
of Lethbridge, Alberta. Canadian Public Administration, 54(1), 41–72.

Himmelstein, D., Ariely, D., & Woolhandler, S. (2014). Pay-for-performance: Toxic to quality? Insights from
behavioural economics. International Journal of Health Services, 44(2), 203–214.

Hood, C. (2006). Gaming in targetworld: The targets approach to managing British public services. Public
Administration Review, 66(4), 515–521.

Hood, C., Dixon, R., & Wilson, D. (2009). “Managing by numbers”: The way to make public services better?
Retrieved from http://www.christopherhood.net/pdfs/Managing_by_numbers.pdf

Hood, C., & Peters, G. (2004). The middle aging of new public management: Into the age of paradox? Journal of
Public Administration Research and Theory, 14(3), 267–282.

Jacobsen, C., Hvitved, J., & Andersen, L. (2014). Command and motivation: How the perception of external
interventions relates to intrinsic motivation and public service motivation. Public Administration 92(4),
790–806.

Jakobsen, M. L., Baekgaard, M., Moynihan, D. P., & van Loon, N. (2017). Making sense of performance
regimes: Rebalancing external accountability and internal learning. Perspectives on Public Management and
Governance, 1–15.

Jenkins, S., Brandolini, A., Micklewright, J., & Nolan, B. (2013). The Great Recession and the distribution of
household income. London: Oxford University Press.

Johnsen, Å. (1999). Implementation mode and local government performance measurement: A Norwegian
experience. Financial Accountability & Management, 15, 1, pp. 41–66.

Johnsen, Å. (2005). What does 25 years of experience tell us about the state of performance measurement in
public policy and management? Public Money and Management, 25(1), 9–17.

Kelly, J. M., & Rivenbark, W. C. (2014). Performance budgeting for state and local government (2nd Ed.). New
York: Routledge.

526
Kelman, S., & Friedman, J. (2009). Performance improvement and performance dysfunction: An empirical
examination of distortionary impacts of the emergency room wait-time target in the English National Health
Service. Journal of Public Administration Research and Theory, 19(4),917–946.

Kettl, D., & Kelman, S. (2007). Reflections on 21st century government management. Washington, DC: IBM
Center for the Business of Government.

Kroll, A. (2015). Drivers of performance information use: Systematic literature review and directions for future
research. Public Performance and Management Review, Volume 38:3, 459–486.

Kroll, A., & Moynihan, D. P. (2018). The design and practice of integrating evidence: Connecting performance
management with program Evaluation. Public Administration Review, 78(2), 183–194.

Laihonen, H., & Mäntylä, S. (2017). Principles of performance dialogue in public administration. International
Journal of Public Sector Management, 30(5), 414–428.

Laihonen, H., & Mäntylä, S. (2018). Strategic knowledge management and evolving local government, Journal of
Knowledge Management, 22(1), 219–234.

Le Grand, J. (2010). Knights and knaves return: Public service motivation and the delivery of public services.
International Public Management Journal, 13(1), 56–71.

Lewis, J. (2015). The politics and consequences of performance measurement. Policy and Society, 34(1), 1–12.

McDavid, J. C. (2001). Solid-waste contracting-out, competition, and bidding practices among Canadian local
governments. Canadian Public Administration, 44(1), 1–25.

McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

McLean, I., Haubrich, D., & Gutierrez-Romer, R. (2007). The perils and pitfalls of performance measurement:
The CPA regime for local authorities in England. Public Money & Management, 27(2), 111–118.

Moynihan, D. P. (2005). Goal-based learning and the future of performance management. Public Administration
Review, 65(2), 203–216.

Moynihan, D. P. (2008). The dynamics of performance management: Constructing information and reform.
Washington, DC: Georgetown University Press.

Moynihan, D. (2009). Through a glass, darkly: Understanding the effects of performance regimes. Public
Performance & Management Review, 32(4),592–603

527
Moynihan, D., & Pandey, S. (2010). The big question for performance management: Why do managers use
performance information? Journal of Public Administration Research and Theory, 20(4), 849–866.

Niskanen, W. A. (1971). Bureaucracy and representative government. New York: Aldine-Atherton.

OECD. (2010). Qualitative Assessments of Recent Reforms. In Public Administration after “New Public
Management”, Paris: OECD Publishing.

Otley, D. (2003). Management control and performance management: Whence and whither? British Accounting
Review, 35(4), 309–326.

Pandey, S. K. (2010). Cutback management and the paradox of publicness. Public Administration Review, 70(4),
564–571.

Perrin, B. (2015). Bringing accountability up to date with the realities of public sector management in the 21st
century. Canadian Public Administration, 58(1), 183–203.

Peters, B. G., Pierre, J., & Randma-Liiv, T. (2011). Global financial crisis, public administration and governance:
Do new problems require new solutions? Public Organization Review, 11(1), 13–27.

Poister, T. H, Aristigueta, M. P., & Hall, J. L. (2015). Managingand measuring performance in public and nonprofit
organizations (2nd ed.) San Francisco, CA: Jossey-Bass.

Pollanen, R. M. (2005). Performance measurement in municipalities: Empirical evidence in Canadian context.


International Journal of Public Sector Management, 18(1), 4–24.

Pollitt, C. (2018). Performance management 40 years on: A review. Some key decisions and consequences. Public
Money & Management, 38(3), 167–174.

Pollitt, C., Bal, R., Jerak-Zuiderent, S., Dowswell, G., & Harrison, S. (2010). Performance regimes in health care:
Institutions, critical junctures and the logic of escalation in England and the Netherlands. Evaluation, 16(1),
13–29.

Propper, C., & Wilson, D. (2003). The use and usefulness of performance measures in the public sector. Oxford
Review of Economic Policy, 19(2), 250–267.

Propper, C., Sutton, M., Whitnall, C., & Windmeijer, F. (2010), Incentives and targets in hospital care: A natural
experiment. Journal of Public Economics, 94, 3, pp. 301–335.

Randma-Liiv, T., & Kickert, W. (2018). The impact of fiscal crisis on public administration in Europe. In E.
Ongaro and S. van Thiel (eds.), The Palgrave handbook of public administration and management in Europe (pp.
899–917). London: Palgrave Macmillan.

528
Raudla, R. (2012). The use of performance information in budgetary decision-making by legislators: Is Estonia
any different? Public Administration, 90(4), 1000–1015.

Raudla, R, Savi, R, & Randma-Liiv, T. (2013). Literature review on cutback management. COCOPS—
(COordinating for COhesion in the Public Sector of the Future). Retrieved from
http://hdl.handle.net/1765/40927

Rautiainen, A., (2010). Contending legitimations: Performance measurement coupling and decoupling in two
Finnish cities. Accounting, Auditing & Accountability Journal, 23(3), 373–391.

Savas, E. S. (1982). Privatizing the public sector: How to shrink government. Chatham, NJ: Chatham House.

Savas, E. S. (1987). Privatization: The key to better government. Chatham, NJ: Chatham House.

Schaffner, B. F., Streb, M., & Wright, G. (2001). Teams without uniforms: The nonpartisan ballot in state and
local elections. Political Research Quarterly, 54(1), 7–30.

Schoorman, F. D., Mayer, R. C., & Davis, J. H. (1995). An integrative model of organizational trust. Academy of
Management Review 20(3), 709–734.

Shaw, T. (2016). Performance budgeting practices and procedures. OECD Journal on Budgeting, 15(3), 1–73.

Shepherd, R. (2018). Expenditure reviews and the federal experience: Program evaluation and its contribution to
assurance provision. Canadian Journal of Program Evaluation, 32(3),347–370.

Speklé, R. F., & Verbeeten, F. H. (2014). The use of performance measurement systems in the public sector:
Effects on performance. Management Accounting Research, 25(2), 131–146.

Steele, G. (2005, April). Re-aligning resources and expectations: Getting legislators to do what they “should.” Paper
presented at the 25th Anniversary Conference of CCAF-FCVI, Ottawa, Ontario, Canada.

Sterck, M. (2007). The impact of performance budgeting on the role of the legislature: A four-country study.
International Review of Administrative Sciences, 73(2), 189–203.

Streib, G. D., & Poister, T. H. (1999). Assessing the validity, legitimacy, and functionality of performance
measurement systems in municipal governments. American Review of Public Administration, 29(2), 107–123.

Thomas, P. G. (2006). Performance measurement, reporting, obstacles and accountability: Recent trends and future
directions. Canberra, ACT, Australia: ANU E Press. Retrieved from
http://epress.anu.edu.au/anzsog/performance/pdf/performance-whole.pdf

529
Van der Voet, J., & Van de Walle, S. (2018). How cutbacks and job satisfaction are related: The role of top-level
public managers’ autonomy. Review of Public Personnel Administration, 38(1), 5–23.

Van Dooren, W., & Hoffmann, C. (2018). Performance management in Europe: An idea whose time has come
and gone? In E. Ongaro and S. van Thiel (eds.), The Palgrave handbook of public administration and
management in Europe (pp. 207–225). London: Palgrave Macmillan.

Van Dooren, W., & Van de Walle, S. (Eds.). (2016). Performance information in the public sector: How it is used.
New York, NY: Palgrave Macmillan.

Van Thiel, S., & Leeuw, F. L. (2002). The performance paradox in the public sector. Public Performance &
Management Review, 25(3), 267–281.

Van Thiel, S., & Yesilkagit, K. (2011). Good neighbours or distant friends? Trust between Dutch ministries and
their executive agencies. Public Management Review, 13(6), 783–802.

Wankhade, P. (2011). Performance measurement and the UK emergency ambulance service. International Journal
of Public Sector Management, 24(5), 384–402.

Weber, M. (1930). The Protestant ethic and the spirit of capitalism. London, England: George Allen.

Williams, D. W. (2003). Measuring government in the early twentieth century. Public Administration Review,
63(6), 643–659.

Willoughby, K., & Benson, P. (2011). Program evaluation, performance budgeting and PART: The U.S. Federal
Government experience. Atlanta: Georgia State University.

530
11 Program Evaluation and Program Management

Introduction 446
Internal Evaluation: Views From the Field 447
Intended Evaluation Purposes and Managerial Involvement 450
When the Evaluations Are for Formative Purposes 450
When the Evaluations Are for Summative Purposes 452
Optimizing Internal Evaluation: Leadership and Independence 453
Who Leads the Internal Evaluation? 454
“Independence” for Evaluators 455
Building an Evaluative Culture in Organizations: an Expanded Role for Evaluators 456
Creating Ongoing Streams of Evaluative Knowledge 457
Critical Challenges to Building and Sustaining an Evaluative Culture 458
Building an Evaluative/Learning Culture in a Finnish Local Government: Joining Performance
Measurement and Performance Management 460
Striving for Objectivity in Program Evaluations 460
Can Program Evaluators Claim Objectivity? 462
Objectivity and Replicability 463
Implications for Evaluation Practice: a Police Body-Worn Cameras Example 466
Criteria for High-Quality Evaluations 467
Summary 470
Discussion Questions 471
References 472

531
Introduction
Chapter 11 examines the dynamics between the evaluation function and program management. This includes,
more specifically, the relationships between evaluators and managers, the role of evaluators in the production of
information for political decision-makers, and how these relationships are influenced by evaluation purposes and
organizational contexts. What is the role of the evaluator? Can a manager effectively conduct an evaluation of their
own program? Is an evaluator role in support of a “learning organization” feasible? Is it important for an internal
evaluator to establish independence and objectivity? What kind of difference does context make, such as in eras of
increasing fiscal restraint? On the subject of how evaluators relate to managerial and government clients, a range of
views have been offered by analysts and evaluators over the past four decades.

Aaron Wildavsky (1979), in his seminal book Speaking Truth to Power: The Art and Craft of Policy Analysis,
introduced his discussion of evaluation and organizations this way:

Why don’t organizations evaluate their own activities? Why don’t they seem to manifest rudimentary
self-awareness? How long can people work in organizations without discovering their objectives or
determining how well they are carried out? I started out thinking that it was bad for organizations not
to evaluate, and I ended up wondering why they ever do it. Evaluation and organization, it turns out,
are somewhat contradictory. (p. 212)

When he posed these questions, Wildavsky chiefly had in mind summative evaluations where the future of
programs, and possibly reallocation of funding, could be an issue. As we have seen so far, summative evaluations
are typically higher-stakes than formative evaluations. We will review the implications of the differences in this
chapter.

We will look at learning organizations as an ideal type and connect their attributes to how evaluators might work
with and in such organizations. More specifically, evaluative cultures (Mayne, 2008; Mayne & Rist, 2006; Scott,
2016) are intended to be embedded in learning organizations. In such cases, evaluative thinking and practices
become diffused throughout the organization, supporting learning, innovation, and what has been termed a
“performance dialogue.” We discuss the prospects for realizing such cultures in contemporary public sector
organizations. In doing so, we offer a range of viewpoints on the relationships between evaluators and the
organizations in which or for which they do their work.

In Chapter 10, we introduced internal learning cultures (Laihonen & Mäntylä, 2017) and summarized a case of a
Finnish local government where managerial learning forums have been introduced to discuss and use internal
performance measurement results. We will take another brief look at this example of strategic knowledge
management, because this innovation aligns well with building learning cultures in organizations and supporting
performance dialogues.

Aaron Wildavsky (1979) was among the first of many to raise questions about the challenges of “speaking truth to
power” in public sector organizations, and for evaluators this relates to the issues of objectivity and evaluator
independence. We will look in depth at what evaluator objectivity and independence mean, particularly in the
context of high- or medium-stakes accountability situations. This question is important because some professions
related to evaluation (public sector audit, for example) claim objectivity for their work and it is arguable that
“objectivity” is a foundational issue for the creation of credible and defensible performance information (Mayne &
Rist, 2006; McDavid & Huse, 2006).

Finally, based on the guidelines and principles offered by evaluation associations internationally, we offer some
general guidance for evaluators in positioning themselves as practitioners who want to make credible claims for
doing high-quality evaluations.

532
533
Internal Evaluation: Views From The Field
Wildavsky’s (1979) view of organizations as settings where “speaking truth to power” is a challenge is similar to
the political/cultural image of organizations offered by de Lancer Julnes and Holzer (2001) and in Chapter 9 of
this textbook. The phrase “speaking truth to power” has become part of the lexicon of the public policy arena;
Wildavsky’s book is still relevant and was republished with a new introduction in 2017. The chapter The self-
evaluating organization is a worthwhile read, even these 40 years later.

Wildavsky points out that the views and purposes of evaluators and managers may be in contrast, even within an
organization. Evaluators are described as people who question assumptions, who are skeptical, who are (somewhat)
detached, who view organizations/programs as means and not ends in themselves, whose currency is the credibility
and defensibility of their evidence and products, and who ultimately focus on the broad social needs that the
program is intended to address, rather than on organizational ends, which often involve preserving budgets and
defending programs.

In contrast, organizational/program managers are characterized as people who are committed to their individual
programs, who are advocates for what they do and what their programs do, and who do not want to see their
commitments curtailed or their resources diminished. They personally identify with their programs and the
benefits they confer to their clients.

How, then, do organizations resolve the questions of who designs and conducts an evaluation, who controls its
interpretation and reporting, and how the perspectives of evaluators and managers are balanced? Are these
rational/technical decisions or instead are they political decisions that reflect the culture in the organization?

In a typical scenario where they have roles similar to what Wildavsky has suggested, internal evaluators would have
substantial involvement in how new or revised programs or policies should be evaluated. Will baseline measures be
identified and data collected, for example? Will new programs (particularly in complex organizations) be
implemented as pilots so that evaluation results, reported once the evaluations are completed, be used to inform
decisions whether to scale up programs? How much independence will internal evaluators have, to speak truth to
power (Head, 2013)?

The field of evaluation has hosted, and continues to host, vigorous debate on when, why, and how to prioritize
objectivity, independence, organizational participation, empowerment, and managerial involvement, in terms of
formation and summative purposes (Alkin, 2012; Julnes & Bustelo, 2017; King & Stevahn, 2015). Arnold Love
(1991) wrote a seminal book generally, though not universally, supportive of internal evaluation. His work
continues to influence the field, and his views have been amplified and refined by others (e.g., Sonnichsen, 2000;
Volkov, 2011a). Arnold Love began an interview (Volkov, 2011b) reflecting on 25 years of internal evaluation this
way:

I would like to set the record straight about my position regarding internal evaluation. Because my
name is associated so closely with internal evaluation, there is often the misperception that I am
promoting internal evaluation as the preferred alternative to external evaluation. Nothing could be
further from my own position. I feel that internal evaluation is a valuable form of evaluation, but the
choice of any particular form (internal or external) depends on the purpose for the evaluation and a
careful consideration of who is in the best position to conduct the evaluation. In some cases, it is
internal evaluators, but in other cases it is external evaluators. (p. 6)

Love goes on to elaborate, in the interview, his own views on when internal evaluation is appropriate, and
although this quote suggests limits to the purview of internal evaluation, he is clearly an advocate for an expanded
role for this approach. In his 1991 book, Love elaborates on an approach that is premised on the assumption that
evaluators can be a part of organizations (i.e., paid employees who report to organizational executives) and can

534
contribute to improving the efficiency and effectiveness of programs.

Love (1991) outlines six stages in the development of internal evaluation capacity, beginning with ad hoc program
evaluations and ending with strategically focused cost–benefit analyses:

Ad hoc evaluations focused on single programs


Regular evaluations that describe program processes and results
Program goal setting, measurement of program outcomes, program monitoring, adjustment
Evaluations of program effectiveness, improving organizational performance
Evaluations of technical efficiency and cost-effectiveness
Strategic evaluations, including cost–benefit analyses

These six stages can broadly be seen as a gradation of the purposes of evaluations from formative to summative.
They also reflect the roles of evaluation and performance measurement in the performance management cycle.
Love (1991) highlights the importance of an internal working environment where organizational members are
encouraged to participate in evaluations, and where trust of evaluators and their commitment to the organization is
part of the culture. What Love is suggesting is that even though some of the six stages of developing evaluation
capacity are aligned with conventional accountability roles for evaluators, it is possible to modify an organizational
culture, over time, so that it embraces evaluation as a strategic internal organizational asset. Later in this chapter,
we will look at the prospects for building evaluative cultures and how evaluators can contribute to this process.

We are discussing internal evaluation here, but it is important to distinguish between internal evaluations led or
even conducted by the manager, and evaluations led by an arms-length internal evaluator or evaluation team. This
relates to evaluation as a discipline in its own right, and the importance of an evaluator’s technical knowledge,
interpersonal skills, and attention to context that we have covered in the earlier chapters. It also relates to the
formative or summative purpose of the evaluation. Perhaps the most difficult combination is evaluations done for
a summative, high-stakes purpose, led by an organization’s manager (Stufflebeam, 1994).

Daniel Stufflebeam has contributed to the field in many ways, most recently in a textbook that updates his
Context, Input, Process, Product (CIPP) model for conducting program evaluations (Stufflebeam & Zhang,
2017). When empowerment evaluation was fairly new to the field, Daniel Stufflebeam (1994) offered a trenchant
critique of the implications for program evaluators (and the whole field). We will focus on his spirited advocacy
for the separation of the roles of evaluators and managers, in his critique of empowerment evaluation
(Stufflebeam, 1994). This was, in the mid-nineties, an emerging evaluation approach that was premised on
evaluators building the capacity in (client) organizations to evaluate their own programs—to empower
organizations to evaluate their own programs in ways that improve social justice (Fetterman, 1994; Fetterman &
Wandersman, 2007; Fetterman, Rodriguez-Campos, Wandersman, & O’Sullivan, 2014). Stufflebeam challenged
this view, expressing his misgivings around whether managers or other stakeholders (who are not evaluators)
should make the decisions about the evaluation process, including methodologies, analysis, and reporting of
evaluation findings. In his view, ceding those roles amounted to inviting “corrupt or incompetent evaluation
activity” (p. 324):

Many administrators caught in political conflicts over programs or needing to improve their public
relations image likely would pay handsomely for such friendly, non-threatening, empowering evaluation
service. Unfortunately, there are many persons who call themselves evaluators who would be glad to sell
such services. Unhealthy alliances of this type can only delude those who engage in such pseudo
evaluation practices, deceive those whom they are supposed to serve, and discredit the evaluation field as
a legitimate field of professional practice. (p. 325)

Stufflebeam’s view is a strong critique of empowerment evaluation and, by implication, other evaluative
approaches that cede the central position that evaluation professionals have in conducting evaluations. What
Stufflebeam is saying is that program managers, aside from not being trained as evaluators, can be in conflict of
interest when it comes to evaluating their own programs—his view accords with what Wildavsky said earlier.

535
An additional feature of Stufflebeam’s critical assessment of empowerment evaluation that stands out is his defense
of the importance of what he calls “objectivist evaluation” (p. 326) in professional evaluation practice. His
definition of objectivist evaluation also resonates with some of the themes articulated by Chelimsky (2008) and
Weiss (2013). For Stufflebeam (1994),

. . . objectivist evaluations are based on the theory that moral good is objective and independent of
personal or merely human feelings. They are firmly grounded in ethical principles, strictly control bias
or prejudice in seeking determinations of merit and worth . . . obtain and validate findings from
multiple sources, set forth and justify conclusions about the evaluand’s merit and/or worth, report
findings honestly and fairly to all-right-to know audiences, and subject the evaluation process and
findings to independent assessments against the standards of the evaluation field. Fundamentally,
objectivist evaluations are intended to lead to conclusions that are correct—not correct or incorrect
relative to a person’s position, standing or point of view. (p. 326)

Michael Scriven (1997), regarded as one of the evaluation field’s founders, proposed a view of how evaluators
should engage with their clients in evaluation work. For Scriven, objectivity is defined as “with basis and without
bias” (p. 480), and an important part of being able to claim that an evaluation is objective is to maintain an
appropriate distance between the evaluator and what/who is being evaluated (the evaluand). This issue continues
to resonate in the field (Markiewicz, 2008; Trimmer, 2016; Weiss, 2013).

In reviewing the works of seminal evaluators who emphasize evaluator objectivity, like Wildavsky (1979), Love
(1991), Stufflebeam (1994), Scriven (1997), Chelimsky (2008), and Weiss (2013), a number of issues surface.
One is the importance of evaluation purposes/uses and how those affect the relationships between evaluators and
managers. High-stakes evaluations bring with them concerns about heightened internal organizational
involvement and influence. Another is the importance of organizational culture. Trust and trust-building has been
a theme in how organizational cultures evolve, particularly when the focus is building evaluative capacity for a
learning organization. A third issue is whether and under what conditions it is possible for evaluators to be
objective in the work they do. To help untangle the factors, it may be helpful to view the internal evaluator as
having an intermediary role, helping produce and organize defensible and credible evaluative information in a way
appropriate to the timing, expected use, and expected users of the information (Meyer, 2010; Olejniczak,
Raimondo, & Kupiec, 2016). We will consider each of these issues in turn.

536
Intended Evaluation Purposes and Managerial Involvement

When the Evaluations Are for Formative Purposes


In Chapter 1 we introduced formative evaluations that are typically done internally with a view to offering
program and organizational managers and other stakeholders, information that they can use to improve the
efficiency and/or the effectiveness of existing programs. Program improvement is the main purpose of such
evaluations. Generally, questions about the continuation of support for the program are not part of formative
evaluation terms of reference.

Typically, program evaluators depend on program managers to provide key information and to arrange access to
people, data sources, and other sources of evaluation information (Chelimsky, 2008). So, from an evaluator’s
standpoint, the experience of conducting a formative evaluation can be quite different from conducting a
summative evaluation. Securing and sustaining cooperation is affected by the purposes of the evaluation—
managerial reluctance or strategies to “put the best foot forward” might well be more expected where the stakes
include the future of the program itself.

Managers are more likely to view formative evaluations as “friendly” evaluations and, hence, are more likely to be
willing to cooperate with (and trust) the evaluators. They have an incentive to do so because the evaluation is
intended to assist them in improving program performance, without raising questions that could result in major
changes, including reductions to or even the elimination of a program.

Many contemporary evaluation approaches (implicitly or explicitly) support involvement of program managers in
the process of evaluating their programs. Participatory and empowerment evaluation approaches, for example,
emphasize the importance of having practitioners involved in evaluations, principally to increase the likelihood
that the evaluations will be used (Cousins & Chouinard, 2012; Cousins & Whitmore, 1998; Fetterman &
Wandersman, 2007; Fetterman et al., 2014; Smits & Champagne, 2008). Patton, in three successive books that
cover utilization-focused evaluation (2008), developmental evaluation (2011), and principles-focused evaluation
(2018), proposes that program or organization-related stakeholders be involved in the entire evaluation process; in
fact he sees all three approaches as realizations of his utilization-focused evaluation approach wherein the intended
uses of the evaluation process and products should drive how evaluations are designed and implemented.

Views on the appropriate level of involvement of program managers can vary considerably from one evaluation
approach to another. Earlier we cited Stufflebeam’s (1994) and Cousins’ (2005) concerns about managers self-
evaluating their programs. In a rebuttal of criticisms of empowerment evaluation, Fetterman and Wandersman
(2007) suggest that their approach is capable of producing unbiased evaluations and, by implication, evaluations
that are defensible. In response to criticism by Cousins (2005), they suggest,

. . . contrary to Cousins’ (2005) position that “collaborative evaluation approaches . . . [have] . . . an


inherent tendency toward self-serving bias” (p. 206), we have found many empowerment evaluations to
be highly critical of their own operations, in part because they are tired of seeing the same problems and
because they want their programs to work. Similarly, empowerment evaluators may be highly critical of
programs that they favor because they want them to be effective and accomplish their intended goals. It
may appear counterintuitive, but in practice we have found appropriately designed empowerment
evaluations to be more critical and penetrating than many external evaluations. (Fetterman &
Wandersman, 2007, p. 184)

Their view of how program managers and other internal stakeholders (empowerment evaluators) relate to their
own programs suggests that concerns with self-evaluation bias are, at least in some situations, unfounded. But we
have seen, in Chapter 10, how widespread the concerns are with how program managers will respond to
performance measurement requirements that amount to them having to self-report in high-stakes contexts where

537
negative performance results could have a deleterious effect on their programs and on themselves. This has been
particularly a concern after the global financial crisis (Arnaboldi, Lapsley, & Steccolini, 2015; Van Dooren &
Hoffman, 2018). Is it unreasonable to assume that these concerns will carry over to program evaluations and to
the managers of programs? Let us take a closer look at the situation in Canada.

Love (in his interview with Volkov, 2011b) cites the program evaluation function in the federal government of
Canada as an example of a robust internal evaluation presence. Created in the late 1970s (Dobell & Zussman,
2018; Shepherd, 2018), it has a nearly fifty year history—a unique accomplishment among national-level program
evaluation functions, internationally. The function operates across federal departments and agencies, and each
department has its own internally-stationed evaluation unit that is responsible for conducting program evaluations
as required by the Treasury Board policies that guide and specify requirements for the federal evaluation function.
However, managers themselves do not lead the evaluations. Heads of the evaluation units report (directly or
indirectly) to the senior executive in the department, in part to ensure that reporting relationships do not go
through program-related managers or executives.

How well this function has performed over its history, however, depends to some extent on what the expectations
are for its products. Shepherd (2011) critically assessed the evaluation function from the perspective that the
(then) existing federal evaluation policy (Treasury Board of Canada Secretariat, 2009), which placed a premium
on accountability expectations for program evaluations, including their relevance for senior political decision-
makers. Shepherd pointed out that program evaluations were not delivering on that expectation and the whole
function risked becoming irrelevant if it did not re-orient itself. Similarly, Bourgeois and Whynot (2018)
completed a qualitative content analysis of program evaluations done in two federal departments during a recent
3-year period (2010–2013) and concluded that in general the evaluations were not useful for the kinds of strategic
political decision-making that had been envisioned in the federal evaluation policy in place at that time. Thus,
again, the balance of the evidence suggests that an internal evaluation function is useful for formative evaluation
purposes but is limited for the summative evaluation purposes expected by senior political decision-makers.

Where evaluation or performance measurement systems are focused internally on improving program and
organizational performance (formative uses) the stakes are lower and the spectre of conflict of interest between
managers wanting to preserve and enhance their programs, on the one hand, and the objectives of the system, on
the other hand, is not nearly as stark (Van Dooren & Hoffman, 2018).

When the Evaluations Are for Summative Purposes


Program evaluations can, alternatively, be summative—that is, intended to render judgments (of merit, worth or
significance) on the value of the program (Scriven, 2013). Summative evaluations are more directly linked to
accountability requirements that are often built into the program management cycle, which was introduced in
Chapter 1. Summative evaluations can focus on issues that are similar to those included in formative evaluations
(e.g., program effectiveness), but the intention is to produce information that can be used to make decisions about
the program’s future, such as whether to reallocate resources elsewhere, or whether to continue the program.

As we have noted, high-stakes performance measurement focused on accountability (summative performance


measurement) that is intended to drive performance improvement through external pressure to achieve results,
tends to also produce unintended consequences, among which are gaming of the performance measures (Bevan &
Hamblin, 2009; Gao, 2015) and negative side effects on the morale of the program’s human resources (Arnaboldi,
Lapsley, & Steccolini, 2015). As Norris (2005) says, “Faced with high-stakes targets and the paraphernalia of the
testing and performance measurement that goes with them, practitioner and organizations sometimes choose to
dissemble” (p. 585).

Summative program evaluations in the context of fiscal restraint are generally viewed with more concern by
program managers. Program managers perceive different incentives in providing information or even participating
in such an evaluation. The future of their programs may be at stake so candid involvement in such an evaluation
carries risks to them. Clearly it is critical to consider the intended uses of the evaluation when assessing the

538
involvement and leadership roles of managers and internal evaluators in designing and conducting program
evaluations (Van Dooren & Hoffmann, 2018). This is a contextual issue where the evaluation benefits from solid
practical wisdom of the evaluator.

Mayne (2018), who was among those involved during the creation of the function in Canada in the late 1970s, is
now of the view that program evaluations should, first and foremost, serve the decision-making needs of the
departments from whence they come—in effect the balance between summative and formative purposes should be
tilted toward formative evaluations that focus on ways of incrementally changing programs to improve them, and
to support a learning environment.

539
Optimizing Internal Evaluation: Leadership and Independence
Internal evaluation is now a widespread feature of contemporary evaluation practice. How that actually looks
varies a lot from one organizational context to another. Volkov and Baron (2011) in their synthesis of the papers
to a Special Issue of New Directions in Evaluation, point out that the roles of internal evaluators have changed over
time:

Over the years, the internal evaluator has gone from being seen as a pawn in the hands of the
organization’s administrator, to an advocate, to a manager of evaluative information, to a motivator and
change agent for performance improvement (p. 107–108).

There can be a tension between departmental internal use of evaluative information to improve programs, and
central government’s desire to use the same information for more summative, possibly budgetary, purposes. Even
when a system is originally designed for formative purposes, it can evolve or be pushed toward summative, high-
stakes accountability-related uses, as Pollitt (2018) and Kristiansen, Dahler-Larsen, and Ghin (2017) have
suggested. That is what happened to performance measurement in National Health Service in Britain. The
evidence from the British experience with high-stakes performance information is that biases in how the
information is constructed and conveyed is a significant problem (Lewis, 2015; Gao, 2015). The problem can be
mitigated by independently auditing performance information and performance reports, but there is a risk that
even then, strategies and counter-strategies will develop that constitute an “arms race” (Otley, 2003).

Who Leads the Internal Evaluation?


Part of the debate around whether and in what ways managers should be involved in evaluating their own
programs revolves the distinction between a focus on managers (Stufflebeam and Fetterman are debating the
involvement of managers as evaluators) and a focus on evaluation as an internally-stationed organizational
function. The relationships between internal program evaluators and program managers are multi-faceted. Some
jurisdictions specify how internal evaluators are expected to approach their work. For example, in the federal
government of Canada, internal evaluators (and the evaluation units in departments and agencies) are expected to
be neutral (Treasury Board, 2016)—that is, able to weigh all sides of an issue but ultimately take a position that
respects the distinct role they have in the organization. In the case of the Canadian federal government evaluation
function, persistent attempts by Treasury Board to focus program evaluations on the accountability-related needs
of decision-makers have tended to fall short (Shepherd, 2011). What has persisted instead is a function that is
generally oriented to the respective intra-departmental environments, meeting the needs of both executives and
program managers to improve programs (Bourgeois & Whynot, 2018).

Generally, the whole field of evaluation has moved toward valuing participation of stakeholders in evaluations. A
key reason is to improve the likelihood of evaluation use. By now, there are several conventional ways to involve
managers without ceding them a pivotal role in program evaluations. Typically, evaluations that are done in
government organizations have steering committees that include program representatives (who may or may not
have a vote in committee deliberations). Evaluation steering committees are typically responsible for overseeing the
entire process, from framing the terms of reference for the evaluation to reviewing and approving the draft final
report. Internal evaluators commonly draft the terms of reference with input from the steering committee.
Another common way to involve managers is to include them in the lines of evidence that are gathered.
Alternatively, program managers can be interviewed or surveyed (perhaps included in focus groups) to solicit their
views on questions driving the evaluation.

Olejniczak et al. (2016) in their discussion of internal evaluation units point out that “the evaluation literature is
conspicuously silent on the role that evaluation units play in brokering knowledge between producers and end
users” (p. 174). Our perspective in this textbook is of the evaluator as a neutral, knowledgeable (co-)producer of
evaluative information, and while we do have some reservations about the idea of viewing evaluators as

540
“knowledge brokers”, we do see merit in the idea of the evaluator as an intermediary. It does help highlight some
of non-technical competencies that evaluators should have. Olejniczak et al. have created a “knowledge brokering
framework” designed to drive “ongoing dialogue on policy issues” (p. 174).

This perspective of evaluators as intermediaries emphasizes their independence and factors such as the importance
of timeliness, credibility, appropriateness for use, effective delivery channels, and having a good match between the
evaluation design and the questions that decision-makers need answered. The value of building networks, paying
attention to the policy cycle timing, aggregating knowledge over time, and building capacity for an evaluative
culture are all part of the professional judgement component of being an evaluator, above and beyond
methodological know-how.

Overall, when we look at the ongoing relationships between internal evaluators and program managers, evaluators
need to have a leadership role in the evaluation. Program managers depend on evaluators to offer them a
perspective on their programs that will add value and suggest ways of improving efficiency and effectiveness. At
the same time, internal evaluators depend on program managers for information that is required to do program
evaluations. This two-way relationship is overlaid by the purposes for each program evaluation. Where the terms
of reference are formative, evaluators and managers have a shared objective of improving program performance.
But when the terms of reference are summative, the roles of evaluators and managers have the potential to clash.
Further, where a continued focus on summative terms of reference is mandated, it could become challenging for
evaluators to do their work internally. In essence, the internally-stationed evaluation function can become caught
in the middle between the program management and the political decision makers. These potential tensions
explain why key evaluation organizations such as the Canadian Evaluation Society (2010), the Australasian
Evaluation Society (2013), and the American Evaluation Association (2018; see: Galport & Azzam, 2017; Julnes
& Bustelo, 2017) have highlighted the need for evaluator objectivity, interpersonal skills, situational analysis, and
reflective practice. These domains illustrate the need for evaluators to have competencies beyond technical,
methodological knowledge.

“Independence” for Evaluators


This last point leads to a related issue for internal evaluators. Evaluation, even as it aspires to become a profession,
at least in some countries (Canadian Evaluation Society, 2018; Fierro, Galport, Hunt, Codd, & Donaldson,
2016), does not have the same stature as other related professions, such as accounting or auditing. When internal
auditors do their work, they know that they have the backing of their professional association and should do their
work within guidelines that specify how they should handle a range of interactions with their clients. In situations
where audit findings or even audit methodologies conflict with organizational views, it is possible for auditors to
call on their profession to back them up. This offers them assurance that their work is protected from interference,
and hence, bias (Altschuld & Engle, 2015; Everett, Green & Neu, 2005; Halpern, Gauthier, & McDavid, 2014;
McDavid & Huse, 2015).

Evaluators do not have this kind of back up. Depending on the local circumstances they may enjoy considerable
independence, or not, but there is no professional association on which to call when the circumstances of a
particular evaluation indicate a conflict between the evaluator or evaluators and organizational managers. Again,
these kinds of situations are more likely to arise in summative evaluations.

The key point for evaluators is to have a keen awareness of the intended evaluation purposes, the organizational
and political context, and how to best establish defensibility and credibility for the process and the results.

We will take a closer look at evaluator professionalization in Chapter 12. The opportunities and challenges are
both institutional and ultimately personal for evaluation practitioners.

541
542
Building an Evaluative Culture in Organizations: an Expanded Role for
Evaluators
Olejniczak et al. (2016) include “accumulating knowledge over time” and “promoting evidence-based culture” (p.
174) as roles for evaluators. Mayne and Rist (2006), Mayne (2008) and Patton (2011) are among the advocates
for a broader role for evaluation and evaluators in organizations. Like Love (1991) and Volkov (2011b), their view
is that it is possible to build internal organizational capacity to perform evaluation-related work that contributes to
changing the organization. Mayne (2008) has outlined the key features of an evaluative culture. We summarize his
main points in Table 11.1.

For Mayne (2008) and Mayne and Rist (2006), there are opportunities for evaluators that go well beyond doing
evaluation studies/projects—they need to play a role in knowledge management for the organization. Evaluators
need to engage with executives and program managers, offer them advice and assistance on a real time basis, and
take a leading role in training and other kinds of learning events that showcase and mainstream evaluation and
knowledge-related products. Implied in this broader role is evaluator involvement in performance measurement
systems, including the design and implementation and the uses of performance information. The designated
stance of evaluators in such settings is to play a supportive role in building an organizational culture that values
and ultimately relies on timely, reliable, valid, and relevant information to make decisions on programs and
policies. In Wildavsky’s (1979) words, an evaluative culture is one wherein both managers and evaluators feel
supported in “speaking truth to power.”

Table 11.1 Characteristics of an Evaluative Culture in Organizations


Table 11.1 Characteristics of an Evaluative Culture in Organizations

An organization that has a strong evaluative culture:

Engages in self-reflection and self-examination by


Seeing evidence on what it is achieving, using both monitoring and evaluation approaches
Using evidence of results to challenge and support what it is doing
Valuing candor, challenge, and genuine dialogue both horizontally and vertically within the
organization
Engages in evidence-based learning by
Allocating time and resources for learning events
Acknowledging and learning from mistakes and poor performance
Encouraging and modeling knowledge sharing and fostering the view that knowledge is a
resource and not a political weapon
Encourages experimentation and change by
Supporting program and policy implementation in ways that facilitate evaluation and learning
Supporting deliberate risk taking
Seeking out new ways of doing business

Source: Adapted from Mayne (2009, p. 1).

Another way to look at these expanded roles for evaluators is to recall Figure 1.1 in Chapter 1 that depicts the
performance management cycle. In that model, evaluation and performance measurement are involved in all four
phases of performance management: strategic planning and resource allocation; policy and program design;
implementation and management; and assessment and reporting of results. Recalling Figure 10.7 wherein that
model is overlaid with organizational cultural factors, the success in transforming an organization to embody an
evaluative culture, comes down to managing and mitigating the challenges that complex organizations present to
any organizational change.

Organizations with evaluative cultures can also be seen as learning organizations. Morgan (2006), following
Senge (1990), suggests that learning organizations have developed capacities to:

543
Scan and anticipate change in the wider environment to detect significant variations . . .
Question, challenge, and change operating norms and assumptions . . .
Allow an appropriate strategic direction and pattern of organization to emerge. (Morgan, 2006, p. 87)

Key to establishing a learning organization is what Argyris (1976) termed double-loop learning—that is, learning
that critically assesses existing organizational goals and priorities in light of evidence and includes options for
adopting new goals and objectives. Organizations must get outside their established structures and procedures and
instead focus on processes to create new information, which in turn can be used to challenge the status quo, make
changes and institutionalize new norms, values and goals. Key attributes are adaptability and improved capacity
for innovation.

Garvin (1993) has suggested five “building blocks” for creating learning organizations, which are similar to key
characteristics of organizations that have evaluative cultures: (1) systematic problem solving using evidence, (2)
experimentation and evaluation of outcomes before broader implementation, (3) learning from past performance,
(4) learning from others, (5) and treating knowledge as a resource that should be widely communicated.

544
Creating Ongoing Streams of Evaluative Knowledge
Streams of evaluative knowledge include both program evaluations and performance measurement results (Rist &
Stame, 2006). In Chapter 9, we outlined 12 steps that are important in building and sustaining performance
measurement systems in organizations, and in Chapter 10 we outlined the steps that play a role in changing an
existing performance measurement system—rebalancing it in effect so that performance improvement uses of
information are enabled. In both chapters we discussed the importance of real-time performance measurement
and results being available to managers for their monitoring and evaluative uses. By itself, building a performance
measurement system to meet periodic external accountability expectations will not ensure that performance
information will be used internally by organizational managers. The same point applies to program evaluation.
Key to a working evaluative culture would be the usefulness of ongoing evaluative information to managers, and
the responsiveness of evaluators to managerial priorities.

Patton (1994, 2011) and Westley, Zimmerman and Patton (2009) have introduced developmental evaluation as
an alternative to formative and summative program evaluations. Developmental evaluations view organizations in
some settings as co-evolving in complex environments. Organizational objectives (and hence program objectives)
and/or the organizational environment may be in flux. Conventional evaluation approaches that assume a
relatively static program structure for which it is possible to build logic models, for example, and conduct
formative or summative evaluations, may have limited application in co-evolving complex settings. Patton suggests
that evaluators should take on the role of organizational development specialists, working with managers and other
stakeholders as team members to offer evaluative information in real time so that programs and policies can take
advantage of a range of periodic and real-time evaluative information. Additionally, evaluators would play a role in
structuring an information system where evaluative information could be pooled, monitored, and assessed over
time. This, again, touches on the provocative idea of evaluators as “knowledge brokers” (Olejniczak et al., 2016).

545
Critical Challenges to Building and Sustaining an Evaluative Culture
The prospects for developing evaluative/learning cultures have become a topic of considerable interest among
those who have followed the rise of performance measurement and performance management in governments,
internationally. Refocusing organizational managers on outcomes instead of inputs and offering them incentives to
perform to those (desired) outcomes has been linked to New Public Management ideals of loosening the process
constraints on organizations so that managers would have more autonomy to improve efficiency and effectiveness
(Hood, 1995). But as Moynihan (2008) and Gill (2011) have pointed out, what has tended to happen in settings
where political cultures are adversarial is that performance expectations (objectives, targets, and measures) have
been layered on top of existing process controls, instead of replacing them. In addition, the pressures of ongoing
fiscal constraints have impacted the processes of summative performance information production and use (Shaw,
2016). In effect, from a managerial perspective, there are more controls in place now that performance
measurement and reporting are part of the picture and less “freedom to manage.” Alignment and control become
the dominant expectation for performance measures and performance results.

The New Public Management-inspired imperative for top-down accountability-driven performance measurement
and performance management has not delivered on the twin promises of more accountability and better
performance (Jakobsen, Baekgaard, Moynihan, & van Loon, 2017; Van Dooren & Hoffmann, 2018). In fact,
there is a growing concern that a focus on accountability drowns out learning as an objective (Hoffman, 2016). In
our textbook, we have voiced similar views, particularly in Chapters 8, 9 and 10.

Mayne (2008), Mayne and Rist (2006), Patton (2011), and other proponents of evaluative cultures are offering us
a normative view of what “ought” to occur in organizations. To build and sustain an evaluative culture, Mayne
(2008) suggests, among other things, that

. . . managers need adequate autonomy to manage for results—managers seeking to achieve outcomes
need to be able to adjust their operations as they learn what is working and what is not. Managing only
for planned outputs does not foster a culture of inquiry about what are the impacts of delivering those
outputs. (p. 2)

But many public sector and nonprofit organizations have to navigate environments or governments that are
adversarial, engendering negative consequences to managers (and their political masters) if programs or policies are
not “successful,” or if candid information about the weaknesses in performance becomes public. What we must
keep in mind, much as we did in Chapter 10 when we were assessing the prospects for performance measurement
and public reporting systems to be used for both accountability and performance improvement, is that the
environments in which public and nonprofit organizations are embedded play an important role in the ways
organizational cultures evolve and co-adapt.

What effect does this have on building evaluative cultures? The main issue is the chilling impact on the willingness
to take risks. Where organizational environments are substantially risk-averse, that will condition and limit the
prospects for developing an organizational culture that encourages innovation. In short, building and sustaining
evaluative cultures requires not only supportive organizational leadership but also a political and organizational
environment (or ways of buffering the organization from that environment) that permits creating and using
evaluative results that are able to acknowledge below-par performance, when it occurs.

What are the prospects for building evaluative cultures? Managers, when confronted by situations where public
performance results need to be sanitized or at least carefully presented to reduce political risks, may choose to
decouple those measures from internal performance management uses, preferring instead to develop, as one
option, other measures that are internal to the organization (Brignall & Modell, 2000; Johnsen, 1999, 2005; Kettl
& Kelman, 2007; McDavid & Huse, 2012; and, Rautiainen, 2010).

546
In sum, many organizations find themselves at a crossroads, where promises of NPM have not delivered,
particularly accountability-focused drivers of performance improvement, and new strategies are needed to address
wicked public policy problems. Van Dooren and Hoffman (2018), after reviewing the European experience with
performance management systems, advocate for learning as the goal for designing and implementing performance
measurement systems, and an emphasis on building trust. We would add that their perspective applies to program
evaluation as well. Developing evaluative cultures amounts to a commitment to re-thinking the purposes of
creating evaluative information:

With increasing evidence of its shortcomings, performance management finds itself at a crossroads. The
engineer’s logic—set targets, measure attainment, and punish or reward—has reached its limits. A
learning logic presents itself as a promising alternative. Instead of performance targets, we could have a
performance dialogue (Moynihan, 2008). Nevertheless, a learning system does not come without its
own drawbacks, either (Grieves, 2008; Lewis & Triantafillou, 2012). Organizational structures and
cultures must be revisited. Control mechanisms would need to be replaced by trust mechanisms.
Moreover, the true purpose of and the need for performance information would have to be uncovered
to avoid engaging in performance dialogues for the sake of simply doing so rather than to envision
genuine change and improvement. (p. 221)

Below, we briefly review the example of a Finnish local government approach to building an evaluative culture and
using a performance dialogue to build trust.

547
Building an Evaluative/Learning Culture in a Finnish Local Government:
Joining Performance Measurement and Performance Management
In Chapter 10, we described a case where a medium-sized Finnish local government that has been working with a
team of university-based researchers is in the process of changing its performance management system from being
mainly focused on external accountability (performance measurement and external reporting) to one where
accountability and program improvement are more balanced (Laihonen & Mäntylä, 2017). The key practice of
interest is strategic management of information, and a continuing performance dialogue. Working with an
approach first suggested by Moynihan (2005) and elaborated by Jakobsen et al. (2017) the research team and the
local government managers designed and implemented a learning forum event wherein departmental managers
came together to convey to each other performance results from their departments. A key part of this forum was a
set of guidelines for interactions that amounted to framing a collegial, non-confrontational environment for this
session. We summarized these guidelines in Chapter 10.

Performance results were discussed and interpreted by the group, mainly based on the experiential knowledge of
the managers around the table. The group also suggested ways of improving performance based on the results and
their interpretations of why observed results were occurring.

Of particular interest for this textbook, although this learning forum was focused on performance measurement
and using performance results, is that a key part of the discussion was focused on understanding why patterns of
results happened. Interpretations of results became the basis for recommended improvements—improvements that
could be implemented and evaluated to see whether they worked. In effect, performance measurement and
program evaluation came together as complementary ways of improving performance.

A subsequent article about this case study (Laihonen & Mäntylä, 2018) highlights the importance of having a
“systematic management framework for gathering and utilizing information” in local government, with four
critical factors:

First, it should be driven by the city’s strategy. Second, it should be carefully integrated into the general
management system. Third, clear processes and responsibilities for refining the data are needed. Fourth,
the quality of the data must be guaranteed. (p. 219)

Conducting periodic learning forums accords with what Mayne (2008) is suggesting about ways of building
toward an evaluative culture. If this Finnish local government continues to use these forums and their results, in
ways that build a sense of trust and mutual support among the participants, important building blocks toward a
learning culture will be institutionalized.

548
Striving for Objectivity in Program Evaluations
We started this chapter with brief summaries of Stufflebeam and Scriven’s views of the importance of objectivity, a
topic that continues to resonate as a theme in the evaluation field. As evaluation organizations collaborate and
design competency frameworks, objectivity arises as a common thread. For example, the Competencies for
Canadian Evaluation Practice, under “reflective practice”, contains:

Provides independent and impartial perspective: (1) Able to speak truth to power while maintaining an
objective frame of mind. (2) Committed to present evaluation results as objectively as possible.
(Canadian Evaluation Society, 2010, p. 5)

The Australasian (AES, 2013) Evaluators’ Professional Learning Competency Framework lists under “personal skills”
that the evaluator should “maintain an objective perspective” (p. 15).

Schweigert (2011) offers this view of the desired normative stance of internal evaluators in relation to their roles in
the organizations in which they work:

However invested internal evaluators may be in the success or direction of the organization, they occupy
a unique position within the organization as its view from the outside—viewing the organization’s work
and results with the eye of Adam Smith’s “impartial spectator” (Smith, 1790/1984), reflecting back to
their co-workers an objective view of their work (Schweigert, 2011, p. 48).

Chelimsky (2008), in her description of the challenges to independence that are endemic in the work that the U.S.
Government Accountability Office (GAO) does, makes a case for the importance of evaluations being objective:

The strongest defense for an evaluation that’s in political trouble is its technical credibility, which, for
me, has three components. First, the evaluation must be technically competent, defensible, and
transparent enough to be understood, at least for the most part. Second, it must be objective: That is, in
Matthew Arnold’s terms (as cited in Evans, 2006), it needs to have “a reverence for the truth.” And
third, it must not only be but also seem objective and competent: That is, the reverence for truth and
the methodological quality need to be evident to the reader of the evaluation report. So, by technical
credibility, I mean methodological competence and objectivity in the evaluation, and the perception by
others that both of these characteristics are present. (p. 411)

Clearly, Chelimsky sees the value in establishing that GAO evaluations are objective, and are seen to be objective.
At different points in time, “objective” has also been a desired attribute of the information produced in federal
evaluations in Canada: “Evaluation . . . informs government decisions on resource allocation and reallocation
by . . . providing objective information to help Ministers understand how new spending proposals fit” (Treasury
Board of Canada Secretariat, 2009, sec. 3.2).

As we indicated earlier in this chapter, Scriven’s (1997) view is that objectivity is an important part of evaluation
practice. Other related professions have asserted that professional practice is, or at least ought to be, objective. In
the 2017 edition of the Government Auditing Standards (GAO, 2017), government auditors are enjoined to
perform their work this way:

3.12 Auditors’ objectivity in discharging their professional responsibilities is the basis for the credibility
of auditing in the government sector. Objectivity includes independence of mind and appearance when
conducting engagements, maintaining an attitude of impartiality, having intellectual honesty, and being

549
free of conflicts of interest. Maintaining objectivity includes a continuing assessment of relationships
with audited entities and other stakeholders in the context of the auditors’ responsibility to the public.
The concepts of objectivity and independence are closely related. Independence impairments affect
auditors’ objectivity. (p. 15)

Indeed, if we see evaluators as competing with auditors or management consultants for clients, establishing
objectivity, as best as possible, could be an important factor in maintaining credibility of the performance
information.

550
Can Program Evaluators Claim Objectivity?
How do we defend a claim to a prospective client that our work is objective? Weiss (2013), in summing up the
nexus of methodology and context for program evaluators, makes this point, “In the quest for fair assessment,
advantages accrue not only to methodological expertise but also to sensitive observation, insight, awareness of
context and understanding. Evaluators will be willing to explore all the directions that the findings open up.
Inevitably, they won’t attain complete objectivity, but we can try for it.” (p. 132)

Scriven (1997) develops an approach to objectivity that relies on a legal metaphor to understand the work of an
evaluator: For him, when we do program evaluations, we can think of ourselves as expert witnesses. We are, in
effect, called to “testify” about a program, we offer our expert views, and the “court” (our client) can decide what
to do with our contributions.

He takes the courtroom metaphor further when he asserts that in much the same way that witnesses are sworn to
tell “the truth, the whole truth, and nothing but the truth” (p. 496), evaluators can rely on a common-sense
notion of the truth as they do their work. If such an oath “works” in courts (Scriven believes it does), then despite
the philosophical questions that can be raised by a claim that something is true, we can and should continue to
rely on a common-sense notion of what is true and what is not. Is Scriven’s definition (or others) of objectivity
defensible?

Scriven’s main point is that program evaluators should be prepared to offer objective evaluations and that to do so,
it is essential that we recognize the difference between conducting ourselves in ways that promote our objectivity
and ways that do not. Even those who assert that there often cannot be ultimate truths in our work are, according
to Scriven, uttering a self-contradictory assertion: They wish to claim the truth of a statement that there are no
truths.

Although Scriven’s argument has a common-sense appeal, there are two main issues in the approach he takes.
First, his metaphor of evaluators as expert witnesses does have some limitations. In courts of law, expert witnesses
are routinely challenged by their counterparts and by opposing lawyers—they can be cross-examined. Unlike
Scriven’s evaluators, who do their work, offer their report, and then may absent themselves to avoid possible
compromises of their objectivity, expert witnesses in courts undergo a high level of scrutiny. Even where expert
witnesses have offered their version of the truth, it is often not clear whether that is their view or the views of a
party to a legal dispute. “Expert” witnesses can sometimes be biased.

Second, witnesses speaking in court can be severely penalized if it is discovered that they have lied under oath. For
program evaluators, it is far less likely that sanctions will be brought to bear even if it could be demonstrated that
an evaluator did not speak “the truth.” The reality is that in the practice of program evaluation for accountability
purposes, clients can shop for evaluators who are likely to provide a sanitized evaluation. Certainly, evaluators may
find there are pressures to support the socio-political ideology of the environment in which they work (and are
funded). Mathison (2018), in a thought-provoking “three reasons I believe evaluation has not and is not
contributing enough to the public good”, argues the following:

First, evaluation theory and practice (like many social practices) reflects the values, beliefs and
preferences of the time. As such, evaluation is constrained by dominant socio-political ideologies.
Second, evaluation fundamentally lacks independence: it is a service provided to those with power and
money and, in that relationship, becomes a practice constrained in its capacity to contribute to the
public good. And third, evaluation is fundamentally a conserving practice, working within domains
established by others, and more often than not maintaining the status quo. (p. 114)

That is sobering but, if nothing else, her perspective underlines our emphasis on methodological rigour to
substantiate an evaluation’s credibility and defensibility, and practical wisdom to see the bigger picture within

551
which one is working. It ties in with Picciotto’s (2015) concerns about contexts where “evaluation has been
captured by powerful interests whether globally, within countries or within organizations” (p. 150), and his
emphasis on the need for evaluator independence within a “democratic evaluation model” (p. 151).

We will explore these issues further in Chapter 12 but will focus here on one slice: the importance of transparency
and replicability for establishing objectivity. One component of methodological rigor is having a transparent
process and reporting, that invites both scrutiny and repeatability of evaluative findings.

552
Objectivity and Replicability
For scientists, objectivity has two important elements, both of which are necessary. Methods and procedures need
to be constructed and applied so that the work done, as well as the findings, are open to scrutiny by one’s peers.
Although the process of doing a science-based research project does not by itself make the research objective, it is
essential that this process be transparent. Scrutability of methods facilitates repeating the research. If findings can
be replicated independently, the community of scholars engaged in similar work confers objectivity on the
research. Even then, scientific findings are not treated as absolutes. Future tests might raise further questions, offer
refinements, and generally increase knowledge.

This working definition of objectivity does not imply that objectivity confers “truth” on scientific findings.
Indeed, the idea that objectivity is about scrutability and replicability of methods and repeatability of findings is
consistent with Kuhn’s (1962) notion of paradigms. Kuhn suggested that communities of scientists who share a
“worldview” are able to conduct research and interpret the results. Within a paradigm, “normal science” is about
solving puzzles that are implied by the theoretical structure that undergirds the paradigm. “Truth” is agreement,
based on research evidence, among those who share a paradigm.

In program evaluation practice, much of what we call methodology is tailored to particular settings. Increasingly,
we are taking advantage of mixed qualitative–quantitative methods (Creswell, 2009; Hearn, Lawler, & Dowswell,
2003; Johnson & Onwuegbuzie, 2004) when we design and conduct evaluations, and our own judgment as
professionals plays an important role in how evaluations are designed and data are gathered, interpreted, and
reported. Owen and Rogers (1999) make this point when they state,

. . . no evaluation is totally objective: it is subject to a series of linked decisions [made by the evaluator].
Evaluation can be thought of as a point of view rather than a statement of absolute truth about a
program. Findings must be considered within the context of the decisions made by the evaluator in
undertaking the translation of issues into data collection tools and the subsequent data analysis and
interpretation. (p. 306)

The Federal Government of Canada’s OCG (Office of the Comptroller General, 1981) was among the
government jurisdictions that historically advocated the importance of objectivity and replicability in evaluations:

Objectivity is of paramount importance in evaluative work. Evaluations are often challenged by


someone: a program manager, a client, senior management, a central agency or a minister. Objectivity
means that the evidence and conclusions can be verified and confirmed by people other than the
original authors. Simply stated, the conclusions must follow from the evidence. Evaluation information
and data should be collected, analyzed and presented so that if others conducted the same evaluation and
used the same basic assumptions, they would reach similar conclusions. (Treasury Board of Canada
Secretariat, 1990, p. 28, emphasis added)

This emphasis on the replicability of evaluation findings and conclusions is similar to the way auditors define high-
quality work in their profession. It implies, at least in principle, that the work of one evaluator or one evaluation
team could be repeated, with the same results, by a second evaluation of the same program.

The OCG criterion of repeatability is similar in part to the way scientists do their work. Findings and conclusions,
to be accepted by the discipline, must be replicable (Asendorpf et al., 2013).

There is, however, an important difference between most program evaluation practice and the practice of scientific
disciplines. In the sciences, the methodologies and procedures that are used to conduct research and report the
results are intended to facilitate replication. Methods are scrutinized by one’s peers, and if the way the work has

553
been conducted and reported passes this test, it is then “turned over” to the community of peer researchers, where
it is subjected to independent efforts to replicate the results. In other words, meaningfully claiming objectivity would
require both the use of replicable methodologies and actual replications of the evaluations of programs and
policies. In practical terms, satisfying both of these criteria is rare.

In the sciences, if a particular set of findings cannot be replicated by independent researchers, the community of
research peers eventually discards the results as an artifact of the setting or the scientist’s biases. Transparent
methodologies are necessary but not sufficient to establish objectivity of scientific results. The initial reports of cold
fusion reactions (Fleischmann & Pons, 1989), for example, prompted additional attempts to replicate the reported
findings, to no avail. Fleischman and Pons’s research methods proved to be faulty, and cold fusion did not pass the
test of replicability.

A recent controversy that also hinges on being able to replicate experimental results is the question of whether
high-energy neutrinos can travel faster than the speed of light. If such a finding were corroborated (reproduced by
independent teams of researchers), it would undermine a fundamental assumption of Einstein’s relativity theory—
that no particle can travel faster than the speed of light. The back-and-forth “dialogue” in the high-energy physics
community is illustrated by a publication that claims that the one set of experimental results (apparently
replicating the original experiment) were wrong and that Einstein’s theory is safe (Antonello et al., 2012).

An ongoing controversy (as this edition of the textbook is being prepared) is a research program to determine
whether it is possible to transmit information between a pair of entangled quantum particles at a distance and in
doing so, exceed the speed of light (a fundamental constant that Einstein said could not be exceeded). Initial
research results (Reiserer et al., 2016) suggest that it is possible. Others have disputed this finding and have
embarked on a research program to independently test that result (Handsteiner et al., 2017). If it is possible to
transmit information instantaneously over cosmic distances, in principle it would be possible to design an
interstellar communication device that resembles the science fiction-inspired ansible popularized by Ursula Le
Guin (1974).

Although the OCG criterion of repeatability (Treasury Board of Canada Secretariat, 1990) in principle might be
desirable, it is rarely applicable to program evaluation practice. Even in the audit community, it is rare to repeat
the fieldwork that underlies an audit report. Instead, the fieldwork is conducted so that all findings are
documented and corroborated by more than one line of evidence (or one source of information). In effect, there is
an audit trail for the evidence and the findings.

Where does this leave us? Scriven’s (1997) criteria for objectivity—with basis and without bias—have defensibility
limitations in as much as they usually depend on the “objectivity” of individual evaluators in particular settings.
Not even in the natural sciences, where the subject matter and methods are more conducive to Scriven’s
definition, do researchers rely on one scientist’s assertions about “facts” and “objectivity.” Instead, the scientific
community demands that the methods and results be stated so that the research results can be corroborated or
disconfirmed, and it is via that process that “objectivity” is conferred. Objectivity is not an attribute of one researcher
or evaluator but instead is predicated of the process in the scientific community in which that researcher or evaluator
practices. In some professional settings where teams of evaluators work on projects together, it may be possible to
construct internal challenge functions and even share draft reports externally to increase the likelihood that the
final product will be viewed as defensible and robust. But repeating an evaluation to confirm the replicability of
the findings is rare. Below, we summarize a unique example of a program evaluation being transparent and
replicated: the police use of body-worn cameras.

Implications for Evaluation Practice: A Police Body-Worn Cameras Example


The widespread interest in body-worn cameras (BWCs) for police departments internationally has produced a
large number of studies, some of which have replicated the original Rialto, California experiment done in 2012–
2013. Seven U.S. urban police departments have implemented and evaluated the same program as Rialto. It is,
therefore, possible to compare key results.

554
Starting in Chapter 1 we introduced body-worn camera program evaluations as an ongoing series of projects
focused on this high-profile public policy intervention. The first evaluation project was completed in 2014 in
Rialto, California (Ariel, Farrar, & Sutherland, 2015) and since then a large number of studies have been done
(Maskaly, Donner, Jennings, Ariel, Sutherland, 2017). Ariel et al. (2017) report results from seven replications of
the original evaluation that was done in Rialto. All seven cities in the USA are medium-sized (109,000 to 751,500
population) and the same evaluation design (before–after time series, plus randomized controlled trial for one year,
where patrol shifts were the unit of analysis and officers on program shifts (”treatment” group) had to have their
cameras on all the time) was used in all seven police departments. A key dependent variable in the Rialto
experiment was the number of complaints against officers, and that variable was the main focus in the seven
replications. Figure 11.1 displays the before-after differences in the percentage of complaints against officers
(treatment group and control group) in all seven cities.

Replicating Program Evaluations: Body-worn Camera Experiments in Seven US Cities

Figure 11.1 Complaints Filed Against Officers in the Seven Experimental Sites: Before-After Percent Changes

Source: Ariel et al., 2017, p. 302. Reprinted with permission.

All seven police departments experienced large before-after differences in complaints for both the treatment and the
control shifts. Counter-intuitively, when the complaints are compared between the two groups for each
department, there are no significant differences—in other words, the drops happened in both the control and
treatment groups. This happened in Rialto as well. The authors of the report suggest that in all eight research
designs, diffusion of treatment (a construct validity threat) occurred (Ariel et al., 2017). Ariel and his colleagues
termed this “contagious accountability.”

For us, this series of replicated program evaluations is an exception to our usual practice in the field. We rarely
find these replications but when we do we have an opportunity to see whether programs have external validity:
“We demonstrated that the use of BWCs in police operations dramatically reduces the incidence of complaints
lodged against police officers, thus illustrating the treatment effect, first detected in a relatively small force in
Rialto, carries strong external validity.” (Ariel et al., 2017 p. 302)

555
The realities of program evaluation practice can work to weaken claims that evaluators can be objective in the
work we do. Overall, evaluation is a craft that mixes methodologies and methods together with professional
judgment to produce products that are intended to be methodologically defensible, yet also tailored appropriately
to contexts and intended uses. One conundrum that evaluators encounter is that the more an evaluation is tailored
to the organizational, economic, social, political, and/or geographic context, the more difficult it becomes to argue
for external generalizability of the results. Replication of studies is often a luxury, but as we’ve seen with the body-
worn cameras study, it does help establish objectivity and generalizability.

556
Criteria for High-Quality Evaluations
A range of program evaluation standards, codes of conduct and ethical guidelines have been developed (for
example: American Educational Research Association, 2011; American Evaluation Association, 2018; Australasian
Evaluation Society, 2013a, 2013b; Canadian Evaluation Society, 2010, 2012; Yarbrough, Shulha, Hopson, &
Caruthers, 2011). These intersecting resources emphasize various features that underpin ‘quality’ in evaluations
and professional practice, including evaluator objectivity in their work as contributors to the performance
management cycle.

Most professional associations are national in scope, but when we look across their best practice and ethics
guidelines, we see common themes. There are themes related to methodological rigour and related technical
competencies, and there are themes related to evaluators’ attitude, their awareness and appreciation of context, and
the importance of interpersonal skills. In 2012, the Canadian Evaluation Society adopted the Program Evaluation
Standards developed by Yarbrough et al. (2011) that address the usefulness, effectiveness, fairness, reliability, and
accountability of evaluations (see also, Buchanan & Kuji-Shikatani, 2013). The American Evaluation Association’s
(2018) Guiding Principles for Evaluators present standards pertaining to five overlapping issues: systematic inquiry;
evaluators’ competencies; personal and professional integrity; respect for people; and common good and equity.
The Australasian Evaluation Society’s (2013a) Evaluators’ Professional Learning Competency Framework presents
seven “domains of competence” intended to support ongoing improvement during various phases of the
evaluation process:

1. Evaluative Attitude and Professional Practice


2. Evaluation Theory
3. Culture, Stakeholders and Context
4. Research Methods and Systematic Inquiry
5. Project Management
6. Interpersonal Skills
7. Evaluation Activities

The Organisation for Economic Cooperation and Development’s (2010) Quality Standards for Development
Evaluation provide principles regarding the planning, implementation, and utility of evaluations, as well as
broader issues related to development evaluation, including transparency, ethics, working in partnerships, and
building capacity. These types of guidelines for conducting high quality evaluations, in conjunction with
professional ethical standards and prescribed core competencies, set the bar for professional evaluation practices.

It may be useful here to provide a real-world example of what a U.S. national service-delivery department includes
about evaluator qualifications, independence, and objectivity in its evaluation standards of practice. The United
States President’s Emergency Plan for Aids Relief (PEPFAR) originated in 2003, and “expects all PEPFAR
implementing agencies and those who procure and implement evaluations to commit themselves at a minimum to
evaluation practices based on the standards of practice”, which include “ensure appropriate evaluator qualifications
and independence” (PEPFAR, 2017, p. 7):

Ensure that an evaluator has appropriate experience and capabilities. Manage any conflicts of interest of
the evaluators (or team) and mitigate any untoward pressures that could be applied to the evaluator or
evaluation team that would influence its independence.

It is important that the evaluation team members:

are qualified to conduct the evaluation through knowledge and experience;


disclose any potential conflict of interest with the evaluation;
are protected from any undue pressure or influence that would affect the independence of the evaluation or

557
objectivity of the evaluator(s).

What we can see here is an emphasis on evaluator qualifications, and a concern about pressures on an evaluator or
evaluation team, that can undermine the quality of the evaluation. They continue:

Managing the independence of the evaluation includes informing and educating all those participating
in the evaluation (including those collecting data, funding, reviewing, or approving the evaluation) that
the planning, implementation and results of the evaluation should not be manipulated in any way to
suggest undue influence. (p. 7)

There are a number of criteria for high-quality evaluations, and in this chapter, we have tried to address the
evaluator’s role in the program management context, and the challenges to designing, conducting, and reporting
on an evaluation. Having a defensible and credible methodology is foundational, but: (1) although objectivity is
included in existing evaluation guidelines, predicating objectivity of a single evaluator(s) is questionable,
particularly in high-stakes contexts, (2) interpersonal and networking skills are needed, (3) evaluators must be
aware of the performance management cycle, the policy cycle, and the intended use of the performance
information (evaluations or performance measures), and (4) evaluators have a role to play in helping develop an
evaluative, learning culture in an organization.

558
Summary
The relationships between managers and evaluators are affected by the incentives that each side faces in particular program management
contexts. If evaluators have been commissioned to conduct a summative evaluation, it is more likely that program managers may be less
forthcoming about their programs, particularly where the stakes are perceived to be high. This is what Wildavsky pointed to in 1979.
Expecting managers, under these conditions, to participate as neutral parties in an evaluation ignores the potential for conflicts of
commitments, which can affect the accuracy and completeness of information that managers provide about their own programs. This
problem parallels the problem that exists in performance measurement systems, where public, high-stakes, summative uses of performance
results will tend to incentivize gaming of the system by those who are affected by the consequences of disseminating performance results
(Gill, 2011).

Formative evaluations, where it is generally possible to project a “win-win” scenario for managers and evaluators, offer incentives for
managers to be forthcoming so that they benefit from an assessment based on an accurate and complete understanding of their programs.
Historically, a majority of evaluations have been formative. Although advocates for program evaluation and performance measurement
imply that evaluations can be used for resource allocation/reallocation decisions, it is comparatively rare to have an evaluation that does
that. There has been a gap between the promise and the performance of evaluation functions in governments in that regard (Muller-
Clemm & Barnes, 1997; Mayne, 2018; Shaw, 2016; Shepherd, 2018). On the other hand, even in cases where expectations have been
that formative evaluations may be drawn into use for summative uses such as budget cuts, evidence has shown that internal evaluations
have been useful for program managers. As well, internal evaluations and strategic performance measures have shown value as part of a
“learning culture” where a “continuous performance dialogue” is the goal. This seems particularly feasible at the local government level
(Laihonen & Mäntylä, 2017, 2018).

Many evaluation approaches encourage or even mandate manager or organizational participation in evaluations. Where utilization of
evaluation results is a central concern of evaluation processes, managerial involvement has been shown to increase uses of evaluation
findings. Some evaluation approaches—empowerment evaluation is an example of an important approach—suggest that control of the
evaluation process should be devolved to those in the organizations and programs being evaluated. This view is contested in the evaluation
field (Miller & Campbell, 2006).

We consider objectivity as a criterion for evaluation practice in this chapter, pointing out that aside from technical imperatives for the
evaluator or evaluation team, there are other key issues in designing and conducting an evaluation or performance measurement system: a
keen awareness of the proposed use(s) for the performance information; the economic, social, and political context of the evaluative
process or project; and the timing and positioning of the evaluative product in terms of the program management cycle. Additional
related topics, such as evaluator ethics, are discussed in the following chapter.

Evaluators, accountants, and management consultants will continue to be connected with efforts by government and nonprofit
organizations to be more accountable. In some situations, evaluation professionals, accounting professionals and management consultants
will compete for work with clients. Because the accounting and audit professions assert that their work is ‘objective’, evaluators have to
address the issue of how to characterize their own practice so that clients can be assured that the work of evaluators meets standards of
rigor, defensibility, neutrality, and ethical practice.

559
Discussion Questions
1. Why are summative evaluations more challenging to do than formative evaluations?
2. What is a learning organization, and how is the culture of a learning organization supportive of evaluation?
3. What are the advantages and disadvantages of relying on internal evaluators in public sector and nonprofit organizations?
4. What is an evaluative culture in an organization? What roles should evaluators play in building and sustaining such a culture?
5. What would it take for an evaluator to claim that her or his evaluation work is objective? Given those requirements, is it possible
for any evaluator to say that his or her evaluation is objective? Under what circumstances, if any?
6. Suppose that you are a practicing evaluator and you are discussing a possible contract to do an evaluation for a nonprofit agency.
The agency director is interested in your proposal but, in the discussions, says that he wants an objective evaluation. If you are
willing to tell him that your evaluation will be objective, you likely have the contract. How would you respond to this situation?
7. Other professions such as medicine, law, accounting, and social work have guidelines for professional practice that can be
enforced—these guidelines constrain professional practice but at the same time, can protect practitioners from pressures that
contravene professional practice. Evaluation has guidelines—many national evaluation associations have standards for practice and
ethical guidelines—but they are not enforceable. What would be the advantages and disadvantages of the evaluation profession
having enforceable practice guidelines? Who would do the enforcing? How would enforcement happen? Who would gain and
who would lose from having enforceable practice guidelines?
8. Read the following paragraphs and then see whether you agree or disagree with the point of view being expressed in them:

Program managers know from their own experiences that in the course of a day, they will conduct several informal “evaluations”
of situations, resulting in some instances in important decisions. The data sources for many of these “evaluations” are a
combination of documentary, observational, and interaction-based evidence, together with their own experiences (and their own
judgment). Managers routinely develop working hypotheses or conjectures about situations or their interactions with people, and
“test” these hypotheses informally with subsequent observations, meetings, or questions. Although these “evaluation” methods are
informal, and can have biases (e.g., not having representative views on an issue, or weighting the gathered data inappropriately),
they are the core of much current managerial practice. Henry Mintzberg (1997), who has spent much of his career observing
managers to understand the patterns in their work, suggests that managerial work is essentially focused on being both a conduit
for and synthesizer of information. Much of that information is obtained informally (e.g., by face-to-face meetings, telephone
conversations, or casual encounters in and outside the workplace).

Program evaluators can also develop some of the same skills managers use to become informed about a program and its context.
Seasoned evaluators, having accumulated a wide variety of experiences, can often grasp evaluation issues, possible organizational
constraints, and other factors bearing on their work, in a short period of time.
9. In the field of program evaluation, there are important differences of opinion around the question of how “close” program
evaluators should get to the people and the programs being evaluated. In general, how close do you think that evaluators should
get to program managers? Why?

560
References
Alkin, M. C. (Ed.). (2012). Evaluation roots: A wider perspective of theorists’ views and influences. Thousand Oaks,
CA: Sage.

Altschuld, J. W., & Engle, M. (Eds.). (2015). Accreditation, certification, and credentialing: Relevant concerns for US
evaluators. New Directions for Evaluation, 145, 21–37.

American Educational Research Association. (2011). Code of ethics: American Educational Research Association—
approved by the AERA Council February 2011. Retrieved from
http://www.aera.net/Portals/38/docs/About_AERA/CodeOfEthics(1).pdf

American Evaluation Association. (2018). Guiding principles for evaluators. Retrieved from
http://www.eval.org/p/cm/ld/fid=51

Antonello, M., Aprili, P., Baibussinov, B., Baldo Ceolin, M., Benetti, P., Calligarich, E., . . . Zmuda, J. (2012). A
search for the analogue to Cherenkov radiation by high energy neutrinos at superluminal speeds in ICARUS.
Physics Letters B, 711 (3–4), 270–275.

Argyris, C. (1976). Single-loop and double-loop models in research on decision making. Administrative Science
Quarterly, 21(3), 363–375.

Ariel, B., Farrar, W., & Sutherland, A. (2015). The effect of police body-worn cameras on use of force and
citizens’ complaints against the police: A randomized controlled trial. Journal of Quantitative Criminology,
31(3), 509–535.

Ariel, B., Sutherland, A., Henstock, D., Young, J., Drover, P., Sykes, J., . . . & Henderson, R. (2017).
“Contagious accountability”: A global multisite randomized controlled trial on the effect of police body-worn
cameras on citizens’ complaints against the police. Criminal Justice and Behavior, 44(2), 293–316.

Arnaboldi, M., Lapsley, I., & Steccolini, I. (2015). Performance management in the public sector: The ultimate
challenge. Financial Accountability & Management, 31(1), 1–22.

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., . . . & Perugini, M.
(2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27(2),
108–119.

Australasian Evaluation Society (AES). (2013a). Evaluators’ professional learning competency framework. Retrieved
from
https://www.aes.asn.au/images/stories/files/Professional%20Learning/AES_Evaluators_Competency_Framework.pdf

Australasian Evaluation Society. (2013b). Guidelines for the ethical conduct of evaluations. Retrieved from
https://www.aes.asn.au/images/stories/files/membership/AES_Guidelines_web_v2.pdf

561
Bevan, G., & Hamblin, R. (2009). Hitting and missing targets by ambulance services for emergency calls: Effects
of different systems of performance measurement within the UK. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 172(1), 161–190.

Bourgeois, I., & Whynot, J. (2018). Strategic evaluation utilization in the Canadian Federal Government.
Canadian Journal of Program Evaluation, 32(3), 327–346.

Brignall, S., & Modell, S. (2000). An institutional perspective on performance measurement and management in
the ‘new public sector’. Management Accounting Research, 11(3), 281–306.

Buchanan, H., & Kuji-Shikatani, K. (2013). Evaluator competencies: The Canadian experience. Canadian Journal
of Program Evaluation, 28(3), 29–47.

Canadian Evaluation Society. (2010). Competencies for Canadian evaluation practice. Retrieved from
http://www.evaluationcanada.ca/txt/2_competencies_cdn_evaluation_practice.pdf

Canadian Evaluation Society. (2012). Program evaluation standards. Retrieved from


https://evaluationcanada.ca/program-evaluation-standards

Canadian Evaluation Society, (2018). About the CE Designation. Retrieved from https://evaluationcanada.ca/ce

Chelimsky, E. (2008). A clash of cultures: Improving the “fit” between evaluative independence and the political
requirements of a democratic society. American Journal of Evaluation, 29(4), 400–415.

Cousins, J. B. (2005). Will the real empowerment evaluation please stand up? A critical friend perspective. In D.
Fetterman & A. Wandersman (Eds.), Empowerment evaluation principles in practice (pp. 183–208). New York:
Guilford Press.

Cousins, J. B., & Whitmore, E. (1998). Framing participatory evaluation. New Directions for Evaluation, 80,
5–23.

Cousins, J. B., & Chouinard, J. A. (2012). Participatory evaluation up close: An integration of research based
knowledge. Charlotte, NC: Information Age Publishing, Inc.

Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches. Thousand Oaks,
CA: Sage.

de Lancer Julnes, P., & Holzer, M. (2001). Promoting the utilization of performance measures in public
organizations: An empirical study of factors affecting adoption and implementation. Public Administration
Review, 61(6), 693–708.

Dobell, R., & Zussman, D. (2018). Sunshine, scrutiny, and spending review in Canada, Trudeau to Trudeau:

562
From program evaluation and policy to commitment and results. Canadian Journal of Program Evaluation,
32(3), 371–393.

Evans, H. (2006, June 18). Eye on the times. New York Times Book Review, p. 16.

Everett, J., Green, D., & Neu, D. (2005). Independence, objectivity and the Canadian CA profession. Critical
Perspectives on Accounting, 16(4), 415–440.

Fetterman, D. (1994). Empowerment evaluation. Evaluation practice, 15(1), 1–15.

Fetterman, D., & Wandersman, A. (2007). Empowerment evaluation: Yesterday, today, and tomorrow. American
Journal of Evaluation, 28(2), 179–198.

Fetterman, D., Rodríguez-Campos, L., Wandersman, A., & O’Sullivan, R. G. (2014). Collaborative,
participatory, and empowerment evaluation: Building a strong conceptual foundation for stakeholder
involvement approaches to evaluation (A response to Cousins, Whitmore, & Shulha, 2013). American Journal
of Evaluation, 35(1), 144–148.

Fierro, L., Galport, N., Hunt, A., Codd, H., & Donaldson, S. (2016). Canadian Evaluation Society Credentialed
Evaluator Designation Program. Claremont Graduate University, Claremont Evaluation Centre. Retrieved from
https://evaluationcanada.ca/txt/2016_pdp_evalrep_en.pdf

Fleischmann, M., & Pons, S. (1989). Electrochemically induced nuclear fusion of deuterium. Journal of
Electroanalytical Chemistry, 261(2A), 301–308.

Galport, N., & Azzam, T. (2017). Evaluator training needs and competencies: A gap analysis. American Journal of
Evaluation, 38(1), 80–100.

Gao, J. (2015). Performance measurement and management in the public sector: Some lessons from research
evidence. Public Administration and Development, 35(2), 86–96.

Garvin, D. A. (1993). Building a learning organization. Harvard Business Review, 71(4), 78–90.

Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state organisations in New Zealand.
Wellington, NZ: Institute of Policy Studies.

Government Accountability Office. (2017). Government Auditing Standards: Exposure Draft. Retrieved from:
https://www.gao.gov/assets/690/683933.pdf

Grieves, J. (2008). Why we should abandon the idea of the learning organization. The Learning Organization,
15(6), 463–473.

563
Halpern, G., Gauthier, B., & McDavid, J. C. (2014). Professional standards for evaluators: The development of
an action plan for the Canadian Evaluation Society. The Canadian Journal of Program Evaluation, 29(3), 21.

Handsteiner, J., Friedman, A. S., Rauch, D., Gallicchio, J., Liu, B., Hosp, H., . . . & Mark, A. (2017). Cosmic
bell test: Measurement settings from milky way stars. Physical Review Letters, 118 (6),06040, 1–8.

Head, B. W. (2013). Evidence-based policymaking–speaking truth to power? Australian Journal of Public


Administration, 72(4), 397–403.

Hearn, J., Lawler, J., & Dowswell, G. (2003). Qualitative evaluations, combined methods and key challenges:
General lessons from the qualitative evaluation of community intervention in stroke rehabilitation. Evaluation,
9(1), 30–54.

Hoffmann, C. (2016). At a crossroads—How to change ways towards more meaningful performance management?
Antwerp: University of Antwerp.

Hood, C. (1995). The “New Public Management” in the 1980s: Variations on a theme. Accounting, Organizations
and Society, 20(2–3), 93–109.

Jakobsen, M. L., Baekgaard, M., Moynihan, D. P., & van Loon, N. (2017). Making sense of performance
regimes: Rebalancing external accountability and internal learning. Perspectives on Public Management and
Governance, 1(2) 127–141.

Johnsen, Å. (1999), Implementation mode and local government performance measurement: A Norwegian
experience. Financial Accountability & Management, 15(1), 41–66.

Johnsen, Å. (2005). What does 25 years of experience tell us about the state of performance measurement in
public policy and management? Public Money and Management, 15(1), 41–66.

Johnson, R., & Onwuegbuzie, A. (2004). Mixed methods research: A research paradigm whose time has come.
Educational Researcher, 33(7), 14–26.

Julnes, G., & Bustelo, M. (2017). Professional evaluation in the public interest(s). American Journal of Evaluation,
38(4), 540–545.

King, J. A., & Stevahn, L. (2015). Competencies for program evaluators in light of adaptive action: What? So
What? Now What? New Directions for Evaluation, 145, 21–37.

Kettl, D., & Kelman, S. (2007). Reflections on 21st century government management. Washington, DC: IBM
Center for the Business of Government.

Kristiansen, M., Dahler-Larsen, P. & Ghin, E. M. (2017). On the dynamic nature of performance management

564
regimes. Administration & Society, 1–23. [Online first]

Kuhn, T. S. (1962). The structure of scientific revolutions. IL: University of Chicago Press.

Laihonen, H., & Mäntylä, S. (2017). Principles of performance dialogue in public administration. International
Journal of Public Sector Management, 30(5), 414–428.

Laihonen, H., & Mäntylä, S. (2018). Strategic knowledge management and evolving local government. Journal of
Knowledge Management, 22(1), 219–234.

Le Guin, U. (1974). The dispossessed. New York: Harper Collins.

Lewis, J. (2015). The politics and consequences of performance measurement. Policy and Society, 34(1), 1–12.

Lewis, J., & Triantafillou, P. (2012). From performance measurement to learning: A new source of government
overload? International Review of Administrative Sciences, 78(4), 597–614.

Love, A. J. (1991). Internal evaluation: Building organizations from within. Newbury Park, CA: Sage.

Markiewicz, A. (2008). The political context of evaluation: What does this mean for independence and
objectivity? Evaluation Journal of Australasia, 8(2), 35–41.

Maskaly, J., Donner, C., Jennings, W. G., Ariel, B., & Sutherland, A. (2017). The effects of body-worn cameras
(BWCs) on police and citizen outcomes: A state-of-the-art review. Policing: An International Journal of Police
Strategies & Management, 40(4), 672–688.

Mathison, S. (2018). Does evaluation contribute to the public good? Evaluation, 24(1), 113–119.

Mayne, J. (2008). Building an evaluative culture for effective evaluation and results management. Retrieved from
https://ageconsearch.umn.edu/bitstream/52535/2/ILAC_WorkingPaper_No8_EvaluativeCulture_Mayne.pdf

Mayne, J. (2018). Linking evaluation to expenditure reviews: Neither realistic nor a good idea. Canadian Journal
of Program Evaluation, 32(3),316–326.

Mayne, J., & Rist, R. C. (2006). Studies are not enough: The necessary transformation of evaluation. Canadian
Journal of Program Evaluation, 21(3), 93–120.

McDavid, J. C; & Huse, I. (2006). Will evaluation prosper in the future? Canadian Journal of Program Evaluation,
21(3), 47–72.

565
McDavid, J. C., & Huse, I. (2012). Legislator uses of public performance reports: Findings from a five-year study.
American Journal of Evaluation, 33(1), 7–25.

McDavid, J. C., & Huse, I. (2015). How does accreditation fit into the picture? New Directions for Evaluation,
145, 53–69.

Meyer, M. (2010). The rise of the knowledge broker. Science Communication, 32(1), 118–127.

Miller, R., & Campbell, R. (2006). Taking stock of empowerment evaluation: An empirical review. American
Journal of Evaluation, 27(3), 296–319.

Mintzberg, H. (1997). The manager’s job: Folklore and fact. Leadership: Understanding the dynamics of power and
influence in organizations. In R. P. Vecchio (Ed.), Understanding the dynamics of power and influence in
organizations. (35–53). Notre Dame, IN, USA: University of Notre Dame Press.

Morgan, G. (2006). Images of organization (Updated ed.). Thousand Oaks, CA: Sage.

Moynihan, D. P. (2005). Goal-based learning and the future of performance management. Public Administration
Review, 65(2), 203–216.

Moynihan, D. P. (2008). The dynamics of performance management: Constructing information and reform.
Washington, DC: Georgetown University Press.

Muller-Clemm, W. J., & Barnes, M. P. (1997). A historical perspective on federal program evaluation in Canada.
Canadian Journal of Program Evaluation, 12(1), 47–70.

Norris, N. (2005). The politics of evaluation and the methodological imagination. American Journal of Evaluation,
26(4), 584–586.

Office of the Comptroller General of Canada. (1981). Guide on the program evaluation function. Ottawa, Ontario,
Canada: Treasury Board of Canada Secretariat.

Olejniczak, K., Raimondo, E., & Kupiec, T. (2016). Evaluation units as knowledge brokers: Testing and calibrating
an innovative framework. Evaluation, 22(2), 168–189.

Organisation for Economic Cooperation and Development. (2010). DAC guidelines and reference series: Quality
standards for development evaluation. Paris, France: Author. Retrieved from
http://www.oecd.org/dac/evaluation/qualitystandards.pdf

Otley, D. (2003). Management control and performance management: Whence and whither? The British
Accounting Review, 35(4), 309–326.

566
Owen, J. M., & Rogers, P. J. (1999). Program evaluation: Forms and approaches (International ed.). Thousand
Oaks, CA: Sage.

Patton, M. Q. (1994). Developmental evaluation. Evaluation Practice, 15(3), 311–319.

Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (2011). Developmental evaluation: Applying complexity to enhance innovation and use. New York:
Guilford Press.

Patton, M. (2018). Principles-focused evaluation: The guide. New York, NY: The Guilford Press.

PEPFAR (2017). U.S. President’s Emergency Plan for AIDS Relief: Evaluation Standards of Practice, v. 3. U.S.
Department of State, Office of the U.S. Global AIDS Coordinator and Health Diplomacy. Retrieved from
www.pepfar.gov/documents/organization/276886.pdf

Picciotto, R. (2015). Democratic evaluation for the 21st century. Evaluation, 21(2), 150–166.

Pollitt, C. (2018). Performance management 40 years on: A review. Some key decisions and consequences. Public
Money & Management, 38(3), 167–174.

Rautiainen, A, (2010) Contending legitimations: Performance measurement coupling and decoupling in two
Finnish cities. Accounting, Auditing & Accountability Journal, 23(3), 373–391

Reiserer, A., Kalb, N., Blok, M. S., van Bemmelen, K. J., Taminiau, T. H., Hanson, R., . . . & Markham, M.
(2016). Robust quantum-network memory using decoherence-protected subspaces of nuclear spins. Physical
Review X, 6 (2), 1–8.

Rist, R. C., & Stame, N. (Eds.). (2006). From studies to streams: Managing evaluative systems (Vol. 12). New
Brunswick, NJ: Transaction.

Schweigert, F. (2011). Predicament and promise: The internal evaluator and ethical leader. Chapter 4, In B.
Volkov & M. Baron [Eds.]. New Directions in Evaluation, 132, 43–56.

Scriven, M. (1997). Truth and objectivity in evaluation. In E. Chelimsky & W. R. Shadish (Eds.), Evaluation for
the 21st century: A handbook (pp. 477–500). Thousand Oaks, CA: Sage.

Scriven, M. (2013). Evaluation Checklist. Retrieved from http://michaelscriven.info/images/KEC.25.2013.pdf

Scott, C. (2016). Cultures of evaluation: Tales from the end of the line, Journal of Development Effectiveness, 8(4),
553–560.

567
Senge, P. M. (1990). The fifth discipline: The art and practice of the learning organization (1st ed.). New York:
Doubleday/Currency.

Shaw, T. (2016). Performance budgeting practices and procedures. OECD Journal on Budgeting, 15(3), 1–73.

Shepherd, R. (2011). In search of a balanced Canadian federal evaluation function: Getting to relevance.
Canadian Journal of Program Evaluation, 26(2), 1–45.

Shepherd, R. (2018). Expenditure reviews and the federal experience: Program evaluation and its contribution to
assurance provision. Canadian Journal of Program Evaluation, 32(3), 347–370.

Smith, A. (1984). The theory of moral sentiments. (In D. D. Raphael & A. L MacFie (Eds.; 6th ed.).
Indianapolis, IN: Liberty Fund. (original work published in 1790).

Smits, P., & Champagne, F. (2008). An assessment of the theoretical underpinnings of practical participatory
evaluation. American Journal of Evaluation, 29(4), 427–442.

Sonnichsen, R. C. (2000). High impact internal evaluation. Thousand Oaks, CA: Sage.

Stufflebeam, D. L. (1994). Empowerment evaluation, objectivist evaluation, and evaluation standards: Where the
future of evaluation should not go and where it needs to go. Evaluation Practice, 15(3), 321–338.

Stufflebeam, D., & Zhang, G. (2017). The CIPP Evaluation Model: How to Evaluate for Improvement and
Accountability. New York, NY: The Guilford Press.

Treasury Board of Canada Secretariat. (1990). Program evaluation methods: Measurement and attribution of
program results (3rd ed.). Ottawa, Ontario, Canada: Deputy Comptroller General Branch, Government Review
and Quality Services.

Treasury Board of Canada Secretariat. (2009). Policy on evaluation [Rescinded]. Retrieved from http://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=15024

Treasury Board of Canada Secretariat. (2016). Directive on Results. Retrieved from: https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=31306

Trimmer, K. (2016). The pressures within: Dilemmas in the conduct of evaluation from within government. In:
Political Pressures on Educational and Social Research: International Perspectives. Taylor & Francis (Routledge),
Oxon, United Kingdom, pp. 180–191.

Van Dooren, W., & Hoffmann, C. (2018). Performance management in Europe: An idea whose time has come
and gone? In E. Ongaro & S. van Thiel (Eds.) The Palgrave handbook of public administration and management
in Europe (pp. 207–225). London: Palgrave Macmillan.

568
Volkov, B. B. (2011a). Beyond being an evaluator: The multiplicity of roles of the internal evaluator. In B. B.
Volkov & M. E. Baron (Eds.), Internal Evaluation in the 21st Century. New Directions for Evaluation, 132,
25–42.

Volkov, B. B. (2011b). Internal evaluation a quarter-century later: A conversation with Arnold J. Love. In Volkov,
B. & Baron, M. (Eds.), Internal Evaluation in the 21st Century. New Directions for Evaluation, 132, 5–12.

Volkov, B., & Baron, M. (2011). Issues in internal evaluation: Implications for practice, training, and research.
New Directions for Evaluation, 132, 101–111.

Weiss, C. (2013). Rooting for evaluation: Digging into beliefs. In Alkin, M. (Ed.) Evaluation roots: A wider
perspective of theorists’ views and influences (2nd ed.). Thousand Oaks, CA: Sage.

Westley, F., Zimmerman, B., & Patton, M. (2009). Getting to maybe: How the world is changed. Toronto: Vintage
Canada.

Wildavsky, A. B. (1979). Speaking truth to power: The art and craft of policy analysis. Boston, MA: Little, Brown.

Wildavsky, A. (2017). Speaking truth to power: Art and craft of policy analysis 2nd ed. (Special Edition). New York,
NY: Routledge.

Yarbrough, D. B., Shulha, L. M., Hopson, R. K., & Caruthers, F. A. (2011). The program evaluation standards: A
guide for evaluators and evaluation users (3rd ed.). Thousand Oaks, CA: Sage.

569
12 The Nature and Practice of Professional Judgment in
Evaluation

570
Contents
Introduction 478
The Nature of the Evaluation Enterprise 478
Our Stance 479
Reconciling the Diversity in Evaluation Theory With Evaluation Practice 480
Working in the Swamp: The Real World of Evaluation Practice 481
Ethical Foundations of Evaluation Practice 482
Power Relationships and Ethical Practice 485
Ethical Guidelines for Evaluation Practice 486
Evaluation Association-Based Ethical Guidelines 486
Understanding Professional Judgment 490
What Is Good Evaluation Theory and Practice? 490
Tacit Knowledge 492
Balancing Theoretical and Practical Knowledge in Professional Practice 492
Aspects of Professional Judgment 493
The Professional Judgment Process: A Model 495
The Decision Environment 497
Values, Beliefs, and Expectations 497
Cultural Competence in Evaluation Practice 498
Improving Professional Judgment in Evaluation 499
Mindfulness and Reflective Practice 499
Professional Judgment and Evaluation Competencies 501
Education and Training-Related Activities 504
Teamwork and Improving Professional Judgment 505
The Prospects for an Evaluation Profession 506
Summary 509
Discussion Questions 510
Appendix 511
Appendix A: Fiona’s Choice: An Ethical Dilemma for a Program Evaluator 511
Your Task 512
References 513

571
Introduction
Chapter 12 combines two central themes in this textbook: the importance of defensible methodological cores for
evaluations and the importance of professional judgment in evaluation practice. We summarize our stance (we had
outlined it in Chapter 3) and point out that the theoretical and methodological richness that now characterizes
our field must be understood within the realities of current evaluation practice, where economic, organizational,
and political pressures may constrain or misdirect the choices of design, implementation, or reporting. A theme in
this textbook is that credible and defensible methodology is our foundation, but that in addition a good evaluator
needs to understand the public sector environment and develop navigation tools for his or her evaluation practice.

First, we introduce several ethical lenses relevant to evaluation work and connect them to our view that evaluation
practice has a moral and ethical dimension to it. We describe recent work that has been done to bring attention to
the issue of ethical space for evaluation in the face of pressures to align with dominant values in public sector
organizations and governments. Ethical professional practice requires evaluators to reflect on the idea of agency.
The Greek concept of practical wisdom (phronesis) is explored as a concept to guide ethical practice.

We introduce ethical guidelines from several evaluation associations and describe the ethical principles that are
discernable in the guidelines. We connect those principles to our discussion of ethical frameworks and to the
challenges of applying ethical principles to particular situations.

We then turn to understanding professional judgment in general—how different kinds of judgments are involved
in evaluation practice and how those relate to the methodological and ethical dimensions of what evaluators do.
We relate professional judgment to evaluator competencies and suggest ways that evaluators can improve their
professional judgment by being effective practitioners and by acquiring knowledge, skills, and experience through
education, reflection, and practice. Evaluative work, from our point of view, has ethical, societal, and political
implications.

The final part of our chapter is our reflections on the prospects for an evaluation profession in the foreseeable
future.

572
The Nature of the Evaluation Enterprise
Evaluation is a structured process that creates, synthesizes and communicates information that is intended to
reduce the level of uncertainty for stakeholders about the effectiveness of a given program or policy. It is intended
to answer questions (see the list of evaluation questions discussed in Chapter 1) or test hypotheses, the results of
which are then incorporated into the additional information bases used by those who have a stake creating,
implementing, or adjusting programs or policies, ideally for the public good. Evaluative information can be used
for program or organizational improvement, or for accountability and budgetary needs. It is a broad field. There
can be various uses for evaluations, and thus stakeholders can mean central budget authorities, departmental
decision-makers, program managers, program clients, and the public.

573
Our Stance
This textbook is substantially focused on evaluating the effectiveness of programs and policies. Central to
evaluating effectiveness is examining causes and effects. We are not advocating that all program evaluations should
be centered on experimental or quasi-experimental research designs. Instead, what we are advocating is that an
evaluator needs to understand how these designs are constructed and needs to understand the logic of causes and
effects that is at the core of experiments and quasi-experiments. In particular, it is important to identify and think
through the rival hypotheses that can weaken our efforts to examine program effectiveness. In other words, we are
advocating a way of thinking about evaluations that is valuable for a wide range of public sector situations where
one of the key questions is whether the program was effective, or how it could become more effective. That
includes asking whether the observed outcomes can be attributed to the program; our view is that different research
designs, including qualitative approaches, can be appropriate to address questions around program effectiveness,
depending on the context. In many cases, multiple lines of evidence may be necessary.

Sound methodology is necessary to evaluate the effectiveness of programs, but it is not sufficient. Our view is that
evaluation practice also entails making judgments—judgments that range in scope and impact but are an intrinsic
part of the work that we do. Fundamentally, professional judgments include both “is” and “ought” components;
they are grounded in part in the tools and practices of our craft but also grounded in the ethical dimensions of
each decision context. Part of what it means to be a professional is to be able to bring to bear the ethics and values
that are appropriate for our day-to-day practice.

We will explore the nature and practice of making judgments in evaluations, but for now we want to be clear that
because of the intrinsically political nature of evaluations, embedded as they are in value-laden environments and
power relationships, it is important for evaluators who aspire to becoming professionals to recognize that the
context for evaluations (interpersonal, organizational, governmental, economic, cultural, and societal) all influence
and are potentially influenced by the judgments that we make as a part of the work that we do.

Later in this chapter we outline an approach to understanding and practicing professional judgment that relies in
part on understanding professional practice that originated in Greek philosophy some 2500 years ago. The
Aristotelian concept of phronesis (translated in different ways but often rendered as practical wisdom, practical
reasoning, or practical ethics) is now recognized as a component of a balanced approach to professional practice—
a way of recognizing and valuing the autonomy of professionals in the work they do, in contradistinction to
restricting professional practice with top-down manuals, regulations and such that codify practice and are
intended to make practice and its “products” uniform and predictable. This latter approach, evident in licensed
human services in particular (Evans & Hardy, 2017; Kinsella & Pitman, 2012) can have the effect of reducing or
hampering professional discretion/judgment in interactions with clients. Some argue that professional practice
under such conditions is ethically compromised (Evans & Hardy, 2017).

574
Reconciling the Diversity in Evaluation Theory With Evaluation Practice
Alkin (2013) has illustrated that the field of evaluation, in its relatively short time as a discipline, has evolved into
having a wide (and growing) range of theoretical approaches. The Evaluation Theory Tree, depicted in Figure
12.1, suggests the range of approaches in the field, although it is not comprehensive. For example, the Valuing
part of the tree has been questioned for not separately representing social justice-related evaluation theories as a
distinct (fourth) set of branches on the tree (Mertens and Wilson, 2012).

Figure 12.1 The Evaluation Theory Tree

Source: Alkin, 2013, p. 12.

Inarguably, there is a wide range of ways that evaluators approach the field. This theoretical richness has been
referenced in different chapters of this textbook. One reason why evaluators are attracted to the field is the
opportunity to explore different combinations of philosophical and methodological approaches. But our field is
also grounded in practice, and understanding some of the contours of actual evaluation practice is important in
our pursuing the nature and practice of professional judgment in the work we do. Public sector evaluations should
be designed to address public interest, but there are a number of different views on how to determine choices to be
made in the realm of ‘public interest’.

575
Working in the Swamp: The Real World of Evaluation Practice
Most evaluation practice settings continue to struggle with optimizing methodological design in the public sector
milieu of “wicked problems.” Typical program evaluation methodologies rely on multiple, independent lines of
evidence to bolster research designs that are case studies or implicit designs (diagrammed in Chapter 3 as XO
designs, where X is the program and O is the set of observations/data on the outcomes that are expected to be
affected by the program). That is, the program has been implemented at some time in the past, and now the
evaluator is expected to assess program effectiveness—perhaps even summatively. There is no pre-test and no
control group; there are insufficient resources to construct these comparisons, and in most situations, comparison
groups are not feasible. Although multiple data sources permit triangulation of findings, that does not change the
fact that the basic research design is the same; it is simply repeated for each data source (which is a strength since
measurement errors would likely be independent) but is still subject to the prospective weaknesses of that design.
In sum, typical program evaluations are conducted after the program is implemented, in settings where the
evaluation team has to rely on evidence about the program group alone (i.e., there is no control group). In most
evaluation settings, these designs rely on mixed qualitative and quantitative lines of evidence.

In such situations, some evaluators would advocate not using the evaluation results to make any causal inferences
about the program. In other words, it would be argued that such evaluations ought not to be used to try to address
the question: “Did the program make a difference, and if so, what difference(s) did it make?” Instead the
evaluation should be limited to describing whether intended outcomes were actually achieved, regardless of
whether the program itself “produced” those outcomes. That is essentially what performance measurement systems
do.

But, many evaluations are commissioned with the need to know whether the program worked, and why. Even
formative evaluations often include questions about the effectiveness of the program (Chen, 1996; Cronbach,
1980; Weiss, 1998). Answering “why” questions entails looking at causes and effects.

In situations where a client wants to know if and why the program was effective, and there is clearly insufficient
time, money, and control to construct an evaluation design that meets criteria for answering those questions using
an experimental design, evaluators have a choice. They can advise their client that wanting to know whether the
program or policy worked—and why—is perhaps not feasible, or they can proceed with the understanding that
their work may not be as defensible as some research textbooks (or theoretical approaches) would advocate.

Usually, some variation of the work proceeds. Although RCT comparisons between program and no-program
groups are not possible, comparisons among program recipients (grouped by socio-demographic variables or
perhaps by how much exposure they have had to the program), comparisons over time for program recipients who
have participated in the program, and comparisons with other stakeholders or clients are all possible. We maintain
that the way to answer causal questions without research designs that can rule out most rival hypotheses is to
acknowledge that in addressing issues such as program effectiveness (which we take to be the central question in
most evaluations and one of the distinguishing features of our field) we cannot offer definitive findings or
conclusions. Instead, our findings, conclusions, and our recommendations, supported by the evidence at hand and
by our professional judgment, will reduce the uncertainty associated with the question.

In this textbook, our point of view is that in all evaluations, regardless of how sophisticated they are in terms of
research designs, measures, statistical tools, or qualitative analytical methods, evaluators will use one form or
another of professional judgment in the decisions that comprise the process of designing and completing an
evaluation project. Moreover, rather than focusing exclusively on the judgment of merit and worth, we are saying
that judgment calls are reflected in decisions that are made throughout the process of providing information during
the performance management cycle.

Where research designs are weak in terms of potential threats to their internal validity, as evaluators we introduce
to a greater extent our own experience and our own (sometimes subjective) assessments, which in turn are
conditioned by ethical considerations and our values, beliefs, and expectations. These become part of the basis on

576
which we interpret the evidence at hand and are also a part of the conclusions and the recommendations. This
professional judgment component in every evaluation complements and even supplements the kinds of
methodologies we deploy in our work. We believe it is essential to be aware of what professional judgments consist
of and learn how to cultivate and practice sound professional judgment.

577
Ethical Foundations of Evaluation Practice
In this section of Chapter 12, we introduce basic descriptions of ethical frameworks that have guided
contemporary public administration. We then introduce the growing body of theory and practice that advocates
for including “practical wisdom” as a necessity for an ethical stance in everyday professional practice. Importantly,
practical wisdom is intended to create space for professionals to exercise their judgment and to take into account
the social context of the decisions they make in the work they do.

We distinguish three different approaches to ethics that are all relevant to public administration and by
implication to the work that goes on in or with both public organizations and governments. An understanding of
these three approaches gives evaluators a bit of a map to understand the political and organizational context
surrounding the evaluation design, implementation, and reporting process.

The “duty” approach to ethics (sometimes called deontological ethics) was articulated by Emmanuel Kant in part
as a reaction to what he saw as contemporary moral decay (he lived from 1724 to 1804) and is based on being able
to identify and act on a set of unchanging ethical principles. For Kant, “situational, or relativistic ethics invited
moral decay. Without immutable, eternal, never-changing standards, a person or a society was trapped on a
slippery slope where anything was allowed to achieve one’s goals.” (Martinez, 2009, p. xiii). Duty ethics has
evolved over time but is linked to contemporary administrative systems that have codified and elaborated policies
and rules that determine how to respond to a wide range of decision-making situations. In effect this approach
relies on nested rules and regulations to guide public officials in their duties and responsibilities (Langford, 2004).
Where the existing rules are found to be short, new rules can be elaborated to cover those (heretofore)
unanticipated situations. Over time, duty ethics applications can suffer from accretion. Rules pile on rules to a
point where procedures can dominate decision-making, and processing slows down administrative activities and
decisions. The rules establish consistency and equality of treatment, but efficiency and effectiveness may be
sacrificed because of red tape.

An important criticism of this approach by proponents of New Public Management in the 1980s and 1990s
(Osborne & Gaebler, 1992) was that depending on processes to guide decision-making displaces a focus on
achieving results; efficiency is reduced and effectiveness is under-valued.

A second approach that is now an important part of contemporary administrative and governmental settings, and
is arguably replacing rules-based ethical regimes, is a focus on results-based values. In contemporary administrative
settings, this approach has evolved into values-based ethical frameworks wherein sets of core values (desirable
qualities or behaviors for individuals or groups) can be identified for public servants, and those values are
promulgated as the foundation for making ethical decisions (Langford, 2004). It is a basis for NPM norms of
“letting the manager manage” in an environment of performance incentives and alignment with organizational
objectives.

Langford (2004), in a trenchant critique of the Canadian federal government’s values-based ethical framework,
points out that statements of core values are hard to pin down and hard to translate into guidance for particular
decision-making situations. His comments on “value shopping” suggest that this approach to ethics engenders
organizational conflicts:

Beyond the inherent silliness of valuing anything and everything, lies the spectre of endless value
conflict. For the cynical, a long list of core values affords an opportunity to “value shop.” The longer the
list, the more likely it is that a federal public servant, facing a hard choice or questions from superiors
about an action taken, could rationalize any position or rule interpretation by adhering to one core
value rather than to another. What is an opportunity for the cynical is a nightmare for more responsible
public servants. Where one sees the obligation to advance the value of service in a particular situation,
another might see the value of accountability as dominant, and another might feel compelled by the
demands of fairness. Value conflict is the inevitable result of large core-value sets. (p. 439)

578
A third approach is consequentialism—an approach to ethical decision making that focuses on making choices
based on a weighing of the social cost and benefit consequences of a decision. Although different formulations of
this approach have been articulated, they generally have in common some kind of formal or informal process
wherein decision makers weigh the “benefits” and the “costs” of different courses of action and make a choice that,
on balance, has the most positive (or least negative) results for society or key stakeholders. Langford, in his critique
of the values-based ethics regime put into place by the federal government of Canada in the 1990s, argues that
public servants are inherently more likely to be consequentialists “While undoubtedly removed from
contemporary philosophical debates about consequentialism, virtually all public servants intuitively resort to the
premium attached in all democratic societies to being able to defend actions or rules in terms of their impacts on
all affected stakeholders in specific situations.” (Langford, 2004, p. 444).

Consequentialism, based as it is on a philosophical tradition that emphasizes weighing ethical decisions in terms of
“benefits versus costs,” has commonalities with utilitarianism (Mill, Bentham, Ryan, & Bentham, 1987).
However, the consequentialist approach has been criticized for being incapable of taking into account human
rights (an equality- and fairness-based duty ethics perspective).

For example, in a recent evaluation of an ongoing program in New York City that focused on providing wrap-
around services to those who were at risk of being homeless (Rolston, Geyer, Locke, Metraux, & Treglia, 2013), a
sample of homeless or near homeless families were given a choice to participate in a random assignment process
(half of those agreeing would receive the program and the other half would be denied the program for two years—
the duration of the experiment). Families not choosing to be randomly assigned were denied the service for up to
four months while sufficient families were recruited to run the two-year RCT.

A consequentialist or even values-based ethical perspective could be used to defend the experiment; the
inconvenience/costs to those families who were denied the program would have to be weighed against the benefits
to all those families who receive the program into the future, if the program showed success (consequentialist). It
could be seen as an innovative, efficient way to test a program’s effectiveness (values-based). Indeed, the evaluation
did go forward, and demonstrated that the program reduced homelessness, so there was a commitment to
continue funding it on that basis. However, from a human rights perspective (duty ethics) the informed consent
process was arguably flawed. The at-risk families who were asked to participate were vulnerable, and expecting
them to provide their “free and informed consent” (Government of Canada, 2010, p. 1) in a situation where the
experimenters enjoyed a clear power-over relationship appeared to be unethical.

Another approach has re-emerged as a way to guide contemporary professional practice (Evans and Hardy, 2017;
Flyvbjerg, 2004; Melé, 2005). The views of Aristotle, among the ancient Greek thinkers, have provided ideas for
how to situate ethics into the practical day-to-day lives of his (and our) contemporaries. For Aristotle, five different
kinds of knowledge were intended to cover all human endeavors: episteme (context-independent/universal
knowledge); nous (intuition or intellect); sophia (wisdom); techne (context-dependent knowledge used to produce
things); and phronesis (practical wisdom, practical reasoning, or practical ethics) (Mejlgaard et al., 2018).

Phronesis has been defined as: “Deliberation about values with reference to praxis. Pragmatic, variable, context-
dependent. Oriented toward action. Based on practical value-rationality.” (Flyvbjerg, 2004, p. 287). Flyvbjerg
adds, “Phronesis concerns values and goes beyond analytical, scientific knowledge (episteme) and technical
knowledge or know how (techne) and involves what Vickers (1995) calls “the art of judgment” (Flyvbjerg, 2004, p.
285, emphasis added).

Mejlgaard et al., (2018), in referring to previous work Flyvbjerg published, suggest five questions that comprise a
framework for making practical ethical decisions: Where are we going? Is this desirable? What should be done? Who
gains and who loses? And by what mechanisms? (p. 6). Schwandt (2018) uses these questions to challenge
contemporary evaluation practice. He highlights the tensions that can occur between one’s beliefs about ethical
conduct, one’s political stance, and one’s professional obligations.

Professional practitioners (social workers, teachers, and healthcare workers are examples) sometimes find
themselves being constrained by organizational and governmental expectations to behave in ways that are

579
consistent with organizational objectives (efficiency and cost-cutting, for example), over client-focused program
objectives or overall social good (Evans & Hardy, 2017). Situating a practical wisdom perspective on ethical
decision-making, Evans and Hardy (2017) suggest that this fusion of ancient and modern opens up possibilities
for seeing ethical decision-making in pragmatic terms:

An alternative approach is “ethics“ that sees ethical theories as resources to help us think about these
fundamental issues. Concern for consequences, rights, procedural consistency, individual ethical creativity and
virtue are not mutually exclusive; they do not reflect different schools but are necessary tools that can be drawn
on to analyse the nature of the ethical problem and identify an ethical response. For O’Neil (1986, p. 27),
ethical thinking " … will require us to listen to other appraisals and to reflect on and modify our own …
Reflective judgment so understood is an indispensable preliminary or background to ethical decisions about
any actual case” (p. 951).

This is a subtle point, but worth highlighting: There is not necessarily one “best” model of ethics; professional
judgment entails being aware of the various types of ethical pressures that may be in play in a given context, and
being able to reflectively navigate the situation.

Similarly, Melé (2005), in his discussion of ethical education in the accounting profession, highlights the
importance of cultivating the (Aristotelian) virtues-grounded capacity to make moral judgments:

In contrast to modern moral philosophy, the Aristotelian view argues that moral judgment “is not merely an
intellectual exercise of subsuming a particular under rules or hyper-norms. Judgment is an activity of
perceiving while simultaneously perfecting the capacity to judge actions and choices and to perceive being”
(Koehn, 2000, p. 17). (p. 100).

In a nutshell, as part of one’s professional judgment as an evaluator, ethical reflection is necessary because it is
practically inevitable that an evaluator, at some point, will find herself or himself in a situation that requires an
ethical decision and response. An evaluator’s personal “agency” can be challenged by power relationships. We
explore that topic next.

580
Power Relationships and Ethical Practice
Flyvbjerg (2004) acknowledges that Aristotle and other proponents of this ethical approach (Gadamer, 1975) did
not include power relationships in their formulations. The current interest in practical wisdom is coupled with a
growing concern that professionals, working in organizations that operate under the aegis of neo-liberal principles
that prioritize effective and efficient administration (Emslie & Watts, 2017; Evans & Hardy, 2017; House, 2015;
Petersen & Olsson, 2015) are subject to pressures that can cause ethical tension: The “ethical ‘turn’ in the social
work academy over the past few years has occurred partly in response to concerns that contemporary practice,
occurring with a framework of neo-liberal managerialism, is actually unethical.” (Evans & Hardy, 2017, p. 948).

Sandra Mathison (2017), in a keynote speech to the Australasian Evaluation Society, draws a connection between
the dominant sociopolitical ideologies that have paralleled the development of the evaluation field, and the
normative focus of the evaluation field itself: social democracy (1960 to roughly 1980), neo-liberalism (1980 to
the present day) and populism (present day into the future). Her concern is that, notwithstanding some
evaluators’ continued focus on the goal of improving social justice (e.g., Astbury, 2016; Donaldson & Picciotto,
2016; House, 2015; Mertens & Wilson, 2012), “by most accounts, evaluators’ work isn’t contributing enough to
poverty-reduction, human rights, and access to food, water, education and health care.” (p. 1). In summary, her
view is that the field, and evaluation practice in particular, is “not contributing enough to the public good.” (p. 2).
Mathison (2017) argues that we are still in the neo-liberal era, notwithstanding the recent emergence of populism
and the uncertainties that it brings. The dominant view of evaluation (and policy analysis) is that “evaluation has
become a tool of the state … constantly monitoring and assessing public policies, the conduct of organizations,
agencies and individuals, even serving as the final evaluator” (p. 4).

Proponents of practical wisdom as an ethical stance are asserting that valuing more robust professional autonomy
for practitioners is a way to push back against the pressures to which Mathison and others point. In effect,
advocates for incorporating practical wisdom into the ethical foundations for practice are saying that by
acknowledging the moral dimensions of professional practice, and fostering the development of moral dispositions
in those who practice, it is more likely that practitioners will be able and willing to reflect on the consequences of
their decisions for their clients and for other stakeholders, and have ethical considerations impact their actual
practice. This is more than consequentialism; instead, it is about taking a critical stance on the importance of
improving social justice by addressing the power-related implications of professional practice.

581
Ethical Guidelines for Evaluation Practice
As the field of evaluation grows and diversifies internationally (Stockmann & Meyer, 2016), and as evaluation
practitioners encounter a wider range of political, social and economic contexts, there is a growing concern that
the field needs to come to grips with the implications of practicing in a wide range of political and cultural
contexts, some of which challenge evaluators to take into account power imbalances and inequalities (House,
2015; Mathison, 2017; Picciotto, 2015; Schwandt, 2017). What, so far, have evaluation societies established to
address norms for ethical practice?

582
Evaluation Association-Based Ethical Guidelines
The evaluation guidelines, standards, and principles that have been developed by various evaluation associations all
address, in different ways, ethical practice. Although evaluation practice is not guided by a set of professional
norms that are enforceable (Rossi, Lipsey, & Freeman, 2004), ethical guidelines are an initial normative reference
point for evaluators. Increasingly, organizations that involve people (e.g., clients or employees) in research are
expected to take into account the rights of their participants across the stages of the evaluation. In universities, for
example, human research ethics committees routinely scrutinize research plans to ensure that they do not violate
the rights of participants. In both the United States and Canada, there are national policies or regulations that are
intended to protect the rights of persons who are participants in research (Government of Canada, 2014; U.S.
Department of Health and Human Services, 2009).

The past quarter century has witnessed significant developments in the domain of evaluation ethics guidelines.
These include publication of the original and revised versions of the Guiding Principles for Evaluators (AEA, 1995,
2004, 2018), and the second and third editions of the Program Evaluation Standards (Sanders, 1994; Yarbrough,
Shulha, Hopson, & Caruthers, 2011). The 2011 version of the Program Evaluation Standards has been adopted by
the Canadian Evaluation Society (CES, 2012b). Two examples of books devoted to program evaluation ethics
(Morris, 2008; Newman & Brown, 1996) as well as chapters on ethics in handbooks in the field (Seiber, 2009;
Simons, 2006) are additional resources. More recently, Schwandt (2007, 2015, 2017) and Scriven (2016) have
made contributions to discussions about both evaluation ethics and professionalization.

The AEA is active in promoting evaluation ethics with the creation of the Ethical Challenges section of the
American Journal of Evaluation (Morris, 1998), now a rotating feature of issues of the journal. Morris (2011) has
followed the development of evaluation ethics over the past quarter century and notes that there are few empirical
studies that focus on evaluation ethics to date. Additionally, he argues that “most of what we know (or think we
know) about evaluation ethics comes from the testimonies and reflections of evaluators”—leaving out the crucial
perspectives of other stakeholders in the evaluation process (p. 145). Textbooks on the topic of evaluation range in
the amount of attention that is paid to evaluation ethics; in some textbooks, it is the first topic of discussion on
which the rest of the chapters rest, as in, for example, Qualitative Researching by Jennifer Mason (2002) and
Mertens and Wilson (2012). In others, the topic arises later, or in some cases it is left out entirely.

Table 12.5 summarizes some of the ethical principles that can be discerned in the AEA’s Guiding Principles for
Evaluators (AEA, 2018) and the Canadian Evaluation Society (CES) Guidelines for Ethical Conduct (CES,
2012a).

The ethical principles summarized in the right-hand column of Table 12.5 are similar to lists of principles/values
that have been articulated by other professions. For example, Melé (2005) identifies these values in the Code of the
American Institute of Chartered Professional Accountants (AICPA): service to others or public interest; competency;
integrity; objectivity; independence; professionalism; and accountability to the profession (p. 101). Langford
(2004), lists these core values for the Canadian federal public service: integrity; fairness; accountability; loyalty,
excellence; respect; honesty and probity (p. 438).

These words or phrases identify desirable behaviors but do so in general terms. Recalling Langford’s (2004)
assessment of the values-based ethical framework put into place in the Canadian federal government in the 1990s,
a significant challenge is how these values would be applied in specific situations. Multiple values that could apply
could easily put practitioners into situations where choices among conflicting values have to be made.

For example, the “keeping promises” principle in Table 12.5 suggests that contracts, once made, are to be honored
by evaluators. But consider the following example: An evaluator makes an agreement with the executive director of
a nonprofit agency to conduct an evaluation of a major program that is delivered by the agency. The contract
specifies that the evaluator will deliver three interim progress reports to the executive director, in addition to a
final report. As the evaluator begins her work, she learns from sever al agency managers that the executive director
has been redirecting money from the project budget for office furniture, equipment, and her own travel expenses

583
—none of these being connected with the program that is being evaluated. In her first interim report, the
evaluator brings these concerns to the attention of the executive director, who denies any wrongdoings and
reminds the evaluator that the interim reports are not to be shared with anyone else—in fact threatens to
terminate the contract if the evaluator does not comply. The evaluator discusses this situation with her colleagues
in the firm in which she is employed and decides to inform the chair of the board of directors for the agency. She
has broken her contractual agreement and in doing so is calling on another ethical principle. At the same time, the
outcome of this decision (a deliberative judgement decision) could have consequences for the evaluation
engagement and possibly for future evaluation work for that group of professionals.

Of note, the frameworks in Table 12.5 include guidelines aimed at outlining responsibilities for the common good
and equity (AEA, 2018). While the AEA’s (2004) fifth general guiding principle was “Responsibilities for general
and public welfare”, the updated version of this principle is “Common good and equity”. It states: “Evaluators
strive to contribute to the common good and advancement of an equitable and just society” (AEA, 2018, p. 3).

Our earlier discussion of practical wisdom as an attribute of professional practice goes beyond current ethical
guidelines in that respect. It suggests that in particular situations, different mixes of ethical principles (and
stakeholder viewpoints) can be in play, and evaluators who aspire to be ethical practitioners need to have practice
making ethical decisions using exemplars, the experiences of other practitioners, observation, discussions with
peers, and case studies. Learning from one’s own experiences is key. Fundamentally, cultivating practical wisdom
is about being able to acquire virtues (permanent dispositions) “that favor ethical behavior” (Melé, 2005, p. 101).
Virtues can be demonstrated, but learning them is a subjective process (Mélé, 2005). In Appendix A, we have
included a case that provides you with an opportunity to grapple with an example of the ethical choices that
confront an evaluator who works in a government department. We discussed internal evaluation in Chapter 11,
and this case illustrates the tensions that can occur for internal evaluators. The evaluator is in a difficult situation
and has to decide what decision she should make, balancing ethical principles and her own well-being as the
manager of an evaluation branch in that department. There is no right answer to this case. Instead, it gives you an
opportunity to see how challenging ethical choice making can be, and it gives you an opportunity to make a
choice and build a rationale for your choice.

The case is a good example of what is involved in exercising deliberative judgment—at least in a simulated
setting. Flyvbjerg (2004) comments on the value of case-based curricula for schools of business administration, “In
the field of business administration and management, some of the best schools, such as Harvard Business School,
have understood the importance of cases over rules and emphasize case-based and practical teaching. Schools like
this may be called Aristotelian” (p. 288).

Table 12.1 Ethical Principles in the American Evaluation Association (AEA) Guiding Principles
and the Canadian Evaluation Society (CES) Guidelines for Ethical Conduct
Table 12.1 Ethical Principles in the American Evaluation Association (AEA) Guiding Principles and the
Canadian Evaluation Society (CES) Guidelines for Ethical Conduct

Ethical Principles for


AEA Guiding Principles CES Guidelines for Ethical Conduct
Evaluators

Systematic inquiry

1. Commitment to
technical competence
Evaluators conduct databased 2. Openness and
Evaluators should apply systematic transparency in
inquiries that are thorough,
methods of inquiry appropriate to the communicating
methodical, and contextually
evaluation strengths and
relevant
weaknesses of
evaluation approach

584
Competence

1. Commitment to the
technical competence of
Evaluators provide skilled the evaluation team
Evaluators are to be competent in
professional services to 2. Commitment to the
their provision of service
stakeholders cultural competence of
the team

Integrity

1. Being honest
2. Keeping promises
3. No conflicts of interest
Evaluators behave with honesty —disclose any roles,
Evaluators are to act with integrity in relationships or other
and transparency in order to
their relationships with all factors that could bias
ensure the integrity of the
stakeholders the evaluation
evaluation
engagement
4. Commitment to
integrity

Respect for people

1. Free and informed


consent
2. Privacy and
confidentiality
3. Respect the dignity and
self-worth of all
stakeholders
4. When feasible, foster
Evaluators honor the dignity, Evaluators should be sensitive to the social equity so that
well-being, and self-worth of cultural and social environment of all those that have given to
individuals and acknowledge the stakeholders and conduct themselves the evaluation may
influence of culture within and in a manner appropriate to the benefit from it
across groups environment 5. Understand, respect
and take into account
social and cultural
differences among
stakeholders
6. Maximize the benefits
and reduce unnecessary
harms

Common good and equity

1. Take into account the


public interest and

585
consider the welfare of
society as a whole
2. Balance the needs and
interests of clients and
Evaluators strive to contribute to See above (under 'Respect for other stakeholders
the common good and people'), AND: Evaluators are to be 3. Communicate results in
advancement of an equitable and accountable for their performance and a respectful manner
just society their product 4. Honor commitments
made during the
evaluation process
5. Commitment to full
and fair
communications of
evaluation results

586
Understanding Professional Judgment
The competent practitioner uses his or her learned, experiential, and intuitive knowledge to assess a situation and
offer a diagnosis (in the health field, for example) or a decision in other professions (Eraut, 1994; Cox &
Pyakuryal, 2013). Although theoretical knowledge is a part of what competent practitioners rely on in their work,
practice is seen as more than applying theoretical knowledge. It includes a substantial component that is learned
through practice itself. Although some of this knowledge can be codified and shared (Schön, 1987; Tripp, 1993),
part of it is tacit—that is, known to individual practitioners, but not shareable in the same ways that we share the
knowledge in textbooks, lectures, or other publicly accessible learning and teaching modalities (Schwandt, 2008;
Cox & Pyakuryal, 2013). Evaluation context is dynamic, and evaluators need to know how to navigate the waves
of economic, organizational, political, and societal change. We explore these ideas in this section.

587
What Is Good Evaluation Theory and Practice?
Views of evaluation theory and practice, and in particular about what they ought to be, vary widely (Alkin, 2013).
At one end of the spectrum, advocates of a highly structured (typically quantitative) approach to evaluations tend
to emphasize the use of research designs that ensure sufficient internal and statistical conclusions validity that the
key causal relationships between the program and outcomes can be tested. According to this view, experimental
designs—typically randomized controlled trials—are the benchmark of sound evaluation designs, and departures
from this ideal can be associated with problems that either require specifically designed (and usually more
complex) methodologies to resolve limitations, or are simply not resolvable—at least to a point where plausible
threats to internal validity are controlled. The emphasis on scientific methodology has waxed and waned in
evaluation over the years.

Robert Picciotto (2015) has suggested that there have been four waves in “the big tent of evaluation” (p. 152),
each reflecting the dominant political ideology of the time (Vedung, 2010). The first wave was Donald
Campbell’s “experimenting society” approach to evaluation wherein programs were conceptualized as disseminable
packages that would be rigorously evaluated at the pilot stage and then, depending on the success of the program,
either rolled out more broadly or set aside. An important feature of Campbell’s approach was the belief that
programs could be more or less effective, but conferring effectiveness did not “blame or shame” those who
operated the program. Evaluations were ways of systematically learning “what worked”.

The second wave was a reaction to this positivist or post-positivist view of what was sound evaluation. This second
wave was “dialogue-oriented, constructivist, participatory and pluralistic” (Picciotto, 2015, p. 152). We have
outlined ontological, epistemological and methodological elements of this second wave in Chapter 5 of the
textbook, where we discussed qualitative evaluation.

The third wave, which generally supplanted the second, paralleled the ideological neo-liberal, new public
management shift that happened in the 1980s and beyond. That shift “swelled and engulfed the evaluation
discipline: it was called upon to promote free markets; public-private partnerships and results-based incentives in
the public sector” (p. 152). An important feature of this wave was a shift from governments valuing program
evaluation to valuing performance measurement systems. The field of evaluation, after initially resisting
performance measurement and performance management (Perrin, 1998), has generally accepted that performance
measurement is “here to stay” (Feller, 2002 p. 438). An accountability and compliance-focused “what works”
emphasis often dominates both program evaluation and performance measurement systems. Picciotto sees our
current fourth wave as “a technocratic, positivist, utilization-focused evaluation model highly reliant on impact
assessments” (p. 153). While acknowledging that “scientific concepts are precious assets for the evaluation
discipline”, he argues:

We are now surfing a fourth wave. It has carried experimental evaluation to the top of the methodological
pyramid. It is evidence based and it takes neo-liberalism for granted. The scientific aura of randomization
steers clear of stakeholders’ values. By emphasizing a particular notion of impact evaluation that clinically
verifies “what works” it has restored experimentalism as the privileged approach to the evaluation enterprise.
By doing so it has implicitly helped to set aside democratic politics from the purview of evaluation—the
hallmark of the prior dialogical wave. (p. 153)

An example of the enduring influence of “results-based” neo-liberalism on government policies is the recent
changes made to the evaluation policy in the Canadian federal government. In 2016, the Policy on Results
(Treasury Board, 2016a) was implemented, rescinding the earlier Policy on Evaluation (Treasury Board, 2009).
The main thrust now is a focus on measuring and reporting performance—in particular implementing policies
and programs and then measuring and reporting their outcomes. This approach is a version of “deliverology”—an
approach to performance management that was adopted by the British government with the guidance of Sir
Michael Barber (Barber, 2007). Program evaluation, still required for many federal departments and agencies, is

588
not featured in the Policy on Results. Instead it is outlined in the Directive on Results that is intended to detail the
implementation of the policy (Treasury Board, 2016b). It is arguable that program evaluation has to some extent
been supplanted by this focus on performance measurement (Shepherd, 2018).

589
Tacit Knowledge
Polanyi (1958) described tacit knowledge as the capacity we have as human beings to integrate “facts” (data and
perceptions) into patterns. He defined tacit knowledge in terms of the process of discovering theory: “This act of
integration, which we can identify both in the visual perception of objects and in the discovery of scientific
theories, is the tacit power we have been looking for. I shall call it tacit knowing” (Polanyi & Grene, 1969, p.
140). Pitman (2012) defines tacit knowledge this way, “Tacit knowledge carries all of the individual characteristics
of personal experience, framed within the epistemic structures of the knowledge discipline that is utilized in the
professional’s practice” (p. 141).

For Polanyi, tacit knowledge cannot be communicated directly. It has to be learned through one’s own
experiences—it is by definition personal knowledge. Knowing how to ride a bicycle, for example, is in part tacit.
We can describe to others the physics and the mechanics of getting onto a bicycle and riding it, but the experience
of getting onto the bicycle, pedaling, and getting it to stay up is quite different from being told how to do so.

Ethical decision making has been described as tacit (Mejlgaard et al., 2018; Pitman, 2012). This suggests that
experience is an important factor in cultivating sound ethical decision-making (Flyvbjerg, 2004; Mejlgaard et al.,
2018).

One implication of acknowledging that what we know is in part personal is that we cannot teach everything that is
needed to learn a skill. The learner can be guided with textbooks, examples, and demonstrations, but that
knowledge (Polanyi calls it impersonal knowledge) must be combined with the learner’s own capacity to tacitly
know—to experience the realization (or a series of them) that he or she understands/intuits how to use the skill.

Clearly, from this point of view, practice is an essential part of learning. One’s own experience is essential for fully
integrating impersonal knowledge into working/personal knowledge. But because the skill that has been learned is
in part tacit, when the learner tries to communicate it, he or she will discover that, at some point, the best advice is
to suggest that the new learner try it and “learn by doing.” This is a key part of craftsmanship.

590
Balancing Theoretical and Practical Knowledge in Professional Practice
The difference between the applied theory and the practical know-how views of professional knowledge has been
characterized as the difference between knowing that (publicly accessible, propositional knowledge and skills) and
knowing how (practical, intuitive, experientially grounded knowledge that involves wisdom, or what Aristotle
called praxis) (Eraut, 1994; Fish & Coles, 1998; Flyvbjerg, 2004; Kemmis, 2012; Schwandt, 2008).

These two views of professional knowledge highlight different views of what professional practice is and indeed
ought to be. The first view can be illustrated with an example. In the field of medicine, the technical/rational view
of professional knowledge and professional practice continues to support efforts to construct and use expert
systems—software systems that can offer a diagnosis based on a logic model that links combinations of symptoms
in a probabilistic tree to possible diagnoses (Fish & Coles, 1998). By inputting the symptoms that are either
observed or reported by the patient, the expert system (embodying the public knowledge that is presumably
available to competent practitioners) can treat the diagnosis as a problem to solve. Clinical decision making
employs algorithms that produce a probabilistic assessment of the likelihood that symptoms and other technical
information will support one or another alternative diagnoses. More recently, Arsene, Dumitrache and Mihu
(2015) describe an expert system for medical diagnoses that incorporates expert sub-systems for different parts
(systems) of the body—circulatory system, for example, that each work with information inputs, and
communicate with their counterpart sub-systems to produce an overall diagnosis. The growing importance of
artificial intelligence (AI) systems suggests that there will be more applications of this approach in medicine in the
future.

Alternatively, the view of professional knowledge as practical know-how embraces the perspective of professional
practice as craftsmanship or even artistry. Although it highlights the importance of experience in becoming a
competent practitioner, it also complicates our efforts to understand the nature of professional evaluation practice.
If practitioners know things that they cannot share and their knowledge is an essential part of sound practice, how
do professions find ways of ensuring that their members are competent?

Schwandt (2008) recognizes the importance of balancing applied theory and practical knowledge in evaluation.
His concern is with the tendency, particularly in performance management systems where practice is
circumscribed by a focus on outputs and outcomes, to force “good practice” to conform to some set of
performance measures and performance results:

The fundamental distinction between instrumental reason as the hallmark of technical knowledge and
judgment as the defining characteristic of practical knowledge is instinctively recognizable to many
practitioners (Dunne & Pendlebury, 2003). Yet the idea that “good” practice depends in a significant way on
the experiential, existential knowledge we speak of as perceptivity, insightfulness, and deliberative judgment is
always in danger of being overrun by (or at least regarded as inferior to) an ideal of “good” practice grounded
in notions of objectivity, control, predictability, generalizability beyond specific circumstances, and
unambiguous criteria for establishing accountability and success. This danger seems to be particularly acute of
late, as notions of auditable performance, output measurement, and quality assurance have come to dominate
the ways in which human services are defined and evaluated. (p. 37)

The idea of balance is further explored in the section below, where we discuss various aspects of professional
judgment.

591
Aspects of Professional Judgment
What are the different kinds of professional judgment? How does professional judgment impact the range of
decisions that evaluators make? Can we construct a model of how professional judgment relates to evaluation-
related decisions?

Fish and Coles (1998) have constructed a typology of four kinds of professional judgment in the health care field.
We believe that these are useful for understanding professional judgment in evaluation. Each builds on the
previous one; the kinds of judgment differ across the four kinds. At one end of the continuum, practitioners apply
technical judgments that are about specific issues involving routine tasks. Typical questions would include the
following: What do I do now? How do I apply my existing knowledge and skills to do this routine task? In an
evaluation, an example of this kind of judgment would be how to select a random sample from a population of
case files in a social service agency.

The next level is procedural judgment, which focuses on procedural questions and involves the practitioner
comparing the skills/tools that he or she has available to accomplish a task. Practitioners ask questions such as
“What are my choices to do this task?” “From among the tools/knowledge/skills available to me, which
combination works best for this task?” An example from an evaluation would be deciding how to include clients in
an evaluation of a social service agency program—whether to use a survey (and if so, internet, mailing, telephone,
interview format, or some combination) or use focus groups (and if so, how many, where, how many participants
in each, how to gather them).

The third level of professional judgment is reflective. It again assumes that the task or the problem is a given, but
now the practitioner is asking the following questions: How do I tackle this problem? Given what I know, what
are the ways that I could proceed? Are the tools that are easily within reach adequate, or instead, should I be trying
some new combination or perhaps developing some new ways of dealing with this problem? A defining
characteristic of this third level of professional judgment is that the practitioner is reflecting on his or her
practice/experience and is seeking ways to enhance his or her practical knowledge and skills and perhaps innovate
to address a given situation.

The fourth level of professional judgment is deliberative The example earlier in this chapter described an
evaluation of a homelessness prevention program in New York City (Rolston et al., 2013) wherein families were
selected to participate through a process where at least some arguably did not have the capacity to offer them free
and informed consent. Members of the evaluation team decided to implement a research design (an RCT) that
was intended to maximize internal validity and privilege that over the personal circumstances and the needs of the
families facing homelessness. What contextual and ethical factors should the evaluators have considered in that
situation? No longer are the ends or the tasks fixed, but instead the professional is taking a broader view that
includes the possibility that the task or problem may or may not be an appropriate one to pursue. Professionals at
this level are asking questions about the nature of their practice and connecting what they do as professionals with
ethical and moral considerations. The case study in Appendix A of this chapter is an example of a situation that
involves deliberative judgment.

It is important to keep in mind that evaluation practice typically involves some compromises. We are often
“fitting round pegs into square holes.” In some settings, even routine technical decisions (e.g., should we use
significance tests where the response rate to our survey was 15 percent?) can have a significant “what should I do?”
question attached to them. As we move from routine to more complex decisions, “what should I do” becomes
more important. Addressing this question involves calling on one’s experience and it is important to keep in mind
that our experiences are a reflection of our values, beliefs, and expectations, and our ethical stance. Ethics are an
important part of what comprises our judgments as professionals. What was being said earlier in this chapter is
that professional practice is intrinsically tied to ethics; developing professional judgment involves developing
practical wisdom.

592
593
The Professional Judgment Process: A Model
Since professional judgment spans the evaluation process, it will influence a wide range of decisions that evaluators
make in their practice. The four types of professional judgment that Fish and Coles (1998) describe suggest
decisions of increasing complexity from discrete technical decisions to deliberative decisions. Figure 12.2 displays a
more detailed model of the way that professional judgment is involved in evaluator decision making. The model
focuses on single decisions—a typical evaluation would involve many such decisions of varying complexity. In the
model, evaluator ethics, values, beliefs, and expectations, together with both shareable and practical (tacit)
knowledge combine to create a fund of experience that is the foundation for professional judgments. In turn,
professional judgments influence the decision at hand.

There is a feedback loop that connects the decision environment to the evaluator via her/his shareable knowledge.
There are also feedback loops that connect decision consequences with shareable knowledge and ethics, as well as
practical know-how (tacit knowledge) and the evaluator’s values, beliefs and expectations.

This model is dynamic: the factors in the model interact over time in such ways that changes can occur in
professional judgment antecedents, summed up in evaluator experience. Later in this chapter we will discuss
reflective practice.

The model can be unpacked by discussing the constructs in it. Some constructs have been elaborated in this
chapter already (ethics, shareable knowledge, practical know-how, and professional judgment), but it is
worthwhile to define each one explicitly in one table. Table 12.2 summarizes the constructs in Figure 12.2 and
offers a short definition of each. Several of the constructs will then be discussed further to help us understand what
roles they play in the process of forming and applying professional judgment.

Figure 12.2 The Professional Judgment Process

Table 12.2 Definitions of Constructs in the Model of the Professional Judgment Process
Table 12.2 Definitions of Constructs in the Model of the Professional Judgment Process

Constructs in
Definitions
the Model

Moral principles that are intended to guide a person’s decisions about “right” and
“wrong,” and typically distinguish between acceptable and unacceptable behaviors. For
evaluators, professional guidelines, standards or ethical frameworks are part of the ethical
Ethics influences on decisions, either directly or indirectly through professional associations (for
example). However, there is more to one’s ethical decision-making than what is found in

594
the guidelines.

Values are statements about what is desirable, what ought to be, in a given situation.
Values Values can be personal or more general. Values can be a part of ethical frameworks. They
can be about choices, but not necessarily about right and wrong.

Beliefs are about what we take to be true, for example, our assumptions about how we
Beliefs
know what we know (our epistemologies are examples of our beliefs).

Expectations are assumptions that are typically based on what we have learned and what
Expectations we have come to accept as normal. Expectations can limit what we are able to “see” in
particular situations.

Knowledge that is typically found in textbooks or other such media; knowledge that can
Shareable
be communicated and typically forms the core of the formal training and education of
knowledge
professionals in a field.

Practical know-how is the knowledge that is gained through practice. It complements


Practical
shareable knowledge and is tacit—that is, acquired from one’s professional practice and is
know-how
not directly shareable.

Experience is the subjective amalgam of our knowledge, ethics, values, beliefs,


expectations, and practical know-how at a given point in time. For a given decision, we
Experience have a “fund” of experience that we can draw from. We can augment or change that fund
with learning from the consequences of the decisions we make as professionals and from
the (changing) environments in which our practice decisions occur.

Professional Professional judgment is a subjective process that relies on our experience and ranges from
judgment technical judgments to deliberative judgments.

In a typical evaluation, evaluators make hundreds of decisions that collectively define the
entire evaluation process. Decisions are choices—a choice made by an evaluator about
Decision
everything from discrete methodological issues to global values–based decisions that affect
the whole evaluation (and perhaps future evaluations) or even the evaluator’s career.

Each decision has consequences—for the evaluator and for the evaluation process.
Consequences can range from discrete to global, commensurate with the scope and
Consequences
implications of the decision. Consequences both influence and are influenced by the
decision environment.

The decision environment is the set of contextual factors that influences the decision-
making process, and the stock of knowledge that is available to the evaluator. Among the
factors that could impact an evaluator decision are client expectations, future funding
opportunities, resources (including time and data), power relationships, and constraints
Decision (legal, institutional, and regulatory requirements that specify the ways that evaluator
environment decisions are to fit a decision environment). Evaluator decisions can also influence the
decision environment—the basic idea of “speaking truth to power” is that evaluator
decisions will be conveyed to organizational/political decision-makers. Mathison (2017)
suggests that evaluators should “speak truth to the powerless” (p. 7) as a way of improving
social justice, as an evaluation goal.

595
The Decision Environment
The particular situation or problem at hand, and its context, influence how a program evaluator’s professional
judgment will be exercised. Each opportunity for professional judgment will have unique characteristics that will
demand that it be approached in particular ways. For example, a methodological issue will typically require a
different kind of judgment from one that centers on an ethical issue. Even two cases involving a similar question
of methodological choice will have facts about each of them that will influence the professional judgment process.
We would agree with evaluators who argue that methodologies need to be situationally appropriate, avoiding a
one-size-fits-all approach (Patton, 2008). The extent to which the relevant information about a particular situation
is known or understood by the evaluator will affect the professional judgment process—professional judgments are
typically made under conditions of uncertainty.

The decision environment includes constraints and incentives both real and perceived that affect professional
judgment. Some examples include the expectations of the client, the professional’s lines of accountability, tight
deadlines, complex and conflicting objectives, organizational environment, political context, cultural
considerations, and financial constraints. For people working within an organization—for example, internal
evaluators—the organization also presents a significant set of decision-related factors, in that its particular culture,
goals, and objectives will have an impact on the way the professional judgment process unfolds.

Values, Beliefs, and Expectations


Professional judgment is influenced by personal characteristics of the person exercising it. It must always be kept
in mind that “judgment is a human process, with logical, psychological, social, legal, and even political overtones”
(Gibbins & Mason, 1988, p. 18). Each of us has a unique combination of values, beliefs, and expectations that
make us who we are, and each of us has internalized a set of professional norms that make us the kind of
practitioner that we are (at a given point in time). These personal factors can lead two professionals to make quite
different professional judgments about the same situation (Tripp, 1993).

Among the personal characteristics that can influence one’s professional judgment, expectations are among the
most important. Expectations have been linked to paradigms; perceptual and theoretical structures that function
as frameworks for organizing one’s perspectives, even one’s beliefs about what is real and what is taken to be
factual. Kuhn (1962) has suggested that paradigms are formed through our education and training. Eraut (1994)
has suggested that the process of learning to become a professional is akin to absorbing an ideology.

Our past experiences (including the consequences of previous decisions we have made in our practice) predispose
us to understand or even expect some things and not others, to interpret situations, and consequently to behave in
certain ways rather than in others. As Abercrombie (1960) argues, “We never come to an act of perception with an
entirely blank mind but are always in a state of preparedness or expectancy, because of our past experiences” (p.
53). Thus, when we are confronted with a new situation, we perceive and interpret it in whatever way makes it
most consistent with our existing understanding of the world, with our existing paradigms. For the most part, we
perform this act unconsciously. We are often not even aware of how our particular worldview influences how we
interpret and judge the information we receive on a daily basis in the course of our work, or how it affects our
subsequent behavior.

How does this relate to our professional judgment? Our expectations can lead us to see things we are expecting to
see, even if they are not actually there, and to not see things we are not expecting, even if they are there.
Abercrombie (1960) calls our worldview our “schemata” and illustrates its power over our judgment process with
the following figure (Figure 12.3).

596
Figure 12.3 The Three Triangles

Source: Abercrombie, 1960.

In most cases, when we first read the phrases contained in the triangles, we do not see the extra words. As
Abercrombie (1960) points out, “it’s as though the phrase ‘Paris in the Spring,’ if seen often enough, leaves a kind
of imprint on the mind’s eye, into which the phrase in the triangle must be made to fit” (p. 35). She argues that
“if [one’s] schemata are not sufficiently ‘living and flexible,’ they hinder instead of help [one] to see” (p. 29). Our
tendency is to ignore or reject what does not fit our expectations. Thus, similar to the way we assume the phrases
in the triangles make sense and therefore unconsciously ignore the extra words, our professional judgments are
based in part on our preconceptions and thus may not be appropriate for the situation. Later in this chapter we
will discuss reflective practice.

597
Cultural Competence in Evaluation Practice
The globalization of evaluation (Stockmann & Meyer, 2016) and the growth of national evaluation associations
point have evidenced that evaluation practice has components which reflect the culture(s) in which it is embedded.
Schwandt (2007), speaking of the AEA case, notes that “the Guiding Principles (as well as most of the ethical
guidelines of academic and professional associations in North America) have been developed largely against the
foreground of a Western framework of moral understandings” (p. 400) and are often framed in terms of
individual behaviors, largely ignoring the normative influences of social practices and institutions.

The American Evaluation Association (AEA, 2011) produced a cultural competence statement that is not
intended to be generalized beyond the United States and describes cultural competence this way:

Cultural competence is not a state at which one arrives; rather, it is a process of learning, unlearning and
relearning. It is a sensibility cultivated throughout a lifetime. Cultural competence requires awareness of self,
reflection on one’s own cultural position, awareness of others’ positions, and the ability to interact genuinely
and respectfully with others (AEA, 2011, p. 3).

The same document defines culture: “Culture can be defined as the shared experiences of people, including their
languages, values, customs, beliefs, and mores. It also includes worldviews, ways of knowing and ways of
communicating.” (p. 2). Although work is being done to update the evaluator competencies (King and Stevahn,
2015), the cultural competencies document (AEA, 2011) continues to stand apart from the competency
framework.

One issue that stands out in reflecting on cultural competencies is power relationships (Chouinard & Cousins,
2007; Lowell, Kildea, Liddle, Cox & Paterson, 2015). Chouinard and Cousins, in their synthesis of Indigenous
evaluation-related publications, connect the creation of knowledge in cross-cultural evaluations with a post-
modern view of the relationship between knowledge and power, “To move cultural competence in evaluation
beyond the more legitimate and accepted vocabulary, beyond mere words, we must appreciate that there is no
resonant universal social science methodologies and no neutral knowledge generation. Knowledge, as Foucault
(1980) suggests, is not infused with power, it is an effect of power” (p. 46). This view accords with the perspective
taken by those who are critical of professional practice for having been perhaps “captured” by neo-liberal values
(see Donaldson & Picciotto, 2016; Evans & Hardy, 2017; House, 2015; Picciotto, 2015; Schwandt, 2017). An
essential part of incorporating practical wisdom as a way to approach practice is to acknowledge the moral nature
of professional practice and the importance of keeping in view the power relationships in which practitioners are
always embedded (Mejlgaard et al., 2018). Schwandt (2018) in a discussion of what it means for us to be
evaluation practitioners suggests:

Because boundaries are not given, we have to “do” something about boundaries when we make judgments of
how to act in the world. Thus, ‘what should we do?’ is a practical, situated, time- and place-bound question.
Developing good answers to that question is what practical reasoning in evaluation is all about—a
commitment to examining assumptions, values, and facts entailed in the questions: ‘What do we want to
achieve/Where are we going?’ ‘Who gains and who loses by our actions, and by which mechanisms of power?’
‘Is this development desirable?’ ‘What, if anything, should we do about it?’. (p. 134)

With this in mind, we move on to examine how to go about improving one’s professional judgment.

598
599
Improving Professional Judgment in Evaluation
Having reviewed the ways that professional judgment is woven through the fabric of evaluation practice and
having shown how professional judgment plays a part in our decisions as evaluation practitioners, we can turn to
discussing ways of self-consciously improving our professional judgment. Key to this process is becoming aware of
one’s own decision-making processes. Mowen (1993) notes that our experience, if used reflectively and analytically
to inform our decisions, can be a positive factor contributing to good professional judgment. Indeed, he goes so
far as to argue that “one cannot become a peerless decision maker without that well-worn coat of experience . . . 
the bumps and bruises received from making decisions and seeing their outcomes, both good or bad, are the
hallmark of peerless decision makers” (p. 243).

600
Mindfulness and Reflective Practice
Self-consciously challenging the routines of our practice is an effective way to begin to develop a more mindful
stance. In our professional practice, each of us will have developed routines for addressing situations that occur
frequently. As Tripp (1993) points out, although routines

… may originally have been consciously planned and practiced, they will have become habitual, and so
unconscious, as expertise is gained over time. Indeed, our routines often become such well-established habits
that we often cannot say why we did one thing rather than another, but tend to put it down to some kind of
mystery such as “professional intuition.” (p. 17)

Mindfulness as an approach to improving professional practice is becoming more appreciated and understood
(Dobkin & Hutchinson, 2013; Epstein, 2017; Riskin, 2011). Dobkin and Hutchinson (2013) report that 14
medical schools in Canada and the United States teach mindfulness to their medical and dental students and
residents (p.768). More generally, it is now seen as a way to prevent “compassion fatigue and burnout” in health
practitioners (Dobkin & Hutchinson, 2013, p. 768).

Mindfulness is aimed at improving our capacity to become more aware of our values and morals, expectations,
beliefs, assumptions, and even what is tacit in our practice.

Epstein (2003) characterizes a mindful practitioner as one who has cultivated the art of self-observation
(cultivating the compassionate observer). The objective of mindfulness is to see what is rather than what one wants
to see or even expects to see. Mindful self-monitoring involves several things: “access to internal and external data;
lowered reactivity [less self-judging] to inner experiences such as thoughts and emotions; active and attentive
observation of sensations, images, feelings, and thoughts; curiosity; adopting a nonjudgmental stance; presence,
[that is] acting with awareness . . . ; openness to possibility; adopting more than one perspective; [and] ability to
describe one’s inner experience” (Epstein, Siegel, & Silberman, 2008, p. 10).

Epstein (1999) suggests that there are at least three ways of nurturing mindfulness: (1) mentorships with
practitioners who are themselves well regarded in the profession; (2) reviewing one’s own work, taking a
nonjudgmental stance; and (3) meditation to cultivate a capacity to observe one’s self. He goes further (Epstein,
2017) to suggest that cultivating mindfulness is not just for individual practitioners but is also for work teams and
organizations.

Professionals should consistently reflect on what they have done in the course of their work and then investigate
the issues that arise from this review. Reflection should involve articulating and defining the underlying principles
and rationale behind our professional actions and should focus on discovering the “intuitive knowing implicit in
the action” (Schön, 1988, p. 69).

Tripp (1993) suggests that this process of reflection can be accomplished by selecting and then analyzing critical
incidents that have occurred during our professional practice in the past (critical incident analysis). This approach
is used to assess and improve the quality of human services (Arora, Johnson, Lovinger, Humphrey, & Meltzer,
2005; Davies & Kinloch, 2000). A critical incident can be any incident that occurred in the course of our practice
that sticks in our mind and hence, provides an opportunity to learn. What makes it critical is the reflection and
analysis that we bring to it. Through the process of critical incident analysis, we can gain an increasingly better
understanding of the factors that have influenced our professional judgments. For it is only in retrospect, in
analyzing our past decisions, that we can see the complexities underlying what at the time may have appeared to
be a straightforward, intuitive professional judgment. “By uncovering our judgments . . . and reflecting upon
them,” Fish and Coles (1998) maintain, “we believe that it is possible to develop our judgments because we
understand more about them and about how we as individuals come to them” (p. 285).

Another key way to critically reflect on our professional practice and understand what factors influence the

601
formation of our professional judgments is to discuss our practice with our colleagues (Epstein, 2017). Colleagues,
especially those who are removed from the situation at hand or under discussion, can act as “critical friends” and
can help in the work of analyzing and critiquing our professional judgments with an eye to improving them. With
different education, training, and experience, our professional peers often have different perspectives from us.
Consequently, involving colleagues in the process of analyzing and critiquing our professional practice allows us to
compare with other professionals our ways of interpreting situations and choosing alternatives for action.
Moreover, the simple act of describing and summarizing an issue so that our colleagues can understand it can
reveal and provide much insight into the professional judgments we have incorporated.

602
Professional Judgment and Evaluation Competencies
There is continuing interest in the evaluation field in specifying the competencies that define sound evaluation
practice (King & Stevahn, 2015). Building on previous work (Ghere, King, Stevahn, & Minnema, 2006; King,
Stevahn, Ghere, & Minnema, 2001; Stevahn, King, Ghere, & Minnema, 2005a; Stevahn, King, Ghere, &
Minnema, 2005b; Wilcox & King, 2013), King and Stevahn (2015) say:

The time has come at last for the field of program evaluation in the United States to address head-on an issue
that scholars and leaders of professional evaluation associations have discussed periodically over 30 years: What
is the set of competencies that an individual must have to conduct high-quality program evaluations? (p. 21)

This push is part of a broader international effort to develop evaluation competencies and link those to
professionalization of the evaluation discipline (King & Stevahn, 2015; Stockmann & Meyer, 2016; Wilcox &
King, 2013).

In an earlier study that included the views of 31 evaluation professionals in the United States, they were asked to
rate the importance of 49 evaluator competencies and then try to come to a consensus about the ratings, given
feedback on how their peers had rated each item (King et al., 2001). The 49 items were grouped into four broad
clusters of competencies: (1) systematic inquiry (most items were about methodological knowledge and skills), (2)
competent evaluation practice (most items focused on organizational and project management skills), (3) general
skills for evaluation practice (most items were on communication, teamwork, and negotiation skills), and (4)
evaluation professionalism (most items focused on self-development and training, ethics and standards, and
involvement in the evaluation profession).

Among the 49 competencies, one was “making judgments” and referred to making an overall evaluative judgment,
as opposed to a number of recommendations, at the end of an evaluation (King et al., 2001, p. 233). Interestingly,
it was rated the second lowest on average among all the competencies. This finding suggests that judgment,
comparatively, is not rated to be that important (although the item average was still 74.68 out of 100 possible
points). King et al. (2001) suggested that “some evaluators agreed with Michael Scriven that to evaluate is to
judge; others did not” (p. 245). The “reflects on practice” item, however, was given an average rating of 93.23—a
ranking of 17 among the 49 items. For both of these items, there was substantial variation among the practitioners
about their ratings, with individual ratings ranging from 100 (highest possible score) to 20. The discrepancy
between the low overall score for “making judgments” and the higher score for “reflects on practice” may be
related to the difference between making a judgment, as an action, and reflecting on practice, as a personal quality.

If we look at linkages between types of professional judgment and the range of activities that comprise evaluation
practice, we can see that some kinds of professional judgment are more important for some clusters of activities
than others. But for many evaluation activities, several different kinds of professional judgment can be relevant.
Table 12.3 summarizes the steps we introduced in Chapter 1 to design and implement a program evaluation. For
each step, we have offered a (subjective) assessment of what kinds of professional judgment are involved. You can
see that for all the steps, there are multiple kinds of professional judgments involved and many of the steps involve
deliberative judgments—these are the ones that are most directly related to developing a morally-grounded
evaluation practice.

Table 12.4 displays the steps involved in designing and implementing a performance measurement system (taken
from Chapter 9). What you can see is that for all the steps there are multiple kinds of professional judgment
involved and for nearly all of them, deliberative judgment-related decisions. This reflects that fact that designing
and implementing a performance measurement system is both a technical process and organizational change
process, involving a wide range of organizational/political culture-related decisions. We have not displayed the list
of steps involved in re-balancing a performance measurement system (included in Chapter 10) but the range and
kinds of judgments involved would be similar to corresponding steps in Table 12.4.

603
Table 12.3 Types of Professional Judgment That Are Relevant to the Program Evaluation
Framework in This Textbook
Table 12.3 Types of Professional Judgment That Are Relevant to the Program Evaluation Framework in
This Textbook

Types of Professional Judgment

Technical Procedural Reflective Deliberative

Steps in designing and implementing a program evaluation

1. Who are the clients for the evaluation, and the


Yes Yes
stakeholders?

2. What are the questions and issues driving the


Yes Yes Yes Yes
evaluation?

3. What resources are available to do the


Yes Yes Yes Yes
evaluation?

4. Given the evaluation questions, what do we


Yes Yes Yes Yes
already know?

5. What is the logic and structure of the


Yes Yes Yes
program?

6. Which research design alternatives are


Yes Yes Yes Yes
desirable and feasible?

7. What kind of environment does the program


operate in and how does that affect the Yes Yes
comparisons available to an evaluator?

8. What data sources are available and


appropriate, given the evaluation issues, the
Yes Yes Yes Yes
program structure, and the environment in
which the program operates?

9. Given all the issues raised in Points 1 to 8,


which evaluation strategy is most feasible, and Yes Yes
defensible?

10. Should the evaluation be undertaken? Yes Yes

Steps in conducting and reporting an evaluation

1. Develop the data collection instruments and


Yes Yes Yes
pre-test them.

2. Collect data/lines of evidence that are


appropriate for answering the evaluation Yes Yes Yes

604
questions.

3. Analyze the data, focusing on answering the


Yes Yes Yes
evaluation questions.

4. Write, review, and finalize the report. Yes Yes Yes Yes

5. Disseminate the report. Yes Yes Yes Yes

Table 12.4 Types of Professional Judgment That Are Relevant to the Performance Measurement
Framework in This Textbook
Table 12.4 Types of Professional Judgment That Are Relevant to the Performance Measurement Framework
in This Textbook

Types of Professional Judgment

Technical Procedural Reflective Deliberative

Steps in designing and implementing a performance measurement system

Leadership: Identify the organizational


Yes Yes
champions of this change.

Understand what a performance measurement


Yes Yes
system can and cannot do and why it is needed.

Communication: Establish multichannel ways of


communicating that facilitate top-down, bottom-
Yes Yes Yes Yes
up, and horizontal sharing of information,
problem identification, and problem solving.

Clarify the expectations for the uses of the


Yes Yes
performance information that will be created.

Identify the resources and plan for the design,


implementation and maintenance of the Yes Yes Yes Yes
performance measurement system.

Take the time to understand the organizational


Yes Yes
history around similar initiatives.

Develop logic models for the programs or lines of


business for which performance measures are Yes Yes Yes
being developed.

Identify constructs that are intended to represent


performance for aggregations of programs or the Yes Yes Yes Yes
whole organization.

Involve prospective users in reviewing the logic


models and constructs in the proposed
Yes Yes Yes Yes

605
performance measurement system.

Translate the constructs into observable


Yes Yes Yes
measures.

Highlight the comparisons that can be part of the


Yes Yes Yes Yes
performance measurement system.

Reporting results and then regularly review


feedback from users and, if needed, make Yes Yes Yes Yes
changes to the performance measurement system.

606
Education and Training-Related Activities
Developing sound professional judgment depends substantially on being able to develop and practice the craft of
evaluation. Schön (1987) and Tripp (1993), among others (e.g., Greeff & Rennie, 2016; Mejlgaard et al., 2018;
Melé, 2005), have emphasized the importance of experience as a way of cultivating sound professional judgment.
Although textbook knowledge is also an essential part of every evaluator’s toolkit, a key part of evaluation curricula
are opportunities to acquire experience and by implication, tacit knowledge.

There are at least six complementary ways that evaluation education and training can be focused to provide
opportunities for students and new practitioners to develop their judgment skills. Some activities are more discrete
—that is, are relevant for developing skills that are specific—these are more focused on technical and procedural
judgment-related skills. These are generally limited to a single course or even a part of a course. Others are more
generic, offering opportunities to acquire experience that spans entire evaluation processes. These are typically
activities that integrate coursework into work experiences. Table 12.5 summarizes ways that academic programs
can inculcate professional judgment capacities in their students.

The types of learning activities in Table 12.5 are typical of many programs that train evaluators, but what is
important is realizing that each of these kinds of activities contributes directly to developing a set of skills that all
practitioners need and will use in all their professional work. In an important way, identifying these learning
activities amounts to making explicit what has largely been tacit in our profession.

Table 12.5 Learning Activities to Increase Professional Judgment Capacity in Novice


Practitioners
Table 12.5 Learning Activities to Increase Professional Judgment Capacity in Novice Practitioners

Types of
Professional
Learning Activities Examples
Judgment
Involved

Course-based activities

Technical Develop a coding frame and test the coding categories


and for intercoder reliability for a sample of open-ended
Problem/puzzle solving
procedural responses to an actual client survey that the instructor
judgment has provided

Technical,
Make a decision for an evaluator who finds himself or
procedural,
herself caught between the demands of his or her
reflective,
Case studies superior (who wants evaluation interpretations
and
changed) and the project team who see no reason to
deliberative
make any changes
judgment

Technical,
procedural,
reflective,
Using a scenario and role playing, negotiate the terms
Simulations and
of reference for an evaluation
deliberative
judgment

607
Technical, Students are expected to design a practical,
procedural, implementable evaluation for an actual client
Course projects
reflective, organization
and
deliberative
judgment

Program-based activities

Technical,
procedural,
Students work as apprentice evaluators in
Apprenticeships/internships/work reflective,
organizations that design and conduct evaluations, for
terms and
extended periods of time (at least 4 months)
deliberative
judgment

Technical,
Working with a client organization, develop the terms
procedural,
of reference for a program evaluation, conduct the
Conduct an actual program reflective,
evaluation, including preparation of the evaluation
evaluation and
report, deliver the report to the client, and follow up
deliberative
with appropriate dissemination activities
judgment

608
Teamwork and Improving Professional Judgment
Evaluators and managers often work in organizational settings where teamwork is expected. Successful teamwork
requires establishing norms and expectations that encourage good communication, sharing of information, and a
joint commitment to the task at hand. In effect a well-functioning team is able to develop a learning culture for
the task at hand. Being able to select team members and foster a work environment wherein people are willing to
trust each other, and be open and honest about their own views on issues, is conducive to generating information
that reflects a diversity of perspectives. Even though there will still be individual biases, the views expressed are
more likely to be valid than the perceptions of a dominant individual or coalition in the group. Parenthetically, an
organizational culture that emulates features of learning organizations (Garvin, 1993; Mayne, 2008) will tend to
produce information that is more valid as input for making decisions and evaluating policies and programs.

Managers and evaluators who have the skills and experience to network with others and, in doing so, be
reasonably confident that honest views about an issue are being offered, have a powerful tool to complement their
own knowledge and experience and their own systematic inquiries.

609
The Prospects for an Evaluation Profession
What does it mean to be a professional? What distinguishes a profession from other occupations? Eraut (1994)
suggests that professions are characterized by the following: a core body of knowledge that is shared through the
training and education of those in the profession; some kind of government-sanctioned license to practice; a code
of ethics and standards of practice; and self-regulation (and sanctions for wrongdoings) through some kind of
professional association to which members of the practice community must belong.

The idea that evaluation is a profession, or aspires to be a profession, is an important part of discussions of the
scope and direction of the enterprise (Altschuld, 1999; Altschuld & Engle, 2015; Stockmann & Meyer, 2016).
Modarresi, Newman, and Abolafia (2001) quote Leonard Bickman (1997), who was president of the American
Evaluation Association (AEA) in 1997, in asserting that “we need to move ahead with professionalizing evaluation
or else we will just drift into oblivion” (p. 1). Bickman and others in the evaluation field were aware that other
related professions continue to carve out territory, sometimes at the expense of evaluators. Picciotto (2011) points
out, however, that “heated doctrinal disputes within the membership of the AEA have blocked progress [toward
professionalization] in the USA” (p. 165). More recently, Picciotto (2015) suggests that professionalizing
evaluation is now a global issue wherein a significant challenge is working in contexts that do not support the
democratic evaluation model that has underpinned the development of the field. He suggests, “The time has come
to experiment with a more activist and independent evaluation model grounded in professional autonomy reliant
on independent funding sources and tailor made to diverse governance environments.” (p. 164).

Professionalizing evaluation now appears to be a global movement, judging by the growing number of Voluntary
Organizations of Professional Evaluation (VOPEs), their memberships, and the parallel efforts by some national
evaluation organizations to implement first steps in making it possible for evaluation practitioners to distinguish
themselves, professionally (Donaldson & Donaldson, 2015). They summarize the global lay of that land this way:

During the 2015 International Year of Evaluation we learned about the profound growth and expansion
of VOPEs. While there were relatively few VOPEs prior to 1990, we have witnessed exponential growth
over the past 25 years (Donaldson, Christie, & Mark, 2015; Segone & Rugh, 2013). Rugh (personal
communication, 2015, October) reported that there are now approximately 227 VOPEs (170 verified)
representing 141 countries (111 verified) consisting of a total of approximately 52000 members. At the
same time, there has been a rapid expansion of University courses, certificates and degree programs in
evaluation and major growth in the number of VOPEs and other training organizations providing
evaluation workshops, online training, and other professional development experiences in evaluation
(LaVelle & Donaldson, 2015) (Donaldson & Donaldson, 2015, p. 2).

The growth in evaluation-related voluntary organizations is occurring against a background of the diversity in the
field. Donaldson and Donaldson (2015) point out that the core of the evaluation field is its theories, and a
contemporary reading of the field suggests that theoretical perspectives continue to emerge and differentiate
themselves (Alkin, 2013; Mertens & Wilson, 2012; Stockmann & Meyer, 2016). On the one hand, this richness
suggests a dynamic field that is continually enriched by the (now) global contributions of scholars and
practitioners.

But if we look at evaluation as a prospective profession, this diversity presents a challenge to efforts to define the
core competencies that are typically central to any profession. Imas (2017) summarizes the global evaluation
situation this way, “today any person or group can create their own set of competencies. And indeed, that is not
only what is happening but also what is being encouraged” (p. 73). She goes on to point out that “most fields
recognized as professions, such as health care, teaching, counseling, and so on, have typically developed
competencies … by asking a group of distinguished practitioners… . to first generate [an] initial list of
competencies, then to institute an expert review process to edit and refine them. The competencies are then made
available to professionals in the field” (p. 71).

610
Competencies are typically used to structure education/training programs and guide practice. In the evaluation
field, bottom-up efforts continue to dominate efforts to define core competencies (King & Stevahn, 2015).
Although more likely to be representative of the range of existing theories and practice, they may trade off breadth
with depth. Among the recommendations in an evaluation of the Canadian Evaluation Society Professional
Designation Program (Fierro, Galport, Hunt, Codd, & Donaldson, 2016), is one to facilitate recognizing
specializations for persons who are successful in acquiring the Credentialed Evaluator (CE) designation. In effect,
the 49 competencies that are the basis for the CE assessment process (Canadian Evaluation Society, 2018) would
be refined to formally acknowledge different theoretical and methodological approaches to evaluation practice.

One way to approach professionalization is to focus on the steps or stages involved. Altschuld and Austin (2005)
suggest there are three stages: credentialing, certification, and licensing for practitioners. Credentialing involves
demonstrating completion of specified requirements (knowledge, skills, experience, and education/training). A
profession that credentials its practitioners offers this step on a voluntary basis and cannot exclude practitioners
who do not obtain the credential. The Canadian Evaluation Society Credentialed Evaluator designation is such a
program (Canadian Evaluation Society, 2018). Certification involves testing competencies and other professional
attributes via an independent testing process that may involve examinations and practice requirements (practicums
or internships, for example). Typically, those who pass the certification process are issued document(s) attesting to
their competence to be practitioners. The profession cannot exclude those who do not seek (voluntary)
certification or who fail the process. Finally, licensing involves government jurisdictions issuing permits to practice
the profession; persons without a license cannot practice. Persons who are licensed to practice are typically
certified; for such professions, certification is a step toward obtaining a license to practice.

Aside from practitioner-focused steps, it also possible for professions to accredit formal education/training
programs (typically offered by universities) so that students who complete those programs are certified and can (if
appropriate) apply to become licensed practitioners. Accreditation typically involves periodic peer reviews of
programs, including the qualifications of those teaching, the resources for the programs, the contents of the
program, the qualifications of the students (the demand for the program), and other factors that are deemed to
predict student competencies (McDavid & Huse, 2015).

Globally, the prospects for the field of evaluation evolving to be more professionalized are promising, judging by
the interest in evaluation and the growth in evaluation-related associations. Some countries (Canada, Britain and
Japan) are taking the next step—credentialing evaluators who are interested in differentiating themselves
professionally (UK Evaluation Society, 2018; Wilcox & King, 2013). But there is also evidence of limited
movement, particularly among those countries that have taken the lead in professionalizing evaluation so far
(United States, Canada, Australia and New Zealand) where “the development can be described as stagnation, with
even a certain decline in the number of programs (primarily in Psychology)” (Stockmann & Meyer, 2016, p. 337).
In the evaluation of the Professional Designation Program (Fierro, Galport, Hunt, Codd, & Donaldson, 2016),
the evaluators asked Canadian Evaluation Society Board members “if they believed that recognition of evaluation
as a profession in Canada was increasing, decreasing or remaining the same. While no one reported a decrease in
recognition, the board members were split on whether it was increasing or remaining the same” (p.17).

Although it is challenging to offer an overall assessment of the future of evaluation, it seems clear that the
recognition of evaluation as a separate discipline/profession/body of practice is growing globally. But taking the
next steps toward professionalization is far more challenging. The experience of the Canadian Evaluation Society
in embarking on a program to credential evaluators suggests that building and sustaining an interest and
involvement in evaluation at this next level is promising but not yet assured.

Stockmann and Meyer (2016) sum up their volume on the global prospects for evaluation this way:

To sum up: the global trends for the future of evaluation are still positive, even if many pitfalls can be
identified. While evaluation is steadily on the increase, this continuously produces new challenges for the
integration of evaluation as a scientific, practical and politically useful endeavor. Today, the shared
perspective of being one global evaluation community dominates and many different ways of doing

611
evaluations are accepted. The tasks for the future will be more scientific research on evaluation and improved
utilization in public policy. This will be a dance on the volcano—as it ever has been. (p. 357)

612
Summary
Program evaluation is partly about understanding and applying methodologies and partly about exercising sound professional judgment
in a wide range of practice settings. But, because most evaluation settings offer only roughly appropriate opportunities to apply the tools
that are often designed for social science research settings, it is essential that evaluators learn the craft of working with square pegs for
round holes.

This chapter emphasizes the central role played by professional judgment in the practice of professions, including evaluation, and the
importance of cultivating sound professional judgment. Michael Patton, through his alter ego Halcolm, puts it this way (Patton, 2008, p.
501):

Forget “judge not and ye shall not be judged.”

The evaluator’s mantra: Judge often and well so that you get better at it.

—Halcolm

Professional judgment is substantially based on experience and our experiences are founded on what we know, what we learn, what we
value, and what we believe. Professional judgment has an important ethical component to it. Professional practice consists in part on
relying on our knowledge and skills, but it is also grounded in what we believe is right and wrong. Even evaluators who are making “pure
methodological decisions” are doing so based on their beliefs about what is right and wrong in each circumstance. Rights and wrongs are
based in part on values—there is no such thing as a value-free stance in our field—and are based in part on ethics, what is morally right
and wrong.

Professional programs, courses in universities, textbooks, and learning experiences are opportunities to learn and practice professional
judgment skills. Some of that is tacit—can only learned by experience. Participating in practica, internships, apprenticeships, are all good
ways of tying what we can learn from books, teachers, mentors and our peers (working in teams is an asset that way) to what we can
“know” experientially.

Although professional guidelines are an asset as we navigate practice settings, they are not enforceable and because they are mostly based
on (desired) values, are both general and can even conflict in a given situation. How we navigate those conflicts—how we choose among
moral values when we work—is an important part of what defines us as practitioners. In our field there is a growing concern that
evaluators should do more to play a role in addressing inequalities and injustices, globally. As our field globalizes, we encounter practice
situations where our clients do not want evaluators to address social justice issues. How we respond to these challenges will, in part, define
our efforts to become a profession.

613
Discussion Questions
1. Take a position for or against the following proposition and develop a strong one-page argument that supports your position.
This is the proposition: “Be it resolved that experiments, where program and control groups are randomly assigned, are the Gold
Standard in evaluating the effectiveness of programs.”
2. What do evaluators and program managers have in common? What differences can you think of as well?
3. What is tacit knowledge? How does it differ from public/shareable knowledge?
4. In this chapter, we said that learning to ride a bicycle is partly tacit. For those who want to challenge this statement, try to
describe learning how to ride a bicycle so that a person who has never before ridden a bicycle could get on one and ride it right
away.
5. What other skills can you think of that are tacit?
6. What is mindfulness, and how can it be used to develop sound professional judgment?
7. Why is teamwork an asset for persons who want to develop sound professional judgment?
8. In this chapter we introduced three different ethical frameworks. Which one aligns most closely with your own ethical approach?
Why?
9. What is practical wisdom as an ethical approach in professional practice? How is it different from the three ethical frameworks we
introduced in this chapter?
10. What do you think would be required to make evaluation more professional—that is, have the characteristics of a profession?

614
Appendix

615
Appendix A: Fiona’s Choice: An Ethical Dilemma for a Program Evaluator
Fiona Barnes did not feel well as the deputy commissioner’s office door closed behind her. She walked back to her
office wondering why bad news seems to come on Friday afternoons. Sitting at her desk, she went over the events
of the past several days and the decision that lay ahead of her. This was clearly the most difficult situation that she
had encountered since her promotion to the position of Director of Evaluation in the Department of Human
Services.

Fiona’s predicament had begun the day before, when the new commissioner, Fran Atkin, had called a meeting
with Fiona and the deputy commissioner. The governor was in a difficult position: In his recent election
campaign, he had made potentially conflicting campaign promises. He had promised to reduce taxes and had also
promised to maintain existing health and social programs, while balancing the state budget.

The week before, a loud and lengthy meeting of the commissioners in the state government had resulted in a
course of action intended to resolve the issue of conflicting election promises. Fran Atkin had been persuaded by
the governor that she should meet with the senior staff in her department, and after the meeting, a major
evaluation of the department’s programs would be announced. The evaluation would provide the governor with
some post-election breathing space. But the evaluation results were predetermined—they would be used to justify
program cuts. In sum, a “compassionate” but substantial reduction in the department’s social programs would be
made to ensure the department’s contribution to a balanced budget.

As the new commissioner, Fran Atkin relied on her deputy commissioner, Elinor Ames. Elinor had been one of
several deputies to continue on under the new administration and had been heavily committed to developing and
implementing key programs in the department, under the previous administration. Her success in doing that had
been a principal reason why she had been promoted to deputy commissioner.

On Wednesday, the day before the meeting with Fiona, Fran Atkin had met with Elinor Ames to explain the
decision reached by the governor, downplaying the contentiousness of the discussion. Fran had acknowledged
some discomfort with her position, but she believed her department now had a mandate. Proceeding with it was
in the public’s interest.

Elinor was upset with the governor’s decision. She had fought hard over the years to build the programs in
question. Now she was being told to dismantle her legacy—programs she believed in that made up a considerable
part of her budget and person-year allocations.

In her meeting with Fiona on Friday afternoon, Elinor had filled Fiona in on the political rationale for the
decision to cut human service programs. She also made clear what Fiona had suspected when they had met with
the commissioner earlier that week—the outcomes of the evaluation were predetermined: They would show that
key programs where substantial resources were tied up were not effective and would be used to justify cuts to the
department’s programs.

Fiona was upset with the commissioner’s intended use of her branch. Elinor, watching Fiona’s reactions closely,
had expressed some regret over the situation. After some hesitation, she suggested that she and Fiona could work
on the evaluation together, “to ensure that it meets our needs and is done according to our standards.” After
pausing once more, Elinor added, “Of course, Fiona, if you do not feel that the branch has the capabilities needed
to undertake this project, we can contract it out. I know some good people in this area.”

Fiona was shown to the door and asked to think about it over the weekend.

Fiona Barnes took pride in her growing reputation as a competent and serious director of a good evaluation shop.
Her people did good work that was viewed as being honest and fair, and they prided themselves on being able to
handle any work that came their way. Elinor Ames had appointed Fiona to the job, and now this.

616
Your Task
Analyze this case and offer a resolution to Fiona’s dilemma. Should Fiona undertake the evaluation project?
Should she agree to have the work contracted out? Why?

A. In responding to this case, consider the issues on two levels: (1) look at the issues taking into account Fiona’s
personal situation and the “benefits and costs” of the options available to her and (2) look at the issues from an
organizational standpoint, again weighing the “benefits and the costs”. Ultimately, you will have to decide how to
weigh the benefits and costs from both Fiona’s and the department’s standpoints.

B. Then look at this case and address this question: Is there an ethical “bottom line” such that, regardless of the
costs and benefits involved, it should guide Fiona’s decision. If there is, what is the ethical bottom line? Again,
what should Fiona do? Why?

617
References
Abercrombie, M. L. J. (1960). The anatomy of judgment: An investigation into the processes of perception and
reasoning. New York: Basic Books.

Alkin, M. C. (Ed.). (2013). Evaluation roots: A wider perspective of theorists’ views and influences (2nd ed.).
Thousand Oaks: Sage.

Altschuld, J. (1999). The certification of evaluators: Highlights from a report submitted to the Board of Directors
of the American Evaluation Association. American Journal of Evaluation, 20(3), 481–493.

Altschuld, J. W., & Austin, J. T. (2005). Certification. Encyclopedia of Evaluation. Thousand Oaks, CA: Sage.

Altschuld, J. W., & Engle, M. (Eds.). (2015). Accreditation, Certification, and Credentialing: Relevant Concerns for
U.S. Evaluators: New Directions for Evaluation (No. 145). Hoboken, NJ: John Wiley & Sons.

American Evaluation Association. (1995). Guiding principles for evaluators. New Directions for Program
Evaluation, 66, 19–26.

American Evaluation Association. (2004). Guiding principles for evaluators. Retrieved from
http://www.eval.org/p/cm/ld/fid=51

American Evaluation Association. (2011). American Evaluation Association Statement on Cultural Competence in
Evaluation. Retrieved from: http://www.eval.org/ccstatement

American Evaluation Association. (2018). Guiding principles for evaluators. Retrieved from
http://http://www.eval.org/p/cm/ld/fid=51

Arsene, O., Dumitrache, I., & Mihu, I. (2015). Expert system for medicine diagnosis using software agents.
Expert Systems with Applications, 42(4), 1825–1834.

Arora, V., Johnson, J., Lovinger, D., Humphrey, H. J., & Meltzer, D. O. (2005). Communication failures in
patient sign-out and suggestions for improvement: A critical incident analysis. BMJ Quality & Safety, 14(6),
401–407.

Astbury, B. (2016). Reframing how evaluators think and act: New insights from Ernest House. Evaluation, 22(1),
58–71.

Barber, M. (2007). Instruction to Deliver: Tony Blair, the Public Services and the Challenge of Achieving Targets.
London Politico's.

618
Bickman, L. (1997). Evaluating evaluation: Where do we go from here? Evaluation Practice, 18(1), 1–16.

Canadian Evaluation Society. (2012a). CES guidelines for ethical conduct. Retrieved from
https://evaluationcanada.ca/ethics

Canadian Evaluation Society. (2012b). Program evaluation standards. Retrieved from


https://evaluationcanada.ca/program-evaluation-standards

Canadian Evaluation Society. (2018). About the CE Designation. Retrieved from https://evaluationcanada.ca/ce

Chen, H. T. (1996). A comprehensive typology for program evaluation. Evaluation Practice, 17(2), 121–130.

Chouinard, J. A., & Cousins, J. B. (2007). Culturally competent evaluation for Aboriginal communities: A review
of the empirical literature. Journal of MultiDisciplinary Evaluation, 4(8), 40–57.

Cox, R. & Pyakuryal, S. (2013). Tacit Knowledge. In H. Fredrickson and R. Ghere (Eds.), Ethics in Public
Management (pp. 216–239) (2nd Ed.). New York: Sharpe,.

Cronbach, L. J. (1980). Toward reform of program evaluation (1st ed.). San Francisco: Jossey-Bass.

Davies, H., & Kinloch, H. (2000). Critical incident analysis: Facilitating reflection and transfer of learning. In V.
Cree and C. Maccaulay (Eds.) Transfer of Learning in Professional and Vocational Education (p. 137–147),
London, UK: Routledge.

Dobkin, P., & Hutchinson, T. (2013). Teaching mindfulness in medical school: Where are we now and where are
we going? Medical Education, 47(8), 768–779.

Donaldson, S., & Donaldson, S. I. (2015). Visions for using evaluation to develop more equitable societies. In S.
Donaldston and R. Picciotto (Eds.), Evaluation for an equitable society (pp. 1–10). Charlotte, NC: Information
Age Publishing.

Donaldson, S., Christie, C., & Mark, M. (Eds.). (2015). Credible and Actionable Evidence: The Foundation for
Rigorous and Influential Evaluations. (2nd Ed.). Thousand Oaks, CA: Sage.

Donaldson, S., & Picciotto, R. (Eds.) (2016). Evaluation for an equitable society. Charlotte, NC: Information Age
Publishing.

Dunne, J., & Pendlebury, S. (2003). Practical reason. In N. Blake, P. Smeyers, R. Smith, & P. Standish (Eds.),
The Blackwell guide to the philosophy of education (pp. 194–211). Oxford: Blackwell.

Emslie, M., & Watts, R. (2017). On technology and the prospects for good practice in the human services:

619
Donald Schön, Martin Heidegger, and the case for phronesis and praxis. Social Service Review, 91(2), 319–356.

Epstein, R. M. (1999). Mindful practice. Journal of the American Medical Association, 282(9), 833–839.

Epstein, R. M. (2003). Mindful practice in action (I): Technical competence, evidence-based medicine, and
relationship-centered care. Families, Systems & Health, 21(1), 1–9.

Epstein, R. M. (2017). Mindful practitioners, mindful teams, and mindful organizations: Attending to the core
tasks of medicine. In P. Papadokos and S. Bertman (Eds.), Distracted doctoring (p. 229–243). Cham,
Switzerland: Springer.

Epstein, R. M., Siegel, D. J., & Silberman, J. (2008). Self-monitoring in clinical practice: A challenge for medical
educators. Journal of Continuing Education in the Health Professions, 28(1), 5–13.

Eraut, M. (1994). Developing professional knowledge and competence. Washington, DC: Falmer Press.

Evans, T., & Hardy, M. (2017). The ethics of practical reasoning—exploring the terrain. European Journal of
Social Work, 20(6), 947–957.

Feller, I. (2002). Performance measurement redux. The American Journal of Evaluation, 23(4), 435–452.

Fierro, L., Galport, N., Hunt, A., Codd, H., & Donaldson, S. (2016). Canadian Evaluation Society credentialed
evaluator designation program: Evaluation report. Claremont Evaluation Centre. Retrieved from
https://evaluationcanada.ca/txt/2016_pdp_evalrep_en.pdf

Fish, D., & Coles, C. (1998). Developing professional judgement in health care: Learning through the critical
appreciation of practice. Boston, MA: Butterworth-Heinemann.

Flyvbjerg, B. (2004). Phronetic planning research: Theoretical and methodological reflections. Planning Theory &
Practice, 5(3), 283–306.

Foucault, M. (1980). Power/knowledge: Selected interviews and other writings, 1972–1977. New York: Pantheon.

Gadamer, H. (1975) Truth and method. London, UK: Sheed and Ward.

Garvin, D. A. (1993). Building a learning organization. Harvard Business Review, 71(4), 78–90.

Ghere, G., King, J. A., Stevahn, L., & Minnema, J. (2006). A professional development unit for reflecting on
program evaluator competencies. American Journal of Evaluation, 27(1), 108–123.

620
Gibbins, M., & Mason, A. K. (1988). Professional judgment in financial reporting. Toronto, Ontario, Canada:
Canadian Institute of Chartered Accountants.

Government of Canada (2014). Tri-council policy statement: Ethical conduct for research involving humans.
Retrieved from http://www.pre.ethics.gc.ca/eng/policy-politique/initiatives/tcps2-eptc2/Default/

Greeff, M., & Rennie, S. (2016). Phronesis: beyond the research ethics committee—a crucial decision-making
skill for health researchers during community research. Journal of Empirical Research on Human Research Ethics,
11(2), 170–179.

House, E. R. (2015). Evaluating: Values, biases, and practical wisdom. Charlotte, NC: Information Age Publishing.

Hursthouse, R. (1999). On virtue ethics. Oxford OUP.

Imas, L. M. (2017). Professionalizing evaluation: A golden opportunity. In R. Van Den Berg, I. Naidoo, and S.
Tamondong (Eds.), Evaluation for Agenda 2030: Providing evidence on progress and sustainablity. Exeter, UK:
International Development Evaluation Association.

Kemmis, S. (2012). Phronesis, experienced and the primacy of praxis. In Kinsella E. and Pitman, A. (Eds.).
Phronesis as professional knowledge: Practical wisdom in the professions (pp. 147–162). Rotterdam,
SensePublishers.

Koehn, D. (2000). What is practical judgment? Professional Ethics: A Multidisciplinary Journal, 8(3.4), 3–18.
Kinsella, E. A., & Pitman, A. (2012). Phronesis as professional knowledge. In Phronesis as Professional Knowledge
(pp. 163–172). Rotterdam: SensePublishers.

King, J. A., Stevahn, L., Ghere, G., & Minnema, J. (2001). Toward a taxonomy of essential evaluator
competencies. American Journal of Evaluation, 22(2), 229–247.

King, J. A., & Stevahn, L. (2015). Competencies for program evaluators in light of adaptive action: What? So
What? Now What? New Directions for Evaluation, (145), 21–37.

Kuhn, T. S. (1962). The structure of scientific revolutions. IL: University of Chicago Press.

Langford, J. W. (2004). Acting on values: An ethical dead end for public servants. Canadian Public
Administration, 47(4), 429–450.

LaVelle, J. M., & Donaldson, S. I. (2015). The state of preparing evaluators. New Directions for Evaluation, (145),
39–52.

Lowell, A., Kildea, S., Liddle, M., Cox, B., & Paterson, B. (2015). Supporting aboriginal knowledge and practice
in health care: Lessons from a qualitative evaluation of the strong women, strong babies, strong culture

621
program. BMC Pregnancy and Childbirth, 15(1), 19–32.

Martinez, J. M. (2009). Public administration ethics for the 21st century. Santa Barbara, CA: ABC-CLIO.

Mason, J. (2002). Qualitative researching (2nd ed.). Thousand Oaks, CA: Sage.

Mathison, S. (2017). Does evaluation contribute to the public good? Keynote address to the Australasian Evaluation
Society, September 4.

Mayne, J. (2008). Building an evaluative culture for effective evaluation and results management. Retrieved from
http://www.focusintl.com/RBM107-ILAC_WorkingPaper_No8_EvaluativeCulture_Mayne.pdf

McDavid, J. C., & Huse, I. (2015). How does accreditation fit into the picture? New Directions for Evaluation,
(145), 53–69.

Mejlgaard, N., Christensen, M. V., Strand, R., Buljan, I., Carrió, M., i Giralt, M. C.,… & Rodríguez, G. (2018).
Teaching responsible research and innovation: A phronetic perspective. Science and Engineering Ethics, 1–19.

Melé, D. (2005). Ethical education in accounting: Integrating rules, values and virtues. Journal of Business Ethics,
57(1), 97–109.

Mertens, D. M., & Wilson, A. T. (2012). Program evaluation theory and practice: A comprehensive guide. New
York: Guilford Press.

Mill, J. S., Bentham, J., Ryan, A., & Bentham, J. (1987). Utilitarianism and Other Essays. London: Penguin
Books.

Modarresi, S., Newman, D. L., & Abolafia, M. Y. (2001). Academic evaluators versus practitioners: Alternative
experiences of professionalism. Evaluation and Program Planning, 24(1), 1–11.

Morris, M. (1998). Ethical challenges. American Journal of Evaluation, 19(3), 381–382.

Morris, M. (Ed.). (2008). Evaluation ethics for best practice: Cases and commentaries. New York: Guilford Press.

Morris, M. (2011). The good, the bad, and the evaluator: 25 years of AJE ethics. American Journal of Evaluation,
32(1), 134–151.

Mowen, J. C. (1993). Judgment calls: High-stakes decisions in a risky world. New York: Simon & Schuster.

Newman, D. L., & Brown, R. D. (1996). Applied ethics for program evaluation. Thousand Oaks, CA: Sage.

622
O’Neil, O. (1986). The power of example. Philosophy, 61, 5–29.

Osborne, D., & Gaebler, T. (1992). Reinventing government: How the entrepreneurial spirit is transforming
government. Reading, MA: Adison Wesley Public Comp.

Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.) Thousand Oaks, CA: Sage.

Perrin, B. (1998). Effective use and misuse of performance measurement. American Journal of Evaluation, 19(3),
367–379.

Petersen, A. & Olsson, J. (2015). Calling evidence-based practice into question: Acknowledging phronetic
knowledge in social work. British Journal of Social Work, 45, 1581–1597.

Picciotto, R. (2011). The logic of evaluation professionalism. Evaluation, 17(2), 165–180.

Picciotto, R. (2015). Democratic evaluation for the 21st century. Evaluation, 21(2), 150–166.

Pitman, A. (2012). Professionalism and professionalisation. In A. Kinsella and A. Pitman (Eds.). Phronesis as
professional knowledge (pp. 131–146). Rotterdam: SensePublisher.

Polanyi, M. (1958). Personal knowledge: Towards a post-critical philosophy. IL: University of Chicago Press.

Polanyi, M., & Grene, M. G. (1969). Knowing and being: Essays. IL: University of Chicago Press.

Riskin, L. L. (2011). Awareness and the legal profession: An introduction to the mindful lawyer symposium.
Journal of Legal Education, 61, 634—640.

Rolston, H., Geyer, J., Locke, G., Metraux, S., & Treglia, D. (2013, June 6). Evaluation of Homebase
Community Prevention Program. Final Report, Abt Associates Inc.

Rossi, P. H., Lipsey, M. W., & Freeman, H. E. (2004). Evaluation: A systematic approach. Thousand Oaks, CA:
Sage.

Segone, M., & Rugh, J. (2013). Evaluation and civil society: Stakeholders’ perspectives on national evaluation
capacity development. Published by UNICEF, EvalPartners and IOCE in partnership with CLEAR, IEG World
Bank, Ministry for Foreign Affairs of Finland, OECD Development Assistance Committee Network on Development
Evaluation, UNEG and UN Women.

Sanders, J. R. (1994). Publisher description for the program evaluation standards: How to assess evaluations of
educational programs. Retrieved from http://catdir.loc.gov/catdir/enhancements/fy0655/94001178-d.html

623
Schön, D. A. (1987). Educating the reflective practitioner: Toward a new design for teaching and learning in the
professions (1st ed.). San Francisco, CA: Jossey-Bass.

Schön, D. A. (1988). From technical rationality to reflection-in-action. In J. Dowie and A. S. Elstein (Eds.),
Professional judgment: A reader in clinical decision making (pp. 60–77). New York: Cambridge University Press.

Schwandt, T. A. (2007). Expanding the conversation on evaluation ethics. Evaluation and Program Planning,
30(4), 400–403.

Schwandt, T. A. (2008). The relevance of practical knowledge traditions to evaluation practice. In N. L. Smith
and P. R. Brandon (Eds.), Fundamental issues in evaluation (pp. 29–40). New York: Guilford Press.

Schwandt, T. (2015). Evaluation foundations revisited: Cultivating a life of the mind for practice. Stanford, CA:
Stanford University Press.

Schwandt, T. A. (2017). Professionalization, ethics, and fidelity to an evaluation ethos. American Journal of
Evaluation, 38(4), 546–553.

Schwandt, T. A. (2018). Evaluative thinking as a collaborative social practice: The case of boundary judgment
making. New Directions for Evaluation, 2018(158), 125–137.

Scriven, M. (2016). The last frontier of evaluation: Ethics. In S. Donaldson and R. Picciotto (Eds.), Evaluation for
an equitable society (pp. 11–48). Charlotte, NC: Information Age Publishers.

Seiber, J. (2009). Planning ethically responsible research. In L. Bickman and D. Rog (Eds.), The Sage handbook of
applied social research methods (2nd ed., pp. 106–142). Thousand Oaks, CA: Sage.

Shepherd, R. (2018). Expenditure reviews and the federal experience: Program evaluation and its contribution to
assurance provision. Canadian Journal of Program Evaluation, 32(3) (Special Issue), 347–370.

Simons, H. (2006). Ethics in evaluation. In I. Shaw, J. Greene, and M. M. Mark (Eds.), The Sage handbook of
evaluation (pp. 243–265). Thousand Oaks, CA: Sage.

Stevahn, L., King, J. A., Ghere, G., & Minnema, J. (2005a). Establishing essential competencies for program
evaluators. American Journal of Evaluation, 26(1), 43–59.

Stevahn, L., King, J. A., Ghere, G., & Minnema, J. (2005b). Evaluator competencies in university-based training
programs. Canadian Journal of Program Evaluation, 20(2), 101–123.

Stockmann, R., & Meyer, W. (2016). The Future of Evaluation: Global Trends, New Challenges and Shared
Perspectives. Palgrave Macmillan, London.

624
Treasury Board of Canada (2009). Policy on Evaluation (rescinded). Archive retrieved from https://www.tbs-
sct.gc.ca/pol/doc-eng.aspx?id=15024

Treasury Board of Canada. (2016a). Policy on Results. Retrieved from https://www.tbs-sct.gc.ca/pol/doc-


eng.aspx?id=31300.

Treasury Board of Canada. (2016b). Directive on Results. Retrieved from https://www.tbs-sct.gc.ca/pol/doc-


eng.aspx?id=31306.

Tripp, D. (1993). Critical incidents in teaching: Developing professional judgement. London, England: Routledge.

UK Evaluation Society. (2018). Voluntary evaluator peer review: Next steps. Retrieved from
https://www.evaluation.org.uk/index.php/events-courses/vepr/203-vepr-update

U.S. Department of Health and Human Services. (2009). Code of Federal Regulations—Title 45: Public Welfare;
Part 46: Protection of Human Subjects. Revised January 15, 2009: Effective July 14, 2009. Retrieved from
https://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html

Vickers, S. (1995). The Art of Judgment. Thousand Oaks, CA: Sage.

Weiss, C. H. (1998). Evaluation: Methods for studying programs and policies (2nd ed.). Upper Saddle River, NJ:
Prentice Hall.

Wilcox, Y., & King, J. A. (2013). A professional grounding and history of the development and formal use of
evaluator competencies. The Canadian Journal of Program Evaluation, 28(3), 1–28.

Yarbrough, D., Shulha, L., Hopson, R., & Caruthers, F. (2011). Joint Committee on Standards for Educational
Evaluation: A guide for evaluators and evaluation users (3rd ed.). Los Angeles, CA: Sage.

625
517

529

626
Glossary

accountability:
responsibility for the fiscal; administrative and programmatic activities that occur in organizational units
over which one has formal authority
action research:
collaborative, inclusive research with the objective of action resulting in the promotion of social change
adequacy:
the extent to which the program outcomes were sufficient to meet the needs for a program
after-only experimental design:
a research design that does not measure program/control differences before the treatment begins
ambiguous temporal sequence:
this internal validity threat arises where the “cause” and the “effect” variables could plausibly be reversed so
that the effect variable becomes the cause
antirealist ontology:
the idea that reality consists only of ideas or is confined to the mind
appropriateness:
the extent to which the theory/logic of a program is the best means to achieve the objectives of the program
attributes:
measurable characteristics of interest about the units of analysis (cases) in the evaluation
attribution:
the extent to which the program, and not some other factor(s) in the program environment, caused the
observed outcomes
averting behavior method:
in cost–benefit analysis, the estimation of the social cost of an avoided risk to health or safety based on cost
of the behavior to avoid the risk
balanced scorecard:
a type of organizational performance measurement system, originated by Robert Kaplan and David Norton,
that typically includes clusters of performance measures for four different dimensions: organizational
learning and growth, internal business processes, customers, and the financial perspective
baseline measures:
measures of outcome-related variables taken before a program is implemented
before–after design:
a research design that compares measurements or data points before the program is implemented with
measurements after implementation
benchmark/gold standard:
a standard or point of reference (often some standard of best practices) against which program processes,
outcomes, or an evaluation design can be compared
bias:
a systematic distortion in a measurement instrument or measurement results that results in data that tend to
be either too high or too low in relation to the true value of a measure
Campbell Collaboration:
an online resource dedicated to producing systematic reviews on the effects of education, criminal justice,
and social welfare interventions
case studies:
methods of inquiry that focus on intensive data collection and analysis that investigates only a few units of
analysis
case study design:
a research design where the comparisons are internal to the program group, and often there are no
opportunities to construct comparisons (before–after or program group[s]) to assess the incremental effects
of the program

627
cases:
see: units of analysis
causal chain:
a set of connected causal relationships
causal linkages (same as cause-and-effect linkages):
intended causal relationships between the constructs in a program logic model
causal relationship:
one variable is said to cause another where the causal variable occurs before the effect variable; the cause and
effect variables covary; that is, as one changes, the other one also changes (either positively or negatively);
and there are no other variables that could plausibly account for the covariation between the cause and the
effect variables
ceteris paribus:
all other things being equal—that is, all influences are held constant except the one that is of immediate
interest
choice modeling:
in cost–benefit analysis, survey-based methods used to value and compare complex alternative social
investments with multiple varying dimensions and attributes
closed-ended questions:
questions that are structured so that all the categories of possible responses are constructed by the evaluator
before the data are collected
Cochrane Collaboration:
an online resource dedicated to systematically reviewing experimental studies in health care, medicine, and
related fields to determine the effectiveness of interventions
comparative time-series design:
a research design that relies on data collected at multiple points in time (before and after program
implementation) for both the program group and control groups
comparison group:
a group of units of analysis (usually people) who are not exposed to the program and who are compared
with the program group
compensatory equalization of treatments:
a construct validity threat where the group that is not supposed to get the program is offered components of
the program, or similar benefits, because the program provider wishes to balance perceived inequalities
between the two groups
compensatory rivalry:
a construct validity threat where the performance of the no-program group or individual improves because
of a desire to do as well as those receiving the program, and this diminishes the differences between the new
program and the existing programs; also known as the “John Henry effect”
conceptual framework:
a set of related constructs that provides us with definitions and categories that permit us to structure our
thinking about social processes
conceptual use (of evaluation):
the knowledge from the evaluation becomes part of the background in the organization and influences other
programs at other times
concurrent validity:
validity related to the strength of the correlation between a new measure of a construct and an existing
(presumed valid) measure of a construct
confidence interval:
when sample descriptive statistics are calculated (e.g., a mean) and then generalized to the population from
which the sample was randomly drawn, the interval of probable values of the population mean, centered
around the sample mean, is the confidence interval
confirmatory factor analysis:
the use of factor analysis (a multivariate statistical procedure for data analysis) to confirm the underlying

628
dimension(s) in an empirical assessment of the internal structure of a measure
consequentialism:
in ethics, this approach emphasizes the importance of the consequences (positives and negatives) in making
moral decisions
construct validity:
the extent to which the variables used to measure program constructs convincingly represent the constructs
in the program logic model
constructionists:
evaluators who believe that meaningful reality does not exist independently of human consciousness and
experience
constructivist:
a philosophical view of the world that assumes that people’s perceptions are relative and that reality is
socially constructed; there is no foundational reality
constructs:
the words or phrases in logic models that we use to describe programs and program results, including the
cause-and-effect linkages in the program
content analysis:
qualitative analysis of textual materials; determining common themes, coding the themes, and, in some
cases, quantifying the coded information
content validity:
the extent to which a measure “captures” the intended range of the content of a construct
context-dependent mediation:
occurs when pre-existing features of the environment in which the (new) program is implemented influence
the program outcomes
contingent valuation:
in cost–benefit analysis, survey-based methods used to construct a hypothetical market to elicit the value of a
social investment
contribution analysis:
an approach to evaluation, originally developed by John Mayne in 2001, to model and facilitate the analysis
of the links between programs and actual performance results (outputs and outcomes), and avoid making
unsupportable claims about the degree of program attribution
control group:
a group of units of analysis (usually people) who are not exposed to the experiment or program and who are
compared with the program group
convergent validity:
the extent to which measures of two or more constructs that are theoretically related correlate or covary with
each other
coping organizations:
organizations where work tasks change a lot and results are not readily visible; least likely to be successful in
measuring outputs and outcomes; a departmental communications office is an example
correlation:
the extent to which the variance of one variable covaries with the variance of another variable; correlations
can be either positive or negative and can vary in strength between –1 (perfect negative correlation) and +1
(perfect positive correlation)
cost-based analyses:
(see cost–benefit analysis, cost–effectiveness analysis, cost–utility analysis)
cost–benefit analysis:
an evaluation of the costs and benefits of a policy, program, or project wherein all the current and future
costs and benefits are converted to current dollars
cost-effectiveness:
the ratio of program inputs (expressed in monetary units) to program outcomes
cost–effectiveness analysis:

629
a comparison of the costs and outcomes of policy, program, or project alternatives such that ratios of costs
per unit of outcome are calculated
cost of illness:
in economic evaluation, a measure of the burden of illness on society, including the direct costs of treatment
and indirect costs such as reduced output and the cost of pain and suffering
cost–utility analysis:
a comparison of costs and estimated utilities of program outcomes that weights and combines outcomes so
that the alternatives can be compared
counterfactual:
the outcomes that what would have happened without the implementation of the program
covariation:
as the values of one variable change (either increasing or decreasing), the values of the other variable also
change; covariation can be positive or negative
craft organizations:
organizations where work involves applying mixes of professional knowledge and skills to unique tasks to
produce visible outcomes; a public audit office would be an example
criterion validity:
see concurrent validity
critical incidents:
in terms of evaluation, can be any incident that occurs in the course of one’s practice that sticks in one’s
mind and, hence, provides an opportunity to learn
Cronbach’s alpha:
a statistic based on the extent to which the responses to closed-ended survey items correlate with each other,
taking into account the number of items being assessed for their collective reliability; it can vary between 0
(no reliability) and 1 (perfect reliability)
declining rates of discount:
see: time-declining rates of discount
decoupling:
in performance measurement, arranging separate sets of measures for management (improvement) purposes
and external accountability purposes
defensive expenditures method:
in cost–benefit analysis, the estimation of the social cost of an avoided risk to health or safety based on the
expenditures made to avoid the risk
deliberative judgment:
professional judgment that involves making decisions that have the potential to affect whether an evaluator
engages in a particular task or even an evaluation project
deontological ethics:
an ethical approach that judges decisions and actions based on their adherence to universal standards of right
and wrong
dependent variables:
a variable that we expect will be affected by one or more independent variables—in most evaluations, the
observed outcomes are dependent variables
developmental evaluation:
an alternative to formative and summative program evaluations, designed to contribute to ongoing
organizational innovations; organizations are viewed as co-evolving with complex environments, with
program objectives and program structures in flux
diffusion of treatments:
a construct validity threat where interactions between the program group and the control group offer ways
for the control group to learn about the intended treatment, weakening the intended differences between
the two groups
discounting:
the process of determining the net present value of a dollar amount of costs or benefits

630
discount rate:
the rate of interest used in discounting costs and benefits—that is, converting all costs and benefits over the
life of the policy, program, or project into net present values
discriminant validity:
the extent to which the measures of two or more constructs that are not theoretically related do not correlate
with each other
disproportionate stratified sample:
similar to stratified sample (see definition of stratified sampling below), except that one or more strata are
randomly sampled, so that the number of cases selected is greater (or less) than the fraction/proportion that
that stratum is of the whole population
distributional analysis:
in cost–benefit analysis, an analysis of the net costs or benefits of an investment to different groups in society
distributional weights:
in distributional analysis conducted in cost–benefit analysis, the weights assigned to the net costs or benefits
to different groups in society to sum them up to arrive at net social costs of benefits; net costs or benefits to
disadvantaged groups, such as low-income people, are typically assigned higher weights so that the utility of
disadvantaged groups is given higher recognition in the social welfare calculus
double-loop learning:
learning that critically assesses existing organizational goals and priorities in light of evidence and includes
options for adopting new goals and objectives
duty ethics:
human actions are expected to be guided by universal standards of right and wrong that apply at all times
economic efficiency:
the net social value of a project or program, estimated by subtracting the discounted social costs from the
discounted social benefits
effectiveness:
the extent to which the observed outcomes of a program are consistent with the intended objectives; also,
the extent to which the observed outcomes can be attributed to the program
efficiency:
attaining the most program outputs possible for each program input (usually expressed in monetary units)
empowerment evaluation:
the use of evaluation concepts, techniques, and findings to facilitate program managers and staff evaluating
their own programs and thus improving practice and fostering self-determination in organizations
environmental factors:
organizational, institutional, and interpersonal factors in the surroundings of a program that may have an
effect on its operations and the intended outcomes
environmental scan:
an analysis of trends and key factors (both positive and negative) in an organization’s environment that may
have an impact on it now or in the future
epidemiological databases:
in needs assessments, databases providing information on the prevalence and incidence of factors related to
or even predictive of specific needs
episteme:
in Greek philosophy, universal knowledge, sometimes equated with scientific knowledge
epistemological beliefs:
beliefs about how we can know ourselves and the physical and social world around us
epistemology:
the philosophical analysis of theories of knowledge; in evaluation, how we know what we know as evaluators
and researchers
ethnographies:
qualitative studies that rely on developing and conveying an authentic sense of the knowledge and belief
systems that constitute a given culture

631
evaluation:
the systematic assessment of a program or policy using absolute (merit-based) or relative (worth-based)
criteria
evaluation assessment:
a systematic study of the options, including their strengths and weaknesses, when a program evaluation is
being planned
evaluation study:
the process of designing, conducting, and reporting the results of a program evaluation
evaluative cultures:
organizational cultures that emphasize the importance of evidence-based decision making and learning, in
which evaluative approaches are widely accepted and used
evidence-based decision making:
a philosophy of management that emphasizes the importance of using defensible evidence as a basis for
making decisions—sometimes associated with performance management
ex ante analyses:
analyses (usually cost–benefit, but they can also be cost–effectiveness or cost–utility analysis) that are done
before a program, policy, or project is implemented
ex ante evaluation:
an evaluation that is conducted before a program is implemented
existence value:
in environmental economics, the value of an asset to be contemplated, viewed or admired, or left untouched
to serve as wildlife habitat
exogenous:
characteristics that affect the program but are not affected by it, such as demographic attributes of
stakeholders
experimental design:
a research design involving one or more treatment (program) and control groups, where program and
control participants are randomly assigned to the groups, ensuring that the groups are statistically equivalent
except for the experience of participating in the program itself
experimental diffusion:
see diffusion of treatments
experimental research:
see: experimental design
ex post analyses:
analyses that are done after a policy, program, or project is implemented
ex post evaluation:
an evaluation that is conducted after a program has been implemented
External Accountability (EA) approach:
development and implementation of a primarily top-down, externally mandated performance measurement
systems, focused on external performance-based account-giving
external validity:
the extent to which the results of an evaluation can be generalized to other times, other people, other
treatments, and other places
externality:
in economics, a “good” or “bad” affecting individuals not involved in decisions regarding its production or
consumption; its price does not reflect these effects
face validity:
where an evaluator or experts judge that a measurement instrument appears to be adequately measuring the
construct that it is intended to measure
focus group:
a group of persons (usually a maximum of 12) selected for their relevance for a particular evaluation
question, who discuss their individual and collective opinions; usually focus groups are facilitated by one or

632
more persons who guide the discussion and record the proceedings for further qualitative analysis
formative evaluation:
an evaluation designed to provide feedback and advice for improving a program
gaming performance measures:
occurs in situations where unintended, less desirable behaviors result from the implementation of
performance measures intended to improve performance and accountability
goal:
a broad statement of intended outcomes for a program, line of business, or organization—goals are typically
intended to guide the formation of (more specific) objectives that are linked to the goals
grey literature:
research that has not been published commercially, but is made publicly available (examples include
government reports, nonprofit reports, and databases of collected research information)
halo effect:
In surveys or interviews, the risk that if overall ratings are solicited first, the initial overall rating will “color”
subsequent ratings
Hawthorne effect:
a construct validity threat where there are unintended results caused by the subjects knowing that they are
participants in an evaluation process and thus behaving differently than they would if there was no
evaluation being conducted
health rating method:
in cost–utility analysis, health ratings are calculated from questionnaires or interviews asking respondents to
numerically rank health states to derive QALY
hedonic price:
in economics, the implicit price of a good’s attributes estimated from expenditures on goods or services with
multiple attributes
history:
an internal validity threat where changes in the program environment coincide with or mask program
effects, biasing the results of the evaluation
holding constant:
the process of using either research design or statistics to isolate one intended cause-and-effect linkage so
that it can be tested with evidence
holistic approach:
seeking patterns that provide an overall understanding of the evaluation data, including and integrating the
perspectives of different stakeholders
hypothesis:
statement(s), structured in an if-then format to examine cause and effect, intended to be testable
implementation activities:
statements of what needs to happen to get a program to produce outputs; they focus on program
implementation actions and not on program outcomes
implicit design:
a posttest-only design with no control group—the evaluation occurs after the program is implemented, and
there are no non-program comparison groups
incommensurable:
two theories or approaches are incommensurable when there is no neutral language with which to compare
them; in the philosophy of science, incommensurability of scientific theories would entail not being able to
compare the two theories to determine which one is better supported by existing evidence
incremental effects:
outcome changes that result from a program (see attribution)
independent variables:
an observable characteristic of the units of analysis that we expect to cause some other variable; in a research
design where we have a program and a control group, the presence or absence of the program for each unit
of analysis becomes an independent variable

633
index:
a measure based on combining the data (either weighted or unweighted) from two or more other measures,
usually from a survey
indirect costs:
costs caused by a policy, program, or project that occur in the environment and are not intended
individually necessary and jointly sufficient conditions:
when we speak about conditions for determining whether a relationship between two variables is causal, we
specify three criteria: temporal asymmetry, covariation, and no plausible rival hypotheses; these are
individually necessary for causality and together are jointly sufficient to determine whether a cause and effect
relationship exists
inductive approach:
a process that begins with data and constructs patterns that can be generalized
instrumental use (of evaluation):
direct uses of evaluation products in decision-making
instrumentation:
an internal validity threat where changes in the measurement instrument(s) used in the evaluation coincide
with the implementation of the program, making it very difficult to distinguish program effects from effects
due to changes in the measurement processes
intangible costs:
costs that cannot easily be expressed in dollar terms
interaction effects:
in multivariate statistical analysis, interaction effects are the joint, nonadditive effects of two or more
independent variables on a dependent variable
intercoder reliability:
a calculation of the extent to which individual analysts’ decisions in coding qualitative data are similar
interest rate:
see: real prices and interest rates and nominal prices/costs and interest rates
Internal Learning (IL) approach:
a performance measurement approach intended to satisfy both external accountability and performance
improvement expectations; managers and frontline workers are engaged, and the external reporting
requirements are intended to be buffered by developing an internal learning culture
internal structure validity:
validity related to the coherence of a pool of items that are collectively intended to be a measure of a
construct—can be estimated using multivariate statistical methods, such as factor analysis
internal validity:
the extent to which there are no plausible rival hypotheses that could explain the linkage between a program
and its observed outcomes—an internally valid research design eliminates all plausible rival hypotheses,
allowing a “clean” test of the linkage
interpretivism:
sometimes called antipositivism, this perspective assumes that our descriptions of objects, be they people,
social programs, or institutions, are always the product of interpretation, not neutral reports of our
observations
interrupted time-series designs:
research designs that feature a before-versus-after comparison of an outcome variable—that is, multiple
observations of the variable before the program is implemented are compared with multiple observations of
the same variable after program implementation
interval level of measurement:
a level of measurement where there is a unit of measurement; that is, the values of the variable are all equal
intervals, but there is no natural zero value in the scale
knowledge management:
the strategies and processes in organizations for acquiring, organizing, distributing, and using knowledge for
management and decision making

634
league tables:
a set of data for ranking or comparing performance measures of organizations or institutions
learning organization:
an organization that is characterized by double-loop learning, that is, acquiring and using information to
correct its performance in relation to current objectives as well as assessing and even changing its objectives
level of confidence:
in generalizing to a population from the description of a sample (e.g., the mean), how much error we are
willing to tolerate in estimating a range of values that are intended to include the population mean—the
higher the level of confidence we pick (e.g., the 99% level instead of the 95% level), the greater the
likelihood that our range of values (our confidence interval) will, in fact, capture the true population mean
levels of analysis problem:
a situation in performance measurement where performance data for one level in an organization (e.g., the
program level) are used to (invalidly) infer performance at another level (e.g., individuals working within the
programs)
levels of measurement:
a hierarchy of measurement procedures that begins with classification (nominal measurement), proceeds
through ranking (ordinal measurement), then to interval (counting the unit amounts of a characteristic),
and ends with ratio measures (interval measures that have a natural zero point)
Likert statements:
items worded so that survey respondents can respond by agreeing or disagreeing with the statement, usually
on a 5-point scale from strongly agree to strongly disagree
lines of evidence:
see: multiple independent lines of evidence
logic models:
see: program logic model
longitudinal:
observations of variables for units of analysis over time
lost output method:
in economics, the estimation of the social value of lost output arising as a consequence of an action or event
main effects:
in analysis of variance with two or more independent variables, the main effects are the statistical
relationships between, in turn, each of the independent variables and the dependent variable
marginal cost:
the cost of producing one additional unit of output
maturation:
an internal validity threat where natural changes over time in the subjects being studied coincide with the
predicted program effects
maximum variation sampling:
To deliberately get a wide range of variation on characteristics of interest; documents unique, diverse, or
common patterns that occur across variations
means–ends relationship:
a causal relationship between or among factors such that one is affected by the other(s)—one is said to cause
the other
measurement:
the procedures that we use to translate a construct into observable data
measurement instrument:
the instrument that implements the procedures we use to translate a construct into observable data
measurement scale:
a measuring instrument that is divided into units or categories (nominal, ordinal, interval/ratio)
measurement validity:
the validity of the empirical constructs (the measured variables) in the empirical plane in their representation
of the concepts in the theoretical plane

635
mechanisms:
see: program theories
meta-analysis:
a synthesis of existing program evaluation studies in a given area, designed to summarize current knowledge
about a particular type of program
meta-evaluation:
the evaluation of one or more completed evaluation projects
methods:
approaches to the collection of data for an evaluation
methodologies:
the strategies or designs that underlie choices of method; for example, experimental research is a
methodology that encompasses many methods, whereas random sampling is one method for collecting
quantitative data
mix of methods/mixed methods:
utilizing a combination of qualitative and quantitative methods in an evaluation
mixed methods:
see mix of methods
mixed sampling strategies:
see: mix of methods
mortality:
an internal validity threat where the withdrawal of subjects during the evaluation process interferes with
before–after comparisons
multiple independent lines of evidence:
in an evaluation, they are the findings from different perspectives and methods, facilitating the triangulation
of evidence in assessing a program’s merit and worth
multivariate statistical analysis:
statistical methods that allow for the simultaneous assessment of the influence of two or more independent
variables on one or more dependent variables
naturalistic approach:
an approach to evaluation that generally does not involve manipulating the program setting when gathering
information, instead relying on primarily qualitative methodologies that reduce the intrusiveness of the
evaluation
necessary condition:
an event or factor that must occur for a program to be successful but whose occurrence does not guarantee
the program’s success
needs assessment:
a study that measures the nature and extent of the need for a program, conducted either before a new
program is developed or during its lifetime
neo-liberal(ism):
a philosophical and political movement particularly important since the early 1980s that is characterized by
an emphasis on private sector-related values (efficiency, reduced government regulation, reduced
government expenditures) to reform governmental institutions and make government policy
net present benefit:
the total value of program benefits expressed in current dollars minus the total value of program costs
expressed in current dollars
net present value:
the present monetary value of the benefits less the present value of the costs
net social benefit:
the economic value of a project or program once net present (discounted) costs have been subtracted from
net present (discounted) benefits
net social value:
see: net social benefit

636
new public management:
an approach to public sector reform that emphasizes business-like practices for organizations, including
performance measurement and managing for results
nominal:
the most basic level of measurement, where the variable consists of two or more mutually exclusive
categories
nominal prices/costs and interest rates:
prices and interest rates in current values without any adjustment for inflation
nonrecursive causal models:
quantitative causal models that specify one-way causal relationships among the variables in the model
nous:
in Greek philosophy, intelligence or intellectual ability
objective:
a statement of intended outcomes that is focused and time specific, that is, intended to be achievable in a
specified time frame
objectivism:
the assumption that objects exist as meaningful entities independently of human consciousness and
experience
objectivity:
a two-stage process involving scrutable methods and replication of findings by independent, evidence-based
testing of hypotheses or answering of research questions
observables:
when we translate constructs into (measurable) variables, these variables are the observables in the evaluation
open-ended questions:
questions where the respondent can answer in his or her own words, and then categories are created by the
analyst to classify the responses after the data are collected
open system:
a bounded structure of means–ends relationships that affects and is affected by its environment
open systems approach:
conceptualizing programs as open systems, that is, sets of means–ends linkages that affect and are affected by
their environment
open systems metaphor:
a metaphor from biology or from engineering that offers a way of describing programs as open systems
operating costs:
costs associated with items that contribute to the operation of a program, such as salaries and supplies
operationalized:
when we measure constructs, we sometimes say that the constructs have been translated into measurement
operations that are intended to collect data
opportunistic sampling:
a sampling strategy where participants are selected based on their connection to emerging research questions
opportunity costs:
the cost that is equivalent to the next best economic activity that would be forgone if a project or program
proceeds
ordinal level of measurement:
a level of measurement where the variable is a set of categories that are ranked on some underlying
dimension
organizational logic models:
logic models for whole organizations where business units are linked to organizational goals through
business-line objectives and, hence, to strategies and performance measures
outcome mapping:
an evaluative strategy that focuses on monitoring the performance of a program by tracking the effects or
influences of the program on stakeholders, including those who are not direct program recipients

637
output distortions:
occur in situations where performance results are “adjusted” or gamed so that they line up with performance
expectations
paradigm:
a particular way of seeing and interpreting the world—akin to a belief system
parametric statistics:
statistical methods that are used for interval/ratio-level data
patched-up research designs:
where several research designs in program evaluations have been combined with the intention of reducing
the weaknesses in any one of them
path analysis:
a technique in nonrecursive causal modeling that examines the linkages among constructs in a program,
while summarizing their collective correlations and their statistical significance with respect to an outcome
variable
performance dialogues:
An emerging perspective on performance measurement and performance management internationally, with
regular organizational performance discussions and learning forums among internal stakeholders
performance management:
organizational management that relies on evidence about policy and program accomplishments to connect
strategic priorities to outcomes and make decisions about current and future directions
performance management cycle:
a normative model of organizational planning and actions that emphasizes the importance of stating clear
goals and objectives, translating these into policies and programs, implementing them, and then assessing
and reporting outcomes so that the goals and objectives can be appropriately modified
performance measurement:
the process of designing and implementing quantitative and qualitative measures of program results,
including outputs and outcomes
performance measures:
quantitative and qualitative measures of program or organizational results, including outputs and outcomes
phenomenology:
assumes that our culture gives us ready-made interpretations of objects in the world, and focuses on trying
to get past these ready-made meanings in our interpretations of phenomena
philosophical pragmatism:
see: pragmatism
phronesis:
an approach to ethical decision-making that is attributed to Aristotle and emphasizes practical, multi-faceted
situation-specific (moral) knowledge and experience as the foundation for day-to-day practice
plausible rival hypotheses:
variables that are shown either by evidence or by judgment to influence the relationship between a program
and its intended outcome(s) in a manner not originally hypothesized
political/cultural perspective:
a view of organizations that emphasizes the people dynamics in organizations, rather than the systems and
structures in which they are embedded
political culture:
the constellation of values, attitudes, beliefs, and behavioral propensities that characterize the political
relationships among individuals, groups, and institutions in a society
politics:
the authoritative (either formally or informally) allocation of values within organizations
population:
a group of people, who may or may not be from the same geographic area, who receive or could receive
services from public sector or nonprofit organizations
positivist:

638
a philosophical view of the world that assumes that our perceptions are factual, there is a reality that we have
access to with our perceptions, and the process of testing hypotheses involves comparing predictions to
patterns of facts
post-test-only design:
where measurements occur only after being exposed to the program
practical wisdom:
see: phronesis
pragmatic stance:
mixing qualitative and quantitative methodologies in ways that are intended to be situationally appropriate
pragmatism:
the philosophical view that theory, including epistemological beliefs, is guided by practice; a pragmatic view
of evaluation is that different methodologies, rather than being tied to particular underlying epistemologies,
can be employed by evaluators situationally in ways that mix qualitative and quantitative methodologies
praxis:
in Greek philosophy, action in the world, sometimes equated with (professional) practice
predictive validity:
the extent to which a measure of one construct can be used to predict the measures of other constructs in
the future
pre-test:
a test of a measurement instrument prior to its actual use, designed to identify and correct problems with
the instrument
pre- and post-test assessments:
see: pre-test–post-test design, below
pre-test–post-test design:
where measurements occur before and after being exposed to the program, and the two sets of results are
compared
primary data:
data gathered by the evaluator specifically for a current needs assessment or evaluation
procedural judgment:
picking a method to complete a task in an evaluation—for example, choosing focus groups or interviews as a
way to involve program providers in an evaluation
procedural organizations:
organizations where work tasks rely on processes to produce outputs that are visible and countable but
produce outcomes that are less visible; they can produce output measures more readily than outcome
measures—military organizations are an example
process use (of evaluation):
effects of the process of implementation of an evaluation
production organizations:
organizations where work tasks are clear and repetitive, and the results are visible and countable; they are
most likely to be able to build performance measurement systems that include outputs and outcomes
Professional Engagement Regime (PR):
primarily bottom-up development and implementation of performance measures, relying on those in the
organization (managers and workers) taking the lead, and focusing primarily on performance improvement
professional judgment:
combining experience, which is influenced by ethics, beliefs, values, and expectations, with evidence to make
decisions and construct findings, conclusions, and recommendations in program evaluations
program:
a set of related, purposive activities that is intended to achieve one or several related objectives
program activities:
the work done in a program that produces the program outputs
program components:
major clusters of activities in a program that are intended to drive the process of producing outcomes

639
program constructs:
words or phrases that describe key features of a program
program effectiveness:
the extent to which a program achieves its intended outcomes
program environment:
the surroundings and conditions within which a program is situated
program evaluation:
a systematic process for gathering and interpreting information intended to answer questions about a
program
program impacts:
longer term outcomes that are attributable to the program
program implementation:
converting program inputs into the activities that are needed to produce outputs
program inputs:
the resources consumed by program activities
program logic model:
a way of representing a program as an open system that categorizes program activities, and outlines the
intended flow of activities from outputs to outcomes
program logics:
models of programs that categorize program activities and link activities to results (outputs and outcomes)
program objectives:
statements of intended outcomes for programs, which should ideally (a) specify the target group, the
magnitude and direction of the expected change, and the time frame for achieving the result; and (b) be
measurable
program outcomes (intended):
the intended results occurring in the environment of a program
program outcomes (observed):
what a program appears to have achieved, discerned through a process of measurement
program outputs:
the work produced by program activities
program processes:
the activities in a program that produce its outputs
program rationale:
the ways (if any) that a program fits into the current and emerging priorities of the government or agency
that has sponsored/funded it
program theories:
ways of thinking about programs (evidence-based and otherwise) that reflect our understanding of the causal
relationships among the constructs that can be included in a program logic model; program theory helps us
understand why a program logic model is constructed the way it is
Progressive Movement:
around the turn of the 20th century, a movement of reformers who wanted to introduce political and
organizational changes that would eliminate the perceived ills of U.S. public sector governance; a response to
the widespread concern about political corruption and machine politics in American state and local
governments
propensity score analysis:
a statistical technique wherein sociodemographic characteristics of all participants (program and control) are
used to predict the likelihood/probability that each person is in either the program or the control group
proportionate stratified sample:
similar to a stratified sample (see the definition below), but in each stratum, the number of cases selected is
proportional to the fraction/percentage that that stratum is of the whole population
proxy measurement:
a measure that substitutes for another; for example, using measures of outputs to measure outcomes

640
purposeful sampling:
see: purposive sampling strategies
purposive sampling strategies:
the deliberate selection of specific cases; selection strategies used in qualitative research as an alternative to
random selection
qualitative evaluation approaches:
evaluation methods that rely on narrative, that is, nonnumerical, data
qualitative program evaluations:
evaluations that rely on words (instead of numbers) as the principal source of data
quality-adjusted life-years (QALY):
a method of estimating utility that assigns a preference weight to each health state, determines the time
spent in each state, and estimates life expectancy as the sum of the products of each preference weight and
the time spent in each state
QALY threshold:
in cost–utility analysis, a dollar amount used to determine whether an intervention generates a net positive
social value; interventions costing less than the threshold amount per QALY generated are considered to
generate net positive social value
quantitative evaluation methods:
evaluation methods that rely on numerical sources of data
quasi-experimental:
research designs that do not involve random assignment to program and control groups but do include
comparisons (comparison groups or time series) that make it easier to sort out the cause-and-effect linkages
that are being tested
randomized experiments/randomized controlled trials (RCTs):
research designs that involve randomly assigning the units of analysis (usually people) to program and
control groups and comparing the groups in terms of outcome variables
random sample:
a sample that is selected using a process where each member of the population has an equal or known
chance of being selected, which enables the research results to be generalized to the whole population
ratchet effect:
a tendency for performance targets to be lowered over time as agencies fail to meet them
ratio measures:
a level of measurement where the values of the variable are all equal intervals and there is a natural zero value
in the scale
real benefits:
monetary value of benefits after being adjusted for inflation
real costs:
monetary value of costs after being adjusted for inflation
realist evaluation:
developing program-related knowledge based on the context, mechanisms, and outcomes (CMOs)
associated with program successes and failures
real prices and interest rates:
prices and interest rates adjusted for inflation
real rates:
see: real prices and interest rates
recommendations:
in an evaluation report, suggested actions for the client(s) of the evaluation that are based on the findings
and conclusions of the report
reflective judgment:
given an evaluator’s knowledge and experience, how to solve a methodological or ethical challenge in an
evaluation, for example, whether to include a non-program comparison group in a situation where the
comparison group would not have access to the program for the duration of the evaluation

641
regression analysis:
statistical analysis process used to estimate relationships among independent and dependent variables
relevance:
the extent to which the objectives of a program are connected to the assessed needs and/or government
priorities
reliability:
the extent to which a measurement instrument produces consistent results over repeated applications
representative sample:
when the statistical characteristics of a sample (e.g., demographic characteristics) match those same
characteristics for the population, we say that the sample is representative
research design:
the overall method and procedures that specify the comparisons that will be made in an evaluation
resentful demoralization:
a construct validity threat where members of the control group react negatively as a result of being in the
control group, which biases the results of the evaluation
response process validity:
the extent to which the respondents to an instrument that is being validated demonstrate engagement and
sincerity in the way they participate
response set:
a potential issue in sets of statements in a survey where negatively and positively worded statements are not
mingled, inducing respondents who are in a hurry to pick one response category—”agree”, for example—
and check off that response from top to bottom
response-shift bias:
participants use a pre-program frame of reference to estimate their knowledge and skills before participating
in a program, and once they have been through the program have a different frame of reference for rating
the program effects on them
results-based management:
a philosophy of management that emphasizes the importance of program or organizational evidence or
results in managing the organization, its programs, and its people
retrospective pre-test:
a case study research design where “pre-program” variables are measured (typically after the program
participation) by asking participants to estimate their pre-program level of knowledge, skill or competence
(whatever the outcome variable is), retrospectively
revealed preferences methods:
in economics, methods used to calculate the social value of the preferences of individuals revealed through
their market behavior
rival hypotheses:
factors in the environment of a program that operate on both the program and its intended outcomes in
such a way that their effects could be mistaken for the outcomes that the program itself produces
robustness:
resilience or methodological defensibility of a procedure or process
sampling:
the selection of cases or units of analysis from a population so that we can generalize the findings from the
sample to the population
sampling error:
estimated range of population percentages that could be true given a particular sample percentage—the
greater the sample size, the smaller the sampling error
sampling strategy/procedure:
the process through which a sample is selected
scrutability:
characteristics of methods and procedures in research that make them transparent and replicable
secondary data:

642
data that have been previously gathered for purposes other than the current needs assessment or evaluation
selection:
an internal validity threat where differences between the program group and the control group before the
program is implemented could account for observed differences in outcomes between the program and
control groups
selection-based interactions:
where selection interacts with other threats to internal validity to bias the results of an evaluation
sensitivity analysis:
major assumptions in analytical exercises are varied in plausible ranges to evaluate the effects on projected
impacts or outcomes
shoestring evaluation:
a combination of tools designed to facilitate methodologically sound evaluations while operating under tight
budget, time, and data constraints
single time-series design:
a pre-test–post-test design, with no control group, where there are multiple observations before and after the
program is implemented (see interrupted time-series designs)
skip factor:
a fixed number that defines how many cases need to be counted from a population list of all cases before the
next case is drawn; skip factors are used in systematic sampling
snowball sampling:
a sampling strategy where additional participants are identified based on information provided by previous
participants
social constructionism:
an epistemological view that it is the social context that produces the meanings individuals use
social desirability response bias:
this can happen in surveys that focus on “undesirable” attitudes or behaviors; respondents may alter their
responses to correspond to answers they feel are more socially positive
social opportunity cost of capital:
the real rate of return on a marginal investment taking into account the range of public and private sector
investment alternatives
social rate of time preference:
the interest rate at which society is willing to substitute present for future consumption or, equivalently, the
interest rate society requires to postpone consumption
sophia:
in Greek philosophy, wisdom, sometimes equated with philosophical wisdom
split-half reliability:
the use of two parallel sets of Likert statements, which are examined to see whether the results are consistent
across the two versions of the measures
standard gamble method:
in cost–utility analysis, to derive QALY, respondents are asked to make choices in decision tree scenarios
with various timings of life, death, or survival with impaired health
standing:
in economic evaluation, the status determining whether the preferences of an individual or members of a
group are included or excluded in estimates of social value
stated preferences methods:
in economics, survey-based methods used to elicit social value for (nonmarket) assets, goods, or services
static-group comparison:
A design where there is a program/no-program comparison, but there are no baseline measurements, so we
cannot control for the following: pre-program differences in the two groups, maturation of the participants,
attrition, or selection-based interaction effects
statistical conclusions validity:
the extent to which we can be confident that we have met the statistical requirements needed to calculate the

643
existence and strength of the covariation between the independent (cause) and dependent (effect) variables
statistical regression:
the tendency whereby extreme scores on a pre-test tend to regress toward the mean of the distribution for
that variable in a post-test
statistically significant:
refers to the likelihood that a given statistical test result could have occurred by chance if the null hypothesis
is true; conventionally, criteria are established that are used to either accept or reject the null, and if a test
result is consistent with a decision to reject the null hypothesis, we say that the test outcome is statistically
significant
strategy:
summary of related activities that are intended to contribute to the achievement of an objective
stratified sample/ stratified random sample/ stratified purposeful sample:
a probabilistic (having an element of randomness in the selection process) sample that divides a population
into groups or strata, and samples randomly from each one
summative evaluation:
an evaluation of the merit or worth of a program, designed to provide feedback and advice about whether or
not a program should be continued, expanded, or contracted
survey:
a measuring instrument where information is gathered from units of analysis (usually people) generally
through the use of a questionnaire that usually combines open- and closed-ended questions
symbolic use (of evaluation):
the evaluation is used to rationalize or legitimize decisions made for political reasons
systematic review:
a structured comparison and synthesis of evaluations or research studies that is intended to distill common
themes or summarize evidence that pertains to a research question
systematic sample:
a sample drawn where the ratio of the population size to the sample size is used to calculate a skip factor
(defined above as intervals from which cases are sampled); the first case in the sample is randomly selected in
the first interval, and from that point onward, each additional case is selected by counting a fixed number
(the skip factor) and then selecting that case
tacit knowledge:
the capacity we have as human beings to integrate “facts”—data and perceptions—into patterns we can use,
but that are very difficult to communicate verbally or textually to others
techne:
in Greek philosophy, knowledge that is involved in craftsmanship and is focused on doing tasks
technical efficiency:
attaining the most program outputs possible for each program input (usually expressed in monetary units)
technical judgments:
decisions that are routine, given a task at hand—an example would be using a random numbers table to
pick a random sample from a list of cases
technical/rational perspective:
a view of organizations as complex, rational means–ends systems that are designed to achieve purposive goals
temporal asymmetry:
where the independent variable precedes the dependent variable
testing:
a threat to internal validity where taking a pre-test familiarizes the participants with the measurement
instrument and unintentionally biases their responses to the post-test
thematic analysis:
the process of categorizing the ideas (words, phrases, sentences, paragraphs) in narrative data
theoretical sampling:
in qualitative evaluation, sampling cases to reflect theoretical expectations about their characteristics so as to
examine whether the actual patterns in the data collected correspond to the patterns predicted by the theory

644
theory-driven evaluation:
“unpacking” the causal structure of the program; testing the hypothesized linkages, generally facilitated by
program theories/logic models
theory of change:
ways of thinking about programs that reflect our understanding of causal relationships among the factors
that can be included in the design and implementation of the evaluation
theory of change response bias:
a tendency on the part of participants in programs to believe that the program “must” have made a
difference for them, resulting in a positive bias in their estimates of the amount of change
three-way analysis of variance:
a statistical analysis method, where there are two independent variables and one dependent variable,
designed to determine whether there is a statistically significant relationship between, in turn, each of
independent variables and the dependent variable, as well as whether there is a statistically significant
interaction between the independent variables
threshold effects:
they occur when a performance target results in organizational behaviors that distort the range of work
activities in an organization in ways to meet, rather than exceed, targets
time-declining rates of discount:
the use of lower rates of discount for later costs and benefits rather than for costs and benefits that occur
sooner
time series:
systematic measurement of variables over time
time trade-off method:
in cost–utility analysis, to derive QALY, subjective ratings of various combinations of length of life and
quality of life in that time are gathered
total economic value:
the value of ecological assets categorized into use and nonuse value
travel cost method:
a method used to estimate the use value of recreational amenities through calculation of the direct and
indirect costs incurred to travel to a recreation site
treatment groups:
persons (usually) who are provided with a program or some other intervention that is being evaluated
triangulation:
the process of collecting data from a variety of sources and/or by using a variety of measurement procedures
to answer an evaluation question
typical case sampling:
Sampling by questioning knowledgeable staff or participants to identify who, or what, is typical of a
program
unit of analysis:
the cases (often people) that are the main focus of an evaluation; we measure (observe) characteristics of the
units of analysis, and these observations become the data we analyze in the evaluation
utilization:
the extent to which the program evaluation process and results (findings, conclusions, and
recommendations) are deemed by stakeholders to be useful to them
utilization-focused evaluation:
an evaluative approach, pioneered by Michael Quinn Patton, that involves designing and implementing
evaluations to focus on the utilization of evaluation results by stakeholders
validity in measurement:
the extent to which a measuring instrument measures what it is intended to measure
value-for-money:
an important normative goal for public officials and other stakeholders who are concerned with whether
taxpayers and citizens are receiving efficient and effective programs and services for their tax dollars

645
variable (dependent):
a variable that we expect will be affected by one or more independent variables; in most evaluations, the
observed outcomes are dependent variables
variable (independent):
an observable characteristic of the units of analysis that we expect to cause some other variable; in a research
design where we have a program and a control group, the presence or absence of the program for each unit
of analysis becomes an independent variable
variables:
in program evaluations and performance measurement systems, variables are the products of our efforts to
measure constructs; they are the observables that take on discrete values across the units of analysis in an
evaluation and are analyzed and reported
verstehen:
understanding based on a process of learning the subjective realities of people
welfare economics:
a branch of economics that focuses on the utility-based measurement of the welfare of a society
willingness-to-accept:
in economics, a measure of the social value of a consequence based on the minimum sum of money
individuals need to be paid to accept that consequence
willingness-to-pay:
in economics, a measure of the social value of a consequence based on the maximum sum of money
individuals are willing to pay to secure that consequence

646
530

550

647
Index

Abercrombie, M. L. J., 497–498


Abolafia, M. Y., 506
Accountability, 4
performance measurement for, 343, 377–378, 411–429
performance paradox in public sector, 429–430
public, 377–378, 381, 400–401
rebalancing performance measurement systems focused on, 429–437
results-oriented approach to, 376–377
Action plans, 284
Adequacy, program, 27
After-only experimental design, 108
Alignment of program objectives with government and organizational goals, 66–67
Alkin, M. C., 206, 480
Allen, J., 195–196
Altshuld, J., 249, 250, 252, 257, 258
on phase II: post-assessment, 284
on phase II: the needs assessment, 268–270, 280, 283
on pre-assessment phase in needs assessment, 259–268
Ambiguous temporal sequence and internal validity, 121
American Evaluation Association (AEA), 454, 467, 506
on cultural competence in evaluation practice, 498
ethical guidelines, 487–488, 489–490 (table)
gold standard and, 102
Guiding Principles for Evaluators,467, 487
American Journal of Evaluation, 487
Anderson, G., 213
Anderson, L. M., 32, 70, 303
Antirealist ontology, 210
Appropriateness, policy, 253
Ariel, B., 18–20, 122, 126, 466
Aristigueta, M. P., 374, 398
Arora, K., 125
Arsene, O., 493
Asadi-Lari, M., 256
“Assessing the Mental Health Service Needs of the Homeless: A Level-of-Care Approach,” 254
Association for Community Health Improvement, 262
Atkinson, G., 315, 330
Attributes, 175–176
Attribution, 14, 479
economic evaluation and, 303–304
performance measurement and, 358–361
program evaluations and, 358–361
Attrition/mortality and internal validity, 121
Austin, M. J., 362
Australasian Evaluation Society, 454, 461, 467–468
Averting behavior method, 312, 313 (table)
Axford, N., 265, 267

Baekgaard, M., 429–430

648
Bal, R., 363, 412
Balanced scorecard, 388
Bamberger, M., 194, 224
Barbier, E. B., 315
Barnes, H. V., 115
Baron, M., 453
Baseline measures, 16, 135, 360
in internal evaluation, 447
in retrospective pre-tests, 194
sources of data and, 179
in surveys, 192
Before-after designs, 133
Belfield, C. R., 323–328
Belief-related constructs, 183
in model of professional judgment process, 496 (table), 497–498
Benchmarks, 4
Benefits
categorizing and cataloging, 324–325
discount rate and, 313, 316–317, 326
monetizing, 315–316, 325–326
net present, 318
predicting, 315, 325
real, 316
Berk, R. A., 111
Berkowitz, S., 258
Berrueta-Clement, J. R., 115
Between-case analysis, 223–224
Bevan, G., 412, 414–416, 429
Bias
in decision making, 103
response shift, 195
social desirability response, 191
theory of change response, 191, 195
as validity problem, 168
Bickman, L., 9, 506
Big data analytics, 180–181
Bish, R., 425
Blair, T., 413, 415
Boardman, A., 308, 315
Body-worn cameras. See Police body-worn camera program, Rialto, California
Bourgeois, I., 452
Bradshaw, J., 255, 256
Brainstorming, 80–81
Breuer, E., 74
Brignall, S., 432, 459
Britain
assessing the “naming and shaming” approach to performance measurement in, 415–418
National Health Service as high-stakes environment in, 412–415
U.K. Job Retention and Rehabilitation Pilot, 223, 225
Britain, Troubled Famlies Program in, 166–167
as complex program, 302–303
mixed methods in, 225–226

649
needs assessment and, 251
qualitative evaluation report, 237
within-case analysis, 223
British Columbia CounterAttackprogram, 119
British Medical Journal, 301
Bryman, A., 224
Budget Transparency and Accountability Act, Canada, 377, 379, 419–424
Bush, G. W., 6, 349
Business, government as, 351–352

Calgary Homeless Foundation, 82, 83 (figure)


Calsyn, R. J., 276–277
Camberwell Assessment of Need, 252
Campbell, D. T., 69, 103, 118, 127, 129, 131, 134, 170, 390, 491
Campbell Collaboration, 69, 102
Canada
alignment of program objectives with governments goals in, 67
Budget Transparency and Accountability Act, 377, 379, 419–424
Calgary Homeless Foundation, 82, 83 (figure)
Canada/Yukon Economic Development Agreement, 230
Canadian Audit and Accountability Foundation (CCAF-FCVI), 393, 398
Canadian Heritage Department, 387
Canadian Observatory onHomelessness, 250
community health needs assessment in New Brunswick, 285–290
complex logic model describing primary health care in, 89–91
Evaluation Report for the Smoke-Free Ontario Strategy, 359–360
government evaluation policy in, 7
Homeless Hub, 250
Institute for Clinical EvaluativeSciences, 254
joining internal and external uses of performance information in Lethbridge, Alberta, 425–429
legislator expected versus actual uses of performance reports in, 419–424
logic model for Canadian Evaluation Society Credentialed Evaluator program, 92, 93 (figure)
Management Accountability Framework, 350
New Public Management in, 350
Office of the Comptroller General (OCG), 464–465
Ontario Institute for Clinical Evaluative Sciences, 263
performance data linked to program evaluations in, 35–36
policy in, 10–11
Policy on Results, 67, 491
programs in, 11
resource alignment review in, 67
“tough on crime” policies in, 256–257
Treasury Board of Canada Secretariat (TBS) (See Treasury Board of Canada Secretariat (TBS))
WorkSafeBC, 358, 395–396
Canada/Yukon Economic Development Agreement, 230
Canadian Audit and Accountability Foundation (CCAF-FCVI), 393, 398
Canadian Evaluation Society (CES), 454, 461, 487
Credentialed Evaluator program, 92, 93 (figure), 507
Program Evaluation Standards, 467
Canadian Heritage Department, 387
Canadian Journal of Program Evaluation, 103
Canadian Observatory on Homelessness, 250
Carande-Kulis, V. G., 32

650
Carifio, J., 185–186
Carter, A., 301
Carter, C., 258
Carvel, J., 418
Cases, 175
See also Units of analysis
Case study designs, 34, 141
power of, 241–242
Catholic Health Association, 262, 265, 284
Causal analysis of needs, 280
Causal chain, 181
Causality
economic evaluation and, 303–304
in program evaluation, 12–14
Causal linkages, 33
tested in program logic models, 141–145
Causal relationships
attribution issue and, 303
defined, 12
established between variables, 99
gold standards and, 110
good evaluation theory andpractice and, 490
internal validity and, 119
mechanisms and, 72
in Neighborhood Watch Evaluation, 137
in program logic models, 141
program theories and, 68
in qualitative relationships, 208, 220
Causes and effects, validity of, 197–198
CBA. See Cost–benefit analysis (CBA)
CEA. See Cost–effectiveness analysis (CEA)
Center for Evidence-Based Crime Policy, George Mason University, 18
Centre for Health Services and Policy Research, University of BritishColumbia, 89
Ceteris paribus assumption, 361
Chain sampling, 228
Chalip, L., 309
Chau, N., 254
Chelimsky, E., 15, 449, 461
Chen, H.-T., 14–15
Children’s Review Schedule, 252
Choice modeling method, 312, 313 (table)
Chouinard, J. A., 499
Christie, C. A., 102, 206
Church, M., 194
Clarke, A., 163
Clinton, B., 6, 307
Closed-ended questions, 231
Cochrane Collaboration, 32, 69, 102
Cochrane Handbook for Systematic Reviews of Interventions, 32, 301
Code of the American Institute of Chartered Professional Accountants (AICPA), 487
Coles, C., 494, 495
Collaboration, 389

651
Collaborative and participatory approaches, 215 (table)
Collaborative evaluation, 451
Collaborative services, 265
Communication in performance measurement design, 379–380, 433
Community Health Assessment Network of Manitoba, 255, 261, 280
Community health needs assessment in New Brunswick, 285–290
Comparative need, 255
Compensatory equalization of treatments, 116, 128
Compensatory rivalry, 128
Competencies for Canadian Evaluation Practice, 461
Complex interventions, 61–62
Complexity theory, 62–63
Complex problems, 60–61
Complicated interventions, 61–62
Complicated problems, 60–61
Conceptual uses of evaluations, 41
Concurrent validity, 171, 173
Confidence interval, 271
Confirmatory factor analysis, 173
Connelly, J. L., 254
Consensus on Health Economic Criteria, 301
Consequences in model of professional judgment process, 496 (table)
Consequentialism, 483–484
Constructivism, 210, 213
as criteria for judging quality and credibility of qualitative research, 215 (table)
Constructs, 51, 59–60, 165
beyond those in single programs, 387–390
expressed in relation to time, 175
involving prospective users in development of, 390–391
Likert statements and, 185–186
measured in evaluations, 166
in model of professional judgment process, 496 (table)
operationalizing, 166
performance measurement design and key, 385–386, 387 (table)
psychological or belief-related, 183
translated into observable performance measures, 391–395
validity types that related multiple measures to multiple, 173–175
Construct validity, 21, 68, 98, 118, 170–171
measurement validity component, 125–126
other problems in, 126–129
Content validity, 171, 172, 392
Context-dependent mediation, 130
Context-mechanism-outcomes (CMOs), 71–73
Contextual factors in program logics, 70–71
Contingent valuation method, 312, 313 (table)
Control groups, 4
Convenience sampling methods, 274
Convergent validity, 171, 174
Converse, P. D., 188
Cook, T. D., 118, 127, 129, 131, 170
Coping organizations, 386, 387 (table)
Corbeil, R., 80

652
Correlation, 12
Coryn, C. L., 74
Cost-based analyses, 21
See also Economic evaluation
Cost–benefit analysis (CBA), 5, 27, 299, 300, 308–309
growth in the 1930s, 307
High Scope/Perry Preschool Program example, 322–328
internal evaluation capacity, 448
standing in, 309–312, 324–325
steps in, 313–319
strengths and limitations of, 328–332
valuing nonmarket impacts, 312, 313 (table)
See also Economic evaluation
Cost–effectiveness analysis (CEA), 5, 26–27, 299, 300, 301, 307–308, 320–321
attribution and, 303
in needs assessment, 256
program complexity and, 302–303
steps in, 313–319
strengths and limitations of, 328–332
See also Economic evaluation
Cost of illness method, 312, 313 (table)
Costs
categorizing and cataloging, 324–325
comparison by computing net present value, 317–318
discount rate and, 313, 316–317, 326
intangible, 314–315
marginal, 257
monetizing, 315–316, 325–326
nominal, 316
predicting, 315, 325
real, 316
Cost–utility analysis (CUA), 299, 300, 321–322
steps in, 313–319
strengths and limitations of, 328–332
See also Economic evaluation
Counterfactuals, 53, 154, 358
Cousins, J. B., 451, 499
Covariation, 99
Craft organizations, 386, 387 (table)
Credibility and generalizability of qualitative findings, 237–239
Creswell, J. W., 213, 224, 226
Critical change criteria for judging quality and credibility of qualitative research, 215 (table)
Critical incidents, 500
Critical reflection, 500–501
Cronbach, L. J., 101
Cronbach’s alpha, 168, 185
Crotty, M., 209, 213–214
CUA. See Cost–utility analysis (CUA)
Cubitt, T. I., 69
Cultural competence in evaluation practice, 498–499
Culture, organizational, 433
evaluative, 449, 456–460

653
Cumulative Needs Care Monitor, 252

Dahler-Larsen, P., 425, 429, 453


Dart, J., 240
Darzi, L., 301
Data analysis, 38–39
Databases
administrative, 37
Big Data, 180, 342
capturing all data in analysis of, 120
existing, 179–180
experiential, 182
governmental, 179, 250, 254, 329
for needs assessments, 254, 261–262, 267
in performance measurement, 342, 356, 393, 399, 432
statistical conclusions validity and, 118
of survey data, 187–188
in systematic reviews, 69
Data collection, 38
from existing sources, 179–182
by program evaluators, 182
qualitative, 230–233
Data Resource Center for Children and Adolescent Health, U.S., 254
Data sources, 179–191
census information, 262–263
collected by the program evaluator, 182
epidemiological databases, 262
existing, 179–182
primary, 357
secondary, 262, 357
surveys as, 182–191
triangulation of, 239
Davies, R., 240
Dawson, R., 163
Day, L., 226
Decentralized performance measurement, 435–437
Decision environment in model of professional judgment process, 496 (table), 497
Decision making
consequentialism and, 483–484
evidence-based, 162
in model of professional judgment process, 496 (table)
needs assessment in, 257
to resolve needs and select solutions, 283–284
using economic evaluation, 319–320, 327–328
Declining rates of discount, 317
Decoupling, 431–432
De Felice, D., 240
Defensive expenditure method, 312, 313 (table)
De Lancer Julnes, P., 357, 373
Deliberative judgment, 488, 494
Demicheli, V., 301
Denzin, N. K., 207–208
Deontological ethics, 482

654
Dependent variables, 124
in body-worn camera experiment, 466
in causal relationships, 99
construct validity and, 125–126
inter-, 62
internal validity and, 118–119
in program logic models, 141, 165
statistical conclusions validity and, 118
Derse, A., 330
De Silva, M., 74
Deterrence theory, 79
Developmental evaluation, 15, 85, 458
Dewa, C., 254
Diffusion of treatments effects, 21, 127–128
Digital-Era Governance, 6
Dillman, D., 188
Discount rate, 313, 316–317, 326
Discrepancies identification in needs assessments, 269–278
Discriminant validity, 171, 174–175
Disproportionate stratified sampling, 272
Distortions, output, 418–419
Distributional analysis, 318–319, 327
Distributional weights, 319
Dixon, R., 412
Dobkin, P., 500
Donaldson, C., 315
Donaldson, S., 507
Donaldson, S. I., 507
Donnelly, J., 125
Double-loop learning, 457
Downs, A., 348
Dowswell, G., 363, 412
Dumitrache, I., 493
Duty ethics, 482–483

Earls, F., 174


Economic efficiency, 299–300, 308
types of, 304–305
Economic evaluation
attribution issue in, 303–304
categorizing and cataloging costs and benefits in, 314–315
choice of method, 304–305
computing net present value of each alternative, 317–318, 327
connected to program evaluation, 302–303
deciding whose benefits and costs count in, 314, 324–325
discounting in, 313, 316–317, 326
ethical and equity considerations in, 330–331
historical developments in, 307–308
introduction to, 299–304
making recommendations using, 319–320, 327–328
monetizing all costs and benefits in, 315–316, 325–326
in the performance management cycle, 306
predicting costs and benefits quantitatively over life of project, 315, 325

655
sensitivity and distributional analysis in, 318–319, 327
specifying set of alternatives in, 314, 324
steps for, 313–319
strengths and limitations of, 328–332
why evaluator needs to know about, 300–301
See also Cost–benefit analysis (CBA); Cost–effectiveness analysis (CEA); Cost–utility analysis (CUA);
Program evaluation
Economic impact analysis (EIA), 309
Edejer, T. T., 317
Effectiveness, program, 26
Effect of Body-Worn Cameras on Use of Force and Citizens’ Complaints Against the Police: A Randomized
Controlled Trial, The, 18
Efficiency
economic, 299–300, 304–305, 308
technical, 25
Empowerment evaluation, 215 (table), 449
England. See Britain
Environmental factors, 13, 34–35, 57
Environmental scanning, 53
Epidemiological databases, 262
Epistemology, 209–211
interpretivist, 211–213
pragmatism and, 213–214
Epstein, R. M., 500
Equitable societies, 250
Equity concerns in economic evaluation, 330–331
Eraut, M., 506
Ethics
consequentialism and, 483–484
deontological, 482
duty, 482–483
economic evaluation and, 330–331
example of dilemma in, 511–512
foundations of evaluation practice, 482–486
guidelines for evaluation practice, 486–488, 489–490 (table)
in model of professional judgment process, 496 (table)
power relationships and, 485–486
practical wisdom and, 484, 486
Evaluation. See Economic evaluation; Program evaluation
Evaluation assessment process, 28
Evaluation association-based ethical guidelines, 486–488, 489–490 (table)
Evaluation design mixed methods, 224, 224–228
Evaluation feasibility assessment, 30–37
checklist, 29 (table)
client identification, 30
decision making, 36–37
information/data sources, 35–36
most feasible strategy, 36
previous work, 32
program environment, 34–35
program structure and logic, 32–33
questions and issues, 30–31

656
research design alternatives, 33–34
resources, 31
Evaluation Report for the Smoke-Free Ontario Strategy, 359–360
Evaluation study, 28
Evaluation theory
balanced with practical knowledge in professional practice, 492–493
diversity of, 480–481
good, 490–492
Evaluation Theory Tree, 480
Evaluative cultures, 449
building, 456–460
creating ongoing streams of evaluative knowledge in, 457–458
critical challenges to building and sustaining, 458–459
in local government, 460
Evaluative knowledge, streams of, 457–458
Evaluators, 453
building evaluative cultures, 456–460
data collection by, 182
education and training-related activities for, 504–505
ethical dilemma example for, 511–512
independence for, 455
leadership by, 454–455
objectivity claims by, 462–463
professional prospects for, 506–508
qualifications for, 468–469
Evaluators’ Professional Learning Competency Framework, 461, 467–468
Evans, T., 485
Evidence-based decision making, 162
Ex ante analyses, 306
Ex ante evaluations, 8, 16, 27
Ex ante studies, 21
Existing sources of data, 179–182
Expectations in model of professional judgment process, 496 (table), 497–498
Experimental designs, 4, 101–103, 490
evaluation using, 112–117
gold standard in, 4, 21, 36, 102, 110, 122
origins of, 104–110
Perry Preschool Study, 112–117
why pay attention to, 110–111
Experimental diffusion, 113
Ex post analysis, 306
Ex post evaluations, 15–16, 27
Expressed need, 255
External accountability (EA) approach, 430–431, 435
Externalities, 308
External validity, 21, 98, 118, 129–131

Face validity, 171, 172, 392


Farrar, T., 18
Farrar, W. A., 18–20
Feasibility issues, 10
See also Evaluation feasibility assessment
Federal performance budgeting reform, 345–346

657
Feedback
from informants, 239
in performance measurement systems, 399
Feedback loops, 52, 495
Felt need, 255
Feminist inquiry, 215 (table)
Ferguson, C., 258
Fielding, J. E., 32
Fish, D., 494, 495
Fish, S., 212
Fisher, R. A., 105
Fleischer, D. N., 102
Fleischmann, M., 465
Flybjerg, B., 484, 485, 488
Focus groups, 36, 288
data sources, 182
interval evaluation, 454
needs assessment, 267–269
procedural judgment and, 494
qualitative evaluation methods and, 206, 214, 277–278, 288
validity and, 172
Formative evaluations, 8, 14–15, 446
program management and, 450–452
as qualitative evaluation approach, 216–217, 217–218 (table)
Formative needs assessment, 260–261
Fort, L., 194
Fort Bragg Continuum of Care Program, 9
Foucault, M., 499
Framework approach, 234–235
F-Ratio, 186
Friedman, D. J., 255
Fuguitt, D., 319–320, 330
Fullilove, M. T., 32
Funnell, S., 62, 74

Gaber, J., 267


Gaebler, T., 343, 347–348
Gaming performance measures, 381–382, 411, 416, 433
in open systems, 53
in public reporting systems, 424
as widespread problem, 416–419, 453
Garvin, D. A., 457
Gates, E., 62
Gaub, J. E., 20
Gaultney, J. G., 301, 331
Generalizability
and balancing theoretical and practical knowledge in professionalpractice, 493
context-dependent mediation and, 130
in economic evaluation, 331–332
external validity, 116, 129, 327, 467
limitations on, 282
of qualitative findings, 206, 237–239
research design and, 101

658
Ghin, E. M., 425, 429, 453
Gill, D., 432, 434, 458
Gillam, S., 252–254
Globalization of evaluation, 498–499
Glouberman, S., 60–62
Gold standard, 4, 21, 36, 102, 110, 122
Goodwin, L. D., 172
Government Accountability Office (GAO), U.S., 54, 461
Government as business, 351–352
Government Auditing Standards, 461–462
Government goals, program alignment with, 64–67
Government Performance and Results Act, U.S., 349, 424
Government Performance and Results Act Modernization Act, U.S., 6–7, 349
Graduate Record Examination (GRE), 173
Great Recession, 2008-2009, 249, 251, 300
Green, B. C., 309
Green, S., 301
Green, V., 265, 267
Greenberg, D., 308, 315
Greenhouse gas emissions (GHG),cost–benefit analysis on, 309–312
Grey literature, 267
Group-level focused needs assessment, 252
Guba, E. G., 210
Guiding Principles for Evaluators, 467, 487
Gupte, C., 301

Hall, J. L., 374, 398


Halo effect, 190
Hamblin, R., 412, 414–416
Handbook of Qualitative Research, 207
Hanley, N., 315
Hardy, M., 485
Harrison, S., 363, 412
Hatry, H. P., 425
Hawthorne effect, 128
Head Start Program, U.S., 117, 128, 172
Health economic evaluations (HEEs), 301
Health rating method, 322
Heckman, J. J., 116, 117, 328
Hedberg, E., 20
Hedonic price method, 312, 313 (table)
Hibbard, J., 414
Higgins, J. P. T., 301
High/Scope Educational Research Foundation, 112
High Scope/Perry Preschool Program cost-benefit analysis, 322–328
High-stakes environment, performance measurement in, 412–415
High-touch nudges, 103
Hillbrow neighborhood qualitative needs assessment, 277–278
History and internal validity, 119
History of program, 25
HM Treasury, 54
Hoffmann, C., 459
Holistic approach, 219

659
Holzer, M., 373
Homeless Hub, 250
Homelessness programs, 11, 82, 83, 84–85
Canadian Observatory onHomelessness, 250
design and implementation of, 386, 389
needs assessments in, 253, 267, 280
performance measurement, 484, 494
possible service overlaps, 265
Homeostasis, 352
Hood, C., 388, 412, 415, 418, 429
Huang, X., 188
Huberman, A. M., 222, 228
Huse, I., 420, 425, 432, 459
Hutchinson, T., 500
Hypotheses, 165

Images of Organization, 373


Implementation activities, 55–56
Implementation issues, 21, 24–25
performance measurement, 383–384
Implicit designs, 141, 222
Incentives, 424–425
Incremental changes, 41
Incremental effects
data sources and, 179
defined, 4
internal validity and, 122
interval-ratio-level variables and, 179
limitations of economic evaluation and, 329–330
in program evaluations, 358–359
program impacts and, 57, 63
using surveys to estimate, 192–196
Independence in optimizing internal evaluation, 453–455
Independent variables, 124
internal validity and, 118–119
in police body-worn cameraexperiment, 126
in program logic models, 141, 151
Index of client satisfaction, 174
Indirect costs, 308
Inductive approach, 219
Informants, feedback from, 239
Informed consent, 111, 232, 270, 484
Institute for Clinical Evaluative Sciences, Canada, 254
Instrumental uses of evaluations, 41
Instruments
internal validity and, 120
structuring data collection, 230–231
structuring survey, 189–191
Intangible costs, 314–315
Intangibles, 308
Intended causal linkages, 75–79
Intended outcomes, 13, 33
Interactions among main effects, 107

660
Intercoder reliability, 167, 191, 234, 505
Internal evaluation
leadership and independence in optimizing, 453–455
six stages in development of, 448
views from the field on, 447–455
Internal learning (IL) approach, 430, 435–437
Internal structure validity, 171, 172–173
Internal validity, 21, 35, 98, 110
ambiguous temporal sequence and, 121
attrition/mortality and, 121
defined, 118–119
history and, 119
instrumentation and, 120
maturation and, 119, 132
quasi-experimental designs for addressing, 131–140
selection and, 120–121, 132
selection-based interactions and, 121
statistical regression and, 120
testing and, 120
International Research DevelopmentCentre, 220
Interpretivism, 211–213
as criteria for judging quality and credibility of qualitative research, 215 (table)
Interrupted time series design, 133, 134–135
York Neighborhood Watch Program, 136–140
Interval/ratio level of measurement, 176, 177–179
Interval sampling, 272, 275 (table)
Interventions, simple, complicated, and complex, 61–62
Interviews, qualitative, 231–234
virtual, 236
Inventories of existing services, 264–265
Iron Cage Revealed, The, 434

Jakobsen, M. L., 429–430, 435, 460


Jamieson, S., 185
Janesick, V. J., 240
Jefferson, T., 301
Jerak-Zuiderent, S., 363, 412
Job Retention and Rehabilitation Pilot, 230
qualitative data analysis, 233, 234–235
structured data collection instruments, 231
Johnsen, A., 432, 459
Johnson, R. B., 213, 224
Johnson-Masotti, A. P., 330
Jones, E. T., 276–277
Judgment sampling, 272, 275 (table)

Kahneman, D., 103


Kalsbeek, A., 265, 267
Kansas City Preventive Patrol Experiment, 99–101
Kapp, S. A., 213
Kelemen, W. L., 276–277
Kellogg Foundation Logic Model Development Guide, 54
Kelman, S., 432

661
Kendall, M., 255
Kernick, D., 322
Kesenne, S., 309
Kettl, D., 432
Kettner, P. M., 378
Kid Science Program, 197
King, J. A., 501
Knowledge
Aristotle’s five kinds of, 484
balancing theoretical and practical, 492–493
creating ongoing streams of evaluative, 457–458
power and, 499
shareable, in model of professional judgment process, 496 (table)
tacit, 17, 490, 492
Kotter, J., 377
Krause, D., 12
Kravchuk, R. S., 380
Kristiansen, M., 425, 429, 453
Kuhn, T. S., 208–209, 463, 497
Kumar, D. D., 249, 250, 252, 257, 258
on phase II: post-assessment, 284
on phase II: the needs assessment,268–270, 280, 283
on pre-assessment phase in needs assessment, 259–268

Laihonen, H., 435–437


Langford, J. W., 483, 487
Layde, P. M., 330
Leadership
in optimizing internal evaluation, 453–455
sustained, 432
League tables, 397
Learning
in craftsmanship, 492
double-loop, 457
Lee, L., 74
Lee, M., 344
Leeuw, F. L., 429
Le Grand, J., 412, 413, 429
Le Guin, U., 465
Lethbridge, Alberta, Canada, performance measurement, 425–429
Level of confidence, 274
Levels of measurement, 118, 176
interval/ratio, 176, 177–179
nominal, 176–177
ordinal, 176, 177
Levin, H. M., 303
Lewin, S., 303
Likert, R., 185
Likert statements, 167, 173, 420
in surveys, 185–187
Lincoln, Y. S., 207–208, 210
Linear program logic model, 33 (figure)
Lines of evidence, 36, 38, 357–358

662
triangulating qualitative andquantitative, 288
Literature reviews, 32, 36
complex logic model describing primary health care in Canada, 89–91
in needs assessments, 267, 269, 277, 279
in police body-worn camera study, 73
Litman, T., 315
Local government, performance measurement in, 344–345
decentralized, 435–437
evaluative cultures and, 460
joining internal and external uses of performance information, 425–429
Logic modeling, 33
changes to performance measurement systems and, 433
features of, 78
involving prospective users in development of, 390–391
organizational, 387, 404 (figure)
performance measurement and, 353–363
performance measurement design and, 385–386, 387 (table)
See also Program logic models
Logic of causes and effects, 12
Longitudinal studies, 113–114
Lost output method, 312, 313 (table)
Love, A. J., 362, 448, 452, 456
Low-stakes environment, performance measurement in, 425–429
Low-touch nudges, 103
Lund, C., 74

Macdonell, J. J., 307


Machines, organizations as, 52, 351
Magenta Book: Guidance for Evaluation, The, 32, 54
Mailed surveys, 188
Main effects, 107
Maintenance needs, 256
Manpower Demonstration Research Corporation, 211
Mäntylä, S., 435–437
Mapping, outcome, 220–221
Marginal costs, 257
Marsh, K., 315
“Marshmallow studies,” 174
Martin, L. L., 378
Martyn, S., 309
Maskaly, J., 69, 132
Maslow’s hierarchy of needs, 256
Mason, J., 487
Mathison, S., 463, 486
Maturation and internal validity, 119, 132
Maxfield, M., 73
Maximum variation sampling, 230
Mayne, J., 11, 28, 355, 378, 453, 456, 458
McDavid, J. C., 420, 425, 432, 459
McEwan, P. J., 303
McKillip, J., 252, 258, 261, 264
McMorris, B., 188
Meals on Wheels program, 88, 89(figure), 271

663
needs assessment, 293–294
Means-ends relationships, 32
Measurement, 5
as about a set of methodological procedures intended to translated constructs into observables, 162
as about finding/collecting relevantdata, 162
defined, 166
introduction to, 162–164
levels of (See Levels of measurement)
performance (See Performance measurement)
proxy, 181
reliability (See Reliability)
survey (See Surveys)
triangulation, 34, 140, 141
units of analysis in, 4, 144, 175–176
validity (See Validity)
Measurement instrument, 167
Mechanisms, 51, 72–73
Medicaid, 249
Medium-stakes environment, performance measurement in, 419–424
Mejlgaard, N., 484
Melé D., 485
Mertens, D. M., 487
Meta-analysis, 69
Meta-evaluation, 69
Methodologies, 38–40, 63, 214
conflict between social science research and program evaluation, 181–182
decision environment and, 497
defensible, 16
economic evaluation, 307, 329
empowerment evaluation, 449
in good evaluation theory andpractice, 490
independence for evaluators and, 455
measurement, 163
need for program and, 22
needs assessment, 252, 255, 262, 275
performance measurement, 391
in policies and programs, 10, 102
professional judgment and, 499
qualitative, 206, 209, 211, 214–215, 218, 221, 228, 239, 241
in real world of evaluation practice, 481–482
for replicatability, 464
for retrospective pre-tests, 194
validity in measurement, 434
Methods, 213–214
economic evaluation, 304, 312, 328–331
ethical principles and, 489
evaluation, 206–207, 209, 211
evaluation theory tree and, 480
measurement, 163
for measuring QALY, 322
mixed, 92, 213–214, 216, 221, 224, 224–228
monetizing, 315

664
needs assessment, 254, 262, 268–272, 281–282
objectivity and replicability in, 463–468
output distortions and, 418
performance measurement, 357, 394
professional judgment and, 482
qualitative, 214–217, 215, 220, 237–239, 238, 277–278
revealed and stated preferences, 312, 313
sampling, 105, 274–275
statistical, 105, 176–179
validity, 172
Meyer, W., 62, 508
Mihu, I., 493
Miles, M. B., 222, 228, 238
Mills, K. M., 306
Mindfulness, 499–501
Mischel, W., 173–174
Mixed methods, 38, 216, 221
evaluation designs, 224, 224–228
logic model, 92
in surveys, 188
timing in, 224–225
weighting in, 225
when and how of mixing in, 225
Mixed-sampling strategies, 230
Modarresi, S., 506
Modell, S., 432, 459
Moral judgments, 485
Morgan, D. L., 214
Morgan, G., 52, 65, 352, 373, 457
Morpeth, L., 265, 267
Morris, M., 487
Most Significant Change (MSC) approach, 239–241, 242
Motivation, public servant, 413
Mourato, S., 315, 330
Mowen, J. C., 499
Moynihan, D. P., 376, 429–430, 435–436, 458, 460
Mueller, C. E., 196
Mugford, M., 315
Multiple regression analysis, 150, 150–151, 154, 186
Multivariate statistical analysis, 179

“Naming and shaming” approach to performance management, 415–418


National Health Service, Britain, 413–415
National Survey of Children’s Health,U.S., 254
National Survey of Children with Special Health Care Needs, U.S., 254
Naturalistic designs, 219–220
Necessary conditions, 9, 56
Need Analysis: Tools for the Human Services and Education, 258
Needs assessment, 5, 22–24, 27
causal analysis of needs in, 280
community health needs assessment in New Brunswick example, 285–290
conducting full, 268–269
defined, 250

665
developing action plans in, 284
general considerations, 250–257
group-level focus, 252
identification of solutions in, 280–282
identifying discrepancies in, 269–278
implementing, monitoring, and evaluating, 284–285
introduction to, 249–257
as linear, 258
making decisions to resolve needs and select solutions in, 283–284
measurement validity issues in, 275–277
moving to phase III or stopping, 282–283
need for, 250–252
needs assessment committees (NAC), 265–267
in the performance management cycle, 252–254
perspectives on needs and, 255–256
phase I: pre-assessment, 259–268
phase II: needs assessment, 268–283
phase III: post-assessment, 283–285
politics of, 256–257
prioritizing needs to be addressed in, 278–279, 288–290
purpose of, 260–261
qualitative methods in, 277–278
recent trends and developments in, 254–255
resources available for, 265–266
sampling in, 271–275
for small nonprofit organization, 293–294
steps in conducting, 257–285
surveys in, 270–271
target populations for, 262–263
Needs Assessment: A Creative and Practical Guide for Social Scientists, 258
Neo-liberalism, 486, 491–492, 499
Net present benefit, 318
Net present value (NPV), 316, 317–318,323, 327
Net social benefits/net social value (NSBs), 300, 309
Neumann, P. J., 321–322
New Brunswick community health needs assessment, 285–290
New Chance welfare-to-work evaluation, 211, 222–223
sampling in, 228
within-case analysis, 222–223
New Deal, 307
New Hope program, 226–227
New Jersey Negative Income Tax Experiment, 99–100, 178
Newman, D. L., 506
New Public Governance, 6
New Public Management (NPM), 6, 60
accountability in, 429
economic evaluation and, 300–301, 332
emergence of, 346–349
ethical foundations and, 483
evaluative cultures and, 458
government as business and, 352
performance measurement giving managers “freedom to manage” in, 434–435

666
performance measurement in, 341
See also Performance measurement
Nimon, K., 195–196
Niskanen, W. A., 348, 425
Nix, J., 20
Noakes, L. A., 74
Nominal costs, 316
Nominal interest rates, 317
Nominal level of measurement, 176–177
Non-experimental designs, 140–141
Nonmarket impacts, valuing of, 312, 313 (table)
Nonrecursive causal model, 152
Normative need, 255
Norris, N., 453
North Carolina Community Assessment Guidebook, 279
Nudges, 103

Obama, B., 6, 58, 349


Objectives, program, 64–67
Objectivism, 210
Objectivity, 40, 450, 460–467
evaluators claims of, 462–463
replicability and, 463–465
Observed outcomes, 13, 57, 104
Office of Management and Budget (OMB), 6–7, 349, 424
Office of the Comptroller General (OCG), Canada, 464–465
Ogborne, A., 55
Olejniczak, K., 454, 456
Ontario Institute for Clinical Evaluative Sciences, 263
Onwuegbuzie, A. J., 213, 224
Open-ended questions, 37–38
coding scheme for, 167
in surveys, 227, 231
Open systems, 34
implications of understanding policies and programs as, 53–54
logic models and, 52–53
organizations as, 352–353
Operationalization of program construct, 126–127
Opioid addiction problem, 251
Opportunistic sampling, 230
Opportunity cost, 309
Ordinal level of measurement, 176, 177
Oregon Open Data Portal, 263
Organisation for Economic Cooperation and Development (OECD), 468
Organizational charts, 84–85
Organizational logic models, 387, 404 (figure)
Organizational politics, 424–425
Organizations
building evaluative culture in, 456–460
coping, 386, 387 (table)
craft, 386, 387 (table)
as machines, 52, 351
as open systems, 352–353

667
as organisms, 52
procedural, 386, 387 (table)
production, 386, 387 (table)
self-evaluating, 447
Osborne, D., 343, 347–348
Oswald, F. L., 188
Otley, D., 411, 415, 416, 418–419
Outcomes, 56–57
causal chain, 181
initial, intermediate, and long-term, 58
intended, 13, 33
mapping of, 220–221
observed, 13, 57, 104
performance measurement systems de-emphasizing, 437–438
program effectiveness and, 26
Outputs, 56
distortions, 418–419
performance measurement systems de-emphasizing, 437–438
Overlaps, identifying service, 265
Owen, J. M., 30–31, 464
Oyserman, D., 192–193

Palmer, C., 265, 267


Pandey, S. K., 376, 429
Pang, L., 306
Paradigms, 208–213
defined, 208
as incommensurable, 209
pragmatism and, 213–214
Parametric statistics, 178
Parrish, R. G., 255
PART (Performance Assessment Rating Tool), 6
Patched-up research design, 104
Path analysis, 152
Pathways to Work pilot, 228
Patient Protection and Affordable Care Act, U.S., 251–252
Patterson, P., 267
Patton, M. Q., 11, 15, 28–30, 40, 61, 85, 355, 451
on evaluative culture, 458
on evaluators engaging with stakeholders, 283
on qualitative evaluation, 213, 214, 216, 217, 231–233, 241–242
Pawson, R., 71
Pearce, D., 315, 330, 331
Pearson correlation, 186
Performance dialogues, 435, 460
Performance improvement
accountability measures for, 411–429
“naming and shaming” approach to, 415–418
performance measurement for, 343
rebalancing accountability-focused performance measurement systems to increase uses, 429–437
steering, control, and, 349–350
Performance information, joining of internal and external uses of, 425–429
Performance management cycle, 8–10, 21

668
economic evaluation in, 306
needs assessment in, 252–254
ratchet effects in, 388, 418
threshold effects in, 418
Performance measurement, 163–164
for accountability and performance improvement, 343
addressing general issues, 356–357
attribution and, 358–361
beginnings in local government, 344–345
big data analytics in, 180–181
comparing program evaluation and, 353–363
comparisons included in system for, 395–398
complex interventions and, 63
under conditions of chronic fiscal restraint, 437–438
conflicting expectations for, 379–380
connecting qualitative evaluation methods to, 239–241
current imperative for, 342–343
decentralized, 435–437
decoupling in, 431–432
de-emphasizing outputs and outcomes, 437–438
emergence of New Public Management and, 346–349
evaluators, 362–363
external accountability (EA) approach in, 430–431, 435
federal performance budgeting reform, 345–346
gaming, 416
giving managers “freedom to manage,” 434–435
growth and evolution of, 344–350
in high-stakes environment, 412–415
integration with program evaluation, 4–5
intended purposes of, 363, 380–382
internal learning (IL) approach in, 430–431, 435–437
introduction to, 341
logic models for, 81–82, 83 (figure)
in low-stakes environment, 425–429
making changes to systems of, 432–434
in medium-stakes environment, 419–424
metaphors that support and sustain, 350–353
Most Significant Change (MSC) approach, 239–241, 242
“naming and shaming” approach to, 415–418
as ongoing, 356, 376
ongoing resources for, 361–362
organizational cultural acceptance and commitment to, 374
professional engagement (PR) regime in, 430–431
for public accountability, 400–401
rebalancing accountability-focused, 429–437
research designs and, 145–146
role of incentives and organizational politics in, 424–425
routinized processes in, 357–358
sources of data, 179–191
steering, control, and performance improvement with, 349–350
validity issues, 388, 391–393, 482
Performance measurement design

669
changes in, 432–434
clarifying expectations for intended uses in, 380–382
communication in, 379–380, 433
developing logic models for programs for which performance measures are being designed and
identifying key constructs to be measured in, 385–386, 387 (table)
highlighting the comparisons that can be part of the system, 395–398
identifying constructs beyond those in single programs in, 387–390
identifying resources and planning for, 383–384
introduction to, 372
involving prospective users in development of logic models and constructs in, 390–391
key steps in, 374–399
leadership and, 375–377
reporting and making changes included in, 398–399
taking time to understand organizational history around similar initiatives in, 384–385
technical/rational view and political/cultural view in, 372–374
translating constructs into observable performance measures in, 391–395
understanding what performance measurement systems can and cannot do and, 377–378
Performance monitoring, 9
Performance paradox in the public sector, 429–430
Perla, R., 185–186
Perry Preschool Study, 145, 179
compensatory equalization of treatments in, 116
conclusions from, 116–117
empirical causal model for, 152–153
High Scope/Perry Preschool Program cost–benefit analysis, 322–328
limitations of, 115–116
as longitudinal study, 113–114
research design, 112–115
within-case analysis, 222
Personal recall, 192–194
Peters, G., 429
Peterson, K., 306
Petersson, J., 181
Petrosino, A., 143
Pett, M., 254
Philosophical pragmatism, 224
Photo radar cameras, Vancouver, Canada, 164–165
Phronesis, 479, 484
Picciotto, R., 12, 463, 491, 506
Pinkerton, S. D., 330
Pitman, A., 492
Planning, programming, and budgeting systems (PPBS), 344
Plausible rival hypotheses, 12, 163, 399
causal relationship between two variables and, 99
visual metaphor for, 108, 108
Poister, T. H., 374, 383, 394, 398
Polanyi, M., 492
Police body-worn camera program, Rialto, California
as basic type of logic model, 58–60, 78–79
connecting this book to, 21–22
construct validity, 127
context of, 17–18

670
implementing and evaluating effects of, 18–19
key findings on, 19
measurement validity, 126–127
program logic model for, 59, 59–60
program success versus understanding the cause-and-effect linkages in, 20
randomized controlled trials andquasi-experiments, 122–124
replication of evaluation of, 466–467
Policies, 10–11
appropriateness of, 253
incremental impact of changes in, 153–156
as open systems, 53–54
results-based neo-liberalism,491–492
Policy on Evaluation, 491
Policy on Results, 491
Policy Paradox: The Art of Political Decision Making, 373
Polinder, S., 301, 331
Political/cultural perspective, 373
Politics
and incentives in performance measurement systems, 424–425
of needs assessments, 256–257
organizational, 373
Pollitt, C., 363, 382, 412, 413, 415,424–425, 429, 453
Pons, S., 465
Populism, 486
Positivism, 208, 211
as criteria for judging quality and credibility of qualitative research, 215 (table)
Post-assessment phase in needs assessment, 283–285
Postpositivism, 211
as criteria for judging quality and credibility of qualitative research, 215 (table)
Post-test assessments, 191
Post-test only experimental design, 108–109
Post-test only group, 110
Power
ethical practice and, 485–486
knowledge and, 499
Practical know-how in model of professional judgment process, 496 (table)
Practical wisdom, 484, 486
Pragmatism, 213–214
philosophical, 224
Pre-assessment phase in needs assessment, 259–268
focusing the needs assessment in, 260–266, 286
forming the needs assessment committee (NAC) in, 266, 286–287
literature reviews in, 267, 287–288
moving to phase II and/or III or stopping after, 268
Predictive validity, 171, 173
President’s Emergency Plan for Aids Relief (PEPFAR), U.S., 468
Pre-test assessments, 37, 191
retrospective, 135, 194–196
Pre-test-post-test experimental design, 108–109
Prioritizing needs to addressed, 278–279, 288–290
Problems, simple, complicated, and complex, 60–61
Procedural judgment, 494

671
Procedural organizations, 386, 387 (table)
Process uses of evaluations, 41
Production organizations, 386, 387 (table)
Professional engagement regime (PR), 430
Professional judgment, 5
aspects of, 493–495
balancing theoretical and practical knowledge, 492–493
education and training-related activities and, 504–505
ethics in (See Ethics)
evaluation competencies and, 501–502, 502–504 (table)
good evaluation theory and practice and, 490–492
importance of, 16–17
improving, 499–506
introduction to, 478
mindfulness and reflective practice in, 499–501
nature of the evaluation enterprise and, 478–482
process in, 495–498
tacit knowledge and, 17, 490, 492
teamwork and improving, 505–506
types of, 494
understanding, 490–499
Program activities, 24–25
Program Assessment Rating Tool(PART), 424
Program complexity, 302–303
Program components, 55
Program effectiveness, 13, 217
Program environment, 13, 34–35, 57
Program evaluation, 3, 4
after needs assessment, 284–285
American model of, 6–7
attribution issue in (See Attribution)
basic statistical tools for, 150–151
big data analytics in, 180–181
Canadian federal model of, 7
causality in, 12–14
collaborative, 451
comparing performance measurement and, 353–363
connected to economic evaluation, 302–303
connected to performance management system, 5–8
constructing logic model for, 79–81
Context, Input, Process, Product (CIPP) model, 449
criteria for high-quality, 467–469
cultural competence in, 498–499
defining, 3
developmental, 15, 85, 458
diversity of theory on, 480
as episodic, 356
ethical foundations of, 482–486
ethical guidelines for, 486–488, 489–490 (table)
evaluators in, 362–363
ex ante, 8, 16, 27
ex post, 15–16, 27

672
formative, 8, 14–15, 216–217, 217–218 (table), 446, 450–452
holistic approach to, 219
importance of professional judgment in, 16–17
improving professional judgment in, 499–506
inductive approach to, 219
integration with performance measurement, 4–5
intended purposes of, 363
internal, 447–455
as issue/context specific, 356–357
key concepts in, 12–17
key questions in, 22–27
linking theoretical and empirical planes in, 124 (figure)
making changes based on, 41–42
measurement in, 163–164
measures and lines of evidence in, 357–358
measuring constructs in, 166
nature of the evaluation enterprise and, 478–482
objectivity in, 40, 450, 460–467
paradigms and their relevance to, 208–213
police body-worn camera program, Rialto, California, 17–22
process, 37–41
professional judgment and competencies in, 501–502, 502–504 (table)
realist, 71–74
real world of, 481–482
shoestring, 31
sources of data, 179–191
steps in conducting (See Program evaluation, steps in conducting)
summative, 9, 14–15, 216–217, 217–218 (table), 446, 452–453
targeted resources for, 361–362
theory-driven, 68, 74–75
See also Economic evaluation
Program evaluation, steps in conducting
doing the evaluation, 37–41
feasibility, 30–37
general, 28–29
Program Evaluation Standards, 467, 487
Program impacts, 57
Program implementation, 8–9
after needs assessment, 284–285
Program inputs, 24–25, 55
Program logic models, 5, 33
basic logic modeling approach, 54–60
brainstorming for, 80–81
construction of, 79–81
defined, 51
introduction to, 51–54
for Meals on Wheels program, 88, 89 (figure)
open systems approach and, 52–53
for performance measurement, 81–82, 83 (figure)
for police body-worn camera programs, 59, 59–60
primary health care in Canada, 89–91
program objectives and program alignment with government goals, 64–67

673
program theories and program logics in, 68–75
“repacking,” 104
strengths and limitations of, 84–85
surveys in, 183
testing causal linkages in, 141–145
that categorize and specify intended causal linkages, 75–79
in a turbulent world, 85
working with uncertainty, 60–63
See also Logic modeling
Program logics, 68, 84
contextual factors, 70–71
systematic reviews, 69–70
Program management
formative evaluation and, 450–452
summative evaluation and, 452–453
Program managers, 84, 164
performance measurement and, 181–182
Programmed Planned Budgeting Systems (PPBS), 345–346
Program monitoring after needsassessment, 284
Program objectives, 64–67
Program processes, 15
Programs, 11
intended outcomes of, 13
as open systems, 53–54
strategic context of, 263–264
Program theories, 68, 74–75
Progressive Movement, 345
Propensity score analysis, 134
Proportionate stratified samples, 272
Propper, C., 416
Proxy measurement, 181
Psychological constructs, 183
Public accountability, 377–378, 381
performance measurement for, 400–401, 411–429
performance paradox in, 429–430
Public Safety Canada, 18
Public Transit Commission, Pennsylvania, study, 275–277
Public Value Governance, 6
Purposeful sampling, 228, 229 (table)

QALY. See Quality-adjusted life-years (QALY)


Qualitative data, 38
analysis of, 233–236
collecting and coding, 230–233
triangulating, 288
within-case analysis of, 222–223
Qualitative evaluation, 101–102
alternative criteria for assessing qualitative research and, 214–216
basics of designs for, 216–221
comparing and contrasting different approaches to, 207–216, 218–221
differences between quantitative and, 219 (table)
diversity of approaches in, 207
introduction to, 206–207

674
mixed methods designs, 224–228
naturalistic designs, 219–220
outcomes mapping in, 220–221
paradigms and, 208–213
performance measurement connected to methods in, 239–241
power of case studies and, 241–242
summative versus formative, 216–217, 217–218 (table)
Qualitative interviews, 231–234
in community health needs assessment in New Brunswick, 288
Qualitative needs assessment, 277–278
Qualitative program evaluation, 219
collecting and coding data in, 230–233
credibility and generalizability of, 237–239
data analysis in, 233–236
designing and conducting, 221–237
interviews in, 231–234
purpose and questions clarification, 222
reporting results of, 237
research designs and appropriate comparisons for, 222–224
sampling in, 228–230
Qualitative Researching, 487
Quality-adjusted life-years (QALY), 300, 304, 305
cost–utility analysis and, 321–322
threshold for, 314
Quality Standards for Development Evaluation, 468
Quantitative data, 38
triangulating, 288
Quantitative evaluation, 179
differences between qualitative and, 219 (table)
mixed methods designs, 224–228
Quasi-experimental designs, 101
addressing threats to internal validity, 131–140
police body-worn cameras, 122–124
Quota sampling, 275 (table)

Randomized experiments/randomized controlled trials (RCTs), 4, 21, 481


consequentialism and, 484
construct validity, 127–128
High Scope/Perry Preschool Program cost–benefit analysis, 323
police body-worn cameras, 122–124
qualitative methods, 219–220
Random sampling, 272
Ratchet effect, 388, 418
Ratio measures, 176, 177–179
Rationale, 21
Raudenbush, S., 174
Rautiainen, A., 432, 459
Reagan, R., 307
Real benefits, 316
Real costs, 316
Realist evaluation, 71–74
Real rates, 317
RealWorld Evaluation approach, 240

675
Redekp, W. K., 301
Reflective judgment, 494
Reflective practice, 461, 499–501
Regression, 150
coefficients, 150
logistic, 134
multiple regression analysis, 150,150–151, 154, 186
multivariate, 175
revealed preferences methods, 313
statistical, 120, 133, 137
Regulatory Impact Analysis (RIA), U.S., 307
Reilly, P., 301
Reinventing Government, 347
Relevance, program, 24
Reliability, 164–175
Cronbach’s alpha, 168
difference between validity and, 169–170
intercoder, 168
Likert statement, 168
split-half, 167
in surveys, 191
understanding, 167–168
Replicability and objectivity, 463–465
body-worn cameras study, 466–467
Reports
needs assessment, 280–282
dissemination, 40–41
qualitative program evaluation, 237
writing, review and finalizing of, 39–40
Research designs, 5
case study, 34, 141
characteristics of, 104–110
conditions for establishing relationship between two variables, 99
evaluation feasibility assessment, 33–34
experimental (See Experimental designs)
feasibility issues, 10
gold standard, 4, 21, 102
holding other factors constant, 104
implicit, 141
naturalistic, 219–220
non-experimental, 140–141
patched-up, 104
performance measurement and, 145–146
Perry Preschool Study, 112–117
qualitative (See Qualitative evaluation)
quasi-experimental (See Quasi-experimental designs)
“repacking” logic models, 104
survey instruments, 189–191
threats to validity and, 118–131
treatment groups, 99–100
why pay attention to experimental designs in, 110–111
Response process validity, 171, 172, 392

676
Response set, 187
Response shift bias, 195
Results-Based Logic Model for Primary Health Care: Laying an Evidence-Based Foundation to Guide
Performance Measurement, Monitoring and Evaluation, A, 89
Results-based management, 5–6
See also New public management (NPM)
Results-based neo-liberalism, 491–492
Results reporting
in performance measurement systems, 398–399
qualitative program evaluation, 237
Retrospective pre-tests, 135, 194–196
Revealed preferences, 312, 313 (table)
Reviere, R., 258
Richie, J., 235
Rist, R. C., 28, 355, 456, 458
Rival hypotheses, 26, 57
plausible, 99, 108
Rogers, P. J., 30–31, 62, 74, 464
Roosevelt, F. D., 307
Rossi, P. H., 111
Roth, J., 256
Rothery, M., 254
Royal British Columbia Museum admission fee policy, 153–156
Rugh, J., 194
Rush, B., 55
Rutman, L., 28

Sabharwal, S., 301


Sadler, S., 306
Saldana, J., 222
Sample sizes, 273–274
Sampling, 34, 118
level of confidence, 274
methods, 105, 274–275
mixed, 230
in needs assessments, 271–275
opportunistic, 230
purposeful, 228, 229 (table)
qualitative evaluations, 228–230
random, 272
sizes of, 273–274
snowball or chain, 228, 230, 272, 275 (table)
theoretical, 228
typical case, 230
Sampling error, 273
Sampson, R., 174
Sanders, G. D., 301
Scale, Likert-like, 186
Schack, R. W., 380
Schön, D. A., 504
Schröter, D. C., 74
Schwandt, T., 57, 487, 493
Schwarz, N., 192–193

677
Schweinhart, L., 115
Scrimshaw, S. C., 32
Scriven, M., 12, 14–15, 40, 222, 256, 450, 460–462
on ethical evaluation, 487
Secondary sources, 262, 357
Selection and internal validity, 120–121, 132
Selection-based interactions and internal validity, 121
Self-awareness and socially-desirable responding, 79
Self-evaluating organizations, 447
Self-Sufficiency Project, 75–77
Senge, P. M., 380, 457
Sensitivity analysis, 317, 318–319, 327
Sequential explanatory design, 226–227
Sequential exploratory design, 227
Shadish, W. R., 111, 118, 125, 127–130, 170
on validity, 174, 194
Shareable knowledge in model of professional judgment process, 496 (table)
Shaw, I., 30
Shemilt, I., 315
Shepherd, R., 452
Shoestring evaluation, 31
“Sibling Cancer Needs Instrument,” 267
Sigsgaard, P., 239, 240
Simple interventions, 61–62
Simple problems, 60–61
Single time series design, 34, 133, 134–135
Skip factors, 272
Snowball sampling, 228, 230, 272, 275 (table)
Social constructivism, 210, 213
as criteria for judging quality and credibility of qualitative research, 215 (table)
Social democracy, 486
Social desirability response bias, 191
Social need, types of, 255–256
Social opportunity cost of capital (SOC), 316–317
Social rate of time preference (SRTP), 316–317
Solomon Four-Group Design, 110
Sonneveld, P., 301
Soriano, F. L., 258, 280
Sork, T. J., 250, 263, 270
Speaking truth to power: The art and craft of policy analysis, 446
Special Issue of New Directions inEvaluation, 453
Specifying set of alternatives in economic evaluation, 314
Split-half reliability, 167
Stamp, J., 393
Standard gamble method, 322
Standards for Educational and Psychological Testing, 170
Standing in cost–benefit analysis, 309–312, 314, 324–325
Stanford–Binet Intelligence Test, 112–114
Stanford University Bing NurserySchool, 174
Stanley, J. C., 118, 134
Stated preferences method, 312, 313 (table)
Static-group comparison design, 135

678
Statistical conclusions validity, 98, 118, 131
Statistical Methods for Research Workers, 105
Statistical Package for the Social Sciences (SPSS), 223
Statistical regression and internalvalidity, 120
Statistical significance, 113
Steccolini, I., 357
Stergiopoulos, V., 254
Stern, N., 317
Stevahn, L., 501
Stevens, A., 252–254
Stimulus-response model of surveyprocess, 183
Stockard, J., 414
Stockmann, R., 62, 508
Stone, D., 373
Strategic context of programs, 263–264
Stratified purposeful sampling, 230
Stratified random samples, 272
Streams of evaluative knowledge, 457–458
Structure/logic of programs, 24
Structure of Scientific Revolutions, The, 208
“Study of Administration, The,” 352
Stufflebeam, D., 449–450, 460
Summative evaluation, 9, 14–15, 446, 452–453
as qualitative evaluation approach, 216–217, 217–218 (table)
Summative needs assessment, 261
Surveys
conducting, 187–189
designs, 187–189, 196–197
estimating incremental effects of programs using, 192–196
as evaluator-initiated data source in evaluations, 182–184
Likert statements in, 185–187
in medium-stakes environment, 420–423
in needs assessments, 270–271
open-ended questions in, 227, 231
personal recall and, 192–194
retrospective pre-test, 135, 194–196
steps in responding to, 193
stimulus-response model of, 183
structuring instruments for, 189–191
unintended responses in, 184
validity and reliability issues applicable to, 191
Sustained leadership, 432
Sutherland, A., 18–20
Swenson, J. R., 265
Symbolic uses of evaluations, 41
Systematic review, 32
Systematic sampling, 272, 275 (table)

Tacit knowledge, 17, 490, 492


Tailored design method surveys, 188
Taks, M., 309
Tanner, G., 254
Target populations for needs assessment, 262–263

679
Target setting, performance measurement systems, 388–395
Taylor, F., 351
Teamwork and professional judgment, 505–506
Technical efficiency, 25
Technical judgments, 494
Technical/rational perspective, 373
Temporal asymmetry, 99
Testing procedures and internal validity, 120
Thatcher, M., 347
Theoretical sampling, 228
Theorizing in mixed methods, 225
Theory-driven evaluations, 68, 74–75
Theory of change (ToC), 51, 74–75
response bias, 191, 195
Thompson, J. D., 361
Three triangles in model of professional judgment process, 498
Three-way analysis of variance, 107
Threshold effects, 418
Tilley, N., 71
Time series, 113
interrupted, 133–140
single, 34, 133, 134–135
York Neighborhood Watch Program, 136–140
Time trade-off method, 322
Timing in mixed methods, 224–225
Traditional anthropological research, 208
Travel cost method, 312, 313 (table)
Treasury Board of Canada Secretariat (TBS), 7, 82, 350, 355
accountability expectations, 452
core questions in program evaluation, 356
key role for managers in, 362
logic model template, 54, 91
objectivity in evaluation of, 461, 464
program structure, 387
repeatability in evaluation of, 465
resource alignment review, 67
Treatment groups, 99
Triangulation, 34, 140, 141
of data sources, 239
of qualitative and quantitative lines of evidence, 288
Tripp, D., 499–500, 504
Trochim, R., 125, 170
Troubled Families Program in Britain, 166–167
as complex program, 302–303
mixed methods in, 225–226
needs assessment and, 251
qualitative evaluation report, 237
within-case analysis, 223
Trump, D., 7
Tusler, M., 414
Tutty, L., 254
Tversky, A., 103

680
Typical case sampling, 230

U.K. Job Retention and Rehabilitation Pilot, 223, 225


Uncertainty, working with, 60–63
Unintended responses, 184
United States, the
Data Resource Center for Children and Adolescent Health, 254
early childhood programs in, 32, 69,98, 128
federal performance budgeting reform in, 345–346
focus on government program performance results in, 6–7
gold standard in, 102
Government Accountability Office,54, 461
Government Performance and Results Act, 349, 424
Government Performance and Results Act Modernization Act, 349
Medicaid program, 249
New Deal era in, 307
North Carolina Community Assessment Guidebook, 279
Office of Management and Budget (OMB), 6–7, 349, 424
Oregon Open Data Portal, 263
Patient Protection and Affordable Care Act, 251
performance measurement in local governments in, 344–345
police body-worn camera study in Rialto (See Police body-worn camera program, Rialto, California)
President’s Emergency Plan for Aids Relief (PEPFAR), 468
Regulatory Impact Analysis (RIA), 307
resource alignment review in, 67
Units of analysis, 4, 144, 175–176
in surveys, 182
Urban Change welfare-to-work project, 217
within-case analysis, 223
US Bureau of Justice Assistance, 18
Utilization focus, 30
UTOS, 101
Uyl-de Groot, C., 301

Vale, L., 301, 315


Validity, 164–175
bias as problem in, 168
of causes and effects, 197–198
concurrent, 171, 173
construct, 21, 68, 118, 124–129, 170–171
content, 171, 172, 392
convergent, 171, 174
difference between reliability and, 169–170
discriminant, 171, 174–175
external, 21, 98, 118, 129–131
face, 171, 172, 392
four basic threats to, 118–131
internal, 21, 35, 98, 110, 118–122, 131–140
internal structure, 171, 172–173
measurement, 125–126, 197–198
in needs assessments, 275–277
of performance measures, 5
predictive, 171, 173

681
response process, 171, 172, 392
statistical conclusions, 98, 118, 131
in surveys, 191
types of, 170–171
understanding, 169–170
ways to assess, 171–175
Value-for-money, 307
Values in model of professional judgment process, 496 (table), 497–498
Van Dooren, W., 459
Van Loon, N., 429–430
Van Thiel, S., 429, 432
Variables, 106
ambiguous temporal sequence and, 121
dependent (See Dependent variables)
independent (See Independent variables)
nominal, 176–177
ordinal, 177
Vickers, S., 484
Vining, A., 308, 315
Virtual interviews, 236
Vo, A. T., 206
Volkov, B., 453
Voluntary Organizations of Professional Evaluation (VOPEs), 506–507

Wankhade, P., 415


Watson, K., 103
Web-based surveys, 188
Weber, M., 434
Weighting in mixed methods, 225
Weikart, D., 115
Weimer, D., 308, 315
Weisburd, D., 110–111
Weiss, C., 449
Weiss, C. H., 9, 15
Welfare economics, 305
Westine, C. D., 74
Whynot, J., 452
Wicked problems, 481
Wilcox, S. J., 319–320, 330
Wildavsky, A. B., 362, 446–447, 456
Williams, D. W., 344
Willingness-to-accept (WTA), 305, 312
Willingness-to-pay (WTP), 305, 312
Wilson, A. T., 487
Wilson, D., 412, 416, 429
Wilson, J. Q., 386
Wilson, W., 352
Winter, J. P., 276–277
Wisdom, practical, 484
Within-case analysis, 222–223
Wolfe, E. W., 188
Wolfe, S. E., 20
Workable logic models, 80

682
WorkSafeBC, 358, 395–396
Wright, B. E., 376

Yarbrough, D. B., 467


Yesilkagit, K., 432
York Neighborhood Watch Program, 136–140
findings and conclusions, 137–140
program logic, 141–145

Zero-based budgeting (ZBB), 344


Zigarmi, D., 195–196
Zimmerman, B., 60–62

683

S-ar putea să vă placă și