Sunteți pe pagina 1din 36

The Balanced Scorecard:

Judgmental effects of information organization and diversity

Marlys Gascho Lipe


Rath Chair in Accounting
University of Oklahoma
Price College of Business
Norman, OK 73019
email: MLipe@ou.edu
Steven Salterio
Associate Professor of Accounting
University of Waterloo
School of Accountancy
Waterloo, Ontario, Canada N2L 3G1
sesalterio@uwaterloo.ca

November 17, 1998

We are grateful to the Canadian Academic Accounting Association for project funding and to the
University of Alberta's Southam Fellowship for funding for the second author. Helpful
comments were provided by Carla Carnaghan, Mike Gibbins, Yves Grendron, Joan Luft, Alan
Webb, and workshop participants at the Universities of Alberta, Oklahoma, Texas, and North
Texas.
Data Availability: Data are available upon request from the second author.

The Balanced Scorecard:


Judgmental effects of information organization and diversity

ABSTRACT
This paper examines the judgmental effects of the balanced scorecard. The balanced scorecard
contains a large number of financial and nonfinancial performance measures divided into four
categories. Based on judgment and decision making theory we examine whether the
organization of the information and the diverse (common and unique) measures contained in the
scorecard result in differences in managerial performance evaluation judgments. We find that
relative performance evaluations are affected by organizing the measures into the balanced
scorecard categories. Further, when scorecards for two divisions contain some common and
some unique measures, performance evaluations are affected only by the common measures.

Key Words: Balanced scorecard, performance evaluation, nonfinancial performance measures,


information organization, chunking.

3
The Balanced Scorecard:
Judgmental effects of information organization and diversity
I. Introduction
The balanced scorecard (BSC) was first developed by Robert Kaplan and David Norton
as a mechanism to complement traditional financial measures of business unit performance
(Kaplan and Norton 1992). The diverse set of measures on the BSC span the areas of financial
performance, customer relations, internal business processes, and the organizations learning and
growth activities (Kaplan and Norton 1992). This large set of measures is designed to capture
the firms desired business strategy (Kaplan and Norton 1993, 1996a) and to include drivers of
performance in all areas of importance to the firm.
The stated purpose in developing a managerial tool which includes information beyond
financial measures is to improve managerial decision making. While determining whether the
BSC improves managers judgments and decisions can be difficult, a reasonable starting point is
to determine whether the BSC changes managers' judgments and decisions. In this study we
explore whether and how the BSC affects judgments given common human information
processing characteristics and limitations. Specifically, we investigate the judgmental effects of
the BSCs (1) information organization, (2) inclusion of nonfinancial measures, and (3) inclusion
of measures that are unique to a particular business unit versus those that are common to multiple
units.
The four category organization of the BSC is likely to affect managers use of the large
volume of measures included in the scorecard by suggesting a way to combine and simplify the
data. Information organization has been shown to affect auditors going concern judgments
(Ricchiute 1992), information search in a personnel evaluation task (Wang 1995), and recall of

4
internal controls (Frederick 1991). Similarly, the information organization in the BSC format is
likely to affect managerial performance evaluations made with the scorecard.
Kaplan and Norton (1996a) argue that because the BSC reduces reliance on financial
measures its use will lead to better business decisions. Business people, however, often view
financial performance as the ultimate test of the success of the firm (Newman 1991). Therefore,
the non-financial measures on the scorecard may be unused by managers (Ittner, Larcker and
Meyer 1998) leading to decisions based solely on financial measures. Judgment and decision
making research suggests, however, that making measures readily available in an easy to process
format will encourage their use (e.g., Tversky and Kahneman 1973).
Finally, measures on the BSC are specifically chosen for each business unit; therefore,
there may be only a few measures that all subsidiaries or units have in common (Kaplan and
Norton 1996b). Judgment and decision making research suggests that decision makers place
great weight on common measures (Slovic and MacPhillamy 1974) at the expense of others.
Therefore, it is possible that the unique measures designed for each unit might be underused or
even ignored in managers' judgments.
The remainder of the paper is organized as follows. In the next section we will briefly
describe the BSC and its use, as envisioned by Kaplan and Norton (1996b). In the third section
we review the judgment and decision making literature applicable to the study of the BSC and
state our research hypotheses. Section four describes the experimental work used to test the
hypotheses and our results, and the final section summarizes the conclusions which can be drawn
from the study.

5
II. The Balanced Scorecard
In a best-selling book Kaplan and Norton (1996b) provide a blueprint for organizations
interested in implementing a BSC. In the interests of brevity, only three important aspects of the
scorecard will be described here: the measures included, the strategic use of the scorecard, and
the tie to performance evaluation and compensation.
The BSC, according to Kaplan and Norton, should contain measures related to financial
performance, customer relations, internal business processes, and measures related to learning
and growth in the organization. The specific measures chosen for each business unit in the
organization will likely differ somewhat as they should be tailored to that unit.
Financial measures can include the traditional general measures such as return on assets
and net income; however, Kaplan and Norton emphasize choosing measures particularly relevant
to the business unit (e.g., revenues per employee for a sales unit or research and development
expense/sales for a pharmaceutical division). Measures related to customers include results of
customer surveys, sales from repeat customers, and customer profitability. Internal business
process measures relate specifically to the operational processes of the business unit. For
example, a petroleum distributor may measure investment in new product development and
dealer quality (Kaplan and Norton 1996b, 111-113). The final set of performance measures,
those related to learning and growth, are likely the most difficult to select. Kaplan and Norton
(1996b, 127) suggest measures related to employee capabilities, information systems
capabilities, and employee motivation and empowerment.
Kaplan and Norton (1993, 1996a) view the scorecard as a strategic management tool that
should explicate the drivers of performance, as well as provide measures of performance. The
BSC can provide early benefits to the organization due to the process of determining the causal

6
drivers of a units success during construction of the scorecard (Kaplan and Norton 1996b, 148).
In addition, the scorecard is expected to provide continuing benefits to the organization as
management uses the scorecard in evaluation and decision making. This study investigates
issues related to the attainability of these evaluation and decision making benefits.
Kaplan and Norton (1996b, 217-223) suggest there is a problem with continuing to tie
compensation and evaluation to traditional measures while asking subordinates to focus on the
scorecard measures. They state (t)hat the balanced scorecard has a role to play in the
determination of incentive compensation is not in doubt (1996a, 82). However, they are
reluctant to provide specific recommendations regarding how to link the BSC to compensation.
In light of Kaplan and Nortons reticence on the issue of compensation, our experiments provide
evidence regarding use of the BSC for performance measurement and evaluation purposes, rather
than for compensation purposes. Use of the BSC for performance measurement and evaluation
is described by Kaplan and Norton (1996a) as part of the communicating and linking process,
the second of four processes in BSC implementation.
The BSC is a relatively complex measurement system. In order to determine the BSCs
potential benefits and costs to the organization, it is important to consider how it will interact
with the cognitive capabilities and characteristics of managers. The next section reviews
judgment and decision making research which provides insights regarding managers abilities to
process and use the information found on the BSC.
III. Judgment and Decision Making Research
Information organization
Several areas of research have investigated the impact of information organization on
cognition. First, research on learning and memory indicates that schematically organized

7
information is recalled better than information without this organization (Frederick 1991;
Rabinowitz and Mandler 1983). In addition, more experienced people can recall a larger amount
of information than less experienced people (Chase and Simon 1973; Baddeley 1994;
MacGregor 1987); this is attributed to the experienced person's ability to combine (or "chunk") a
number of pieces of information and treat the combination as one entry in working memory
(Servan-Schreiber and Anderson 1990). Thus, information organization both affects learning
and memory and is affected by experience and knowledge.
Additionally, accounting researchers have investigated the judgment and decision making
effects of organizing financial information in different ways. Blocher and Davis (1990) found
that when business students had a large number of cues to use in categorizing business invoices
as high or low risk for error, presenting the cues in a table led to greater decision accuracy than
presenting the cues graphically. Other accounting studies of tabular and graphical display were
done by Kaplan (1988), Blocher, Moffie, and Zmud (1986), and Moriarity (1979). In addition to
the impact of display type, financial statement placement of particular items has also been shown
to affect users judgments. Hopkins (1996) showed that placement of hybrid securities in the
liability section, equity section, or mezzanine section had an effect on financial analysts stock
price judgments. Similarly, Hirst and Hopkins (1998) showed that presenting comprehensive
income in the income statement affected financial analysts stock price judgments differently
than presenting the information in the statement of changes in equity. The specific placement of
particular pieces of information affected the use of the information.
As shown in the studies cited above, informations organization can affect the use and
processing of that information. The BSC logically organizes a large volume of measures into
four categories. This organization should suggest to the user a simplified processing strategy

8
whereby the four categories, as opposed to the large volume of specific measures, are used in
evaluating performance. This is likely to affect the evaluations made since this simplification
leads to the use of only four combined (or chunked) cues instead of the larger number which
may be cognitively difficult to handle. In a classic study, Miller (1956) showed that people are
only able to handle seven plus or minus two items in working memory at any point in time.
Related studies in accounting confirm that providing a greater quantity of cues often overloads
decision makers, leading to judgments of lower quality (Chewning and Harrell 1990; Iselin 1988,
1993).
Thus, we expect that organizing performance measures into the four BSC categories will
affect managerial evaluations. For example, if a manager is performing poorly on a number of
measures, this may be evaluated differently if it is clear that those measures are components of
only one of the four BSC categories. Kaplan and Norton (1996b) suggest that while many
performance measures have been collected by organizations in the past (e.g., see calls for the use
of multiple performance measures by Nanni, Dixon and Vollmann (1990)), these multiple
measures were not sufficiently organized to be useful. We propose the following hypothesis,
stated in alternate form:
H1. Evaluations using the balanced scorecard format will differ
from evaluations based on the same measures without the
scorecard format (organization hypothesis).
Availability of nonfinancial measures
As indicated previously, the BSC includes a diverse set of performance measures,
including both financial and nonfinancial measures. Financial measures have traditionally been
used in performance evaluation. Indeed, Ittner, Larcker, and Rajan (1997) show that financial

9
measures are the sole measures used in determining executive rewards in over 60% of firms they
studied. Furthermore, Newman (1991, 11) exhorts managers to abandon methods which use
nonfinancial measures as they distort incentives and change behavior, often in undesirable
ways. Schein (1996) finds that CEOs believe that financial measures are the most important
measures of performance. Thus, contrary to the hopes of Kaplan and Norton (1996a), the BSCs
inclusion of nonfinancial measures may not affect managers judgments if they continue to focus
on the financial measures.
Judgment and decision making research indicates, however, that the availability of
information often leads to its use (Tversky and Kahneman 1973). Joyce and Biddle (1981), for
example, showed that when asked to estimate the prevalence of fraud, auditors were affected by
irrelevant anchors such as the reference points included in the question. Auditors who were
asked whether the incidence of fraud was more than 0.1% subsequently estimated fraud
prevalence lower than those who were first asked whether the incidence of fraud was more than
20%. Thus, it appears that providing information can lead decision makers to attempt to use it
(see also, Hackenbrack 1992).
As the BSC provides nonfinancial measures in an accessible package it is likely that
evaluators will utilize this information. In fact, Schiff and Hoffman (1996) provided financial
and nonfinancial performance measures to a group of operations and finance executives in a lens
model experiment and found that most participants appeared to place some weight on both types
of measures in evaluating departments and managers. Further, in a BSC field study, Ittner,
Larcker and Meyer (1998) found that nonfinancial performance measures added explanatory
power to financial performance measures in explaining differences in bank managers'

10
performance evaluations. Thus, due to the effect of making such information readily available,
we hypothesize that:
H2. Nonfinancial measures in the balanced scorecard will affect
performance evaluations (nonfinancial measures hypothesis).
Common and unique measures
Each business unit in the organization will have its own BSC. Units at the same
organizational level may have some common measures in addition to other measures which are
unique to their business unit and strategy. Previous judgment and decision making research
suggests that people process common and unique information in predictable ways. Slovic and
MacPhillamy (1974) show that when two alternatives have a common attribute along with
unique attributes, the common one is weighted more in judgments about the alternatives. For
example, if a university admissions committee knows the grade point averages of two
candidates, along with a list of extracurricular activities of one and a standardized test score of
the other, the grade points will likely determine rankings or choices between the two candidates.
Stone and Schkade (1991) indicate that even the commonality of the scales on which the
attributes are measured affect the strategies of decision makers and their use of the attributes.
Payne, Bettman, and Johnson (1993) posit that people are adaptive in their decision making
strategies. That is, people choose simplifying decision strategies in response to the specific
information set available in the task or environment. Reliance on common attributes is one such
simplifying strategy. This adaptive behavior is not necessarily conscious, that is, people may use
and process information in a simplified manner without awareness of doing so.
The studies cited above relate directly to choice or ranking tasks; the evaluation of
managers may not initially appear to fit this category. However, when multiple managers are

11
being evaluated, the evaluation task is implicitly, if not explicitly, comparative. For this reason,
we predict that:
H3. Performance evaluations using the balanced scorecard will be
more affected by common measures than by unique measures
(common measures hypothesis).
In this section, prior judgment and decision making research has been used to advance
three hypotheses related to use of the BSC. The next section describes the experiments and the
results of tests of hypotheses 1 to 3.
IV. Method and Results
Overview of experiments
The experiments are set in a common context and follow a similar procedure.
Participants are presented with a case where they are asked to take the role of a senior executive
of WCS Incorporated, a firm specializing in retailing womens apparel.

WCS has multiple

divisions, the two largest of which are the focus of the case materials. The WCS mission
statement is quoted, 2the managers of the two business units are introduced, and the strategies of
the individual business units are described. Multiple performance measures are presented in a
variety of combinations or formats depending on the experimental treatment as described below.
The participant is then asked to evaluate the performance of each of the two unit managers on a
scale with seven descriptive labels and numerical endpoints of "0" and "100" (see Table 1 for a
sample evaluation form).
-------Please place Table 1 about here------After providing the manager evaluations, the participants completed a questionnaire.
This asked the participants to rank the two divisional managers, asked for demographic

12
information, provided manipulation checks (discussed further in the results below), and gathered
data regarding task difficulty, realism, and understandability.
In experiment one the two divisions described are RadWear and PlusWear, retail
divisions specializing in clothing for the urban teen-ager and in large-sized clothing,
respectively. The strategies of the two divisions are described as follows.
RadWears management determined that its growth must
take place through an aggressive strategy of opening new stores.
RadWear also determined that it must increase the number of
brands offered to keep the attention and capture the clothing
dollars of its teenage customers. RadWear concluded that its
competition radius is fairly small due to the low mobility of young
teens.
PlusWears management decided to grow its sales through
expanding the range of apparel in its stores. Thus, sportswear was
added and the accessory line was increased. PlusWear also
decided to concentrate on a few high and mid-quality brand names
so that its mature shoppers would be familiar and comfortable with
the brands offered. Each PlusWear store views its main
competition as those stores within a 15 mile radius that offer largesized womens wear.
The performance measures for each division are appropriate for retailers and capture these
strategies.
Experiment one provides evidence related to H1 and H2 (i.e., the organization hypothesis
and the financial measures hypothesis). Experiment two is designed to test H3, the common
measures hypothesis. Although experiment two also includes RadWear, a different second
division is used (WorkWear). Further details regarding experiment two will be provided later.
Experiment One
In experiment one we focus on whether the BSC format makes a difference in divisional
manager performance evaluation (H1). To investigate this the performance measure presentation
format is manipulated while information content and volume are held constant. Evidence related

13
to the impact of nonfinancial measures (H2) is also provided by having performance on
nonfinancial measures differ across two divisions within a firm.
Subjects
Sixty-four first year MBA students at a private university and fourteen first and second
year MBA students at a public university served as experimental participants. There were no
significant differences in the demographics or responses of the two groups of students so their
responses are combined. The students had, on average, four years of work experience and 62%
were male.
Design and procedure
All participants received a diverse set of performance measures, a description of how the
measures were calculated, and the comparison of each measure to its expectation or target for
each of the two divisions (see Table 2 for the BSC version of the task).

Further, all participants

were told that the performance measures were carefully chosen to represent important aspects
of a business unit[s performance] and were drivers of the units success and linked to its
strategy and mission.
-------Please place Table 2 about here------The between-subjects (Ss) manipulation was the organization of the performance
measures. The organization hypothesis (H1) suggests that the BSCs organization provides a
natural chunking or combination which may help decision makers process and use a large
amount of data. Therefore, the BSC group received the twenty measures divided into the four
BSC categories (financial measures, customer satisfaction measures, operational measures, and
learning measures) while other participants received the same set of twenty measures without the
BSC format (NOFORM group). For the NOFORM group the measures were presented in one of

14
two orders, alphabetical or random. 4 In addition to the format manipulations across subject
groups, the order of presentation of the two divisions was counterbalanced across subjects within
each group (BSC and NOFORM).
For all participants, the financial measures indicated that performance was somewhat
above expectations for both divisions (note in Table 2 that two financial measures were above
targets, one below target, and two on target). Further, for all participants, one division was
somewhat above expectations in its customer related measures and the second division was
somewhat below expectations in the customer related measures (note that Table 2 shows four
RadWear customer measures better than target and four PlusWear worse than target). The two
remaining groups of measures (internal business processes and learning and growth) were
approximately at expectations for all participants (note that Table 2 shows one measure above
target, one below target and three on target). Therefore, there was one within-subjects
manipulation: the divisions being above or below the customer-related performance measure
targets. This manipulation allows for testing of H2, the nonfinancial measures hypothesis.
Subsequent to the evaluation task, the participants were asked to identify and rank the five
measures that most influenced their evaluations of the managers.
Dependent Measure
All subjects evaluated each manager using the evaluation form and scale shown in Table
1. Although there is no normative model for performance evaluation, our stimuli provide a
situation where there is an appropriate ordering of such scores. Thus, our analyses focus on the
relative evaluations of the two managers. Specifically, we expect that there will be a main effect
for division, showing that differential divisional performance on the nonfinancial measures
affects their managers relative evaluations. Additionally, we expect an interaction of

15
organization and division, showing that information organization affects these relative
evaluations. Since judgments are strongly affected by comparison cases (Hsee 1998, 1996), it is
unlikely that information organization would push all judgments up or all judgments down;
rather, if information organization affects information processing it is likely to affect the
comparative or relative judgments.
Results
Checks on the effectiveness of the manipulations revealed that participants receiving the
BSC format felt that the performance measures were more logically organized and usefully
categorized than those receiving the NOFORM performance measures (both p-values < 0.01).
No other differences were noted for these groups for questions regarding difficulty of the task,
emphasis on financial measures, or comprehensiveness of measures provided (all p-values >
0.10). Within the NOFORM group, no differences were found for subjects with the alphabetic
versus the random order for any of these questions (all p-values > 0.10). Also, the order of the
presentation of divisions had no effects on responses to the manipulation check questions (all pvalues > 0.10). Although it was not related to the hypotheses, division order did interact with
division in affecting performance evaluations (F=10.57, p<.002). Thus, order is included in the
statistical analysis but it is not discussed further.
Analyzing the individual manager evaluations via a repeated measures 2 x 2 x 2 Analysis
of Variance (ANOVA) with scorecard organization and division order as between-Ss factors and
division as a within-Ss factor (see Table 3), indicates statistically significant effects for division
(F=97.14, p<.000) and the interaction of division and organization (F=5.70, p <.02). These
results show that RadWears manager is evaluated higher than PlusWears and that the
scorecards organization affects the relative evaluations of the two divisional managers.

16
-------Please place Table 3 about here------The interaction of organization and division supports H1, showing that informations
organization affects the relative evaluations of the managers. These relative performance
evaluations are different when the performance measures are organized into the BSC categories
versus provided without this organization. As shown in Panel B of Table 3, participants with the
four category organization of measures evaluated RadWears manager 17.27 points higher than
PlusWears while participants with the NOFORM format evaluated RadWears manager 24.29
points higher than PlusWears. Perhaps PlusWears inferior performance seems broader and
more systemic when the measures are provided in a random list, whereas the BSC shows that
PlusWear is only inferior on one (albeit important) dimension. Since the BSC format affects the
relative evaluations of the managers, it appears that the BSC in some way changes the processing
of the multitude of measures included, perhaps integrating them into four chunks.

Hypothesis 2 predicts that nonfinancial measures included in the BSC will affect manager
evaluations. Using only the subjects in the BSC condition of experiment one, an ANOVA shows
that the evaluations of the two divisional managers are significantly different (F=26.15, df=1,42,
p=.00). As shown in Table 3, Panel B, the BSC subjects provided mean (standard deviation)
evaluations for RadWear of 69.77 (14.21) and for PlusWear of 52.50 (16.93). Hypothesis 2 is
predicated on the idea that the availability of the nonfinancial measures leads to their use; not
surprisingly then, since all subjects in experiment one were presented with all measures,
divisional differences due to customer related measures also had a significant impact in the
overall ANOVA (F=97.14, df=1,74, p=.00). This overall judgmental effect of making
nonfinancial measures available is also consistent with H2, the nonfinancial measures
hypothesis. 7

17
Experiment one provides evidence on the importance of the BSC organizational format.
This organization had an effect on relative performance evaluations. In addition, nonfinancial
measures were found to affect judgments of managers' performance.
Experiment Two
In the second experiment, we concentrate on the issue of common and unique measures.
Since the BSC requires that performance measures be chosen to specifically address the
operations and concerns of each business unit, it is likely that the measures chosen for the units
will include some items that are common to all units and others that are unique to each. The
common and unique measures are manipulated in this experiment.
Once again the stimuli case described two divisions of WCS, RadWear (the urban teen
clothing retailer) was one of these. However, the second division must differ substantially from
the first in order to logically include a significant number of distinct performance measures. For
this reason, a second retailing division was not desirable. Instead, the second division was
WorkWear, a division selling work uniforms through catalogs and direct sales calls. The
operations and strategy of WorkWear were described as follows.
WorkWear sells its product through direct sales contact
with business clients. Thus, WorkWears customers are the
business managers (often personnel managers) who decide on their
firms uniform supplier. These managers are busy professionals
with many responsibilities. They generally want to spend little
time on the choice and purchase of uniforms but have high
standards for durability and cleaning-ability. When these
managers have changes in their work force they often need new
uniforms in a short time frame.
Although WCS has historically focused on womens
clothing, WorkWears management decided to grow its sales by
including a few basic uniforms for men. It is expected that this
will make WorkWear a more attractive supplier for businesses who
want to purchase uniforms from a single supplier. WorkWear also
decided to print a catalog so that clients could place some orders
without a direct sales visit, particularly for repeat or replacement

18
orders, this should help to retain some sales which might otherwise
be lost due to time considerations.
Subjects
Fifty-eight full- and part-time first year MBA students at a public university served as
experimental participants. The students had, on average, more than five years of work
experience and 63% were male.
Design and Procedure.
The experiment used a 2 x 2 between subjects design, in conjunction with a 2-level
within-Ss factor similar to that used in experiment one (i.e., the complete design is 2 x 2 x 2).
The first independent factor indicates the particular pattern of performance for the two business
units when considering their common measures. Thus, RadWear could have better performance
on the common measures than WorkWear (COM-Rad) or WorkWear could outperform
RadWear on the common measures (COM-Work). Similarly, the second factor indicates the
particular pattern of performance for RadWear and WorkWear when considering their unique
measures. So RadWear could have better performance on its unique measures than WorkWear
has on its unique factors (UNIQ-Rad) or vice versa (UNIQ-Work). Each subject evaluated
managers for both divisions, this is the within-Ss factor.
Sixteen-measure balanced scorecards were designed for the two divisions of WCS
Incorporated. Four performance measures were used in each of the BSC categories, two of the
measures in each category were used for both divisions (i.e., were common across divisions)
and the other two were uniquely designed for the division. A list of the performance measures
used is presented in Table 4. 8 For all measures, each division performed better than its target.
The percentage above target, however, varied in the design as indicated above. All data were
carefully designed so the common and unique items had the same excess performance. For

19
example, in COM-Rad the first common financial measure (i.e., Return on Sales) was 8.33%
above target for RadWear and 4.17% above target for WorkWear. Similarly, in UNIQ-Rad, the
first unique financial measure was 8.33% above target for RadWear and 4.25% above target for
WorkWear. Although the exact percentages varied slightly due to rounding, even these small
variations were counterbalanced and controlled. 9 Further, the percent better than target,
calculated to the second digit, was included as a column in the exhibits presented to the
participants (i.e., exhibit columns included the measures name or label, the target, the actual,
and percent better than target).
-------Please place Table 4 about here------Dependent Measure
As in experiment one, all subjects evaluated each manager using the evaluation form and
scale shown in Table 1. Also, as in experiment one, the relative evaluations are the measures of
interest. Here we want to determine whether relative performance on common and unique
measures affects the evaluations of the managers. If common measures affect these evaluations,
this will results in an interaction of division and common measures. If unique measures affect
these relative evaluations, this will result in an interaction of division and unique measures.
Results
A manipulation check showed that the participants recognized that there were different
performance measures employed by the two divisions (p<0.0001). Further manipulation checks
showed that participants agreed that the two divisions sold to different markets (p<0.0001) and
that it was appropriate for the divisions to employ different performance measures (p<0.0001).
No variation in manipulation check results is found across experimental treatments. In addition
there were no differences across experimental treatments for measures of usefulness of

20
categorization on the BSC, ease of understanding, case difficulty, case realism, and the degree of
emphasis placed on financial measures (all p-values > 0.10).
A 2 x 2 x 2 repeated measures ANOVA was performed to test H3. The results are
presented in Panel A of Table 5. The only statistically significant effect was due to the
interaction of common measures and division (F=30.69, df=1,54, p < 0.01) indicating that the
pattern of performance on common measures affected the managers relative evaluations while
the pattern for unique measures did not. 10 Panel B of Table 5 indicates that when common
measures favored RadWear, RadWears manager was evaluated 6.05 points higher than
WorkWears manager. Similarly, when common measures favored WorkWear, WorkWears
manager was evaluated 7.17 points higher than RadWears manager. In contrast, when unique
measures favored RadWear (WorkWear), there was little difference in the evaluations of the
managers, a mean difference of 0.64 (1.76). 11 Thus, the results corroborate judgment and
decision making researchs finding that common measures dominate unique measures in their
judgmental effect. 12
-------Please place Table 5 about here------V. Limitations and Conclusion
Our experimental design has several limitations. First, our experimental participants
were not involved in the choice of measures to be included in the BSC. Thus, we are unable to
investigate the effect of such involvement. Greater involvement would likely lead to greater
reliance on the BSC, however, our experiments do not appear to suffer from a lack of such
reliance. A second limitation is that our participants were novices in the use of the BSC and did
not necessarily have business experience in the retail sector from which we pulled much of our
case materials. However, the MBA students work experience and academic preparation

21
certainly provided them with prior opportunities to be subject to and to use performance
measures. Further, they were familiar enough with the retail business to indicate an
understanding of the appropriateness of the plans and measures used in the case. Although the
effects we tested relate to basic issues of cognition, we, of course, do not know whether or how
further experience may impact the observed effects.
Since there is no accepted normative model for performance evaluation, it is impossible
to assess the accuracy of our participants evaluations. Therefore, we make no claims as to
whether judgments are better or worse with the scorecard, we could only test whether they are
different. Further, we note that there is wide variation in evaluations provided by our subjects.
This is, perhaps, a result of the lack of a normative model for such evaluations. The issue of
whether the balanced scorecard leads to improved managerial evaluations is left for future
research; however, a method to assess the quality of such judgments is a prerequisite for such a
study.
We investigated the effect of known human information processing characteristics on use
of the balanced scorecard. The organization provided by the balanced scorecard affected the
relative judgments made. Our results also indicate that managerial evaluations were affected by
nonfinancial performance measures included in the scorecard. This provides support for Kaplan
and Nortons claim that the balanced scorecard may reduce managements focus on financial
measures. Indeed, in a study of a large bank system, Ittner, Larcker, and Meyer (1998) found
that use of the balanced scorecard led to improvements in nonfinancial measures. However, they
also found that profitability and performance on some other financial measures declined with use
of the scorecard. Thus, whether the use of the BSC leads to improved decision making is still in
question.

22
Our second experiment found that unique measures received little attention in the
evaluation of managers. It appears that the experimental participants succumbed to the
predictable simplifying strategy of concentrating on common measures in evaluating multiple
managers. This strategy undermines one of the major espoused benefits of the BSC, namely, that
each business unit will have a scorecard which uniquely captures its business strategy. If the
unique measures are underweighted in ex post evaluations of the business unit and its manager,
these measures are likely to receive little ex ante weight in decision making within the unit.
Interestingly, Johnson (1984) showed that when faced with noncomparable alternatives, people
tend to compare the alternatives on higher-level or more abstract or general attributes. Such
higher level attributes are readily available in the BSC in that managers and units could be
compared on the general attributes of financial measures or customer-related measures rather
than on specific unique measures. Future research could determine whether this decision
strategy is used with the BSC.
The balanced scorecard has received significant attention in the business press. A survey
by Towers Perrin (1996, reported in Ittner, Larcker, and Meyer 1998) found 57 responding
organizations using the BSC. Despite this widespread attention and growing use, the human
information processing demands of the BSC have received little consideration. We investigated
the impact of the BSCs information organization and inclusion of diverse measures on
performance judgments. Many other judgment issues deserve research attention. For example,
how are trade-offs made across BSC categories, how are the individual measures weighted, and
how does the differential reliability and objectivity of measures affect the weight placed on the
measures? It may also be interesting to study the use of the BSC in group decision making as it

23
is certainly designed to be used by both individuals and groups. We encourage further research
in these areas.

24
References
Baddeley, A. 1994. The magical number seven: Still magic after all these years? Psychological
Review (April): 353-356.
Blocher, E., and C. Davis. 1990. Presentation format and information load effects on judgment
and recall in a risk analysis task. In Accounting, Communication, and Monitoring, S.
Moriarity (ed.), University of Oklahoma Press: 138-157.
Blocher, E., R. Moffie, and R. Zmud. 1986. Report format and task complexity: Interaction in
risk judgments. Accounting, Organizations, and Society 11(6): 457-470.
Chase, W. G. and H. Simon. 1973. The minds eye in chess. In Visual Information Processing
edited by W. G. Chase.
Chewning, E., and A. Harrell. 1990. The effect of information load on decision makers cue
utilization levels and decision quality in a financial distress decision task. Accounting,
Organizations and Society 15 (6): 527-542.
Dempsey, S., J. F. Gatti, D. J. Grinnell, and W. L. Cats-Baril. 1997. The use of strategic
performance variables as leading indicators in financial analysts forecasts. The Journal
of Financial Statement Analysis (Summer): 61-79.
Frederick, D. 1991. Auditors' representation and retrieval of internal control knowledge. The
Accounting Review: 240-258.
Hackenbrack, K. 1992. Implications of seemingly irrelevant evidence in audit judgment.
Journal of Accounting Research (Spring): 126-136.
Hirst, E. and P. Hopkins. 1998. Comprehensive Income Reporting and Analysts Valuation
Judgments. Journal of Accounting Research (supplement): forthcoming.

25
Hopkins, P. 1996. The effect of financial statement classification of hybrid financial instruments
on financial analysts stock price judgments. Journal of Accounting Research
(supplement): 33-50.
Hsee, C. 1996. The evaluability hypothesis: An explanation for preference reversals between
joint and separate evaluations of alternatives. Organizational Behavioral and Human
Decision Processes 67(3): 247-257.
Hsee, C. 1998. Less is better: When low-value options are valued more highly than high-value
options. Journal of Behavioral Decision Making 11(2): 107-121.
Iselin, E. 1988. The effects of information load and information diversity on decision quality in a
structured decision task. Accounting, Organizations and Society 13 (2): 147-164.
Iselin, E. 1993. The effects of the information and data properties of financial ratios and
statements on managerial decision quality. Journal of Business Finance & Accounting 20
(2): 249-266.
Ittner, C., D. Larcker, and M. Meyer. 1998. The use of subjectivity in multi-criteria reward
systems. Wharton Working Paper.
Ittner, C., D. Larcker, and M. Rajan. 1997. The choice of performance measures in annual bonus
contracts. The Accounting Review 72 (April): 231-255.
Johnson, M. 1984. Consumer choice strategies for comparing noncomparable alternatives.
Journal of Consumer Research 11: 741-753.
Joyce, E., and G. Biddle. 1981. Anchoring and adjustment in probabilistic inference in auditing.
Journal of Accounting Research (Spring): 120-145.
Kaplan, R., and D. Norton. 1992. The balanced scorecard - measures that drive performance.
Harvard Business Review (January-February): 71-79.

26
Kaplan, R., and D. Norton. 1993. Putting the balanced scorecard to work. Harvard Business
Review (September-October): 134-147.
Kaplan, R., and D. Norton. 1996a. Using the balanced scorecard as a strategic management
system. Harvard Business Review (January-February): 75-85.
Kaplan, R., and D. Norton. 1996b. The Balanced Scorecard. Boston, MA: Harvard Business
School Press.
Kaplan, S. 1988. An examination of the effect of presentation format on auditors expected value
judgments. Accounting Horizons 2(3): 90-95.
MacGregor, J. N. 1987. Short-term memory capacity: Limitation or optimization?
Psychological Review. (January): 107-108.
Miller, G. 1956. The magical number seven, plus or minus two: Some limits on our capacity for
information processing. The Psychological Review (March): 81-96.
Moriarity, S. 1979. Communicating financial information through multidimensional graphics.
Journal of Accounting Research 17(1): 205-224.
Nanni, A., R. Dixon, and T. Vollmann. 1990. Strategic control and performance measurement.
Journal of Cost Management 4 (Summer): 33-42.
Newman, G. 1991. The absolute measure of corporate excellence. Across the Board (October):
10-12.
Payne, J., J. Bettman, and E. Johnson. 1993. The Adaptive Decision Maker. Cambridge:
Cambridge University Press.
Rabinowitz, M., and J Mandler. 1983. Organization and information retrieval. Journal of
Experimental Psychology: Learning, Memory, and Cognition 9(3): 430-439.

27
Ricchiute, D. 1992. Working-paper order effects and auditors going concern decisions. The
Accounting Review (Jan): 46-58.
Schein, E. 1996. Three cultures of management: The key to organizational learning. Sloan
Management Review (Fall): 9-20.
Schiff, A., and L. R. Hoffman. 1996. An exploration of the use of financial and nonfinancial
measures of performance by executives in a service organization. Behavioral Research
in Accounting 8: 134-153.
Servan-Schreiber, E., and J. R. Anderson. 1990. Learning artificial grammars with competitive
chunking. Journal of Experimental Psychology: Learning Memory and Cognition 16
(4): 592-608.
Slovic, P., and D. MacPhillamy. 1974. Dimensional commensurability and cue utilization in
comparative judgment. Organizational Behavior and Human Performance 11: 172-194.
Stone, D., and D. Schkade. 1991. Effects of attribute scales on process and performance in
multiattribute choice. Organizational Behavior and Human Decision Processes 59: 261287.
Towers Perrin. 1996. Inside the balanced scorecard. Compuscan Report, January, p. 1-5.
Tversky, A. and D. Kahneman. 1973. Availability: A heuristic for judging frequency and
probability. Cognitive Psychology 5: 207-232.
Wang, Z. 1995. Task constraints and user-system interaction process under personnel decision
support. Ergonomics 39 (5): 1049-1056.

28
Footnotes
1. The case was developed by the authors, one of whom has prior experience with
apparel retailing. The Kenyon Stores example discussed by Kaplan and Norton (1996b)
provided preliminary guidance on possible performance measures.
2. The mission statement says We will be an outstanding apparel supplier in each of the
specialty niches served by WCS. This kind of vague, but inspirational, mission statement is
discussed by Kaplan and Norton (1996b).
3. Participants received separate exhibits for RadWear and PlusWear. Measures for both
divisions are included in Table 2 for efficiency of exposition.
4. The order of measures for the latter was chosen by random draw with the only proviso
that adjacent measures should not come from the same BSC category.
5. Tests of ANOVA model assumptions indicate no problems except for nonnormality of
the error terms (p<.01). This is not unusual for a study including repeated measures. Although
the reported F tests are quite robust to this nonnormality, we also ran an additional ANOVA
using the difference in performance evaluations of the managers of the two divisions as the
dependent variable (RadWears evaluation minus PlusWears evaluation). For this analysis, all
ANOVA assumptions are met and the results indicate that participants receiving the BSC format
evaluated the relative performance of the managers differently than those who received the same
measures without the BSC format (F=5.70, p<0.02). This corroborates the results of the repeated
measures analysis.
6. Sixty-four of the subjects provided memos explaining their evaluations. The
NOFORM subjects mentioned, on average, 22.6 individual performance measures in these
memos. They also referred to (self-generated) groups of measures 1.1 times on average. Thus,

29
about 95% of the measures mentioned by these subjects were individual measures. Interestingly,
the BSC subjects referred to an average of 18.7 individual measures and 8.1 groups of measures
(self-generated or provided). Thus, about 30% of the measures mentioned were in groups.
7. Subjects also indicated the five measures which most influenced their evaluations for
each divisional manager. Five points were assigned to the item ranked highest for each manager,
four for the second highest, etc. Measures with the highest average rankings were Sales Growth
(average points of 2.47), Customer Satisfaction (2.08), New Store or Line Sales (1.85), Repeat
Sales (1.67), Return on Sales (1.60), and Mystery Shopper Rating (1.24). It is interesting to note
that this list includes three financial measures and three customer-related measures. Listings for
BSC subjects and NOFORM subjects were quite similar. In fact, although in slightly different
orders, the top six items for the two groups were the same.
8. Of course, we would like our evaluators to perceive the common and unique measures
included in our materials to be equally diagnostic and useful. A study by Dempsey, Gatti,
Grinnell and Cats-Baril (1997), provides some independent verification of this. Dempsey, et al.
had financial analysts rate a set of strategic measures as to frequency of use and predictive value;
ten of our measures (five common and five unique) were included in their list. There were no
significant differences in the frequency of use and predictive value for the common measures and
the unique measures we employed.
9. Overall, the sum of excess performance (percentage above target) for common
measures was 85.18 for the better division and 51.99 for the worse division (a difference of
33.19) while the sum of excess performance for unique measures ranged from 84.96 to 85.08 for
the better division and 51.20 to 51.95 for the worse division (a difference of 33.01 or 33.88).

30
The slight variation for the unique measures was due to the different units or bases for the
measures related to RadWear versus WorkWear.
10. Again, tests indicate that the ANOVA model assumption of normality of error terms
is violated. However, using the difference in manager evaluations as the dependent variable
satisfies all ANOVA assumptions and again corroborates the reported results (i.e., only common
measures have a significant effect on the difference in the managers evaluations, F= 30.69, p <
.01).
11. Analyzing the differences in managerial performance evaluations via regression
indicates that common measures have a slope coefficient of 10.87 (t=3.28, p<0.01) while unique
2
items slope coefficient is 0.08 (t=0.02, p>0.10). The adjusted R of
the regression is 0.34.

12. Subjects also indicated which manager they would prefer to promote. Chi-square
tests indicate that these choices are independent of the pattern of unique measures ( 2 = 0.64,
df=1, p>0.10, Fisher Exact test) but are not independent of the pattern of common measures
(2=14.11, df=1, p<0.01, Fisher Exact test). This finding further corroborates the results
reported in the text. In addition, this choice question mirrors the dependent variable employed in
the judgment and decision making research upon which our hypothesis was based.

31
Table 1
Sample Evaluation Form Employed in Both Experiments
WCS Inc.
Initial Evaluation Form
Year: ___1996______
Manager: ____Chris Peters_____________
Division: ____RadWear_______________
Evaluator: __________________________
1. Indicate your initial performance evaluation for this manager by placing an X somewhere on
the scale below. Note that some label interpretations are provided below.
0

50

100

|----|----|----|----|----|----|----|----|----|----|

Reassign

very
poor

poor

average

good

very
good

excellent

Excellent: far beyond expectations, manager excels


Very good: considerably above expectations
Good: somewhat above expectations
Average: meets expectations
Poor: somewhat below expectations, needs some improvement
Very Poor: considerably below expectations, needs considerable improvement
Reassign: sufficient improvement unlikely

32
Table 2
Measures, Targets, and Actuals in Balanced Scorecard Format
RadWear Balanced Scorecard
(PlusWear items in Parentheses)
Targets and Actuals for 1996
Measure
Target
Actual
Financial:
1. Return on sales........................ 24% (22)... 25% (23)
2.

Sales growth .......................... 35% (30)... 38% (33)

3.

New store sales.(New lines sales)...... 30% (25)... 26% (22)

4.

Market share relative to retail space.. $80 (70)... $80 (70)

5.

Return on expenses .................... 42% (36)... 42% (36)

Customer-Related:
1. Repeat sales .......................... 30% (40)... 33% (36)
2.

Customer satisfaction rating .......... 95

(97)...

96 (96)

3.

Mystery shopper program rating ........ 96

(96)...

98 (94)

4.

Returns by customers as % of sales..... 10% (7)....

5.

Out of stock items .................... 10% (14)... 10% (14)

9% (8)

Internal Business Processes:


1. Average major brand names/store
(Average % of product range)........... 32
2.

(88%).. 34 (90%)

Sales from new market leaders


(Sales from top brand names)........... 25% (28)... 22% (25)

3.

Returns to suppliers ..................

5% (3)....

4.

Average markdowns ..................... 15% (12)... 15% (12)

5.

Voided sales transactions..............

Learning and Growth:


1. Hours of employee training/employee.... 10

(2)....

5% (3)
3

(2)

(8).... 11

(9)

2.

Average tenure of sales personnel ..... 1.4 (2.1).. 1.2(1.9)

3.

Employee suggestions/employee..........

4.

Sales personnel taking manager test.... 30% (36)... 30% (36)

5.

Stores computerizing .................. 85% (85)... 85% (85)

Note: Calculations for the measures were described within the case.

(2)....

(2)

33

Table 3
ANOVA Results for Experiment One
Manager Evaluations

PANEL A: ANOVA Results


Variable
Between Ss:
Organization
Order
Organ. x Order
Error

df

SS

MS

1
1
1
74

41.25
4.10
0.52
22,567.86

41.25
4.10
0.52
304.97

0.14
0.01
0.00

0.71
0.91
0.97

Within Ss:
Division
1
Div. x Organization 1
Div. x Order
1
Div. x Organ. x Order 1
Error
74

13,917.31
817.27
1,513.64
344.02
10,601.94

13,917.31
817.27
1,513.64
344.02
143.27

97.14
5.70
10.57
2.40

0.00
0.02
0.00
0.13

PANEL B: Descriptive Statistics Means (Standard Deviations)


Evaluative Difference
(RadWear-PlusWear)

Scorecard Format

RadWear

PlusWear

BSC

69.77
(14.21)

52.50
(16.93)

17.27
(20.32)

NOFORM

74.32
(10.24)

50.03
(18.02)

24.29
(15.03)

71.76

51.42

Both Formats

34
Table 4
Performance Measures Used in Experiment Two
____________________________________________________
Type 1
Measure
____________________________________________________
Financial Measures:
C
Return on Sales
C
Sales Growth
U-Rad
New Store Sales
U-Rad
Market Share Relative to Retail Space
U-Work
Revenues per sales visit
U-Work
Catalog profits
Customer-Related Measures:
C
Repeat Sales
C
Customer Satisfaction Rating
U-Rad
Mystery shopper program rating
U-Rad
Return by customers as a % of sales
U-Work
Captured customers
U-Work
Referrals
Internal Business Process Measures:
C
Return to Suppliers
C
Average Markdowns
U-Rad
Average major brand names per store
U-Rad
Sales from new market leaders
U-Work
Orders filled within one week
U-Work
Catalog orders filled with errors
Learning and Growth Measures:
C
Hours of employee training per employee
C
Employee suggestions per employee
U-Rad
Average tenure of sales personnel
U-Rad
Stores computerizing
U-Work
% Sales managers with MBA degrees
U-Work
Data Base Certification of Clerks
____________________________________________________
1. C indicates a common measure. U-Rad is a unique measure for RadWear, a teen-wear retail
division. U-Work is a unique measure for WorkWear, a uniform division which sells through
catalogs and sales calls. Participants received a brief verbal description of the calculation of each
measure.

35
Table 5
Results for Experiment Two
Manager Evaluations
Panel A: ANOVA Results
Variable
Between Ss:
Common
Unique
Common x Unique
Error

df

SS

1
1
1
54

143.86
57.16
434.13
14,402.36

Within Ss:
Division
1
Div. x Common
1
Div. x Unique
1
Div. x Com. x Unique 1
Error
54

9.15
1,265.38
42.04
40.07
2,226.44

MS

143.86
57.16
434.13
266.71

0.54
0.21
1.63

0.47
0.65
0.21

0.22
30.69
1.02
0.97

0.64
0.00
0.32
0.33

9.15
1,265.38
42.04
40.07
41.23

PANEL B: Descriptive Statistics Means (Standard Deviations)


Measures
Common

Favor RadWear
Rad

74.21
(11.08)

70.00
(12.86)

Work

68.24
(14.26)

77.17
(11.09)

6.05

-7.17

Rad

72.00
(11.02)

72.20
(13.19)

Work

71.36
(11.70)

73.97
(14.97)

0.64

-1.76

Difference: Rad-Work
Unique

Favor WorkWear

Difference: Rad-Work

S-ar putea să vă placă și