Sunteți pe pagina 1din 10

Journal of Business Research 57 (2004) 98 107

The use of expert judges in scale development


Implications for improving face validity of measures of
unobservable constructs
David M. Hardestya,*, William O. Beardenb
a

Department of Marketing, School of Business Administration, University of Miami, 523D Jenkins Building, Coral Gables, FL 33124-6554, USA
b
Department of Marketing, Moore School of Business, University of South Carolina, 29208 Columbia, SC, USA

Abstract
A review of the assessment of face validity in consumer-related scale development research is reported, suggesting that concerns over the
lack of consistency and guidance regarding item retention during the expert judging phase of scale development are warranted. After
analyzing data from three scale development efforts, guidance regarding the application of different decision rules to use for item retention is
offered. Additionally, the results suggest that research using new, changed, or previously unexamined scale items should, at a minimum, be
judged for face validity.
D 2003 Elsevier Science Inc. All rights reserved.
Keywords: Face validity; Content validity; Scale development

1. Introduction
Concerns regarding the lack of consistency and guidance
regarding how to use the expertise of judges to determine
whether an item should be retained for further analysis in
the scale development process motivated this investigation.
Moreover, there is confusion regarding the difference
between face and content validity and no previous research
has addressed directly the procedures used by consumer and
marketing researchers to determine item retention during
face validity assessment. Therefore, our first objective was
to assimilate and explain the difference between content and
face validity. Our second objective was to review the use of
expert judges in previous consumer and marketing research.
Finally, and based on this review, our third objective was to
test three frequently employed decision rules used in consumer and marketing research, in order to investigate the
relative effectiveness of alternative decision rules for use in
assessing face validity of scale items being considered in
measure development processes.
We begin by differentiating between face and content
validity, and then we describe the importance of having face
valid items. Next, the results of a review of marketing and

* Corresponding author. Tel.: +1-305-284-5011.


E-mail address: hardesty@miami.edu (D.M. Hardesty).
0148-2963/$ see front matter D 2003 Elsevier Science Inc. All rights reserved.
doi:10.1016/S0148-2963(01)00295-8

consumer behavior-related scale development efforts are presented. The review consisted of an assessment of the scale
development articles reviewed in Bearden and Netemeyers
(1999) second edition compilation of marketing scales. These
scales were chosen since they are among the most frequently
employed and more rigorously constructed scales used by
consumer and marketing researchers. The review was undertaken for two primary reasons. First, we wanted to determine
the prevalence of expert judging as a tool to aid in item face
validity assessment; and second, we wanted to gain an understanding of how previous researchers used expert judging to
reduce an initial item pool and to determine which items to be
further analyzed. After establishing that there has been a lack of
consistency regarding the rules used for item retention, several
data sets were analyzed in an attempt to provide future
researchers with guidance regarding item retention decisions.
Finally, the article concludes with remarks concerning the
implications and limitations of our research, as well as a
discussion of future research avenues.
2. Face validity assessment
2.1. The importance of having face valid items
Churchill (1979) proposed a widely accepted general
paradigm for developing measures of marketing constructs,

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

and the second step of this paradigm, the generation and


editing of items, is the focus of this article. Implicit in this
stage is the process whereby generated items are judged for
face and/or content validity. Specifically, in the second step of
the above scale development process, an initial item pool is
established, which should possess content and face validity
(Churchill, 1979). Content, as well as face validity, have been
defined variously by previous researchers (e.g., Schreisheim
et al., 1993; Nunnally and Bernstein, 1994) and the distinction between the two concepts is not always clear. As one
noted example, Nunnally and Bernstein (1994) defined
content validity as the degree to which a measures items
represent a proper sample of the theoretical content domain of
a construct. In order for the criterion of content validity to be
met by the initial pool of items, these items need to be face
valid. Face validity has been defined as reflecting the extent
to which a measure reflects what it is intended to measure
(Nunnally and Bernstein, 1994). Similarly, Allen and Yen
(1979), Anastasi (1988), and Nevo (1985) defined face
validity as the degree that respondents or users judge that
the items of an assessment instrument are appropriate to the
targeted construct and assessment objectives.
Often, face and content validity have been used interchangeably even though there is an important conceptual
difference. One helpful way to distinguish between face and
content validity is to consider the domain of a construct
being represented by a dartboard. In order for the criterion
of content validity to be established, darts must land
randomly all over the board to obtain a proper representation of the construct. Therefore, if darts were located on
only the left-hand side of the board (i.e., items were
measuring only half of the domain of a construct), the
measure would not be content valid. Relatedly, if items
are generated that are too similar and do not tap the full
domain of the construct (i.e., the entire dartboard), content
validity is not established. Using the dartboard analogy, an
item has face validity if it hits the dartboard otherwise, the
item does not represent the intended construct. Therefore,
researchers must ensure that the items in the initial pool
reflect the desired construct (i.e., hit the dartboard). This
validity assessment is necessary since inferences are made
based on the final scale items and, therefore, they must be
deemed face valid if we are to have confidence in any
inferences made using the final scale form.
Importantly, if items from a scale are not face valid, the
overall measure cannot be a valid operationalization of the
construct of interest. Hence, face validity is a necessary but
not sufficient condition for ensuring construct validity. That
is, items must reflect what they are intended to measure (i.e.,
face validity) and represent a proper sample of the domain
of a construct (i.e., content validity), and pass other tests of
validity (e.g., discriminant, convergent, and predictive validity), in order for a measure to have construct validity.
Unfortunately, consumer researchers often fail to include
an evaluation of the face validity of items when developing
measures. Similar to the admonition regarding content

99

adequacy by Schreisheim et al. (1993), it seems appropriate


to suggest that consumer and marketing research that
employs new, untested, or modified measures should provide evidence of face validity for the items being used.
Including a judging phase to help ensure face validity of
scale items should not normally be a tremendous burden on
researchers and may dramatically improve the scales that are
being used in consumer and marketing research. After all,
sound scales are necessary for any scientific discipline to
move forward.

3. Face validity assessment in prior consumer and


marketing research: a review
Our analyses began with a review that focused on
measures summarized in Bearden and Netemeyers (1999)
Handbook of Marketing Scales. Bearden and Netemeyers
book of marketing scales contains a summary of approximately 200 multi-item scales that assess a variety of
consumer and marketing unobservable constructs. Each
scale included in their text met the following conditions:
(1) the measure was developed from a reasonable theoretical
base and/or conceptual definition; (2) the measure was
composed of several (i.e., at least three) items or questions;
(3) the measure was developed within the marketing or
consumer behavior literature and was used in, or was
relevant to, the marketing or consumer behavior literature;
(4) at least some scaling procedures were employed in scale
development; and (5) estimates of reliability and/or validity
existed (Bearden and Netemeyer, 1999, p. 1). Our review of
these measures indicated that some form of expert judging
was definitely used to evaluate face validity of items in 39
of these scales. In reviewing each of the 39 scales that
reported expert judging of the face validity of items, the
following information was gathered: (1) name of construct;
(2) author names; (3) initial number of items; (4) number of
items remaining after judging; (5) number of items in the
final scale; (6) number of judges; and (7) the decision rule
used for item retention. Table 1 summarizes the results from
this review.
There were a number of occasions where each of the
above pieces of information, even for the 39 scales using
expert judging, was either not included or was vague. So, of
approximately 200 of the most rigorously tested scales in
consumer and marketing research, only about 19.5% (or
n = 39) of the articles definitely reported the use of expert
judging to aid in face validity assessment. While it is
possible that expert judging was conducted but not reported,
the percentage seems surprisingly low given the importance
of having face valid items in the development of psychometrically sound scales (cf. Churchill, 1979) and given the
support of expert judging we found in the literature. As
shown in Table 1, individual constructs or facets were
developed based on an initial item pool consisting of from
10 to 180 items. On average, each facet or overall construct

100

Table 1
Expert judging of face validity: 39 measures reported in Bearden and Netemeyers (1999) Handbook of Marketing Scales
Initial number
of items

Number of items
after judging

Number of items
in the final scale

Number of
expert judges

Compliant Interpersonal Orientation (Cohen, 1967)

10

10

10

Aggressive Interpersonal Orientation (Cohen, 1967)

15

15

15

Detached Interpersonal Orientation (Cohen, 1967)

10

10

10

Preference for Consistency (Cialdini et al., 1995)


Consumer Self-Actualization Test: CSAT (Brooker, 1975)

72
150

60
150

18
20

not reported
4

Self-Concepts, Person Concepts, and Product Concepts


(Malhotra, 1981)
Separateness Connectedness Self-Schema
(Wang and Mowen, 1997)

70

27

15

60

32

Achievement and Physical Vanity


(Netemeyer et al., 1995)

99

60

5, 5

Consumer Impulsiveness (Puri, 1996)

25

12

12

colleagues in
the marketing
department
marketing
professors
and PhD
students
3

Country Image (Martin and Eroglu, 1993)

60

29

14

Consumer Ethnocentrism (Shimp and Sharma, 1987)

180

100

17

Market Mavenism (Feick and Price, 1987)


Consumer Independent Judgment Making and Consumer
Novelty Seeking (Manning et al., 1995)

40
74, 60

19
16, 16

6
6, 8

a group
5, 5

Use Innovativeness, 5 facets (Creativity/Curiosity (CC),


Risk Preferences (RP), Voluntary Simplicity (VS),
Creative Reuse (CR), Multiple Use Potential (MUP)
(Price and Ridgway, 1983)
Consumer Susceptibility to Interpersonal Influence
(Bearden et al., 1989)

70

60

13 (CC), 9 (RP),
5 (VS), 10 (CR),
7 (MUP)

several

135Study
1
86Study 2

86

12

62

12

ECOSCALE: Environmentally Responsible Consumer


(Stone et al., 1995)

50

not reported

31

Leisure (Unger and Kernan, 1983)

42

36

26

a group of
university
professors
10

Decision rule for item retention


9 of 10 items received seven of seven based on construct
definition; one item received six out of seven
13 of 15 items received seven of seven based on construct
definition; two items received six out of seven
7 of 10 items received seven of seven based on construct
definition; three items received six out of seven
items deleted for redundancy or poor face validity
judges pretested items for clarity, familiarity and wording,
and the like
judges made a list of about 35 items and all agreed on 27
items to retain
items were judged for face validity

judges consistently rated the items as at least somewhat


characteristic

adjectives that appeared either ambiguous or unrelated


were removed
average interjudge agreement and reliability was obtained
for 29 of the 60 word pairs; the Holsti (1969) procedure
was used to determine an average interjudge agreement
and reliability
at least five of six chose the items to be in the facet under
consideration
not reported
items that were judged to be not representative by any of
the judges or evaluated as clearly representative by fewer
than three of the judges were not retained
reduced to 60 based on judgment of several experts

at least four of five judges chose items to be in the facet


under consideration
three judges rated clearly representative and one rated
somewhat representative
the professors knew that the survey was trying to measure
involvement with the environment; also, the questionnaire
was generally well received thus supporting face validity
judges were asked to indicate which dimension each item
represented; items were eliminated if three or more
assigned incorrect classifications

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

Name of construct (authors)

168

43, 23

20

3, 5

Involvement Revisited (Zaichkowsy, 1994)

168

35

10

3, 5

Purchasing Involvement (Slama and Tashchian, 1985)


Exploratory Acquisition of Product and Exploratory
Information Seeking (Baumgartner and
Steenkamp, 1996)
Shopping Value (Babin et al., 1994)

150
89, 89

75
41, 28

33
10

30
5

71

53

15

Coupon Proneness (CP)/Value Consciousness (VC)


(Lichtenstein et al., 1990)

33 (CP)
Study 1
33 (VC)
Study 1
25 (CP)
Study 2
18 (VC)
Study 2
104

25

18

25

15

72

5, 5, 5

52

124

31

Physical Distribution Service Quality


(Bienstock et al., 1997)

45

36

15

33

Consumer Alienation for Marketplacefour variants


(Allison, 1978)

115

50

35

35

Consumer Discontent (Lundstrum and Lamont, 1976)

118

99

82

10

Ethical Behavior (Ferrell and Skinner, 1988)

70

not given

11

Trust, Expertise, and Attractiveness of Celebrity


Endorsers (Ohanian, 1990)
Consumer Skepticism Toward Advertising
(Obermiller and Spangenberg, 1998)

each word pair was rated: (1) clearly representative, (2)


somewhat representative, or (3) not representative; word
pairs that were not rated representative for any of three
choices were dropped; then, a second judging phase using
the same procedure was implemented; items were deleted
if less than 12 of 15 ratings were representative
all three judged as clearly or somewhat representative and
then five new judges rated as clearly or somewhat
representative at least 80% of the time
75% agreement that the item is appropriate
four of five judges classified items correctly

each judge was given a description of hedonic and


utilitarian value and asked to sort the items into hedonic,
utilitarian, or other; any item classified as representative
by all three judges was retained; five additional items
were retained following a discussion among the judges
at least four of five judges chose items to be in the facet
under consideration

four of five judges rated items as being at least somewhat


representative

items with 75% or more agreement as belonging to a


certain construct were thus retained for further analysis
judges were asked to rate each item as a very good, good,
fair, or poor representation of its content; items were
retained that were rated very good by at least three judges
and poor by none
if fewer than three mentioned the item as being
appropriate and the authors judged the item to lack face
validity it was deleted
75% of judges agreed items would differentiate between
alienated and nonalienated consumers and 60% or more
attributed the item to the same variant of consumer
alienation
statements that did not fit into either pro- or anti-business
sentiments were eliminated
items were removed if any judge felt it lacked face validity

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

Involvement (Zaichkowsky, 1985)

(continued on next page)

101

102

Table 1 (continued )
Initial number
of items

Number of items
after judging

Number of items
in the final scale

Number of
expert judges

Business Ethics (Reidenbach and Robin, 1990)


Excellence (Sharma et al., 1990)

33
200

33
34, 31

8
16

3
8, 18

Market Orientation (Narver and Slater, 1990)

not given

not given

6, 4, 5, 3, 3

3, 3

Work Family Conflict and Family Work Conflict


(Netemeyer et al., 1996)
Performance of Industrial Salespersons
(Behrman and Perreault, 1982)

57, 53

22, 21

5, 5

4, 4

100

65

31

a number of
judges

Consumer Orientation of Salespeople (SOCO)


(Saxe and Weitz, 1982)

104

70

24

24

Buying Influence (Kohli and Zaltman, 1988)

not given

not given

Social Power (Swasy, 1979)

150

85

31

a panel of
judges
6

Distributor Power and Manufacturer Power


(Butaney and Wortzel, 1988)
Power Sources (Gaski and Nevin, 1985)

27, > 40

22, 21

17, 21

12

not given

not given

not given

Channel Leadership (Schul et al., 1983)


Reseller Performance (Kumar et al., 1992)

19
>100

9
34

15, 6, 6, 15, 10,


10, 5, 2
3, 3, 3
5, 5, 4, 4, 4, 4, 4,
4

about 50
>21

Decision rule for item retention


judges partitioned the items into moral philosophies
judges were asked to sort the items into eight groups;
statements were retained if at least seven of eight placed
them in the same dimension; then, judges were asked to
indicate whether or not each item represented the attribute
it was meant to represent; statements on which 70% of
judges agreed upon were retained
items were submitted to two panels and were rated highly
consistent with market orientation by all
items were retained if judged at least somewhat
representative by all
performance items that were evaluated as ambiguous, not
well categorized, not representative of the majority of
industrial selling job situations, or simply not important
were eliminated or modified
judges rated the items as clearly, somewhat, or not
representative; items were retained if at least 50% of the
judges rated them clearly representative; 10 unrelated
items were also included to monitor judging; one judge
was subsequently deleted
judges critiqued the structure and content of items
five of six judges classified the item as an indicator for the
same power type and the item was not classified as an
indicator of the other categories three or more times
not reported
supplier management made additions and deletions to
assess face validity
not reported
21 graduate students did an item sort task to assign items
to the hypothesized facet

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

Name of construct (authors)

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

had 65 items in the initial item pool. After judging, the


number of items remaining ranged from 3 to 150 and
averaged approximately 32. The average number of judges
used was approximately 10 per construct or facet, with the
number of judges employed ranging from 3 to 52. Finally,
the number of items in the final scale ranged from 2 to 82
and averaged approximately 12.
As the results of our review suggest, the establishment of
face validity has historically involved a mix of different
judgmental procedures and approaches. Judges are often
exposed to individual items and asked to evaluate the degree
to which items are representative of a constructs conceptual
definition. One common way of judging items is to use
some variant of the method employed by Zaichkowsky
(1985), whereby each item is rated by a panel of judges
as clearly representative, somewhat representative, or
not representative of the construct of interest. Of the 39
articles where expert judging was reported in aiding the
assessment of face validity, 10 used Zaichkowskys exact
procedure or one very similar. As an example of one of the
modified procedures, Obermiller and Spangenberg (1998)
extended Zaichkowskys procedure to include four possibilities (very good, good, fair, or poor representation of the construct). One interesting method employed by
Saxe and Weitz (1982), who also used the Zaichkowsky
procedure, was including 10 unrelated items to assess the
quality of judging. This procedure resulted in the subsequent deletion of the responses from one of the judges.
Another common method using expert judges is the
assignment of items to either an overall construct definition
or, for multifaceted constructs, one of the constructs dimension definitions. In this approach, a panel of judges is given
the definition of each construct or construct dimension, as
well as a list of all items. Judges are then asked to assign
each item to one of the construct definitions or assign the
item to a category labeled other. Variations of this
procedure have been used for multidimensional constructs
(cf. Ohanian, 1990), conceptually different constructs being
developed simultaneously (cf. Baumgartner and Steenkamp,
1996), as well as unidimensional constructs (cf. Shimp and
Sharma, 1987). This procedure or a similar variant was used
by 14 of the authors who used expert judging to aid in the
assessment of face validity of scale items. The remainder of
the authors used either some general procedure to assess the
face validity of items, or failed to report adequately the
nature of the procedures employed.
Regardless of the procedure employed, authors must
determine which items to retain for further analysis. Scale
developers often use different rules for determining which
items to retain (cf. Bearden et al., 1989; Lichtenstein et al.,
1990; Zaichkowsky, 1994; Netemeyer et al., 1996). For
example, a number of authors have used expert judges to
delete ambiguous, redundant, or unrelated items (cf.
Brooker, 1975; Behrman and Perreault, 1982; Gaski and
Nevin, 1985; Cialdini et al., 1995; Puri, 1996). Other
researchers have used expert judges to generally evaluate

103

the quality of the survey (cf. Kohli and Zaltman, 1988;


Stone et al., 1995). Malhotra (1981) used judges to agree on
a subset of items to use in further analysis. Finally, Reidenbach and Robin (1990) used expert judges to partition
items into facets, not as a means of deletion.
When researchers have employed Zaichkowskys (1985)
procedure or a similar one, several rules for item deletion
have emerged. In many instances, items were deleted when
evaluated by any judge as being not representative (i.e., a
poor indicator) of the construct (cf. Bearden et al., 1989;
Netemeyer et al., 1995, 1996). Other authors used decision
rules that focused on the overall evaluations of all of the
judges. For example, Lichtenstein et al. (1990) and Zaichkowsky (1985, 1994) decided that items would be retained
if at least 80% of the judges rated an item as at least
somewhat representative of the construct. Similarly, Sharma
et al. (1990) retained items that 70% of judges coded as
representative versus not representative of corporate excellence. One final set of rules that emerged from our review
contained references to the number of judges who evaluated
an item as completely representative of the construct. For
example, Obermiller and Spangenberg (1998) required at
least three of four judges to rate an item as being a very
good representation of consumer skepticism toward advertising. Similarly, Saxe and Weitz (1982) and Manning et al.
(1995) required at least 50% and 60% of their judges,
respectively, rate an item as completely representative in
order to be retained.
Researchers using the other dominant procedure (i.e.,
placing items into facets or dimensions based on definitions)
also used different rules when determining which items to
retain. Allison (1978) required that at least 60% (21 of 35)
of the judges place an item into the same facet. Babin et al.
(1994) used the strictest rule in that all three of their judges
had to assign items to the same facet. For the remainder of
the authors, between 75% and 88% of the judges involved
had to assign an item to the same construct (cf. Swasy,
1979; Unger and Kernan, 1983; Slama and Tashchian, 1985;
Baumgartner and Steenkamp, 1996; Shimp and Sharma,
1987; Bearden et al., 1989; Lichtenstein et al., 1990;
Ohanian, 1990; Sharma et al., 1990). Martin and Eroglu
(1993) employed the Holsti (1969) procedure to determine
an average interjudge agreement and reliability. These
values were then used to determine which items to retain.
It should be noted that some authors used more than one
phase of judging and therefore may have employed more
than one procedure or decision rule (cf. Allison, 1978;
Zaichkowsky, 1985, 1994; Bearden et al., 1989; Lichtenstein et al., 1990; Sharma et al., 1990).
In summary, Zaichkowskys (1985) procedure and
assigning items to construct definitions are the two dominant procedures that have been followed by marketing and
consumer researchers when assessing face validity of scale
items. When using the latter procedure of assigning items to
construct definitions, researchers have required that at least
60% of judges assign an item to the desired construct or

104

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

construct facet. Most of these authors have deemed at least


75% agreement as a minimum cutoff for item retention.
Therefore, although previous researchers have employed
many decision criteria, there seems to be a good bit of
consistency and guidance regarding the minimum required
degree of agreement necessary between judges.
Alternatively, when using Zaichkowskys (1985) procedure to evaluate the face validity of scale items, there is
less guidance regarding the rule(s) used to retain items.
Some researchers required that no judge rate an item as
not representative in order to be retained, while others
considered all judge ratings and the number of representative or completely representative ratings. Consequently,
there is apparently limited direction in the literature
regarding specific rules that should be used for judging
face validity of scale items. In the following section, we
investigate the relative effectiveness of several approaches
for determining the adequate number of items for retention (cf. Churchill, 1979).

4. Comparing several expert judging decision rules


Given the inconsistencies noted in the literature review in
terms of expert judging procedures, we decided to investigate which of the three rules most highly correlated with an
item being ultimately included in a scale. In doing so, data
sets along with their expert judging information from three
recent scale development efforts were obtained. All three of
the available judging data sets were a variate of the method
described in Zaichkowsky (1985). That is, judges rated each
item as completely, somewhat, or not at all representative of the construct or facet of interest.
4.1. Data sets included
The first data set considered is based on a scale development effort recently published in the Journal of Consumer Research (Bearden et al., 2001). Data regarding
development of these measures of consumer self-confidence
were obtained since the developmental judging procedures
employed by Bearden et al. (2001) enabled a test of all three
decision rules across various aspects of consumer selfconfidence. Specifically, the construct measures reflect six
facets of consumer self-confidence: (1) information acquisition; (2) consideration set formation; (3) personal outcome
decision making; (4) social outcome decision making; (5)
persuasion knowledge and (6) marketplace interfaces. Seven
judges were used to assess the degree to which each item
was representative of each of the six facets of the scale.
Zaichkowskys (1985) procedure was followed and judges
indicated whether the items were completely representative, somewhat representative, or not representative
of the facet of interest. Items were deleted that did not
average at least somewhat representative of the construct
being measured across the seven judges. This rule resulted

in the original edited item pool being reduced from 145


items to 83 items. The application of these judging procedures reduced the number of items across the six facets to
13, 11, 14, 12, 13, and 20 items, respectively.
The second data set considered addressed the development of two separate scales (Netemeyer et al., 1996)
and was used to test the two remaining decision rules. The
two scales were the Work Family Conflict Scale and the
Family Work Conflict Scale. In their research, Netemeyer
et al. (1996) employed the expertise of four judges. These
four judges rated items as clearly, somewhat, or not representative of each construct. Netemeyer et al. decided that all
four judges had to have indicated that the item was at least
somewhat representative of the construct of interest. This
rule resulted in reducing the initial item pool from 110 items
to 43 items. The number of items remaining for the Work
Family Conflict and Family Work Conflict measures were
22 and 21.
The final scale data were from Netemeyer et al.s (1995)
development of the achievement vanity and physical vanity
scales. In their research, Netemeyer et al. employed two
judging phases. First, three of four judges had to rate items
as at least somewhat representative for the item to be
retained. Then, in a second phase, all four new judges had
to rate items at least somewhat representative to be retained.
These two phases resulted in 60 items being considered for
further analysis from an initial pool of 99 items.
4.2. Decision rules tested
In the following analyses, three decision rules were
considered. First, a rule labeled sumscore was evaluated
(e.g., Lichtenstein et al., 1990; Sharma et al., 1990). Sumscore is defined as the total score for an item across all
judges. For example, if there were four judges for a
particular data set and one judge indicated an item was
completely representative, two judges indicated the item
was somewhat representative, and the final judge indicated
that the item was not representative, the item received a
sumscore of eight points. This value was calculated as three
points for the completely representative judgement, four
points for the two somewhat representative judgements, and
one point for the not representative judgment. Importantly,
this decision rule was included since many previous
researchers considered all of the judges when assessing face
validity of items.
The second decision rule considered was labeled complete (e.g., Obermiller and Spangenberg, 1998; Saxe and
Weitz, 1982). Complete was operationalized as the number
of judges that rated an item as completely representative of
the construct. For the above example, the item received a
complete score of one point, since only one of the four
judges rated the item as completely representative of the
construct. As an example, Saxe and Weitz (1982) required at
least 50% of their judges to rate an item as completely
representative in order for the item to be retained.

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

Finally, a third decision rule, not representative, was


considered (cf., Bearden et al., 1989; Netemeyer et al.,
1995, 1996). The not representative rule was operationalized as the number of judges indicating that the item was not
representative of the construct of interest. For the above
example, the item received a not representative score of one
point since one judge rated the item as not at all representative. This third rule was considered based upon the
recognition that some researchers were only concerned with
deleting items judged as not representative.
4.3. Results
For each of the three decision rules, correlation between
the decision rule score and ultimate inclusion (coded as a 1
if included, 0 if excluded) of the item in the scale were
calculated. That is, a comparison was made between the
expert judging scores and whether or not the item ended
up being included in the final scale. Table 2 summarizes
the correlation between each of the decision rules and the
inclusion of the items that make up each final scale or
scale facet.
For the consumer self-confidence data, the sumscore,
complete, and not representative decision rules were each
significantly correlated with four of the six facets, as well as
the overall measure, of consumer self-confidence. Based on

Table 2
Three decision rules employed by researchers in the use of expert judging
for the face validity of itemsa
Name of construct

Not
Sumscore Complete representative

Information Acquisition
Consideration Set Formation
Personal Outcome Decision Making
Social Outcome Decision Making
Persuasion Knowledge
Marketplace Interfaces
Overall Consumer Self Confidence
Work Family Conflict
Family Work Conflict
Achievement Vanity
Physical Vanity

.705***
.566**
.335
.730***
.501**
.183
.395***
.401**
.242
.151 *
.063

.709***
.556**
.248
.683***
.544**
.182
.475***
.396 *
.246
.119
.035

.428*
.481*
.548 **
.742***
.128
.152
.316***
not applicable
not applicable
not applicable
not applicable

Information Acquisition, Consideration Set Formation, Personal Outcome


Decision Making, Social Outcome Decision Making, Persuasion Knowledge, and Marketplace Interfaces make up the six facets of Consumer SelfConfidence (Bearden et al., 2001). Work Family Conflict and Family
Work Conflict are from Netemeyer et al. (1996). Achievement Vanity and
Physical Vanity were developed by Netemeyer et al. (1995).
Sumscore represents the sum of the ratings from all judges for each item.
Complete represents the number of judges rating an item as completely
representative. Not representative corresponds to the number of judges
rating an item as not representative.
a
The values in the first three columns represent the correlation
between the decision rule and whether or not the scale item was included
in the final scale.
* P < .10.
** P < .05.
*** P < .01.

105

the sizes of the correlations, it appears that the not representative decision rule is not as effective at predicting
ultimate inclusion of an item in a scale as the alternative
two rules (i.e., sumscore and complete). That is, researchers
should not simply focus on the number of judges who rate an
item as not representative at all for a construct when determining whether to retain or delete items (cf. Bearden et al.,
1989; Netemeyer et al., 1995, 1996). In order to evaluate the
sumscore and the complete decision rules further, we considered two additional data sets. The nature of the decision
rules used by these two sets of authors precludes further
testing of the not representative decision rule.
The Work Family Conflict and the Family Work Conflict Scales were considered first (Netemeyer et al., 1996).
These authors decided that all judges had to have indicated
that the item was at least somewhat representative of the
construct of interest in order to be retained. As shown in
Table 2, both the sumscore and complete decision rule are
statistically related to the inclusion of items in the final
scale for one of the two constructs. Based on this, there is
little difference between the use of the sumscore and
complete decision rules. In order to further evaluate these
two rules, the physical and achievement vanity scales of
Netemeyer et al. (1995) were evaluated. Specifically,
Netemeyer et al., in the second phase of their judging
procedures, decided that all four judges had to rate an item
as at least somewhat representative to be retained. As can
be seen in Table 2, only the sumscore decision rule is
statistically related to the inclusion of items in the final
achievement vanity scale. Neither of the decision rules was
statistically related for the physical vanity scale. Again,
however, the overall magnitude of both the sumscore and
complete decision rules seems to suggest that they are
performing similarly.

5. Implications, limitations, and future research


directions
In summary, there is an apparent lack of consistency in
the literature in terms of how researchers use the opinions
of expert judges in aiding the decision of whether or not to
retain items for a scale. We have taken a first step in
attempting to provide researchers with some direction
regarding the decision rule to use during the judging phase
of the scale development process. In our analyses, the not
representative decision rule was found least capable of
predicting the eventual inclusion of an item in a scale
based upon the pool of items considered by Bearden et al.
(2001). These limited data suggest that the not representative rule may not be the best rule to employ. However, it
is important to note that future researchers should further
explore the not representative decision rule using additional data. Again, our tests were restricted solely to the
items comprising the six facets of consumer self-confidence developed by Bearden et al.

106

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107

Subsequently, two more sets of data (e.g., Netemeyer


et al., 1995, 1996) were investigated to further evaluate the
sumscore and complete decision rules. Both of these rules
predicted item inclusion in the final scales similarly. However, the sumscore decision rule slightly outperformed the
complete decision rule. At least for two of the developmental item pools (i.e., Bearden et al., 2001; Netemeyer
et al., 1996), these two rules did fairly well at predicting
whether an item would eventually be included in a final
scale. Notably, the present findings support the important
ability of expert judges to enhance eventual scale reliability
and hence, subsequent validity. Therefore and similar to
others (cf. Schreisheim et al., 1993), we suggest that any
research using new, changed, or previously unexamined
scale items, should at a minimum be judged by a panel of
experts for face validity. Having said this, we by no means
are arguing that subsequent stages of the scale development
process be ignored. There are, however, occasions where
consumer researchers develop a set of items that may not be
the focal point of the article and, therefore, do not engage in
the entire scale development process. It is in these occasions
that the use of expert judging appears especially desirable.
In performing this step, the face validity of the scale is
increased at only limited cost in time or funds. Also, the
items that are ultimately used correlate fairly well in some
instances with those that would have survived an exhaustive
scale development process.
It needs to be noted that simply judging items may not
guarantee the selection of the most appropriate items for a
scale. For example, in our research, the items comprising the
two dimensions of the final vanity scales did not correlate
highly with either the sumscore or complete decision rules.
Therefore and as stated previously, expert judging should
not be used as a substitute for the scale development
process. Rather, expert judging should be used to obtain
some justification for the face validity of items when those
items are not the focal point of the research. One additional
conclusion that is clear from our literature review is the lack
of any consistency regarding item face validity evaluation in
the literature. Future research is warranted to establish
procedures that researchers can use to strengthen scale
development efforts.
As a result of our data analysis, the sumscore decision
rule performed somewhat more effectively at predicting
whether an item is eventually included in a scale, and
appears, therefore, to be a reasonable rule for researchers
to employ. A caveat associated with this finding is the
realization that cutoff values for when to delete and when
to retain items are still in need of inquiry. Future researchers
with access to other data sets may be able to provide
guidance regarding such cutoff values. Additionally, and
as suggested by a reviewer of this manuscript, a logical next
step in the assessment of the sumscore and other decision
rules would be to collect data and test the scales for overall
construct validity. Assessing reliability, unidimensionality,
discriminant and convergent validity, and nomological

validity would be a more rigorous test in discerning the


value of the various decision rules. One final avenue of
research, which seems to be important, but was not considered here, is evaluating the other prevailing way in which
expert judges are used. That is, an evaluation of the
technique of asking judges to assign items to dimensions
or facets of multidimensional scales seems warranted.

References
Allen MJ, Yen WM. Introduction to measurement theory. Monterey (CA):
Brooks/Cole, 1979.
Allison NK. A psychometric development of a test for consumer alienation
from the marketplace. J Mark Res 1978;15:565 75.
Anastasi A. Psychological testing. New York: Macmillan, 1988.
Babin BJ, Darden WR, Griffin M. Work and/or fun: measuring hedonic and
utilitarian shopping value. J Consum Res 1994;20:644 56 (March).
Baumgartner H, Steenkamp J-BEM. Exploratory consumer buying behavior: conceptualization and measurement. Int J Res Mark 1996;13:
121 37.
Bearden WO, Netemeyer RG. Handbook of marketing scales: multi-item
measures for marketing and consumer behavior research. Thousand
Oaks (CA): Sage Publications, 1999.
Bearden WO, Netemeyer RG, Teel JE. Measurement of consumer susceptibility to interpersonal influence. J Consum Res 1989;15:473 81.
Bearden WO, Hardesty DM, Rose RL. Consumer self-confidence: refinements in conceptualization and measurement. J Consum Res 2001;
28:121 34.
Behrman DN, Perreault WD. Measuring the performance of industrial
salespersons. J Bus Res 1982;10:355 70.
Bienstock CC, Mentzer JT, Bird MM. Measuring physical distribution
service quality. J Acad Mark Sci 1997;25:31 44 (Winter).
Brooker G. An instrument to measure consumer self-actualization. In:
Schlinger MJ, editor. Advances in consumer research, vol. 2. Ann Arbor
(MI): Association for Consumer Research, 1975. p. 563 75.
Butaney G, Wortzel LH. Distributor power versus manufacturer power: the
customer role. J Mark 1988;52:52 63 (January).
Churchill G. A paradigm for developing better measures of marketing
constructs. J Mark Res 1979;16:64 73 (February).
Cialdini RB, Frost MR, Newsom JT. Preference for consistency: the development of a valid measure and the discovery of surprising behavioral
implications. J Pers Soc Psychol 1995;69(2):318 28.
Cohen JB. An interpersonal orientation to the study of consumer behavior. J
Mark Res 1967;4:27 278.
Feick LF, Price LL. The market maven: a diffuser of marketplace information. J Mark 1987;51:83 97.
Ferrell OC, Skinner SJ. Ethical behavior and bureaucratic structure in marketing research organizations. J Mark Res 1988;25:103 9 (February).
Gaski JF, Nevin JR. The differential effects of exercised and unexercised power sources in a marketing channel. J Mark Res 1985;22:
130 42 (May).
Holsti O. Content analysis for the social sciences and humanities. Reading
(MA): Addison-Wesley Publishing, 1969.
Kohli AK, Zaltman G. Measuring multiple buying influences. Ind Mark
Manage 1988;17:197 204.
Kumar N, Stern LW, Achrol RS. Assessing reseller performance from the
perspective of the supplier. J Mark Res 1992;29:238 53 (May).
Lichtenstein DR, Netemeyer RG, Burton S. Distinguishing coupon proneness from value consciousness: an acquisitiontransaction utility
theory perspective. J Mark 1990;54:54 67.
Lundstrum WJ, Lamont LM. The development of a scale to measure consumer discontent. J Mark Res 1976;13:373 81.
Malhotra NK. A scale to measure self-concepts, person concepts, and
product concepts. J Mark Res 1981;16:456 64.

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98107


Manning KC, Bearden WO, Madden TJ. Consumer innovativeness and the
adoption process. J Consum Psychol 1995;4(4):329 45.
Martin IM, Eroglu S. Measuring a multi-dimensional construct: country
image. J Bus Res 1993;28:191 210.
Narver JC, Slater SF. The effect of a market orientation on business profitability. J Mark 1990;54:20 35 (October).
Netemeyer RG, Burton S, Lichtenstein DR. Trait aspects of vanity: measurement and relevance to consumer behavior. J Consum Res 1995;21:
612 26 (March).
Netemeyer RG, Boles JS, McMurrian R. Development and validation of
Work Family Conflict and Family Work Conflict Scales. J Appl Psychol 1996;81(4):400 10.
Nevo B. Face validity revisited. J Educ Meas 1985;22:287 93.
Nunnally JC, Bernstein IH. Psychometric theory. New York: McGrawHill, 1994.
Obermiller C, Spangenberg ER. Development of a scale to measure consumer skepticism toward advertising. J Consum Psychol 1998;7(2):
159 86.
Ohanian R. Construction and validation of a scale to measure celebrity
endorsers perceived expertise, trustworthiness, and attractiveness. J
Adver 1990;19(3):39 52.
Price LL, Ridgway NM. Development of a scale to measure use innovativeness. In: Bagozzi RP, Tybout AM, editors. Advances in consumer
research, vol. 10. Ann Arbor (MI): Association for Consumer Research,
1983. p. 679 84.
Puri R. Measuring and modifying consumer impulsiveness: a cost benefit
accessibility framework. J Consum Psychol 1996;5(2):87 113.
Reidenbach RE, Robin DP. Toward the development of a multidimensional
scale for improving evaluations of business ethics. J Bus Ethics 1990;
9:639 53.
Saxe R, Weitz BA. The SOCO scale: a measure of the customer orientation
of salespeople. J Mark Res 1982;19:343 51 (August).
Schreisheim CA, Powers KJ, Scandura TA, Gardiner CC, Lankau MJ.

107

Improving construct measurement in management research: comments


and a quantitative approach for assessing the theoretical content adequacy of paper-and-pencil survey-type instruments. J Manage 1993;
19(2):385 417.
Schul PL, Pride WM, Little TL. The impact of channel leadership behavior
on interchannel conflict. J Mark 1983;47:21 34 (Summer).
Sharma S, Netemeyer R, Mahajan V. In search of excellence revisited: an
empirical investigation of Peters and Watermans attributes of excellence. In: Bearden WO, Parasuraman A, editors. Enhancing knowledge
development in marketing, vol. 1. Chicago (IL): American Marketing
Association, 1990. p. 322 8.
Shimp TA, Sharma S. Consumer ethnocentrism: construction and validation
of the CETSCALE. J Mark Res 1987;24:280 9.
Slama ME, Tashchian A. Selected socioeconomic and demographic characteristics associated with purchasing involvement. J Mark 1985;49:
72 82 (Winter).
Stone G, Barnes JH, Montgomery C. ECOSCALE: a scale for the measurement of environmentally responsible consumers. Psychol Mark 1995;
12:595 612 (October).
Swasy JL. Measuring the bases of social power. In: Wilkie WL, editor.
Advances in consumer research, vol. 6. Ann Arbor (MI): Association
for Consumer Research, 1979. p. 340 6.
Unger LS, Kernan JB. On the meaning of leisure: an investigation of some
determinants of the subjective experience. J Consum Res 1983;9:381
92 (March).
Wang CL, Mowen JC. The separateness connectedness self-schema: scale
development and application to message construction. Psychol Mark
1997;14:185 207 (March).
Zaichkowsky JL. Measuring the involvement construct. J Consum Res
1985;12:341 52 (December).
Zaichkowsky JL. The personal involvement inventory: reduction, revision,
and application to advertising. J Adver 1994;23:59 70 (December).

S-ar putea să vă placă și