Brooks, Ruven E. - Studying Programmer Behavior Experimentally - The Problems of Proper Methodology

CATEGORY E Product Shipment, Commercial Soft-
ware, and Usage--Software Devel-

opment and Maintenance Aids, and
Applications Software
This category includes software-based critical tools Human Aspects H. Ledgard

for implementing military systems, the CAD/CAM of Computing Editor
systems, the critical software development tools de-
scribed in subsection 3.2, and the many commercial ap-
plication packages that have absolutely no military
Studying Programmer
significance. Therefore, the CNCTEG recommends that
this category be controlled initially, and that a high
Behavior
priority effort be placed on further subdividing it so that
many of the products in the category may be freed from
Experimentally: The
control, thus reducing the administrative workload re-
lated to export licenses for software having no military
Problems of Proper
significance.
Methodology
Ruven E. Brooks
CATEGORY F Product Shipment, Commercial Soft- The University of Texas Medical Branch,
ware, and Usage--Widely Available
Galveston, Texas
Operating Systems, Network Ori-
ented Software, Database Manage-
ment Software, and User Interface
The application of behavioral or psychological
Software
techniques to the evaluation of programming languages
and techniques is an approach which has found
The CNCTEG recommends that no export controls
increased applicability over the past decade. In order to
be applied to this category. The product shipment--
use this approach successfully, investigators must pay
commercial software category consists of products that
close attention to methodological issues, both in order
are widely available throughout the free world in easily
to insure the generalizability of their f'mdings and to
replicable form. In the case of usage, the products are
defend the quality of their work to researchers in other
not actually transported into the communist country.
fields. Three major areas of methodological concern,
Some controls may be applied indirectly in this area as
the selection of subjects, materials, and measures, are
a part of Category B. Additional work is recommended
reviewed. The first two of these areas continue to
to remove these controls when practical.
present major difficulties for this type of research.
In general, the CNCTEG believes that if a copy of a
Key Words and Phrases: psychology of
generally available program is known to be available in
programming, software psychology
the communist countries, whether critical or not, then no
CR Category: 4.6
great value lies in controlling the export of additional
copies.
Received 10/79; revised 11/79; accepted 12/79
References
1. Brown, H. Interim DoD Policy Statement on Export Control of
United States Technology. Office of the Secretary of Defense, Aug.
26, 1977.
2. Computer Network Critical Technology Expert Group (F.R.
Spitznogle, Chairman). Computer networks: An assessment of the
critical technologies and recommendations for the controls on the
exports of such technologies. Final rep., prepared for the U. S.
Department of Defense, April 30, 1979.
3. Dale, A.G. Database management systems development in the
USSR. Comptg. Surv. 11, 3 (Sept. 1979), 213-226. Permission to copy without fee all or part of this material is
4. Davis, N.C. and Goodman, S.E. The Soviet blocs unified system granted provided that the copies are not made or distributed for direct
of computers. Comptg. Surv. 10, 2 (June 1978), 93-122. commercial advantage, the ACM copyright notice and the title of the
5. Defense Science Board Task Force (J.F. Bucy, Chairman). An publication and its date appear, and notice is given that copying is by
analysis of export control of U.S. technology--a DoD perspective. permission of the Association for Computing Machinery. To copy
ODDR&E, Feb. 4, 1976. otherwise, or to republish, requires a fee and/or specific permission.
6. Goodman, S.E. Software in the Soviet Union: Progress and Author's present address: R.E. Brooks, Dept. of Psychiatry and
problems. Advances in Computers 18 (1979), 231-287. Behavioral Sciences, The University of Texas Medical Branch, Gal-
7. U.S. Congress. Public Law 96-72, The Export Administration veston, TX 77550.
Act of 1979, Sept. 29, 1979. 1980 ACM 0001-0782/80/0400-0207 $00.75.
207 Communications April 1980

of Volume 23
the ACM Number 4
I. Introduction why workers in this area should have a particularly
strong, early concern for a well-developed methodology.
Over the past decade, a number of studies have The first is that while behavioral research in com-
appeared which use behavioral or psychological meth- puter science is relatively new, other disciplines, most
odologies to evaluate programming language constructs notably psychology, have developed and elaborated the
and techniques; some examples include Boies and Gould methodological techniques for over a century. In partic-
[11, Gould and Drongowski [7], Shneiderman et al. [18], ular, psychometric theory, analysis of variance, and,
Shneiderman [19], and Sime et al. [20, 21, 22]. Studies of more recently, the multivariate techniques are a powerful
this kind involve direct observation and manipulation of set of analytic tools for unraveling complex behavioral
the behavior of individual programmers, in contrast to phenomena. Researchers in these other behavioral dis-
studies on programming products [11, 14] or on the social ciplines have come to expect the use of such tools as an
dynamics of programming groups [23]. integral part of behavioral research and to judge re-
Such studies have a large potential for impacting the search, in part, by the skill with which these tools are
construction, selection, and use of programming tools used. Behavioral research, even if done in a computer
and techniques in a number of different ways. First, they science context, will inevitably be judged by these same
may serve to affirm or refute behavioral assumptions standards of methodological rigor. In order to maintain
underlying particular programming concepts. As an ex- creditability with behavioral researchers in other areas,
ample, the work of Sime et al. [22] lends support to the behavioral researchers in computer science must pay
claim that for nonprogrammers nesting is superior to close attention to methodological issues.
jumps as a method for representing conditionals. Often, A second reason for an early concern with method-
such work may lead to further refinement or explication ology has to do with the impact that methodology avail-
of the behavioral assumption; in the study just men- ability has on the choice of experimental questions. In
tioned, scope markers for nesting which carded redun- theory, the process of designing an experiment begins
dant information were found to be superior to those with the identification of an important theoretical ques-
which only marked the scope end. tion, selected without regard to its experimental meas-
A second impact that behavioral studies may have is urability. Further assumptions are made as needed to
the provision of quantitative information on the relative bridge between the theoretical question and an actual
effectiveness of programming techniques. The work of experimental situation. Since it is this translation process
Sackman [15] on time-sharing versus batch processing that requires the greatest amount of both ingenuity and
provides comparative figures on labor and hardware effort, there is a subtle pressure to give preference to
costs for each operating mode, as well as showing that those theoretical questions that are the easiest to answer
the differences were small compared to individual dif- experimentally. A strong concern for methodology does
ferences. Similarly, the work of Shneiderman et al. [18] not per se counteract this tendency; however, it does
provides an order-of-magnitude estimate of the benefits encourage researchers to be aware of the role methodo-
of flowcharts. logical issues play in the design of experiments.
Finally, behavioral studies may provide verification
of explicit models of programmer behavior [2, 19]. Once 2. Major Methodological Issues
such models are verified, they can be used in a number
of ways, such as a basis for selecting programming In the paradigmatic situation in which these meth-
language features or to generate new programming tools odologies are employed, a researcher has some hypoth-
and techniques. As an example, verification of a theory esis about the behavioral effect of a programming lan-
which asserted that variable naming was more important guage construct or technique. In order to test this hy-
to comprehensibility than control structures would lead pothesis, he or she must construct a situation in which
to increased work on variable naming systems. the behavioral effect can be observed and measured.
Creating such situations typically involves the selection
of appropriate subjects, the administration of stimulus
1.1 The Need for Good Methodology materials likely to evoke an experimental effect, and the
The goal of this article is to explicate some of the use of an appropriate measure of the extent to which
methodological problems that research on behavioral effects have occurred. These issues of subjects, materials,
aspects of programming entails and to suggest some and measures are the fundamental issues in the choice
tactics for addressing these problems. of methodology.
It would seem that the first duty of this kind of
research, as a newly emerging approach to an existing 2.1 Choosing Subjects
set of problems, is to demonstrate powerful and useful In selecting subjects the experimenter must satisfy
findings; if a crude methodology is adequate to this task, two, sometimes contradictory criteria. On one hand, he
the development of a more refined, sophisticated meth- or she would like the subjects to be representative. The
odology is a task that can wait for further maturation in purpose of doing a study is to make a statement about
the area. There are, however, two significant reasons behavior which will be true for some large population
of Volume 23
the ACM Number 4
such as all Fortran programmers or all real-time pro- bugging strategies than beginners do. Hence, in the
grammers. Usually, the subjects are only a small sample absence of a body of studies that fmd the same results
of the population, but they are selected by some method with experienced and student programmers, researchers
which justifies the claim that they are representative of are not justified in generalizing between the two groups.
the population as a whole. Thus, an experimenter may The caution given in the previous paragraph is prob-
draw all his subjects from one firm on the basis that ably equally applicable to intermediate programming
employees from the firm are typical of programmers in students. Shneiderman [19] has shown that even a few
general. months' difference in experience can have a significant
The other criteria that experimenters must satisfy is effect on performance. An intermediate programming
that the subjects be relatively uniform in regard to their student may have taken 5 or 6 programming courses and
characteristics and abilities at the point at which they are have spent a maximum of perhaps a thousand hours on
selected to participate in the experiment. Suppose that programming, including writeups and class time. A pro-
the goal of an experiment is to measure the effect of two grammer with only three years of experience has prob-
control structures on comprehensibility and that two ably spent five times as much time programming, so that
different groups of subjects are used. If one group con- the intermediate student programmer probably resem-
tains a disproportionate number of high ability subjects, bles the beginning programmer more than he or she does
then the experimental outcome may reflect this differ- the experienced one. The results of studies, such as
ence, rather than the effect of control structure. flowcharts or the work of Gannon [6] on data typing,
These two criteria become contradictory when the which use such intermediate students are therefore in
parent population is very hetergeneous. In this situation, need of replication using experienced programmers.
a representative sample perforce reflects this heteroge- Since control of variability by sample size alone is
neity. The most widely used method for handling this rarely feasible, the experimenter will frequently want to
kind of situation is simply to use enough subjects so that take other measures. A potentially attractive approach is
the chances will be extremely small that all the low or based on assessing the abilities of subjects prior to their
high ability subjects will fall into the same group. Un- participation in the experiment and then either grouping
fortunately, in the study of programming behavior, the them on the basis of ability (stratification) or adjusting
number of subjects required can be very large. Studies the measurements of their performance in the experiment
of different aspects of programmer performance have for their initial ability level (covariance analysis). Unfor-
found that ability differences are from 4 to 1 [7] to 25 to tunately, a problem lies in the selection of an ability
1 [15] across experienced programmers with equivalent measure. The obvious choice, length of experience, was
backgrounds. For some commonly used experimental used to select subjects in those studies in which the
designs, this range of variability can imply a need for difference was first found, and recent work using a wider
hundreds of subjects in order to obtain significant results. range of experience [17] also failed to find a significant
The use of too few subjects under these conditions entails relationship. Their study did, however, suggest that va-
a strong risk that subject differences will obscure exper- riety of experience has some predictive value, particu-
imental effects. Considering their report of large subject larly for programmers with less than three years of
differences, this may explain the failure to obtain statis- experience.
tically significant results in Experiments II and III of Given the uncertain status of biographic measures,
Shneiderman et al. [18]. an alternative approach would be to use a pretest of
One tempting solution to obtaining uniform subjects programming skills to assess subject ability. The diffi-
is to use students in programming classes, as was done culties of this approach lie in deciding what to use as the
by Shneiderman et al. [18] and Gannon [6]. There are pretest measure. If the question under investigation con-
two problems with doing this. The first one is that it is cerns comparing programming languages with GOTOs
by no means a solution to the individual differences with those with block structuring, then what sort of
problem; Shneiderman et al. report large individual dif- control structure should be used in the test materials?
ferences both within and between programming classes. Perhaps because of this difficulty in selecting the pretest,
The second problem is that at the current time, there is no studies have yet appeared which use this pretest
little justification for the assumption that results found approach.
with beginning programmers are applicable to experi- The solution to the subject variability problem that,
enced ones. In order for such an assumption to be to date, appears most successful is the use of within-
justified, it must be the case that the experienced pro- subject experimental designs. These designs are based on
grammer follows the same procedures and practices in exposing each subject to all levels of the experimental
problem solving as does the naive programmer, albeit at variables under investigation; for example, in an experi-
a faster rate. In the case of chess, a task with certain ment to investigate the effects of four different structur-
analogies to programming, very experienced players ing techniques, each subject would use all four tech-
(masters) use entirely different procedures than novice niques. The advantage to this approach is that the anal-
players, and the work of Youngs [26] on debugging ysis is then based on the relative performance of each
suggests that experienced programmers use different de- technique with each subject.

of Volume 23
the ACM Number 4
A drawback of within-subject designs is that they succeed or fail are entirely different than those involved
often require the preparation of large amounts of stim- in small systems.
ulus materials. If an experiment is being done to compare Unfortunately, all of the published studies in this
five commenting techniques and each subject reads dif- area have used programs of less than 500 lines in size.
ferent programs with all five techniques, then a total of To legitimately generalize from these experiments with
55=25 different program listings would be needed. small programs to large ones requires a psychological
Despite their requirements for stimulus materials, characterization or model of the effects of program size.
within-subject designs have been used successfully in a Since such models are not yet available, all of the work
variety of studies [8, 9, 17]. done to date is in need of replication using larger pro-
grams before any claims can be made about the gener-
2.1.1 A c o m m e n t on subject selection. In this discus- ality of the results.
sion, the wide range of intersubject variability has been In addition to being appropriate to the experimental
treated as a source of variance that must be compensated difference being investigated, stimulus materials must
for in experiments on other factors. At the same time, meet several other criteria. One of the more difficult of
however, the source and properties of this variance rep- these is insuring comparability across different experi-
resent important research questions in their own right. mental conditions. If an experiment requires the use of
Specification of the exact behaviors and processes which several different programs in different conditions, then
differ between better and poorer programmers and the the programs must be comparable in all significant re-
establishment of etiologies for these differences are re- spects except those which are experimental manipula-
search problems that are as intriguing as those concerned tion. The difficulty lies in deciding, at the present level
with differences in programming languages and meth- of knowledge about the psychology of programming,
odologies. In addition, progress on this individual differ- which aspects are actually significant. In the absence of
ences issue offers the potential for better controls on adequate theory, an obvious approach for insuring com-
individual variation for research in other areas. parability is to match the programs on their intrinsic
characteristics, such as their length, the language in
2.2 Choosing Materials which they are written, and the kinds of data structures
The most important criterion in the choice of exper- and variables which they use. For most studies, it will be
imental materials is that they tap the experimental dif- extremely difficult to fred a set of programs which match
ference being tested; if, for example, the experiment is well on all the possible measures.
designed to assess the impact of good structure on pro- An alternative to attempting to match on multiple
gram modifiability, the materials used should vary in characteristics is to use the measures of program com-
level of structuredness. The major problem in achieving plexity developed in software science [10, 12]. This work
this criterion is to design materials that are usable with is aimed at deriving measures of the difficulty or com-
the resources available for the experiment, but which are plexity of a program from the operations performed
representative of some wider class of programs. If the within the program. The measures are intended to be
programs which are created are too abnormal or patho- independent of the task the program is performing and
logical, then readers of the studies are likely to be of the language in which the program is written, but to
unwilling to apply their results to the real world; if a test correlate well with other indicators of program complex-
of branching constructs requires programs that are half ity.
branch statements to obtain significant results, then the Experimental work indicates that these measures can
reader might reasonably conclude that branching con- be used to predict both time and accuracy to make
structs make very little difference at best. modifications in small programs [4, 17]. Researchers
Though none of the programs used in published considering these measures should, however, also be
studies appear to be exceptional in regard to their inter- aware that they leave nearly half of program variability
nal structure, they do come under criticism in regard to unaccounted for, and that they appear to be most effec-
their lengths. At a time when several systems with more tive with inexperienced programmers and with unstruc-
than a million lines of source code exist, programs of less tured programs.
than 500 lines will usually turn out to be toy or student A fmal consideration in selecting materials is that
programs. An argument can be made that programs in they be of an appropriate level of difficulty to produce
the 50- to 100-line range are the same size as the modules data with desirable statistical characteristics. Experiment
that constitute the larger systems, and that, therefore, the I, in the study of Shneiderman et al. [18], provides an
results obtained from using these smaller programs are, illustration of what happens if the problems are too easy.
at least partially, generalizable to larger systems. By now, In that experiment, scores for the two experimental
however, there is universal acknowledgment that writing groups were 94 and 95 respectively, while the maximum
a large system is not just a matter of scaling up the possible score was 100. With the means so close to the
manpower required for a small system; it requires a set upper limit, a number of subjects must have received the
of techniques which are qualitatively and quantitatively maximum score. Not only has this probably acted to
different, and the factors which cause such projects to reduce the magnitude of the experimental difference, but

of Volume 23
the ACM Number 4
it also invalidates any statistical analyses, such as the t- inference: To modify or debug a program, the program-
test, which make the assumption of an underlying normal mer must first understand it. In practice, this inference
distribution. Similar results would occur in a study in may be difficult to guarantee; Gould [8] contains exam-
which many o f the subjects received scores of zero. ples of debugging tactics which require little or no un-
derstanding of the program involved, and many modi-
2.3 Selecting an Experimental Measure fication tasks, such as altering output formats, can also
From the point of view o f their effects on program- be accomplished without understanding. To use modifi-
mer behavior, most innovations in programming lan- cation or debugging tasks successfully to measure com-
guages or techniques can influence the ease with which prehension, the experimenter should be sure that the
programs can be constructed a n d / o r the ease with which task cannot be accomplished either by looking for state-
existing programs can be understood. The experimental ments of a particular type or by simple checks on the
tasks used in these studies will, therefore, be aimed at properties or variables. The use of multiple bugs or
measuring changes in either or both o f these properties. modifications in a single program serves as an additional
Shneiderman [19] discusses a range of such tasks and safeguard that comprehension is not limited to a single
presents experimental evidence in favor of a particular section of the program.
one, "memorization-recall." The aim of the discussion An additional problem peculiar to modification tasks
here is to elaborate on some of the issues he raises. is that the specification of the modification may, itself,
give away information about the program. The following
2.3.1 Program construction measures. An extremely modification was used by Sheppard et al. [16]:
powerful claim for any programming technique or lan-
Subroutine EUCLD currently finds the greatest common divisor
guage feature is that it improves the program writing of two integers (M, N). Modify the subroutine so that it finds the
process, either by reducing the effort needed to write greatest common divisor of three integers (L, M, N).
programs or by producing a better fmal product. The
obvious approach to assessing such a claim is to have Clearly, this modification task will not reveal much
subjects write programs with and without the new tech- about how well subjects understood what the E U C L D
nique; for example, the advantages o f the case statement subroutine did in the original program.
over the switch statement could be assessed by having
subjects solve the same problem using languages which 2.3.3 Memorization-recall and reconstruction. This
differed only in the feature in question. class of tasks involves presenting a programmer with a
A wide variety of measures are available for making program for some time interval, removing the program,
this assessment. If the claim under test is that the new and then asking the programmer to reproduce it. Appro-
feature reduces programming effort, then an important priate scoring systems are used to quantify the functional,
measure is the time taken to write the program. Unfor- as well as literal, accuracy o f the reproductions. Shnei-
tunately, this measure suffers from the problems, dis- derman [19] has advocated this technique for evaluating
cussed in a later section, that are common to time the comprehensibility of programs, and he has demon-
measures in general. An additional problem o f insuring strated its sensitivity to differences in program structure
accurate measurement arises if subjects are allowed to and in comment location.
set their own work periods. For these reasons, the time The implicit assumption behind the use of these tasks
measures must be supplemented with other measures, is that, with restricted study times, the easier a program
such as the number of debugging runs or the ratio o f is to comprehend, the easier it will be to learn. This
total lines written to final program size. Sackman [15] assumption can be based on one or two models. The
illustrates the use of a number of such measures. model Shneiderman used is based on analogy to the
If the claim is that the new feature improves program work of Chase and Simon [3] on chess. It assumes that
quality, then the measurement issue becomes more dif- the programmer learns the program from the bottom up
ficult. For a situation in which "quality" is considered by organizing his or her knowledge into ever-larger units.
equivalent to execution efficiency, then runtime provides The ease of comprehending the program will affect the
an empirical measure of program quality. At the other effectiveness of this reorganization process. On the basis
extreme, if quality is interpreted to involve issues of of this model, Shneiderman advocates the term "mem-
programming style, then no objective measures are read- orization-recall" for this kind of task.
ily available and the experimenter will have to be con- An alternative model is that during the learning
cerned with issues of rating scale construction and rater phase programmers extract enough information to re-
reliability. construct the program. This information is a mixture o f
low-level details such as variable names, constants, and
2.3.2 Debugging and modification measures. While expressions and higher level information about algo-
this class of measures may not be used to directly test rithms and global structure. The ease of comprehension
claims about the ease of debugging or modification, a of the program will affect the ease with which the higher
more frequent use will probably be to verify assertions level information can be extracted.
about program comprehensibility. Such tests involve an These two models are not exclusive, and which model

of Volume 23
the ACM Number 4
is a better description of actual events will depend on question must not inadvertently reveal the answer to
the experimental instructions and on individual strate- another question. Given the difficulty of balancing all
gies. The memorization-recall model will be a better fit these requirements, extensive pretesting is virtually a
if subjects are encouraged to be as literally accurate as necessity in the use of multiple-choice questions.
possible, while the reconstruction model is more likely
to be appropriate if subjects are told to write a program 2.3.5 Hand execution. The accuracy or speed with
that is as close as possible to the original. which a programmer can hand execute or simulate a
From the point of view of the experimenter, the program has been used as a measure of program quality
choice of a model will have a decided impact on the [24]. A major problem with the use of this task for this
construction of experimental materials and on the inter- purpose or for measuring comprehensibility is that there
pretation of study results. The memorization-recall is no easy way to insure that the hand execution involves
model is consistent with an interpretation of "under- knowledge of the overall structure or organization of the
standing" a program as complete knowledge of all the program, since hand (or machine) execution can be
details of the program's construction. Appropriate scor- accomplished on a statement basis. Combination of this
ing techniques will be ones that check for literal accuracy, task with question answering about the overall structure
such as those used by Shneiderman. would yield greater confidence in interpreting the exper-
The problem with the memorization-recall model is imental results.
that it is applicable only to isolated modules or to toy or
student programs. Even though a programmer is thor- 2.3.6 A comment on time measures. The previous
oughly familiar with say, a typical compiler, he or she discussion has not advocated any particular experimental
will certainly be unable to reproduce it literally. For task as being uniformly superior. The choice of a partic-
systems of realistic size, instructions which encourage ular task will depend on the experimental question and
subjects to behave consistent with a reconstruction model on the subject population and resources available to the
will probably be more appropriate. The drawback of experimenter: Most studies will use several different
such instructions will be that it will be necessary to tasks. A general comment can, however, be made about
develop a scoring scheme that compares programs on the use of time as a variable.
underlying structure, rather than on literal equivalence. A first consideration that must be kept in mind for
use of this measure is that the time must be measured in
2.3.4 Question answering. If the goal of the experi- such a way as to exclude irrelevant behavior; for exam-
ment is primarily to assess comprehension of a program, ple, the experimenter is rarely interested in the time
then an attractive task is simply to have the subject study taken to understand the task instructions and would like
a program and then to ask him or her questions about it. to exclude this from total program time. While a simple
These tasks may be used alone or in conjunction with solution to this problem would be to time from the point
construction, debugging, or memorization tasks. at which the subject actually begins to write code, some
The kinds of questions used can range from com- subjects may begin coding before they understand the
pletely open-ended ("How does this program work?") to problem and then go back to the problem description. A
completely structured multiple-choice. Open-ended and potential remedy for this problem is to instruct subjects
short-answer questions have the advantage in that it is that they will be allowed to read the problem description
fairly easy to construct a comprehensive set of questions. only once and that they must be sure they understand it
On the other hand, scoring is often difficult; unless an before they begin to generate code. After subjects have
elaborate formal scoring scheme has been created, it is read the problem, they would be asked to answer ques-
often difficult to tell how much more accurate one tions about it, and whenever a subject gives incorrect
description of how a program works is from another. answers, he or she would be asked to go back and reread
A variant on the open-ended question is, instead, to the problem description.
ask the programmer to talk aloud as he or she looks at A similar difficulty in the use of time measures occurs
the program. These protocols can then be used as the because not all parts of a program are relevant to the
basis for constructing models of the problem-solving hypothesis under test. If the hypothesis is about control
process such as those of Newell and Simon [13]. structures for iteration, then including the time spent on
The primary drawback to multiple-choice questions I / O in the program writing time may give false results,
is that construction of a sufficiently large set of questions particularly if the I / O turns out to be unusually difficult.
often requires considerable effort, since the questions One solution to this problem, used by Brooks [2], is to
must satisfy a number of criteria. First, while the correct supply the programmer with a partially completed pro-
response should be the one that is chosen most often, the gram containing all the code except that which is of
other responses are also chosen a moderate fraction of experimental interest. Another alternative would be to
the time. Second, if the goal of the experiment is to time the programmer only when he is working on the
measure comprehension of the entire program, then the relevant sections of the program; the difficulty in imple-
questions must cover all aspects of the program to ap- menting this method would be determining reliably what
proximately the same degree. Finally, the content of one the programmer is working on at given instant.

of Volume 23
the ACM Number 4
In addition to the problems of determining exactly the cognitive processes involved in programming is,
what to time, time measures often have the additional therefore, likely to be a prerequisite to progress on these
characteristic of having skewed distributions, with the methodological issues.
slowest subjects taking several times as much time as t h e
mode. In order to use most statistical techniques which Acknowledgment. This paper benefited from the
rely on a roughly normal distribution, this problem must comments of B. Shneiderman on an earlier draft.
be corrected. One way is to simply throw out the extreme
values on the reasonable premise that those subjects Received 8/79; revised 11/79; accepted 1/80
either did not understand the problem or that they, in References
fact, belonged to a different population than the rest of 1. Boies, S.F., and Gould, J.D. Syntactic errors in computer
the subjects. Another method is to use statistical trans- programming. Human Factors 16 (1974), 253-257.
2. Brooks, R. Using a behavioral theory of program comprehension
formations, such as a square root or arcsin [25], to make in software engineering. Proc. 3rd Internat. Conf. Software Eng.,
the distribution more normal. IEEE, New York, 1978, pp. 196-201.
3. Chase, W.G., and Simon, H.A. Perception in chess. Cognitive
Psychol. 4 (1974), 55-81.
3. Prospects for Future Improvements 4. Curtis, B., Sheppard, S.B., and Millman, P. Third time charm:
Stronger prediction of programmer performance by software
The intent of this paper has been to review some o f complexity metrics. Proc. 4th Internat. Conf. on Software Eng.,
IEEE, New York, 1979, pp. 356-360.
the methodological problems faced in the behavioral 5. Garmon, J.D. An experiment for the evaluation of language
evaluation of programming language constructs and features. Internat. J. of Man-Machine Studies 8 (1976), 61-73.
techniques. O f the three topics surveyed, two, the selec- 6. Garmon, J.D. An experimental evaluation of data type
conventions. Comm. A C M 20, 8 (Aug. 1977), 584-595.
tion of subjects and the selection o f materials, currently 7. Gould, J.D., and Drongowski, P. An exploratory study of
remain sources of difficulty in performing behavioral computer program debugging. Human Factors 16 (1974), 258-276.
studies. 8. Gould, J.D. Some psychological evidence on how people debug
computer programs, lnternat. J. of Man-Machine Studies 7 (1975),
The impacts of the persistence of these two areas as 151-182.
problems are two-fold: First, since the experimenter does 9. Green, T.R.G. Conditional program statements and their
not have good control over them, they introduce addi- comprehensibility to professional programmers. J. of Occupational
Psychol. 50 (1977), 93-109.
tional error variance into experiments. This variance 10. Halstead, M.H. Elements of Software Science. Amer. Elsevier
may act to obscure the true differences arising from the Pub. Co., New York, 1977.
particular programming feature or technique being stud- 11. Knuth, D.E. An empirical study of FORTRAN programs.
Software Practice and Experience 1, 2 (1971) 105-133.
ied. 12. McCabe, T.J. A complexity measure. 1EEE Trans. on Software
More importantly, the lack of knowledge about the Eng. SE-2 (1976), 308-320.
description and specification of differences among sub- 13. Newell, A., and Simon, H.A. Human Problem Solving. Prentice-
Hall, Englewood Cliffs, N.J., 1972.
jects and programs has a damaging effect on the gener- 14. Rubey, R.J. A comparative evaluation of PL/1. Datamation 20
alizability of the experimental findings. If there is no (Dec. 1968).
effective, replicable way to describe the characteristics of 15. Sackman, H. Man-Computer Problem Solving: Experimental
Evaluation of Time-Sharing and Batch Processing. Auerbach, New
subjects or programs used in an experiment, then there York, 1970.
is no way to determine whether the results of the exper- 16. Sheppard, S.B., Borst, M.A., Curtis, B., and Love, L.T. Predicting
iment apply to real world situations. programmers' ability to modify software. TR-388100-3, General
Electric Co., Arlington, Va., 1978.
Desirable as it might be to come up with such descrip- 17. Sheppard, S.B., Borst, M.A., Millman, P., and Love, T. Modern
tors, prospects for their development in the immediate coding practices and programmer performance. Computer 12 (1979).
future do not appear good. The essential problem is that 18. Shneiderman, B., Mayer, R., McKay, D., and Heller, P.
Experimental investigation of the utility of detailed flowcharts in
there is a strong interaction between the characteristics programming. Comm. A C M 20, 6 (June 1977), 373-381.
of the programmer and those of the program; the diffi- 19. Shneiderman, B. Measuring computer program quality and
culty in modifying, rewriting, or understanding a pro- comprehension, lnternat. J. of Man-Machine Studies 9 (1977), 465-
478.
gram may be as much a function of the programmer's 20. Sime, M.E., Arblaster, A.T., and Green, T.R.G. Reducing
familiarity with that type of program as they are of programming errors in nested conditionals by prescribing a writing
intrinsic characteristics of the program itself. As Curtis procedure. Internat. J. of Man-Machine Studies 9 (1977), 119-126.
21. Sime, M.E., Green, T.R.G., and Guest, D.J. Psychological
et al. [4] point out, measures such as Halstead's or evaluation of two conditional constructions used in computer
McCabe's that are based only on control flow or opera- languages, lnternat. J. of Man-Machine Studies 5 (1973), 105-113.
tional characteristics of the program can never tap these 22. Sime, M.E., Green, T.R.G., and Guest, D.J. Scope marking in
computer conditionals--a psychological evaluation. Internat. J. of
additional sources o f variation. An analogous problem Man-Machine Studies 9 (1977), 107-118.
is likely to occur with measures o f programmer ability 23. Weinberg, G., and Shulman, E. Goals and performance in
that are based on gross experience or training measures. computer programming. Human Factors 16, 1 (1974).
24. Weissman, L. A methodology for studying the psychological
What approaches, then, show promise? Any successful complexity of computer programs. Unpub. Ph.D. diss., Univ. of
characterization of the program-programmer interaction Toronto, Canada, 1974.
will probably be based on a model of the process or 25. Winer, B. Statistical Principles in Experimental Design. McGraw-
Hill, New York, 1971.
processes used by a programmer in interacting with a 26. Youngs, E.A. Human errors in programming, lnternat. J. of
program. The development of such theories or models of Man-Machine Studies 6 (1974), 361-376.

of Volume 23
the ACM Number 4

Brooks, Ruven E. - Studying Programmer Behavior Experimentally - The Problems of Proper Methodology

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Brooks, Ruven E. - Studying Programmer Behavior Experimentally - The Problems of Proper Methodology

Încărcat de

Drepturi de autor:

Formate disponibile

CATEGORY E Product Shipment, Commercial Soft-

ware, and Usage--Software Devel-

This category includes software-based critical tools Human Aspects H. Ledgard

Received 10/79; revised 11/79; accepted 12/79

207 Communications April 1980

209 Communications April 1980

210 Communications April 1980

211 Communications April 1980

212 Communications April 1980

213 Communications April 1980

S-ar putea să vă placă și