Documente Academic
Documente Profesional
Documente Cultură
Abstract
Usability is an important step in the software and product design cycle. There are a number of methodologies such as talk aloud
protocol, and cognitive walkthrough that can be employed in usability evaluations. However, many of these methods are not designed to
include users with disabilities. Legislation and good design practice should provide incentives for researchers in this field to consider more
inclusive methodologies. We carried out two studies to explore the viability of collecting gestural protocols from sign language users who
are deaf using the think aloud protocol (TAP) method. Results of our studies support the viability of gestural TAP as a usability
evaluation method and provide additional evidence that the cognitive systems used to produce successful verbal protocols in people who
are hearing seem to work similarly in people who speak with gestures. The challenges for adapting the TAP method for gestural language
relate to how the data was collected and not to the data or its analysis.
r 2005 Elsevier Ltd. All rights reserved.
Keywords: Usability; Usability evaluation methods; Deaf; Think aloud protocol; Gestural think aloud protocol
1071-5819/$ - see front matter r 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ijhcs.2005.11.001
ARTICLE IN PRESS
490 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
The theoretical underpinning applicable to the TAP way, the protocol method may be ideal for speakers of ASL
method and to this research is that information held or who will be able to articulate their thoughts in their first
recently held in short term (STM) and working memory is language using their known words, grammar and syntax.
accurately revealed through verbalizations. What is im- For most native ASL speakers, oral language and its
portant for GTAP to be successful, is that there is a trace of written form is a second language in which they are less
recent thought in working memory that may be articulated fluent (Schirmer, 2000). For this reason, written question-
via sign language. Research with ASL participants (Bellugi naires and responses or paper-based surveys are not always
et al., 1974; Campbell and Wright, 1990; MacSweeney, an appropriate means for equitable user testing. Gestural
1998; Wilson and Emmorey, 1998, 2003) provide support TAP enables spontaneous user feedback about how a task
for STM or working memory processes in individuals who is being accomplished without the additional and poten-
are deaf that are similar in function to individuals that have tially interfering cognitive processes required for individuals
hearing. Also, these studies indicate that participants are to work in a language that is not their first language. Also,
able to access and articulate items in working memory. research has shown that the concurrent verbal protocols
Thus, the use of gestural TAP with individuals who are obtained using TAP methods are superior to retrospective
deaf and sign language speakers is further supported. protocols obtained after the task is completed (Nisbett and
There have been two other studies with participants who Wilson, 1977; Ericsson and Simon, 1984; Kuusela and Paul,
were deaf that have been characterized as think aloud 2000). Furthermore, in a study that compared four usability
studies: the reading comprehension studies of Andrews and methods (logged data, questionnaires, interviews and verbal
Mason (1991) and Schirmer (2003). The former asked protocol analysis), TAP was shown to be ‘‘the most
participants to read a close passage and upon reaching the effective single method at highlighting usability problems’’
omitted word, to think their thoughts aloud. At the same (Henderson et al., 1995, p. 426). Indeed, in some cases,
time, the researchers engaged the participant in conversa- when UEM methods were paired, the number of unique
tion and asked probing questions. This study was not a problems revealed by the paired UEMs still failed to
typical think aloud method as it was more likely to draw identify as many problems as were identified through TAP
introspective as well as retrospective responses and did not alone (Henderson et al., 1995).
follow the standard procedure where prompts to partici- Signed languages are complete, natural languages that
pants are limited to ‘‘keep talking’’ or ‘‘thoughts?’’ are distinct from spoken languages (Stokoe, 2001) and
Schirmer (2003) asked participants to think aloud at have a distinct phonology, morphology, syntax and
natural breaks in the text such as page breaks rather than vocabulary (Messing, 1999). Thus ASL has the same
a concurrent thought stream. While it is clear that TAP is a potential to provide rich protocols as do spoken languages.
very useful tool in understanding reading strategies, there is Although oral and gestural languages rely on different
no body of research that explicitly verifies that gestural modes for communication, the underlying neural structures
TAP and verbal TAP are equivalent cognitively or produce are actually very similar (MacSweeney, 1998). In hearing
the same results. individuals, the left hemisphere of the brain is responsible
The think aloud method or protocol (TAP) was first for speech production and language comprehension
used in research on problem-solving behaviour (Someren (McNeill, 1992). Corina (1998) reviewed 16 studies of
et al., 1994). Pioneered by Newell and Simon (1972) and individuals who communicate with sign and have suffered
refined by Ericsson and Simon (1984), TAP required that brain lesions and found that the studies provided sufficient
participants speak their thoughts aloud as they complete an evidence to support the notion that the left hemisphere of
assigned task. the brain is also responsible for sign language comprehen-
The TAP method is not associated with any one model sion and sign production.
of cognition or memory (Ericsson and Simon, 1984; Green, Indeed, this notion that language and gesture rely on
1998) and may be best viewed as a method for testing similar brain regions and processes is supported by
hypotheses (Ericsson and Simon, 1984; Green, 1998). For research with individuals with aphasia (McNeill, 1992).
example, the usability researcher tests hypotheses held by Furthermore, Bates and Dick (2002) in their review of
developers about how certain features will be used by the language and gesture research found that ‘‘compelling links
audience or how well the needs and expectations of have been observed, involving specific aspects of gesture
the audience are met by the product. Furthermore, ‘‘one that precede or accompany each of the major language
of the uses of think aloud protocol is to assist in forming milestones from 6 to 30 months’’ (p. 293), and ‘‘that there
hypotheses about areas where not much is known’’ is research to support links between gesture and language
(Wiedenbeck et al., 1989, p. 25). Hence, TAP is particularly development in both typical and atypical populations’’
useful for the iterative development cycle of software. (p. 295). The researchers conclude that ‘‘the division of
As an inclusive UEM, TAP has an advantage over labor in the brain does not seem to break down neatly into
structured elicitation techniques such as interview and language versus nonlanguage’’ (p. 305). Thus there is
surveys: the think aloud method makes it easy for the support for the notion that the collection of gestural
subjects because they are allowed to use their own language protocols makes similar cognitive demands on the partici-
(Van Someren et al., 1994, p. 26) and own words. In this pant as the collection of verbal protocols.
ARTICLE IN PRESS
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 491
1.1. Scope
2. Method
Fig. 1. Lab configuration for Solitaire Game Study.
2.1. Solitaire game study 2.1.3.2. Think aloud protocol. Participants in the game
study were asked to speak/sign their thoughts at any time
2.1.1. Materials or at least before making a move/using the mouse. After
For the game component, Microsoft Solitaire on a short periods of silence (about 10 s), several moves without
laptop computer was used. One video camera was used to utterances, or indications of thought such as facial
record the screen and spoken English interpretation expressions or head nodding, the investigator would
simultaneously (see Fig. 1 for schematic). remind the participant to ‘‘keep talking.’’
ARTICLE IN PRESS
492 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
C
grouping factor is language. Protocols collected for
the game study are categorized as belonging to one of the
r
following categories: (1) responses to the cards (e.g., prete
Inter
expression of need); (2) play (e.g., card value); (3) strategy
(e.g., plan for best play); (4) general comments about
Solitaire in general; (5) error (e.g., having made a mistake);
(6) TAP (e.g., effect of TAP on play); (7) technical
(comment about hardware); and (8) usability (e.g., com-
ments about game software environment).
The verbal protocols were transcribed from videotape, Participant
chunked into segments and then coded for analysis. The
Intraclass Correlation statistic was used to determine the C
reliability of the coding strategy between two separate
coders. The ASL was translated and recorded simulta-
neously during the study.
Participants were asked to complete a pre-study ques-
Investigator
tionnaire. The Solitaire pre-study questionnaire asked
some background information questions as well as ques-
tions about the participant’s experience with Solitaire and
skill level. The participant had the option to have any
written material translated into ASL and to respond in
ASL.
2.2.1. Materials
A software video player interface was developed and
tested for the usability component. The interface allowed logging software. Two video cameras were used to record
users to view an educational video about traffic court along the signs made by the participant and by the investigator so
with an ASL interpretation of the video. The original that they could be translated at the close of the study.
education video was produced as a standard National Please see Fig. 2 for schematic.
Television Systems Committee (NTSC) video with tradi-
tional verbatim closed captions. The ASL interpretation 2.2.2. Subjects
was provided in two formats, an Acted ASL version where For the usability study, there were 17 (9 female and 8
actors in costume played the parts of the original actors male) participants who were deaf and fluent in ASL and
and a Standard ASL version using a single translator for all ranged in education level from high school to college/
of the parts. The Acted ASL version showed ASL actors in university. The Canadian Learning Television video on
two separate video windows along with the original video. traffic court was aimed at the age and education level of the
The translated version showed the translator in one participants.
separate window along with the original video.
Video controls in the interface allowed the user to play, 2.2.3. Procedure
pause, move forward or rewind the video. Viewing 2.2.3.1. Task. Individuals in the usability study were
preferences could also be set by the user and included asked to watch a 10-min video segment of a traffic court
adjustment of video and interpretation window positions, educational video. The first viewing of the video was
proximity of the windows, relative size of the windows and followed by the comprehension test. Next, participants
use of borders around the windows. A short training video were allowed to adjust viewing preferences before viewing
was provided to assist participants in learning the interface the same educational video segment but with a different
as well as in practicing GTAP. The educational video was ASL interpretation format from the first viewing. The
provided by Canadian Learning Television and is part of a session closed with a short questionnaire and discussion.
full social studies curriculum.
For the usability study, subjects were provided with a PC 2.2.3.2. Think aloud protocol. Usability participants were
that included keyboard, mouse, speakers and 17-in colour asked to sign their thoughts about the task as they were
monitor. Each PC was further equipped with key-stroke performing it. After short periods of silence (about 10 s), or
ARTICLE IN PRESS
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 493
indications of thought such as facial expressions or head complete move such as playing a card and coded by two
nodding, the investigator would remind the participant to independent raters. Inter-rater reliability (ICC) of the
‘‘keep talking.’’ chunking was high at 76% for 416 sample utterances,
and ICC for the coding components was high ranging from
2.2.4. Outcome measures and data analysis 68 to 100% for all categories. Chunking and coding tasks
The law video usability study is a within-subjects design. were then continued by only one of the raters.
The experimental factors of interest in this study are The average number of comments for the play category
viewing order and interpretation type (Acted ASL and was nearly double for the Oral group than for the ASL
Standard ASL). The law video data is aggregated into four group (53 and 28, respectively). However, a 1 min snapshot
overarching categories that relate to the specific research of the data at the third minute of each session shows no
questions of the usability study: (1) interpretation, (2) difference in the number of play comments. This 3rd min
format, (3) technical issues and, (4) content. All of the was randomly selected to provide a normalized unit of time
analyses are based on these macro-categories. To under- to eliminate differences in counts due to time elapsed
stand the quality of the comments in these macro- during play which varied from participant to participant.
categories further sub-categories of positive and negative The mean number of comments for each category are
are used. shown in Tables 1 and 2.
A counterbalanced order for the levels of presentation The standard deviation for all category counts is high for
technique was established and participants received the both full session and minute-three protocols.
order that was assigned to their subject number. Partici- Multivariate analysis of the dependent variable, lan-
pants were assigned subject numbers consecutively and the guage, showed no effect of group; however, the observed
order of subjects was simply determined by the order that
arose from the subjects’ availability for the study.
As in the game study, video data was transcribed and Table 1
coded. The Intraclass correlation statistic was used to Descriptive group data (n ¼ 9) for category counts for minute three of
session protocol
verify the consistency and codability of the data. In this
study, the ASL was translated after the study Participants Category ASL Oral
were asked to complete a pre-study questionnaire. The
Mean Count SD Mean Count SD
usability pre-study questionnaire asked questions about the
participant’s experience with online video, computers and Play 4.33 39 4.33 4.33 39 2.74
ASL interpretations of video material. The participant had Strategy .22 2 .44 .33 3 .50
the option to have any written material translated into Cards 1.44 13 1.33 2.22 20 1.39
Solitaire .22 2 .44 .22 2 .44
ASL and to respond in ASL.
Error .11 1 .33 .33 3 .50
TAP .11 1 .33 .11 1 .33
2.3. Statistical issues Distracted .56 9 1.33 .00 0 .00
Technical .11 1 .33 .00 0 .00
Non-parametric tests were used to compare the means of Usability .00 0 .00 .00 0 .00
Conversation .89 8 1.05 1.11 10 1.27
the two groups in both studies as the number of subjects is
No. of utterances 7.44 76 3.00 7.11 78 2.37
considered small. A higher alpha level for small groups will
increase the power of the test and improve the chance of
making a correct decision about the null hypothesis so the
level for tests of both sets of data was set to a ¼ :10
(Stevens, 1996, pp. 6 and 175). However, because the data Table 2
is multivariate, the alpha level for univariate analysis will Descriptive group data (n ¼ 9) for category counts for complete session
protocol
be calculated as .01 (roughly a/p, where p ¼ the number of
categories) in order to reduce the risk of a type I false Category ASL Oral
rejection error (Stevens, 1996, p. 160). Power is particularly
Mean SD Mean SD
important in the Solitaire game study since the expectation
is that there will be no difference between the two groups. Play 27.67 18.76 52.89 26.90
It should be noted that no data is removed in order to Strategy 1.67 1.94 2.78 2.22
increase power. Cards 10.56 6.06 14.67 8.76
Solitaire 1.11 1.36 2.22 1.56
Error 1.11 1.45 2.00 1.73
3. Results TAP 1.67 1.66 .56 1.01
Distracted 2.56 5.66 .00 .00
3.1. Solitaire game study Technical .33 .50 .33 .50
Usability .56 .88 .22 .44
Conversation 11.44 6.19 5.78 5.19
A sample set of the transcripts for the protocols were
No. of utterances 53.56 16.45 72.56 27.12
divided into utterance ‘‘‘chunks’’ that represented a
ARTICLE IN PRESS
494 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
power was low at .33. Wilcoxon t-tests were carried out for A Wilcoxon t-test for the two aggregate categories
each of the categories for utterances made for the entire showed no significant difference between the groups.
session and for utterances made during the 3rd min slice
only. The individual results are summarised in Tables 3 3.2. Law video usability study
and 4. No significant difference (po:01 level) on number of
comments produced in a category was found in either the This task is a case study of the adapted TAP method in a
whole session or minute-three data. usability study. The grouping factors in this study were
Participants in both groups averaged 7 utterances during interpretation treatment (acted and standard interpreta-
minute three of the verbal protocol. Category sums related tion) and viewing order of the video interpretation
to playing the game (play, strategy, cards, error) and treatment.
categories not related to the game (solitaire, distracted, Due to technical difficulties, only 13 participants of the
TAP, technical, usability, conversation) were aggregated initial 17 have full data sets (26 ten-min video sessions) and
and are shown in Table 5. The overall ratio of game only these full sets are included in the analysis. Only the
comments to non-game comments was 4:1. verbal protocols for the law video viewing sessions are
analysed because these sessions constitute the GTAP
usability portion of the study.
Table 3 An initial coding scheme of nine categories (see Table 6)
Wilcoxon t-test results for whole session data was developed. Seven of the categories were aggregated
Category Test statistic Asymptotic significance
into four categories: the ASL interpretation (1 and 2),
format (3, 4 and 7), technical issues (5) and content (6) of
Play 64 .06 the law video. Categories 8 and 9 stood alone. Comments
Strategy 73.5 .28 were coded as positive or negative for each aggregate
Cards 75.5 .38
category and simply counted in categories 8 and 9. The
Solitaire 69.5 .14
Error 73.5 .26 ICC results for the four main categories which include all
TAP 68.5 .11 of the sub-categories are high and are shown in Table 7.
Distracted 72 .07 Table 8 shows the average divided category counts for
Technical 85.5 1.00 the data for each interpretation type and for each viewing
Usability 79 .47
Conversation 63.5 .05
order. T-tests were used to determine whether viewing
No. of utterances 68.5 .13 order and interpretation type affected the quantity of
positive and negative comments. There were significantly
more positive comments during the second session
(tð22Þ ¼ 2:08, po:05). No effect of interpretation type
Table 4 on number of comments was found.
Wilcoxon t-test results for minute three data. Protocol analysis Multivariate analysis of the ten grouping variables was
conducted to determine whether order or interpretation
Category Test statistic Asymptotic significance
type had an effect on the number of comments made. A
Play 80.5 .66 significant effect of order was found with alpha at the .10
Strategy 82.5 .78 level [F ð10; 13Þ ¼ 2:31, po:1]. The observed power of the
Cards 81 .61 effect is moderate at .81.
Solitaire 85.5 1.00
The categories were also grouped by order only and
Error 76.5 .27
TAP 85.5 1.00 interpretation type only and Wilcoxon tests were run.
Distracted 76.5 .15 Table 9 shows the findings for the test that are significant
Technical 81 .32 or approached significance.
Usability 85.5 1.00 Many issues were identified by the participants. For
Conversation 73 .25
example, two participants identified seven different issues
No. of utterances 83.5 .86
for the acted interpretation and nine different issues for the
standard interpretation; their comments are shown in
Table 10.
Table 5
Each of the 26 ten-min video sessions yielded at least one
Aggregate category sums for minute three of verbal protocol coded comment with the average number of comments per
session being 8.4. Table 11 provides a summary of the
Group total Average Ratio number of comments in each category. Examples of
Game Non-game Game Non-game comments made by participants are shown in Fig. 3.
No significant correlation between interpretation type or
ASL 56 16 6.22 1.78 3.5:1 viewing order and total number of comments was found;
Oral 64 14 7.11 1.56 5.6:1
however, correlations between these variables and three of
Total 120 30 13.33 3.33 4:1
the four coding categories were found (see Table 12).
ARTICLE IN PRESS
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 495
Table 6
Protocol analysis for usability study and sample comments
Table 8
Mean and standard deviation for split category counts divided by interpretation type and viewing order
+ + + +
First Interpreted n ¼ 8 Mean 1.38 3.00 .00 .38 .00 1.50 .13 .75 .38 2.25
SD 1.51 3.38 .00 1.06 .00 1.41 .35 1.75 .74 5.20
Acted n ¼ 5 Mean 1.40 1.80 .00 2.40 .00 3.20 .20 .80 .00 .60
SD 2.19 1.48 .00 1.67 .00 2.86 .45 1.10 .00 .89
Total n ¼ 13 Mean 1.38 2.54 .00 1.15 .00 2.15 .15 .77 .23 1.62
SD 1.71 2.79 .00 1.63 .00 2.15 .38 1.48 .60 4.09
Second Interpreted n ¼ 5 Mean 1.40 .40 2.40 1.40 .00 1.80 .00 .00 .00 .40
SD 1.95 .55 2.19 1.34 .00 1.30 .00 .00 .00 .89
Acted n ¼ 8 Mean 2.62 .63 1.50 2.38 1.00 1.13 .13 .38 .00 1.88
SD 4.27 .92 1.77 2.07 1.07 1.89 .35 .52 .00 2.17
Total n ¼ 13 Mean 2.15 .54 1.85 2.00 .62 1.38 7.69E-02 .23 .00 1.31
SD 3.51 .78 1.91 1.83 .96 1.66 .28 .44 .00 1.89
Total Interpreted n ¼ 13 Mean 1.38 2.00 .92 .77 .00 1.62 7.69E-02 .46 .23 1.54
SD 1.61 2.92 1.75 1.24 .00 1.33 .28 1.39 .60 4.12
Acted n ¼ 13 Mean 2.15 1.08 .92 2.38 .62 1.92 .15 .54 .00 1.38
SD 3.56 1.26 1.55 1.85 .96 2.43 .38 .78 .00 1.85
Total n ¼ 26 Mean 1.77 1.54 .92 1.58 .31 1.77 .12 .50 .12 1.46
SD 2.73 2.25 1.62 1.75 .74 1.92 .33 1.10 .43 3.13
Table 10
Issues identified in two randomly selected protocols for each interpretation mode
Participant B 1. The interpretation and movie are not synchronized 1. Closed captioning would make this easier
2. Both are speaking at the same time 2. This is boring
3. The popping in is difficult to follow 3. The popping in and out is awkward
4. This is boring
Table 11
Count of coded comments for each category
Acted Interpreted
They should have deaf actors do the whole thing measured in this study. Also, all deaf participants were
Boring/long culturally deaf and not hard of hearing or hearing
They should have put window on bottom like close-captioning impaired. Furthermore, all deaf participants used ASL
I liked the straight caption better than the acted one for communication. Next, the alpha level was adjusted so
Dark background makes it hard to read the interpreter
that the probability of having a type II error decreased.
I lose concentration trying to look up at the interpreter
I would like to change the border colour. These considerations ensure a powerful test given the
It would be good to be able to set the span of the frames to own setting [not pre- small sample size. Indeed, the analysis indicates no
determined setting]. difference between the groups despite statistically optimiz-
Why does the actor disappear? They should just freeze the frame. ing the chances of finding a difference.
The person is still talking, why aren’t they signing?
In the Solitaire study there may be a linguistic preference
for certain general types of comments (e.g., it is easier for
Fig. 3. Sample participant comments from law video usability sessions.
ASL speakers to converse with the translator than produce
game-type comments). To examine this, two overarching
individuals who knew how to play Solitaire and had played categories are considered in the protocol analysis: game
the computer version before were selected to reduce the comments and non-game comments. The game category is
possibility that any comments were related to learning how composed of those coding categories that specifically refer
to play the game as learning was not one of the factors to an aspect of playing the Solitaire game such as play,
ARTICLE IN PRESS
498 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
Table 12
Correlation matrix for split categories by order and interpretation
+ + +
n ¼ 26.
*Correlation is significant at the .05 level (2-tailed).
strategy, cards and error. The non-game category is categories is good indicating that the categorization
composed of the coding categories that relate to the process and the resulting data are reliable.
situation such as solitaire, distracted, TAP, technical, In the law video usability study, there were a total of 262
usability and conversation. When controlled for session comments generated by 17 participants (or approximately
length differences, on average, oral language participants 15 comments per participant) in all categories and sub-
made 1.11 more game comments and only a fraction fewer categories. The majority of comments appeared in the
non-game comments per session than did gestural language negative sub-category of technical issues (25 for the Acted
participants. However, they are not significantly different. and 21 for the Standard). The fewest comments were made
Thus, it is likely that both groups were able to concentrate in the positive sub-category of the Content of Video
on the game equally well while carrying out the think aloud category (2 for Acted and 1 for Standard). There were three
method. Gesture does not seem to impede ability to play sub-categories where no comments were generated for
the game or ability to produce protocols. either the Acted or Standard groups (positive Viewing
Preferences and negative and positive Ease of Using Video
4.2. The viability of GTAP for usability evaluation Interface). As demonstrated in Table 10, participants were
able to identify a large number of issues with the
We found that using GTAP in a usability study context, interpretation modes. For example, from the randomly
the law video usability study, produced usability results selected protocols of just two participants, seven different
that are viable and useful. This study was a particularly issues in the acted interpretation and nine different issues in
interesting testing ground for GTAP as an inclusive UEM the standard interpretation were identified. These two
because it involved the complex management and pre- participants identified the same issue only once. The
sentation of multiple video windows (one video without protocols of these two participants exemplify the varia-
ASL and a second video with the ASL interpretation) as bility between participant comments and the way that this
well as a user interface for customization of the presenta- variability benefits that usability process.
tion. The study involved presentation of a novel ASL Differences in interpretation type and viewing order
interpretation format (acted interpretation) as well as factors were found in four categories (these differences
introduction to new viewing preferences choices. The video approached significance at the .01 level with a Wilcoxian t-
interpretation and customization formats were specifically test). Correlation coefficients seen in Table 12 also show
designed for individuals who communicate with ASL. The these relationships. For viewing order, negative ASL and
following analysis provides an interpretation of the GTAP positive Technical approached significance (p ¼ :016
usability data collected for this study. and .015, respectively) and for video, negative format and
The experimental factors of interest for the law video positive technical approached significance (p ¼ :017 and
usability study were viewing order and interpretation type. .015, respectively). The results shown in Table 8 indicate
The coding categories for the law video usability study that for the positive technical category, all comments were
arose from the questions being asked by developers and made during the second viewing but only when the Acted
researchers. The greatest concern for the researchers was ASL video was played (there were no positive technical
the participant’s response to Acted ASL interpretation as comments for the Standard video version in either order
well as to the available viewing preferences (size of video nor were there positive technical comments for the Acted
boxes, position of boxes, proximity of boxes and outline of ASL video when it was the first video played). The
boxes). Developers were further concerned with the viewer technical category definition included anything that might
interface controls/buttons. Potential confounds for parti- be controlled in production such as synchronizing acted
cipant responses were identified as production/technical interpretation to main video, lighting, interpreter position,
issues such as lighting and video content that may not costumes/clothing. The Acted ASL video violated more
interest all participants. The inter-rater reliability for all user expectations with regards to costumes, interpreter
ARTICLE IN PRESS
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 499
position and synchronicity. Indeed producers of the Acted any specific use of the user interface during the playing of
ASL video deliberately broke standard presentation style the video. Standards for optimum number of comments
of ASL interpretation video in order to explore new per session do not exist for the TAP method and
methods of presenting interpretation of existing videos and establishing a standard is beyond the scope of this study.
their possible benefit to viewers. When viewers were in a However, it seemed reasonable that the number of
position to compare the acted interpretation to a standard comments produced would vary from task to task and
interpretation they made both positive and negative the objective of our study was not to compare the
comments about the technical aspects of the acted video. frequency of comments between tasks but to show that
During the first viewing, when direct comparison between comments would be generated. For the law video sessions,
the acted interpretation and a standard interpretation was the production of 8.4 coded comments per session was
not available, no positive technical comments were made substantial given the more passive role of the participant.
about the acted interpretation.
Another possible cause of this result is that there is a 4.3. The relationship between gestural language type and
difference in the production quality of the two different method of data collection
versions. The Acted ASL version may have been perceived
as higher quality than the standard interpretation and The results of our studies also seem to indicate that the
therefore it drew more favourable comments than without TAP method requires very little adaptation to allow for the
the comparison. Interestingly, at no time were positive collection of gestural protocols. The greatest modification
technical comments made for the standard interpretation. in the conventional TAP protocol is the addition of an
However, this lack of favourable technical comments is not interpreter (i.e., there are now two observers/experimenters
necessarily indicative of a low-quality production video. involved). However, these modifications are relatively
Rather, it may be that the standard interpretation met the minor and do not cause much disruption to the flow of
normal expectations of the participants as it is the the protocol as long the interpreter is adequately briefed
traditional style of presenting ASL interpreted video and prepared.
material. Thus, the standard interpretation video did little In the Solitaire game study the interpreter provided real-
to surprise the viewer and did little to elicit positive time translation during the study. This translation was
comments regarding production. recorded along with the participant’s actions as they
It is unlikely that a practice effect as a result of repetition occurred during the study. In the usability study, a more
of the video dialogue caused an increase in positive conventional usability approach was taken to data collec-
technical comments since this effect was not shown for tion and recording; the participant actions and comments
both video categories. This discrepancy indicates that were recorded together and translation/transcription oc-
comparison between the two videos is a likely contributor curred after the study was completed. We found that
to the finding that positive technical comments were made having simultaneous translation during the actual study
only during Acted ASL videos shown after a Standard simplified the analysis process because it did not involve
ASL video. using a two-step process. However, simultaneous transla-
Participants produced more negative ASL comments tion has the potential to introduce slight time delays
during the first video session than the second. This because the translator must wait until enough has been said
difference may be attributed to practice effects. Both by the signer to correctly translate the syntax and
videos have the same script so a viewer would be more grammatical structures. This may cause difficulties in
likely to have a better understanding of the interpretation studies where accurate time date for beginning and ending
in the second session when they are seeing the script signed of thoughts is required.
for the second time. This ‘‘practice’’ of the script means
that the participant would be less likely to notice or have 4.3.1. Preliminary guidelines for using GTAP
focus drawn to aspects of the interpretation that would be There are some simple steps that we used and would
otherwise difficult to understand. recommend to optimize collection of a gestural protocol.
The objective for using TAP in a software usability study First, it is useful to explain the rationale of the TAP
was to gather rich data that sheds light on how an method to the interpreter. This step enables the interpreter
individual is responding to and using the user interface. to better assist the researcher in obtaining the desired
The concurrent nature of the data helped the researcher to results since interpretation to ASL from English (and vice
make connections between specific actions or aspects of the versa) is not a direct one to one mapping.
interface and thoughts voiced by the participant. In the Second, it is important to discuss the handling of
usability study reported here, participants reported their speaking prompts to the interpreter and participant before
thoughts with an average frequency that approached 1/min the data collection begins. The interpreter should under-
(.84/min). In addition each verbal protocol yielded at least stand the importance of using brief, neutral cues such as
one coded comment and the average number of comments ‘‘thoughts?’’ and ‘‘keep talking’’ when prompting a
per session was 8.4. Participants were only required to participant who has fallen silent. Another way to handle
watch the video segments and were not instructed to make prompting is to tell the participant that they will be
ARTICLE IN PRESS
500 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
physically tapped when they fall silent for too long. In this Acknowledgements
way, the researcher may be positioned behind the
participant an ideal location from which to operate the Funding for this project, Creating Barrier-Free, Broad-
camera, to observe the user and screen without being a band Learning Environments Project #34, is provided by
visual distraction for the participant and to prompt the the E-Learning program of CANARIE. The authors wish
participant with a shoulder tap when necessary. to acknowledge gratefully Earl Woodruff and Richard
Third, the researcher should consider the importance of Volpe of the University of Toronto for their assistance with
text interactions with the participant and remove them this project, the Canadian Hearing Society for assistance in
whenever possible. For example, it is customary to have recruiting participants, and all of the participants who gave
hearing participants read a passage aloud as a warm-up for up their valuable time to participate in our studies.
thinking aloud. For deaf participants, it may make more
sense to practice signing thoughts while performing a simple
task on or off the computer. Participants may also be asked
References
to sign rote phrases such as counting or the alphabet.
Fourth, some deaf individuals have residual hearing, thus, Andrews, J.F., Mason, J.M., 1991. Strategy usage among deaf and hearing
when an interface has an audio component, speakers and readers. Exceptional Children 57, 536–545.
sound should be on. Finally, when taping the actual gestures Bates, E., Dick, F., 2002. Language, gesture, and the developing brain
of the participants, have a lab coat available for participants (Special issue: Converging method approach to the study of develop-
mental science). Developmental Psychobiology 40 (3), 293–310.
to wear. On video and even in person, the ease of viewing
Bellugi, U., Klima, E.S., Siple, P., 1974. Remembering in signs. Cognition
signs is hampered by patterned tops. A covering such as a 3 (2), 93–125.
lab coat will create a neutral background making the signs Campbell, R., Wright, H., 1990. Deafness and immediate memory for
more visible to the interpreter and anyone viewing the video pictures: dissociations between ‘‘inner speech’’ and the ‘‘inner ear’’?
of the gestural protocol. Understanding and attending to Journal of Experimental Child Psychology 50 (2), 259–286.
these variables and how they affect the gestural protocol are Corina, D.P., 1998. Sign language aphasia. In: Coppens, P. (Ed.), Aphasia
in Atypical Populations. Erlbaum, Hillsdale, NJ, pp. 261–309.
keys to using the method optimally. Corina, D.P., McBurney, S.L., 2001. The neural representation of
Gestural TAP is laid on the foundation of research that language in users of American Sign Language. Journal of Commu-
supports TAP as a valid and reliable UEM. The collection nication Disorders 34 (6), 455–471.
of gestural protocols requires only minimal change to the Ericsson, K.A., Simon, H.A., 1984. Protocol Analysis: Verbal Reports as
TAP method: an interpreter and some consideration of Data. MIT Press, Cambridge, MA.
Ericsson, K.A., Simon, H.A., 1998. How to study thinking in everyday
hand requirements during the task. Equipment, coding and life: contrasting think-aloud protocols with descriptions and explana-
analysis requirements are virtually unchanged when con- tions of thinking. Mind, Culture, & Activity 5 (3), 178–186.
current interpretation of the utterances are collected. This Green, A., 1998. Verbal Protocol Analysis in Language Testing Research:
minor change means that the method may be readily a Handbook. Cambridge University Press, Cambridge, UK.
Henderson, R., Podd, J., Smith, M., Varela-Alvarez, H., 1995. An
applied by field practitioners, particularly as replications of
examination of four user-based software evaluation methods. Inter-
this research further support the similarity between spoken acting with Computers 7 (4), 412–432.
and gestural protocols. Jones, L., Pullen, G., 1992. Cultural differences: deaf and hearing
Gestural TAP enables inclusive use of TAP, the most researchers working together. Disability & Society 7 (2), 189–196.
effective UEM available (Corina and McBurney, 2001). Kuusela, H., Paul, P., 2000. A comparison of concurrent and retrospective
These methods are important to enable developers to meet verbal protocol analysis. American Journal of Psychology 113 (3),
387–404.
inclusive technology mandates and to help foster an MacSweeney, M., 1998. Cognition and deafness. In: Gregory, S. (Ed.),
environment of universal design. There is ample evidence Issues in Deaf Education. D. Fulton Publishers, London, p. xi 292.
that retrofitting environments whether they are physical or McNeill, D., 1992. Hand and Mind: What Gestures Reveal About
digital to accommodate individuals with special needs is Thought. University of Chicago Press, Chicago.
Messing, L.S., 1999. An introduction to signed languages. In: Messing,
more costly than building inclusive environments in the
L.S., Campbell, R. (Eds.), Gesture, Speech and Sign. Oxford
first place. Developers however, cannot effectively meet the University Press, Oxford, UK, New York, p. xxv 227.
needs of disabled users unless they include these users in Newell, A., Simon, H.A., 1972. Human Problem Solving. Prentice-Hall,
their usability evaluations. This GTAP research and Englewood Cliffs, NJ.
research on other inclusive UEMs are urgently needed so Nisbett, R.E., Wilson, T.D., 1977. Telling more than we can know: verbal
that developers and usability engineers are equipped with reports on mental processes. Psychological Review 84 (3), 231–259.
Schirmer, B.R., 2000. Language and Literacy Development in Children
methods that have been tested and refined. Who are Deaf, second ed. Allyn and Bacon, Boston.
This study began with the research question: are the Schirmer, B.R., 2003. Using verbal protocols to identify the reading
outcomes of ASL speakers comparable to English speak- strategies of students who are deaf. Journal of Deaf Studies & Deaf
ers’ outcomes? Certainly the failure to find significant Education 8 (2), 157–170.
differences between the groups on most of the variables Someren, M.W.v., Barnard, Y.F., Sandberg, J., 1994. The Think Aloud
Method: a Practical Guide to Modelling Cognitive Processes.
lends support to the idea that oral and gestural language Academic Press, London, San Diego.
verbal protocols are similar especially given the steps taken Stevens, J., 1996. Applied Multivariate Statistics for the Social Sciences,
to increase statistical power and precision. third ed. Lawrence Erlbaum Associates, Mahwah, NJ.
ARTICLE IN PRESS
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 501
Stokoe, W.C., 2001. The study and use of sign language. Sign Language Wilson, M., Emmorey, K., 1998. A ‘‘word length effect’’ for sign language:
Studies 1 (4), 369–406. further evidence for the role of language in structuring working
Van Someren, M.W., Barnard, Y.F., Sandberg, J., 1994. The Think Aloud memory. Memory & Cognition 26 (3), 584–590.
Method: a Practical Guide to Modelling Cognitive Processes. Wilson, M., Emmorey, K., 2003. The effect of irrelevant visual input on
Academic Press, London, San Diego. working memory for sign language. Journal of Deaf Studies & Deaf
Wiedenbeck, S., Lampert, R., Scholtz, J., 1989. Using protocol analysis to Education 8 (2), 97–103.
study the user interface. Bulletin of the American Society for
Information Science June/July, 25–26.